# Review of ‘The web was done by amateurs’ by Marco Aiello

Via one of those friend-of-a-friend likes on social media that popped up in my stream, I stumbled upon the recently published book “The web was done by amateurs” (there’s also a related talk) by Marco Aiello, which piqued my interest both concerning the title and the author. I’ve met Aiello once in Trento, when a colleague and he had a departing party, with Aiello leaving for Groningen. He probably doesn’t remember me, nor do I remember much of him—other than his lamentations about Italian academia and going for greener pastures. Turns out he’s done very well for himself academically, and the foray into writing for the general public has been, in my opinion, a fairly successful attempt with this book.

The short book—it easily can be read in a weekend—starts in the first part with historical notes on who did what for the Internet (the infrastructure) and the multiple predecessor proposals and applications of hyperlinking across documents that Tim Berners-Lee (TBL) apparently was blissfully unaware of. It’s surely a more interesting and useful read than the first Google hit, the few factoids from W3C, or Wikipedia one can find online with a simple search—or: it pays off to read books still in this day and age :). The second part is for most readers, perhaps, also still history: the ‘birth’ of the Web and the browser wars in the mid 1990s.

Part III is, in my opinion, the most fun to read: it discusses various extensions to the original design of TBL’s Web that fixes, or at least aims to fix, a shortcoming of the Web’s basics, i.e., they’re presented as “patches” to patch up a too basic—or: rank-amateur—design of the original Web. They are, among others, persistence with cookies to mimic statefulness for Web-based transactions (for, e.g., buying things on the web), trying to get some executable instructions with Java (ActiveX, Flash), and web services (from CORBA, service-oriented computing, to REST and the cloud and such). Interestingly, they all originate in the 1990s in the time of the browser wars.

There are more names in the distant and recent history of the Web that I knew of, so even I picked up a few things here or there. IIRC, they’re all men, though. Surely there would be at least one woman worthy of mention? I probably ought to know, but didn’t, so I searched the Web and easily stumbled upon the Internet Hall of Fame. That list includes Susan Estrada among the pioneers, who founded CERFnet that “grew the network from 25 sites to hundreds of sites.”, and, after that, Anriette Esterhuysen and Nancy Hafkin for the network in Africa, Qiheng Hu for doing this for China, and Ida Holz for the same in Latin America (in ‘global connections’). Web innovators specifically include Anne-Marie Eklund Löwinder for DNS security extensions (DNSSEC, noted on p143 but not by its inventor’s name) and Elizabeth Feinler for the “first query-based network host name and address (WHOIS) server” and “she and her group developed the top-level domain-naming scheme of .com, .edu, .gov, .mil, .org, and .net, which are still in use today”.

One patch to the Web that I really missed in the overview of the early patches, is the “Web 2.0”. I know that, technologically, it is a trivial extension to TBL’s original proposal: the move from static web pages in 1:n communication from content provider to many passive readers, to m:n communication with comment sections (fancy forms), or: instead of the surfer being just a recipient of information by reading one webpage after another and thinking her own thing of it, to be able to respond and interact, i.e., the chatrooms, the article and blog comment features, and, in the 2000s, the likes of MySpace and Facebook. It got so many more people involved in it all.

Continuing with the book’s content, cloud computing and the fog (section 7.9) are from this millennium, as is, what Aiello dubbed, the “Mother of All Patches.”: the Semantic Web. Regarding the latter, early on in the book (pp. vii-viii) there is already an off-hand comment that does not bode well: “Chap. 8 on the Semantic Web is slightly more technical than the rest and can be safely skipped.” (emphasis added). The way Chapter 8 is written, perhaps. Before discussing his main claim there, a few minor quibbles: it’s the Web Ontology Language OWL, not “Ontology Web Language” (p105), and there’s OWL 2 as successor of the OWL of 2004. “RDF is a nifty combination of being a simple modeling language while also functioning as an expressive ontological language” (p104), no: RDF is for representing data, not really for modeling, and most certainly would not be considered an ontology language (one can serialize an ontology in RDF/XML, but that’s different). Class satisfiability example: no, that’s not what it does, or: the simplification does not faithfully capture it; an example with a MammalFish that cannot have any instances (as subclass of both Mammal and Fish that are disjoint), would have been (regardless the real world).

The main claim of Aiello regarding the Semantic Web, however, is that it’s been that time to throw in the towel, because there hasn’t been widespread uptake of Semantic Web technologies on the Web even though it was proposed already around the turn of the millenium. I lean towards that as well and have reduced the time spent on it from my ontology engineering course over the years, but don’t want to throw out the baby with the bathwater just yet, for two reasons. First, scientific results tend to take a long time to trickle down. Second, I am not convinced that the ‘semantic’ part of the Web is the same level of end-user stuff as playing with HTML is. I still have an HTML book from 1997. It has instructions to “design your first page in 10 minutes!”. I cannot recall if it was indeed <10 minutes, but it sure was fast back in 1998-1999 when I made my first pages, as a non-IT interested layperson. I’m not sure if the whole semantics thing can be done even on the proverbial rainy Sunday afternoon, but the dumbed down version with schema.org sort of works. This schema.org brings me to p110 of Aiello’s book, which states that Google can make do with just statistics for optimal search results because of its sheer volume (so bye-bye Semantic Web). But it is not just stats-based: even Google is trying with schema.org and its “knowledge graph”; admitted, it’s extremely lightweight, but it’s more than stats-only. Perhaps the schema.org and knowledge graph sort of thing are to the Semantic Web what TBL’s proposal for the Web was to, say, the fancier HyperCard.

I don’t know if people within the Semantic Web research community would think of its tooling as technologies for the general public. I suspect not. I consider the development and use of ontologies in ontology-driven information systems as part of the ‘back office’ technologies, notwithstanding my occasional attempts to explain to friends and family what sort of things I’m working on.

What I did find curious, is that one of Aiello’s arguments for the Semantic Web’s failure was that “Using ontologies and defining what the meaning of a page is can be much more easily exploited by malicious users” (p110). It can be exploited, for sure, but statistics can go bad, very bad, too, especially on associations of search terms, the creepy amount of data collection on the Web, and bias built into the Machine Learning algorithms. Search engine optimization is just the polite terms for messing with ‘honest’ stats and algorithms. With the Semantic Web, it would a conscious decision to mess around and that’s easily traceable, but with all the stats-based approaches, it sneakishly can creep in whilst trying to keep up the veneer of impartiality, which is harder to detect. If it were a choice between two technology evils, I prefer the honest bastard cf. being stabbed in the back. (That the users of the current Web are opting for the latter does not make it the lesser of two evils.)

As to two possible new patches (not in the book and one can debate whether they are), time will tell whether a few recent calls for “decentralizing” the Web will take hold, or more fine-grained privacy that also entails more fine-grained recording of events (e.g., TBL’s solid project). The app-fication discussion (Section 10.1) was an interesting one—I hardly use mobile apps and so am not really into it—and the lock-in it entails is indeed a cause for concern for the Web and all it offers. Another section in Chapter 10 is IoT, which sounds promising and potentially scary (what would the data-hungry ML algorithms of the Web infer from my fridge contents, and from that, about me??)—for the past 10 years or so. Lastly, the final chapter has the tempting-to-read title “Should a new Web be designed?”, but the answer is not a clear yes or no. Evolve, it will.

Would I have read the book if I weren’t on sabbatical now? Probably still, on an otherwise ‘lost time’ intercontinental trip to a conference. So, overall, besides the occasional gap and one could quibble a bit here and there, the book is a nice read on the whole for any lay-person interested in learning something about the ubiquitous Web, any expert who’s using only a little corner of it, and certainly for the younger generation to get a feel for how the current Web came about and how technologies get shaped in praxis.

# On ‘open access’ CS conference proceedings

It perhaps sounds nice and doing-good-like, for the doe-eyed ones at least: publish computer science conference proceedings as open access so that anyone in the world can access the scientific advances for free. Yay. Free access to scientific materials is good for a multitude of reasons. There’s downside in the set-up in the way some try to push this now, though, which amounts to making people pay for what used to be, and still mostly is, for free already. I take issue with that. Instead of individualising a downside of open access by heaping more costs onto the individual researchers, the free flow of knowledge should be—and remain—a collectivised effort.

It is, and used to be, the case that most authors put the camera-ready-copy (CRC) on their respective homepages and/or institutional repositories, and it used to be typically even before the conference (e.g., mine are here). Putting the CRC on one’s website or in an openly accessible institutional repository seems to happen slightly less often now, even though it is legal to do so. I don’t know why. Even if it were not entirely legal, a collective disobedience is not something that the publishers easily can fight. It doesn’t help that Google indexes the publisher quicker than the academics’ webpages, so the CRCs on the authors’ pages don’t turn up immediately in the search results even whey the CRCs are online, but that would be a pathetic reason for not uploading the CRC. It’s a little extra effort to lookup an author’s website, but acceptable as long as the file is still online and freely available.

Besides the established hallelujah’s to principles of knowledge sharing, there’s since recently a drive at various computer science (CS) conferences to make sure the proceedings will be open access (OA). Like for OA journal papers in an OA or hybrid journal, someone’s going to have to pay for the ‘article processing charges’. The instances that I’ve seen close-up, put those costs for all papers of the proceedings in the conference budget and therewith increase the conference registration costs. Depending on 1) how good or bad the deal is that the organisers made, 2) how many people are expected to attend, and 3) how many papers will go in the volume, it hikes up the registration costs by some 50 euro. This is new money that the publishing house is making that they did not use to make before, and I’m pretty sure they wouldn’t offer an OA option if it were to result in them making less profit from the obscenely lucrative science publishing business.

So, who pays? Different universities have different funding schemes, as have different funders as to what they fund. For instance, there exist funds for contributing to OA journal article publishing (also at UCT, and Springer even has a list of OA funders in several countries), but that cannot be used in this case, for the OA costs are hidden in the conference registration fee. There are also conference travel funds, but they fund part of it or cap it to a maximum, and the more the whole thing costs, the greater the shortfall that one then will have to pay out of one’s own research fund or one’s own pocket.

A colleague (at another university) who’s pushing for the OA for CS conference proceedings said that his institution is paying for all the OA anyway, not him—he easily can have principles, as it doesn’t cost him anything anyway. Some academics have their universities pay for the conference proceedings access already anyway, as part of the subscription package; it’s typically the higher-ranking technical universities that have access. Those I spoke to, didn’t like the idea that now they’d have to pay for access in this way, for they already had ‘free’ (to them) access, as the registration fees come from their own research funds. For me, it is my own research funds as well, i.e., those funds that I have to scramble together through project proposal applications with their low acceptance rates. If I’d go to/have papers at, say, 5 such conferences per year (in the past several years, it was more like double that), that’s the same amount as paying a student/scientific programmer for almost a week and about a monthly salary for the lowest-paid in South Africa, or travel costs or accommodation for the national CS&IT conference (or both) or its registration fees. That is, with increased registration fees to cover the additional OA costs, at least one of my students or I would lose out on participating in even a local conference, or students would be less exposed to doing research and obtaining programming experience that helps them to get a better job or better chance at obtaining a scholarship for postgraduate studies. To name but a few trade-offs.

Effectively, the system has moved from “free access to the scientific literature anyway” (the online CRCs), to “free access plus losing money (i.e.: all that I could have done with it) in the process”. That’s not an improvement on the ground.

Further, my hard-earned research funds are mine, and I’d like to decide what to do with it, rather than having that decision been taken for me. Who do the rich boys up North think they are to say that I should spend it on OA when the papers were already free, rather than giving a student an opportunity to go to a national conference or devise and implement an algorithm, or participate in an experiment etc.! (Setting aside them trying to reprimand and ‘educate’ me on the goodness—tsk! as if I don’t know that the free flow of scientific information is a good thing.)

Tell me, why should the OA principles trump the capacity building when the papers are free access already anyway? I’ve not seen OA advocates actually weighing up any alternatives on what would be the better good to spend money on. As to possible answers, note that an “it ought to be the case that there would be enough money for both” is not a valid answer in discussing trade-offs, nor is a “we might add a bit of patching up as conference registration reduction for those needy that are not in the rich inner core” for it hardly ever happens, nor is a “it’s not much for each instance, you really should be able to cover it” because many instances do add up. We all know that funding for universities and for research in general is being squeezed left, right, and centre in most countries, especially over the past 8-10 years, and such choices will have to, and are being, made already. These are not just choices we face in Africa, but this holds also in richer countries, like in the EU (fewer resources in relative or absolute terms and greater divides), although a 250 euro (the 5 conferences scenario) won’t go as far there as in low-income countries.

Also, and regardless the funding squeeze: why should we start paying for free access that already was a de facto, and with most CS proceedings publishers, also a de jure, free access anyway? I’m seriously starting to wonder who’s getting kickbacks for promoting and pushing this sort of scheme. It’s certainly not me, and nor would I take it if some publisher would offer it to me, as it contributes to the flow of even more money from universities and research institutes to the profits of multinationals. If it’s not kickbacks, then to all those new ‘conference proceedings need to be OA’ advocates: why do you advocate paying for a right that we had for free? Why isn’t it enough for you to just pay for a principle yourself as you so desire, but instead insist to force others to do so too even when there is already a tacit and functioning agreement going on that realises that aim of free flow of knowledge?

Sure, the publisher has a responsibility to keep the papers available in perpetuity, which I don’t, and link rot does exist. One easily could write a script to search all academics’ websites and get the files, like citeseer used to do well. They get funding for such projects for long-term archiving, like arxiv.org does as well, and philpapers, and SSRN as popular ones (see also a comprehensive list of preprint servers), and most institution’s repositories, too (e.g., the CS@UCT pubs repository). So, the perpetuity argument can also be taken care of that way, without the researchers actually having to pay more.

Really, if you’re swimming in so much research money that you want to pay for a principle that was realised without costs to researchers, then perhaps instead do fund the event so that, say, some student grants can be given out, that it can contribute to some nice networking activity, or whatever part of the costs. The new “we should pay for OA, notwithstanding that no one was suffering when it was for free” attitude for CS conference proceedings is way too fishy to actually being honest; if you’re honest and not getting kickbacks, then it’s a very dumb thing to advocate for.

For the two events where this scheme is happening that I’m involved in, I admit I didn’t forcefully object at the time it was mentioned (nor had I really thought through the consequences). I should have, though. I will do so a next time.

# An Ontology Engineering textbook

My first textbook “An Introduction to Ontology Engineering” (pdf) is just released as an open textbook. I have revised, updated, and extended my earlier lecture notes on ontology engineering, amounting to about 1/3 more new content cf. its predecessor. Its main aim is to provide an introductory overview of ontology engineering and its secondary aim is to provide hands-on experience in ontology development that illustrate the theory.

The contents and narrative is aimed at advanced undergraduate and postgraduate level in computing (e.g., as a semester-long course), and the book is structured accordingly. After an introductory chapter, there are three blocks:

• Logic foundations for ontologies: languages (FOL, DLs, OWL species) and automated reasoning (principles and the basics of tableau);
• Developing good ontologies with methods and methodologies, the top-down approach with foundational ontologies, and the bottom-up approach to extract as much useful content as possible from legacy material;
• Advanced topics that has a selection of sub-topics: Ontology-Based Data Access, interactions between ontologies and natural languages, and advanced modelling with additional language features (fuzzy and temporal).

Each chapter has several review questions and exercises to explore one or more aspects of the theory, as well as descriptions of two assignments that require using several sub-topics at once. More information is available on the textbook’s page [also here] (including the links to the ontologies used in the exercises), or you can click here for the pdf (7MB).

Feedback is welcome, of course. Also, if you happen to use it in whole or in part for your course, I’d be grateful if you would let me know. Finally, if this textbook will be used half (or even a quarter) as much as the 2009/2010 blogposts have been visited (around 10K unique visitors since posting them), that would mean there are a lot of people learning about ontology engineering and then I’ll have achieved more than I hoped for.

UPDATE: meanwhile, it has been added to several open (text)book repositories, such as OpenUCT and the Open Textbook Archive, and it has been featured on unglue.it in the week of 13-8 (out of its 14K free ebooks).

# ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

# Updated isiZulu spellchecker and new isiXhosa spellchecker

Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:

• first error correction algorithms for isiZulu;
• improved error detection with a few basic rules, also for isiZulu;
• new isiXhosa error detection and correction;

The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):

Single word error *khupels that has a substitution error s for a in the intended word (khupela)

The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).

The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than spellchecker.net’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the spellchecker.net’s version got to 82.66% on the same text. Put differently: spellchecker.net flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).

Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.

A section of the isiXhosa Wikipedia entry about the UN (*ukuez should be ukuze, which is among the proposed words).

Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.

# Orchestrating 28 logical theories of mereo(topo)logy

Parts and wholes, again. This time it’s about the logic-aspects of theories of parthood (cf. aligning different hierarchies of (part-whole) relations and make them compatible with foundational ontologies). I intended to write this post before the Ninth Conference on Knowledge Capture (K-CAP 2017), where the paper describing the new material would be presented by my co-author, Oliver Kutz. Now, afterwards, I can add that “Orchestrating a Network of Mereo(topo) logical Theories” [1] even won the Best Paper Award. The novelties, in broad strokes, are that we figured out and structured some hitherto messy and confusing state of affairs, showed that one can do more than generally assumed especially with a new logics orchestration framework, and we proposed first steps toward conflict resolution to sort out expressivity and logic limitations trade-offs. Constructing a tweet-size “tl;dr” version of the contents is not easy, and as I have as much space here on my blog as I like, it ended up to be three paragraphs here: scene-setting, solution, and a few examples to illustrate some of it.

Problems

As ontologists know, parthood is used widely in ontologies across most subject domains, such as biomedicine, geographic information systems, architecture, and so on. Ontology (the philosophers) offer a parthood relation that has a bunch of computationally unpleasant properties that are structured in a plethora of mereologicial and meretopological theories such that it has become hard to see the forest for the trees. This is then complicated in practice because there are multiple logics of varying expressivity (support more or less language features), with the result that only certain fragments of the mereo(topo)logical theories can be represented. However, it’s mostly not clear what can be used when, during the ontology authoring stage one may want to have all those features so as to check correctness, and it’s not easy to predict what will happen when one aligns ontologies with different fragments of mereo(topo)logy.

Solution

We solved these problems by specifying a structured network of theories formulated in multiple logics that are glued together by the various linking constructs of the Distributed Ontology, Model, and Specification Language (DOL). The ‘structured network of theories’-part concerns all the maximal expressible fragments of the KGEMT mereotopological theory and five of its most well-recognised sub-theories (like GEM and MT) in the seven Description Logics-based OWL species, first-order logic, and higher order logic. The ‘glued together’-part refers to relating the resultant 28 theories within DOL (in Ontohub), which is a non-trivial (understatement, unfortunately) metalanguage that has the constructors for the glue, such as enabling one to declare to merge two theories/modules represented in different logics, extending a theory (ontology) with axioms that go beyond that language without messing up the original (expressivity-restricted) ontology, and more. Further, because the annoying thing of merging two ontologies/modules can be that the merged ontology may be in a different language than the two original ones, which is very hard to predict, we have a cute proof-of-concept tool so that it assists with steps toward resolution of language feature conflicts by pinpointing profile violations.

Examples

The paper describes nine mechanisms with DOL and the mereotopological theories. Here I’ll start with a simple one: we have Minimal Topology (MT) partially represented in OWL 2 EL/QL in “theory8” where the connection relation (C) is just reflexive (among other axioms; see table in the paper for details). Now what if we add connection’s symmetry, which results in “theory4”? First, we do this by not harming theory8, in DOL syntax (see also the ESSLI’16 tutorial):

logic OWL2.QL
ontology theory4 =
theory8
then
ObjectProperty: C Characteristics: Symmetric %(t7)

What is the logic of theory4? Still in OWL, and if so, which species? The Owl classifier shows the result:

Another case is that OWL does not let one define an object property; at best, one can add domain and range axioms and the occasional ‘characteristic’ (like aforementioned symmetry), for allowing arbitrary full definitions pushes it out of the decidable fragment. One can add them, though, in a system that can handle first order logic, such as the Heterogeneous toolset (Hets); for instance, where in OWL one can add only “overlap” as a primitive relation (vocabulary element without definition), we can take such a theory and declare that definition:

logic CASL.FOL
ontology theory20 =
theory6_plus_antisym_and_WS
then %wdef
. forall x,y:Thing . O(x,y) <=> exists z:Thing (P(z,x) /\ P(z,y)) %(t21)
. forall x,y:Thing . EQ(x,y) <=> P(x,y) /\ P(y,x) %(t22)

As last example, let me illustrate the notion of the conflict resolution. Consider theory19—ground mereology, partially—that is within OWL 2 EL expressivity and theory18—also ground mereology, partially—that is within OWL 2 DL expressivity. So, they can’t be the same; the difference is that theory18 has parthood reflexive and transitive and proper parthood asymmetric and irreflexive, whereas theory19 has both parthood and proper parthood transitive. What happens if one aligns the ontologies that contain these theories, say, O1 (with theory18) and O2 (with theory19)? The Owl classifier provides easy pinpointing and tells you the profile: OWL 2 full (or: first order logic, or: beyond OWL 2 DL—top row) and why (bottom section):

Now, what can one do? The conflict resolution cannot be fully automated, because it depends on what the modeller wants or needs, but there’s enough data generated already and there are known trade-offs so that it is possible to describe the consequences:

• Choose the O1 axioms (with irreflexivity and asymmetry on proper part of), which will make the ontology interoperable with other ontologies in OWL 2 DL, FOL or HOL.
• Choose O2’s axioms (with transitivity on part of and proper part of), which will facilitate linking to ontologies in OWL 2 RL, 2 EL, 2 DL, FOL, and HOL.
• Choose to keep both sets will result in an OWL 2 Full ontology that is undecidable, and it is then compatible only with FOL and HOL ontologies.

As serious final note: there’s still fun to be had on the logic side of things with countermodels and sub-networks and such, and with refining the conflict resolution to assist ontology engineers better. (or: TBC)

As less serious final note: the working title of early drafts of the paper was “DOLifying mereo(topo)logy”, but at some point we chickened out and let go of that frivolity.

References

[1] Keet, C.M., Kutz, O. Orchestrating a Network of Mereo(topo)logical Theories. Ninth International Conference on Knowledge Capture (K-CAP’17), Austin, Texas, USA, December 4-6, 2017. ACM Proceedings.

# Figuring out the verbalisation of temporal constraints in ontologies and conceptual models

Temporal conceptual models, ontologies, and their logics are nothing new, but that sort of information and knowledge representation still doesn’t gain a lot of traction (cf. say, formal methods for verification). This is in no small part because modelling temporal information is not easy. Several conceptual modelling languages do have various temporal extensions, but most modellers don’t even use all of the default language features yet [1]. How could one at least reduce the barrier to adoption of temporal logics and modelling languages? The two principle approaches are visualisation with a diagrammatic language and rendering it in a (pseudo-)natural language. One of my postgraduate students looked at the former, trying to figure out what would be the best icons and such, which showed there was still a steep learning curve [2]. Before examining whether that could be optimised, I wondered whether the natural language option might be promising. The problem was, that no-one had yet tried to determine what the natural language counterpart of the temporal constraints were supposed to be, let alone whether they be ‘adequate’ or the ‘best’ way of rendering the temporal constraints in tolerable natural language sentences. I wanted to know that badly enough that I tried to find out.

Given that using templates is a tried-and-tested relatively successful approach for atemporal conceptual models and ontologies (e.g., for ORM, the ACE system), it makes sense to do something similar, but then for some temporal extension. As temporal conceptual modelling language I used one that has a Description Logics foundation (DLRUS [3,4]) for that easily links to ontologies as well, added a few known temporal constraints (like for relationships/DL roles, mandatory) and removing others (some didn’t seem all that interesting), which resulted in 34 constraints, still. For each one, I tried to devise more and less reasonable templates, resulting in 101 templates overall. Those templates were evaluated on semantics and preference by three temporal logic experts and five ‘mixed experts’ (experts in natural language generation, logic, or modelling). This resulted in a final set of preferred templates to verbalise the temporal constraints. The remainder of this post first describes a bit about the templates and then the results of which I think they are most interesting.

Templates

The basic idea of a template—in the context of the verbalisation of conceptual models and ontologies—is to have some natural language for the constraint where then the vocabulary gets slotted in at runtime. Take, for instance, simple named class subsumption in an ontology, $C \sqsubseteq D$, for which one could define a template “Each [C] is a(n) [D]”, so that with some axiom $Manager \sqsubseteq Employee$, it would generate the sentence “Each Manager is an Employee”. One also could have devised the template “All [C] are [D]” and then it would have generated “All Managers are Employees”. The choice between the two templates in this case is just taste, for in both cases, the semantics is the same. More complex axioms are not always that straightforward. For instance, for the axiom type $C \sqsubseteq \exists R.D$, would “Each [C] [R] some [D]” be good enough, or would perhaps “Each [C] must [R] at least one [D]” be better? E.g., “Each Professor teaches some Course” vs “Each Professor must teach at least one Course”.

The same can be done for the temporal constraints. To get there, I did a bit of a linguistic detour that informed the template design (described in the paper [5]). Let us take as first example for templates temporal class that has a semantics of $o \in C^{\mathcal{I}(t)} \rightarrow \exists t' \neq t. o \notin C^{\mathcal{I}(t')}$; for instance, UndergraduateStudent (assuming they graduate and end up as alumni or as drop outs, and weren’t undergrads from birth):

1. If an object is an instance of entity type [C], then there is some time where it is not a(n) [C].
2. [C] is an entity type whose objects are, for some time in their existence, not instances of [C].
3. [C] is an entity type of which each object is not a(n) [C] for some time during its existence.
4. All instances of entity type [C] are not a(n) [C] for some time.
5. Each [C] is not a(n) [C] for some time.
6. Each [C] is for some time not a(n) [C].

Which one(s) do you think captures the semantics, and which one(s) do you prefer?

A more elaborate constraint for relationships is ‘dynamic extension for relationships, past, mandatory], which is formalised as $\langle o , o' \rangle \in \mbox{{\sc RDexM}-}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow (\langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t' where $\langle o , o' \rangle \in \mbox{{\sc RDex}}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow ( \langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t'>t. \langle o , o' \rangle \in {\tt R_2}^{\mathcal{I}(t')})$.; e.g., every passenger who boards a flight must have checked in for that flight. Two options could be:

1. Each ..C_1.. ..R_1.. ..C_2.. was preceded by ..C_1.. ..R_2.. ..C_2.. some time earlier.
2. Each ..C_1.. ..R_1.. ..C_2.. must be preceded by ..C_1.. ..R_2.. ..C_2.. .

I’m not saying they are all correct; they were some of the options given, which the participants could choose from and comment on. The full list of constraints and template options are available in the supplementary material, which also contains a file where you can fill in your own answers, see what the (anonymised) participants said, and it has the final list of ‘best’ constraints.

Results

The main aggregate quantitative results are shown in the following table.

Many observations can be made from the data (see the paper for details). Some of the salient aspects are that there was low inter-annotator agreement among the experts, despite that they know each other (temporal logics is a small community) and that the ‘mixed group’ deemed many sentences correct that the experts deemed wrong in the sense of not properly capturing the semantics of the constraint. Put differently, it looks like the mixed experts, as a group, did not fully grasp some subtle distinction in the temporal constraints.

With respect to the templates, the preferred ones don’t follow the structure of the logic, but are, in a way, a separate rendering, or: there’s no neat 1:1 mapping between axiom type and template structure. That said, that doesn’t mean that they always chose the shortest template: the experts definitely did not, while the mixed experts leaned a bit toward preferring templates with fewer words even though they were surely not always the semantically correct option.

It may not look good that the experts preferred different templates, but in a follow-up interview with one of the experts, the expert noted that it was not really a problem “for there is the logic that does have the precise meaning anyway” and thus “resolves any confusion that may arise from using slightly different terminology”. The temporal logic expert does have a point from the expert’s view, fair enough, but that pretty much defeats my aim with the experiment. Asking more non-experts may not be a good strategy either, for they are, on average, too lenient.

So, for now, we do have a set of, relatively, ‘best’ templates to verbalise temporal constraints in temporal conceptual models and ontologies. The next step is to compare that with the diagrammatic representation. This we did [6], and I’ll describe those results informally in a next post.

I’ll present more details at the upcoming CREOL: Contextual Representation of Events and Objects in Language Workshop that is part of the Joint Ontology Workshops 2017, which will be held next week (21-23 September) in Bolzano, Italy. As the KRDB group at FUB in Bolzano has a few temporal logic experts, I’m looking forward to the discussions! Also, I’d be happy if you would be willing to fill in the spreadsheet with your preferences (before looking at the answers given by the participants!), and send them to me.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[2] T. Shunmugam. Adoption of a visual model for temporal database representation. M. IT thesis, Department of Computer Science, University of Cape Town, South Africa, 2016.

[3] A. Artale, E. Franconi, F. Wolter, and M. Zakharyaschev. A temporal description logic for reasoning about conceptual schemas and queries. In S. Flesca, S. Greco, N. Leone, and G. Ianni, editors, Proceedings of the 8th Joint European Conference on Logics in Artificial Intelligence (JELIA-02), volume 2424 of LNAI, pages 98-110. Springer Verlag, 2002.

[4] A. Artale, C. Parent, and S. Spaccapietra. Evolving objects in temporal information systems. Annals of Mathematics and Artificial Intelligence, 50(1-2):5-38, 2007.

[5] Keet, C.M. Natural language template selection for temporal constraints. CREOL: Contextual Representation of Events and Objects in Language, Joint Ontology Workshops 2017, 21-23 September 2017, Bolzano, Italy. CEUR-WS Vol. (in print).

[6] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Springer LNCS. 6-9 Nov 2017, Valencia, Spain. (in print)