On that “shared” conceptualization and other definitions of an ontology

It’s a topic that never failed to generate a discussion on all 10 instalments of the ontology engineering course I taught from BSc(hons) up to participants studying toward or already having a PhD: those pesky definitions of what an ontology is. To top it off, like I didn’t know, I also got a snarky reviewer’s comment about it on my Stuff ontology paper [1]:

A comment that might be superficial but I cannot help: since an ontology is usually (in Borst’s terms) assumed to be a ‘shared’ conceptualization, I find a little surprising for such a complex model to have been designed by a sole author. While I acknowledge the huge amount of literature carefully analyzed, it still seems that the concrete modeling decisions eventually relied on the background of a single ontologist

Is that bad? Does that make the Stuff Ontology a ‘nontology’? And, by the by, what about all those loner philosophers who write single-author papers on ontology; should that whole field be discarded because most of the ontology insights were “shared” only from paper submission and publication?

Anyway, let’s start from the beginning. There’s the much-criticized definition of an ontology from Gruber that, it seems, only novices seem to keep quoting (to my irritation, indeed):

An ontology is a specification of a conceptualization. [2]

If you wonder why quite a bit has been written about it: try to answer what “specification” really means and how it is specified, and what exactly a “conceptualization” is. The real fun starts with Borst et al.’s [3] and then Studer et al.’s [4] refinement of Gruber’s version, which the reviewer quoted above alluded to:

An ontology is a formal, explicit specification of a shared conceptualization. [4]

At least there’s the “formal” (be it in the sense of logic or formal ontology), and “explicit”, so something is being made explicit and precise. But “shared”? Shared with whom? How? Is a logical theory that not one, but two, people write down an ontology, then? Or one person develops an ontology and then emails it to a few colleagues or puts it online in, say, the open BioPortal ontology repository. Does that count as “shared” then? Or is it only “shared” if at least one other person agrees with it as is (all reviewers of the Stuff Ontology did, btw), or perhaps (most or all of) the ‘conceptualization’ of it but a few axioms would need a bit of tweaking and cleaning up? Do you need at least a group of people to develop an ontology, and if so, how large should that group be, and should that group consist of independent sub-groups that adopt the ontology (and if so, how many endorsers)? Is a lightweight low-hanging-fruit ontology that is used by a large company a real or successful ontology, but a highly axiomatised ontology with a high tangledness that is used by a specialist organization, not? And even if you canvass and get a large group and/or organization to buy into that formal explicit specification, what if they are all wrong on the reality is supposed to represent? Does it still count as an ontology no matter how wrong the conceptualization is, just because it’s formal, explicit, and shared? Is a tailor-made module of, say, the DOLCE ontology not also an ontology, even if the module was made by one person and made available in an online repository like ROMULUS?

Perhaps one shouldn’t start top-down, but bottom-up: take some things and decide (who?) whether it is an ontology or not. Case one: the taxonomy of part-whole relations is a mini-ontology, and although at the start only ‘shared’ with my co-author and published in the Applied Ontology journal [5], it has been used by quite a few researchers for various (and unintended) purposes afterward, notably in NLP (e.g., [6]). An ontology? If so, since when? Case two: Noy et al. converted the representation of the NCI thesaurus into OWL DL [7]. Does changing the serialisation of a multi-authored thesaurus from one format into another make it an ontology? (more on that below.) Case three: a group of 5 people try to represent the subject domain of, say, breast cancer, but it is replete with mistakes both regarding the reality it ought to represent and unintended modelling errors (such as confusing is-a with part-of). Is it still an ontology, albeit a bad one?

It gets more muddled when the representation language is thrown in (as with case 2 above). What if the ontology turns out to be unsatisfiable? From a logic viewpoint, it’s not a theory then (a consistent set of sentences, is), but if it’s formal, explicit, and shared, is it acceptable that those people who developed the artefact simply have an inconsistent conceptualization and that it still counts as an ontology?

Horrocks et al. [8] simplify the whole thing by eliminating the ‘shared’ aspect:

an ontology being equivalent to a Description Logic knowledge base. [8]

However, this generates a set of questions and problems of its own that are practically also problematic. For instance: 1) whether transforming a UML Class Diagram into OWL ‘magically’ makes it an ontology (answer: no); 2) The NCI Thesaurus to OWL (answer: no); or 3) if you used, say, Common Logic to represent it, that then it could not be an ontology because it’s not formalised in Description Logics (answer: it sure can be one).

There are more attempts to give a definition or a description, notably by Nicola Guarino in [9] (a key paper in the field):

An ontology is a logical theory accounting for the intended meaning of a formal vocabulary, i.e. its ontological commitment to a particular conceptualization of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models. [9]

That’s a mouthful, but at least no “shared” in there, either. And, finally, among the many definitions in [10], here’s Barry Smith and cs.’s take on it:

An ONTOLOGY is a representational artifact, comprising a taxonomy as proper part, whose representational units are intended to designate some combination of universals, defined classes, and certain relations between them. [10]

And again, no “shared” either in this definition. Of course, also with Smith’s definition, there are things one can debate about and pose it against Guarino’s definition, like the “universals” vs. “conceptualization” etc., but that’s a story for another time.

So, to sum up: there is that problem on how to interpret “shared”, which is untenable, and one just as well can pick a definition of an ontology from a widely cited paper that doesn’t include that in the definition.

That said, all this doesn’t help my students to grapple with the notion of ‘an ontology’. Examples help, and it would be good if someone, or, say, the International Association for Ontology and its Applications (IAOA) would have a list of “exemplar ontologies” sooner rather than later. (Yes, I have a list, but it still needs to be annotated better). Another aspect that helps explaining it comes is from Guarino’s slides on going “from logical to ontological level” and on good and bad ontologies. This first screenshot (taken from my slides—easier to find) shows there’s “something more” to an ontology than just the logic, with a hint to reasons why (note to my students: more about that later in the course). The second screenshot shows that, yes, we can have the good, bad, and ugly: the yellow oval denotes the intended models (what it should be), and the other ovals denote the various approximations that one may have tried to represent in an ontology. For instance, representing ‘each human has exactly one brain’ is more precise (“good”) than stating ‘each human has at least one brain’ (“less good”) or not saying anything at all about it an ontology of human anatomy (“bad”), and even “worse” it would be if that ontology ware to state ‘each human has exactly two tails’.


Maybe we can’t do better than ‘intuition’ or ‘very wieldy explanation’. If this were a local installation of WordPress, I’d have added a poll on definitions and the subjectivity on the shared-ness factor (though knowing well that science isn’t governed as a democracy). In lieu of that: comments, preferences for one definition or the other, or any better suggestions for definitions are most welcome! (The next instalment of my Ontology Engineering course will start in a few week’s time.)



[1] Keet, C.M. A core ontology of macroscopic stuff. 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW’14). K. Janowicz et al. (Eds.). 24-28 Nov, 2014, Linkoping, Sweden. Springer LNAI vol. 8876, 209-224.

[2] Gruber, T. R. A translation approach to portable ontology specifications. Knowledge Acquisition, 1993, 5(2):199-220.

[3] Borst, W.N., Akkermans, J.M. Engineering Ontologies. International Journal of Human-Computer Studies, 1997, 46(2-3):365-406.

[4] Studer, R., Benjamins, R., and Fensel, D. Knowledge engineering: Principles and methods. Data & Knowledge Engineering, 1998, 25(1-2):161-198.

[5] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[6] Tandon, N., Hariman, C., Urbani, J., Rohrbach, A., Rohrbach, M., Weikum, G.: Commonsense in parts: Mining part-whole relations from the web and image tags. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16). pp. 243-250. AAAI Press (2016)

[7] Noy, N.F., de Coronado, S., Solbrig, H., Fragoso, G., Hartel, F.W., Musen, M. Representing the NCI Thesaurus in OWL DL: Modeling tools help modeling languages. Applied Ontology, 2008, 3(3):173-190.

[8] Horrocks, I., Patel-Schneider, P. F., and van Harmelen, F. From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 2003, 1(1):7.

[9] Guarino, N. (1998). Formal ontology and information systems. In Guarino, N., editor, Proceedings of Formal Ontology in Information Systems (FOIS’98), Frontiers in Artificial intelligence and Applications, pages 3-15. Amsterdam: IOS Press.

[10] Smith, B., Kusnierczyk, W., Schober, D., Ceusters, W. Towards a Reference Terminology for Ontology Research and Development in the Biomedical Domain. KR-MED 2006 “Biomedical Ontology in Action”. November 8, 2006, Baltimore, Maryland, USA.

Book reviews for 2016

I can’t resist adding another instalment of brief reviews of some of the books I’ve read over the past books2016year, following the previous five editions and the gender analysis of them (with POC/non-POC added on request at the end). This time, there are three (well, four) non-fiction books and four fiction novels discussed in the remainder of the post. The links to the books used to be mostly to Kalahari.com online (an SA-owned bookstore), but they have been usurped by the awfully-sounding TakeALot, so the links to the books are diversified a bit more now.


Writing what we like—a new generation speaks, edited by Yolisa Qunta (2016). This is a collection of short essays about how society is perceived by young adults in South Africa. I think this stock-taking of events and opinions thereof is a must-read for anyone wanting to know what goes on and willing to look a bit beyond the #FeesMustFall sound bites on Twitter and Facebook. For instance, “A story of privilege” by Shaka Sisulu describing his experiences coming to study at UCT, and Sophokuhle Mathe in “White supremacy vs transformation” on UCT’s new admissions policy, the need for transformation, and going to hold the university to account; Yolisa Qunta’s “Spider’s web” on the ghost of apartheid with the every-day racist incidents and the anger that comes with it; “Cape Town’s pretend partnership” by Ilham Rawoot on his observations of exclusion of most Capetonians regarding preparations of the World Design Capital in 2014. There are a few ‘lighter’ essays as well, like the fun side of taking the taxi (minibus) in “life lessons learnt from taking the taxi” by Qunta (indeed, travelling by taxi can be fun).

Elephants on Acid by Alex Boese (2007). This is a fun book about the weird and outright should-not-have-been-done research—and why we have ethics committees now. There are of course the ‘usual suspects’ (gorillas in our midst, Milgram’s experiment), the weird ones (testing LSD on elephants; didn’t turn out alright), funny ones (will your dog get help if you are in trouble [no]; how much pubic hair you lose during intercourse [not enough for the CSI people]; social facilitation with cockroach games; trying to weigh the mass of a soul), but also those of the do-not-repeat variety. The latter include trying to figure out whether a person under the guillotine will realise it has been ‘separated’ from his body, Little Albert, and the “depatterning” of ‘beneficial brainwashing’ (it wasn’t beneficial at all). The book is written in an entertaining way, either alike a ‘what on earth was their hypothesis to devise such an experiment?’, or, knowing the hypothesis, with some morbid fascination to see whether it was falsified. Most of the research referenced is, for obvious reasons, older. But well, that doesn’t mean there wouldn’t be any outrageous experiments being conducted nowadays when we look back in, say, 20 years time.

What if? by Randall Munroe (2014, Dutch translation, dwarsligger). Great; read it. Weird and outright absurd questions asked by xkcd readers are answered sort of seriously from a STEM perspective.

Say again? The other side of South African English by Jean Branford and Malcolm Venter (2016). This short review ended up a lot longer, so it got its own blog post two weeks ago.


Red ink by Angela Makholwa (2007). This is a juicy crime novel, as the Black Widow Society by the same author is (that I reviewed last year), and definitely a recommendable read. The protagonist, Lucy Khambule, is a PR consultant setting up her company in Johannesburg, but used to be a gutsy journalist who had sent a convicted serial killer a letter asking for an interview. Five years hence, he invites her for that interview and asks her to write a book about him. As writing a book was her dream, she takes up the offer. Things get messy, partly as a result of that: more murders, intrigues, and some love and friendship (the latter with other people, not the serial killer) that put the people close to Lucy in harm’s way. As with the Black Widow Society, it ends well for some but not for others.

Things fall apart by Chinua Achebe (1958 [2008 edition]). This is a well-known book in Africa at least, and there are many analyses are available online, so I’m not going to repeat all that. The story documents both the mores in a rural village and how things—more precisely: the society—fall apart due to several reasons, both on how the society was organised and the influence of the colonialists and their religion. The storytelling has a slow start, but picks up in pace after a short while, and it is worthwhile to bite through that slow start. You can’t feel but a powerless onlooker to how the events unfold and sorry how things turn out.

Kassandra by Christa Wolf (1983, Dutch translation [1990] from the German original; also available in English). Greeks, Trojans, Achilles, Trojan Horse, and all that. Kassandra the seer and daughter of king Priamos and queen Hadebe, is an independent woman, who rambles on analysing her life’s main moments before her execution. It has an awkward prose that one needs to get used to, but there are some interesting nuggets. On only approaching things in duals, or alternative options, like endlessly win or loose wars or the third option of to live. It was a present from the last century that I ought to have read earlier; but better late than never.

De midlife club by Karin Belt (2014, in Dutch, dwarsligger). The story describes four women in their early 40s living in a province in the Netherlands (the author is from a city nearby where I grew up), for whom life didn’t quite turn out as they fantasised about in their early twenties, due to one life choice after another. Superficially, things seem ok, but something is simmering underneath, which comes to the surface when they go to a holiday house in France for a short retreat. (I’m not going to include spoilers). It was nice to read a Dutch novel with recognisable scenes and that contemplates choices. The suspense and twists were fun such that I really had to finish reading it as soon as possible.

As I still have some 150 pages to go to finish the 700-page tome of Indaba, my children by Credo Mutwa, a review will have to wait until next year. But I can already highly recommend it.

Robot peppers, monkey gland sauce, and go well—Say again? reviewed

The previous post about TDDonto2 had as toy example a pool braai, which does exist in South Africa at least, but perhaps also elsewhere under a different name: the braai is the ‘South African English’ (SAE) for the barbecue. There are more such words and phrases peculiar to SAE, and after the paper deadline last week, I did finish reading the book Say again? The other side of South African English by Jean Branford and Malcolm Venter (published earlier this year) that has many more examples of SAE and a bit of sociolinguistics and some etymology of that. Anyone visiting South Africa will encounter at least several of the words and sentence constructions that are SAE, but probably would raise eyebrows elsewhere. Let me start with some examples.

Besides the braai, one certainly will encounter the robot, which is a traffic light (automating the human police officer). A minor extension to that term can be found in the supermarket (see figure on the right): robot peppers, being a bag of three peppers in the colours of red, yellow, and green—no vegetable AI, sorry. robotpeppers

How familiar the other ones discussed in the book are, depends on how much you interact with South Africans, where you stay(ed), and how much you read and knew about the country before visiting it, I suppose. For instance, when I visited Pretoria in 2008, I had not come across the bunny, but did so upon my first visit in Durban in 2010 (it’s a hollowed-out half a loaf of bread, filled with a curry) and bush college upon starting to work at a university (UKZN) here in 2011. The latter is a derogatory term that was used for universities for non-white students in the Apartheid era, with the non-white being its own loaded term from the same regime. (It’s better not to use it—all terms for classifying people one way or another are a bit of a mine field, whose nuances I’m still trying to figure out; the book didn’t help with that).

Then there’s the category of words one may know from ‘general English’, but are by the authors claimed to have a different meaning here. One is the sell-outs, which is “to apply particularly to black people who were thought to have betrayed their people” (p143), though I have the impression it can be applied generally. Another is townhouse, which supposedly has narrowed its meaning cf. British English (p155), but from having lived on the isles some years ago, it was used in the very same way as it is here; the book’s authors just stick to its older meaning and assume the British and Irish do so too (they don’t, though). One that indeed does fall in the category ‘meaning restriction’ is transformation (an explanation of the narrower sense will take up too much space). While I’ve learned a bunch of the ‘unusual’ usual words in the meantime I’ve worked here, there were others that I still did wonder about. For instance, the lay-bye, which the book explained to be the situation when the shop sets aside a product the customer wants, and the customer pays the price in instalments until it is fully paid before taking the product home. The monkey gland sauce one can buy in the supermarket is another, which is a sauce based on ketchup and onion with some chutney in it—no monkeys and no glands—but, I’ll readily admit, I still have not tried it due to its awful name.

There are many more terms described and discussed in the book, and it has a useful index at the end, especially given that it gives the impression to be a very popsci-like book. The content is very nicely typesetted, with news item snippets and aside-boxes and such. Overall, though, while it’s ok to read in the gym on the bicycle for a foreigner who sometimes wonders about certain terms and constructions, it is rather uni-dimensional from a British White South African perspective and the authors are clearly Cape Town-based, with the majority of examples from SA media from Cape Town’s news outlets. They take a heavily Afrikaans-influence-only bias, with, iirc, only four examples of the influence of, e.g., isiZulu on SAE (e.g., the ‘go well’ literal translation of isiZulu’s hamba kahle), which is a missed opportunity. A quick online search reveals quite a list of words from indigenous languages that have been adopted (and more here and here and here and here) such as muti (medicine; from the isiZulu umuthi) and maas (thick sour milk; from the isiZulu amasi) and dagga (marijuana; from the Khoe daxa-b), not to mention the many loan words, such as indaba (conference; isiZulu) and ubuntu (the concept, not the operating system—which the authors seem to be a bit short of, given the near blind spot on import of words with a local origin). If that does not make you hesitant to read it, then let me illustrate some more inaccuracies beyond the aforementioned townhouse squabble, which results in having to take the book’s contents probably with a grain of salt and heavily contextualise it, and/or at least fact-check it yourself. They fall in at least three categories: vocabulary, grammar, and etymology.

To quote: “This came about because the Dutch term tijger means either tiger or leopard” (p219): no, we do have a word for leopard: luipaard. That word is included even in a pocket-size Prisma English-Dutch dictionary or any online EN-NL dictionary, so a simple look-up to fact-check would have sufficed (and it existed already in Dutch before a bunch of them started colonising South Africa in 1652; originating from old French in ~1200). Not having done so smells of either sloppiness or arrogance. And I’m not so sure about the widespread use of pavement special (stray or mongrel dogs or cats), as my backyard neighbours use just stray for ‘my’ stray cat (whom they want to sterilise because he meows in the morning). It is a fun term, though.

Then there’s stunted etymology of words. The coconut is not a term that emerged in the “new South Africa” (pp145-146), but is transferred from the Americas where it was already in use for at least since the 1970s to denote the same concept (in short: a brown skinned person who is White on the inside) but then applied to some people from Central and South America [Latino/Hispanic; take your pick].

Extending the criticism also to the grammar explanations, the “with” aside box on pp203-204 is wrong as well, though perhaps not as blatantly obvious as the leopard and coconut ones. The authors stipulate that phrases like “Is So-and-So coming with?” (p203) is Afrikaans influence of kom saam “where saam sounds like ‘with’” (p203) (uh, no, it doesn’t), and as more guessing they drag a bit of German influence in US English into it. This use, and the related examples like the “…I have to take all my food with” (p204) is the same construction and similar word order for the Dutch adverb mee ‘with’ (and German mit), such as in the infinitives meekomen ‘to come with’ (komen = to come), meenemen ‘to take with’, meebrengen ‘to bring with’, and meegaan ‘to go with’. In a sentence, the mee may be separated from the rest of the verb and put somewhere, including at the end of the sentence, like in ik neem mijn eten mee ‘I take my food with’ (word-by-word translated) en komt d’n dieje mee? ‘comes so-and-so with?’ (word-by-word translated, with a bit of ABB in the Dutch). German has similar infinitives—mitkommen, mitnehmen, mitbringen, and mitgehen, respectively—sure, but the grammar construction the book’s authors highlight is so much more likely to come from Dutch as first step of tracing it back, given that Afrikaans is a ‘simplified’ version of Dutch, not of German. (My guess would be that the Dutch mee- can be traced back, in turn, to the German mit, as Dutch is a sort of ‘simplified’ German, but that’s a separate story.)

In closing, I could go on with examples and corrections, and maybe I should, but I think I made the point clear. The book didn’t read as badly as it may seem from this review, but writing the review required me to fact-check a few things, rather than taking most of it at face value, which made it turn out more and more mediocre than the couple of irritations I had whilst reading it.

Improved! TDDonto v2—more types of axioms supported and better feedback

Yes, the title almost sounds like a silly washing powder ad, but version 2 of TDDonto really does more than the TDDonto tool for Test-Driven Development of ontologies [1,2] that was introduced earlier this year. There are two principal novelties, largely thanks to Kieren Davies (also at UCT): more types of axioms are supported—arbitrary class expressions on both sides of the inclusion and ABox assertions—and differentiated test feedback beyond just pass/fail/unknown. TDDonto2 obviously still uses a test-first approach rather than test-last for ontology authoring, i.e., checking whether the axiom is already entailed in the ontology or would cause problems before actually adding it, saving yourself a lot of classification time overhead in the ontology authoring process. 

On the first item, TDDonto (or TawnyOwl or Scone) could not handle, e.g., Carnivore \sqcup Herbivore \sqsubseteq Animal or some domain restriction \forall eats.Animal \sqsubseteq Carnivore , or whether some individual is different/same from another. TDDonto2 can. This required a new set of algorithms, some nifty orchestration of several functions offered by an automated reasoned (of the DL/OWL variety), and extending the Protégé 5 functionality with parsing Manchester syntax keyword constructs for individuals as well (another 3600 lines of code). The Protégé 5 plugin works. Correctness of those algorithms has been proven, so you can rely on it just like you can with the test-last approach of add-axiom-and-then-run-the-reasoner (I’ll save you from those details).

On the second item (and also beyond the current TDD tools): now it can tell you not only just ‘pass’ (i.e., the axiom is entailed), but the ‘failed’ has been refined into the various possible cases: that adding the axiom to the ontology would cause the ontology to become inconsistent, or that it would cause a class to become unsatisfiable (incoherent), or it may be neither of the three (absent) so it would be ‘safe’ to add the axiom under test to the ontology (that is: at least not cause inconsistency, incoherence, or redundancy). Further, we’ve added ‘pre-real TDD unit test’ checks: if the ontology is already inconsistent, there’s no point in testing the axiom; if the ontology already has unsatisfiable classes, then one should fix that first; and if there is an entity in the test axiom that is not in the ontology, then it should be added first.

The remainder of the post mainly just shows off some of the functionality. Put the JAR file in the plugins directory, and then put it somewhere via Window – Views – Ontology views – TDDonto2. As toy ontology, I tend to end up with examples of the African Wildlife Ontology, which I use for exercises in my Ontology Engineering course, but as it is almost summer holiday here, I’ve conjured up a different example. That test ontology contains the following knowledge at the start:

ServiceObject \equiv Facility \sqcup Attraction

Pool \sqsubseteq Facility

Braai \sqsubseteq Facility

Pool \sqcap Braai \sqsubseteq \bot

Hotel \sqsubseteq Accommodation

BedAndBreakfast \sqsubseteq Accommodation

BedAndBreakfast \sqcap Hotel \sqsubseteq \bot

Facility \sqsubseteq \exists offeredBy.Accommodation

Hotel \sqsubseteq =1 offers.Pool


The first test is to see whether \exists offeredBy.Accommodation \sqsubseteq Facility , to show that TDDonto2 can handle class expressions on the left-hand side of the inclusion axiom. It can, and it is clearly absent from the toy ontology; see screenshot below, first line in the middle section. Likewise for the second and third test, where a typical novice ontology authoring mixup is made between ‘and’ and ‘or’, which different test results: one is absent, the other entailed.

poolbraaimissingThen some more fun: the pool braai. First of all, PoolBraai is not in our ontology, so TDDonto2 returns an error: it can be seen from Protégé’s handling (red dotted line below PoolBraai and red-lined text box in the screenshot above), and TDDonto2 will not let you add it to the set of tests (pop-up box, not shown). After adding it and testing “PoolBraai SubClassOf: Pool and Braai”, then if we were to add that axiom to the ontology, it will be incoherent (because Pool and Braai are disjoint):

poolbraaiwillbeincoherentDoing this nonetheless by selecting the axiom and adding it to our ontology by pressing the “Add selected to ontology”:

poolbraaiaddedand running all tests again by pressing the “Evaluate all” button (or select it and click “Evaluate selected”), the results look like this:

poolbraaifailedpreconditionThat is, we failed a precondition, because PoolBraai is unsatisfiable, so no tests are being executed until this is fixed. Did I make this up just to have a silly toy ontology example? No, the pool braai does exist, in South Africa at least: it is a stainless steel barbecue table-set that one can place in a small backyard pool. So, we remove PoolBraai \sqsubseteq Pool \sqcap Braai from the ontology and add PoolBraai \sqsubseteq Braai , so that we can do a few more tests.

Let’s assume we want to explore more about accommodations and their facilities, and add some knowledge about that (tests 5-7):

accosomefacilitiesFinally, let’s check something about any instances in the ontology. First, whether LagoonBeach is a hotel “LagoonBeach Type: Hotel”, which it is (with a view on Table Mountain), and whether it also could be a B&B, which it cannot be, because hotel and B&B are disjoint. Adding another individual to the ontology for the sake of example, SinCity (an owl:Thing), we want to know whether SinCity can be the same as LagoonBeach, or asserted as different (the last two test in the list): the tests return absent, i.e., they can be either, for nothing is known about SinCity.


Now let’s remove a selection of the tests because they would cause problems in the ontology, and add the remaining five in one go:

5addedThis change requires one to classify the ontology, and subsequently you’re expected to run all the tests again to check that they are all entailed and do not cause some new problem, which they don’t:


And, finally, a few arbitrary ones that are ontologically a bit off, but they show that yes, something arbitrary both on the left-hand side and right-hand side of the inclusion (or equivalence) works (first test, below), disjointness still works (test 2) and now also with arbitrary class expressions (test 5), and the same/different individuals can take more than two arguments (tests 3 and 4).


The source code and JAR file are freely available (GPL licence) to use, examine, or extend. A paper with the details has been submitted, so you’ll have to make do with just the tool for the moment. If you have any feedback on the tool, please let us know.



[1] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. 13th Extended Semantic Web Conference (ESWC’16). H. Sack et al. (Eds.). Springer LNCS vol. 9678, pp642-657. 29 May – 2 June, 2016, Crete, Greece.

[2] Lawrynowicz, A., Keet, C.M. The TDDonto Tool for Test-Driven Development of DL Knowledge bases. 29th International Workshop on Description Logics (DL’16). April 22-25, Cape Town, South Africa. CEUR WS vol. 1577.

Launch of the isiZulu spellchecker


Langa Khumalo, ULPDO director, giving the spellchecker demo, pointing out a detected spelling error in the text. On his left, Mpho Monareng, CEO of PanSALB.

Yesterday, the isiZulu spellchecker was launched at UKZN’s “Launch of the UKZN isiZulu Books and Human Language Technologies” event, which was also featured on 702 live radio, SABC 2 Morning Live, and e-news during the day. What we at UCT have to do with it is that both the theory and the spellchecker tool were developed in-house by members of the Department of Computer Science at UCT. The connection with UKZN’s University Language Planning & Development Office is that we used a section of their isiZulu National Corpus (INC) [1] to train the spellchecker with, and that they wanted a spellchecker (the latter came first).

The theory behind the spellchecker was described briefly in an earlier post and it has been presented at IST-Africa 2016 [2]. Basically, we don’t use a wordlist + rules-based approach as some experiments of 20 years ago did, nor a wordlist + a few rules of the now-defunct translate.org.za OpenOffice v3 plugin seven years ago, but a data-driven approach with a statistical language model that uses tri-grams. The section of the INC we used were novels and news items, so, including present-day isiZulu texts. At the time of the IST-Africa’16 paper, based on Balone Ndaba’s BSc CS honours project, the spell checking was very proof-of-concept, but it showed that it could be done and still achieve a good enough accuracy. We used that approach to create an enduser-usable isiZulu spellchecker, which saw the light of day thanks to our 3rd-year CS@UCT student Norman Pilusa, who both developed the front-end and optimised the backend so that it has an excellent performance.

Upon starting the platform-independent isiZulu_spellchecker.jar file, the English interface version looks like this:


You can write text in the text box, or open a txt or docx file, which then is displayed in the textbox. Click “Run”. Now there are two options: you can choose to step-through the words that are detected as misspelled one at a time or “Show All” words that are detected as misspelled. Both are shown for some sample text in the screenshot below.


processing one error at a time


highlighting all words detected as very probably misspelled

Then it is up to you to choose what to do with it: correct it in the textbox, “Ignore once”, “Ignore all”, or “Add” the word to your (local) dictionary. If you have modified the text, you can save it with the changes made by clicking “Save correction”. You also can switch the interface from the default English to isiZulu by clicking “File – Use English”, and back to English via “iFayela – ulimi lesingisi”. You can download the isiZulu spellchecker from the ULPDO website and from the GitHub repository for those who want to get their hands on the source code.

To anticipate some possible questions you may have: incorporating it as a plugin to Microsoft word, OpenOffice/LibreOffice, and Mozilla Firefox was in the planning. The former is technologically ‘closed source’, however, and the latter two have a certain way of doing spellchecking that is not amenable to the data-driven approach with the trigrams. So, for now, it is a standalone tool. By design, it is desktop-based rather than for mobile phones, because according to the client (ULPDO@UKZN), they expect the first users to be professionals with admin documents and emails, journalists writing articles, and such, writing on PCs and laptops.

There was also a trade-off between a particular sort of error: the tool now flags more words as probably incorrect than it could have, yet it will detect (a subset of) capitalization, correctly, such as KwaZulu-Natal whilst flagging some of the deviant spellings that go around, as shown in the screenshot below.

zuspellkznThe customer preferred recognising such capitalisation.

Error correction sounds like an obvious feature as well, but that will require a bit more work, not just technologically, but also the underlying theory. It will probably be an honours project topic for next year.

In the grand scheme of things, the current v1 of the spellchecker is only a small step—yet, many such small steps in succession will get one far eventually.

The launch itself saw an impressive line-up of speeches and introductions: the keynote address was given by Dr Zweli Mkhize, UKZN Chancellor and member of the ANC NEC; Prof Ramesh Krishnamurthy, from Aston University UK, gave the opening address; Mpho Monareng, CEO of PanSALB gave an address and co-launched the human language technologies; UKZN’s VC Andre van Jaarsveld provided the official welcome; and two of UKZN’s DVCs, Prof Renuka Vithal and Prof Cheryl Potgieter, gave presentations. Besides our ‘5-minutes of fame’ with the isiZulu spellchecker, the event also launched the isiZulu National Corpus, the isiZulu Term Bank, the ZuluLex mobile-compatible application (Android and iPhone), and two isiZulu books on collected short stories and an English-isiZulu architecture glossary.



[1] Khumalo, L. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[2] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

Relations with roles / verbalising object properties in isiZulu

The narratives can be very different for the paper “A model for verbalising relations with roles in multiple languages” that was recently accepted paper at the 20th International Conference on Knowledge Engineering and Knowledge management (EKAW’16), for the paper makes a nice smoothie of the three ingredients of language, logic, and ontology. The natural language part zooms in on isiZulu as use case (possibly losing some ontologist or logician readers), then there are the logics about mapping the Description Logic DLR’s role components with OWL (lose possible interest of the natural language researchers), and a bit of philosophy (and lose most people…). It solves some thorny issues when trying to verbalise complicated verbs that we need for knowledge-to-text natural language generation in isiZulu and some other languages (e.g., German). And it solves the matching of logic-based representations popularised in mainly UML and ORM (that typically uses a logic in the DLR family of Description Logic languages) with the more commonly used OWL. The latter is even implemented as a Protégé plugin.

Let me start with some use-cases that cause problems that need to be solved. It is well-known that natural language renderings of ontologies facilitate communication with domain experts who are expected to model and validate the represented knowledge. This is doable for English, with ACE in the lead, but it isn’t for grammatically richer languages. There, there are complications, such as conjugation of verbs, an article that may be dependent on the preposition, or a preposition may modify the noun. For instance, works for, made by, located in, and is part of are quite common names for object properties in ontologies. They all do have a dependent preposition, however, there are different verb tenses, and the latter has a copulative and noun rather than just a verb. All that goes into the object properties name in an ‘English-based ontology’ and does not really have to be processed further in ontology verbalisation other than beautification. Not so in multiple other languages. For instance, the ‘in’ of located in ends up as affixes to the noun representing the object that the other object is located in. Like, imvilophu ‘envelope’ and emvilophini ‘in the envelope’ (locative underlined). Even something straightforward like a property eats can end up having to be conjugated differently depending on who’s eating: when a human eats, it is udla in isiZulu, but for, say, a dog, it is idla (modification underlined), which is driven by the system of noun classes, of which there are 17 in isiZulu. Many more examples illustrating different issues are described in the paper. To make a long story short, there are gradations in complicating effects, from no effect where a preposition can be squeezed in with the verb in naming an OP, to phonological conditioning, to modifying the article of the noun to modifying the noun. A ‘3rd pers. sg.’ may thus be context-dependent, and notions of prepositions may modify the verb or the noun or the article of the noun, or both. For a setting other than English ontologies (e.g., Greek, German, Lithuanian), a preposition may belong neither to the verb nor to the noun, but instead to the role that the object plays in the relation described by the verb in the sentence. For instance, one obtains yomuntu, rather than the basic noun umuntu, if it plays the role of the whole in a part-whole relation like in ‘heart is part of a human’ (inhliziyo iyingxenye yomuntu).

The question then becomes how to handle such a representation that also has to include roles? This is quite common in conceptual data modelling languages and in the DLR family of DL languages, which is known in ontology as positionalism [2]. Bumping up the role to an element in the representation language—thus, in addition to the relationship—enables one to attach information to it, like whether there is a (deep) preposition associated with it, the tense, or the case. Such role-based annotations can then be used to generate the right element, like einen Betrieb ‘some company’ to adjust the article for the case it goes with in German, or ya+umuntu=yomuntu ‘of a human’, modifying the noun in the object position in the sentence.

To get this working properly, with a solid theoretical foundation, we reused a part of the conceptual modelling languages’ metamodel [3] to create a language model for such annotations, in particular regarding the attributes of the classes in the metamodel. On its own, however, it is rather isolated and not immediately useful for ontologies that we set out to be in need of verbalising. To this end, it links to the ‘OWL way of representing relations’ (ontologically: the so-called standard view), and we separate out the logic-based representation from the readings that one can generate with the structured representation of the knowledge. All in all, the simplified high-level model looks like the picture below.

Simplified diagram in UML Class Diagram notation of the main components (see paper for attributes), linking a section of the metamodel (orange; positionalist commitment) to predicates (green; standard view) and their verbalisation (yellow). (Source: [1])

Simplified diagram in UML Class Diagram notation of the main components (see paper for attributes), linking a section of the metamodel (orange; positionalist commitment) to predicates (green; standard view) and their verbalisation (yellow). (Source: [1])

That much for the conceptual part; more details are described in the paper.

Just a fluffy colourful diagram isn’t enough for a solid implementation, however. To this end, we mapped one of the logics that adhere to positionalism to one of the standard view, being DLR [4] and OWL, respectively. It equally well could have been done for other pairs of languages (e.g., with Common Logic), but these two are more popular in terms of theory and tools.

Having the conceptual and logical foundations in place, we did implement it to see whether it actually can be done and to check whether the theory was sufficient. The Protégé plugin is called iMPALA—it could be an abbreviation for ‘Model for Positionalism And Language Annotation’—that both writes all the non-OWL annotations in a separate XML file and takes care of the renderings in Protégé. It works; yay. Specifically, it handles the interaction between the OWL file, the positionalist elements, and the annotations/attributes, plus the additional feature that one can add new linguistic annotation properties, so as to cater for extensibility. Here are a few screenshots:

OWL’s arbeitetFuer ‘works for’ is linked to the relationship arbeiten.

OWL’s arbeitetFuer ‘works for’ is linked to the relationship arbeiten.

The prey role in the axiom of the impala being eaten by the ibhubesi.

The prey role in the axiom of the impala being eaten by the ibhubesi.

 Annotations of the prey role itself, which is a role in the relationship ukudla.

Annotations of the prey role itself, which is a role in the relationship ukudla.

We did test it a bit, from just the regular feature testing to the African Wildlife ontology that was translated into isiZulu (spoken in South Africa) and a people and pets ontology in ciShona (spoken in Zimbabwe). These details are available in the online supplementary material.

The next step is to tie it all together, being the verbalisation patterns for isiZulu [5,6] and the OWL ontologies to generate full sentences, correctly. This is set to happen soon (provided all the protests don’t mess up the planning too much). If you want to know more details that are not, or not clearly, in the paper, then please have a look at the project page of A Grammar engine for Nguni natural language interfaces (GeNi), or come visit EKAW16 that will be held from 21-23 November in Bologna, Italy, where I will present the paper.



[1] Keet, C.M., Chirema, T. A model for verbalising relations with roles in multiple languages. 20th International Conference on Knowledge Engineering and Knowledge Management EKAW’16). Springer LNAI, 19-23 November 2016, Bologna, Italy. (in print)

[2] Leo, J. Modeling relations. Journal of Philosophical Logic, 2008, 37:353-385.

[3] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[4] Calvanese, D., De Giacomo, G. The Description Logics Handbook: Theory, Implementation and Applications, chap. Expressive description logics, pp. 178-218. Cambridge University Press (2003).

[5] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016, in print.

[6] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. Proceedings of the 9th International Natural Language Generation conference 2016 (INLG’16), Edinburgh, Scotland, Sept 2016. ACL, 174-183.

Surprising similarities and differences in orthography across several African languages

It is well-known that natural language interfaces and tools in one’s own language are known to be useful in ICT-mediated communication. For instance, tools like spellcheckers and Web search engines, machine translation, or even just straight-forward natural language processing to at least ‘understand’ documents and find the right one with a keyword search. Most languages in Southern Africa, and those in the (linguistically called) Bantu language family, are still under-resourced, however, so this is not a trivial task due to the limited data and researched and documented grammar. Any possibility to ‘bootstrap’ theory, techniques, and tools developed for one language and to fiddle just a bit to make it work for a similar one will save many resources compared to starting from scratch time and again. Likewise, it would be very useful if both the generic and the few language-specific NLP tools for the well-resourced languages could be reused or easily adapted across languages. The question is: does that work? We know very little about whether it does. Taking one step back, then: for that bootstrapping to work well, we need to have insight into how similar the languages are. And we may be able to find that out if only we knew how to measure similarity of languages.

The most well-know qualitative way for determining some notion of similarity started with Meinhof’s noun class system [1] and the Guthrie zones. That’s interesting, but not nearly enough for computational tools. An experiment has been done for morphological analysers [2], with promising results, yet it also had more of a qualitative flavour to it.

I’m adding here another proverbial “2 cents” to it, by taking a mostly quantitative approach to it, and focusing on orthography (how things are written down) in text documents and corpora. This was a two-step process. First, 12 versions of the Universal Declaration of Human Rights were examined on tokens and their word length; second, because the UDHR is a quite small document, isiZulu corpora were examined to see whether the UDHR was a representative sample, i.e., whether extrapolation from its results may be justified. The methods, results, and discussion are described in “An assessment of orthographic similarity measures for several African languages” [3].

The really cool thing of the language comparison is that it shows clusters of languages, indicating where bootstrapping may have more or less success, and they do not quite match with Guthrie zones. The cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa is shown in the figure below, where the names of the languages are those of the file names of the NLTK data kit that contains the quality translations of the UDHR.

Cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa (Source: [3]).

Cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa (Source: [3]).

The paper contains some statistical tests, showing that the bottom cluster are not statistically significantly different form each other, but they are from the ‘middle’ cluster. So, the word length distribution of Kiswahili is substantially different from that of, among others, isiZulu, in that it has more shorter words and isiZulu more longer words, but Kiswahili’s pattern is similar to that of Afrikaans and English. This is important for NLP, for isiZulu is known to be highly agglutinating, but English (and thus also Kiswahili) is disjunctive. How important is such a difference? The simple answer is that grammatical elements of a sentences get ‘glued’ together in isiZulu, whereas at least some of them are written as separate words in Kiswahili. This is not to be conflated with, say, German, Dutch, and Afrikaans, where nouns can be concatenated to form new words, but, e.g., a preposition is glued onto a noun. For instance, ‘of clay’ is ngobumba, contracting nga+ubumba with a vowel coalescence rule (-a + u- = -o-), which thus happens much less often in a language with disjunctive orthography. This, in turn, affects the algorithms needed to computationally process the languages, hence, the prospects for bootstrapping.

Note that middle cluster looks deceptively isolating, but it isn’t. Sesotho and Setswana are statistically significantly different from the others, in that they are even more disjunctive than English. Sepedi (top-most line) even more so. While I don’t know that language, a hypothetical example suffice to illustrate this notion. There is conjugation of verbs, like ‘works’ or trabajas or usebenza (inflection underlined), but some orthographer a while ago could have decided to write that separate from the verb stem (e.g., trabaj as and u sebenza instead), hence, generating more tokens with fewer characters.

There are other aspects of language and orthography one can ‘play’ with to analyse quantitatively, like whether words mainly end in a vowel or not, and which vowel mostly, and whether two successive vowels are acceptable for a language (for some, it isn’t). This is further described in the paper [3].

Yet, the UDHR is just one document. To examine the generalisability of these observations, we need to know whether the UDHR text is a ‘typical’ one. This was assessed in more detail by zooming in on isiZulu both quantitatively and qualitatively with four other corpora and texts in different genres. The results show that the UHDR is a typical text document orthographically, at least for the cumulative frequency distribution of the word length.

There were some other differences across the other corpora, which have to do with genre and datedness, which was observed elsewhere for whole words [4]. For instance, news items of isiZulu newspapers nowadays include words like iFacebook and EFF, which surely don’t occur in a century-old bible translation. They do violate the ‘no two successive vowels’ rule and the ‘final vowel’ rule, though.

On the qualitative side of the matter, and which will have an effect on searching for information in texts, text summarization, and error correction of spellcheckers, is, again, that agglutination. For instance, searching on imali ‘money’ alone would be woefully inadequate to find all relevant texts; e.g., those news items also include kwemali, yimali, onemali, osozimali, kwezimali, and ngezimali, which are, respectively of -, and -, that/which/who has -, of – (pl.), about/by/with/per – (pl.) money. Searching on the stem or root only is not going to help you much either, however. Take, for instance -fund-, of which the results of just two days of Isolezwe news articles is shown in the table below (articles from 2015, when there were protests, too). Depending on what comes before fund and what comes after it, it can have a different meaning, such as abafundi ‘students’ and azifundi ‘they do not learn’.


Placing this is the broader NLP scope, it also affects the widely-used notion of lexical diversity, which, in its basic form, is a type-to-token ratio. Lexical diversity is used as a proxy measure for ‘difficulty’ or level of a text (the higher the more difficult), language development in humans as they grow up, second-language learning, and related topics. Letting that loose on isiZulu text, it will count abafundi, bafundi, and nabafundi as three different tokens, so wheehee, high lexical diversity, yet in English, it amounts to ‘students’, ‘students’ and ‘and the students’. Put differently, somehow we have to come up with a more meaningful notion of lexical diversity for agglutinating languages. A first attempt is made in the paper in its section 4 [3].

Thus, the last word has not been said yet about orthographic similarity, yet we now do have more insight into it. The surprising similarity of isiZulu (South Africa) with Runyankore (Uganda) was exploited in another research activity, and shown to be very amenable to bootstrapping [5], so, in its own way providing supporting evidence for bootstrapping potential that the figure above also indicated as promising.

As a final comment on the tooling side of things, I did use NLTK (Python). It worked well for basic analyses of text, but it (and similar NLP tools) will need considerable customization for the agglutinating languages.



[1] C. Meinhof. 1932. Introduction to the phonology of the Bantu languages . Dietrich Reiner/Ernst Vohsen, Johannesburg. Translated, revised and enlarged in collaboration with the author and Dr. Alice Werner by N.J. Van Warmelo.

[2] L. Pretorius and S. Bosch. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 96–103, 2009.

[3] C.M. Keet. An assessment of orthographic similarity measures for several African languages. Technical report, arxiv 1608.03065. August 2016.

[4] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[5] J. Byamugisha, C. M. Keet, and B. DeRenzi. Bootstrapping a Runyankore CNL from an isiZulu CNL. In B. Davis et al., editors, 5th Workshop on Controlled Natural Language (CNL’16), volume 9767 of LNAI, pages 25–36. Springer, 2016. 25-27 July 2016, Aberdeen, UK.