# Aligning different relations: the case of part-whole relations—LDK2017

Despite the best intentions, I did not get around to writing a post on the paper that I presented last week at the First International Conference on Language, Data and Knowledge 2017, 19-20 June, Galway, Ireland, and now Paul Groth also ‘beat’ me to it writing a nice conference report of it. On the bright side, it is an opportunity to say upfront I really enjoyed the conference and look forward to the next edition in 2019. The ESWC’17 organisers might be slightly disappointed that there was no special track on the multilingual semantic web after all, but I did get the distinct impression that the LDK17 authors might just all have gambled on LDK17—an opportunity to binge two days on all things natural language & Semantic Web—rather than on one track at an overpriced conference (despite the allure of it being A-rated).

So, what was my paper about that could have been submitted to either? I ended up struggling—and solving—an issue with aligning OWL object properties that were not simple 1:1 mappings, in a similar scope as our ESWC17 paper (introduced here) [4], but then with too many complications. Complications were due to the different conceptualisations of part-whole relations and that one of the requirements was to solve what to do with an object property (relation, relationship) that does not have a neat, single, label, and therewith neither fitting with the common OWL modelling paradigm nor with the recently agreed-upon ontolex-lemon model for linguistic annotations.

The start of all this sounded nice and doable: we need to generate natural language for healthcare, using, e.g., SNOMED CT, in local languages in South Africa, focussing on the largest one, being isiZulu. Medical terminologies are riddled with part-whole relations, so we sought to address that one (simple existentials already having been solved), availing of a standard list of part-whole relations (e.g. [1]). That turned out to be a non-trivial exercise, but doable eventually [2]. What wasn’t addressed in [2] was that some ‘common’ part-whole relations, such as membership and containment, weren’t like that in isiZulu, at all. Moreover, it wasn’t just a language issue, but ontological as well. The LDK17 paper “Representing and aligning similar relations: parts and wholes in isiZulu vs English” [3] describes this in some detail.

Here’s a (simplified) list of (assumed to be) common part-whole relations, which takes into account both transitivity differences and domain and range:

Now here’s the one based on the isiZulu language and some ontological analysis of that:

That is: there are both generalisations—some distinctions are not being made—and specialisations—some distinctions are made here but not elsewhere. For instance, ‘musician is part of some orchestra’ and ‘heart is part of some human’ (or vv.) is all done and described in the same way (ingxenye ‘part of’ and SC+CONJ for ‘has part’ [more about that below]). Yet, there is a difference between an individual (e.g., a voter) participating in some process and a collective (e.g., the electorate) participating in a process, or vv. The paper describes this more precisely, going into some detail regarding the differences in categories of domain and range and into the consequences on transitivity of mereological parthood.

The other ‘odd thing’—cf. current multilingual Semantic Web assumptions and technologies, that is—is that while the conceptualisation of ‘has part’ exists, it does not have a single label as in English (or in several other languages, such as heeft as deel), but it is dependent on the noun class of the noun of the class that play the part and play the whole in the relation. It combines the subject concord (~conjugation) of the noun class of the noun that plays the whole with a conjunction that is phonologically conditioned based on the first letter of the noun that plays the part; with verbalisation in the plural and three phonological cases, there are 18 possible strings all denoting ‘has part’. This still could be sorted with a language with inverses, provided the part-of direction has a name, like the ingxenye. This is not the case for containment, however. Instead of the relation (object property) having a name—be this a verb like ‘contained in’ or some noun phrase—it is the noun that plays the whole (the container, if you will) that gets modified. For instance, imvilophu ‘envelope’ and emvilophini denoting ‘contained in the envelope’, or, for individuals and locations, the city iTheku ‘Durban’ and eThekwini meaning ‘located in Durban’ (no typo—there’s some phonological conditioning I’m brushing over). While I have gotten used to such constructions, it generated some surprise among several attendees that one can have notions, concepts, views on or interpretations or descriptions of reality, that exist but do not have even one single string of text throughout to refer to regardless the context it is used.

The naming issue was solved by adding some arbitrary string as ‘name’ of the object property, and relating that to the function that verbalises that specific part-whole relation. The former issue, i.e., not all the same part-whole relations, required a bit more work, using ontology pattern alignments, by extending one correspondence pattern from the ODP catalogue and introducing a new one (see paper for the formal details), using the same broad framework of formalisation as proposed in [4].

All this was then implemented and aligned, and verified to not result in some unsatisfiable classes, object properties, or inconsistency (files). This also works in the isiZulu verbalisation tool we demo-ed at ESWC17 (described in the previous post) [5], all as part of the NRF-funded GeNI project.

Now, ideally, I already would have had the time to read the papers I flagged in my LDK17 conference notes with “check paper”. I haven’t yet due to end-of-semester tasks. So, on the basis of just a positive-seeming presentation, here are a few that are on the top of my list to check out first, for quite different reasons:

• Interaction between natural language reading capabilities and math education, focusing on language production (i.e., ‘can you talk about it?’) [6], mainly because math education in South Africa faces a lot of problems. It also generated a lively discussion in the Q&A session.
• The OnLiT ontology for linguistic [7] and LLODifying linguistic glosses [8] terminology (also: one of the two also won the best paper award).
• Deep text generation, for it was looking at trying to address skewed or limited data to learn from [9], which is an issue we face when trying to do some NLP with most South African languages.

References

[1] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[2] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), September 5-8, 2016, Edinburgh, UK. ACL.

[3] Keet, C.M. Representing and aligning similar relations: parts and wholes in isiZulu vs English. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 58-73.

[4] Fillottrani, P.R., Keet, C.M. Patterns for Heterogeneous TBox Mappings to Bridge Different Modelling Decisions. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017.

[5] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[6] Crossley, S., Kostyuk, V. Letting the genie out of the lamp: using natural language processing tools to predict math performance. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 330-342.

[7] Klimek, B., McCrae, J.P., Lehmann, C., Chiarcos, C., Hellmann, S. OnLiT: and ontology for linguistic terminology. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 42-57.

[8] Chiarcos, C., Ionov, M. Rind-Pawlowski, M., Fäth, C., Wichers Schreur, J., Nevskaya. I. LLODifying linguistic glosses. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 89-103.

[9] Dethlefs N., Turner A. Deep Text Generation — Using Hierarchical Decomposition to Mitigate the Effect of Rare Data Points. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 290-298.

# Our ESWC17 demos: TDDonto2 and an OWL verbaliser for isiZulu

Besides the full paper on heterogeneous alignments for 14th Extended Semantic Web Conference (ESWC’17) that will take place next week in Portoroz, Slovenia, we also managed to squeeze out two demo papers. You may already know of TDDonto2 with Kieren Davies and Agnieszka Lawrynowicz, which was discussed in an earlier post that has been updated with a tutorial video. It now has a demo paper as well [1], which describes the rationale and a few scenarios. The other demo, with Musa Xakaza and Langa Khumalo, is new-new, but the regular reader might have seen it coming: we finally managed to link the verbalisation patterns for certain Description Logic axiom types [2,3] to those in OWL ontologies. The tool takes as input an ontology in isiZulu and the verbalisation algorithms, and out come the isiZulu sentences, be this in plain text for further processing or in a GUI for inspection by a domain expert [4]. There is a basic demo-screencast to show it’s all working.

The overall architecture may be of interest, for it deviates from most OWL verbalisers. It is shown in the following figure:

For instance, we use the Python-based OWL API Owlready, rather than a Java-based app, for Python is rather popular in NLP and the verbalisation algorithms may be used elsewhere as well. We made more such decisions with the aim to make whatever we did as multi-purpose usable as possible, like the list of nouns with noun classes (surprisingly, and annoyingly, there is no such readily available list yet, though isizulu.net probably will have it somewhere but inaccessible), verb roots, and exceptions in pluralisation. (Problems for integrating the verbaliser with, say, Protégé will be interesting to discuss during the demo session!)

The text-based output doesn’t look as nice as the GUI interface, so I will show here only the GUI interface, which is adorned with some annotations to illustrate that those verbalisation algorithms in the background are far from trivial templates:

For instance, while in English the universal quantification is always ‘Each’ or ‘All’ regardless the named class quantified over, in isiZulu it depends on the noun class of the noun that is the name of the OWL class. For instance, in the figure above, izingwe ‘leopards’ is in noun class 10, so the ‘Each/All’ is Zonke, amavazi ‘vases’ is in noun class 6, so ‘Each/All’ then becomes Onke, and abantu ‘people’/’humans’ is in noun class 2, making Bonke. There are 17 noun classes. They also determine the subject concords (SC, alike conjugation) for the verbs, with zi- for noun class 10, ­a- for noun class 6, and ba- for noun class 2, to name a few. How this all works is described in [2,3]. We’ve implemented all those algorithms and integrated the pluraliser [5] in it to make it work. The source files are available to check and play with already, you can do so and ask us during the ESWC17 demo session, and/or also have a look at the related outputs of the NRF-funded project Grammar Engine for Nguni natural language interfaces (GeNi).

References

[1] Davies, K. Keet, C.M., Lawrynowicz, A. TDDonto2: A Test-Driven Development Plugin for arbitrary TBox and ABox axioms. Extended Semantic Web Conference (ESWC’17), Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157.

[3] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), 5-8 September, 2016, Edinburgh, UK. Association for Computational Linguistics, 174-183.

[4] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[5] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey.

# Launch of the isiZulu spellchecker

Langa Khumalo, ULPDO director, giving the spellchecker demo, pointing out a detected spelling error in the text. On his left, Mpho Monareng, CEO of PanSALB.

Yesterday, the isiZulu spellchecker was launched at UKZN’s “Launch of the UKZN isiZulu Books and Human Language Technologies” event, which was also featured on 702 live radio, SABC 2 Morning Live, and e-news during the day. What we at UCT have to do with it is that both the theory and the spellchecker tool were developed in-house by members of the Department of Computer Science at UCT. The connection with UKZN’s University Language Planning & Development Office is that we used a section of their isiZulu National Corpus (INC) [1] to train the spellchecker with, and that they wanted a spellchecker (the latter came first).

The theory behind the spellchecker was described briefly in an earlier post and it has been presented at IST-Africa 2016 [2]. Basically, we don’t use a wordlist + rules-based approach as some experiments of 20 years ago did, nor a wordlist + a few rules of the now-defunct translate.org.za OpenOffice v3 plugin seven years ago, but a data-driven approach with a statistical language model that uses tri-grams. The section of the INC we used were novels and news items, so, including present-day isiZulu texts. At the time of the IST-Africa’16 paper, based on Balone Ndaba’s BSc CS honours project, the spell checking was very proof-of-concept, but it showed that it could be done and still achieve a good enough accuracy. We used that approach to create an enduser-usable isiZulu spellchecker, which saw the light of day thanks to our 3rd-year CS@UCT student Norman Pilusa, who both developed the front-end and optimised the backend so that it has an excellent performance.

Upon starting the platform-independent isiZulu_spellchecker.jar file, the English interface version looks like this:

You can write text in the text box, or open a txt or docx file, which then is displayed in the textbox. Click “Run”. Now there are two options: you can choose to step-through the words that are detected as misspelled one at a time or “Show All” words that are detected as misspelled. Both are shown for some sample text in the screenshot below.

processing one error at a time

highlighting all words detected as very probably misspelled

Then it is up to you to choose what to do with it: correct it in the textbox, “Ignore once”, “Ignore all”, or “Add” the word to your (local) dictionary. If you have modified the text, you can save it with the changes made by clicking “Save correction”. You also can switch the interface from the default English to isiZulu by clicking “File – Use English”, and back to English via “iFayela – ulimi lesingisi”. You can download the isiZulu spellchecker from the ULPDO website and from the GitHub repository for those who want to get their hands on the source code.

To anticipate some possible questions you may have: incorporating it as a plugin to Microsoft word, OpenOffice/LibreOffice, and Mozilla Firefox was in the planning. The former is technologically ‘closed source’, however, and the latter two have a certain way of doing spellchecking that is not amenable to the data-driven approach with the trigrams. So, for now, it is a standalone tool. By design, it is desktop-based rather than for mobile phones, because according to the client (ULPDO@UKZN), they expect the first users to be professionals with admin documents and emails, journalists writing articles, and such, writing on PCs and laptops.

There was also a trade-off between a particular sort of error: the tool now flags more words as probably incorrect than it could have, yet it will detect (a subset of) capitalization, correctly, such as KwaZulu-Natal whilst flagging some of the deviant spellings that go around, as shown in the screenshot below.

The customer preferred recognising such capitalisation.

Error correction sounds like an obvious feature as well, but that will require a bit more work, not just technologically, but also the underlying theory. It will probably be an honours project topic for next year.

In the grand scheme of things, the current v1 of the spellchecker is only a small step—yet, many such small steps in succession will get one far eventually.

The launch itself saw an impressive line-up of speeches and introductions: the keynote address was given by Dr Zweli Mkhize, UKZN Chancellor and member of the ANC NEC; Prof Ramesh Krishnamurthy, from Aston University UK, gave the opening address; Mpho Monareng, CEO of PanSALB gave an address and co-launched the human language technologies; UKZN’s VC Andre van Jaarsveld provided the official welcome; and two of UKZN’s DVCs, Prof Renuka Vithal and Prof Cheryl Potgieter, gave presentations. Besides our ‘5-minutes of fame’ with the isiZulu spellchecker, the event also launched the isiZulu National Corpus, the isiZulu Term Bank, the ZuluLex mobile-compatible application (Android and iPhone), and two isiZulu books on collected short stories and an English-isiZulu architecture glossary.

References

[1] Khumalo, L. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[2] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

# Relations with roles / verbalising object properties in isiZulu

The narratives can be very different for the paper “A model for verbalising relations with roles in multiple languages” that was recently accepted paper at the 20th International Conference on Knowledge Engineering and Knowledge management (EKAW’16), for the paper makes a nice smoothie of the three ingredients of language, logic, and ontology. The natural language part zooms in on isiZulu as use case (possibly losing some ontologist or logician readers), then there are the logics about mapping the Description Logic DLR’s role components with OWL (lose possible interest of the natural language researchers), and a bit of philosophy (and lose most people…). It solves some thorny issues when trying to verbalise complicated verbs that we need for knowledge-to-text natural language generation in isiZulu and some other languages (e.g., German). And it solves the matching of logic-based representations popularised in mainly UML and ORM (that typically uses a logic in the DLR family of Description Logic languages) with the more commonly used OWL. The latter is even implemented as a Protégé plugin.

Let me start with some use-cases that cause problems that need to be solved. It is well-known that natural language renderings of ontologies facilitate communication with domain experts who are expected to model and validate the represented knowledge. This is doable for English, with ACE in the lead, but it isn’t for grammatically richer languages. There, there are complications, such as conjugation of verbs, an article that may be dependent on the preposition, or a preposition may modify the noun. For instance, works for, made by, located in, and is part of are quite common names for object properties in ontologies. They all do have a dependent preposition, however, there are different verb tenses, and the latter has a copulative and noun rather than just a verb. All that goes into the object properties name in an ‘English-based ontology’ and does not really have to be processed further in ontology verbalisation other than beautification. Not so in multiple other languages. For instance, the ‘in’ of located in ends up as affixes to the noun representing the object that the other object is located in. Like, imvilophu ‘envelope’ and emvilophini ‘in the envelope’ (locative underlined). Even something straightforward like a property eats can end up having to be conjugated differently depending on who’s eating: when a human eats, it is udla in isiZulu, but for, say, a dog, it is idla (modification underlined), which is driven by the system of noun classes, of which there are 17 in isiZulu. Many more examples illustrating different issues are described in the paper. To make a long story short, there are gradations in complicating effects, from no effect where a preposition can be squeezed in with the verb in naming an OP, to phonological conditioning, to modifying the article of the noun to modifying the noun. A ‘3rd pers. sg.’ may thus be context-dependent, and notions of prepositions may modify the verb or the noun or the article of the noun, or both. For a setting other than English ontologies (e.g., Greek, German, Lithuanian), a preposition may belong neither to the verb nor to the noun, but instead to the role that the object plays in the relation described by the verb in the sentence. For instance, one obtains yomuntu, rather than the basic noun umuntu, if it plays the role of the whole in a part-whole relation like in ‘heart is part of a human’ (inhliziyo iyingxenye yomuntu).

The question then becomes how to handle such a representation that also has to include roles? This is quite common in conceptual data modelling languages and in the DLR family of DL languages, which is known in ontology as positionalism [2]. Bumping up the role to an element in the representation language—thus, in addition to the relationship—enables one to attach information to it, like whether there is a (deep) preposition associated with it, the tense, or the case. Such role-based annotations can then be used to generate the right element, like einen Betrieb ‘some company’ to adjust the article for the case it goes with in German, or ya+umuntu=yomuntu ‘of a human’, modifying the noun in the object position in the sentence.

To get this working properly, with a solid theoretical foundation, we reused a part of the conceptual modelling languages’ metamodel [3] to create a language model for such annotations, in particular regarding the attributes of the classes in the metamodel. On its own, however, it is rather isolated and not immediately useful for ontologies that we set out to be in need of verbalising. To this end, it links to the ‘OWL way of representing relations’ (ontologically: the so-called standard view), and we separate out the logic-based representation from the readings that one can generate with the structured representation of the knowledge. All in all, the simplified high-level model looks like the picture below.

Simplified diagram in UML Class Diagram notation of the main components (see paper for attributes), linking a section of the metamodel (orange; positionalist commitment) to predicates (green; standard view) and their verbalisation (yellow). (Source: [1])

That much for the conceptual part; more details are described in the paper.

Just a fluffy colourful diagram isn’t enough for a solid implementation, however. To this end, we mapped one of the logics that adhere to positionalism to one of the standard view, being DLR [4] and OWL, respectively. It equally well could have been done for other pairs of languages (e.g., with Common Logic), but these two are more popular in terms of theory and tools.

Having the conceptual and logical foundations in place, we did implement it to see whether it actually can be done and to check whether the theory was sufficient. The Protégé plugin is called iMPALA—it could be an abbreviation for ‘Model for Positionalism And Language Annotation’—that both writes all the non-OWL annotations in a separate XML file and takes care of the renderings in Protégé. It works; yay. Specifically, it handles the interaction between the OWL file, the positionalist elements, and the annotations/attributes, plus the additional feature that one can add new linguistic annotation properties, so as to cater for extensibility. Here are a few screenshots:

OWL’s arbeitetFuer ‘works for’ is linked to the relationship arbeiten.

The prey role in the axiom of the impala being eaten by the ibhubesi.

Annotations of the prey role itself, which is a role in the relationship ukudla.

We did test it a bit, from just the regular feature testing to the African Wildlife ontology that was translated into isiZulu (spoken in South Africa) and a people and pets ontology in ciShona (spoken in Zimbabwe). These details are available in the online supplementary material.

The next step is to tie it all together, being the verbalisation patterns for isiZulu [5,6] and the OWL ontologies to generate full sentences, correctly. This is set to happen soon (provided all the protests don’t mess up the planning too much). If you want to know more details that are not, or not clearly, in the paper, then please have a look at the project page of A Grammar engine for Nguni natural language interfaces (GeNi), or come visit EKAW16 that will be held from 21-23 November in Bologna, Italy, where I will present the paper.

References

[1] Keet, C.M., Chirema, T. A model for verbalising relations with roles in multiple languages. 20th International Conference on Knowledge Engineering and Knowledge Management EKAW’16). Springer LNAI, 19-23 November 2016, Bologna, Italy. (in print)

[2] Leo, J. Modeling relations. Journal of Philosophical Logic, 2008, 37:353-385.

[3] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[4] Calvanese, D., De Giacomo, G. The Description Logics Handbook: Theory, Implementation and Applications, chap. Expressive description logics, pp. 178-218. Cambridge University Press (2003).

[5] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016, in print.

[6] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. Proceedings of the 9th International Natural Language Generation conference 2016 (INLG’16), Edinburgh, Scotland, Sept 2016. ACL, 174-183.

# Surprising similarities and differences in orthography across several African languages

It is well-known that natural language interfaces and tools in one’s own language are known to be useful in ICT-mediated communication. For instance, tools like spellcheckers and Web search engines, machine translation, or even just straight-forward natural language processing to at least ‘understand’ documents and find the right one with a keyword search. Most languages in Southern Africa, and those in the (linguistically called) Bantu language family, are still under-resourced, however, so this is not a trivial task due to the limited data and researched and documented grammar. Any possibility to ‘bootstrap’ theory, techniques, and tools developed for one language and to fiddle just a bit to make it work for a similar one will save many resources compared to starting from scratch time and again. Likewise, it would be very useful if both the generic and the few language-specific NLP tools for the well-resourced languages could be reused or easily adapted across languages. The question is: does that work? We know very little about whether it does. Taking one step back, then: for that bootstrapping to work well, we need to have insight into how similar the languages are. And we may be able to find that out if only we knew how to measure similarity of languages.

The most well-know qualitative way for determining some notion of similarity started with Meinhof’s noun class system [1] and the Guthrie zones. That’s interesting, but not nearly enough for computational tools. An experiment has been done for morphological analysers [2], with promising results, yet it also had more of a qualitative flavour to it.

I’m adding here another proverbial “2 cents” to it, by taking a mostly quantitative approach to it, and focusing on orthography (how things are written down) in text documents and corpora. This was a two-step process. First, 12 versions of the Universal Declaration of Human Rights were examined on tokens and their word length; second, because the UDHR is a quite small document, isiZulu corpora were examined to see whether the UDHR was a representative sample, i.e., whether extrapolation from its results may be justified. The methods, results, and discussion are described in “An assessment of orthographic similarity measures for several African languages” [3].

The really cool thing of the language comparison is that it shows clusters of languages, indicating where bootstrapping may have more or less success, and they do not quite match with Guthrie zones. The cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa is shown in the figure below, where the names of the languages are those of the file names of the NLTK data kit that contains the quality translations of the UDHR.

Cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa (Source: [3]).

The paper contains some statistical tests, showing that the bottom cluster are not statistically significantly different form each other, but they are from the ‘middle’ cluster. So, the word length distribution of Kiswahili is substantially different from that of, among others, isiZulu, in that it has more shorter words and isiZulu more longer words, but Kiswahili’s pattern is similar to that of Afrikaans and English. This is important for NLP, for isiZulu is known to be highly agglutinating, but English (and thus also Kiswahili) is disjunctive. How important is such a difference? The simple answer is that grammatical elements of a sentences get ‘glued’ together in isiZulu, whereas at least some of them are written as separate words in Kiswahili. This is not to be conflated with, say, German, Dutch, and Afrikaans, where nouns can be concatenated to form new words, but, e.g., a preposition is glued onto a noun. For instance, ‘of clay’ is ngobumba, contracting nga+ubumba with a vowel coalescence rule (-a + u- = -o-), which thus happens much less often in a language with disjunctive orthography. This, in turn, affects the algorithms needed to computationally process the languages, hence, the prospects for bootstrapping.

Note that middle cluster looks deceptively isolating, but it isn’t. Sesotho and Setswana are statistically significantly different from the others, in that they are even more disjunctive than English. Sepedi (top-most line) even more so. While I don’t know that language, a hypothetical example suffice to illustrate this notion. There is conjugation of verbs, like ‘works’ or trabajas or usebenza (inflection underlined), but some orthographer a while ago could have decided to write that separate from the verb stem (e.g., trabaj as and u sebenza instead), hence, generating more tokens with fewer characters.

There are other aspects of language and orthography one can ‘play’ with to analyse quantitatively, like whether words mainly end in a vowel or not, and which vowel mostly, and whether two successive vowels are acceptable for a language (for some, it isn’t). This is further described in the paper [3].

Yet, the UDHR is just one document. To examine the generalisability of these observations, we need to know whether the UDHR text is a ‘typical’ one. This was assessed in more detail by zooming in on isiZulu both quantitatively and qualitatively with four other corpora and texts in different genres. The results show that the UHDR is a typical text document orthographically, at least for the cumulative frequency distribution of the word length.

There were some other differences across the other corpora, which have to do with genre and datedness, which was observed elsewhere for whole words [4]. For instance, news items of isiZulu newspapers nowadays include words like iFacebook and EFF, which surely don’t occur in a century-old bible translation. They do violate the ‘no two successive vowels’ rule and the ‘final vowel’ rule, though.

On the qualitative side of the matter, and which will have an effect on searching for information in texts, text summarization, and error correction of spellcheckers, is, again, that agglutination. For instance, searching on imali ‘money’ alone would be woefully inadequate to find all relevant texts; e.g., those news items also include kwemali, yimali, onemali, osozimali, kwezimali, and ngezimali, which are, respectively of -, and -, that/which/who has -, of – (pl.), about/by/with/per – (pl.) money. Searching on the stem or root only is not going to help you much either, however. Take, for instance -fund-, of which the results of just two days of Isolezwe news articles is shown in the table below (articles from 2015, when there were protests, too). Depending on what comes before fund and what comes after it, it can have a different meaning, such as abafundi ‘students’ and azifundi ‘they do not learn’.

Placing this is the broader NLP scope, it also affects the widely-used notion of lexical diversity, which, in its basic form, is a type-to-token ratio. Lexical diversity is used as a proxy measure for ‘difficulty’ or level of a text (the higher the more difficult), language development in humans as they grow up, second-language learning, and related topics. Letting that loose on isiZulu text, it will count abafundi, bafundi, and nabafundi as three different tokens, so wheehee, high lexical diversity, yet in English, it amounts to ‘students’, ‘students’ and ‘and the students’. Put differently, somehow we have to come up with a more meaningful notion of lexical diversity for agglutinating languages. A first attempt is made in the paper in its section 4 [3].

Thus, the last word has not been said yet about orthographic similarity, yet we now do have more insight into it. The surprising similarity of isiZulu (South Africa) with Runyankore (Uganda) was exploited in another research activity, and shown to be very amenable to bootstrapping [5], so, in its own way providing supporting evidence for bootstrapping potential that the figure above also indicated as promising.

As a final comment on the tooling side of things, I did use NLTK (Python). It worked well for basic analyses of text, but it (and similar NLP tools) will need considerable customization for the agglutinating languages.

References

[1] C. Meinhof. 1932. Introduction to the phonology of the Bantu languages . Dietrich Reiner/Ernst Vohsen, Johannesburg. Translated, revised and enlarged in collaboration with the author and Dr. Alice Werner by N.J. Van Warmelo.

[2] L. Pretorius and S. Bosch. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 96–103, 2009.

[3] C.M. Keet. An assessment of orthographic similarity measures for several African languages. Technical report, arxiv 1608.03065. August 2016.

[4] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[5] J. Byamugisha, C. M. Keet, and B. DeRenzi. Bootstrapping a Runyankore CNL from an isiZulu CNL. In B. Davis et al., editors, 5th Workshop on Controlled Natural Language (CNL’16), volume 9767 of LNAI, pages 25–36. Springer, 2016. 25-27 July 2016, Aberdeen, UK.

# On generating isiZulu sentences with part-whole relations

It all sounded so easy… We have a pretty good and stable idea about part-whole relations and their properties (see, e.g., [1]), we know how to ‘verbalise’/generate a natural language sentence from basic description logic axioms with object properties that use simple verbs [2], like $Professor \sqsubseteq \exists teaches.Course$ ‘each professor teaches at least one course’, and SNOMED CT is full of logically ‘simple’ axioms (it’s in OWL 2 EL, after all) and has lots of part-whole relations. So why not combine that? We did, but it took some more time than initially anticipated. The outcomes are described in the paper “On the verbalization patterns of part-whole relations in isiZulu”, which was recently accepted at the 9th International Natural Language Generation Conference (INLG’16) that will be held 6-8 September in Edinburgh, Scotland.

What it ended up to be, is that notions of ‘part’ in isiZulu are at times less precise and other times more precise compared to the taxonomy of part-whole relations. This interfered with devising the sentence generation patterns, it pushed the number of ‘elements’ to deal with in the language up to 13 constituents, and there was no way to avoid proper phonological conditioning. We already could handle quantitative, relative, and subject concords, the copulative, and conjunction, but what had to be added were, in particular, the possessive concord, locative affixes, a preposition (just the nga in this context), epenthetic, and the passive tense (with modified final vowel). As practically every element has to be ‘completed’ based on the context (notably the noun class), one can’t really speak of a template-based approach anymore, but a bunch of patterns and partial grammar engine instead. For instance, plain parthood, structural parthood, involvement, membership all have:

• (‘each whole has some part’) $QCall_{nc_{x,pl}}$ $W_{nc_{x,pl}}$ $SC_{nc_{x,pl}}-CONJ-P_{nc_y}$ $RC_{nc_y}-QC_{nc_y}-$dwa
• (‘each part is part of some whole’) $QCall_{nc_{x,pl}}$ $P_{nc_{x,pl}}$ $SC_{nc_{x,pl}}-COP-$ingxenye $PC_{\mbox{\em ingxenye}}-W_{nc_y}$ $RC_{nc_y}-QC _{nc_y}-$dwa

There are a couple of noteworthy things here. First, the whole-part relation does not have one single string, like a ‘has part’ in English, but it is composed of the subject concord (SC) for the noun class (nc) of the noun that play the role of the whole ( W ) together with the phonologically conditioned conjunction na- ‘and’ (the “SC-CONJ”, above) and glued onto the noun of the entity that play the role of the part (P). Thus, the surface realisation of what is conceptually ‘has part’ is dependent on both the noun class of the whole (as the SC is) and on the first letter of the name of the part (e.g., na-+i-=ne-). The ‘is part of’ reading direction is made up of ingxenye ‘part’, which is a noun that is preceded with the copula (COP) y– and together then amounts to ‘is part’. The ‘of’ of the ‘is part of’ is handled by the possessive concord (PC) of ingxenye, and with ingxenye being in noun class 9, the PC is ya-. This ya- is then made into one word together with the noun for the object that plays the role of the whole, taking into account vowel coalescence (e.g., ya-+u-=yo-). Let’s illustrate this with heart (inhliziyo, nc9) standing in a part-whole relation to human (umuntu, NC1), with the ‘has part’ and ‘is part of’ underlined:

• bonke abantu banenhliziyo eyodwa ‘All humans have as part at least one heart’
• The algorithm, in short, to get this sentence from, say $Human \sqsubseteq \exists hasPart.Heart$: 1) it looks up the noun class of umuntu (nc1); 2) it pluralises umuntu into abantu (nc2); 3) it looks up the quantitative concord for universal quantification (QCall) for nc2 (bonke); 4) it looks up the SC for nc2 (ba); 5) then it uses the phonological conditioning rules to add na- to the part inhliziyo, resulting in nenhliziyo and strings it together with the subject concord to banenhliziyo; 6) and finally it looks up the noun class of inhliziyo, which is nc9, and from that it looks up the relative concord (RC) for nc9 (e-) and the quantitative concord for existential quantification (QC) for nc9 (being yo-), and strings it together with –dwa to eyodwa.
• zonke izinhliziyo ziyingxenye yomuntu oyedwa ‘All hearts are part of at least one human’
• The algorithm, in short, to get this sentence from $Heart \sqsubseteq \exists isPartOf.Human$: 1) it looks up the noun class of inhliziyo (nc9); 2) it pluralises inhliziyo to izinhliziyo (nc10); 3) it looks up the QCall for nc10 (zonke); 4) it looks up the SC for nc10 (zi-), takes y- (the COP) and adds them to ingxenye to form ziyingxenye; 5) then it uses the phonological conditioning rules to add ya- to the whole umuntu, resulting in yomuntu; 6) and finally it looks up the noun class of umuntu, which is nc1, and from that the RC for nc10 (o-) and the QC for nc10 (being ye-), and strings it together with –dwa to oyedwa.

For subquantities, we end up with three variants: one for stuff-parts (as in ‘urine has part water’, still with ingxenye for ‘part’), one for portions of solid objects (as in ‘tissue sample is a subquantity of tissue’ or a slice of the cake) that uses umunxa instead of ingxenye, and one ‘spatial’ notion of portion, like that an operating theatre is a portion of a hospital, or the area of the kitchen where the kitchen utensils are is a portion of the kitchen, which uses isiqephu instead of ingxenye. Umunxa is in nc3, so the PC is wa- so that with, e.g., isbhedlela ‘hospital’ it becomes wesibhedlela ‘of the hospital’, and the COP is ng- instead of y-, because umunxa starts with an u. And yet again part-whole relations use locatives (like the containment type of part-whole relation). The paper has all those sentence generation patterns, examples for each, and explanations for them.

The meronymic part-whole relations participation and constitution have added aspects for the verb, such as generating the passive for ‘constituted of’: –akha is ‘to build’ for objects that are made/constituted of some matter in some structural sense, else –enza is used. They are both ‘irregular’ in the sense that it is uncommon that a verb stem starts with a vowel, so this means additional vowel processing (called hiatus resolution in this case) to put the SC together with the verb stem. Then, for instance za+akhiwe=zakhiwe but u+akhiwe=yakhiwe (see rules in paper).

Finally, this was not just a theoretical exercise, but it also has been implemented. I’ll readily admit that the Python code isn’t beautiful and can do with some refactoring, but it does the job. We gave it 42 test cases, of which 38 were answered correctly; the remaining errors were due to an ‘incomplete’ (and unresolvable case for any?) pluraliser and that we don’t know how to systematically encode when to pick akha and when enza, for that requires some more semantics of the nouns. Here is a screenshot with some examples:

The ‘wp’ ones are that a whole has some part, and the ‘pw’ ones that the part is part of the whole and, in terms of the type of axiom that each function verbalises, they are of the so-called ‘all some’ pattern.

The source code, additional files, and the (slightly annotated) test sentences are available from the GENI project’s website. If you want to test it with other nouns, please check whether the noun is already in nncPairs.txt; if not, you can add it, and then invoke the function again. (This remains this ‘clumsily’ until we make a softcopy of all isiZulu nouns with their noun classes. Without the noun class explicitly given, the automatic detection of the noun class is not, and cannot be, more than about 50%, but with noun class information, we can get it up to 90-100% correct in the pluralisation step of the sentence generation [4].)

References

[1] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[2] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS vol. 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[3] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), September 5-8, 2016, Edinburgh, UK. (in print)

[4] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey. (in print)

# Preliminary promising results on a data-driven spellchecker for isiZulu

While developing a spellchecker for isiZulu is not new, the one for Open Office v3 doesn’t work with the more up-to-date versions of Open Office, it was of limited quality, the other papers on isiZulu spellcheckers have no tools for it, and the techniques used were tailored to the language. The demand for it has not decreased, however. Last year’s honours student I co-supervised, Balone Ndaba, set out to address this using n-grams and several corpora, so if this were to work, then this essentially language-independent solution may be ported to other Bantu languages. The data-driven approach worked reasonably well (up to 89% accuracy), so it all ended up as a paper entitled “The effects of a corpus on isiZulu spellcheckers based on n-grams” [1], co-authored with Balone Ndaba, Hussein Suleman, and Langa Khumalo. It has been accepted at the IST Africa’16 conference that will take place in less than two weeks in Durban, South Africa. The material can be downloaded from Balone’s honours project page.

In a nutshell, well less than one page cf. the paper’s 9 pages, the paper describes the design, implementation, and experimental results of a statistics-based spellchecker. One option in a data-driven approach is to make long wordlists to look up the word to be spellchecked, indeed, but this is not doable due to the agglutination in the language (i.e., way too many options), and there are limited curated resources anyhow. Instead, we let it ‘learn’ using a statistical language model using n-grams created from a corpus. For instance, take the word yebo ‘yes’, then its bigrams are ye, eb, and bo, its trigrams are yeb and ebo, and quadrigram (i.e., 4 characters) yebo. Feeding the algorithm a lot of text generates a lot of n-grams, some of which occur much more often than others, whereas the very rare ones were probably typos or a loan word in the original text. The hope is then that when it is fed a word that it has not been trained on, it can recognize that it is (very probably) still a valid word in the language, as the sequence of the characters in the string are deemed common enough, or, conversely, that it is very likely to be misspelled.

It was not clear upfront which of the variables will affect the quality most, and which combination leads to the best spellchecker performance. This could be the threshold for when to include the n-gram, how big the n-gram should be, and whether the training corpus had any effect. So, all this was tested (see paper for details).

Trigrams worked better than quadrigrams, and a threshold of 0.003 worked better than the other thresholds. This is the ‘easy’ part, in a way. What complicated matters were the corpora, where one was much better to learn from or test on than the other. This is summarised in the table below. UC stands for the Ukwabelana corpus [2] that consists of the bible and a few copyright-expired novels, S. INC for a sample of the isiZulu National Corpus that is being developed at the University of KwaZulu-Natal [3], and INC is a small corpus of news items from the Isolezwe and news24 in isiZulu news websites (extended from the earlier playing with some of those articles). The INC and NIC work well on each other, but not at all with UC, and if trained with UC then only mediocre on the more recent texts of NIC and INC. So, the training corpus has a rather large effect on the spellchecker’s performance.

Lexical recall with n-grams: 10-fold cross-validation (i.e., with itself) and tested against the other corpora (t. = test).

 10-fold Train UC Train NIC Train S. INC 3-gr. 4-gr. t. NIC t. INC t. INC t. UC t. UC t. NIC UC 85 80 30 41 S. INC 66 63 54 89 NIC 86 79 70 89

We did look into the corpora a little more, but there’s more to analyse. Notably, the number of unique words in the corpus seems to matter more than its size, and the datedness and quality of the text suggest room for additional evaluation. UC contains some archaic terms that wouldn’t be used in modern-day isiZulu, and there have been some errors introduced in the words, assumed to be due to OCR (words with three successive vowels are a no-no in isiZulu, verbs missing the final vowel).

Overall, though, a lexical recall and an accuracy of 89% is quite alright for a first-pass, meriting resources for fine-tuning for isiZulu and it being promising as approach for related languages.

Balone will present the paper at IST Africa, and Hussein and Langa will also participate, so you can ask them more details in person at the conference. (I’ll be holding a final training for the team representing UCT in the ACM ICPC World Finals and pack my bags to join them as coach in Phuket in Thailand.)

References

[1] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[2] S. Spiegler, A. van der Spuy, P. Flach. Ukwabelana – An Open-source Morphological Zulu Corpus. Proceedings of the 23rd International Conference on Computational Linguistics (COLING, 2010). ACL. 1020-1028.

[3] Khumalo. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.