# A set of competency questions and SPARQL-OWL queries, with analysis

As a good beginning of the new year, our Data in Brief article Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations [1] was accepted and came online this week, which accompanies our Journal of Web Semantics article Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL [2] that was published in December 2019—with ‘our’ referring to my collaborators in Poznan, Dawid Wisniewski, Jedrzej Potoniec, and Agnieszka Lawrynowicz, and myself. The former article provides extensive detail of a dataset we created that was subsequently used for analysis that provided new insights that is described in the latter article.

The dataset

In short, we tried to find existing good TBox-level competency questions (CQs) for available ontologies and manually formulate (i.e., formalise the CQ in) SPARQL-OWL queries for each of the CQs over said ontologies. We ended up with 234 CQs for 5 ontologies, with 131 accompanying SPARQL-OWL queries. This constitutes the first gold standard pipeline for verifying an ontology’s requirements and it presents the systematic analyses of what is translatable from the CQs and what not, and when not, why not. This may assist in further research and tool development on CQs, automating CQ verification, assessing the main query language constructs and therewith language optimisation, among others. The dataset itself is indeed independently reusable for other experiments, and has been reused already [3].

The key insights

The first analysis we conducted on it, reported in [2], revealed several insights. First, a larger set of CQs (cf. earlier work) indeed did increase the number of CQ patterns. There are recurring patterns in the shape of the CQs, when analysed linguistically; a popular one is What EC1 PC1 EC2? obtained from CQs like “What data are collected for the trail making test?” (a Dem@care CQ). Observe that, yes, indeed, we did decouple the language layer from the formalisation layer rather than mixing the two; hence, the ECs (resp. PCs) are not necessarily classes (resp. object properties) in an ontology. The SPARQL-OWL queries were also analysed at to what is really used of that query language, and used most often (see table 7 of the paper).

Second, these characteristics are not the same across CQ sets by different authors of different ontologies in different subject domains, although some patterns do recur and are thus somehow ‘popular’ regardless. Third, the relation CQ (pattern or not) : SPARQL-OWL query (or its signature) is m:n, not 1:1. That is, a CQ may have multiple SPARQL-OWL queries or signatures, and a SPARQL-OWL query or signature may be put into a natural language question (CQ) in different ways. The latter sucks for any aim of automated verification, but unfortunately, there doesn’t seem to be an easy way around that: 1) there are different ways to say the same thing, and 2) the same knowledge can be represented in different ways and therewith leading to a different shape of the query. Some possible ways to mitigate either is being looked into, like specifying a CQ controlled natural language [3] and modelling styles [4] so that one might be able to generate an algorithm to find and link or swap or choose one of them [5,6], but all that is still in the preliminary stages.

Meanwhile, there is that freely available dataset and the in-depth rigorous analysis, so that, hopefully, a solution may be found sooner rather than later.

References

[1] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, in press.

[2] Wisniewski, D., Potoniec, J., Lawrynowicz, A., Keet, C.M. Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL. Journal of Web Semantics, 2019, 59:100534.

[3] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol 1057, 3-15.

[4] Fillottrani, P.R., Keet, C.M. Dimensions Affecting Representation Styles in Ontologies. 1st Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’19). Springer CCIS vol 1029, 186-200. 24-28 June 2019, Villa Clara, Cuba. Paper at Springer

[5] Fillottrani, P.R., Keet, C.M. Patterns for Heterogeneous TBox Mappings to Bridge Different Modelling Decisions. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS vol 10249, 371-386. Portoroz, Slovenia, May 28 – June 2, 2017.

[6] Khan, Z.C., Keet, C.M. Automatically changing modules in modular ontology development and management. Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT’17). ACM Proceedings, 19:1-19:10. Thaba Nchu, South Africa. September 26-28, 2017.

# Localising Protégé with Manchester syntax into your language of choice

Some people like a quasi natural language interface in ontology development tools, which is why Manchester Syntax was proposed [1]. A downside is that it locks the ontology developer into English, so that weird chimaeras are generated in the interface if the author prefers another language for the ontology, such as, e.g., the “jirafa come only (oja or ramita)” mentioned in an earlier post and that was deemed unpleasant in an experiment a while ago [2]. Those who prefer the quasi natural language components will have to resort to localising Manchester syntax and the tool’s interface.

This is precisely what two of my former students—Adam Kaliski and Casey O’Donnell—did during their mini-project in the ontology engineering course of 2017. A localisation in Afrikaans, as the case turned out to be. To make this publicly available, Michael Harrison brushed up the code a bit and tested it worked also in the new version of Protégé. It turned out it wasn’t that easy to localise it to another language the way it was done, so one of my PhD students, Toky Raboanary, redesigned the whole thing. This was then tested with Spanish, and found to be working. The remainder of the post describes informally some main aspects of it. If you don’t want to read all that but want to play with it right away: here are the jar files, open source code, and localisation instructions for if you want to create, say, a French or Dutch variant.

Some sensible constraints, some slightly contrived ones (and some bad ones), for the purpose of showing the localisation of the interface for the various keywords. The view in English is included in the screenshot to facilitate comparison.

Some sensible constraints, some slightly contrived ones (and some bad ones), for the purpose of showing the localisation of the interface for the various keywords. The view in English is included in the screenshot to facilitate comparison.

The localisation functions as a plugin for Protégé as a ‘view’ component. It can be selected under “Windows – Views – Class views” and then Beskrywing for the Afrikaans and Descripción for Spanish, and dragged into the desired position; this is likewise for object properties.

Instead of burying the translations in the code, they are specified in a separate XML file, whose content is fetched during the rendering. Adding a new ‘simple’ (more about that later) language merely amounts to adding a new XML file with the translations of the Protégé labels and of the relevant Manchester syntax. Here are the ‘simple’ translations—i.e., where both are fixed strings—for Afrikaans for the relevant tool interface components:

 Class Description (Label) Klasbeskrywing (Label in Afrikaans) Equivalent To Dieselfde as SubClass Of Subklas van General Class axioms Algemene Klasaksiomas SubClass Of (Anonymous Ancestor) Subklas van (Naamlose Voorvader) Disjoint With Disjunkte van Disjoint Union Of Disjunkte Unie van

The second set of translations is for the Manchester syntax, so as to render that also in the target language. The relevant mappings for Afrikaans class description keywords are listed in the table below, which contain the final choices made by the students who developed the original plugin. For instance, min and max could have been rendered as minimum and maksimum, but the ten minste and by die meeste were deemed more readable despite being multi-word strings. Another interesting bit in the translation is negation, where there has to be a second ‘no’ since Afrikaans has double negation in this construction, so that it renders it as nie <expression> nie. That final rendering is not grammatically perfect, but (hopefully) sufficiently clear:

An attempt at double negation with a fixed string

An attempt at double negation with a fixed string

 Manchester OWL Keyword Afrikaans Manchester OWL Keyword or phrase some sommige only slegs min ten minste max by die meeste exactly precies and en or of not nie nie SubClassOf SubklasVan EquivalentTo DieselfdeAs DisjointWith DisjunkteVan

The people involved in the translations for the object properties view for Afrikaans are Toky, my colleague Tommie Meyer (also at UCT), and myself; snyding for ‘intersection’ sounds somewhat odd to me, but the real tough one to translate was ‘SuperProperty’. Of the four options that were considered—SuperEienskap, SuperVerwantskap, SuperRelasie, and SuperVerband SuperVerwantskap was chosen with Tommie having had the final vote, which is also a semantic translation, not a literal translation.

Screenshot of the object properties description, with comparison to the English

The Spanish version also has multi-word strings, but at least does not do double negation. On the other hand, it has accents. To generate the Spanish version, myself, my collaborator Pablo Fillottrani from the Universidad Nacional del Sur, Argentina, and Toky had a go at it in translating the terms. This was then implemented with the XML file. In case you do not want to dig into the XML file and not install the plugin either, but have a quick look at the translations, they are as follows for the class description view:

 Class Description Label Descripción (in Spanish) Equivalent To Equivalente a SubClass Of Subclase de General Class axioms Axiomas generales de clase SubClass Of (Anonymous Ancestor) Subclase de (Ancestro Anónimo) Disjoint With Disjunto con Disjoint Union Of Unión Disjunta de Instances Instancias

 Manchester OWL Keyword Spanish Manchester OWL Keyword some al menos uno only sólo min al mínimo max al máximo and y or o not no exactly exactamente SubClassOf SubclaseDe EquivalentTo EquivalenteA DisjointWith DisjuntoCon

And here’s a rendering of a real ontology, for geo linked data in Spanish, rather than African wildlife yet again:

screenshot of the plugin behaviour with someone else’s ontology in Spanish

One final comment remains, which has to do with the ‘simple’ mentioned above. The approach of localisation presented here works only with fixed strings, i.e., the strings do not have to change depending on the context where it is uses. It won’t work with, say, isiZulu—a highly agglutinating and inflectional language—because isiZulu doesn’t have fixed strings for the Manchester syntax keywords nor for some other labels. For instance, ‘at least one’ has seven variants for nouns in the singular, depending on the noun class of the noun of the OWL class it quantifies over; e.g., elilodwa for ‘at least one’ apple, and esisodwa for ‘at least one’ twig. Also, the conjugation of the verb for the object property depends on the noun class of the noun of the OWL class, but in this case for the one that plays the subject; e.g., it’s “eats” in English for both humans and elephants eating, say, fruit, so one string for the name of the object property, but that’s udla and idla, respectively, in isiZulu. This requires annotations of the classes with ontolex-lemon or a similar approach and a set of rules (which we have, btw) to determine what to do in which case, which requires on-the-fly modifications to Manchester syntax keywords and elements’ names or labels. And then there’s still phonological conditioning to account for. It surely can be done, but it is not as doable as with the ‘simple’ languages that have at least a disjunctive orthography and much less genders or noun classes for the nouns.

In closing, while there’s indeed more to translate in the Protégé interface in order to fully localise it, hopefully this already helps as-is either for reading at least a whole axiom in one’s language or as stepping stone to extend it further for the other terms in the Manchester syntax and the interface. Feel free to extend our open source code.

References

[1] Matthew Horridge, Nicholas Drummond, John Goodwin, Alan Rector, Robert Stevens and Hai Wang (2006). The Manchester OWL syntax. OWL: Experiences and Directions (OWLED’06), Athens, Georgia, USA, 10-11 Nov 2016, CEUR-WS vol 216.

[2] Keet, C.M. The use of foundational ontologies in ontology development: an empirical assessment. 8th Extended Semantic Web Conference (ESWC’11), G. Antoniou et al (Eds.), Heraklion, Crete, Greece, 29 May-2 June, 2011. Springer LNCS 6643, 321-335.

[3] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157. accepted version

# From ontology verbalisation to language learning exercises

I’m aware that to most people ‘playing with’ (investigating) ontologies and isiZulu does not sound particularly useful on the face of it. Yet, there’s the some long-term future music, like eventually being able to generate patient discharge notes in one’s own language, which will do its bit to ameliorate the language barrier in healthcare in South Africa so that patients at least will adhere to the treatment instructions a little better, and therewith receive better quality healthcare. But benefits in the short-term might serve something as well. To that end, I proposed an honours project last year, which has been completed in the meantime, and one of the two interesting outcomes has made it into a publication already [1]. As you may have guessed from the title, it’s about automation for language learning exercises. The results will be presented at the 6th Workshop on Controlled Natural Language, in Maynooth, Ireland in about 2 weeks time (27-28 August). In the remainder of this post, I highlight the main contributions described in the paper.

First, regarding the post’s title, one might wonder what ontology verbalisation has to do with language learning. Nothing, really, except that we could reuse the algorithms from the controlled natural language (CNL) for ontology verbalisation to generate (computer-assisted) language learning exercises whose answers can be computed and marked automatically. That is, the original design of the CNL for things like pluralising nouns, verb conjugation, and negation that is used for verbalising ontologies in isiZulu in theory [2] and in practice [3], was such that the sentence generator is a detachable module that could be plugged in elsewhere for another task that needs such operations.

Practically, the student who designed and developed the back-end, Nikhil Gilbert, preferred Java over Python, so he converted most parts into Java, and added a bit more, notably the ‘singulariser’, a sentence scrabble, and a sentence generator. Regarding the sentence generator, this is used as part of the exercises & answers generator. For instance, we know that humans and the roles they play (father, aunt, doctor, etc.) are mostly in isiZulu’s noun classes 1, 2, 1a, 2a, or 3a, that those classes do not (or rarely?) have non-human nouns and generally it holds for all humans and their roles that they can ‘eat’, ‘talk’ etc. This makes it relatively easy create a noun chain and a verb chain list to mix and match nouns with verbs accordingly (hurrah! for the semantics-based noun class system). Then, with the 231 nouns and 59 verbs in the newly constructed mini-corpus, the noun chain and the verb chain, 39501 unique question sentences could be generated, using the following overall architecture of the system:

Architecture of the CNL-driven CALL system. The arrows indicate which upper layer components make use of the lower layer components. (Source: [1])

From a CNL perspective as well as the language learning perspective, the actual templates for the exercises may be of interest. For instance, when a learner is learning about pluralising nouns and their associated verb, the system uses the following two templates for the questions and answers:

Q: <prefixSG+stem> <SGSC+VerbRoot+FV>
A: <prefixPL+stem> <PLSC+VerbRoot+FV>
Q: <prefixSG+stem> <SGSC+VerbRoot+FV> <prefixSG+stem>
A: <prefixPL+stem> <PLSC+VerbRoot+FV> <prefixPL+stem>

The answers can be generated automatically with the algorithms that generate the plural noun (from ‘prefixSG’ to ‘prefixPL’) and add the plural subject concord (from ‘SGSC’ to ‘PLSC’, in agreement with ‘prefixPL’), which were developed as part of the GeNI project on ontology verbalization. This can then be checked against what the learner has typed. For instance, a generated question could be umfowethu usula inkomishi and the correct answer generated (to check the learner’s response against) is abafowethu basula izinkomishi. Another example is generation of the negation from the positive, or, vv.; e.g.:

Q: <PLSC+VerbRoot+FV>
A: <PLNEGSC+VerbRoot+NEGFV>

For instance, the question may present batotoba and the correct answer is then abatotobi. In total, there are six different types of sentences, with two double, like the plural above, hence a total of 16 templates. It is not a lot, but it turned out it is one of the very few attempts to use a CNL in such way: there is one paper that also will be presented at CNL’18 in the same session [4], and an earlier one [5] uses a fancy grammar system (that we don’t have yet computationally for isiZulu). This is not to be misunderstood as that this is one of the first CNL/NLG-based system for computer-assisted language learning—e.g., there’s assistance in essay writing, grammar concept question generation, reading understanding question generation—but curiously very little on CNLs or NLG for the standard entry-level type of questions to learn the grammar. Perhaps the latter is considered ‘boring’ for English by now, given all the resources. However, thousands of students take introduction courses in isiZulu each year, and some automation can alleviate the pressure of routine activities from the lecturers. We have done some evaluations with learners—with encouraging results—and plan to do some more, so that it may eventually transition to actual use in the courses; that is: TBC…

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E. et al. (eds.). Springer LNCS vol. 10577, 59-64.

[4] Lange, H., Ljunglof, P. Putting control into language learning. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[5] Gardent, C., Perez-Beltrachini, L. Using FB-LTAG Derivation Trees to Generate Transformation-Based Grammar Exercises. Proc. of TAG+11, Sep 2012, Paris, France. pp117-125, 2012.

# An Ontology Engineering textbook

My first textbook “An Introduction to Ontology Engineering” (pdf) is just released as an open textbook. I have revised, updated, and extended my earlier lecture notes on ontology engineering, amounting to about 1/3 more new content cf. its predecessor. Its main aim is to provide an introductory overview of ontology engineering and its secondary aim is to provide hands-on experience in ontology development that illustrate the theory.

The contents and narrative is aimed at advanced undergraduate and postgraduate level in computing (e.g., as a semester-long course), and the book is structured accordingly. After an introductory chapter, there are three blocks:

• Logic foundations for ontologies: languages (FOL, DLs, OWL species) and automated reasoning (principles and the basics of tableau);
• Developing good ontologies with methods and methodologies, the top-down approach with foundational ontologies, and the bottom-up approach to extract as much useful content as possible from legacy material;
• Advanced topics that has a selection of sub-topics: Ontology-Based Data Access, interactions between ontologies and natural languages, and advanced modelling with additional language features (fuzzy and temporal).

Each chapter has several review questions and exercises to explore one or more aspects of the theory, as well as descriptions of two assignments that require using several sub-topics at once. More information is available on the textbook’s page [also here] (including the links to the ontologies used in the exercises), or you can click here for the pdf (7MB).

Feedback is welcome, of course. Also, if you happen to use it in whole or in part for your course, I’d be grateful if you would let me know. Finally, if this textbook will be used half (or even a quarter) as much as the 2009/2010 blogposts have been visited (around 10K unique visitors since posting them), that would mean there are a lot of people learning about ontology engineering and then I’ll have achieved more than I hoped for.

UPDATE: meanwhile, it has been added to several open (text)book repositories, such as OpenUCT and the Open Textbook Archive, and it has been featured on unglue.it in the week of 13-8 (out of its 14K free ebooks).

# ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

# Updated isiZulu spellchecker and new isiXhosa spellchecker

Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:

• first error correction algorithms for isiZulu;
• improved error detection with a few basic rules, also for isiZulu;
• new isiXhosa error detection and correction;

The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):

Single word error *khupels that has a substitution error s for a in the intended word (khupela)

The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).

The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than spellchecker.net’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the spellchecker.net’s version got to 82.66% on the same text. Put differently: spellchecker.net flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).

Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.

A section of the isiXhosa Wikipedia entry about the UN (*ukuez should be ukuze, which is among the proposed words).

Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.

# The isiZulu spellchecker seems to contribute to ‘intellectualisation’ of isiZulu

Perhaps putting ‘intellectualisation’ in sneer quotes isn’t nice, but I still find it an odd term to refer to a process of (in short, from [1]) coming up with new vocabulary for scientific speech, expression, objective thinking, and logical judgments in a natural language. In the country I grew up, terms in our language were, and still are, invented more because of a push against cultural imperialism and for home language promotion rather than some explicit process to intellectualise the language in the sense of “let’s invent some terms because we need to talk about science in our own language” or “the language needs to grow up” sort of discourses. For instance, having introduced the beautiful word geheugensanering (NL) that captures the concept of ‘garbage collection’ (in computing) way better than the English joke-term for it, elektronische Datenverarbeitung (DE) for ‘ICT’, técnicas de barrido (ES) for ‘sweep line’ algorithms, and mot-dièse (FR) for [twitter] ‘hashtag’, to name but a few inventions.

Be that as it may, here in South Africa, it goes under the banner of intellectualisation, with particular reference to the indigenous languages [2]; e.g., having introduced umakhalekhukhwini ‘cell/mobile phone’ (decomposed: ‘the thing that rings in your pocket’) and ukudlulisa ikheli for ‘pass by reference’ in programming (longer list of isiZulu-English computing and ICT terms), which is occurring for multiple subject domains [3]. Now I ended up as co-author of a paper that has ‘intellectualisation’ in its title [4]: Evaluation of the effects of a spellchecker on the intellectualization of isiZulu that appeared just this week in the Alternation journal.

The main general question we sought to answer was whether human language technologies, and in particular the isiZulu spellchecker launched last year, contribute to the language’s intellectualisation. More specifically, we aimed to answer the following three questions:

1. Is the spellchecker meeting end-user needs and expectations?
2. Is the spellchecker enabling the intellectualisation of the language?
3. Is the lexicon growing upon using the spellchecker?

The answers in a nutshell are: 1) yes, the spellchecker does meet end-user needs and expectations (but there are suggestions further improving its functionality), 2) users perceive that the spellchecker enables the intellectualisation of the language, and 3) non-dictionary words were added, i.e., the lexicon is indeed growing.

The answer to the last question provides some interesting data for linguists to bite their teeth in. For instance, a user had added to the spellchecker’s dictionary LikaSekelaShansela, which is an inflected form of isekelashansela ‘Vice Chancellor’ (that is recognised as correct by the spellchecker). Also some inconsistencies—from a rule-of-thumb viewpoint—in word formation were observed; e.g., usosayensi ‘scientist’ vs. unompilo ‘nurse’. If one were to follow consistently the word formation process for various types of experts in isiZulu, such as usosayensi ‘scientist’, usolwazi ‘professor’, and usomahlaya ‘comedian’, then one reasonably could expect ‘nurse’ to be *usompilo rather than unompilo. Why it isn’t, we don’t know. Regardless, the “add to dictionary” option of the spellchecker proved to be a nice extra feature for a data-driven approach to investigate intellectualisation of a language.

Version 1 of the isiZulu spellchecker that was used in the evaluation was ok and reasonably could not have interfered negatively with any possible intellectualisation (average SUS score of 75 and median 82.5, so ‘good’). It was ok in the sense that a majority of respondents thought that the entire tool was helpful, no features should be removed, it enhances their work, and so on (see paper for details). For the software developers among you who have spare time: they’d like, mainly, to have it as a Chrome and MS Word plugin, predictive text/autocomplete, and have it working on the mobile phone. The spellchecker has improved in the meantime thanks to two honours students, and I will write another blog post about that next.

As a final reflection: it turned out there isn’t a way to measure the level of intellectualisation in a ‘hard sciences’ way, so we concluded the other answers based on data that came from the somewhat fluffy approach of a survey and in-depth interviews (a ‘mixed-methods’ approach, to give it a name). It would be nice to have a way to measure it, though, so one would be able to say which languages are more or less intellectualised, what level of intellectualisation is needed to have a language as language of instruction and science at tertiary level of education and for dissemination of scientific knowledge, and to what extent some policy x, tool y, or activity z contributes to the intellectualization of a language.

References

[1] Havránek, B. 1932. The functions of literary language and its cultivation. In Havránek, B and Weingart, M. (Eds.). A Prague School Reader on Esthetics, Literary Structure and Style. Prague: Melantrich: 32-84.

[2] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[3]Khumalo, L. Intellectualization through terminology development. Lexikos, 2017, 27: 252-264.

[4] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

# Figuring out the verbalisation of temporal constraints in ontologies and conceptual models

Temporal conceptual models, ontologies, and their logics are nothing new, but that sort of information and knowledge representation still doesn’t gain a lot of traction (cf. say, formal methods for verification). This is in no small part because modelling temporal information is not easy. Several conceptual modelling languages do have various temporal extensions, but most modellers don’t even use all of the default language features yet [1]. How could one at least reduce the barrier to adoption of temporal logics and modelling languages? The two principle approaches are visualisation with a diagrammatic language and rendering it in a (pseudo-)natural language. One of my postgraduate students looked at the former, trying to figure out what would be the best icons and such, which showed there was still a steep learning curve [2]. Before examining whether that could be optimised, I wondered whether the natural language option might be promising. The problem was, that no-one had yet tried to determine what the natural language counterpart of the temporal constraints were supposed to be, let alone whether they be ‘adequate’ or the ‘best’ way of rendering the temporal constraints in tolerable natural language sentences. I wanted to know that badly enough that I tried to find out.

Given that using templates is a tried-and-tested relatively successful approach for atemporal conceptual models and ontologies (e.g., for ORM, the ACE system), it makes sense to do something similar, but then for some temporal extension. As temporal conceptual modelling language I used one that has a Description Logics foundation (DLRUS [3,4]) for that easily links to ontologies as well, added a few known temporal constraints (like for relationships/DL roles, mandatory) and removing others (some didn’t seem all that interesting), which resulted in 34 constraints, still. For each one, I tried to devise more and less reasonable templates, resulting in 101 templates overall. Those templates were evaluated on semantics and preference by three temporal logic experts and five ‘mixed experts’ (experts in natural language generation, logic, or modelling). This resulted in a final set of preferred templates to verbalise the temporal constraints. The remainder of this post first describes a bit about the templates and then the results of which I think they are most interesting.

Templates

The basic idea of a template—in the context of the verbalisation of conceptual models and ontologies—is to have some natural language for the constraint where then the vocabulary gets slotted in at runtime. Take, for instance, simple named class subsumption in an ontology, $C \sqsubseteq D$, for which one could define a template “Each [C] is a(n) [D]”, so that with some axiom $Manager \sqsubseteq Employee$, it would generate the sentence “Each Manager is an Employee”. One also could have devised the template “All [C] are [D]” and then it would have generated “All Managers are Employees”. The choice between the two templates in this case is just taste, for in both cases, the semantics is the same. More complex axioms are not always that straightforward. For instance, for the axiom type $C \sqsubseteq \exists R.D$, would “Each [C] [R] some [D]” be good enough, or would perhaps “Each [C] must [R] at least one [D]” be better? E.g., “Each Professor teaches some Course” vs “Each Professor must teach at least one Course”.

The same can be done for the temporal constraints. To get there, I did a bit of a linguistic detour that informed the template design (described in the paper [5]). Let us take as first example for templates temporal class that has a semantics of $o \in C^{\mathcal{I}(t)} \rightarrow \exists t' \neq t. o \notin C^{\mathcal{I}(t')}$; for instance, UndergraduateStudent (assuming they graduate and end up as alumni or as drop outs, and weren’t undergrads from birth):

1. If an object is an instance of entity type [C], then there is some time where it is not a(n) [C].
2. [C] is an entity type whose objects are, for some time in their existence, not instances of [C].
3. [C] is an entity type of which each object is not a(n) [C] for some time during its existence.
4. All instances of entity type [C] are not a(n) [C] for some time.
5. Each [C] is not a(n) [C] for some time.
6. Each [C] is for some time not a(n) [C].

Which one(s) do you think captures the semantics, and which one(s) do you prefer?

A more elaborate constraint for relationships is ‘dynamic extension for relationships, past, mandatory], which is formalised as $\langle o , o' \rangle \in \mbox{{\sc RDexM}-}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow (\langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t' where $\langle o , o' \rangle \in \mbox{{\sc RDex}}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow ( \langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t'>t. \langle o , o' \rangle \in {\tt R_2}^{\mathcal{I}(t')})$.; e.g., every passenger who boards a flight must have checked in for that flight. Two options could be:

1. Each ..C_1.. ..R_1.. ..C_2.. was preceded by ..C_1.. ..R_2.. ..C_2.. some time earlier.
2. Each ..C_1.. ..R_1.. ..C_2.. must be preceded by ..C_1.. ..R_2.. ..C_2.. .

I’m not saying they are all correct; they were some of the options given, which the participants could choose from and comment on. The full list of constraints and template options are available in the supplementary material, which also contains a file where you can fill in your own answers, see what the (anonymised) participants said, and it has the final list of ‘best’ constraints.

Results

The main aggregate quantitative results are shown in the following table.

Many observations can be made from the data (see the paper for details). Some of the salient aspects are that there was low inter-annotator agreement among the experts, despite that they know each other (temporal logics is a small community) and that the ‘mixed group’ deemed many sentences correct that the experts deemed wrong in the sense of not properly capturing the semantics of the constraint. Put differently, it looks like the mixed experts, as a group, did not fully grasp some subtle distinction in the temporal constraints.

With respect to the templates, the preferred ones don’t follow the structure of the logic, but are, in a way, a separate rendering, or: there’s no neat 1:1 mapping between axiom type and template structure. That said, that doesn’t mean that they always chose the shortest template: the experts definitely did not, while the mixed experts leaned a bit toward preferring templates with fewer words even though they were surely not always the semantically correct option.

It may not look good that the experts preferred different templates, but in a follow-up interview with one of the experts, the expert noted that it was not really a problem “for there is the logic that does have the precise meaning anyway” and thus “resolves any confusion that may arise from using slightly different terminology”. The temporal logic expert does have a point from the expert’s view, fair enough, but that pretty much defeats my aim with the experiment. Asking more non-experts may not be a good strategy either, for they are, on average, too lenient.

So, for now, we do have a set of, relatively, ‘best’ templates to verbalise temporal constraints in temporal conceptual models and ontologies. The next step is to compare that with the diagrammatic representation. This we did [6], and I’ll describe those results informally in a next post.

I’ll present more details at the upcoming CREOL: Contextual Representation of Events and Objects in Language Workshop that is part of the Joint Ontology Workshops 2017, which will be held next week (21-23 September) in Bolzano, Italy. As the KRDB group at FUB in Bolzano has a few temporal logic experts, I’m looking forward to the discussions! Also, I’d be happy if you would be willing to fill in the spreadsheet with your preferences (before looking at the answers given by the participants!), and send them to me.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[2] T. Shunmugam. Adoption of a visual model for temporal database representation. M. IT thesis, Department of Computer Science, University of Cape Town, South Africa, 2016.

[3] A. Artale, E. Franconi, F. Wolter, and M. Zakharyaschev. A temporal description logic for reasoning about conceptual schemas and queries. In S. Flesca, S. Greco, N. Leone, and G. Ianni, editors, Proceedings of the 8th Joint European Conference on Logics in Artificial Intelligence (JELIA-02), volume 2424 of LNAI, pages 98-110. Springer Verlag, 2002.

[4] A. Artale, C. Parent, and S. Spaccapietra. Evolving objects in temporal information systems. Annals of Mathematics and Artificial Intelligence, 50(1-2):5-38, 2007.

[5] Keet, C.M. Natural language template selection for temporal constraints. CREOL: Contextual Representation of Events and Objects in Language, Joint Ontology Workshops 2017, 21-23 September 2017, Bolzano, Italy. CEUR-WS Vol. (in print).

[6] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Springer LNCS. 6-9 Nov 2017, Valencia, Spain. (in print)

# A grammar of the isiZulu verb (present tense)

If you have read any of the blog posts on (automated) natural language generation for isiZulu, then you’ll probably agree with me that isiZulu verbs are non-trivial. True, verbs in other languages are most likely not as easy as in English, or Afrikaans for that matter (e.g., they made irregular verbs regular), but there are many little ‘bits and pieces’ ‘glued’ onto the verb root that make it semantically a ‘heavy’ element in a sentence. For instance:

• Aba-shana ba-ya-zi-theng-is-el-an-a                izimpahla
• ‘The children are selling the clothes to each other’

The ba is the subject concord (~conjugation) to match with the noun class (which is 2) of the noun that plays the subject in the sentence (abashana), the ya denotes a continuous action (‘are doing something’ in the present), the zi is the object concord for the noun class (8) of the noun that plays the object in the sentence (izimpahla), theng is the verb root, then comes the CARP extension with is the causative (turning ‘buy’ into ‘sell’), and el the applicative and an the reciprocative, which take care of the ‘to each other’, and then finally the final vowel a.

More precisely, the general basic structure of the verb is as follows:

where NEG is the negative; SC the subject concord; T/A denotes tense/aspect; MOD the mood; OC the object concord; Verb Rad the verb radical; C the causative; A the applicative; R the reciprocal; and P the passive. For instance, if the children were not selling the clothes to each other, then instead of the SC, there would be the NEG SC in that position, making the verb abayazithengiselana.

To make sense of all this in a way that it would be amenable to computation, we—my co-author Langa Khumalo and I—specified the grammar of the complex verb for the present tense in a CFG using an incremental process of development. To the best of our (and the reviewer’s) knowledge, the outcome of the lengthy exercise is (1) the first comprehensive and precisely formulated documentation of the grammar rules for the isiZulu verb present tense, (2) all together in one place (cf. fragments sprinkled around in different papers, Wikipedia, and outdated literature (Doke in 1927 and 1935)), and (3) goes well beyond handling just one of the CARP, among others. The figure below summarises those rules, which are explained in detail in the forthcoming paper “Grammar rules for the isiZulu complex verb”, which will be published in the Southern African Linguistics and Applied Language Studies [1] (finally in print, yay!).

It is one thing to write these rules down on paper, and another to verify whether they’re actually doing what they’re supposed to be doing. Instead of fallible and laborious manual checking, we put them in JFLAP (for the lack of a better alternative at the time; discussed in the paper) and tested the CFG both on generation and recognition. The tests went reasonably well, and it helped fixing a rule during the testing phase.

Because the CFG doesn’t take into account phonological conditioning for the vowels, it generates strings not in the language. Such phonological conditioning is considered to be a post-processing step and was beyond the scope of elucidating and specifying the rules themselves. There are other causes of overgeneration that we did not get around to doing, for various reasons: there are rules that go across the verb root, which are simple to represent in coding-style notation (see paper) but not so much in a CFG, and rules for different types of verbs, but there’s no available resource that lists which verb roots are intransitive, which as monosyllabic and so on. We have started with scoping rules and solving issues for the latter, and do have a subset of phonological conditioning rules; so, to be continued… For now, though, we have completed at least one of the milestones.

Last, but not least, in case you wonder what’s the use of all this besides the linguistics to satisfy one’s curiosity and investigate and document an underresourced language: natural language generation for intelligent user interfaces in localised software, spellcheckers, and grammar checkers, among others.

References

[1] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, (in print). Submitted version (the rules are the same as in the final version)

# Aligning different relations: the case of part-whole relations—LDK2017

Despite the best intentions, I did not get around to writing a post on the paper that I presented last week at the First International Conference on Language, Data and Knowledge 2017, 19-20 June, Galway, Ireland, and now Paul Groth also ‘beat’ me to it writing a nice conference report of it. On the bright side, it is an opportunity to say upfront I really enjoyed the conference and look forward to the next edition in 2019. The ESWC’17 organisers might be slightly disappointed that there was no special track on the multilingual semantic web after all, but I did get the distinct impression that the LDK17 authors might just all have gambled on LDK17—an opportunity to binge two days on all things natural language & Semantic Web—rather than on one track at an overpriced conference (despite the allure of it being A-rated).

So, what was my paper about that could have been submitted to either? I ended up struggling—and solving—an issue with aligning OWL object properties that were not simple 1:1 mappings, in a similar scope as our ESWC17 paper (introduced here) [4], but then with too many complications. Complications were due to the different conceptualisations of part-whole relations and that one of the requirements was to solve what to do with an object property (relation, relationship) that does not have a neat, single, label, and therewith neither fitting with the common OWL modelling paradigm nor with the recently agreed-upon ontolex-lemon model for linguistic annotations.

The start of all this sounded nice and doable: we need to generate natural language for healthcare, using, e.g., SNOMED CT, in local languages in South Africa, focussing on the largest one, being isiZulu. Medical terminologies are riddled with part-whole relations, so we sought to address that one (simple existentials already having been solved), availing of a standard list of part-whole relations (e.g. [1]). That turned out to be a non-trivial exercise, but doable eventually [2]. What wasn’t addressed in [2] was that some ‘common’ part-whole relations, such as membership and containment, weren’t like that in isiZulu, at all. Moreover, it wasn’t just a language issue, but ontological as well. The LDK17 paper “Representing and aligning similar relations: parts and wholes in isiZulu vs English” [3] describes this in some detail.

Here’s a (simplified) list of (assumed to be) common part-whole relations, which takes into account both transitivity differences and domain and range:

Now here’s the one based on the isiZulu language and some ontological analysis of that:

That is: there are both generalisations—some distinctions are not being made—and specialisations—some distinctions are made here but not elsewhere. For instance, ‘musician is part of some orchestra’ and ‘heart is part of some human’ (or vv.) is all done and described in the same way (ingxenye ‘part of’ and SC+CONJ for ‘has part’ [more about that below]). Yet, there is a difference between an individual (e.g., a voter) participating in some process and a collective (e.g., the electorate) participating in a process, or vv. The paper describes this more precisely, going into some detail regarding the differences in categories of domain and range and into the consequences on transitivity of mereological parthood.

The other ‘odd thing’—cf. current multilingual Semantic Web assumptions and technologies, that is—is that while the conceptualisation of ‘has part’ exists, it does not have a single label as in English (or in several other languages, such as heeft as deel), but it is dependent on the noun class of the noun of the class that play the part and play the whole in the relation. It combines the subject concord (~conjugation) of the noun class of the noun that plays the whole with a conjunction that is phonologically conditioned based on the first letter of the noun that plays the part; with verbalisation in the plural and three phonological cases, there are 18 possible strings all denoting ‘has part’. This still could be sorted with a language with inverses, provided the part-of direction has a name, like the ingxenye. This is not the case for containment, however. Instead of the relation (object property) having a name—be this a verb like ‘contained in’ or some noun phrase—it is the noun that plays the whole (the container, if you will) that gets modified. For instance, imvilophu ‘envelope’ and emvilophini denoting ‘contained in the envelope’, or, for individuals and locations, the city iTheku ‘Durban’ and eThekwini meaning ‘located in Durban’ (no typo—there’s some phonological conditioning I’m brushing over). While I have gotten used to such constructions, it generated some surprise among several attendees that one can have notions, concepts, views on or interpretations or descriptions of reality, that exist but do not have even one single string of text throughout to refer to regardless the context it is used.

The naming issue was solved by adding some arbitrary string as ‘name’ of the object property, and relating that to the function that verbalises that specific part-whole relation. The former issue, i.e., not all the same part-whole relations, required a bit more work, using ontology pattern alignments, by extending one correspondence pattern from the ODP catalogue and introducing a new one (see paper for the formal details), using the same broad framework of formalisation as proposed in [4].

All this was then implemented and aligned, and verified to not result in some unsatisfiable classes, object properties, or inconsistency (files). This also works in the isiZulu verbalisation tool we demo-ed at ESWC17 (described in the previous post) [5], all as part of the NRF-funded GeNI project.

Now, ideally, I already would have had the time to read the papers I flagged in my LDK17 conference notes with “check paper”. I haven’t yet due to end-of-semester tasks. So, on the basis of just a positive-seeming presentation, here are a few that are on the top of my list to check out first, for quite different reasons:

• Interaction between natural language reading capabilities and math education, focusing on language production (i.e., ‘can you talk about it?’) [6], mainly because math education in South Africa faces a lot of problems. It also generated a lively discussion in the Q&A session.
• The OnLiT ontology for linguistic [7] and LLODifying linguistic glosses [8] terminology (also: one of the two also won the best paper award).
• Deep text generation, for it was looking at trying to address skewed or limited data to learn from [9], which is an issue we face when trying to do some NLP with most South African languages.

References

[1] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[2] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), September 5-8, 2016, Edinburgh, UK. ACL.

[3] Keet, C.M. Representing and aligning similar relations: parts and wholes in isiZulu vs English. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 58-73.

[4] Fillottrani, P.R., Keet, C.M. Patterns for Heterogeneous TBox Mappings to Bridge Different Modelling Decisions. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017.

[5] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[6] Crossley, S., Kostyuk, V. Letting the genie out of the lamp: using natural language processing tools to predict math performance. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 330-342.

[7] Klimek, B., McCrae, J.P., Lehmann, C., Chiarcos, C., Hellmann, S. OnLiT: and ontology for linguistic terminology. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 42-57.

[8] Chiarcos, C., Ionov, M. Rind-Pawlowski, M., Fäth, C., Wichers Schreur, J., Nevskaya. I. LLODifying linguistic glosses. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 89-103.

[9] Dethlefs N., Turner A. Deep Text Generation — Using Hierarchical Decomposition to Mitigate the Effect of Rare Data Points. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 290-298.