# The isiZulu spellchecker seems to contribute to ‘intellectualisation’ of isiZulu

Perhaps putting ‘intellectualisation’ in sneer quotes isn’t nice, but I still find it an odd term to refer to a process of (in short, from [1]) coming up with new vocabulary for scientific speech, expression, objective thinking, and logical judgments in a natural language. In the country I grew up, terms in our language were, and still are, invented more because of a push against cultural imperialism and for home language promotion rather than some explicit process to intellectualise the language in the sense of “let’s invent some terms because we need to talk about science in our own language” or “the language needs to grow up” sort of discourses. For instance, having introduced the beautiful word geheugensanering (NL) that captures the concept of ‘garbage collection’ (in computing) way better than the English joke-term for it, elektronische Datenverarbeitung (DE) for ‘ICT’, técnicas de barrido (ES) for ‘sweep line’ algorithms, and mot-dièse (FR) for [twitter] ‘hashtag’, to name but a few inventions.

Be that as it may, here in South Africa, it goes under the banner of intellectualisation, with particular reference to the indigenous languages [2]; e.g., having introduced umakhalekhukhwini ‘cell/mobile phone’ (decomposed: ‘the thing that rings in your pocket’) and ukudlulisa ikheli for ‘pass by reference’ in programming (longer list of isiZulu-English computing and ICT terms), which is occurring for multiple subject domains [3]. Now I ended up as co-author of a paper that has ‘intellectualisation’ in its title [4]: Evaluation of the effects of a spellchecker on the intellectualization of isiZulu that appeared just this week in the Alternation journal.

The main general question we sought to answer was whether human language technologies, and in particular the isiZulu spellchecker launched last year, contribute to the language’s intellectualisation. More specifically, we aimed to answer the following three questions:

1. Is the spellchecker meeting end-user needs and expectations?
2. Is the spellchecker enabling the intellectualisation of the language?
3. Is the lexicon growing upon using the spellchecker?

The answers in a nutshell are: 1) yes, the spellchecker does meet end-user needs and expectations (but there are suggestions further improving its functionality), 2) users perceive that the spellchecker enables the intellectualisation of the language, and 3) non-dictionary words were added, i.e., the lexicon is indeed growing.

The answer to the last question provides some interesting data for linguists to bite their teeth in. For instance, a user had added to the spellchecker’s dictionary LikaSekelaShansela, which is an inflected form of isekelashansela ‘Vice Chancellor’ (that is recognised as correct by the spellchecker). Also some inconsistencies—from a rule-of-thumb viewpoint—in word formation were observed; e.g., usosayensi ‘scientist’ vs. unompilo ‘nurse’. If one were to follow consistently the word formation process for various types of experts in isiZulu, such as usosayensi ‘scientist’, usolwazi ‘professor’, and usomahlaya ‘comedian’, then one reasonably could expect ‘nurse’ to be *usompilo rather than unompilo. Why it isn’t, we don’t know. Regardless, the “add to dictionary” option of the spellchecker proved to be a nice extra feature for a data-driven approach to investigate intellectualisation of a language.

Version 1 of the isiZulu spellchecker that was used in the evaluation was ok and reasonably could not have interfered negatively with any possible intellectualisation (average SUS score of 75 and median 82.5, so ‘good’). It was ok in the sense that a majority of respondents thought that the entire tool was helpful, no features should be removed, it enhances their work, and so on (see paper for details). For the software developers among you who have spare time: they’d like, mainly, to have it as a Chrome and MS Word plugin, predictive text/autocomplete, and have it working on the mobile phone. The spellchecker has improved in the meantime thanks to two honours students, and I will write another blog post about that next.

As a final reflection: it turned out there isn’t a way to measure the level of intellectualisation in a ‘hard sciences’ way, so we concluded the other answers based on data that came from the somewhat fluffy approach of a survey and in-depth interviews (a ‘mixed-methods’ approach, to give it a name). It would be nice to have a way to measure it, though, so one would be able to say which languages are more or less intellectualised, what level of intellectualisation is needed to have a language as language of instruction and science at tertiary level of education and for dissemination of scientific knowledge, and to what extent some policy x, tool y, or activity z contributes to the intellectualization of a language.

References

[1] Havránek, B. 1932. The functions of literary language and its cultivation. In Havránek, B and Weingart, M. (Eds.). A Prague School Reader on Esthetics, Literary Structure and Style. Prague: Melantrich: 32-84.

[2] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[3]Khumalo, L. Intellectualization through terminology development. Lexikos, 2017, 27: 252-264.

[4] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

# Figuring out the verbalisation of temporal constraints in ontologies and conceptual models

Temporal conceptual models, ontologies, and their logics are nothing new, but that sort of information and knowledge representation still doesn’t gain a lot of traction (cf. say, formal methods for verification). This is in no small part because modelling temporal information is not easy. Several conceptual modelling languages do have various temporal extensions, but most modellers don’t even use all of the default language features yet [1]. How could one at least reduce the barrier to adoption of temporal logics and modelling languages? The two principle approaches are visualisation with a diagrammatic language and rendering it in a (pseudo-)natural language. One of my postgraduate students looked at the former, trying to figure out what would be the best icons and such, which showed there was still a steep learning curve [2]. Before examining whether that could be optimised, I wondered whether the natural language option might be promising. The problem was, that no-one had yet tried to determine what the natural language counterpart of the temporal constraints were supposed to be, let alone whether they be ‘adequate’ or the ‘best’ way of rendering the temporal constraints in tolerable natural language sentences. I wanted to know that badly enough that I tried to find out.

Given that using templates is a tried-and-tested relatively successful approach for atemporal conceptual models and ontologies (e.g., for ORM, the ACE system), it makes sense to do something similar, but then for some temporal extension. As temporal conceptual modelling language I used one that has a Description Logics foundation (DLRUS [3,4]) for that easily links to ontologies as well, added a few known temporal constraints (like for relationships/DL roles, mandatory) and removing others (some didn’t seem all that interesting), which resulted in 34 constraints, still. For each one, I tried to devise more and less reasonable templates, resulting in 101 templates overall. Those templates were evaluated on semantics and preference by three temporal logic experts and five ‘mixed experts’ (experts in natural language generation, logic, or modelling). This resulted in a final set of preferred templates to verbalise the temporal constraints. The remainder of this post first describes a bit about the templates and then the results of which I think they are most interesting.

Templates

The basic idea of a template—in the context of the verbalisation of conceptual models and ontologies—is to have some natural language for the constraint where then the vocabulary gets slotted in at runtime. Take, for instance, simple named class subsumption in an ontology, $C \sqsubseteq D$, for which one could define a template “Each [C] is a(n) [D]”, so that with some axiom $Manager \sqsubseteq Employee$, it would generate the sentence “Each Manager is an Employee”. One also could have devised the template “All [C] are [D]” and then it would have generated “All Managers are Employees”. The choice between the two templates in this case is just taste, for in both cases, the semantics is the same. More complex axioms are not always that straightforward. For instance, for the axiom type $C \sqsubseteq \exists R.D$, would “Each [C] [R] some [D]” be good enough, or would perhaps “Each [C] must [R] at least one [D]” be better? E.g., “Each Professor teaches some Course” vs “Each Professor must teach at least one Course”.

The same can be done for the temporal constraints. To get there, I did a bit of a linguistic detour that informed the template design (described in the paper [5]). Let us take as first example for templates temporal class that has a semantics of $o \in C^{\mathcal{I}(t)} \rightarrow \exists t' \neq t. o \notin C^{\mathcal{I}(t')}$; for instance, UndergraduateStudent (assuming they graduate and end up as alumni or as drop outs, and weren’t undergrads from birth):

1. If an object is an instance of entity type [C], then there is some time where it is not a(n) [C].
2. [C] is an entity type whose objects are, for some time in their existence, not instances of [C].
3. [C] is an entity type of which each object is not a(n) [C] for some time during its existence.
4. All instances of entity type [C] are not a(n) [C] for some time.
5. Each [C] is not a(n) [C] for some time.
6. Each [C] is for some time not a(n) [C].

Which one(s) do you think captures the semantics, and which one(s) do you prefer?

A more elaborate constraint for relationships is ‘dynamic extension for relationships, past, mandatory], which is formalised as $\langle o , o' \rangle \in \mbox{{\sc RDexM}-}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow (\langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t' where $\langle o , o' \rangle \in \mbox{{\sc RDex}}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow ( \langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t'>t. \langle o , o' \rangle \in {\tt R_2}^{\mathcal{I}(t')})$.; e.g., every passenger who boards a flight must have checked in for that flight. Two options could be:

1. Each ..C_1.. ..R_1.. ..C_2.. was preceded by ..C_1.. ..R_2.. ..C_2.. some time earlier.
2. Each ..C_1.. ..R_1.. ..C_2.. must be preceded by ..C_1.. ..R_2.. ..C_2.. .

I’m not saying they are all correct; they were some of the options given, which the participants could choose from and comment on. The full list of constraints and template options are available in the supplementary material, which also contains a file where you can fill in your own answers, see what the (anonymised) participants said, and it has the final list of ‘best’ constraints.

Results

The main aggregate quantitative results are shown in the following table.

Many observations can be made from the data (see the paper for details). Some of the salient aspects are that there was low inter-annotator agreement among the experts, despite that they know each other (temporal logics is a small community) and that the ‘mixed group’ deemed many sentences correct that the experts deemed wrong in the sense of not properly capturing the semantics of the constraint. Put differently, it looks like the mixed experts, as a group, did not fully grasp some subtle distinction in the temporal constraints.

With respect to the templates, the preferred ones don’t follow the structure of the logic, but are, in a way, a separate rendering, or: there’s no neat 1:1 mapping between axiom type and template structure. That said, that doesn’t mean that they always chose the shortest template: the experts definitely did not, while the mixed experts leaned a bit toward preferring templates with fewer words even though they were surely not always the semantically correct option.

It may not look good that the experts preferred different templates, but in a follow-up interview with one of the experts, the expert noted that it was not really a problem “for there is the logic that does have the precise meaning anyway” and thus “resolves any confusion that may arise from using slightly different terminology”. The temporal logic expert does have a point from the expert’s view, fair enough, but that pretty much defeats my aim with the experiment. Asking more non-experts may not be a good strategy either, for they are, on average, too lenient.

So, for now, we do have a set of, relatively, ‘best’ templates to verbalise temporal constraints in temporal conceptual models and ontologies. The next step is to compare that with the diagrammatic representation. This we did [6], and I’ll describe those results informally in a next post.

I’ll present more details at the upcoming CREOL: Contextual Representation of Events and Objects in Language Workshop that is part of the Joint Ontology Workshops 2017, which will be held next week (21-23 September) in Bolzano, Italy. As the KRDB group at FUB in Bolzano has a few temporal logic experts, I’m looking forward to the discussions! Also, I’d be happy if you would be willing to fill in the spreadsheet with your preferences (before looking at the answers given by the participants!), and send them to me.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[2] T. Shunmugam. Adoption of a visual model for temporal database representation. M. IT thesis, Department of Computer Science, University of Cape Town, South Africa, 2016.

[3] A. Artale, E. Franconi, F. Wolter, and M. Zakharyaschev. A temporal description logic for reasoning about conceptual schemas and queries. In S. Flesca, S. Greco, N. Leone, and G. Ianni, editors, Proceedings of the 8th Joint European Conference on Logics in Artificial Intelligence (JELIA-02), volume 2424 of LNAI, pages 98-110. Springer Verlag, 2002.

[4] A. Artale, C. Parent, and S. Spaccapietra. Evolving objects in temporal information systems. Annals of Mathematics and Artificial Intelligence, 50(1-2):5-38, 2007.

[5] Keet, C.M. Natural language template selection for temporal constraints. CREOL: Contextual Representation of Events and Objects in Language, Joint Ontology Workshops 2017, 21-23 September 2017, Bolzano, Italy. CEUR-WS Vol. (in print).

[6] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Springer LNCS. 6-9 Nov 2017, Valencia, Spain. (in print)

# A grammar of the isiZulu verb (present tense)

If you have read any of the blog posts on (automated) natural language generation for isiZulu, then you’ll probably agree with me that isiZulu verbs are non-trivial. True, verbs in other languages are most likely not as easy as in English, or Afrikaans for that matter (e.g., they made irregular verbs regular), but there are many little ‘bits and pieces’ ‘glued’ onto the verb root that make it semantically a ‘heavy’ element in a sentence. For instance:

• Aba-shana ba-ya-zi-theng-is-el-an-a                izimpahla
• ‘The children are selling the clothes to each other’

The ba is the subject concord (~conjugation) to match with the noun class (which is 2) of the noun that plays the subject in the sentence (abashana), the ya denotes a continuous action (‘are doing something’ in the present), the zi is the object concord for the noun class (8) of the noun that plays the object in the sentence (izimpahla), theng is the verb root, then comes the CARP extension with is the causative (turning ‘buy’ into ‘sell’), and el the applicative and an the reciprocative, which take care of the ‘to each other’, and then finally the final vowel a.

More precisely, the general basic structure of the verb is as follows:

where NEG is the negative; SC the subject concord; T/A denotes tense/aspect; MOD the mood; OC the object concord; Verb Rad the verb radical; C the causative; A the applicative; R the reciprocal; and P the passive. For instance, if the children were not selling the clothes to each other, then instead of the SC, there would be the NEG SC in that position, making the verb abayazithengiselana.

To make sense of all this in a way that it would be amenable to computation, we—my co-author Langa Khumalo and I—specified the grammar of the complex verb for the present tense in a CFG using an incremental process of development. To the best of our (and the reviewer’s) knowledge, the outcome of the lengthy exercise is (1) the first comprehensive and precisely formulated documentation of the grammar rules for the isiZulu verb present tense, (2) all together in one place (cf. fragments sprinkled around in different papers, Wikipedia, and outdated literature (Doke in 1927 and 1935)), and (3) goes well beyond handling just one of the CARP, among others. The figure below summarises those rules, which are explained in detail in the forthcoming paper “Grammar rules for the isiZulu complex verb”, which will be published in the Southern African Linguistics and Applied Language Studies [1] (finally in print, yay!).

It is one thing to write these rules down on paper, and another to verify whether they’re actually doing what they’re supposed to be doing. Instead of fallible and laborious manual checking, we put them in JFLAP (for the lack of a better alternative at the time; discussed in the paper) and tested the CFG both on generation and recognition. The tests went reasonably well, and it helped fixing a rule during the testing phase.

Because the CFG doesn’t take into account phonological conditioning for the vowels, it generates strings not in the language. Such phonological conditioning is considered to be a post-processing step and was beyond the scope of elucidating and specifying the rules themselves. There are other causes of overgeneration that we did not get around to doing, for various reasons: there are rules that go across the verb root, which are simple to represent in coding-style notation (see paper) but not so much in a CFG, and rules for different types of verbs, but there’s no available resource that lists which verb roots are intransitive, which as monosyllabic and so on. We have started with scoping rules and solving issues for the latter, and do have a subset of phonological conditioning rules; so, to be continued… For now, though, we have completed at least one of the milestones.

Last, but not least, in case you wonder what’s the use of all this besides the linguistics to satisfy one’s curiosity and investigate and document an underresourced language: natural language generation for intelligent user interfaces in localised software, spellcheckers, and grammar checkers, among others.

References

[1] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, (in print). Submitted version (the rules are the same as in the final version)

# Aligning different relations: the case of part-whole relations—LDK2017

Despite the best intentions, I did not get around to writing a post on the paper that I presented last week at the First International Conference on Language, Data and Knowledge 2017, 19-20 June, Galway, Ireland, and now Paul Groth also ‘beat’ me to it writing a nice conference report of it. On the bright side, it is an opportunity to say upfront I really enjoyed the conference and look forward to the next edition in 2019. The ESWC’17 organisers might be slightly disappointed that there was no special track on the multilingual semantic web after all, but I did get the distinct impression that the LDK17 authors might just all have gambled on LDK17—an opportunity to binge two days on all things natural language & Semantic Web—rather than on one track at an overpriced conference (despite the allure of it being A-rated).

So, what was my paper about that could have been submitted to either? I ended up struggling—and solving—an issue with aligning OWL object properties that were not simple 1:1 mappings, in a similar scope as our ESWC17 paper (introduced here) [4], but then with too many complications. Complications were due to the different conceptualisations of part-whole relations and that one of the requirements was to solve what to do with an object property (relation, relationship) that does not have a neat, single, label, and therewith neither fitting with the common OWL modelling paradigm nor with the recently agreed-upon ontolex-lemon model for linguistic annotations.

The start of all this sounded nice and doable: we need to generate natural language for healthcare, using, e.g., SNOMED CT, in local languages in South Africa, focussing on the largest one, being isiZulu. Medical terminologies are riddled with part-whole relations, so we sought to address that one (simple existentials already having been solved), availing of a standard list of part-whole relations (e.g. [1]). That turned out to be a non-trivial exercise, but doable eventually [2]. What wasn’t addressed in [2] was that some ‘common’ part-whole relations, such as membership and containment, weren’t like that in isiZulu, at all. Moreover, it wasn’t just a language issue, but ontological as well. The LDK17 paper “Representing and aligning similar relations: parts and wholes in isiZulu vs English” [3] describes this in some detail.

Here’s a (simplified) list of (assumed to be) common part-whole relations, which takes into account both transitivity differences and domain and range:

Now here’s the one based on the isiZulu language and some ontological analysis of that:

That is: there are both generalisations—some distinctions are not being made—and specialisations—some distinctions are made here but not elsewhere. For instance, ‘musician is part of some orchestra’ and ‘heart is part of some human’ (or vv.) is all done and described in the same way (ingxenye ‘part of’ and SC+CONJ for ‘has part’ [more about that below]). Yet, there is a difference between an individual (e.g., a voter) participating in some process and a collective (e.g., the electorate) participating in a process, or vv. The paper describes this more precisely, going into some detail regarding the differences in categories of domain and range and into the consequences on transitivity of mereological parthood.

The other ‘odd thing’—cf. current multilingual Semantic Web assumptions and technologies, that is—is that while the conceptualisation of ‘has part’ exists, it does not have a single label as in English (or in several other languages, such as heeft as deel), but it is dependent on the noun class of the noun of the class that play the part and play the whole in the relation. It combines the subject concord (~conjugation) of the noun class of the noun that plays the whole with a conjunction that is phonologically conditioned based on the first letter of the noun that plays the part; with verbalisation in the plural and three phonological cases, there are 18 possible strings all denoting ‘has part’. This still could be sorted with a language with inverses, provided the part-of direction has a name, like the ingxenye. This is not the case for containment, however. Instead of the relation (object property) having a name—be this a verb like ‘contained in’ or some noun phrase—it is the noun that plays the whole (the container, if you will) that gets modified. For instance, imvilophu ‘envelope’ and emvilophini denoting ‘contained in the envelope’, or, for individuals and locations, the city iTheku ‘Durban’ and eThekwini meaning ‘located in Durban’ (no typo—there’s some phonological conditioning I’m brushing over). While I have gotten used to such constructions, it generated some surprise among several attendees that one can have notions, concepts, views on or interpretations or descriptions of reality, that exist but do not have even one single string of text throughout to refer to regardless the context it is used.

The naming issue was solved by adding some arbitrary string as ‘name’ of the object property, and relating that to the function that verbalises that specific part-whole relation. The former issue, i.e., not all the same part-whole relations, required a bit more work, using ontology pattern alignments, by extending one correspondence pattern from the ODP catalogue and introducing a new one (see paper for the formal details), using the same broad framework of formalisation as proposed in [4].

All this was then implemented and aligned, and verified to not result in some unsatisfiable classes, object properties, or inconsistency (files). This also works in the isiZulu verbalisation tool we demo-ed at ESWC17 (described in the previous post) [5], all as part of the NRF-funded GeNI project.

Now, ideally, I already would have had the time to read the papers I flagged in my LDK17 conference notes with “check paper”. I haven’t yet due to end-of-semester tasks. So, on the basis of just a positive-seeming presentation, here are a few that are on the top of my list to check out first, for quite different reasons:

• Interaction between natural language reading capabilities and math education, focusing on language production (i.e., ‘can you talk about it?’) [6], mainly because math education in South Africa faces a lot of problems. It also generated a lively discussion in the Q&A session.
• The OnLiT ontology for linguistic [7] and LLODifying linguistic glosses [8] terminology (also: one of the two also won the best paper award).
• Deep text generation, for it was looking at trying to address skewed or limited data to learn from [9], which is an issue we face when trying to do some NLP with most South African languages.

References

[1] Keet, C.M., Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2):91-110.

[2] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), September 5-8, 2016, Edinburgh, UK. ACL.

[3] Keet, C.M. Representing and aligning similar relations: parts and wholes in isiZulu vs English. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 58-73.

[4] Fillottrani, P.R., Keet, C.M. Patterns for Heterogeneous TBox Mappings to Bridge Different Modelling Decisions. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017.

[5] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[6] Crossley, S., Kostyuk, V. Letting the genie out of the lamp: using natural language processing tools to predict math performance. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 330-342.

[7] Klimek, B., McCrae, J.P., Lehmann, C., Chiarcos, C., Hellmann, S. OnLiT: and ontology for linguistic terminology. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 42-57.

[8] Chiarcos, C., Ionov, M. Rind-Pawlowski, M., Fäth, C., Wichers Schreur, J., Nevskaya. I. LLODifying linguistic glosses. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 89-103.

[9] Dethlefs N., Turner A. Deep Text Generation — Using Hierarchical Decomposition to Mitigate the Effect of Rare Data Points. In: Gracia J., Bond F., McCrae J., Buitelaar P., Chiarcos C., Hellmann S. (eds) Language, Data, and Knowledge LDK 2017. Springer LNAI vol 10318, 290-298.

# Our ESWC17 demos: TDDonto2 and an OWL verbaliser for isiZulu

Besides the full paper on heterogeneous alignments for 14th Extended Semantic Web Conference (ESWC’17) that will take place next week in Portoroz, Slovenia, we also managed to squeeze out two demo papers. You may already know of TDDonto2 with Kieren Davies and Agnieszka Lawrynowicz, which was discussed in an earlier post that has been updated with a tutorial video. It now has a demo paper as well [1], which describes the rationale and a few scenarios. The other demo, with Musa Xakaza and Langa Khumalo, is new-new, but the regular reader might have seen it coming: we finally managed to link the verbalisation patterns for certain Description Logic axiom types [2,3] to those in OWL ontologies. The tool takes as input an ontology in isiZulu and the verbalisation algorithms, and out come the isiZulu sentences, be this in plain text for further processing or in a GUI for inspection by a domain expert [4]. There is a basic demo-screencast to show it’s all working.

The overall architecture may be of interest, for it deviates from most OWL verbalisers. It is shown in the following figure:

For instance, we use the Python-based OWL API Owlready, rather than a Java-based app, for Python is rather popular in NLP and the verbalisation algorithms may be used elsewhere as well. We made more such decisions with the aim to make whatever we did as multi-purpose usable as possible, like the list of nouns with noun classes (surprisingly, and annoyingly, there is no such readily available list yet, though isizulu.net probably will have it somewhere but inaccessible), verb roots, and exceptions in pluralisation. (Problems for integrating the verbaliser with, say, Protégé will be interesting to discuss during the demo session!)

The text-based output doesn’t look as nice as the GUI interface, so I will show here only the GUI interface, which is adorned with some annotations to illustrate that those verbalisation algorithms in the background are far from trivial templates:

For instance, while in English the universal quantification is always ‘Each’ or ‘All’ regardless the named class quantified over, in isiZulu it depends on the noun class of the noun that is the name of the OWL class. For instance, in the figure above, izingwe ‘leopards’ is in noun class 10, so the ‘Each/All’ is Zonke, amavazi ‘vases’ is in noun class 6, so ‘Each/All’ then becomes Onke, and abantu ‘people’/’humans’ is in noun class 2, making Bonke. There are 17 noun classes. They also determine the subject concords (SC, alike conjugation) for the verbs, with zi- for noun class 10, ­a- for noun class 6, and ba- for noun class 2, to name a few. How this all works is described in [2,3]. We’ve implemented all those algorithms and integrated the pluraliser [5] in it to make it work. The source files are available to check and play with already, you can do so and ask us during the ESWC17 demo session, and/or also have a look at the related outputs of the NRF-funded project Grammar Engine for Nguni natural language interfaces (GeNi).

References

[1] Davies, K. Keet, C.M., Lawrynowicz, A. TDDonto2: A Test-Driven Development Plugin for arbitrary TBox and ABox axioms. Extended Semantic Web Conference (ESWC’17), Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157.

[3] Keet, C.M., Khumalo, L. On the verbalization patterns of part-whole relations in isiZulu. 9th International Natural Language Generation conference (INLG’16), 5-8 September, 2016, Edinburgh, UK. Association for Computational Linguistics, 174-183.

[4] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS. Portoroz, Slovenia, May 28 – June 2, 2017. (demo paper)

[5] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey.

# Robot peppers, monkey gland sauce, and go well—Say again? reviewed

The previous post about TDDonto2 had as toy example a pool braai, which does exist in South Africa at least, but perhaps also elsewhere under a different name: the braai is the ‘South African English’ (SAE) for the barbecue. There are more such words and phrases peculiar to SAE, and after the paper deadline last week, I did finish reading the book Say again? The other side of South African English by Jean Branford and Malcolm Venter (published earlier this year) that has many more examples of SAE and a bit of sociolinguistics and some etymology of that. Anyone visiting South Africa will encounter at least several of the words and sentence constructions that are SAE, but probably would raise eyebrows elsewhere. Let me start with some examples.

Besides the braai, one certainly will encounter the robot, which is a traffic light (automating the human police officer). A minor extension to that term can be found in the supermarket (see figure on the right): robot peppers, being a bag of three peppers in the colours of red, yellow, and green—no vegetable AI, sorry.

How familiar the other ones discussed in the book are, depends on how much you interact with South Africans, where you stay(ed), and how much you read and knew about the country before visiting it, I suppose. For instance, when I visited Pretoria in 2008, I had not come across the bunny, but did so upon my first visit in Durban in 2010 (it’s a hollowed-out half a loaf of bread, filled with a curry) and bush college upon starting to work at a university (UKZN) here in 2011. The latter is a derogatory term that was used for universities for non-white students in the Apartheid era, with the non-white being its own loaded term from the same regime. (It’s better not to use it—all terms for classifying people one way or another are a bit of a mine field, whose nuances I’m still trying to figure out; the book didn’t help with that).

Then there’s the category of words one may know from ‘general English’, but are by the authors claimed to have a different meaning here. One is the sell-outs, which is “to apply particularly to black people who were thought to have betrayed their people” (p143), though I have the impression it can be applied generally. Another is townhouse, which supposedly has narrowed its meaning cf. British English (p155), but from having lived on the isles some years ago, it was used in the very same way as it is here; the book’s authors just stick to its older meaning and assume the British and Irish do so too (they don’t, though). One that indeed does fall in the category ‘meaning restriction’ is transformation (an explanation of the narrower sense will take up too much space). While I’ve learned a bunch of the ‘unusual’ usual words in the meantime I’ve worked here, there were others that I still did wonder about. For instance, the lay-bye, which the book explained to be the situation when the shop sets aside a product the customer wants, and the customer pays the price in instalments until it is fully paid before taking the product home. The monkey gland sauce one can buy in the supermarket is another, which is a sauce based on ketchup and onion with some chutney in it—no monkeys and no glands—but, I’ll readily admit, I still have not tried it due to its awful name.

There are many more terms described and discussed in the book, and it has a useful index at the end, especially given that it gives the impression to be a very popsci-like book. The content is very nicely typesetted, with news item snippets and aside-boxes and such. Overall, though, while it’s ok to read in the gym on the bicycle for a foreigner who sometimes wonders about certain terms and constructions, it is rather uni-dimensional from a British White South African perspective and the authors are clearly Cape Town-based, with the majority of examples from SA media from Cape Town’s news outlets. They take a heavily Afrikaans-influence-only bias, with, iirc, only four examples of the influence of, e.g., isiZulu on SAE (e.g., the ‘go well’ literal translation of isiZulu’s hamba kahle), which is a missed opportunity. A quick online search reveals quite a list of words from indigenous languages that have been adopted (and more here and here and here and here) such as muti (medicine; from the isiZulu umuthi) and maas (thick sour milk; from the isiZulu amasi) and dagga (marijuana; from the Khoe daxa-b), not to mention the many loan words, such as indaba (conference; isiZulu) and ubuntu (the concept, not the operating system—which the authors seem to be a bit short of, given the near blind spot on import of words with a local origin). If that does not make you hesitant to read it, then let me illustrate some more inaccuracies beyond the aforementioned townhouse squabble, which results in having to take the book’s contents probably with a grain of salt and heavily contextualise it, and/or at least fact-check it yourself. They fall in at least three categories: vocabulary, grammar, and etymology.

To quote: “This came about because the Dutch term tijger means either tiger or leopard” (p219): no, we do have a word for leopard: luipaard. That word is included even in a pocket-size Prisma English-Dutch dictionary or any online EN-NL dictionary, so a simple look-up to fact-check would have sufficed (and it existed already in Dutch before a bunch of them started colonising South Africa in 1652; originating from old French in ~1200). Not having done so smells of either sloppiness or arrogance. And I’m not so sure about the widespread use of pavement special (stray or mongrel dogs or cats), as my backyard neighbours use just stray for ‘my’ stray cat (whom they want to sterilise because he meows in the morning). It is a fun term, though.

Then there’s stunted etymology of words. The coconut is not a term that emerged in the “new South Africa” (pp145-146), but is transferred from the Americas where it was already in use for at least since the 1970s to denote the same concept (in short: a brown skinned person who is White on the inside) but then applied to some people from Central and South America [Latino/Hispanic; take your pick].

Extending the criticism also to the grammar explanations, the “with” aside box on pp203-204 is wrong as well, though perhaps not as blatantly obvious as the leopard and coconut ones. The authors stipulate that phrases like “Is So-and-So coming with?” (p203) is Afrikaans influence of kom saam “where saam sounds like ‘with’” (p203) (uh, no, it doesn’t), and as more guessing they drag a bit of German influence in US English into it. This use, and the related examples like the “…I have to take all my food with” (p204) is the same construction and similar word order for the Dutch adverb mee ‘with’ (and German mit), such as in the infinitives meekomen ‘to come with’ (komen = to come), meenemen ‘to take with’, meebrengen ‘to bring with’, and meegaan ‘to go with’. In a sentence, the mee may be separated from the rest of the verb and put somewhere, including at the end of the sentence, like in ik neem mijn eten mee ‘I take my food with’ (word-by-word translated) en komt d’n dieje mee? ‘comes so-and-so with?’ (word-by-word translated, with a bit of ABB in the Dutch). German has similar infinitives—mitkommen, mitnehmen, mitbringen, and mitgehen, respectively—sure, but the grammar construction the book’s authors highlight is so much more likely to come from Dutch as first step of tracing it back, given that Afrikaans is a ‘simplified’ version of Dutch, not of German. (My guess would be that the Dutch mee- can be traced back, in turn, to the German mit, as Dutch is a sort of ‘simplified’ German, but that’s a separate story.)

In closing, I could go on with examples and corrections, and maybe I should, but I think I made the point clear. The book didn’t read as badly as it may seem from this review, but writing the review required me to fact-check a few things, rather than taking most of it at face value, which made it turn out more and more mediocre than the couple of irritations I had whilst reading it.

# Launch of the isiZulu spellchecker

Langa Khumalo, ULPDO director, giving the spellchecker demo, pointing out a detected spelling error in the text. On his left, Mpho Monareng, CEO of PanSALB.

Yesterday, the isiZulu spellchecker was launched at UKZN’s “Launch of the UKZN isiZulu Books and Human Language Technologies” event, which was also featured on 702 live radio, SABC 2 Morning Live, and e-news during the day. What we at UCT have to do with it is that both the theory and the spellchecker tool were developed in-house by members of the Department of Computer Science at UCT. The connection with UKZN’s University Language Planning & Development Office is that we used a section of their isiZulu National Corpus (INC) [1] to train the spellchecker with, and that they wanted a spellchecker (the latter came first).

The theory behind the spellchecker was described briefly in an earlier post and it has been presented at IST-Africa 2016 [2]. Basically, we don’t use a wordlist + rules-based approach as some experiments of 20 years ago did, nor a wordlist + a few rules of the now-defunct translate.org.za OpenOffice v3 plugin seven years ago, but a data-driven approach with a statistical language model that uses tri-grams. The section of the INC we used were novels and news items, so, including present-day isiZulu texts. At the time of the IST-Africa’16 paper, based on Balone Ndaba’s BSc CS honours project, the spell checking was very proof-of-concept, but it showed that it could be done and still achieve a good enough accuracy. We used that approach to create an enduser-usable isiZulu spellchecker, which saw the light of day thanks to our 3rd-year CS@UCT student Norman Pilusa, who both developed the front-end and optimised the backend so that it has an excellent performance.

Upon starting the platform-independent isiZulu_spellchecker.jar file, the English interface version looks like this:

You can write text in the text box, or open a txt or docx file, which then is displayed in the textbox. Click “Run”. Now there are two options: you can choose to step-through the words that are detected as misspelled one at a time or “Show All” words that are detected as misspelled. Both are shown for some sample text in the screenshot below.

processing one error at a time

highlighting all words detected as very probably misspelled

Then it is up to you to choose what to do with it: correct it in the textbox, “Ignore once”, “Ignore all”, or “Add” the word to your (local) dictionary. If you have modified the text, you can save it with the changes made by clicking “Save correction”. You also can switch the interface from the default English to isiZulu by clicking “File – Use English”, and back to English via “iFayela – ulimi lesingisi”. You can download the isiZulu spellchecker from the ULPDO website and from the GitHub repository for those who want to get their hands on the source code.

To anticipate some possible questions you may have: incorporating it as a plugin to Microsoft word, OpenOffice/LibreOffice, and Mozilla Firefox was in the planning. The former is technologically ‘closed source’, however, and the latter two have a certain way of doing spellchecking that is not amenable to the data-driven approach with the trigrams. So, for now, it is a standalone tool. By design, it is desktop-based rather than for mobile phones, because according to the client (ULPDO@UKZN), they expect the first users to be professionals with admin documents and emails, journalists writing articles, and such, writing on PCs and laptops.

There was also a trade-off between a particular sort of error: the tool now flags more words as probably incorrect than it could have, yet it will detect (a subset of) capitalization, correctly, such as KwaZulu-Natal whilst flagging some of the deviant spellings that go around, as shown in the screenshot below.

The customer preferred recognising such capitalisation.

Error correction sounds like an obvious feature as well, but that will require a bit more work, not just technologically, but also the underlying theory. It will probably be an honours project topic for next year.

In the grand scheme of things, the current v1 of the spellchecker is only a small step—yet, many such small steps in succession will get one far eventually.

The launch itself saw an impressive line-up of speeches and introductions: the keynote address was given by Dr Zweli Mkhize, UKZN Chancellor and member of the ANC NEC; Prof Ramesh Krishnamurthy, from Aston University UK, gave the opening address; Mpho Monareng, CEO of PanSALB gave an address and co-launched the human language technologies; UKZN’s VC Andre van Jaarsveld provided the official welcome; and two of UKZN’s DVCs, Prof Renuka Vithal and Prof Cheryl Potgieter, gave presentations. Besides our ‘5-minutes of fame’ with the isiZulu spellchecker, the event also launched the isiZulu National Corpus, the isiZulu Term Bank, the ZuluLex mobile-compatible application (Android and iPhone), and two isiZulu books on collected short stories and an English-isiZulu architecture glossary.

References

[1] Khumalo, L. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[2] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.