Pluralising isiZulu nouns, automatically

Generating the plural of a noun in the singular looked so trivial… one set of ‘boring’ rules from the prefixes listed in the standard table of the noun class system for isiZulu—neatly paired by singular and plural—and you’re done. That is also why we had in the original verbalisation algorithm in the RuleML14 paper just one method [1]. And then came the testing with a set of nouns, where some nouns did not quite stick to that neat table; e.g., indoda ‘man’ is in noun class 9, so ought to take the prefix for noun class 10 as plural, but it is amadoda ‘men’ (noun class 6), or take noun class pair 7 and 8, with prefixes isi- and izi-, for singular and plural, respectively: that generally holds, just not in those cases where the stem commences with a vowel. We bumped into a few of those, requiring us to take a step back and investigate pluralising isiZulu nouns systematically, how well that ‘standard table’ actually works, what can be said about those nouns that don’t quite adhere to it, and whether their underlying causes occur also in other Bantu languages.

We have the answers to those questions now, which are described in the paper entitled “Pluralising Nouns in isiZulu and Related Languages” that recently got accepted at the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), which will take place next week in Konya, Turkey.

Pluralising nouns is nothing novel per se, and the general approach is to take some grammar book and write a bunch of regular expressions or rules, or use a data-oriented approach and find all the rules that way. That doesn’t quite work for isiZulu due to it being an underresourced language, so we ended up designing a set of basic rules manually, and then find other rules through experimentation, i.e., a combined knowledge and data approach.

The knowledge-based part was concerned, first, with the choice whether regular expressions would suffice (syntax-only based approach), designing a few automata by availing of the prefixes in that standard table. This made evident that that was unlikely to work well: some noun classes, e.g., 1 and 3, take the same prefix (um or umu) yet have a different prefix for their plural counterpart (aba for noun class 2 and imi for noun class 4), plus it does not say when it is um and when umu. The latter prefix is used with monosyllabic stems, but we have no way of identifying a monosyllabic stem. Thus, we’d need the semantics (noun class) to go with the regular expression as input, which is unlike any pluralisation algorithm for other languages (that we know of). For the size of the automaton, it doesn’t matter if one checks first the prefix and then the noun’s noun class or vv.

For the experiment, we compiled two set of nouns: one was constructed in English by taking class names for multiple ontologies (set 1), the other was compiled by picking each first-listed noun on the left-hand page of an isiZulu dictionary (set 2). To test for generalizability, we took Runyankore, a language spoken in Uganda, which is in a different subfamily from isiZulu.

So, just how bad is nouns-only, i.e., just some regular expression based on the standard table? The accuracy was about 50%, which is pretty dismal. Just adding a noun’s noun class made the accuracy jump up to about 80-90%! The accuracy of each iteration is shown in the table below.

Accuracy of pluralisation with the incremental versions of the pluraliser (source: [2])

Accuracy of pluralisation with the incremental versions of the pluraliser (source: [2])

So, what were the other culprits? First, compound nouns, such as indawo yokubhukuda ‘swimming pool’, for it is not only the main noun that has to be pluralised (indawo to izindawo), but the other word still has to be in agreement with the first, so it is not yokubhukuda but zokubhukuda for the plural. We managed to devise 4 additional rules for those cases. Another issue were the mass nouns (amount of matter): they can exist in the singular and stay in the singular (iwayini ‘wine’), or are already in a plural class (amanzi ‘water’). There is no way to recognise those, so this has to be annotated. The plural exceptions (true exceptions) and prefix exceptions (regular ‘exception’) were mentioned earlier in this post, and some new rules were added for that. Adding all that generated 92% accuracy for the second set, and 100% for the first set of nouns. For the remaining errors, we have some ideas for how to resolve it, but linguists first have to check (see paper for details).

The initial results for Runyankore were better than for isiZulu, thanks to fewer recurring prefixes, but, interestingly, Runyankore had similar issues overall, notably with the compound nouns, mass nouns, and true exceptions, and some issues with loan words. Tone, not indicated in the orthography, popped up as well, which also holds for some isiZulu nouns (but they didn’t happen to have been in the test sets).

The data, analysis, and the software (including the source code) are available from the GeNI project page. The isiZulu pluraliser was coded up in Python and the Runyankore one in Java. Note that they are at the proof-of-concept level, not industry-grade tools, but you’re free to take it and make industry-grade level software out of it :-).

 

References

[1] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. Proceedings of the 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[2] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey. (in print)

Advertisement

On the need for bottom-up language-specific terminology development

Peoples of several languages intellectualise their vocabulary so as to maintain their own language as medium of instruction (or: LoLT, language of teaching and learning), to conduct scientific discussions among peers and, in some cases, still, publish research in their own language. Some languages I know of who do this are French, Spanish, German, and Italian; e.g., the English ‘set’ is conjunto (Sp.) and insieme (It.), and the Dutch for ‘garbage collection’ (in computing) is geheugensanering. I found out the hard way last month that my Italian scientific vocabulary was better than my Dutch one, never really having practiced the latter in my field of specialisation and I noticed that over the years that I have been globetrotting, quite a few Anglicisms in Dutch had been replaced with Dutch words and some were there for a while already (as excuse: I studied a different discipline in the Netherlands). How do these new words come about? There are many ways of word creation, and then it depends on the country or language region how it gets incorporated in the language. For instance, French uses a top-down approach with the Académie Française and Spain has the Real Academia Española. The Netherlands has De Nederlandse Taalunie that isn’t as autocratic, it seems; for instance, to follow suit with the French mot dièse for the twitter ‘hashtag’, there was some consultation and online voting (sound file) to come up with an agreeable Dutch term for hashtag. But how does that happen elsewhere?

We found out that there is a mode of practice for language-specific terminology development that happens in small ‘workshops’ of some 13-15 people, constituting mainly of terminologists and linguists, and 1-3 subject matter experts. There may be a consultative event with stakeholders, who are not necessarily with subject matter experts. Shocking. The sheer arrogance of the former, who ‘magically’ grasp the concepts that typically take a while to understand when it comes to science, but they supposedly nevertheless understand it well enough to come up with a meaningful local-language word. But maybe, you say, I’m too arrogant in thinking subject matter experts, such as myself, can come up with decent local-language terms. Maybe that’s partially true, but what may be more problematic, is that only a few subject matter experts are involved, so there is an over-reliance on those mere few. Maybe, you say, that’s not a problem. We put that to the test for a computing and computer literacy terminology development for isiZulu, and found out it was: it depends on who you ask what comes out of the term harvesting and term preference. And then asking just a few people is a problem for a term’s uptake. (The students involved in the experiments did not even know there was a computer literacy term list from the South African Department of Arts and Culture, published in 2005, and boo-ed away several of the terms.)

The way we tested it, was with three experiments. The first experiment was an experts-only workshop, with ‘experts’ being 4th-year computer science students who have isiZulu as home language, as there were no isiZulu-speaking MSc and PhD students, nor colleagues, in CS at the University of KwaZulu-Natal, where we did the experiment. The second experiment was an isiZulu-localised survey among undergraduate CS students to collect terms, where we hoped to see a difference between a survey where they were given the entity with an English name and the entity as a picture. The third experiment was a survey where computer literacy students (1st-year science students) could vote for terms for which there was more than one isiZulu term proposed. The details of the set-up and the results have been published recently in the Alternation open-access journal article “Limitations of Regular Terminology Development Practices: The Case of isiZulu Computing Terminology”, in the special issue on “Re-envisioning African Higher Education: Alternative Paradigms, Emerging Trends and New Directions”, edited by Rubby Dhunpath, Nyna Amin and Thabo Msibi. It describes which isiZulu terms from where are affected, ranging from a higher incidence of ‘zulufying’ English terms in aforementioned list by the South African Department of Arts and Culture cf. the proposals by the experiments’ participants, and, e.g., expert consensus for inqolobane for database, versus a preference for imininingo egciniwe by the computer literacy students (see paper for more cases). Further, when all respondents across the survey are aggregated and go for majority voting, the proposed terms by the experts are snowed under. The latter is particularly troublesome in a country where computing is a designated critical skill (or: there aren’t nearly enough of them).

A byproduct of the experiments was that we have collected the, to date, longest list of isiZulu computing terms, which have gone through a standardisation process in the meantime. The latter is mainly thanks to the tireless efforts of Khumbulani Mngadi of the ULPDO of UKZN, and the two expert CS honours students who volunteered in the process, Sibonelo Dlamini and Tanita Singano.

Our approach was already less exclusionary cf. the aforementioned traditional/standard way, but it also shows that broader participation is needed both to collect and to choose terms; or, in the words of the special issue editors [2]: a “democratization of the terminology development process” that “transcends the insularity and purism which characterises traditional laboratory approaches to development”. We are still working on-and-off to achieve this with crowdsourcing, and maybe we should start thinking of crowdfunding that crowdsourcing effort to speed up the whole thing and complete the commuterm project.

As a last note: in case you are interested in other contributions to “re-envisioning African higher education”: scan through the main page online, read the editorial [2] for main outcomes of each of the papers, and/or read the papers, on topics as diverse as postgrad supervision in isiZulu, teaching sexual and gender diversity to pre-service teachers, maths education, IKS in HE, and much more.

References

[1] Keet, C.M., Barbour, G. Limitations of Regular Terminology Development practices: the case of the isiZulu Computing Terminology. Alternation, 2014, 12: 13-48.

[2] Dhunpath, R., Amin, N. Msibi, T. Editorial: Re-envisioning African and Higher Education: Alternative Paradigms, Emerging Trends and New Directions. Alternation, 2014, 12: 1-12.

VocabLift to learn some isiZulu, Shona, French, and English words

While I’ll be at EKAW’14 to network, present the stuff ontology, and support SUGOI, some of my students will hold the fort locally at the African Language Technologies Workshop (AFLaT’14) on 27-28 November in Cape Town. One of the two posters & demos I contributed to is about a cute tool that two 3rd-year students—Ntokozo Zwane and Sungunani Silubonde—designed and implemented as their capstone project for software engineering, which they called VocabLift (zip). The capstone groups’ task was to develop a tool that can help someone to learn vocabulary in a playful way, which had some leeway to be creative in how to realize that.

The context is that everyone has to learn vocabulary over the years, from basic words in primary school to scientific terminology at university, and any time when one is learning a new language. Besides memorizing ‘boring’ lists of words from a sheet of paper, there are more playful ways to do this, like the multi-player dictionary game and hangman, or single-player memory cards game from the EuroTalk DVDs. There are indeed many word games online, e.g., for English, and learning a foreign language on duolingo, but there is less for multilingualism and the languages in Southern Africa. EuroTalk DVDs for Zulu, Shona, Swahili, Yoruba and a few other African languages do exist, true, but at a cost and they are inflexible in a teaching setting. Enter VocabLift, which is both technologically interesting and for the target languages chosen: isiZulu and Shona, and English and French. Conceptually, it is based on natural language-independent root questions that are mapped to the language of choice, so another language easily can be added, and, unlike the usual ‘closed’ world of the computer-based language games, a teacher can add words to the dictionary, making it in principle adaptable to the desired level of language learning.

Currently, VocabLift has three games: the Picture Matcher, Vocab Trainer, and Word Tetris. In Picture Matcher, the name of the object in the picture has to be provided by the user, with as objective to improve memory and spelling in the chosen language; a screenshots for avocado in isiZulu and pineapple Shona are shown below.

avocado in isiZUlu, right before selecting 'confirm word'

Avocado in isiZulu, right before selecting ‘confirm word’

Pineapple in Shona, after I clicked 'I don't know'

Pineapple in Shona, after I clicked ‘I don’t know’

Vocab Trainer tests the user’s ability to recall the word given in English in the target language; screenshots for green in isiZulu and gray in French are shown below.

Choosing the right word for 'green' in isiZulu (the answer also can be found further below in another screenshot)

Choosing the right word for ‘green’ in isiZulu (the answer also can be found further below in another screenshot)

Same story, and just to show it works for French, too.

Same story, and just to show it works for French, too.

The third game, Word Tetris is included so that the user can learn to match the word to the picture. The user has to type the word associated with the picture before it falls below the bar; a screenshot is shown below (I lost points due to trying make nice screenshots, really).

Halfway playing 'word tetris'

Halfway playing ‘word tetris’

One needs to be logged in as administrator to add words (admin 1234 will do the trick) and use the tool in ‘dictionary mode’, as illustrated in the next two screenshots.

Adding terms, having selected to add it to the isiZulu dictionary

Adding terms, having selected to add it to the isiZulu dictionary

Cellphone was added (and you also can find the answer to 'green', above)

Cellphone was added (and you also can find the answer to ‘green’, above)

VocabLift has been implemented using JavaFX and XML, making the tool platform-independent (click the VocabLift.jar file once the downloaded zip file is unzipped). I’ll readily admit it is very well possible to add more features, adapt it to have it running also on a mobile, or refine the HCI, and some educational technologies researcher may want to investigate whether this or something like it improves the learning outcome significantly, but it was a fun software engineering project over a timespan of a mere two months (with other parts of the course being taught) and words surely can be learnt (at least I have, and so did the students).

The AFLaT’14 poster and demo session runs from 15:30-16:30 on the 27th and will remain available during the breaks on the 28th as well, and Ntokozo and Sungunani will be happy to demo it for you and describe more details about it.