Surprising similarities and differences in orthography across several African languages

It is well-known that natural language interfaces and tools in one’s own language are known to be useful in ICT-mediated communication. For instance, tools like spellcheckers and Web search engines, machine translation, or even just straight-forward natural language processing to at least ‘understand’ documents and find the right one with a keyword search. Most languages in Southern Africa, and those in the (linguistically called) Bantu language family, are still under-resourced, however, so this is not a trivial task due to the limited data and researched and documented grammar. Any possibility to ‘bootstrap’ theory, techniques, and tools developed for one language and to fiddle just a bit to make it work for a similar one will save many resources compared to starting from scratch time and again. Likewise, it would be very useful if both the generic and the few language-specific NLP tools for the well-resourced languages could be reused or easily adapted across languages. The question is: does that work? We know very little about whether it does. Taking one step back, then: for that bootstrapping to work well, we need to have insight into how similar the languages are. And we may be able to find that out if only we knew how to measure similarity of languages.

The most well-know qualitative way for determining some notion of similarity started with Meinhof’s noun class system [1] and the Guthrie zones. That’s interesting, but not nearly enough for computational tools. An experiment has been done for morphological analysers [2], with promising results, yet it also had more of a qualitative flavour to it.

I’m adding here another proverbial “2 cents” to it, by taking a mostly quantitative approach to it, and focusing on orthography (how things are written down) in text documents and corpora. This was a two-step process. First, 12 versions of the Universal Declaration of Human Rights were examined on tokens and their word length; second, because the UDHR is a quite small document, isiZulu corpora were examined to see whether the UDHR was a representative sample, i.e., whether extrapolation from its results may be justified. The methods, results, and discussion are described in “An assessment of orthographic similarity measures for several African languages” [3].

The really cool thing of the language comparison is that it shows clusters of languages, indicating where bootstrapping may have more or less success, and they do not quite match with Guthrie zones. The cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa is shown in the figure below, where the names of the languages are those of the file names of the NLTK data kit that contains the quality translations of the UDHR.

Cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa (Source: [3]).

Cumulative frequency distributions of the words in the UDHR of several languages spoken in Sub-Saharan Africa (Source: [3]).

The paper contains some statistical tests, showing that the bottom cluster are not statistically significantly different form each other, but they are from the ‘middle’ cluster. So, the word length distribution of Kiswahili is substantially different from that of, among others, isiZulu, in that it has more shorter words and isiZulu more longer words, but Kiswahili’s pattern is similar to that of Afrikaans and English. This is important for NLP, for isiZulu is known to be highly agglutinating, but English (and thus also Kiswahili) is disjunctive. How important is such a difference? The simple answer is that grammatical elements of a sentences get ‘glued’ together in isiZulu, whereas at least some of them are written as separate words in Kiswahili. This is not to be conflated with, say, German, Dutch, and Afrikaans, where nouns can be concatenated to form new words, but, e.g., a preposition is glued onto a noun. For instance, ‘of clay’ is ngobumba, contracting nga+ubumba with a vowel coalescence rule (-a + u- = -o-), which thus happens much less often in a language with disjunctive orthography. This, in turn, affects the algorithms needed to computationally process the languages, hence, the prospects for bootstrapping.

Note that middle cluster looks deceptively isolating, but it isn’t. Sesotho and Setswana are statistically significantly different from the others, in that they are even more disjunctive than English. Sepedi (top-most line) even more so. While I don’t know that language, a hypothetical example suffice to illustrate this notion. There is conjugation of verbs, like ‘works’ or trabajas or usebenza (inflection underlined), but some orthographer a while ago could have decided to write that separate from the verb stem (e.g., trabaj as and u sebenza instead), hence, generating more tokens with fewer characters.

There are other aspects of language and orthography one can ‘play’ with to analyse quantitatively, like whether words mainly end in a vowel or not, and which vowel mostly, and whether two successive vowels are acceptable for a language (for some, it isn’t). This is further described in the paper [3].

Yet, the UDHR is just one document. To examine the generalisability of these observations, we need to know whether the UDHR text is a ‘typical’ one. This was assessed in more detail by zooming in on isiZulu both quantitatively and qualitatively with four other corpora and texts in different genres. The results show that the UHDR is a typical text document orthographically, at least for the cumulative frequency distribution of the word length.

There were some other differences across the other corpora, which have to do with genre and datedness, which was observed elsewhere for whole words [4]. For instance, news items of isiZulu newspapers nowadays include words like iFacebook and EFF, which surely don’t occur in a century-old bible translation. They do violate the ‘no two successive vowels’ rule and the ‘final vowel’ rule, though.

On the qualitative side of the matter, and which will have an effect on searching for information in texts, text summarization, and error correction of spellcheckers, is, again, that agglutination. For instance, searching on imali ‘money’ alone would be woefully inadequate to find all relevant texts; e.g., those news items also include kwemali, yimali, onemali, osozimali, kwezimali, and ngezimali, which are, respectively of -, and -, that/which/who has -, of – (pl.), about/by/with/per – (pl.) money. Searching on the stem or root only is not going to help you much either, however. Take, for instance -fund-, of which the results of just two days of Isolezwe news articles is shown in the table below (articles from 2015, when there were protests, too). Depending on what comes before fund and what comes after it, it can have a different meaning, such as abafundi ‘students’ and azifundi ‘they do not learn’.

isolezwefund

Placing this is the broader NLP scope, it also affects the widely-used notion of lexical diversity, which, in its basic form, is a type-to-token ratio. Lexical diversity is used as a proxy measure for ‘difficulty’ or level of a text (the higher the more difficult), language development in humans as they grow up, second-language learning, and related topics. Letting that loose on isiZulu text, it will count abafundi, bafundi, and nabafundi as three different tokens, so wheehee, high lexical diversity, yet in English, it amounts to ‘students’, ‘students’ and ‘and the students’. Put differently, somehow we have to come up with a more meaningful notion of lexical diversity for agglutinating languages. A first attempt is made in the paper in its section 4 [3].

Thus, the last word has not been said yet about orthographic similarity, yet we now do have more insight into it. The surprising similarity of isiZulu (South Africa) with Runyankore (Uganda) was exploited in another research activity, and shown to be very amenable to bootstrapping [5], so, in its own way providing supporting evidence for bootstrapping potential that the figure above also indicated as promising.

As a final comment on the tooling side of things, I did use NLTK (Python). It worked well for basic analyses of text, but it (and similar NLP tools) will need considerable customization for the agglutinating languages.

 

References

[1] C. Meinhof. 1932. Introduction to the phonology of the Bantu languages . Dietrich Reiner/Ernst Vohsen, Johannesburg. Translated, revised and enlarged in collaboration with the author and Dr. Alice Werner by N.J. Van Warmelo.

[2] L. Pretorius and S. Bosch. Exploiting cross-linguistic similarities in Zulu and Xhosa computational morphology: Facing the challenge of a disjunctive orthography. In Proceedings of the EACL 2009 Workshop on Language Technologies for African Languages – AfLaT 2009, pages 96–103, 2009.

[3] C.M. Keet. An assessment of orthographic similarity measures for several African languages. Technical report, arxiv 1608.03065. August 2016.

[4] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[5] J. Byamugisha, C. M. Keet, and B. DeRenzi. Bootstrapping a Runyankore CNL from an isiZulu CNL. In B. Davis et al., editors, 5th Workshop on Controlled Natural Language (CNL’16), volume 9767 of LNAI, pages 25–36. Springer, 2016. 25-27 July 2016, Aberdeen, UK.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s