ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

 

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

 

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

 

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

Advertisement

Updated isiZulu spellchecker and new isiXhosa spellchecker

Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:

  • first error correction algorithms for isiZulu;
  • improved error detection with a few basic rules, also for isiZulu;
  • new isiXhosa error detection and correction;

The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):

Single word error *khupels that has a substitution error s for a in the intended word (khupela)

The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).

The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than spellchecker.net’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the spellchecker.net’s version got to 82.66% on the same text. Put differently: spellchecker.net flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).

Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.

A section of the isiXhosa Wikipedia entry about the UN (*ukuez should be ukuze, which is among the proposed words).

Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.

The isiZulu spellchecker seems to contribute to ‘intellectualisation’ of isiZulu

Perhaps putting ‘intellectualisation’ in sneer quotes isn’t nice, but I still find it an odd term to refer to a process of (in short, from [1]) coming up with new vocabulary for scientific speech, expression, objective thinking, and logical judgments in a natural language. In the country I grew up, terms in our language were, and still are, invented more because of a push against cultural imperialism and for home language promotion rather than some explicit process to intellectualise the language in the sense of “let’s invent some terms because we need to talk about science in our own language” or “the language needs to grow up” sort of discourses. For instance, having introduced the beautiful word geheugensanering (NL) that captures the concept of ‘garbage collection’ (in computing) way better than the English joke-term for it, elektronische Datenverarbeitung (DE) for ‘ICT’, técnicas de barrido (ES) for ‘sweep line’ algorithms, and mot-dièse (FR) for [twitter] ‘hashtag’, to name but a few inventions.

Be that as it may, here in South Africa, it goes under the banner of intellectualisation, with particular reference to the indigenous languages [2]; e.g., having introduced umakhalekhukhwini ‘cell/mobile phone’ (decomposed: ‘the thing that rings in your pocket’) and ukudlulisa ikheli for ‘pass by reference’ in programming (longer list of isiZulu-English computing and ICT terms), which is occurring for multiple subject domains [3]. Now I ended up as co-author of a paper that has ‘intellectualisation’ in its title [4]: Evaluation of the effects of a spellchecker on the intellectualization of isiZulu that appeared just this week in the Alternation journal.

The main general question we sought to answer was whether human language technologies, and in particular the isiZulu spellchecker launched last year, contribute to the language’s intellectualisation. More specifically, we aimed to answer the following three questions:

  1. Is the spellchecker meeting end-user needs and expectations?
  2. Is the spellchecker enabling the intellectualisation of the language?
  3. Is the lexicon growing upon using the spellchecker?

The answers in a nutshell are: 1) yes, the spellchecker does meet end-user needs and expectations (but there are suggestions further improving its functionality), 2) users perceive that the spellchecker enables the intellectualisation of the language, and 3) non-dictionary words were added, i.e., the lexicon is indeed growing.

The answer to the last question provides some interesting data for linguists to bite their teeth in. For instance, a user had added to the spellchecker’s dictionary LikaSekelaShansela, which is an inflected form of isekelashansela ‘Vice Chancellor’ (that is recognised as correct by the spellchecker). Also some inconsistencies—from a rule-of-thumb viewpoint—in word formation were observed; e.g., usosayensi ‘scientist’ vs. unompilo ‘nurse’. If one were to follow consistently the word formation process for various types of experts in isiZulu, such as usosayensi ‘scientist’, usolwazi ‘professor’, and usomahlaya ‘comedian’, then one reasonably could expect ‘nurse’ to be *usompilo rather than unompilo. Why it isn’t, we don’t know. Regardless, the “add to dictionary” option of the spellchecker proved to be a nice extra feature for a data-driven approach to investigate intellectualisation of a language.

Version 1 of the isiZulu spellchecker that was used in the evaluation was ok and reasonably could not have interfered negatively with any possible intellectualisation (average SUS score of 75 and median 82.5, so ‘good’). It was ok in the sense that a majority of respondents thought that the entire tool was helpful, no features should be removed, it enhances their work, and so on (see paper for details). For the software developers among you who have spare time: they’d like, mainly, to have it as a Chrome and MS Word plugin, predictive text/autocomplete, and have it working on the mobile phone. The spellchecker has improved in the meantime thanks to two honours students, and I will write another blog post about that next.

As a final reflection: it turned out there isn’t a way to measure the level of intellectualisation in a ‘hard sciences’ way, so we concluded the other answers based on data that came from the somewhat fluffy approach of a survey and in-depth interviews (a ‘mixed-methods’ approach, to give it a name). It would be nice to have a way to measure it, though, so one would be able to say which languages are more or less intellectualised, what level of intellectualisation is needed to have a language as language of instruction and science at tertiary level of education and for dissemination of scientific knowledge, and to what extent some policy x, tool y, or activity z contributes to the intellectualization of a language.

 

References

[1] Havránek, B. 1932. The functions of literary language and its cultivation. In Havránek, B and Weingart, M. (Eds.). A Prague School Reader on Esthetics, Literary Structure and Style. Prague: Melantrich: 32-84.

[2] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[3]Khumalo, L. Intellectualization through terminology development. Lexikos, 2017, 27: 252-264.

[4] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

Launch of the isiZulu spellchecker

launchspellchecker

Langa Khumalo, ULPDO director, giving the spellchecker demo, pointing out a detected spelling error in the text. On his left, Mpho Monareng, CEO of PanSALB.

Yesterday, the isiZulu spellchecker was launched at UKZN’s “Launch of the UKZN isiZulu Books and Human Language Technologies” event, which was also featured on 702 live radio, SABC 2 Morning Live, and e-news during the day. What we at UCT have to do with it is that both the theory and the spellchecker tool were developed in-house by members of the Department of Computer Science at UCT. The connection with UKZN’s University Language Planning & Development Office is that we used a section of their isiZulu National Corpus (INC) [1] to train the spellchecker with, and that they wanted a spellchecker (the latter came first).

The theory behind the spellchecker was described briefly in an earlier post and it has been presented at IST-Africa 2016 [2]. Basically, we don’t use a wordlist + rules-based approach as some experiments of 20 years ago did, nor a wordlist + a few rules of the now-defunct translate.org.za OpenOffice v3 plugin seven years ago, but a data-driven approach with a statistical language model that uses tri-grams. The section of the INC we used were novels and news items, so, including present-day isiZulu texts. At the time of the IST-Africa’16 paper, based on Balone Ndaba’s BSc CS honours project, the spell checking was very proof-of-concept, but it showed that it could be done and still achieve a good enough accuracy. We used that approach to create an enduser-usable isiZulu spellchecker, which saw the light of day thanks to our 3rd-year CS@UCT student Norman Pilusa, who both developed the front-end and optimised the backend so that it has an excellent performance.

Upon starting the platform-independent isiZulu_spellchecker.jar file, the English interface version looks like this:

zuspellopen

You can write text in the text box, or open a txt or docx file, which then is displayed in the textbox. Click “Run”. Now there are two options: you can choose to step-through the words that are detected as misspelled one at a time or “Show All” words that are detected as misspelled. Both are shown for some sample text in the screenshot below.

zuspellonessection

processing one error at a time

zuspellallsection

highlighting all words detected as very probably misspelled

Then it is up to you to choose what to do with it: correct it in the textbox, “Ignore once”, “Ignore all”, or “Add” the word to your (local) dictionary. If you have modified the text, you can save it with the changes made by clicking “Save correction”. You also can switch the interface from the default English to isiZulu by clicking “File – Use English”, and back to English via “iFayela – ulimi lesingisi”. You can download the isiZulu spellchecker from the ULPDO website and from the GitHub repository for those who want to get their hands on the source code.

To anticipate some possible questions you may have: incorporating it as a plugin to Microsoft word, OpenOffice/LibreOffice, and Mozilla Firefox was in the planning. The former is technologically ‘closed source’, however, and the latter two have a certain way of doing spellchecking that is not amenable to the data-driven approach with the trigrams. So, for now, it is a standalone tool. By design, it is desktop-based rather than for mobile phones, because according to the client (ULPDO@UKZN), they expect the first users to be professionals with admin documents and emails, journalists writing articles, and such, writing on PCs and laptops.

There was also a trade-off between a particular sort of error: the tool now flags more words as probably incorrect than it could have, yet it will detect (a subset of) capitalization, correctly, such as KwaZulu-Natal whilst flagging some of the deviant spellings that go around, as shown in the screenshot below.

zuspellkznThe customer preferred recognising such capitalisation.

Error correction sounds like an obvious feature as well, but that will require a bit more work, not just technologically, but also the underlying theory. It will probably be an honours project topic for next year.

In the grand scheme of things, the current v1 of the spellchecker is only a small step—yet, many such small steps in succession will get one far eventually.

The launch itself saw an impressive line-up of speeches and introductions: the keynote address was given by Dr Zweli Mkhize, UKZN Chancellor and member of the ANC NEC; Prof Ramesh Krishnamurthy, from Aston University UK, gave the opening address; Mpho Monareng, CEO of PanSALB gave an address and co-launched the human language technologies; UKZN’s VC Andre van Jaarsveld provided the official welcome; and two of UKZN’s DVCs, Prof Renuka Vithal and Prof Cheryl Potgieter, gave presentations. Besides our ‘5-minutes of fame’ with the isiZulu spellchecker, the event also launched the isiZulu National Corpus, the isiZulu Term Bank, the ZuluLex mobile-compatible application (Android and iPhone), and two isiZulu books on collected short stories and an English-isiZulu architecture glossary.

 

References

[1] Khumalo, L. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[2] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

Preliminary promising results on a data-driven spellchecker for isiZulu

While developing a spellchecker for isiZulu is not new, the one for Open Office v3 doesn’t work with the more up-to-date versions of Open Office, it was of limited quality, the other papers on isiZulu spellcheckers have no tools for it, and the techniques used were tailored to the language. The demand for it has not decreased, however. Last year’s honours student I co-supervised, Balone Ndaba, set out to address this using n-grams and several corpora, so if this were to work, then this essentially language-independent solution may be ported to other Bantu languages. The data-driven approach worked reasonably well (up to 89% accuracy), so it all ended up as a paper entitled “The effects of a corpus on isiZulu spellcheckers based on n-grams” [1], co-authored with Balone Ndaba, Hussein Suleman, and Langa Khumalo. It has been accepted at the IST Africa’16 conference that will take place in less than two weeks in Durban, South Africa. The material can be downloaded from Balone’s honours project page.

In a nutshell, well less than one page cf. the paper’s 9 pages, the paper describes the design, implementation, and experimental results of a statistics-based spellchecker. One option in a data-driven approach is to make long wordlists to look up the word to be spellchecked, indeed, but this is not doable due to the agglutination in the language (i.e., way too many options), and there are limited curated resources anyhow. Instead, we let it ‘learn’ using a statistical language model using n-grams created from a corpus. For instance, take the word yebo ‘yes’, then its bigrams are ye, eb, and bo, its trigrams are yeb and ebo, and quadrigram (i.e., 4 characters) yebo. Feeding the algorithm a lot of text generates a lot of n-grams, some of which occur much more often than others, whereas the very rare ones were probably typos or a loan word in the original text. The hope is then that when it is fed a word that it has not been trained on, it can recognize that it is (very probably) still a valid word in the language, as the sequence of the characters in the string are deemed common enough, or, conversely, that it is very likely to be misspelled.

It was not clear upfront which of the variables will affect the quality most, and which combination leads to the best spellchecker performance. This could be the threshold for when to include the n-gram, how big the n-gram should be, and whether the training corpus had any effect. So, all this was tested (see paper for details).

Trigrams worked better than quadrigrams, and a threshold of 0.003 worked better than the other thresholds. This is the ‘easy’ part, in a way. What complicated matters were the corpora, where one was much better to learn from or test on than the other. This is summarised in the table below. UC stands for the Ukwabelana corpus [2] that consists of the bible and a few copyright-expired novels, S. INC for a sample of the isiZulu National Corpus that is being developed at the University of KwaZulu-Natal [3], and INC is a small corpus of news items from the Isolezwe and news24 in isiZulu news websites (extended from the earlier playing with some of those articles). The INC and NIC work well on each other, but not at all with UC, and if trained with UC then only mediocre on the more recent texts of NIC and INC. So, the training corpus has a rather large effect on the spellchecker’s performance.

 

Lexical recall with n-grams: 10-fold cross-validation (i.e., with itself) and tested against the other corpora (t. = test).

10-fold Train UC Train NIC Train S. INC
3-gr. 4-gr. t. NIC t. INC t. INC t. UC t. UC t. NIC
UC 85 80 30 41
S. INC 66 63 54 89
NIC 86 79 70 89

 

We did look into the corpora a little more, but there’s more to analyse. Notably, the number of unique words in the corpus seems to matter more than its size, and the datedness and quality of the text suggest room for additional evaluation. UC contains some archaic terms that wouldn’t be used in modern-day isiZulu, and there have been some errors introduced in the words, assumed to be due to OCR (words with three successive vowels are a no-no in isiZulu, verbs missing the final vowel).

Overall, though, a lexical recall and an accuracy of 89% is quite alright for a first-pass, meriting resources for fine-tuning for isiZulu and it being promising as approach for related languages.

Balone will present the paper at IST Africa, and Hussein and Langa will also participate, so you can ask them more details in person at the conference. (I’ll be holding a final training for the team representing UCT in the ACM ICPC World Finals and pack my bags to join them as coach in Phuket in Thailand.)

 

References

[1] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[2] S. Spiegler, A. van der Spuy, P. Flach. Ukwabelana – An Open-source Morphological Zulu Corpus. Proceedings of the 23rd International Conference on Computational Linguistics (COLING, 2010). ACL. 1020-1028.

[3] Khumalo. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.