Preliminary promising results on a data-driven spellchecker for isiZulu

While developing a spellchecker for isiZulu is not new, the one for Open Office v3 doesn’t work with the more up-to-date versions of Open Office, it was of limited quality, the other papers on isiZulu spellcheckers have no tools for it, and the techniques used were tailored to the language. The demand for it has not decreased, however. Last year’s honours student I co-supervised, Balone Ndaba, set out to address this using n-grams and several corpora, so if this were to work, then this essentially language-independent solution may be ported to other Bantu languages. The data-driven approach worked reasonably well (up to 89% accuracy), so it all ended up as a paper entitled “The effects of a corpus on isiZulu spellcheckers based on n-grams” [1], co-authored with Balone Ndaba, Hussein Suleman, and Langa Khumalo. It has been accepted at the IST Africa’16 conference that will take place in less than two weeks in Durban, South Africa. The material can be downloaded from Balone’s honours project page.

In a nutshell, well less than one page cf. the paper’s 9 pages, the paper describes the design, implementation, and experimental results of a statistics-based spellchecker. One option in a data-driven approach is to make long wordlists to look up the word to be spellchecked, indeed, but this is not doable due to the agglutination in the language (i.e., way too many options), and there are limited curated resources anyhow. Instead, we let it ‘learn’ using a statistical language model using n-grams created from a corpus. For instance, take the word yebo ‘yes’, then its bigrams are ye, eb, and bo, its trigrams are yeb and ebo, and quadrigram (i.e., 4 characters) yebo. Feeding the algorithm a lot of text generates a lot of n-grams, some of which occur much more often than others, whereas the very rare ones were probably typos or a loan word in the original text. The hope is then that when it is fed a word that it has not been trained on, it can recognize that it is (very probably) still a valid word in the language, as the sequence of the characters in the string are deemed common enough, or, conversely, that it is very likely to be misspelled.

It was not clear upfront which of the variables will affect the quality most, and which combination leads to the best spellchecker performance. This could be the threshold for when to include the n-gram, how big the n-gram should be, and whether the training corpus had any effect. So, all this was tested (see paper for details).

Trigrams worked better than quadrigrams, and a threshold of 0.003 worked better than the other thresholds. This is the ‘easy’ part, in a way. What complicated matters were the corpora, where one was much better to learn from or test on than the other. This is summarised in the table below. UC stands for the Ukwabelana corpus [2] that consists of the bible and a few copyright-expired novels, S. INC for a sample of the isiZulu National Corpus that is being developed at the University of KwaZulu-Natal [3], and INC is a small corpus of news items from the Isolezwe and news24 in isiZulu news websites (extended from the earlier playing with some of those articles). The INC and NIC work well on each other, but not at all with UC, and if trained with UC then only mediocre on the more recent texts of NIC and INC. So, the training corpus has a rather large effect on the spellchecker’s performance.

 

Lexical recall with n-grams: 10-fold cross-validation (i.e., with itself) and tested against the other corpora (t. = test).

10-fold Train UC Train NIC Train S. INC
3-gr. 4-gr. t. NIC t. INC t. INC t. UC t. UC t. NIC
UC 85 80 30 41
S. INC 66 63 54 89
NIC 86 79 70 89

 

We did look into the corpora a little more, but there’s more to analyse. Notably, the number of unique words in the corpus seems to matter more than its size, and the datedness and quality of the text suggest room for additional evaluation. UC contains some archaic terms that wouldn’t be used in modern-day isiZulu, and there have been some errors introduced in the words, assumed to be due to OCR (words with three successive vowels are a no-no in isiZulu, verbs missing the final vowel).

Overall, though, a lexical recall and an accuracy of 89% is quite alright for a first-pass, meriting resources for fine-tuning for isiZulu and it being promising as approach for related languages.

Balone will present the paper at IST Africa, and Hussein and Langa will also participate, so you can ask them more details in person at the conference. (I’ll be holding a final training for the team representing UCT in the ACM ICPC World Finals and pack my bags to join them as coach in Phuket in Thailand.)

 

References

[1] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. IST-Africa 2016. May 11-13, 2016, Durban, South Africa.

[2] S. Spiegler, A. van der Spuy, P. Flach. Ukwabelana – An Open-source Morphological Zulu Corpus. Proceedings of the 23rd International Conference on Computational Linguistics (COLING, 2010). ACL. 1020-1028.

[3] Khumalo. Advances in developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

Advertisements