Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:
- first error correction algorithms for isiZulu;
- improved error detection with a few basic rules, also for isiZulu;
- new isiXhosa error detection and correction;
The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):
The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).
The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than spellchecker.net’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the spellchecker.net’s version got to 82.66% on the same text. Put differently: spellchecker.net flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).
Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.
Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.