ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language [1], with the other 94% essentially not receiving the quality care they deserve due to language barriers [2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation [11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

5 responses to “ICTs for South Africa’s indigenous languages should be a national imperative, too”

Pingback: Software tools to connect isiXhosa and isiZulu to the digital age | Tshwi-fi
Pingback: Connecting isiXhosa, isiZulu to the digital age - NewsWave
Pingback: How we’re making the tools to connect isiXhosa isiZulu to the digital age
Pingback: South Africa:How We're Making the Tools to Connect Isixhosa and Isizulu to the Digital Age - .:EABizInfo.Com. | .:EABizInfo.Com.
Alan says:

February 22, 2020 at 12:57 PM

Hi
I agree with your sentiment entirely. I have already secured
a partnership with two major companies to do exactly what
you are already doing. One of these Companies is based overseas and is now recognised as the most accurate of all the voice recognition platforms.
I would very much like to talk to you directly.

Keet blog

research and teaching, with some relevance for society

ICTs for South Africa’s indigenous languages should be a national imperative, too

5 responses to “ICTs for South Africa’s indigenous languages should be a national imperative, too”

Leave a comment Cancel reply

Share this:

Related

5 responses to “ICTs for South Africa’s indigenous languages should be a national imperative, too”

Leave a comment Cancel reply