Not sorry at all—Review of “Sorry, not Sorry” by Haji Dawjee

Some papers are in the review pipeline for longer than they ought to be and the travel-part of conference attendance is a good opportunity to read books. So, instead of writing more about research, here’s a blogpost with a book review instead, being Sorry, not sorry—Experiences of a brown woman in a white South Africa by South African journalist Haji Mohamed Dawjee. It’s one of those books I bought out of curiosity, as the main title intrigued me on two aspects. First, it contradicts—if you’re not sorry, then don’t apologise for not doing so. Second, the subtitle, as it can be useful to read what people who don’t get much media coverage have to say. It turned out to have been published only last month, so let me break with the usual pattern and write a review now rather than wait until the usual January installments

The book contains 20 essays of Dawjee’s experiences broadly and with many specific events, and reflections thereof, on growing up and working in South Africa. Depending on your background, you’ll find more or less recognisable points in it, or perhaps none at all and you’ll just eat the whole spiced dish served, but if you’re a woke South African white or think of yourself as a do-gooder white, you probably won’t like certain sections of it. As it is not my intention to write a very long review, I’ve picked a few essays to comment on, but there’s no clear single favourite among the essays. There are two essays that I think the book could have done without, but well, I suppose the author is asserting something with it that has something to do with the first essay and that I’m just missing the point. That first essay is entitled ‘We don’t really write what we like’ and relates back to Biko’s statement and essay collection I write what I like, not the Writing what we like essay collection of 2016. It describes the media landscape, the difficulties of people of colour to get published, and that their articles are always expected to have some relevance and insight—“having to be on the frontlines of critical thinking”—rather than some drivel that white guys can get away with, as “We too have nice experiences. We think about things and dream and have magic in us. We have fuzzy fables to share.”. Dawjee doesn’t consider such airy fairy stories by the white guys to be brave, but exhibiting opportunity an privilege, and she wants to have that opportunity and privilege, too. This book, however, is mainly of the not-drivel and making-a-point sort of writing rather than flowery language devoid of a message.

For instance, what it was like from the journalism side when Mandela died, and the magazine she was working for changing her story describing a successful black guy into one “more Tsotsi-like”, because “[t]he obvious reason for the editorial manipulation was that no-one wanted a story of a good black kid. Only white kids are intrinsically exceptional.” (discussed in the essay ‘The curious case of the old white architect’). Several essays describe unpleasant behind-the-scenes experiences in journalism, such as at YOU magazine, and provide a context to her article Maid in South Africa that had as blurb “White people can walk their dogs, but not their children”, which apparently had turned out to cause a shitstorm on social media. There was an opinion-piece response by one of Dawjee’s colleagues, “coming to my ‘rescue’” and who “needed to whitesplain my thoughts and sanitise them with her ‘wokeness’” (p190). It’s a prelude to finishing off with a high note (more about that further below), and illustrates one of the recurring topics—the major irritation with the do-gooders, woke whites, the ones who put themselves in the ‘good whites’ box and ‘liberal left’, but who nonetheless still contribute to systemic racism. This relates to Biko’s essay on the problems with white liberals and similar essays in his I write what I like, there described as category, and in Dawjee’s book illustrated with multiple examples.


In an essay quite different in style, ‘Why I’m down with Downtown Abbey’ (the TV series), Dawjee revels in the joys of seeing white servants doing the scurrying around, cooking, cleaning etc for the rich. On the one hand, knowing a little of South African society by now, understandable. On the other hand, it leaves me wondering just how much messed up the media is that people here still (this is not the first or second time I came across this topic) seem to think that up in Europe most or all families also have maids and gardeners. They don’t. As one Irish placard had put it, “clean up your own shite” is the standard, as is DIY gardening and cooking. Those chores, or joys, are done by the women, children, and men of the nuclear family, not hired helps.

Related to that latter point—who’s doing the chores—two essays have to do with feminism and Islam. The essay title ‘And how the women of Islam did slay’ speaks for itself. And, yes, as Dawjee says, it cannot be repeated often enough that there were strong, successful, and intelligent women at the bedrock of Islam and women actually do have rights (unlike under Christianity); in case you want some references on women’s rights under Islam, have a look at the essay I wrote a while a go about it. ‘My mother, the true radical’ touches upon notions of feminism and who gets to decide who is feminist when and in what way.


I do not quite agree with Dawjee’s conclusion drawn from her Tinder experiences in ‘Tinder is a pocket full of rejection, in two parts’. On p129 she writes “Tinder in South Africa is nothing but fertile ground for race-based rejection.”. If it were a straightforward case of just race-based swiping, then, statistically, I should have had lots of matches with SA white guys, as I surely look white with my pale skin, blue eyes, and dark blonde hair (that I ended up in the 0.6% ‘other’ box in the SA census in 2011 is a separate story). But, nada. In my 1.5 years of Tinder experiment in Cape Town, I never ever got a match with a white guy from SA either, but plenty of matches with blacks, broad and narrow. I still hypothesise that the lack of matches with the white guys is because I list my employer, which scares away men who do not like women who’ve enjoyed some higher education, as it has scared away countless men in several other countries as well. Having educated oneself out of the marriage market, it is also called. There’s a realistic chance that a majority of those South African whites that swiped left on Dawjee are racist, but, sadly, their distorted views on humanity include insecurities on more than one front, and I’m willing to bet that Dawjee having an honours degree under her belt will have contributed to it. That said, two anecdotes doesn’t make data, and an OKCupid-type of analysis like Rudder’s Dataclysm (review) but then of Tinder data would be interesting so as to get to the bottom of that.


The two, imho, skippable essays are “Joining a cult is a terrible idea” (duh) and “Depression: A journal”. I’m not into too personal revelations, and would have preferred a general analysis on how society deals, or not, with mental illness, or, if something more concrete, to relate it to, say, the Life Esidimeni case from whichever angle.


Meandering around through the various serious subtopics and digressions, as a whole, the essays combine into chronicling the road taken by Dawjee to decolonise her mind, culminating in a fine series of statements in the last part of the last essay. She is not sorry for refusing to be a doormat, saying so, and the consequences that that will have for those who perpetuate and benefit from systemic racism, and she now lives from a position of strength rather than struggling and doubting as a receiver of it.


Overall, it was an interesting book and worthwhile to have read. The writing style is very accessible, so one can read the whole book in a day or so. In case you are still unsure whether you want to read it or not: there are free book extracts of ‘We don’t really write what we like’, ‘Begging to be white?’, and ‘And how the women of Islam did slay’ and, at the time of writing this blog post, one written review on News24 and Eusebius McKaiser’s Radio 702 interview with Dawjee (both also positive about the book).


‘Problem shopping’ and networking at IST-Africa’18 in Gaborone

There are several local and regional conferences in (Sub-Saharan) Africa with a focus on Africa in one way or another, be it for, say, computer science and information systems in (mainly) South Africa, computer networks in Africa, or for (computer) engineers. The IST-Africa series covers a broad set of topics and papers must explicitly state how and where all that research output is good for within an African context, hence, with a considerable proportion of the scope within the ICT for Development sphere. I had heard from colleagues it was a good networking opportunity, one of my students had obtained some publishable results during her CS honours project that could be whipped into paper-shape [1], I hadn’t been to Botswana before, and I’m on sabbatical so have some time. To make a long story short: the conference just finished, and I’ll write a bit about the experiences in the remainder of this post.

First, regarding the title of the post: I’m not quite an ICT4D researcher, but I do prefer to work on computer science problems that are based on actual problems that don’t have a solution yet, rather than invented toy examples. A multitude of papers presented at the conference were elaborate on problem specification, like them having gone out in the field and done the contextual inquiries, attitude surveys, and the like so as to better understand the multifaceted problems themselves before working toward a solution that will actually work (cf. the white elephants littered around on the continent). So, in a way, the conference also doubled in a ‘problem shopping’ event, though note that many solutions were presented as well. Here’s a brief smorgasbord of them:

  • Obstacles to eLearning in, say, Tanzania: internet access (40% only), lack of support, lack of local digital content, and too few data-driven analyses of experiments [2].
  • Digital content for healthcare students and practitioners in WikiTropica [3], which has the ‘usual’ problems of low resource needs (e.g., a textbook with lots of pictures but has to work on the mobile phone or tablet nonetheless), the last mile, and language. Also: the question of how to get people to participate to develop such resources? That’s still an open question; students of my colleague Hussein Suleman have been trying to figure out how to motivate them. As to the 24 responses by participants to the question “…Which incentive do you need?” the results were: 7 money/devices, 7 recognition, 4 none, 4 humanity/care/usefulness, 1 share & learn, and 1 not sure (my encoding).

    Content collaboration perceptions

    information sharing perceptions

    With respect to practices and attitudes toward information sharing, the answers were not quite encouraging (see thumbnails). Of course, all this is but a snapshot, but still.

  • The workshop on geospatial sciences & land administration had a paper on building a national database infrastructure that wasn’t free of challenges, among others: buying data is costly, available data but no metadata, privacy issues, data collected and cant ask for consent again for repurposing of that data (p16) [4].
  • How to overcome the (perceived to be the main) hurdle of lack of trust in electronic voting in Kenya [5]. In Thiga’s case, they let the students help coding the voting software and kept things ‘offline’ with a local network in the voting room and the server in sight [5]. There were lively comments in the whole session on voting (session 8c), including privacy issues, auditability, whether blockchain could help (yes on auditability and also anonymity, but consumes a lot of [too much?] electricity, according to a Namibian delegate also in attendance), and scaling up to the population or not (probably not for a while, due to digital literacy and access issues, in addition to the trust issue). The research and experiments continue.
  • Headaches of data integration in Buffalo City to get the water billing information system working properly [6]. There are the usual culprits in system integration from the information systems viewpoint (e.g., no buy-in by top management or users) that were held against the case in the city (cf. the CS side of the equation, like noisy data, gaps, vocabulary alignment etc.). Upon further inquiry, specific issues came to the surface, like not reading the water meters for several years and having been paying some guesstimate all the while, and issues that have to do with interaction between paying water (one system) and electricity (another system) cause problems for customers also when they have paid, among others [6]. A framework was proposed, but that hasn’t solved the actual data integration problem.

There were five parallel sessions over the three days (programme), so there are many papers to check out still.

As to networking with people in Africa, it was good especially to meet African ontologists and semantic web enthusiasts, and learn of the Botswana National Productivity Centre (a spellchecker might help, though needing a bit more research for seTswana then), and completely unrelated ending up bringing up the software-based clicker system we developed a few years ago (and still works). The sessions were well-attended—most of us having seen monkeys and beautiful sunsets, done game drives and such—and for many it was a unique opportunity, ranging from lucky postgrads with some funding to professors from the various institutions. A quick scan through the participants list showed that relatively many participants are affiliated with institutions from South Africa, Botswana, Tanzania, Kenya, and Uganda, but also a few from Cameroon, Burkina Faso, Senegal, Angola, and Malawi, among others, and a few from outside Africa, such as the USA, Finland, Canada, and Germany. There was also a representative from the EU’s DEVCO and from GEANT (the one behind Eduroam). Last, but not least, not only the Minister of Transport and Communication, Onkokame Kitso, was present at the conference’s opening ceremony, but also the brand new—39 days and counting—President of Botswana, Mokgweetsi Masisi.

No doubt there will be a 14th installment of the conference next year. The paper deadline tends to be in December and extended into January.



(papers are now only on the USB stick but will appear in IEEE Xplore soon)

[1] Mjaria F, Keet CM. A statistical approach to error correction for isiZulu spellcheckers. IST-Africa 2018.

[2] Mtebe J, Raphael C. A critical review of eLearning Research trends in Tanzania. IST-Africa 2018.

[3] Kennis J. WikiTropica: collaborative knowledge management in the field of tropical medicine and international health. IST-Africa 2018.

[4] Maphanyane J, Nkwae B, Oitsile T, Serame T, Jakoba K. Towards the Building of a Robust National Database Infrastructure (NSDI) Developing Country Needs: Botswana Case Study. IST-Africa 2018.

[5] Thiga M, Chebon V, Kiptoo S, Okumu E, Onyango D. Electronic Voting System for University Student Elections: The Case of Kabarak University, Kenya. IST-Africa 2018.

[6] Naki A, Boucher D, Nzewi O. A Framework to Mitigate Water Billing Information Systems Integration Challenges at Municipalities. IST-Africa 2018.

CFP 6th Controlled Natural Languages workshop

Here’s some advertisement to submit a paper to an great scientific event that has a constructive and stimulating atmosphere. How can one say these positive aspects upfront, one might wonder. I happened to have participated in previous editions (e.g., this time and another time) and now I’m also a member of the organising committee for this 6th edition of the workshop, and we’ll do our best to make it a great event again.



Final Call for Papers

Sixth Workshop on Controlled Natural Language (CNL 2018)

Submission deadline (All papers): 15 April 2018

Workshop: 27-28 August 2018 in Maynooth, Co Kildare, Ireland

This workshop on Controlled Natural Language (CNL) has a broad scope and embraces all approaches that are based on natural language and apply restrictions on vocabulary, grammar, and/or semantics.

The workshop proceedings will be published open access by IOS Press.

For further information, please see:

ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?


Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.


Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.


Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

Logics and other math for computing (LAC18 report)

Last week I participated in the Workshop on Logic, Algebra, and Category theory (LAC2018) (and their applications in computer science), which was held 12-16 February at La Trobe University in Melbourne, Australia. It’s not fully in my research area, so there was lots of funstuff to learn. There were tutorials in the morning and talks in the afternoon, and, of course, networking and collaborations over lunch and in the evenings.

I finally learned some (hardcore) foundations of institutions that underpins the OMG-standardised Distributed Ontology, Model, and Specification Language DOL, whose standard we used in the (award-winning) KCAP17 paper. It concerns the mathematical foundations to handle different languages in one overarching framework. That framework takes care of the ‘repetitive stuff’—like all languages dealing with sentences, signatures, models, satisfaction etc.—in one fell swoop instead of repeating that for each language (logic). The 5-day tutorial was given by Andrzej Tarlecki from the University of Warsaw (slides).

Oliver Kutz, from the Free University of Bozen-Bolzano, presented our K-CAP paper as part of his DOL tutorial (slides), as well as some more practical motivations for and requirements that went into DOL, or: why ontology engineers need DOL to solve some of the problems.

Dirk Pattinson from the Australian National University started gently with modal logics, but it soon got more involved with coalgebraic logics later on in the week.

The afternoons had two presentations each. The ones of most interest to me included, among others, CSP by Michael Jackson; José Fiadeiro’s fun flexible modal logic for specifying actor networks for, e.g., robots and security breaches (that looks hopeless for implementations, but that as an aside); Ionuț Țuțu’s presentation on model transformations focusing on the maths foundations (cf the boxes-and-lines in, say, Eclipse); and Adrian Rodriguez’s program analysis with Maude (slides). My own presentation was about ontological and logical foundations for interoperability among the main conceptual data modelling languages (slides). They covered some of the outcomes from the bilateral project with Pablo Fillottrani and some new results obtained afterward.

Last, but not least, emeritus Prof Jennifer Seberry gave a presentation about a topic we probably all should have known about: Hadamard matrices and transformations, which appear to be used widely in, among others, error correction, cryptography, spectroscopy and NMR, data encryption, and compression algorithms such as MPEG-4.

Lots of thanks go to Daniel Găină for taking care of most of the organization of the successful event. (and thanks to the generous funders, which made it possible for all of us to fly over to Australia and stay for the week 🙂 ). My many pages of notes will keep me occupied for a while!

Updated isiZulu spellchecker and new isiXhosa spellchecker

Noting that February is the month of language activism in South Africa and that 21 February is the International Mother Language Day (a United Nations event since 2000), let me add my proverbial two cents to that. Since the launch of the isiZulu spellchecker in November 2016, research and development has progressed quite a bit, so that we have released a new ‘version 2’ of the spellchecker. For those not in-the-know: isiZulu and isiXhosa are both among the 11 official languages of South Africa, with isiZulu the largest language in the country by first language speakers and isiXhosa is slated to make an international breakthrough, as it’s used in the Black Panther movie that was released this weekend. Anyhow, the main novelties of the updated spellchecker are:

  • first error correction algorithms for isiZulu;
  • improved error detection with a few basic rules, also for isiZulu;
  • new isiXhosa error detection and correction;

The source code is open source, and, due to various tool limitations beyond our control, it’s still a standalone jar file (zipped for download). Here’s a screenshot of the tool, where it checks a piece of text from a novel in isiZulu, illustrating that *khupels has a substitution error (khupela was the intended word):

Single word error *khupels that has a substitution error s for a in the intended word (khupela)

The error corrector can propose possible corrections for single-word errors that are either transpositions, substitutions, insertions, or deletions. So, for instance, *eybo, *yrbo, *yeebo, and *ybo, respectively, cf. the correctly spelled yebo ‘yes’. It doesn’t perform equally well on each type of typo yet, with the best results obtained for transpositions. As with the error detector, it relies on a data-driven approach, with, for error correction, a lot more statistics-based algorithms cf. the error detection-only algorithms. They are described in detail in Frida Mjaria’s 2017 CS honours project. Suggestion accuracy (i.e., that it at least can suggest something) is 95% and suggestion relevance (that it contains the intended word) made it to 61%, mainly due to weak results of corrections for insertion errors (they mess too much with the trigrams).

The error detection accuracy has been improved mainly through better handling of punctuation, thanks to Norman Pilusa’s programming efforts. This was done through a series of rules on top of the data-driven approach, for it is too hard to learn those from a large corpus. For instance, semi-colons, end-of-sentence periods, and numbers (written in isiZulu like, e.g., ngu-42 rather than just 42) are now mostly ignored rather than the words adjacent to it being detected as probably misspelt. It works better than’s version, which is the only other available isiZulu spellchecker: on a random selection of actual pieces of text, our tool obtained 91.71% lexical recall for error detection, whereas the’s version got to 82.66% on the same text. Put differently: flagged about twice as many words as incorrect as ours did (so there wasn’t much point in comparing error corrections).

Finally, because all the algorithms are essentially language-independent (ok, there’s an underlying assumption of using them for highly agglutinative languages), we fed the algorithms a large isiXhosa corpus that is being developed as part of another project, and incorporated that into the spellchecker. There’s room for some fine-tuning especially for the corrector, but at least now there is one, thanks to Norman Pilusa’s software development contributions. That we thought we could get away with this approach is thanks to Nthabiseng Mashiane’s 2017 CS honours project, which showed that the results would be fairly good (>80% error detection) with more data. We also tried a rules-based approach for isiXhosa. It obtained better accuracies than the statistical language model of Nthabiseng, but only for those parts of speech covered by the rules, which is a subset of all types of words. If you’re interested in those rules, please check out Siseko Neti’s 2017 CS Honours project. To the best of my knowledge, it’s the first time those rules have been formally represented in a computer-usable format and they may be useful for other endeavours, such as morphological analysers.

A section of the isiXhosa Wikipedia entry about the UN (*ukuez should be ukuze, which is among the proposed words).

Further improvements are possible, which are being scoped for a v3 some time later. For instance, for the linguists and language scholars: what are the most common typos? What are the most commonly used words? If we had known that, it would have been an easy way to boost the performance. Can we find optimisations to substitutions, insertions, and deletions similar to the one for transpositions? Should some syntax rules be added for further optimisation? These are some of the outstanding questions. If you’re interested in that or related questions, or you would like to use the algorithms in your tool, please contact me.

Ontology pub quiz questions of ISAO 2016 and JOWO 2017

In 2016 when I was a PC chair of the International School for Applied Ontology (ISAO 2016), the idea of organising a contest for the participants turned into a pub quiz somehow. The lecturers provided one or more questions on the topics they’d be teaching and I added a few as well. This set of ISAO16 ontology pub quiz questions is now finally online. It comes with the warning that it is biased toward the topics covered at ISAO 2016, and it turned out that there were a few questions not well formulated and/or not everyone agreed with the answer.

Notwithstanding, it was deemed sufficiently ok as idea in that the general chair of the Joint Ontology Workshops (JOWO 2017) wanted one for JOWO 2017 as well. Several questions were thrown out of the ISAO16 set for various reasons and more general Ontology questions made their way in, as well as a few ‘fun’ and trivia ones in the hope to add some more entertainment to the ontology pub quiz. The JOWO17 pub quiz question set with answers is now also online to play with, which, in my opinion, is a nicer set than the ISAO16 one. Here are a few questions to give you a taste of it:

  • Where/when can a pointless theory be relevant?
  • What is the goal of guerrilla ontology?
  • No Italian pizza has fruit as topping. Which of the following is (an)/are Italian pizza(s)? Pizza Hawaii, Pizza margherita, Pizza bianca romana (‘white roman pizza’)
  • When was the earliest published occurrence of the word “ontology”?

It turned out that it still was not entirely free of debate. If you disagree with one of the answers now, then let me paraphrase Stefano Borgo, who co-ran the JOWO17 pub quiz at the Irish pub in Bolzano on 23 September: maybe there’s something there to write up and submit a paper to FOIS 2018 :-). Or you can write it in the blog post comments section below, so that those questions will/should not be recycled and I can add longer answers to the questions.