“Grammar infused” templates for NLG

It’s hardly ever entirely one extreme or the other in natural language generation and controlled natural languages. Rarely can one get away with simplistic ‘just fill in the blanks’ templates that do not do any grammar or phonological processing to make the output better; our technical report about work done some 17 years ago was a case in point on the limitations thereof if one still needs to be convinced [1]. But where does NLG start? I agree with Ehud Reiter that it isn’t about template versus NLG, but a case of levels of sophistication: the fill-in-the-blank templates definitely don’t count as NLG and full-fledged grammar-only systems definitely do, with anything in-between a grey area. Adding word-level grammatical functions to templates makes it lean to NLG, or even indeed being so if there are relatively many such rules, and dynamically creating nicely readable sentences with aggregation and connectives counts as NLG for sure, too.

With that in mind, we struggled with how to name the beasts we had created for generating sentences in isiZulu [2], a Niger-Congo B language: nearly every resultant word in the generated sentences required a number of grammar rules to make it render sufficiently well (i.e., at least grammatically acceptable and understandable). Since we didn’t have a proper grammar engine yet, but we knew they could never be fill-in-the-blank templates either, we dubbed them verbalisation patterns. Most systems (by number of systems) use either only templates or templates+grammar, so our implemented system [3] was in good company. It may sound like oldskool technology, but you go ask Meta with their Galactica if a ML/DL-based approach is great for generating sensible text that doesn’t hallucinate… and does it well for languages other than English.

That said, honestly, those first attempts we did for isiZulu were not ideal for reusability and maintainability – that was not the focus – and it opened up another can of worms: how do you link templates to (partial) grammar rules? With the ‘partial’ motivated by taking it one step at a time in grammar engine development, as a sort of agile engine development process that is relevant especially for languages that are not well-resourced.

We looked into this recently. There turn out to be three key mechanisms for linking templates to computational grammar rules: embedding (E), where grammar rules are mixed with the templates specifications and therewith co-dependent, and compulsory (C) and partial (P) attachment where there is, or can be, an independent existence of the grammar rules.

Attachment of grammar rules (that can be separated) vs embedding of grammar rules in a system (intertwined with templates) (Source: [6])

The difference between the latter two is subtle but important for use and reuse of grammar rules in the software system and the NLG-ness of it: if each template must use at least one rule from the set of grammar rules and each rule is used somewhere, then the set of rules is compulsorily attached. Conversely, it is partially attached if there are templates in that system that don’t have any grammar rules attached. Whether it is partial because it’s not needed (e.g., the natural language’s grammar is pretty basic) or because the system is on the fill-in-the-blank not-NLG end of the spectrum, is a separate question, but for sure the compulsory one is more on the NLG side of things. Also, a system may use more than one of them in different places; e.g., EC, both embedding and compulsory attachment. This was introduced in [4] in 2019 and expanded upon in a journal article entitled Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation [5] that was published in IJMSO, and even more detail can be found in Zola Mahlaza’s recently completed PhD thesis [6]. These papers have various examples, illustrations how to categorise a system, and why one system was categorised in one way and not another. Here’s a table with several systems that combine templates and computational grammar rules and how they are categorised:

Source: [5]

We needed a short-hand name to refer to the cumbersome and wordy description of ‘combining templates with grammar rules in a [theoretical or implemented] system in some way’, which ended up to be grammar-infused templates.

Why write about this now? Besides certain pandemic-induced priorities in 2021, the recently proposed template language for Abstract Wikipedia that I blogged about before may mix Compulsory or Partial attachment, but ought not to permit the messy embedding of grammar in a template. This may not have been clear in v1 of the proposal, but hopefully it is a little bit more so in this new version that was put online over the past few days. To make that long story short: besides a few notes at the start of its Section 3, there’s a generic description of an idea for a realization algorithm. Its details don’t matter if you don’t intend to design a new realiser from scratch and maybe not either if you want to link it to your existing system. The key take-away from that section is that there’s where the real grammar and phonological conditioning stuff happens if it’s needed. For example, for the ‘age in years’ sub-template for isiZulu, recall that’s:

Year_zu(years):"{root:Lexeme(L686326)} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounPrefix()}-{nummod:Cardinal(years)}"

The template language sets some boundaries for declaring such a template, but it is a realiser that has to interpret ‘keywords’, such as root, concord, and RelativeConcord, and do something with it so that the output ends up correctly; in this case, from ‘year’ + ‘25’ as input data to iminyaka engama-25 as outputted text. That process might be done in line with Ariel Gutman’s realiser pipeline for Abstract Wikipedia and his proof-of-concept implementation with Scribunto or any other realizer architecture or system, such as Grammatical Framework, SimpleNLG, NinaiUdiron, or Zola’s Nguni Grammar Engine, among several options for multilingual text generation. It might sound silly to put templates on top of the heavy machinery of a grammar engine, but it will make it more accessible to the general public so that they can specify how sentences should be generated. And, hopefully, permit a rules-as-you-go approach as well.

It is then the realiser (including grammar) engine and the partially or compulsorily attached computational grammar rules and other algorithms that work with the template. For the example, when it sees root and that the lemma fetched is a noun (L686326 is unyaka ‘year’), it also fetches the value of the noun class (a grammatical feature stored with the noun), which we always need somewhere for isiZulu NLG. It then needs to figure out to make a plural out of ‘year’, which it will know that it must do thanks to the years fetched for the instance (i.e., 25, which is plural) and the nummod that links to the root by virtue of the design and the assumption there’s a (dependency) grammar. Then, with concord:RelativeConcord, it will fetch the relative concord for that noun class, since concord also links to root. We already can do the concordial agreements and pluralising of nouns (and much more!) for isiZulu since several years. The only hurdle is that that code would need to become interoperable with the template language specification, in that our realisers will have to be able to recognise and process properly those ‘keywords’. Those words are part of an extensible set of words inspired by dependency grammars.

How this is supposed to interact smoothly is to be figured out still. Part of that is touched upon in the section about instrumentalising the template language: you could, for instance, specify it as functions in Wikifunctions that is instantly editable, facilitating an add-rules-as-you-go approach. Or it can be done less flexibly, by mapping or transforming it to another template language or to the specification of an external realiser (since it’s the principle of attachment, not embedding, of computational grammar rules).

In closing, whether the term “grammar-infused templates” will stick remains to be seen, but combining templates with grammars in some way for NLG will have a solid future at least for as long as those ML/DL-based large language model systems keep hallucinating and don’t consider languages other than English, including the intended multilingual setting for Abstract Wikipedia.

References

[1] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157. (accepted version free access)

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64. Portoroz, Slovenia, May 28 – June 2, 2017.

[4] Mahlaza, Z., Keet, C.M. A classification of grammar-infused templates for ontology and model verbalisation. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer vol. CCIS 1057, 64-76. 28-31 Oct 2019, Rome, Italy.

[5] Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

[6] Mahlaza, Z. Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu. PhD Thesis, Department of Computer Science, University of Cape Town, South Africa. 2022.

Advertisement

A review of NLG realizers and a new architecture

That last step in the process of generating text from some structured representation of data, information or knowledge is done by things called surface realizers. They take care of the ‘finishing touches’ – syntax, morphology, and orthography – to make good natural language sentences out of an ontology, conceptual data model, or Wikidata data, among many possible sources that can be used for declaring abstract representations. Besides theories, there are also many tools that try to get that working at least to some extent. Which ways, or system architectures, are available for generating the text? Which components do they all, or at least most of them, have? Where are the differences and how do they matter? Will they work for African languages? And if not, then what?

My soon-to-graduate PhD student Zola Mahlaza and I set out to answer these questions, and more, and the outcome is described in the article Surface realization architecture for low-resourced African languages that was recently accepted and is now in print with the ACM Transactions on Asian and Low-Resource Language Information Processing (ACM TALLIP) journal [1].

Zola examined 77 systems, which exhibited some 13 different principal architectures that could be classified into 6 distinct architecture categories. Purely by number of systems, manually coded and rule-based would be the most popular, but there are a few hybrid and data-driven systems as well. A consensus architecture for realisers there is not. And none exhibit most of the software maintainability characteristics, like modularity, reusability, and analysability that we need for African languages (even more so than for better resourced languages). African is narrowed down in the paper further to those in the Niger-Congo B (‘Bantu’) family of languages. One of the tricky things is that there’s a lot going on at the sub-word level with these languages, whereas practically all extant realizers operate at the word-level.

Hence, the next step was to create a new surface realizer architecture that is suitable for low-resourced African languages and that is maintainable. Perhaps unsurprisingly, since the paper is in print, this new architecture compares favourably against the required features. The new architecture also has ‘bonus’ features, like being guided by an ontology with a template ontology [2] for verification and interoperability. All its components and the rationale for putting it together this way are described in Section 5 of the article and the maintainability claims are discussed in its Section 6.

Source: [1]

There’s also a brief illustration how one can redesign a realiser into the proposed architecture. We redesigned the architecture of OWLSIZ for question generation in isiZulu [3] as use case. The code of that redesign of OWLSIZ is available, i.e., it’s not merely a case of just having drawn a different diagram, but it was actually proof-of-concept tested that it can be done.

While I obviously know what’s going on in the article, if you’d like to know much more details than what’s described there, I suggest you consult Zola as the main author of the article or his (soon to be available online) PhD thesis [4] that devotes roughly a chapter to this topic.

References

[1] Mahlaza, Z., Keet, C.M. Surface realisation architecture for low-resourced African languages. ACM Transactions on Asian and Low-Resource Language Information Processing, (in print). DOI: 10.1145/3567594.

[2] Mahlaza, Z., Keet, C.M. ToCT: A task ontology to manage complex templates. FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.

[3] Mahlaza, Z., Keet, C.M.: OWLSIZ: An isiZulu CNL for structured knowledge validation. In: Proc. of WebNLG+ 2020. pp. 15–25. ACL, Dublin, Ireland (Virtual).

[4] Mahlaza, Z. Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu. PhD Thesis, Department of Computer Science, University of Cape Town, South Africa. 2022.

A proposal for a template language for Abstract Wikipedia

Natural language generation applications have been ‘mainstreaming’ behind the scenes for the last couple of years, from automatically generating text for images, to weather forecasts, summarising news articles, digital assistants that mechanically blurt out text based the structured information they have, and many more. Google, Reuters, BBC, Facebook – they all do it. Wikipedia is working on it as well, principally within the scope of Abstract Wikipedia to try to build a better multilingual Wikipedia [1] to reach more readers better. They all have some source of structured content – like data fetched from a database or spreadsheet, information from, say, a UML class diagram, or knowledge from some knowledge graph or ontology – and a specification as to what the structure of the sentence should be, typically with some grammar rules to at least prettify it, if not also being essential to generate a grammatically correct sentence [2]. That specification is written in templates that are then filled with content.

For instance, a simple rendering of a template may be “Each [C1] [R1] at least one [C2]” or “[I1] is an instance of [C1]”, where the things within the square brackets are variables standing in for content that will be fetched from the source, like a class, relationship, or individual. Linking these to a knowledge graph about universities, it may generate, e.g., “Each academic teaches at least one course” and “Joanne Soap is an instance of Academic”. To get the computer to do this, just “Each [C1] [R1] at least one [C2]” for template won’t do: we need to tell it what the components are so that the program can process it to generate that (pseudo-)natural language sentence.

Many years ago, we did this for multiple languages and used XML to specify the templates for the key aspects of the content. The structured input were conceptual data models in ORM in the DOGMA tool that had that verbalisation component [3]. As example, the template for verbalising a mandatory constraint was as follows:

<Constraint xsi:type="Mandatory">
 <Text> - [Mandatory] Each</Text>
 <Object index="0"/>
 <Text>must</Text>
 <Role index="0"/>
 <Text>at least one</Text>
 <Object index="1"/>
</Constraint>

Besides demarcating the sentence and indicating the constraint, there’s fixed text within the <text> … </text> tags and there’s the variable part with the <Object… that declares that the name of the object type has to be fetched and the <Role… that declares that the name of the relationship has to be fetched from the model (well, more precisely in this care: the reading label), which were elements declared in an XML Schema. With the same example as before, where Academic is in the object index “0” position and Course in the “1” position (see [3] for details), the software would then generate “ – [Mandatory] Each Academic must teaches at least one Course.”

This can be turned up several notches by adding grammatical features to it in order to handle, among others, gender for nouns in German, because they affect the rendering of the ‘each’ and ‘one’ in the sample sentence, not to mention the noun classes of isiZulu and many other languages [4], where even the verb conjugation depends on the noun class of the noun that plays the role of subject in the sentence. Or you could add sentence aggregation to combine two templates into one larger one to generate more flowy text, like a “Joanne Soap is an academic who teaches at least one course”. Or change the application scenario or the machinery for how to deal with the templates. For instance, instead of those variables in the template + code elsewhere that does the content fetching and any linguistic processing, we could put part of that in the template specification. Then there are no variables as such in the template, but functions. The template specification for that same constraint in an ORM diagram might then look like this:

ConstraintIsMandatory {
 “[Mandatory] Each ”
 FetchObjectType(0)
 “ must ”
 MakeInfinitive(FetchRole(0))
 “ at least one ”
 FetchObjectType(1)}

If you want to go with newer technology than markup languages, you may prefer to specify it in JSON. If you’re excited about functional programming languages and see everything through the lens of functions, you even can turn the whole template specification into a bunch of only functions. Either way: there must be a specification of how those templates are permitted to look like, or: what elements can be used to make a valid specification of a template. This so that the software will work properly so that it neither will spit out garbage nor will halt halfway before returning anything. What is permitted in a template language can be specified by means of a model, such as an XML Schema or a DTD, a JSON artefact, or even an ontology [5], a formal definition in some notation of choice, or by defining a grammar (be it a CFG or in BNF notation), and anyhow with enough documentation to figure out what’s going on.

How might this look like in the context of Abstract Wikipedia? For the natural language generation aspects and its first proposal for the realiser architecture, the structured content to be rendered in a natural language sentence is fetched from Wikidata, as is the lexicographic data, and the functions to do the various computations are to come from/go in Wikifunctions. They’re then combined with the templates in various stages in the realiser pipeline to generate those sentences. But there was still a gap as to what those templates in this context may look like. Ariel Gutman, a google.org fellow working on Abstract Wikipedia, and I gave it a try and that proposal for a template language for Abstract Wikipedia is now online accessible for comment, feedback, and, if you happen to speak a grammatically rich language, an option to provide difficult examples so that we can check whether the language is expressive enough.

The proposal is – as any other proposal for a software system – some combination of theoretical foundations, software infrastructure peculiarities, reasoned and arbitrary design decisions, compromises, and time constraints. Here’s a diagram of the key aspects of the syntax, i.e., with the elements, how they relate, and the constraints holding between them, in ORM notation:

An illustrative diagram with the key features of the template language in ORM notation.

There’s also a version in CFG notation, and there are a few examples, each of which shows how the template looks like for verbalising one piece of information (Malala Yousafzai’s age) in Swedish, French, Hebrew, and isiZulu. Swedish is the simplest one, as would English or Dutch be, so let’s begin with that:

Persoon_leeftijd_nl(Entity,Age_in_years): “{Person(Entity) is 
  {Age_in_years} jaar.}”

Where the Person(Entity) fetches the name of the person (that’s identified by an identifier) and the Age_in_years fetches the age. One may like to complicate matters and add a conditional statement, like that any age less than 30 will render that last part not just as jaar ‘year’, but as jaar oud ‘years old’ but jaar jong ‘years young’, but where that dividing line is, is a sensitive topic for some and I will let that rest. In any case, in Dutch, there’s no processing of the number itself to be able to render it in the sentence – 25 renders as 25 – but in other languages there is. For instance, in isiZulu. In that case, instead of a simple fetching of the number, we can put a function in the slot:

Person_AgeYr_zu(Entity,Age_in_years): “{subj:Person(Entity)} 
  {root:subjConcord()}na{Year(Age_in_years).}”

That Year(Age_in_years) is a function that is based on either another function or a sub-template. For instance, it can be defined as follows:

Year_zu(years):"{root:Lexeme(L686326)} 
  {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounPrefix()}-
  {nummod:Cardinal(years)}"

Where Lexeme(L686326) is the word for ‘year’ in isiZulu, unyaka, and for the rest, it first links the age rendering to the ‘year’ with the RelativeConcord() of that word, which practically fetches e- for the ‘years’ (iminyaka, noun class 4),  then gets the copulative (ng in this case), and then the concord for the noun class of the noun of the number. Malala is in her 20s, which is amashumi amabili ..  (noun class 6, which is computed via Cardinal(years)), and thus the function nounPrefix will fetch ama-. So, for Malala’s age data, Year_zu(years) will return iminyaka engama-25. That then gets processed with the rest of the Person_AgeYr_zu template, such as adding an U to the name by subj:Person(Entity), and later steps in the pipeline that take care of things like phonological conditioning (-na- + i- = –ne-), to eventually output UMalala Yousafzai uneminyaka engama-25. In other words: such a template indeed can be specified with the proposed template syntax.

There’s also a section in the proposal about how that template language then connects to the composition syntax so that it can be processed by the Wikifunctions Orchestrator component of the overall architecture. That helps hiding a few complexities from the template declarations, but, yes, someone’s got to write those functions (or take them from existing grammar engines) that will take care of those more or less complicated processing steps. That’s a different problem to solve. You also could link it up with another realiser by means of a transformation the the input type it expects. For now, it’s the syntax of the declarative part for the templates.

If you have any questions or comments or suggestions on that proposal or interesting use cases to test with, please don’t hesitate to add something to the talk page of the proposal, leave a comment here, or contact either Ariel or me directly.

 

References

[1] Vrandečić, D. Building a multilingual Wikipedia. Communications of the ACM, 2021, 64(4), 38-41.

[2] Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

[3] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[4] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157.

[5] Mahlaza, Z., Keet, C. M. ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.

NLG requirements for social robots in Sub-Saharan Africa

When the robots come rolling, or just trickling or seeping or slowly creeping, into daily life, I want them to be culturally aware, give contextually relevant responses, and to do that in a language that the user can understand and speak well. Currently, they don’t. Since I work and in live in South Africa, then what does all that mean for the Southern Africa context? Would social robot use case scenarios be different here than in the Global North where most of the robot research and development is happening, and if so, how? What is meant with  contextually relevant responses? Which language(s) should the robot communicate in?

The question of which languages is the easiest to answer: those spoken in this region, which are mainly those in the Niger-Congo B [NCB] (aka ‘Bantu’) family of languages, and then also Portuguese, French, Afrikaans, and English. I’ve been working on theory and tools for NCB languages, and isiZulu in particular (and some isiXhosa and Runyankore), whose research was mainly as part of the two NRF-funded projects GeNI and MoReNL. However, if we don’t know how that human-robot interaction occurs in which setting, we won’t know whether the algorithms designed so far can also be used for that, which may well be beyond the ontology verbalisation, a patient’s medicine prescription generation, weather forecasts, or language learning exercises that we roughly got covered for the controlled language and natural language generation aspects of it.

So then what about those use case scenarios and contextually relevant responses? Let me first give an example of the latter. A few years ago in one of the social issues and professional practice lectures I was teaching, I brought in the Amazon Echo to illustrate precisely that as well as privacy issues with Alexa and digital assistants (‘robot secretaries’) in general. Upon asking “What is the EFF?”, the whole class—some 300 students present at the time—was expecting that Alexa would respond with something like “The EFF is the economic freedom fighters, a political party in South Africa”. Instead, Alexa fetched the international/US-based answer and responded with “The EFF is the electronic frontier foundation” that the class had never heard of and that EFF doesn’t really do anything in South Africa (it does pass the revue later on in the module nonetheless, btw). There’s plenty of online content about the EFF as political party, yet Alexa chose to ignore that and prioritise information from elsewhere. Go figure with lots of other information that has limited online presence and doesn’t score high in the search engine results because there are fewer queries about it. How to get the right answer in those cases is not my problem (area of expertise), but I take that a solved black box and zoom in on the natural language aspects to automatically generate a sentence that has the answer taken from some structured data or knowledge.

The other aspect of this instance, is that the interactions both during and after the lecture was not a 1:1 interaction of students with their own version of Siri or Cortana and the like, but eager and curious students came in teams, so a 1:m interaction. While that particular class is relatively large and was already split into two sessions, larger classes are also not uncommon in several Sub-Saharan countries: for secondary school class sizes, the SADC average is 23.55 learners per class (the world average is 17), with the lowest is Botswana (13.8 learners) and the highest in Malawi with a whopping 72.3 learners in a class, on average. An educational robot could well be a useful way to get out of that catch-22, and, given resource constraints, end up as a deployment scenario with a robot per study group, and that in a multilingual setting that permits code switching (going back and forth between different languages). While human-robot interaction experts still will need to do some contextual inquiries and such to get to the bottom of the exact requirements and sentences, this variation in use is on top of the hitherto know possible ways for educational robots.

Going beyond this sort of informal chatter, I tried to structure that a bit and narrowed it down to a requirements analysis for the natural language generation aspects of it. After some contextualisation, I principally used two main use cases to elucidate natural language generation requirements and assessed that against key advances in research and technologies for NCB languages. Very, very, briefly, any system will need to i) combine data-to-text and knowledge-to-text, ii) generate many more different types of sentences, including sentences for  both written and spoken languages in the NCB languages that are grammatically rich and often agglutinating, and iii) process non-trivial numbers that is non-trivial to do for NCB languages because the surface realization of the numbers depend on the noun class of the noun that is being counted. At present, no system out there can do all of that. A condensed version of the analysis was recently accepted as a paper entitled Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa [1], for the IST-Africa’21 conference, and it will be presented there next week at the virtual event, in the ‘next generation computing’ session no less, on Wednesday the 12th of May.

Screen grab of the recording of the conference presentation (link to recording available upon request)

Probably none of you has ever heard of this conference. IST-Africa is yearly IT conference in Africa that aims to foster North-South and South-South  networking, promote the academia->industry and academia->policy bridge-creation and knowledge transfer pipelines, and capacity building for paper writing and presentation. The topics covered are distinctly of regional relevance and, according to its call for papers, the “Technical, Policy, Social Implications Papers must present analysis of early/final Research or Implementation Project Results, or business, government, or societal sector Case Study”.

Why should I even bother with an event like that? It’s good to sometimes reflect on the context and ponder about relevance of one’s research—after all, part of the university’s income (and thus my salary) and a large part of the research project funding I have received so far comes ultimately from the taxpayers. South African tax payers, to be more precise; not the taxpayers of the Global North. I can ‘advertise’, ahem, my research area and its progress to a regional audience. Also, I don’t expect that the average scientist in the Global North would care about HRI in Africa and even less so for NCB languages, but the analysis needed to be done and papers equate brownie points. Also, if everyone thinks to better not participate in something locally or regionally, it won’t ever become a vibrant network of research, applied research, and technology. I’ve attended the event once, in 2018 when we had a paper on error correction for isiZulu spellcheckers, and from my researcher viewpoint, it was useful for networking and ‘shopping’ for interesting problems that I may be able to solve, based on other participants’ case studies and inquiries.

Time will tell whether attending that event then and now this paper and online attendance will be time wasted or well spent. Unlike the papers on the isiZulu spellcheckers that reported research and concrete results that a tech company easily could take up (feel free to do so), this is a ‘fluffy’ paper, but exploring the use of robots in Africa was an interesting activity to do, I learned a few things along the way, it will save other interested people time in the analysis phase, and hopefully it also will generate some interest and discussion about what sort of robots we’d want and what they could or should be doing to assist, rather than replace, humans.

p.s.: if you still were to think that there are no robots in Africa and deem all this to be irrelevant: besides robots in the automotive and mining industries by, e.g., Robotic Innovations and Robotic Handling Systems, there are robots in education (also in Cape Town, by RD-9), robot butlers in hotels that serve quarantined people with mild COVID-19 in Johannesburg, they’re used for COVID-19 screening in Rwanda, and the Naledi personal banking app by Botlhale, to name but a few examples. Other tools are moving in that direction, such as, among others, Awezamed’s use of speech synthesis with (canned) text in isiZulu, isiXhosa and Afrikaans and there’s of course my research group where we look into knowledge-to-text text generation in African languages.

References

[1] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. in print.

ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

 

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

 

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

 

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

Bootstrapping a Runyankore CNL from an isiZulu one mostly works well

Earlier this week the 5th Workshop on Controlled Natural Language (CNL’16) was held in Aberdeen, Scotland, where I presented progress made on a Runyankore CNL [1], rather than my student, Joan Byamugisha, who did most of the work on it (she could not attend due to nasty immigration rules by the UK, not a funding issue).

“Runyankore?”, you might ask. It is one of the languages spoken in Uganda. As Runyankore is very under-resourced, any bootstrapping to take a ‘shortcut’ to develop language resources would be welcome. We have a CNL for isiZulu [2], but that is spoken in South Africa, which is a few thousand kilometres further south of Uganda, and it is in a different Guthrie zone of the—in linguistics still called—Bantu languages, so it was a bit of a gamble to see whether those results could be repurposed for Runynakore. They could, needing only minor changes.

What stayed the same were the variables, or: components to make up a grammatically correct sentence when generating a sentence within the context of OWL axioms (ALC, to be more precise). They are: the noun class of the name of the concept (each noun is assigned a noun class—there are 20 in Runyankore), the category of the concept (e.g., noun, adjective), whether the concept is atomic (named OWL class) or an OWL class expression, the quantifier used in the axiom, and the position of the concept in the axiom. The only two real differences were that for universal quantification the word for the quantifier is the same when in the singular (cf. isiZulu, where it changes for both singular or plural), and for disjointness there is only one word, ti ‘is not’ (cf. isiZulu’s negative subject concord + pronomial). Two very minor differences are that for existential quantification ‘at least one’, the ‘at least’ is in a different place in the sentence but the ‘one’ behaves exactly the same, and ‘all’ for universal quantification comes after the head noun rather than before (but is also still dependent on the noun class).

It goes without saying that the vocabulary is different, but that is a minor aspect compared to figuring out the surface realisation for an axiom. Where the bootstrapping thus came in handy was that that arduous step of investigating from scratch the natural language grammar involved in verbalising OWL axioms could be skipped and instead the ones for isiZulu could be reused. Yay. This makes it look very promising to port to other languages in the Bantu language family. (yes, I know, “one swallow does not a summer make” [some Dutch proverb], but one surely is justified to turn up one’s hope a notch regarding generalizability and transferability of results.)

Joan also conducted a user survey to ascertain which surface realisation was preferred among Runyankore speakers, implemented the algorithms, and devised a new one for the ‘hasX’ naming scheme of OWL object properties (like hasSymptom and hasChild). All these details, as well as the details of the Runyankore CNL and the bootstrapping, are described in the paper [1].

 

I cannot resist a final comment on all this. There are people who like to pull it down and trivialise natural language interfaces for African languages, on the grounds of “who cares about text in those kind of countries; we have to accommodate the illiteracy with pictures and icons and speech and such”. People are not as illiterate as is claimed here and there (including by still mentally colonised people from African countries)—if they were, then the likes of Google and Facebook and Microsoft would not invest in localising their interfaces in African languages. The term “illiterate” is used by those people to include also those who do not read/write in English (typically an/the official language of government), even though they can read and write in their local language. People who can read and write—whichever natural language it may be—are not illiterate, neither here in Africa nor anywhere else. English is not the yardstick of (il)literacy, and anyone who thinks it is should think again and reflect a bit on cultural imperialism for starters.

 

References

[1] Byamugisha, J., Keet, C.M., DeRenzi, B. Bootstrapping a Runyankore CNL from an isiZulu CNL. 5th Workshop on Controlled Natural Language (CNL’16), Springer LNAI vol. 9767, 25-36. 25-27 July 2016, Aberdeen, UK. Springer’s version

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016. DOI: 10.1007/s10579-016-9340-0 (in print) accepted version

72010 SemWebTech lecture 10: SWLS and text processing and ontologies

There is a lot to be said about how Ontology, ontologies, and natural language interact from a philosophical perspective up to the point that different commitments lead to different features and, moreover, limitations of a (Semantic Web) application. In this lecture on 22 Dec, however, we shall focus on the interaction of NLP and ontologies within a bio-domain from an engineering perspective.

During the bottom-up ontology development and methodologies lectures, it was already mentioned that natural language processing (NLP) can be useful for ontology development. In addition, NLP can be used as a component in an ontology-driven information system and an NLP application can be enhanced with an ontology. Which approaches and tools suit best depends on the goal (and background) of its developers and prospective users, ontological commitment, and available resources.

Summarising the possibilities for “something natural language text” and ontologies or ontology-like artifacts, we can:

  • Use ontologies to improve NLP: to enhance precision and recall of queries (including enhancing dialogue systems [1]), to sort results of an information retrieval query to the digital library (e.g. GoPubMed [2]), or to navigate literature (which amounts to linked data [3]).
  • Use NLP to develop ontologies (TBox): mainly to search for candidate terms and relations, which is part of the suite of techniques called ‘ontology learning’ [4].
  • Use NLP to populate ontologies (ABox): e.g., document retrieval enhanced by lexicalised ontologies and biomedical text mining [5].
  • Use it for natural language generation (NLG) from a formal language: this can be done using a template-based approach that works quite well for English but much less so for grammatically more structured languages such as Italian [6], or with a full-fledged grammar engine as with the Attempto Controlled English and bi-directional mappings (see for a discussion [7]).

Intuitively, one may be led to think that simply taking the generic NLP or NLG tools will do fine also for the bio(medical) domain. Any application does indeed use those techniques and tools—Paul Buitelaar’s slides have examples and many references to NLP tools—but, generally, they do not suffice to obtain ‘acceptable’ results. Domain specific peculiarities are many and wide-ranging. For instance, to deal with the variations of terms (scientific name, variant, common misspellings) and the grounding step (linking a term to an entity in a biological database) in the ontology-NLP preparation and instance classification side [5], to characterize the question in a question answering system correctly [1], and to find ways to deal with the rather long strings that denote a biological entity or concept or universal [4]. Some of such peculiarities actually generate better overall results than in generic or other domain-specific usages of NLP tools, but it requires extra manual preparatory work and a basic understanding of the subject domain and its applications.

References

[1] K. Vila, A. Ferrández. Developing an Ontology for Improving Question Answering in the Agricultural Domain. In: Proceedings of MTSR’09. Springer CCIS 46, 245-256.

[2] Heiko Dietze, Dimitra Alexopoulou, Michael R. Alvers, Liliana Barrio-Alvers, Bill Andreopoulos, Andreas Doms, Joerg Hakenberg, Jan Moennich, Conrad Plake, Andreas Reischuck, Loic Royer, Thomas Waechter, Matthias Zschunke, and Michael Schroeder. GoPubMed: Exploring PubMed with Ontological Background Knowledge. In Stephen A. Krawetz, editor, Bioinformatics for Systems Biology. Humana Press, 2008.

[3] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784] (but see also some comments on the paper)

[4] Dimitra Alexopoulou, Thomas Waechter, Laura Pickersgill, Cecilia Eyre, and Michael Schroeder. Terminologies for text-mining: an experiment in the lipoprotein metabolism domain. BMC Bioinformatics, 9(Suppl4):S2, 2008

[5] Witte, R. Kappler, T. And Baker, C.J.O. Ontology design for biomedical text mining. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, pp 281-313.

[6] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[7] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear, G. Hart. A comparison of three controlled natural languages for OWL 1.1. Proc. of OWLED 2008 DC.

Note: references 4 and 5 are mandatory reading, and 1-3 and 6 are optional (recommended for the EMLCT students).

Lecture notes: lecture 10 – Text processing

Course website