“Grammar infused” templates for NLG

It’s hardly ever entirely one extreme or the other in natural language generation and controlled natural languages. Rarely can one get away with simplistic ‘just fill in the blanks’ templates that do not do any grammar or phonological processing to make the output better; our technical report about work done some 17 years ago was a case in point on the limitations thereof if one still needs to be convinced [1]. But where does NLG start? I agree with Ehud Reiter that it isn’t about template versus NLG, but a case of levels of sophistication: the fill-in-the-blank templates definitely don’t count as NLG and full-fledged grammar-only systems definitely do, with anything in-between a grey area. Adding word-level grammatical functions to templates makes it lean to NLG, or even indeed being so if there are relatively many such rules, and dynamically creating nicely readable sentences with aggregation and connectives counts as NLG for sure, too.

With that in mind, we struggled with how to name the beasts we had created for generating sentences in isiZulu [2], a Niger-Congo B language: nearly every resultant word in the generated sentences required a number of grammar rules to make it render sufficiently well (i.e., at least grammatically acceptable and understandable). Since we didn’t have a proper grammar engine yet, but we knew they could never be fill-in-the-blank templates either, we dubbed them verbalisation patterns. Most systems (by number of systems) use either only templates or templates+grammar, so our implemented system [3] was in good company. It may sound like oldskool technology, but you go ask Meta with their Galactica if a ML/DL-based approach is great for generating sensible text that doesn’t hallucinate… and does it well for languages other than English.

That said, honestly, those first attempts we did for isiZulu were not ideal for reusability and maintainability – that was not the focus – and it opened up another can of worms: how do you link templates to (partial) grammar rules? With the ‘partial’ motivated by taking it one step at a time in grammar engine development, as a sort of agile engine development process that is relevant especially for languages that are not well-resourced.

We looked into this recently. There turn out to be three key mechanisms for linking templates to computational grammar rules: embedding (E), where grammar rules are mixed with the templates specifications and therewith co-dependent, and compulsory (C) and partial (P) attachment where there is, or can be, an independent existence of the grammar rules.

Attachment of grammar rules (that can be separated) vs embedding of grammar rules in a system (intertwined with templates) (Source: [6])

The difference between the latter two is subtle but important for use and reuse of grammar rules in the software system and the NLG-ness of it: if each template must use at least one rule from the set of grammar rules and each rule is used somewhere, then the set of rules is compulsorily attached. Conversely, it is partially attached if there are templates in that system that don’t have any grammar rules attached. Whether it is partial because it’s not needed (e.g., the natural language’s grammar is pretty basic) or because the system is on the fill-in-the-blank not-NLG end of the spectrum, is a separate question, but for sure the compulsory one is more on the NLG side of things. Also, a system may use more than one of them in different places; e.g., EC, both embedding and compulsory attachment. This was introduced in [4] in 2019 and expanded upon in a journal article entitled Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation [5] that was published in IJMSO, and even more detail can be found in Zola Mahlaza’s recently completed PhD thesis [6]. These papers have various examples, illustrations how to categorise a system, and why one system was categorised in one way and not another. Here’s a table with several systems that combine templates and computational grammar rules and how they are categorised:

Source: [5]

We needed a short-hand name to refer to the cumbersome and wordy description of ‘combining templates with grammar rules in a [theoretical or implemented] system in some way’, which ended up to be grammar-infused templates.

Why write about this now? Besides certain pandemic-induced priorities in 2021, the recently proposed template language for Abstract Wikipedia that I blogged about before may mix Compulsory or Partial attachment, but ought not to permit the messy embedding of grammar in a template. This may not have been clear in v1 of the proposal, but hopefully it is a little bit more so in this new version that was put online over the past few days. To make that long story short: besides a few notes at the start of its Section 3, there’s a generic description of an idea for a realization algorithm. Its details don’t matter if you don’t intend to design a new realiser from scratch and maybe not either if you want to link it to your existing system. The key take-away from that section is that there’s where the real grammar and phonological conditioning stuff happens if it’s needed. For example, for the ‘age in years’ sub-template for isiZulu, recall that’s:

Year_zu(years):"{root:Lexeme(L686326)} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounPrefix()}-{nummod:Cardinal(years)}"

The template language sets some boundaries for declaring such a template, but it is a realiser that has to interpret ‘keywords’, such as root, concord, and RelativeConcord, and do something with it so that the output ends up correctly; in this case, from ‘year’ + ‘25’ as input data to iminyaka engama-25 as outputted text. That process might be done in line with Ariel Gutman’s realiser pipeline for Abstract Wikipedia and his proof-of-concept implementation with Scribunto or any other realizer architecture or system, such as Grammatical Framework, SimpleNLG, NinaiUdiron, or Zola’s Nguni Grammar Engine, among several options for multilingual text generation. It might sound silly to put templates on top of the heavy machinery of a grammar engine, but it will make it more accessible to the general public so that they can specify how sentences should be generated. And, hopefully, permit a rules-as-you-go approach as well.

It is then the realiser (including grammar) engine and the partially or compulsorily attached computational grammar rules and other algorithms that work with the template. For the example, when it sees root and that the lemma fetched is a noun (L686326 is unyaka ‘year’), it also fetches the value of the noun class (a grammatical feature stored with the noun), which we always need somewhere for isiZulu NLG. It then needs to figure out to make a plural out of ‘year’, which it will know that it must do thanks to the years fetched for the instance (i.e., 25, which is plural) and the nummod that links to the root by virtue of the design and the assumption there’s a (dependency) grammar. Then, with concord:RelativeConcord, it will fetch the relative concord for that noun class, since concord also links to root. We already can do the concordial agreements and pluralising of nouns (and much more!) for isiZulu since several years. The only hurdle is that that code would need to become interoperable with the template language specification, in that our realisers will have to be able to recognise and process properly those ‘keywords’. Those words are part of an extensible set of words inspired by dependency grammars.

How this is supposed to interact smoothly is to be figured out still. Part of that is touched upon in the section about instrumentalising the template language: you could, for instance, specify it as functions in Wikifunctions that is instantly editable, facilitating an add-rules-as-you-go approach. Or it can be done less flexibly, by mapping or transforming it to another template language or to the specification of an external realiser (since it’s the principle of attachment, not embedding, of computational grammar rules).

In closing, whether the term “grammar-infused templates” will stick remains to be seen, but combining templates with grammars in some way for NLG will have a solid future at least for as long as those ML/DL-based large language model systems keep hallucinating and don’t consider languages other than English, including the intended multilingual setting for Abstract Wikipedia.

References

[1] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157. (accepted version free access)

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64. Portoroz, Slovenia, May 28 – June 2, 2017.

[4] Mahlaza, Z., Keet, C.M. A classification of grammar-infused templates for ontology and model verbalisation. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer vol. CCIS 1057, 64-76. 28-31 Oct 2019, Rome, Italy.

[5] Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

[6] Mahlaza, Z. Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu. PhD Thesis, Department of Computer Science, University of Cape Town, South Africa. 2022.

Advertisement

A proposal for a template language for Abstract Wikipedia

Natural language generation applications have been ‘mainstreaming’ behind the scenes for the last couple of years, from automatically generating text for images, to weather forecasts, summarising news articles, digital assistants that mechanically blurt out text based the structured information they have, and many more. Google, Reuters, BBC, Facebook – they all do it. Wikipedia is working on it as well, principally within the scope of Abstract Wikipedia to try to build a better multilingual Wikipedia [1] to reach more readers better. They all have some source of structured content – like data fetched from a database or spreadsheet, information from, say, a UML class diagram, or knowledge from some knowledge graph or ontology – and a specification as to what the structure of the sentence should be, typically with some grammar rules to at least prettify it, if not also being essential to generate a grammatically correct sentence [2]. That specification is written in templates that are then filled with content.

For instance, a simple rendering of a template may be “Each [C1] [R1] at least one [C2]” or “[I1] is an instance of [C1]”, where the things within the square brackets are variables standing in for content that will be fetched from the source, like a class, relationship, or individual. Linking these to a knowledge graph about universities, it may generate, e.g., “Each academic teaches at least one course” and “Joanne Soap is an instance of Academic”. To get the computer to do this, just “Each [C1] [R1] at least one [C2]” for template won’t do: we need to tell it what the components are so that the program can process it to generate that (pseudo-)natural language sentence.

Many years ago, we did this for multiple languages and used XML to specify the templates for the key aspects of the content. The structured input were conceptual data models in ORM in the DOGMA tool that had that verbalisation component [3]. As example, the template for verbalising a mandatory constraint was as follows:

<Constraint xsi:type="Mandatory">
 <Text> - [Mandatory] Each</Text>
 <Object index="0"/>
 <Text>must</Text>
 <Role index="0"/>
 <Text>at least one</Text>
 <Object index="1"/>
</Constraint>

Besides demarcating the sentence and indicating the constraint, there’s fixed text within the <text> … </text> tags and there’s the variable part with the <Object… that declares that the name of the object type has to be fetched and the <Role… that declares that the name of the relationship has to be fetched from the model (well, more precisely in this care: the reading label), which were elements declared in an XML Schema. With the same example as before, where Academic is in the object index “0” position and Course in the “1” position (see [3] for details), the software would then generate “ – [Mandatory] Each Academic must teaches at least one Course.”

This can be turned up several notches by adding grammatical features to it in order to handle, among others, gender for nouns in German, because they affect the rendering of the ‘each’ and ‘one’ in the sample sentence, not to mention the noun classes of isiZulu and many other languages [4], where even the verb conjugation depends on the noun class of the noun that plays the role of subject in the sentence. Or you could add sentence aggregation to combine two templates into one larger one to generate more flowy text, like a “Joanne Soap is an academic who teaches at least one course”. Or change the application scenario or the machinery for how to deal with the templates. For instance, instead of those variables in the template + code elsewhere that does the content fetching and any linguistic processing, we could put part of that in the template specification. Then there are no variables as such in the template, but functions. The template specification for that same constraint in an ORM diagram might then look like this:

ConstraintIsMandatory {
 “[Mandatory] Each ”
 FetchObjectType(0)
 “ must ”
 MakeInfinitive(FetchRole(0))
 “ at least one ”
 FetchObjectType(1)}

If you want to go with newer technology than markup languages, you may prefer to specify it in JSON. If you’re excited about functional programming languages and see everything through the lens of functions, you even can turn the whole template specification into a bunch of only functions. Either way: there must be a specification of how those templates are permitted to look like, or: what elements can be used to make a valid specification of a template. This so that the software will work properly so that it neither will spit out garbage nor will halt halfway before returning anything. What is permitted in a template language can be specified by means of a model, such as an XML Schema or a DTD, a JSON artefact, or even an ontology [5], a formal definition in some notation of choice, or by defining a grammar (be it a CFG or in BNF notation), and anyhow with enough documentation to figure out what’s going on.

How might this look like in the context of Abstract Wikipedia? For the natural language generation aspects and its first proposal for the realiser architecture, the structured content to be rendered in a natural language sentence is fetched from Wikidata, as is the lexicographic data, and the functions to do the various computations are to come from/go in Wikifunctions. They’re then combined with the templates in various stages in the realiser pipeline to generate those sentences. But there was still a gap as to what those templates in this context may look like. Ariel Gutman, a google.org fellow working on Abstract Wikipedia, and I gave it a try and that proposal for a template language for Abstract Wikipedia is now online accessible for comment, feedback, and, if you happen to speak a grammatically rich language, an option to provide difficult examples so that we can check whether the language is expressive enough.

The proposal is – as any other proposal for a software system – some combination of theoretical foundations, software infrastructure peculiarities, reasoned and arbitrary design decisions, compromises, and time constraints. Here’s a diagram of the key aspects of the syntax, i.e., with the elements, how they relate, and the constraints holding between them, in ORM notation:

An illustrative diagram with the key features of the template language in ORM notation.

There’s also a version in CFG notation, and there are a few examples, each of which shows how the template looks like for verbalising one piece of information (Malala Yousafzai’s age) in Swedish, French, Hebrew, and isiZulu. Swedish is the simplest one, as would English or Dutch be, so let’s begin with that:

Persoon_leeftijd_nl(Entity,Age_in_years): “{Person(Entity) is 
  {Age_in_years} jaar.}”

Where the Person(Entity) fetches the name of the person (that’s identified by an identifier) and the Age_in_years fetches the age. One may like to complicate matters and add a conditional statement, like that any age less than 30 will render that last part not just as jaar ‘year’, but as jaar oud ‘years old’ but jaar jong ‘years young’, but where that dividing line is, is a sensitive topic for some and I will let that rest. In any case, in Dutch, there’s no processing of the number itself to be able to render it in the sentence – 25 renders as 25 – but in other languages there is. For instance, in isiZulu. In that case, instead of a simple fetching of the number, we can put a function in the slot:

Person_AgeYr_zu(Entity,Age_in_years): “{subj:Person(Entity)} 
  {root:subjConcord()}na{Year(Age_in_years).}”

That Year(Age_in_years) is a function that is based on either another function or a sub-template. For instance, it can be defined as follows:

Year_zu(years):"{root:Lexeme(L686326)} 
  {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounPrefix()}-
  {nummod:Cardinal(years)}"

Where Lexeme(L686326) is the word for ‘year’ in isiZulu, unyaka, and for the rest, it first links the age rendering to the ‘year’ with the RelativeConcord() of that word, which practically fetches e- for the ‘years’ (iminyaka, noun class 4),  then gets the copulative (ng in this case), and then the concord for the noun class of the noun of the number. Malala is in her 20s, which is amashumi amabili ..  (noun class 6, which is computed via Cardinal(years)), and thus the function nounPrefix will fetch ama-. So, for Malala’s age data, Year_zu(years) will return iminyaka engama-25. That then gets processed with the rest of the Person_AgeYr_zu template, such as adding an U to the name by subj:Person(Entity), and later steps in the pipeline that take care of things like phonological conditioning (-na- + i- = –ne-), to eventually output UMalala Yousafzai uneminyaka engama-25. In other words: such a template indeed can be specified with the proposed template syntax.

There’s also a section in the proposal about how that template language then connects to the composition syntax so that it can be processed by the Wikifunctions Orchestrator component of the overall architecture. That helps hiding a few complexities from the template declarations, but, yes, someone’s got to write those functions (or take them from existing grammar engines) that will take care of those more or less complicated processing steps. That’s a different problem to solve. You also could link it up with another realiser by means of a transformation the the input type it expects. For now, it’s the syntax of the declarative part for the templates.

If you have any questions or comments or suggestions on that proposal or interesting use cases to test with, please don’t hesitate to add something to the talk page of the proposal, leave a comment here, or contact either Ariel or me directly.

 

References

[1] Vrandečić, D. Building a multilingual Wikipedia. Communications of the ACM, 2021, 64(4), 38-41.

[2] Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

[3] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[4] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157.

[5] Mahlaza, Z., Keet, C. M. ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.

From ontology verbalisation to language learning exercises

I’m aware that to most people ‘playing with’ (investigating) ontologies and isiZulu does not sound particularly useful on the face of it. Yet, there’s the some long-term future music, like eventually being able to generate patient discharge notes in one’s own language, which will do its bit to ameliorate the language barrier in healthcare in South Africa so that patients at least will adhere to the treatment instructions a little better, and therewith receive better quality healthcare. But benefits in the short-term might serve something as well. To that end, I proposed an honours project last year, which has been completed in the meantime, and one of the two interesting outcomes has made it into a publication already [1]. As you may have guessed from the title, it’s about automation for language learning exercises. The results will be presented at the 6th Workshop on Controlled Natural Language, in Maynooth, Ireland in about 2 weeks time (27-28 August). In the remainder of this post, I highlight the main contributions described in the paper.

First, regarding the post’s title, one might wonder what ontology verbalisation has to do with language learning. Nothing, really, except that we could reuse the algorithms from the controlled natural language (CNL) for ontology verbalisation to generate (computer-assisted) language learning exercises whose answers can be computed and marked automatically. That is, the original design of the CNL for things like pluralising nouns, verb conjugation, and negation that is used for verbalising ontologies in isiZulu in theory [2] and in practice [3], was such that the sentence generator is a detachable module that could be plugged in elsewhere for another task that needs such operations.

Practically, the student who designed and developed the back-end, Nikhil Gilbert, preferred Java over Python, so he converted most parts into Java, and added a bit more, notably the ‘singulariser’, a sentence scrabble, and a sentence generator. Regarding the sentence generator, this is used as part of the exercises & answers generator. For instance, we know that humans and the roles they play (father, aunt, doctor, etc.) are mostly in isiZulu’s noun classes 1, 2, 1a, 2a, or 3a, that those classes do not (or rarely?) have non-human nouns and generally it holds for all humans and their roles that they can ‘eat’, ‘talk’ etc. This makes it relatively easy create a noun chain and a verb chain list to mix and match nouns with verbs accordingly (hurrah! for the semantics-based noun class system). Then, with the 231 nouns and 59 verbs in the newly constructed mini-corpus, the noun chain and the verb chain, 39501 unique question sentences could be generated, using the following overall architecture of the system:

Architecture of the CNL-driven CALL system. The arrows indicate which upper layer components make use of the lower layer components. (Source: [1])

From a CNL perspective as well as the language learning perspective, the actual templates for the exercises may be of interest. For instance, when a learner is learning about pluralising nouns and their associated verb, the system uses the following two templates for the questions and answers:

Q: <prefixSG+stem> <SGSC+VerbRoot+FV>
A: <prefixPL+stem> <PLSC+VerbRoot+FV>
Q: <prefixSG+stem> <SGSC+VerbRoot+FV> <prefixSG+stem>
A: <prefixPL+stem> <PLSC+VerbRoot+FV> <prefixPL+stem>

The answers can be generated automatically with the algorithms that generate the plural noun (from ‘prefixSG’ to ‘prefixPL’) and add the plural subject concord (from ‘SGSC’ to ‘PLSC’, in agreement with ‘prefixPL’), which were developed as part of the GeNI project on ontology verbalization. This can then be checked against what the learner has typed. For instance, a generated question could be umfowethu usula inkomishi and the correct answer generated (to check the learner’s response against) is abafowethu basula izinkomishi. Another example is generation of the negation from the positive, or, vv.; e.g.:

Q: <PLSC+VerbRoot+FV>
A: <PLNEGSC+VerbRoot+NEGFV>

For instance, the question may present batotoba and the correct answer is then abatotobi. In total, there are six different types of sentences, with two double, like the plural above, hence a total of 16 templates. It is not a lot, but it turned out it is one of the very few attempts to use a CNL in such way: there is one paper that also will be presented at CNL’18 in the same session [4], and an earlier one [5] uses a fancy grammar system (that we don’t have yet computationally for isiZulu). This is not to be misunderstood as that this is one of the first CNL/NLG-based system for computer-assisted language learning—e.g., there’s assistance in essay writing, grammar concept question generation, reading understanding question generation—but curiously very little on CNLs or NLG for the standard entry-level type of questions to learn the grammar. Perhaps the latter is considered ‘boring’ for English by now, given all the resources. However, thousands of students take introduction courses in isiZulu each year, and some automation can alleviate the pressure of routine activities from the lecturers. We have done some evaluations with learners—with encouraging results—and plan to do some more, so that it may eventually transition to actual use in the courses; that is: TBC…

 

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E. et al. (eds.). Springer LNCS vol. 10577, 59-64.

[4] Lange, H., Ljunglof, P. Putting control into language learning. 6th International Workshop on Controlled Natural language (CNL’18). IOS Press. Co. Kildare, Ireland, 27-28 August 2018. (in print)

[5] Gardent, C., Perez-Beltrachini, L. Using FB-LTAG Derivation Trees to Generate Transformation-Based Grammar Exercises. Proc. of TAG+11, Sep 2012, Paris, France. pp117-125, 2012.

ICTs for South Africa’s indigenous languages should be a national imperative, too

South Africa has 11 official languages with English as the language of business, as decided during the post-Apartheid negotiations. In practice, that decision has resulted in the other 10 being sidelined, which holds even more so for the nine indigenous languages, as they were already underresourced. This trend runs counter to the citizens’ constitutional rights and the state’s obligations, as she “must take practical and positive measures to elevate the status and advance the use of these languages” (Section 6 (2)). But the obligations go beyond just language promotion. Take, e.g., the right to have access to the public health system: one study showed that only 6% of patient-doctor consultations was held in the patient’s home language[1], with the other 94% essentially not receiving the quality care they deserve due to language barriers[2].

Learning 3-4 languages up to practical multilingualism is obviously a step toward achieving effective communication, which therewith reduces divisions in society, which in turn fosters cohesion-building and inclusion, and may contribute to achieve redress of the injustices of the past. This route does tick multiple boxes of the aims presented in the National Development Plan 2030. How to achieve all that is another matter. Moreover, just learning a language is not enough if there’s no infrastructure to support it. For instance, what’s the point of searching the Web in, say, isiXhosa when there are only a few online documents in isiXhosa and the search engine algorithms can’t process the words properly anyway, hence, not returning the results you’re looking for? Where are the spellcheckers to assist writing emails, school essays, or news articles? Can’t the language barrier in healthcare be bridged by on-the-fly machine translation for any pair of languages, rather than using the Mobile Translate MD system that is based on canned text (i.e., a small set of manually translated sentences)?

 

Rule-based approaches to develop tools

Research is being carried out to devise Human Language Technologies (HLTs) to answer such questions and contribute to realizing those aspects of the NDP. This is not simply a case of copying-and-pasting tools for the more widely-spoken languages. For instance, even just automatically generating the plural noun in isiZulu from a noun in the singular required a new approach that combined syntax (how it is written) with semantics (the meaning) through inclusion of the noun class system in the algorithms[3] [summary]. In contrast, for English, just syntax-based rules can do the job[4] (more precisely: regular expressions in a Perl script). Rule-based approaches are also preferred for morphological analysers for the regional languages[5], which split each word into its constituent parts, and for natural language generation (NLG). An NLG system generates natural language text from structured data, information, or knowledge, such as data in spreadsheets. A simple way of realizing that is to use templates where the software slots in the values given by the data. This is not possible for isiZulu, because the sentence constituents are context-dependent, of which the idea is illustrated in Figure 1[6].

Figure 1. Illustration of a template for the ‘all-some’ axiom type of a logical theory (structured knowledge) and some values that are slotted in, such as Professors, resp. oSolwazi, and eat, resp. adla and zidla; ‘nc’ denotes the noun class of the noun, which governs agreement across related words in a sentence. The four sample sentences in English and isiZulu represent the same information.

Therefore, a grammar engine is needed to generate even the most basic sentences correctly. The core aspects of the workflow in the grammar engine [summary] are presented schematically in Figure 2[7], which is being extended with more precise details of the verbs as a context-free grammar [summary][8]. Such NLG could contribute to, e.g., automatically generating patient discharge notes in one’s own language, text-based weather forecasts, or online language learning exercises.

Figure 2. The isiZulu grammar engine for knowledge-to-text consists conceptually of three components: the verbalisation patterns with their algorithms to generate natural language for a selection of axiom types, a way of representing the knowledge in a structured manner, and the linking of the two to realize the generation of the sentences on-the-fly. It has been implemented in Python and Owlready.

 

Data-driven approaches that use lots of text

The rules-based approach is known to be resource-intensive. Therefore, and in combination with the recent Big Data hype, data-driven approaches with lost of text are on the rise: it offers the hope to achieve more with less effort, not even having to learn the language, and easier bootstrapping of tools for related languages. This can work, provided one has a lot of good quality text (a corpus). Corpora are being developed, such as the isiZulu National Corpus[9], and the recently established South African Centre for Digital Language Resources (SADiLaR) aims to pool the resources. We investigated the effects of a corpus on the quality of an isiZulu spellchecker [summary], which showed that learning the statistics-driven language model on old texts like the bible does not transfer well to modern-day texts such as news items, nor vice versa[10]. The spellchecker has about 90% accuracy in single-word error detection and it seems to contribute to the intellectualisation[11] of isiZulu [summary][12]. Its algorithms use trigrams and probabilities of their occurrence in the corpus to compute the probability that a word is spelled correctly, illustrated in Figure 3, rather than a dictionary-based approach that is impractical for agglutinating languages. The algorithms were reused for isiXhosa simply by feeding it a small isiXhosa corpus: it achieved about 80% accuracy already even without optimisations.

Figure 3. Illustration of the underlying approach of the isiZulu spellchecker

Data-driven approaches are also pursued in information retrieval to, e.g., develop search engines for isiZulu and isiXhosa[13]. Algorithms for data-driven machine translation (MT), on the other hand, can easily be misled by out-of-domain training data of parallel sentences in both languages from which it has to learn the patterns, such as such as concordial agreement like izi- zi- (see Figure 1). In one of our experiments where the MT system learned from software localization texts, an isiXhosa sentence in the context of health care, Le nto ayiqhelekanga kodwa ngokwenene iyenzeka ‘This is not very common, but certainly happens.’ came out as ‘The file is not valid but cannot be deleted.’, which is just wrong. We are currently creating a domain-specific parallel corpus to improve the MT quality that, it is hoped, will eventually replace the afore-mentioned Mobile Translate MD system. It remains to be seen whether such a data-driven MT or an NLG approach, or a combination thereof, may eventually further alleviate the language barriers in healthcare.

 

Because of the ubiquity of ICTs in all of society in South Africa, HLTs for the indigenous languages have become a necessity, be it for human-human or human-computer interaction. Profit-driven multinationals such as Google, Facebook, and Microsoft put resources into development of HLTs for African languages already. Languages, and the identities and cultures intertwined with them, are a national resource, however; hence, suggesting the need for more research and the creation of a substantial public good of a wide range of HLTs to assist people in the use of their language in the digital age and to contribute to effective communication in society.

[1] Levin, M.E. Language as a barrier to care for Xhosa-speaking patients at a South African paediatric teaching hospital. S Afr Med J. 2006 Oct; 96 (10): 1076-9.

[2] Hussey, N. The Language Barrier: The overlooked challenge to equitable health care. SAHR, 2012/13, 189-195.

[3] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16). A. Gelbukh (Ed.). Springer LNCS vol 9623, pp. April 3-9, 2016, Konya, Turkey.

[4] Conway, D.M.: An algorithmic approach to English pluralization. In: Salzenberg, C. (ed.) Proceedings of the Second Annual Perl Conference. O’Reilly (1998), San Jose, USA, 17-20 August, 1998

[5] Pretorius, L. & Bosch, S.E. Enabling computer interaction in the indigenous languages of South Africa: The central role of computational morphology. ACM Interactions, 56 (March + April 2003).

[6] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51(1): 131-157.

[7] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64.

[8] Keet, C.M., Khumalo, L. Grammar rules for the isiZulu complex verb. Southern African Linguistics and Applied Language Studies, 2017, 35(2): 183-200.

[9] L. Khumalo. Advances in Developing corpora in African languages. Kuwala, 2015, 1(2): 21-30.

[10] Ndaba, B., Suleman, H., Keet, C.M., Khumalo, L. The effects of a corpus on isiZulu spellcheckers based on N-grams. In IST-Africa.2016. (May 11-13, 2016). IIMC, Durban, South Africa, 2016, 1-10.

[11] Finlayson, R, Madiba, M. The intellectualization of the indigenous languages of South Africa: Challenges and prospects. Current Issues in Language Planning, 2002, 3(1): 40-61.

[12] Keet, C.M., Khumalo, L. Evaluation of the effects of a spellchecker on the intellectualization of isiZulu. Alternation, 2017, 24(2): 75-97.

[13] Malumba, N., Moukangwe, K., Suleman, H. AfriWeb: A Web Search Engine for a Marginalized Language. Proceedings of 2015 Asian Digital Library Conference, Seoul, South Korea, 9-12 December 2015.

Figuring out the verbalisation of temporal constraints in ontologies and conceptual models

Temporal conceptual models, ontologies, and their logics are nothing new, but that sort of information and knowledge representation still doesn’t gain a lot of traction (cf. say, formal methods for verification). This is in no small part because modelling temporal information is not easy. Several conceptual modelling languages do have various temporal extensions, but most modellers don’t even use all of the default language features yet [1]. How could one at least reduce the barrier to adoption of temporal logics and modelling languages? The two principle approaches are visualisation with a diagrammatic language and rendering it in a (pseudo-)natural language. One of my postgraduate students looked at the former, trying to figure out what would be the best icons and such, which showed there was still a steep learning curve [2]. Before examining whether that could be optimised, I wondered whether the natural language option might be promising. The problem was, that no-one had yet tried to determine what the natural language counterpart of the temporal constraints were supposed to be, let alone whether they be ‘adequate’ or the ‘best’ way of rendering the temporal constraints in tolerable natural language sentences. I wanted to know that badly enough that I tried to find out.

Given that using templates is a tried-and-tested relatively successful approach for atemporal conceptual models and ontologies (e.g., for ORM, the ACE system), it makes sense to do something similar, but then for some temporal extension. As temporal conceptual modelling language I used one that has a Description Logics foundation (DLRUS [3,4]) for that easily links to ontologies as well, added a few known temporal constraints (like for relationships/DL roles, mandatory) and removing others (some didn’t seem all that interesting), which resulted in 34 constraints, still. For each one, I tried to devise more and less reasonable templates, resulting in 101 templates overall. Those templates were evaluated on semantics and preference by three temporal logic experts and five ‘mixed experts’ (experts in natural language generation, logic, or modelling). This resulted in a final set of preferred templates to verbalise the temporal constraints. The remainder of this post first describes a bit about the templates and then the results of which I think they are most interesting.

Templates

The basic idea of a template—in the context of the verbalisation of conceptual models and ontologies—is to have some natural language for the constraint where then the vocabulary gets slotted in at runtime. Take, for instance, simple named class subsumption in an ontology, C \sqsubseteq D, for which one could define a template “Each [C] is a(n) [D]”, so that with some axiom Manager \sqsubseteq Employee, it would generate the sentence “Each Manager is an Employee”. One also could have devised the template “All [C] are [D]” and then it would have generated “All Managers are Employees”. The choice between the two templates in this case is just taste, for in both cases, the semantics is the same. More complex axioms are not always that straightforward. For instance, for the axiom type C \sqsubseteq \exists R.D, would “Each [C] [R] some [D]” be good enough, or would perhaps “Each [C] must [R] at least one [D]” be better? E.g., “Each Professor teaches some Course” vs “Each Professor must teach at least one Course”.

The same can be done for the temporal constraints. To get there, I did a bit of a linguistic detour that informed the template design (described in the paper [5]). Let us take as first example for templates temporal class that has a semantics of o \in C^{\mathcal{I}(t)} \rightarrow \exists t' \neq t. o \notin C^{\mathcal{I}(t')}; for instance, UndergraduateStudent (assuming they graduate and end up as alumni or as drop outs, and weren’t undergrads from birth):

  1. If an object is an instance of entity type [C], then there is some time where it is not a(n) [C].
  2. [C] is an entity type whose objects are, for some time in their existence, not instances of [C].
  3. [C] is an entity type of which each object is not a(n) [C] for some time during its existence.
  4. All instances of entity type [C] are not a(n) [C] for some time.
  5. Each [C] is not a(n) [C] for some time.
  6. Each [C] is for some time not a(n) [C].

Which one(s) do you think captures the semantics, and which one(s) do you prefer?

A more elaborate constraint for relationships is ‘dynamic extension for relationships, past, mandatory], which is formalised as \langle o , o' \rangle \in \mbox{{\sc RDexM}-}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow (\langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow \exists t'<t. \langle o , o' \rangle \in \mbox{{\sc RDex}}_{R_1,R_2}^{\mathcal{I}(t')} where \langle o , o' \rangle \in \mbox{{\sc RDex}}_{R_1,R_2}^{\mathcal{I}(t)} \rightarrow ( \langle o , o' \rangle \in{\tt R_1}^{\mathcal{I}(t)} \rightarrow    \exists t'>t. \langle o , o' \rangle \in {\tt R_2}^{\mathcal{I}(t')}).; e.g., every passenger who boards a flight must have checked in for that flight. Two options could be:

  1. Each ..C_1.. ..R_1.. ..C_2.. was preceded by ..C_1.. ..R_2.. ..C_2.. some time earlier.
  2. Each ..C_1.. ..R_1.. ..C_2.. must be preceded by ..C_1.. ..R_2.. ..C_2.. .

I’m not saying they are all correct; they were some of the options given, which the participants could choose from and comment on. The full list of constraints and template options are available in the supplementary material, which also contains a file where you can fill in your own answers, see what the (anonymised) participants said, and it has the final list of ‘best’ constraints.

Results

The main aggregate quantitative results are shown in the following table.

Many observations can be made from the data (see the paper for details). Some of the salient aspects are that there was low inter-annotator agreement among the experts, despite that they know each other (temporal logics is a small community) and that the ‘mixed group’ deemed many sentences correct that the experts deemed wrong in the sense of not properly capturing the semantics of the constraint. Put differently, it looks like the mixed experts, as a group, did not fully grasp some subtle distinction in the temporal constraints.

With respect to the templates, the preferred ones don’t follow the structure of the logic, but are, in a way, a separate rendering, or: there’s no neat 1:1 mapping between axiom type and template structure. That said, that doesn’t mean that they always chose the shortest template: the experts definitely did not, while the mixed experts leaned a bit toward preferring templates with fewer words even though they were surely not always the semantically correct option.

It may not look good that the experts preferred different templates, but in a follow-up interview with one of the experts, the expert noted that it was not really a problem “for there is the logic that does have the precise meaning anyway” and thus “resolves any confusion that may arise from using slightly different terminology”. The temporal logic expert does have a point from the expert’s view, fair enough, but that pretty much defeats my aim with the experiment. Asking more non-experts may not be a good strategy either, for they are, on average, too lenient.

So, for now, we do have a set of, relatively, ‘best’ templates to verbalise temporal constraints in temporal conceptual models and ontologies. The next step is to compare that with the diagrammatic representation. This we did [6], and I’ll describe those results informally in a next post.

I’ll present more details at the upcoming CREOL: Contextual Representation of Events and Objects in Language Workshop that is part of the Joint Ontology Workshops 2017, which will be held next week (21-23 September) in Bolzano, Italy. As the KRDB group at FUB in Bolzano has a few temporal logic experts, I’m looking forward to the discussions! Also, I’d be happy if you would be willing to fill in the spreadsheet with your preferences (before looking at the answers given by the participants!), and send them to me.

 

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[2] T. Shunmugam. Adoption of a visual model for temporal database representation. M. IT thesis, Department of Computer Science, University of Cape Town, South Africa, 2016.

[3] A. Artale, E. Franconi, F. Wolter, and M. Zakharyaschev. A temporal description logic for reasoning about conceptual schemas and queries. In S. Flesca, S. Greco, N. Leone, and G. Ianni, editors, Proceedings of the 8th Joint European Conference on Logics in Artificial Intelligence (JELIA-02), volume 2424 of LNAI, pages 98-110. Springer Verlag, 2002.

[4] A. Artale, C. Parent, and S. Spaccapietra. Evolving objects in temporal information systems. Annals of Mathematics and Artificial Intelligence, 50(1-2):5-38, 2007.

[5] Keet, C.M. Natural language template selection for temporal constraints. CREOL: Contextual Representation of Events and Objects in Language, Joint Ontology Workshops 2017, 21-23 September 2017, Bolzano, Italy. CEUR-WS Vol. (in print).

[6] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Springer LNCS. 6-9 Nov 2017, Valencia, Spain. (in print)

Bootstrapping a Runyankore CNL from an isiZulu one mostly works well

Earlier this week the 5th Workshop on Controlled Natural Language (CNL’16) was held in Aberdeen, Scotland, where I presented progress made on a Runyankore CNL [1], rather than my student, Joan Byamugisha, who did most of the work on it (she could not attend due to nasty immigration rules by the UK, not a funding issue).

“Runyankore?”, you might ask. It is one of the languages spoken in Uganda. As Runyankore is very under-resourced, any bootstrapping to take a ‘shortcut’ to develop language resources would be welcome. We have a CNL for isiZulu [2], but that is spoken in South Africa, which is a few thousand kilometres further south of Uganda, and it is in a different Guthrie zone of the—in linguistics still called—Bantu languages, so it was a bit of a gamble to see whether those results could be repurposed for Runynakore. They could, needing only minor changes.

What stayed the same were the variables, or: components to make up a grammatically correct sentence when generating a sentence within the context of OWL axioms (ALC, to be more precise). They are: the noun class of the name of the concept (each noun is assigned a noun class—there are 20 in Runyankore), the category of the concept (e.g., noun, adjective), whether the concept is atomic (named OWL class) or an OWL class expression, the quantifier used in the axiom, and the position of the concept in the axiom. The only two real differences were that for universal quantification the word for the quantifier is the same when in the singular (cf. isiZulu, where it changes for both singular or plural), and for disjointness there is only one word, ti ‘is not’ (cf. isiZulu’s negative subject concord + pronomial). Two very minor differences are that for existential quantification ‘at least one’, the ‘at least’ is in a different place in the sentence but the ‘one’ behaves exactly the same, and ‘all’ for universal quantification comes after the head noun rather than before (but is also still dependent on the noun class).

It goes without saying that the vocabulary is different, but that is a minor aspect compared to figuring out the surface realisation for an axiom. Where the bootstrapping thus came in handy was that that arduous step of investigating from scratch the natural language grammar involved in verbalising OWL axioms could be skipped and instead the ones for isiZulu could be reused. Yay. This makes it look very promising to port to other languages in the Bantu language family. (yes, I know, “one swallow does not a summer make” [some Dutch proverb], but one surely is justified to turn up one’s hope a notch regarding generalizability and transferability of results.)

Joan also conducted a user survey to ascertain which surface realisation was preferred among Runyankore speakers, implemented the algorithms, and devised a new one for the ‘hasX’ naming scheme of OWL object properties (like hasSymptom and hasChild). All these details, as well as the details of the Runyankore CNL and the bootstrapping, are described in the paper [1].

 

I cannot resist a final comment on all this. There are people who like to pull it down and trivialise natural language interfaces for African languages, on the grounds of “who cares about text in those kind of countries; we have to accommodate the illiteracy with pictures and icons and speech and such”. People are not as illiterate as is claimed here and there (including by still mentally colonised people from African countries)—if they were, then the likes of Google and Facebook and Microsoft would not invest in localising their interfaces in African languages. The term “illiterate” is used by those people to include also those who do not read/write in English (typically an/the official language of government), even though they can read and write in their local language. People who can read and write—whichever natural language it may be—are not illiterate, neither here in Africa nor anywhere else. English is not the yardstick of (il)literacy, and anyone who thinks it is should think again and reflect a bit on cultural imperialism for starters.

 

References

[1] Byamugisha, J., Keet, C.M., DeRenzi, B. Bootstrapping a Runyankore CNL from an isiZulu CNL. 5th Workshop on Controlled Natural Language (CNL’16), Springer LNAI vol. 9767, 25-36. 25-27 July 2016, Aberdeen, UK. Springer’s version

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016. DOI: 10.1007/s10579-016-9340-0 (in print) accepted version

More results on a CNL for isiZulu

Although it has been a bit quiet here on the controlled natural languages for isiZulu front, lots of new stuff is in the pipeline, and the substantially extended version of our CNL14 and RuleML14 papers [1,2] is in print for publication in the Language Resources and Evaluation journal: Toward a knowledge-to-text controlled natural language of isiZulu [1] (online at LRE as well).

For those who haven’t read the other blog post or the papers on the topic, a brief introduction: for a plethora of reasons, one would want to generate natural language sentences based on some data, information, or knowledge stored on the computer. For instance, to generate automatically weather reports in isiZulu or to browse or query ‘intelligently’ online annotated newspaper text that is guided by an ontology behind-the-scenes in the inner workings of the interface. This means ‘converting’ structured input into structured natural language sentences, which amounts to a Controlled Natural Language (CNL) that is a fragment of the full natural language. For instance, class subsumption in DL (“\sqsubseteq “) is verbalised in English as ‘is a/an’. In isiZulu, it is y- or ng- depending on the first character of the name of the superclass. So, in its simplest form, indlovu \sqsubseteq isilwane (that is, elephant \sqsubseteq animal in an ‘English ontology’) would, with the appropriate algorithm, generate the sentence (be verbalized as) indlovu yisilwane (‘elephant is an animal’).

In the CNL14 and RuleML14 papers, we looked into what could be the verbalisation patterns for subsumption, disjointness, conjunction, and simple existential quantification, we evaluated which ones were preferred, and we designed algorithms for them, as none of them could be done with a template. The paper in the LRE journal extends those works with, mainly: a treatment of verbs (OWL object properties) and their conjugation, updated/extended algorithms to deal with that, design considerations for those algorithms, walk-throughs of the algorithms, and an exploratory evaluation to assess the correctness of the algorithm (is the sentence generated [un]grammatical and [un]ambiguous?). There’s also a longer discussion section and more related works.

Conjugation of the verb in isiZulu is not as trivial as in English, where, for verbalizing knowledge represented in ontologies, one simply uses the 3rd person singular (e.g., ‘eats’) or plural (‘eat’) anywhere it appears in an axiom. In isiZulu, it is conjugated based on the noun class of the noun to which it applies. There are 17 noun classes. For instance, umuntu ‘human’ is in noun class 1, and indlovu in noun class 9. Then, when a human eats something, it is umuntu udla whereas with the elephant, it is indlovu idla. Negating it is not simply putting a ‘not’ or ‘does not’ in front of it, as is the case in English (‘does not eat’), but it has its own conjugation (called negative subject concord) again for each noun class, and modifying the final vowel; the human not eating something then becomes umuntu akadli and for the elephant indovu ayidli. This is now precisely captured in the verbalization patterns and algorithms.

Though a bit tedious and not an easy ride compared to a template-based approach, but surely doable to put in an algorithm. Meanwhile, I did implement the algorithms. I’ll readily admit it’s a scruffy Python file and you’ll have to type the function in the interpreter rather than having it already linked to an ontology, but it works, and that’s what counts. (see that flag put in the sand? 😉 ) Here’s a screenshot with a few examples, just to show that it does what it should do.

Screenshot showing the working functions for verbalising subsumption, disjointness, universal quantificaiton, existential quantification and its negation, and conjunction.

Screenshot showing the working functions for verbalising subsumption, disjointness, universal quantificaiton, existential quantification and its negation, and conjunction.

The code and other files are available from the GeNi project page. The description of the implementation, and the refinements we made along the way in doing so (e.g., filling in that ‘pluralise it’ of the algorithm), is not part of the LRE article, for we were already pushing it beyond the page limit, so I’ll describe that in a later post.

 

References

[1] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), Davis, B, Kuhn, T, Kaljurand, K. (Eds.). Springer LNAI vol. 8625, 78-89. 20-22 August 2014, Galway, Ireland.

[2] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS vol. 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[3] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016: in print. DOI: 10.1007/s10579-016-9340-0

Enjoyable and interesting controlled natural languages workshop (CNL’14)

Conferencing in Ireland was a good experience again. Like EKAW 2012, the Fourth Workshop on Controlled Natural Language (CNL’14) was held in the Aula Maxima at the University of Galway, a beautiful ivy-covered building conducive of a stimulating scientific atmosphere and, as any good event, one leaves with plenty of ideas to pursue, and it was a good ambience to meet up again with colleagues as well as meeting new ones, such as Allan Third of the SWAT natural language tools that we use in the ROMULUS foundational ontology library. The remainder of this post is a quick write-up about several of the papers and presentations, written during an otherwise lost moment at Dublin airport.

If you’re not too familiar with CNLs, a useful brief overview to start with is Safwat and Davis’ state of the art [1]. However, some of you might first prefer to read something that is one of the answers to “what would it be good for?”; in that case, I can highly recommend the paper on automatically generating the Swiss avalanche bulleting in 4 languages [2], presented by Kurt Winkler: not only their participants found it very difficult to figure out which ones were manually generated and which ones automatically (55% correct, on average), but also the CNL attendees had trouble with ‘guessing’ it right (yeah, including me). From a technical perspective, it uses a catalogue-based translation system with chunks of text segments. Rather more theoretical were the two papers on the Grammatical Framework. The first one was the invited talk by Aarne Ranta [3] about embedded controlled languages. He provided a brief overview of GF (which started in 1998 at Xerox in Grenoble) up to the current state in the EU project Molto for multilingual machine translation, and different levels of quality of the generated text. Inari Listenmaa presented an extension to the system so that GF will be able to handle compositionality [4].

Interesting to me was the question whether CNLs exist for generating text about temporal events, in part because I’ve another strand of research on temporal conceptual data modelling. Not everyone agreed whether there was anything other than simple stories, but it was hard to find much about it (if you do or know of it, please leave a pointer in the comments). Gordon Pace presented results on verbalizing finite state machines (events with properties), in particular violation traces through the FSM [5]; e.g., when one has a process for logins and failed logins that is violated, the sysadmin needs to know what has happened, and ideally be informed about what essentially went wrong in an intelligible way, and summarized rather than having to pour over endless logs.

On the multilingual front for less common languages, there were two papers for Latvian involving FrameNet for their controlled natural language [6,7], and Langa Khumalo presented our joint paper about isiZulu natural language generation [8] about which I blogged earlier.

Last, but not least—and, more precisely: first—the best paper award. It was awarded to two papers, being to the paper on technical text authoring by Juyeon Kang and Patrick Saint-Dizier [9] and to the paper on style guides as controlled languages, by Karolina Suchowolec [10].

The next CNL workshop will be held in about 2 years time, also most likely co-located with a larger conference (now it was co-located with COLING in Dublin), and some other activities are also in the pipeline, such as a mailing list, wiki etc. so it will be easier for people to stay tuned with the latest developments in CNLs. I’m already looking forward to the next installment of the event.

References

Note: all links are to the CRCs posted on arxiv; the final versions formatted by Springer are on the Springer site (behind a paywall for most people).

[1] Hazem Safwat and Brian Davis. A Brief State of the Art of CNLs for Ontology Authoring. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 190-200. 20-22 Aug, 2014, Galway, Ireland.

[2] Kurt Winkler, Tobias Kuhn and Martin Volk. Evaluating the fully automatic multi-language translation of the Swiss avalanche bulletin. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 44-54. 20-22 Aug, 2014, Galway, Ireland.

[3] Aarne Ranta. Embedded Controlled Languages. (invited paper). Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 1-7. 20-22 Aug, 2014, Galway, Ireland.

[4] Ramona Enache, Inari Listenmaa and Prasanth Kolachina. Handling non-compositionality in multilingual CNLs. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 147-154. 20-22 Aug, 2014, Galway, Ireland.

[5] Gordon Pace and Michael Rosner. Explaining Violation Traces with Finite State Natural Language Generation Models. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 179-189. 20-22 Aug, 2014, Galway, Ireland.

[6] Guntis Barzdins. FrameNet CNL: a Knowledge Representation and Information Extraction Language. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 90-101. 20-22 Aug, 2014, Galway, Ireland.

[7] Dana Dannells and Normunds Gruzitis. Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 155-166. 20-22 Aug, 2014, Galway, Ireland.

[8] C. Maria Keet and Langa Khumalo. Toward verbalizing ontologies in isiZulu. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 78-89. 20-22 Aug, 2014, Galway, Ireland.

[9] Juyeon Kang and Patrick Saint-Dizier. Towards an Error Correction Memory to Enhance Technical Texts Authoring in LELIE. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 55-65. 20-22 Aug, 2014, Galway, Ireland.

[10] Karolina Suchowolec. Are Style Guides Controlled Languages? The Case of Koenig & Bauer AG. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 112-122. 20-22 Aug, 2014, Galway, Ireland.

First steps for isiZulu natural language generation

Yes, Google Translate English-isiZulu does exist, but it has many errors (some very funny) and there’s a lot more to Natural Language Generation (NLG) than machine translation, such as natural language-based query interfaces that has some AI behind it, and they are needed, too [1]. Why should one bother with isiZulu? Muendane has his lucid opinions about that [2], and in addition to that, it is the first language of about 23% of the population of South Africa (amounting to some 10 million people), about half can speak it, and it is a Bantu language, which is spoken by nearly 300 million people—what works for isiZulu grammar may well be transferrable to its related languages. Moreover, it being in a different language family than the more well-resourced languages, it can uncover some new problems to solve for NLG, and facilitate access to online information without the hurdle of having to learn English or French first, as is the case now in Sub-Saharan Africa.

The three principal approaches for NLG are canned text, templates, and grammars. While I knew from previous efforts [3] that the template-based approach is very well doable but has its limitations, and knowing some basic isiZulu, I guessed it might not work with the template-based approach but appealing if it would (for a range of reasons), that no single template could be identified so far was the other end of the spectrum. Put differently: we had to make a start with something resembling the foundations of a grammar engine.

Langa Khumalo, with the Linguistics program and director of the University Language Planning and Development Office at the University of KwaZulu-Natal, and I have been trying to come up with isiZulu NLG. We have patterns and algorithms for (‘simple’) universal and existential quantification, subsumption, negation (class disjointness), and conjunction; or: roughly OWL 2 EL and a restricted version of ALC. OWL 2 EL fist neatly with SNOMED CT, and therewith has the potential for interactive healthcare applications with the isiZulu healthcare terminologies that are being developed at UKZN.

The first results on isiZulu NLG are described in [4,5], which was not an act of salami-slicing, but we had more results than that fitted in a single paper. The first paper [4] will appear in the proceedings ofthe 4th workshop on Controlled Natural language (CNL’14), and is about finding those patterns and, for the options available, an attempt at figuring out which one would be best. The second paper [5], which will appear in the 8th International Web Rule Symposium (RuleML’14) conference proceedings, is more about devising the algorithms to make it work and how to actually generate those sentences. Langa and I plan to attend both events, so you can ask us about the details either in Prague (18-20 Aug) or Galway (20-22 Aug) in person. In the meantime, the CRCs of the papers are online (here and here).

Regarding the technical aspects, the main reasons why we cannot get away with devising templates to generate isiZulu controlled natural language is that isiZulu is non-trivial:

  • There is a whole system of noun classes: nouns are grouped in one of the 17 noun classes, each with their own peculiarities, which is illustrated in Figure 1, below;
  • Agglutination, informally: putting lots of bits and pieces together to make a word. A selection of those so-called ‘concords’ is included in Figure 2, below;
  • Phonological conditioned copulatives, meaning that the ‘is a’ depends on the term that comes after it (ng or y); and
  • Complex verb conjugation.

isiZulu noun classes with an example (source: [5]).

isiZulu noun classes with an example (source: [5]).

 

A selection of isiZulu concords (source: [5])

A selection of isiZulu concords (source: [5])

What does this mean for the verbalization? In English, we use ‘Each…’ or ‘For all…’ for the universal quantifier and it doesn’t matter over which noun it is quantified. In isiZulu, it does. Each noun class has its own ‘each’ and ‘for all’, and it is not acceptable (understandable) to use one for the wrong noun class. For disjointness, like “Cup is not a Glass” ({\sf Cup \sqsubseteq \neg Glass} in DL), in English we have the ‘is not a’ regardless what comes before or after the subsumption+negation, but in isiZulu, the copulative is omitted, the first noun (OWL class, if you will) brings in a so-called negative subject concord, the second noun brings in a pronominal, and they are glued together (e.g., Indebe akuyona Ingilazi, where the second word is composed of aku + yona), and to top it off, each noun class has its own concord and pronomial. A seemingly simple conjunction—just an ‘and’ in English—has to be divided into an and-when-it-is-used-in-an-enumeration and an and-when-it-is-a-connective, and when it is used in an enumeration, it depends on the first letter of the noun that comes after the ‘and’. Existential quantification is even more of a hassle. The table below shows a very brief summary comparing typical patterns in English with those for isiZulu.

A few DL symbols, their typical verbalization options in English, and an indication of possible patterns (source: [4])

A few DL symbols, their typical verbalization options in English, and an indication of possible patterns (source: [4])

We did ask isiZulu speakers which of the possible options they preferred (in a survey, with Limesurvey localized to isiZulu), but there wasn’t an overwhelming consistent agreement among them except for one of the options for existential quantification (the –dwa option), although there was more agreement among the linguists than among the non-linguists, possibly due to dialect influences (results can be found in [4]).

If you don’t feel like reading the two papers, but still would like to have some general overview and examples, you also can check out the slides of the CS colloquium I gave last week. I managed to ‘lure in’ also ICT4D people—and then smack them with a bit of logic and algorithms—but the other option, being talking about the other paper accepted at RuleML, probably would have had to be a ‘cookie colloquium’ to get anyone to attend (more about that paper in another post—it is fascinating, but possibly of less interest to a broader audience). If you want to skip the tedious bits and just get a feel of how one of the algorithms works out: check out the example starting on slide 63, which shows the steps to go from {\sf \forall x (uSolwazi(x) \rightarrow \exists y (ufundisa(x, y) \land Isifundo(y)))} in FOL, or {\sf uSolwazi \sqsubseteq \exists ufundisa.Isifundo} in DL (“Each professor teaches at least one course”, if the vocabulary were in English), to “Bonke oSolwazi bafundisa isifundo esisodwa”.

Clearly, a lot remains to be done.

References

[1] Alberts, R., Fogwill, T., Keet, C.M. Several Required OWL Features for Indigenous Knowledge Management Systems. 7th Workshop on OWL: Experiences and Directions (OWLED’12). 27-28 May, Heraklion, Crete, Greece. CEUR-WS Vol-849. 12p

[2] Muendane, N.M. I am an African. 2006, Soultalk CC.

[3] Jarrar, M., Keet, C.M., Dongilli, P. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[4] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), 20-22 August 2014, Galway, Ireland. Springer LNAI. (in press)

[5] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), August 18-20, 2014, Prague, Czech Republic. Springer LNCS (in press).