“Grammar infused” templates for NLG

It’s hardly ever entirely one extreme or the other in natural language generation and controlled natural languages. Rarely can one get away with simplistic ‘just fill in the blanks’ templates that do not do any grammar or phonological processing to make the output better; our technical report about work done some 17 years ago was a case in point on the limitations thereof if one still needs to be convinced [1]. But where does NLG start? I agree with Ehud Reiter that it isn’t about template versus NLG, but a case of levels of sophistication: the fill-in-the-blank templates definitely don’t count as NLG and full-fledged grammar-only systems definitely do, with anything in-between a grey area. Adding word-level grammatical functions to templates makes it lean to NLG, or even indeed being so if there are relatively many such rules, and dynamically creating nicely readable sentences with aggregation and connectives counts as NLG for sure, too.

With that in mind, we struggled with how to name the beasts we had created for generating sentences in isiZulu [2], a Niger-Congo B language: nearly every resultant word in the generated sentences required a number of grammar rules to make it render sufficiently well (i.e., at least grammatically acceptable and understandable). Since we didn’t have a proper grammar engine yet, but we knew they could never be fill-in-the-blank templates either, we dubbed them verbalisation patterns. Most systems (by number of systems) use either only templates or templates+grammar, so our implemented system [3] was in good company. It may sound like oldskool technology, but you go ask Meta with their Galactica if a ML/DL-based approach is great for generating sensible text that doesn’t hallucinate… and does it well for languages other than English.

That said, honestly, those first attempts we did for isiZulu were not ideal for reusability and maintainability – that was not the focus – and it opened up another can of worms: how do you link templates to (partial) grammar rules? With the ‘partial’ motivated by taking it one step at a time in grammar engine development, as a sort of agile engine development process that is relevant especially for languages that are not well-resourced.

We looked into this recently. There turn out to be three key mechanisms for linking templates to computational grammar rules: embedding (E), where grammar rules are mixed with the templates specifications and therewith co-dependent, and compulsory (C) and partial (P) attachment where there is, or can be, an independent existence of the grammar rules.

Attachment of grammar rules (that can be separated) vs embedding of grammar rules in a system (intertwined with templates) (Source: [6])

The difference between the latter two is subtle but important for use and reuse of grammar rules in the software system and the NLG-ness of it: if each template must use at least one rule from the set of grammar rules and each rule is used somewhere, then the set of rules is compulsorily attached. Conversely, it is partially attached if there are templates in that system that don’t have any grammar rules attached. Whether it is partial because it’s not needed (e.g., the natural language’s grammar is pretty basic) or because the system is on the fill-in-the-blank not-NLG end of the spectrum, is a separate question, but for sure the compulsory one is more on the NLG side of things. Also, a system may use more than one of them in different places; e.g., EC, both embedding and compulsory attachment. This was introduced in [4] in 2019 and expanded upon in a journal article entitled Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation [5] that was published in IJMSO, and even more detail can be found in Zola Mahlaza’s recently completed PhD thesis [6]. These papers have various examples, illustrations how to categorise a system, and why one system was categorised in one way and not another. Here’s a table with several systems that combine templates and computational grammar rules and how they are categorised:

Source: [5]

We needed a short-hand name to refer to the cumbersome and wordy description of ‘combining templates with grammar rules in a [theoretical or implemented] system in some way’, which ended up to be grammar-infused templates.

Why write about this now? Besides certain pandemic-induced priorities in 2021, the recently proposed template language for Abstract Wikipedia that I blogged about before may mix Compulsory or Partial attachment, but ought not to permit the messy embedding of grammar in a template. This may not have been clear in v1 of the proposal, but hopefully it is a little bit more so in this new version that was put online over the past few days. To make that long story short: besides a few notes at the start of its Section 3, there’s a generic description of an idea for a realization algorithm. Its details don’t matter if you don’t intend to design a new realiser from scratch and maybe not either if you want to link it to your existing system. The key take-away from that section is that there’s where the real grammar and phonological conditioning stuff happens if it’s needed. For example, for the ‘age in years’ sub-template for isiZulu, recall that’s:

Year_zu(years):"{root:Lexeme(L686326)} {concord:RelativeConcord()}{Copula()}{concord_1<nummod:NounPrefix()}-{nummod:Cardinal(years)}"

The template language sets some boundaries for declaring such a template, but it is a realiser that has to interpret ‘keywords’, such as root, concord, and RelativeConcord, and do something with it so that the output ends up correctly; in this case, from ‘year’ + ‘25’ as input data to iminyaka engama-25 as outputted text. That process might be done in line with Ariel Gutman’s realiser pipeline for Abstract Wikipedia and his proof-of-concept implementation with Scribunto or any other realizer architecture or system, such as Grammatical Framework, SimpleNLG, NinaiUdiron, or Zola’s Nguni Grammar Engine, among several options for multilingual text generation. It might sound silly to put templates on top of the heavy machinery of a grammar engine, but it will make it more accessible to the general public so that they can specify how sentences should be generated. And, hopefully, permit a rules-as-you-go approach as well.

It is then the realiser (including grammar) engine and the partially or compulsorily attached computational grammar rules and other algorithms that work with the template. For the example, when it sees root and that the lemma fetched is a noun (L686326 is unyaka ‘year’), it also fetches the value of the noun class (a grammatical feature stored with the noun), which we always need somewhere for isiZulu NLG. It then needs to figure out to make a plural out of ‘year’, which it will know that it must do thanks to the years fetched for the instance (i.e., 25, which is plural) and the nummod that links to the root by virtue of the design and the assumption there’s a (dependency) grammar. Then, with concord:RelativeConcord, it will fetch the relative concord for that noun class, since concord also links to root. We already can do the concordial agreements and pluralising of nouns (and much more!) for isiZulu since several years. The only hurdle is that that code would need to become interoperable with the template language specification, in that our realisers will have to be able to recognise and process properly those ‘keywords’. Those words are part of an extensible set of words inspired by dependency grammars.

How this is supposed to interact smoothly is to be figured out still. Part of that is touched upon in the section about instrumentalising the template language: you could, for instance, specify it as functions in Wikifunctions that is instantly editable, facilitating an add-rules-as-you-go approach. Or it can be done less flexibly, by mapping or transforming it to another template language or to the specification of an external realiser (since it’s the principle of attachment, not embedding, of computational grammar rules).

In closing, whether the term “grammar-infused templates” will stick remains to be seen, but combining templates with grammars in some way for NLG will have a solid future at least for as long as those ML/DL-based large language model systems keep hallucinating and don’t consider languages other than English, including the intended multilingual setting for Abstract Wikipedia.

References

[1] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[2] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2017, 51:131-157. (accepted version free access)

[3] Keet, C.M. Xakaza, M., Khumalo, L. Verbalising OWL ontologies in isiZulu with Python. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 59-64. Portoroz, Slovenia, May 28 – June 2, 2017.

[4] Mahlaza, Z., Keet, C.M. A classification of grammar-infused templates for ontology and model verbalisation. 13th Metadata and Semantics Research Conference (MTSR’19). E. Garoufallou et al. (Eds.). Springer vol. CCIS 1057, 64-76. 28-31 Oct 2019, Rome, Italy.

[5] Mahlaza, Z., Keet, C.M. Formalisation and classification of grammar and template-mediated techniques to model and ontology verbalisation. International Journal of Metadata, Semantics and Ontologies, 2020, 14(3): 249-262.

[6] Mahlaza, Z. Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu. PhD Thesis, Department of Computer Science, University of Cape Town, South Africa. 2022.

A review of NLG realizers and a new architecture

That last step in the process of generating text from some structured representation of data, information or knowledge is done by things called surface realizers. They take care of the ‘finishing touches’ – syntax, morphology, and orthography – to make good natural language sentences out of an ontology, conceptual data model, or Wikidata data, among many possible sources that can be used for declaring abstract representations. Besides theories, there are also many tools that try to get that working at least to some extent. Which ways, or system architectures, are available for generating the text? Which components do they all, or at least most of them, have? Where are the differences and how do they matter? Will they work for African languages? And if not, then what?

My soon-to-graduate PhD student Zola Mahlaza and I set out to answer these questions, and more, and the outcome is described in the article Surface realization architecture for low-resourced African languages that was recently accepted and is now in print with the ACM Transactions on Asian and Low-Resource Language Information Processing (ACM TALLIP) journal [1].

Zola examined 77 systems, which exhibited some 13 different principal architectures that could be classified into 6 distinct architecture categories. Purely by number of systems, manually coded and rule-based would be the most popular, but there are a few hybrid and data-driven systems as well. A consensus architecture for realisers there is not. And none exhibit most of the software maintainability characteristics, like modularity, reusability, and analysability that we need for African languages (even more so than for better resourced languages). African is narrowed down in the paper further to those in the Niger-Congo B (‘Bantu’) family of languages. One of the tricky things is that there’s a lot going on at the sub-word level with these languages, whereas practically all extant realizers operate at the word-level.

Hence, the next step was to create a new surface realizer architecture that is suitable for low-resourced African languages and that is maintainable. Perhaps unsurprisingly, since the paper is in print, this new architecture compares favourably against the required features. The new architecture also has ‘bonus’ features, like being guided by an ontology with a template ontology [2] for verification and interoperability. All its components and the rationale for putting it together this way are described in Section 5 of the article and the maintainability claims are discussed in its Section 6.

Source: [1]

There’s also a brief illustration how one can redesign a realiser into the proposed architecture. We redesigned the architecture of OWLSIZ for question generation in isiZulu [3] as use case. The code of that redesign of OWLSIZ is available, i.e., it’s not merely a case of just having drawn a different diagram, but it was actually proof-of-concept tested that it can be done.

While I obviously know what’s going on in the article, if you’d like to know much more details than what’s described there, I suggest you consult Zola as the main author of the article or his (soon to be available online) PhD thesis [4] that devotes roughly a chapter to this topic.

References

[1] Mahlaza, Z., Keet, C.M. Surface realisation architecture for low-resourced African languages. ACM Transactions on Asian and Low-Resource Language Information Processing, (in print). DOI: 10.1145/3567594.

[2] Mahlaza, Z., Keet, C.M. ToCT: A task ontology to manage complex templates. FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.

[3] Mahlaza, Z., Keet, C.M.: OWLSIZ: An isiZulu CNL for structured knowledge validation. In: Proc. of WebNLG+ 2020. pp. 15–25. ACL, Dublin, Ireland (Virtual).

[4] Mahlaza, Z. Foundations for reusable and maintainable surface realisers for isiXhosa and isiZulu. PhD Thesis, Department of Computer Science, University of Cape Town, South Africa. 2022.

Semantic interoperability of conceptual data modelling languages: FaCIL

Software systems aren’t getting any less complex to design, implement, and maintain, which applies to both the numerous diverse components and the myriad of people involved in the development processes. Even a straightforward configuration of a data­base back-end and an object-oriented front-end tool requires coordination among database analysts, programmers, HCI people, and increasing involvement of domain experts and stakeholders. They each may prefer, and have different competencies in, certain specific design mechanisms; e.g., one may want EER for the database design, UML diagrams for the front-end app, and perhaps structured natural language sentences with SBVR or ORM for expressing the business rules. This requires multi-modal modelling in a plurality of paradigms. This would then need to be supported by hybrid tools that offer interoperability among those modelling languages, since such heterogeneity won’t go away any time soon, or ever.

Example of possible interactions between the various developers of a software system and the models they may be using.

It is far from trivial to have these people work together whilst maintaining their preferred view of a unified system’s design, let alone doing all this design in one system. In fact, there’s no such tool that can seamlessly render such varied models across multiple modelling languages whilst preserving the semantics. At best, there’s either only theory that aims to do that, or only a subset of the respective languages’ features, or a subset of the required combinations. Well, more precisely, until our efforts. We set out to fill this gap in functionality, both in a theoretically sound way and implemented as proof-of-concept to demonstrate its feasibility. The latest progress was recently published in the paper entitled A framework for interoperability with hybrid tools in the Journal of Intelligent Information Systems [1], in collaboration with Germán Braun and Pablo Fillottrani.

First, we propose the Framework for semantiC Interoperability of conceptual data modelling Languages, FaCIL, which serves as the core orchestration mechanism for hybrid modelling tools with relations between components and a workflow that uses them. At its centre, it has a metamodel that is used for the interchange between the various conceptual models represented in different languages and it has sets of rules to and from the metamodel (and at the metamodel level) to ensure the semantics is preserved when transforming a model in one language into a model in a different language and such that edits to one model automatically propagate correctly to the model in another language. In addition, thanks to the metamodel-based approach, logic-based reconstructions of the modelling languages also have become easier to manage, and so a path to automated reasoning is integrated in FaCIL as well.

This generic multi-modal modelling interoperability framework FaCIL was instantiated with a metamodel for UML Class Diagrams, EER, and ORM2 interoperability specifically [2] (introduced in 2015), called the KF metamodel [3] with its relevant rules (initial and implemented ones), an English controlled natural language, and a logic-based reconstruction into a fragment of OWL (orchestration graphically from the paper). This enables a range of different user interactions in the modelling process, of which an example of a possible workflow is shown in the following figure.

A sample workflow in the hybrid setting, showing interactions between visual conceptual data models (i.e., in their diagram version) and in their (pseudo-)natural language versions, with updates propagating to the others automatically. At the start (top), there’s a visual model in one’s preferred language from which a KF runtime model is generated. From there, it can go in various directions: verbalise, convert, or modify it. If the latter, then the KF runtime model is also updated and the changes are propagated to the other versions of the model, as often as needed. The elements in yellow/green/blue are thanks to FaCIL and the white ones are the usual tasks in the traditional one-off one-language modelling setting.

These theoretical foundations were implemented in the web-based crowd 2.0 tool (with source code). crowd 2.0 is the first hybrid tool of its kind, tying together all the pieces such that now, instead of partial or full manual model management of transformations and updates in multiple disparate tools, these tasks can be carried out automatically in one application and therewith also allow diverse developers and stakeholders to work from a shared single system.

We also describe a use case scenario for it – on Covid-19, as pretty much all of the work for this paper was done during the worse-than-today’s stage of the pandemic – that has lots of screenshots from the tool in action, both in the paper (starting here, with details halfway in this section) and more online.

Besides evaluating the framework with an instantiation, a proof-of-concept implementation of that instantiation, and a use case, it was also assessed against the reference framework for conceptual data modelling of Delcambre and co-authors [4] and shown to meet those requirements. Finally, crowd 2.0’s features were assessed against five relevant tools, considering the key requirements for hybrid tools, and shown to compare favourable against them (see Table 2 in the paper).

Distinct advantages can be summed up as follows, from those 26 pages of the paper, where the, in my opinion, most useful ones are underlined here, and the most promising ones to solve another set of related problems with conceptual data modelling (in one fell swoop!) in italics:

  • One system for related tasks, including visual and text-based modelling in multiple modelling languages, automated transformations and update propagation between the models, as well as verification of the model on coherence and consistency.
  • Any visual and text-based conceptual model interaction with the logic has to be maintained only in one place rather than for each conceptual modelling and controlled natural language separately;
  • A controlled natural language can be specified on the KF metamodel elements so that it then can be applied throughout the models regardless the visual language and therewith eliminating duplicate work of re-specifications for each modelling language and fragment thereof;
  • Any further model management, especially in the case of large models, such as abstraction and modularisation, can be specified either on the logic or on the KF metamodel in one place and propagate to other models accordingly, rather than re-inventing or reworking the algorithms for each language over and over again;
  • The modular design of the framework allows for extensions of each component, including more variants of visual languages, more controlled languages in your natural language of choice, or different logic-based reconstructions.

Of course, more can be done to make it even better, but it is a milestone of sorts: research into the  theoretical foundations of this particular line or research had commenced 10 years ago with the DST/MINCyT-funded bi-lateral project on ontology-driven unification of conceptual data modelling languages. Back then, we fantasised that, with more theory, we might get something like this sometime in the future. And we did.

References

[1] Germán Braun, Pablo Fillottrani, and C Maria Keet. A framework for interoperability with hybrid tools. Journal of Intelligent Information Systems, in print since 29 July 2022.

[2] Keet, C. M., & Fillottrani, P. R. (2015). An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 98, 30–53.

[3] Fillottrani, P.R., Keet, C.M. KF metamodel formalization. Technical Report, Arxiv.org http://arxiv.org/abs/1412.6545. Dec 19, 2014. 26p.

[4] Delcambre, L. M. L., Liddle, S. W., Pastor, O., & Storey, V. C. (2018). A reference framework for conceptual modeling. In: 37th International Conference on Conceptual Modeling (ER’18). LNCS. Springer, vol. 11157, 27–42.

More detail on the ontology of pandemic

When we can declare the covid-19 pandemic to be over? I mulled about that earlier in January this year when the omicron wave was fizzling out in South Africa, and wrote a blog post as a step toward trying to figure out and a short general public article was published by The Conversation (republished widely, including by The Next Web). That was not all and the end of it. In parallel – or, more precisely, behind the scenes – that ontological investigation did happen scientifically and in much more detail.

The conclusion is still the same, just with a more detailed analysis, which is now described in the paper entitled Exploring the ontology of pandemic [1], which was accepted at the International Conference on Biomedical Ontology 2022 recently.

First, it includes a proper discussion of how the 9 relevant domain ontologies have pandemic represented in the ontology – the same as epidemic, a sibling thereof, or as a subclass, and why – and what sort of generic top-level entity it is asserted to be, and a few more scientific references by domain experts.

Second, besides the two foundational ontologies that I discussed the alignment to (DOLCE and BFO) in the blog post, I tried with five more foundational ontologies that were selected meeting several criteria: BORO, GFO, SUMO, UFO, and YAMATO. That mainly took up a whole lot more time, but it didn’t add substantially to insights into what kind of entity pandemic is. It did, however, make clear that manually aligning is hard and difficult to get it as precise as it ought, and may need, to be, for several reasons (elaborated on in the paper).

Third, I dug deeper into the eight characteristics of pandemics according to the review by Morens, Folkers and Fauci (yes, him, from the CDC) [1] and disentangled what’s really going on with those, besides already having noted that several of them are fuzzy. Some of the characteristics aren’t really a property of pandemic itself, but of closely related entities, such as the disease (see table below). There are so many intertwined entities and relations, in fact, that one could very well develop an ontology of just pandemics, rather than have it only as a single class on an ontology as is now the case. For instance, there has to be a high attack rate, but ‘attack rate’ itself relies on the fact that there is an infectious agent that causes a disease and that R (reproduction) number that, in turn, is a complex thing that takes into account factors including susceptibility to infection, social dynamics of a population, and the ability to measure infections.

Finally, there are different ways to represent all the knowledge, or a relevant part thereof, as I also elaborated on in my Bio-Ontologies keynote last month. For instance, the attack rate could be squashed into a single data property if the calculation is done elsewhere and you don’t care how it is calculated, or it can be represented in all its glory details for the sake of it or for getting a clearer picture of what goes into computing the R number. For a scientific ontology, the latter is obviously the better choice, but there may be scenarios where the former is more practical.

The conclusion? The analysis cleared up a few things, but with some imprecise and highly complex properties as part of the mix to determine what is (and is not) a pandemic, there will be more than one optimum/finish line for a particular pandemic. To arrive at something more specific than in the paper, the domain experts may need to carry out a bit more research or come up with a consensus on how to precisiate those properties that are currently still vague.

Last, but not least, on attending ICBO’22, which will be held from 25-28 September in Ann Arbour, MI, USA: it runs in hybrid format. At the moment, I’m looking into the logistics of trying to attend in person now that we don’t have the highly anticipated ‘winter wave’ like the one we had last year and that thwarted my conference travel planning. While that takes extra time and resources to sort out, there’s that very thick silver lining that that also means we seem to be considerably closer to that real end of this pandemic (of the acute infections at least). According to the draft characterisation pandemic, one indeed might argue it’s over.

References

[1] Keet, C.M. Exploring the Ontology of Pandemic. 13th International Conference on Biomedical Ontology (ICBO’22). CEUR-WS. Michigan, USA, September 25-28, 2022.

[2] Morens, DM, Folkers, GK, Fauci, AS. What Is a Pandemic? The Journal of Infectious Diseases, 2009, 200(7): 1018-1021.

Conference report: SWAT4HCLS 2022

The things one can do when on sabbatical! For this week, it’s mainly attending the 13th Semantic Web Applications and tools for Health Care and Life Science (SWAT4HCLS) conference and even having some time to write a conference report again. (The last lost tagged with conference report was FOIS2018, at the end of my previous sabbatical.) The conference consisted of a tutorial day, two conference days with several keynotes and invited talks, paper presentations and poster sessions, and the last day a ‘hackathon’/unconference. This clearly has grown over the years from the early days of the event series (one day, workshop, life science).

A photo of the city where it was supposed to take place: Leiden (NL) (Source: here)

It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.

Keynotes

The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool [1], a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.

The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources [2]; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases [3]. It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically [4]. There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.

Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.

The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.

Papers

Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.

Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project [5]. He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).

Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid [6], which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).

Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu  and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) [7], and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO [8]. What to do next with these insights remains to be seen.

Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do [9]. Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.

Source: [9]

Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.

Other

There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.

The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!

References

[1] Verspoor K. et al. Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds). Advances in Information Retrieval. ECIR 2021. Springer LNCS, vol 12657, 559-564.

[2] Ostaszewski M. et al. COVID19 Disease Map, a computational knowledge repository of virus–host interaction mechanisms. Molecular Systems Biology, 2021, 17:e10387.

[3] Hanspers, K., Riutta, A., Summer-Kutmon, M. et al. Pathway information extracted from 25 years of pathway figures. Genome Biology, 2020, 21,273.

[4] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics, 2012, 45(3): 482-494. DOI: dx.doi.org/10.1016/j.jbi.2012.01.004.

[5] Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[6] Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[7] Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[8] César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[9] Frances Gillis-Webber and C. Maria Keet, A Survey of Multilingual OWL Ontologies in BioPortal. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

Progress on generating educational questions from ontologies

With increasing student numbers, but not as much more funding for schools and universities, and the desire to automate certain tasks anyhow, there have been multiple efforts to generate and mark educational exercises automatically. There are a number of efforts for the relatively easy tasks, such as for learning a language, which range from the entry level with simple vocabulary exercises to advanced ones of automatically marking essays. I’ve dabbled in that area as well, mainly with 3rd-year capstone projects and 4th-year honours project student projects [1]. Then there’s one notch up with fact recall and concept meaning recall questions, and further steps up, such as generating multiple-choice questions (MCQs) with not just obviously wrong distractors but good distractors to make the question harder. There’s quite a bit of work done on generating those MCQs in theory and in tooling, notably [2,3,4,5]. As a recent review [6] also notes, however, there are still quite a few gaps. Among others, about generalisability of theory and systems – can you plug in any structured data or knowledge source to question templates – and the type of questions. Most of the research on ‘not-so-hard to generate and mark’ questions has been done for MCQs, but there are multiple of other types of questions that also should be doable to generate automatically, such as true/false, yes/no, and enumerations. For instance, with an axiom such as impala \sqsubseteq \exists livesOn.land in a ontology or knowledge graph, a suitable question generation system may then generate “Does an impala live on land?” or “True or false: An impala lives on land.”, among other options.

We set out to make a start with tackling those sort of questions, for the type-level information from an ontology (cf. facts in the ABox or knowledge graph). The only work done there, when we started with it, was for the slick and fancy Inquire Biology [5], but which did not have their tech available for inspection and use, so we had to start from scratch. In particular, we wanted to find a way to be able to plug in any ontology into a system and generate those non-MCQ other types of educations questions (10 in total), where the questions generated are at least grammatically good and for which the answers also can be generated automatically, so that we get to automated marking as well.

Initial explorations started in 2019 with an honours project to develop some basics and a baseline, which was then expanded upon. Meanwhile, we have some more designed, developed, and evaluated, which was written up in the paper “Generating Answerable Questions from Ontologies for Educational Exercises” [7] that has been accepted for publication and presentation at the 15th international conference on metadata and semantics research (MTSR’21) that will be held online next week.

In short:

  • Different types of questions and the answer they have to provide put different prerequisites on the content of the ontology with certain types of axioms. We specified those for 10 types of educational questions.
  • Three strategies of question generation were devised, being ‘simple’ from the vocabulary and axioms and plug it into a template, guided by some more semantics in the ontology (a foundational ontology), and one that didn’t really care about either but rather took a natural language approach. Variants were added to cater for differences in naming and other variations, amounting to 75 question templates in total.
  • The human evaluation with questions generated from three ontologies showed that while the semantics-based one was slightly better than the baseline, the NLP-based one gave the best results on syntactic and semantic correctness of the sentences (according to the human evaluators).
  • It was tested with several ontologies in different domains, and the generalisability looks promising.
Graphical Abstract (made by Toky Raboanary)

To be honest to those getting their hopes up: there are some issues that cause it never to make it to the ‘100% fabulous!’ if one still wants to designs a system that should be able to take any ontology as input. A main culprit is naming of elements in the ontology, which varies widely across ontologies. There are several guidelines for how to name entities, such as using camel case or underscores, and those things easily can be coded into an algorithm, indeed, but developers don’t stick to them consistently or there’s an ontology import that uses another naming convention so that there likely will be a glitch in the generated sentences here or there. Or they name things within the context of the hierarchy where they put the class, but in the question it is out of that context and then looks weird or is even meaningless. I moaned about this before; e.g., ‘American’ as the name of the class that should have been named ‘American Pizza’ in the Pizza ontology. Or the word used for the name of the class can have different POS tags such that it makes the generated sentence hard to read; e.g., ‘stuff’ as a noun or a verb.

Be this as it may, overall, promising results were obtained and are being extended (more to follow). Some details can be found in the (CRC of the) paper and the algorithms and data are available from the GitHub repo. The first author of the paper, Toky Raboanary, recently made a short presentation video about the paper for the yearly Open Evening/Showcase, which was held virtually and that page is still online available.

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). Davis, B., Keet, C.M., Wyner, A. (Eds.). IOS Press, FAIA vol. 304, 31-40. Co. Kildare, Ireland, 27-28 August 2018.

[2] Alsubait, T., Parsia, B., Sattler, U. Ontology-based multiple choice question generation. KI – Kuenstliche Intelligenz, 2016, 30(2), 183-188.

[3] Rodriguez Rocha, O., Faron Zucker, C. Automatic generation of quizzes from dbpedia according to educational standards. In: The Third Educational Knowledge Management Workshop. pp. 1035-1041 (2018), Lyon, France. April 23 – 27, 2018.

[4] Vega-Gorgojo, G. Clover Quiz: A trivia game powered by DBpedia. Semantic Web Journal, 2019, 10(4), 779-793.

[5] Chaudhri, V., Cheng, B., Overholtzer, A., Roschelle, J., Spaulding, A., Clark, P., Greaves, M., Gunning, D. Inquire biology: A textbook that answers questions. AI Magazine, 2013, 34(3), 55-72.

[6] Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S. A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Edu, 2020, 30(1), 121-204.

[7] Raboanary, T., Wang, S., Keet, C.M. Generating Answerable Questions from Ontologies for Educational Exercises. 15th Metadata and Semantics Research Conference (MTSR’21). 29 Nov – 3 Dec, Madrid, Spain / online. Springer CCIS (in print).

Bias in ontologies?

Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs [1], one attempt toward a framework but looking at FOAF only [2], which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases [3].

We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases [4] (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.

The outcome of that exploratory analysis [5], in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.

The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.

Table 1. Summary of typical possible biases in ontologies grouped by source, with an indication whether such biases would be explicit choices or whether they may creep in unintentionally and lead to implicit bias. (Source: [5])

An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.

Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.

Table 2. Tentative presence of bias in the three COVID-19 ontologies, by cognitive bias; see paper for details.

Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.

Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:

Figure 1. OBDI scenario with the CIDO, two database, and a query over the system that returns a logically correct but undesirable result due to some optimism that an experimental substance is already a drug.

The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.

Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.

Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.

References

[1] K. Janowicz, B. Yan, B. Regalia, R. Zhu, G. Mai, Debiasing knowledge graphs: Why female presidents are not like female popes, in: M. van Erp, M. Atre, V. Lopez, K. Srinivas, C. Fortuna (Eds.), Proceeding of ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, volume 2180 of CEUR-WS, 2017.

[2] D. L. Gomes, T. H. Bragato Barros, The bias in ontologies: An analysis of the FOAF ontology, in: M. Lykke, T. Svarre, M. Skov, D. Martínez-Ávila (Eds.), Proceedings of the Sixteenth International ISKO Conference, Ergon-Verlag, 2020, pp. 236 – 244.

[3] Keet, C.M. Dirty wars, databases, and indices. Peace & Conflict Review, 2009, 4(1):75-78.

[4] E. Dimara, S. Franconeri, C. Plaisant, A. Bezerianos, P. Dragicevic, A task-based taxonomy of cognitive biases for information visualization, IEEE Transactions on Visualization and Computer Graphics 26 (2020) 1413–1432.

[5] Keet, C.M. An exploration into cognitive bias in ontologies. Cognition And OntologieS (CAOS’21), part of JOWO’21, part of BoSK’21. 13-16 September 2021, Bolzano, Italy. (in print)

CLaRO v2.0: A larger CNL for competency questions for ontologies

The avid blog reader with a good memory might remember we had developed a controlled natural language (CNL) in 2019 that we called CLaRO, a Competency question Language for specifying Requirements for an Ontology, model, or specification [1], for specifying requirements on the contents of the TBox (type-level) knowledge specifically. The paper won the best student paper award at the MTSR’19 conference.  Then COVID-19 came along.

Notwithstanding, we did take next steps and obtained some advances in the meantime, which resulted in a substantially extended CNL, called CLaRO v2 [2]. The paper describing how it came about has been accepted recently at the 7th Controlled Natural Language Workshop (CNL2020/21), which will be held on 8-9 September in Amsterdam, The Netherlands, in hybrid mode.

So, what is it about, being “new and improved!” compared to the first version? The first version was created in a bottom-up fashion based on a dataset of 234 competency questions [3] in a few domains only. It turned out alright with decent performance on coverage for unseen questions (88% overall) and very significantly outperforming the others, but there were some nagging doubts about the feasibility of bottom-up approaches to template development, which are essentially at the heart of every bottom-up approach: questions about representativeness and quality of the source data. We used more questions as basis to work from than others and had better coverage, but would coverage improve further then still with even more questions? Would it matter for coverage if the CQs were to come from more diverse subject domains? Also, upon manual inspection of the original CQs, it could be seen that some CQs from the dataset were ill-formed, which propagated through to the final set of templates of CLaRO. Would ‘cleaning’ the source data to presumably better quality templates improve coverage?

One of the PhD students I supervise, Mary-Jane Antia, set out to find answer to these questions. CQs were cleaned and vetted by a linguist, the templates recreated and compared and evaluated—this time automatically in a new testing pipeline. New CQs for ontologies were sourced by searching all over the place and finding some 70, to which we added 22 more variants by tweaking wording of existing CQs such that they still would be potentially answerable by an ontology. They were tested on the templates, which resulted in a lower than ideal percentage of coverage and so new templates were created from them, and yet again evaluated. The key results:

  • An increase from 88% for CLaRO v1 to 94.1% for CLaRO v2 coverage.
  • The new CLaRO v2 has 147 main templates and another 59 variants to cater for minor differences (e.g., singular/plural, redundant words), up from 93 and 41 in CLaRO.
  • Increasing the number of domains that the CQs were drawn from had a larger effect on the CQ coverage than cleaning the source data.
Screenshot of the CLaRO CQ editor tool.

All the data, including the new templates, are available on Github and the details are described in the paper [2]. The CLaRO tool that supports the authoring is in the process of being updated so as to incorporate the v2 templates (currently it is working with the v1 templates).

I will try to make it to Amsterdam where CNL’21 will take place, but travel restrictions aren’t cooperating with that plan just yet; else I’ll participate virtually. Mary-Jane will present the paper, and also for her, despite also having funding for the trip, it increasingly looks like a virtual presentation. On the bright side: at least there is a way to participate virtually.

References

[1] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol. 1075, 3-15.

[2]  Antia, M.-J., Keet, C.M. Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies. 7th International Workshop on Controlled Natural language (CNL’21), 8-9 Sept. 2021, Amsterdam, the Netherlands. (in print)

[3] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, 29: 105098.

NLG requirements for social robots in Sub-Saharan Africa

When the robots come rolling, or just trickling or seeping or slowly creeping, into daily life, I want them to be culturally aware, give contextually relevant responses, and to do that in a language that the user can understand and speak well. Currently, they don’t. Since I work and in live in South Africa, then what does all that mean for the Southern Africa context? Would social robot use case scenarios be different here than in the Global North where most of the robot research and development is happening, and if so, how? What is meant with  contextually relevant responses? Which language(s) should the robot communicate in?

The question of which languages is the easiest to answer: those spoken in this region, which are mainly those in the Niger-Congo B [NCB] (aka ‘Bantu’) family of languages, and then also Portuguese, French, Afrikaans, and English. I’ve been working on theory and tools for NCB languages, and isiZulu in particular (and some isiXhosa and Runyankore), whose research was mainly as part of the two NRF-funded projects GeNI and MoReNL. However, if we don’t know how that human-robot interaction occurs in which setting, we won’t know whether the algorithms designed so far can also be used for that, which may well be beyond the ontology verbalisation, a patient’s medicine prescription generation, weather forecasts, or language learning exercises that we roughly got covered for the controlled language and natural language generation aspects of it.

So then what about those use case scenarios and contextually relevant responses? Let me first give an example of the latter. A few years ago in one of the social issues and professional practice lectures I was teaching, I brought in the Amazon Echo to illustrate precisely that as well as privacy issues with Alexa and digital assistants (‘robot secretaries’) in general. Upon asking “What is the EFF?”, the whole class—some 300 students present at the time—was expecting that Alexa would respond with something like “The EFF is the economic freedom fighters, a political party in South Africa”. Instead, Alexa fetched the international/US-based answer and responded with “The EFF is the electronic frontier foundation” that the class had never heard of and that EFF doesn’t really do anything in South Africa (it does pass the revue later on in the module nonetheless, btw). There’s plenty of online content about the EFF as political party, yet Alexa chose to ignore that and prioritise information from elsewhere. Go figure with lots of other information that has limited online presence and doesn’t score high in the search engine results because there are fewer queries about it. How to get the right answer in those cases is not my problem (area of expertise), but I take that a solved black box and zoom in on the natural language aspects to automatically generate a sentence that has the answer taken from some structured data or knowledge.

The other aspect of this instance, is that the interactions both during and after the lecture was not a 1:1 interaction of students with their own version of Siri or Cortana and the like, but eager and curious students came in teams, so a 1:m interaction. While that particular class is relatively large and was already split into two sessions, larger classes are also not uncommon in several Sub-Saharan countries: for secondary school class sizes, the SADC average is 23.55 learners per class (the world average is 17), with the lowest is Botswana (13.8 learners) and the highest in Malawi with a whopping 72.3 learners in a class, on average. An educational robot could well be a useful way to get out of that catch-22, and, given resource constraints, end up as a deployment scenario with a robot per study group, and that in a multilingual setting that permits code switching (going back and forth between different languages). While human-robot interaction experts still will need to do some contextual inquiries and such to get to the bottom of the exact requirements and sentences, this variation in use is on top of the hitherto know possible ways for educational robots.

Going beyond this sort of informal chatter, I tried to structure that a bit and narrowed it down to a requirements analysis for the natural language generation aspects of it. After some contextualisation, I principally used two main use cases to elucidate natural language generation requirements and assessed that against key advances in research and technologies for NCB languages. Very, very, briefly, any system will need to i) combine data-to-text and knowledge-to-text, ii) generate many more different types of sentences, including sentences for  both written and spoken languages in the NCB languages that are grammatically rich and often agglutinating, and iii) process non-trivial numbers that is non-trivial to do for NCB languages because the surface realization of the numbers depend on the noun class of the noun that is being counted. At present, no system out there can do all of that. A condensed version of the analysis was recently accepted as a paper entitled Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa [1], for the IST-Africa’21 conference, and it will be presented there next week at the virtual event, in the ‘next generation computing’ session no less, on Wednesday the 12th of May.

Screen grab of the recording of the conference presentation (link to recording available upon request)

Probably none of you has ever heard of this conference. IST-Africa is yearly IT conference in Africa that aims to foster North-South and South-South  networking, promote the academia->industry and academia->policy bridge-creation and knowledge transfer pipelines, and capacity building for paper writing and presentation. The topics covered are distinctly of regional relevance and, according to its call for papers, the “Technical, Policy, Social Implications Papers must present analysis of early/final Research or Implementation Project Results, or business, government, or societal sector Case Study”.

Why should I even bother with an event like that? It’s good to sometimes reflect on the context and ponder about relevance of one’s research—after all, part of the university’s income (and thus my salary) and a large part of the research project funding I have received so far comes ultimately from the taxpayers. South African tax payers, to be more precise; not the taxpayers of the Global North. I can ‘advertise’, ahem, my research area and its progress to a regional audience. Also, I don’t expect that the average scientist in the Global North would care about HRI in Africa and even less so for NCB languages, but the analysis needed to be done and papers equate brownie points. Also, if everyone thinks to better not participate in something locally or regionally, it won’t ever become a vibrant network of research, applied research, and technology. I’ve attended the event once, in 2018 when we had a paper on error correction for isiZulu spellcheckers, and from my researcher viewpoint, it was useful for networking and ‘shopping’ for interesting problems that I may be able to solve, based on other participants’ case studies and inquiries.

Time will tell whether attending that event then and now this paper and online attendance will be time wasted or well spent. Unlike the papers on the isiZulu spellcheckers that reported research and concrete results that a tech company easily could take up (feel free to do so), this is a ‘fluffy’ paper, but exploring the use of robots in Africa was an interesting activity to do, I learned a few things along the way, it will save other interested people time in the analysis phase, and hopefully it also will generate some interest and discussion about what sort of robots we’d want and what they could or should be doing to assist, rather than replace, humans.

p.s.: if you still were to think that there are no robots in Africa and deem all this to be irrelevant: besides robots in the automotive and mining industries by, e.g., Robotic Innovations and Robotic Handling Systems, there are robots in education (also in Cape Town, by RD-9), robot butlers in hotels that serve quarantined people with mild COVID-19 in Johannesburg, they’re used for COVID-19 screening in Rwanda, and the Naledi personal banking app by Botlhale, to name but a few examples. Other tools are moving in that direction, such as, among others, Awezamed’s use of speech synthesis with (canned) text in isiZulu, isiXhosa and Afrikaans and there’s of course my research group where we look into knowledge-to-text text generation in African languages.

References

[1] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. in print.

Automatically simplifying an ontology with NOMSA

Ever wanted only to get the gist of the ontology rather than wading manually through thousands of axioms, or to extract only a section of an ontology for reuse? Then the NOMSA tool may provide the solution to your problem.

screenshot of NOMSA in action (deleting classes further than two levels down in the hierarchy in BFO)

There are quite a number of ways to create modules for a range of purposes [1]. We zoomed in on the notion of abstraction: how to remove all sorts of details and create a new ontology module of that. It’s a long-standing topic in computer science that returns every couple of years with another few tries. My first attempts date back to 2005 [2], which references modules & abstractions for conceptual models and logical theories to works published in the mid-1990s and, stretching the scope to granularity, to 1985, even. Those efforts, however, tend to halt at the theory stage or worked for one very specific scenario (e.g., clustering in ER diagrams). In this case, however, my former PhD student and now Senior Research at the CSIR, Zubeida Khan, went further and also devised the algorithms for five types of abstraction, implemented them for OWL ontologies, and evaluated them on various metrics.

The tool itself, NOMSA, was presented very briefly at the EKAW 2018 Posters & Demos session [3] and has supplementary material, such as the definitions and algorithms, a very short screencast and the source code. Five different ways of abstraction to generate ontology modules were implemented: i) removing participation constraints between classes (e.g., the ‘each X R at least one Y’ type of axioms), ii) removing vocabulary (e.g., remove all object properties to yield a bare taxonomy of classes), iii) keeping only a small number of levels in the hierarchy, iv) weightings based on how much some element is used (removing less-connected elements), and v) removing specific language profile features (e.g., qualified cardinality, object property characteristics).

In the meantime, we have added a categorisation of different ways of abstracting conceptual models and ontologies, a larger use case illustrating those five types of abstractions that were chosen for specification and implementation, and an evaluation to see how well the abstraction algorithms work on a set of published ontologies. It was all written up and polished in 2018. Then it took a while in the publication pipeline mixed with pandemic delays, but eventually it has emerged as a book chapter entitled Structuring abstraction to achieve ontology modularisation [4] in the book “Advanced Concepts, methods, and Applications in Semantic Computing” that was edited by Olawande Daramola and Thomas Moser, in January 2021.

Since I bought new video editing software for the ‘physically distanced learning’ that we’re in now at UCT, I decided to play a bit with the software’s features and record a more comprehensive screencast demo video. In the nearly 13 minutes, I illustrate NOMSA with four real ontologies, being the AWO tutorial ontology, BioTop top-domain ontology, BFO top-level ontology, and the Stuff core ontology. Here’s a screengrab from somewhere in the middle of the presentation, where I just automatically removed all 76 object properties from BioTop, with just one click of a button:

screengrab of the demo video

The embedded video (below) might keep it perhaps still readable with really good eyesight; else you can view it here in a separate tab.

The source code is available from Zubeida’s website (and I have a local copy as well). If you have any questions or suggestions, please feel free to contact either of us. Under the fair use clause, we also can share the book chapter that contains the details.

References

[1] Khan, Z.C., Keet, C.M. An empirically-based framework for ontology modularization. Applied Ontology, 2015, 10(3-4):171-195.

[2] Keet, C.M. Using abstractions to facilitate management of large ORM models and ontologies. International Workshop on Object-Role Modeling (ORM’05). Cyprus, 3-4 November 2005. In: OTM Workshops 2005. Halpin, T., Meersman, R. (eds.), LNCS 3762. Berlin: Springer-Verlag, 2005. pp603-612.

[3] Khan, Z.C., Keet, C.M. NOMSA: Automated modularisation for abstraction modules. Proceedings of the EKAW 2018 Posters and Demonstrations Session (EKAW’18). CEUR-WS vol. 2262, pp13-16. 12-16 Nov. 2018, Nancy, France.

[4] Khan, Z.C., Keet, C.M. Structuring abstraction to achieve ontology modularisation. Advanced Concepts, methods, and Applications in Semantic Computing. Daramola O, Moser T (Eds.). IGI Global. 2021, 296p. DOI: 10.4018/978-1-7998-6697-8.ch004