CLaRO v2.0: A larger CNL for competency questions for ontologies

The avid blog reader with a good memory might remember we had developed a controlled natural language (CNL) in 2019 that we called CLaRO, a Competency question Language for specifying Requirements for an Ontology, model, or specification [1], for specifying requirements on the contents of the TBox (type-level) knowledge specifically. The paper won the best student paper award at the MTSR’19 conference.  Then COVID-19 came along.

Notwithstanding, we did take next steps and obtained some advances in the meantime, which resulted in a substantially extended CNL, called CLaRO v2 [2]. The paper describing how it came about has been accepted recently at the 7th Controlled Natural Language Workshop (CNL2020/21), which will be held on 8-9 September in Amsterdam, The Netherlands, in hybrid mode.

So, what is it about, being “new and improved!” compared to the first version? The first version was created in a bottom-up fashion based on a dataset of 234 competency questions [3] in a few domains only. It turned out alright with decent performance on coverage for unseen questions (88% overall) and very significantly outperforming the others, but there were some nagging doubts about the feasibility of bottom-up approaches to template development, which are essentially at the heart of every bottom-up approach: questions about representativeness and quality of the source data. We used more questions as basis to work from than others and had better coverage, but would coverage improve further then still with even more questions? Would it matter for coverage if the CQs were to come from more diverse subject domains? Also, upon manual inspection of the original CQs, it could be seen that some CQs from the dataset were ill-formed, which propagated through to the final set of templates of CLaRO. Would ‘cleaning’ the source data to presumably better quality templates improve coverage?

One of the PhD students I supervise, Mary-Jane Antia, set out to find answer to these questions. CQs were cleaned and vetted by a linguist, the templates recreated and compared and evaluated—this time automatically in a new testing pipeline. New CQs for ontologies were sourced by searching all over the place and finding some 70, to which we added 22 more variants by tweaking wording of existing CQs such that they still would be potentially answerable by an ontology. They were tested on the templates, which resulted in a lower than ideal percentage of coverage and so new templates were created from them, and yet again evaluated. The key results:

  • An increase from 88% for CLaRO v1 to 94.1% for CLaRO v2 coverage.
  • The new CLaRO v2 has 147 main templates and another 59 variants to cater for minor differences (e.g., singular/plural, redundant words), up from 93 and 41 in CLaRO.
  • Increasing the number of domains that the CQs were drawn from had a larger effect on the CQ coverage than cleaning the source data.
Screenshot of the CLaRO CQ editor tool.

All the data, including the new templates, are available on Github and the details are described in the paper [2]. The CLaRO tool that supports the authoring is in the process of being updated so as to incorporate the v2 templates (currently it is working with the v1 templates).

I will try to make it to Amsterdam where CNL’21 will take place, but travel restrictions aren’t cooperating with that plan just yet; else I’ll participate virtually. Mary-Jane will present the paper, and also for her, despite also having funding for the trip, it increasingly looks like a virtual presentation. On the bright side: at least there is a way to participate virtually.

References

[1] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol. 1075, 3-15.

[2]  Antia, M.-J., Keet, C.M. Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies. 7th International Workshop on Controlled Natural language (CNL’21), 8-9 Sept. 2021, Amsterdam, the Netherlands. (in print)

[3] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, 29: 105098.

More results on a CNL for isiZulu

Although it has been a bit quiet here on the controlled natural languages for isiZulu front, lots of new stuff is in the pipeline, and the substantially extended version of our CNL14 and RuleML14 papers [1,2] is in print for publication in the Language Resources and Evaluation journal: Toward a knowledge-to-text controlled natural language of isiZulu [1] (online at LRE as well).

For those who haven’t read the other blog post or the papers on the topic, a brief introduction: for a plethora of reasons, one would want to generate natural language sentences based on some data, information, or knowledge stored on the computer. For instance, to generate automatically weather reports in isiZulu or to browse or query ‘intelligently’ online annotated newspaper text that is guided by an ontology behind-the-scenes in the inner workings of the interface. This means ‘converting’ structured input into structured natural language sentences, which amounts to a Controlled Natural Language (CNL) that is a fragment of the full natural language. For instance, class subsumption in DL (“\sqsubseteq “) is verbalised in English as ‘is a/an’. In isiZulu, it is y- or ng- depending on the first character of the name of the superclass. So, in its simplest form, indlovu \sqsubseteq isilwane (that is, elephant \sqsubseteq animal in an ‘English ontology’) would, with the appropriate algorithm, generate the sentence (be verbalized as) indlovu yisilwane (‘elephant is an animal’).

In the CNL14 and RuleML14 papers, we looked into what could be the verbalisation patterns for subsumption, disjointness, conjunction, and simple existential quantification, we evaluated which ones were preferred, and we designed algorithms for them, as none of them could be done with a template. The paper in the LRE journal extends those works with, mainly: a treatment of verbs (OWL object properties) and their conjugation, updated/extended algorithms to deal with that, design considerations for those algorithms, walk-throughs of the algorithms, and an exploratory evaluation to assess the correctness of the algorithm (is the sentence generated [un]grammatical and [un]ambiguous?). There’s also a longer discussion section and more related works.

Conjugation of the verb in isiZulu is not as trivial as in English, where, for verbalizing knowledge represented in ontologies, one simply uses the 3rd person singular (e.g., ‘eats’) or plural (‘eat’) anywhere it appears in an axiom. In isiZulu, it is conjugated based on the noun class of the noun to which it applies. There are 17 noun classes. For instance, umuntu ‘human’ is in noun class 1, and indlovu in noun class 9. Then, when a human eats something, it is umuntu udla whereas with the elephant, it is indlovu idla. Negating it is not simply putting a ‘not’ or ‘does not’ in front of it, as is the case in English (‘does not eat’), but it has its own conjugation (called negative subject concord) again for each noun class, and modifying the final vowel; the human not eating something then becomes umuntu akadli and for the elephant indovu ayidli. This is now precisely captured in the verbalization patterns and algorithms.

Though a bit tedious and not an easy ride compared to a template-based approach, but surely doable to put in an algorithm. Meanwhile, I did implement the algorithms. I’ll readily admit it’s a scruffy Python file and you’ll have to type the function in the interpreter rather than having it already linked to an ontology, but it works, and that’s what counts. (see that flag put in the sand? 😉 ) Here’s a screenshot with a few examples, just to show that it does what it should do.

Screenshot showing the working functions for verbalising subsumption, disjointness, universal quantificaiton, existential quantification and its negation, and conjunction.

Screenshot showing the working functions for verbalising subsumption, disjointness, universal quantificaiton, existential quantification and its negation, and conjunction.

The code and other files are available from the GeNi project page. The description of the implementation, and the refinements we made along the way in doing so (e.g., filling in that ‘pluralise it’ of the algorithm), is not part of the LRE article, for we were already pushing it beyond the page limit, so I’ll describe that in a later post.

 

References

[1] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), Davis, B, Kuhn, T, Kaljurand, K. (Eds.). Springer LNAI vol. 8625, 78-89. 20-22 August 2014, Galway, Ireland.

[2] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS vol. 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[3] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016: in print. DOI: 10.1007/s10579-016-9340-0