# Progress on generating educational questions from ontologies

With increasing student numbers, but not as much more funding for schools and universities, and the desire to automate certain tasks anyhow, there have been multiple efforts to generate and mark educational exercises automatically. There are a number of efforts for the relatively easy tasks, such as for learning a language, which range from the entry level with simple vocabulary exercises to advanced ones of automatically marking essays. I’ve dabbled in that area as well, mainly with 3rd-year capstone projects and 4th-year honours project student projects [1]. Then there’s one notch up with fact recall and concept meaning recall questions, and further steps up, such as generating multiple-choice questions (MCQs) with not just obviously wrong distractors but good distractors to make the question harder. There’s quite a bit of work done on generating those MCQs in theory and in tooling, notably [2,3,4,5]. As a recent review [6] also notes, however, there are still quite a few gaps. Among others, about generalisability of theory and systems – can you plug in any structured data or knowledge source to question templates – and the type of questions. Most of the research on ‘not-so-hard to generate and mark’ questions has been done for MCQs, but there are multiple of other types of questions that also should be doable to generate automatically, such as true/false, yes/no, and enumerations. For instance, with an axiom such as $impala \sqsubseteq \exists livesOn.land$ in a ontology or knowledge graph, a suitable question generation system may then generate “Does an impala live on land?” or “True or false: An impala lives on land.”, among other options.

We set out to make a start with tackling those sort of questions, for the type-level information from an ontology (cf. facts in the ABox or knowledge graph). The only work done there, when we started with it, was for the slick and fancy Inquire Biology [5], but which did not have their tech available for inspection and use, so we had to start from scratch. In particular, we wanted to find a way to be able to plug in any ontology into a system and generate those non-MCQ other types of educations questions (10 in total), where the questions generated are at least grammatically good and for which the answers also can be generated automatically, so that we get to automated marking as well.

Initial explorations started in 2019 with an honours project to develop some basics and a baseline, which was then expanded upon. Meanwhile, we have some more designed, developed, and evaluated, which was written up in the paper “Generating Answerable Questions from Ontologies for Educational Exercises” [7] that has been accepted for publication and presentation at the 15th international conference on metadata and semantics research (MTSR’21) that will be held online next week.

In short:

• Different types of questions and the answer they have to provide put different prerequisites on the content of the ontology with certain types of axioms. We specified those for 10 types of educational questions.
• Three strategies of question generation were devised, being ‘simple’ from the vocabulary and axioms and plug it into a template, guided by some more semantics in the ontology (a foundational ontology), and one that didn’t really care about either but rather took a natural language approach. Variants were added to cater for differences in naming and other variations, amounting to 75 question templates in total.
• The human evaluation with questions generated from three ontologies showed that while the semantics-based one was slightly better than the baseline, the NLP-based one gave the best results on syntactic and semantic correctness of the sentences (according to the human evaluators).
• It was tested with several ontologies in different domains, and the generalisability looks promising.

To be honest to those getting their hopes up: there are some issues that cause it never to make it to the ‘100% fabulous!’ if one still wants to designs a system that should be able to take any ontology as input. A main culprit is naming of elements in the ontology, which varies widely across ontologies. There are several guidelines for how to name entities, such as using camel case or underscores, and those things easily can be coded into an algorithm, indeed, but developers don’t stick to them consistently or there’s an ontology import that uses another naming convention so that there likely will be a glitch in the generated sentences here or there. Or they name things within the context of the hierarchy where they put the class, but in the question it is out of that context and then looks weird or is even meaningless. I moaned about this before; e.g., ‘American’ as the name of the class that should have been named ‘American Pizza’ in the Pizza ontology. Or the word used for the name of the class can have different POS tags such that it makes the generated sentence hard to read; e.g., ‘stuff’ as a noun or a verb.

Be this as it may, overall, promising results were obtained and are being extended (more to follow). Some details can be found in the (CRC of the) paper and the algorithms and data are available from the GitHub repo. The first author of the paper, Toky Raboanary, recently made a short presentation video about the paper for the yearly Open Evening/Showcase, which was held virtually and that page is still online available.

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). Davis, B., Keet, C.M., Wyner, A. (Eds.). IOS Press, FAIA vol. 304, 31-40. Co. Kildare, Ireland, 27-28 August 2018.

[2] Alsubait, T., Parsia, B., Sattler, U. Ontology-based multiple choice question generation. KI – Kuenstliche Intelligenz, 2016, 30(2), 183-188.

[3] Rodriguez Rocha, O., Faron Zucker, C. Automatic generation of quizzes from dbpedia according to educational standards. In: The Third Educational Knowledge Management Workshop. pp. 1035-1041 (2018), Lyon, France. April 23 – 27, 2018.

[4] Vega-Gorgojo, G. Clover Quiz: A trivia game powered by DBpedia. Semantic Web Journal, 2019, 10(4), 779-793.

[5] Chaudhri, V., Cheng, B., Overholtzer, A., Roschelle, J., Spaulding, A., Clark, P., Greaves, M., Gunning, D. Inquire biology: A textbook that answers questions. AI Magazine, 2013, 34(3), 55-72.

[6] Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S. A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Edu, 2020, 30(1), 121-204.

[7] Raboanary, T., Wang, S., Keet, C.M. Generating Answerable Questions from Ontologies for Educational Exercises. 15th Metadata and Semantics Research Conference (MTSR’21). 29 Nov – 3 Dec, Madrid, Spain / online. Springer CCIS (in print).

# Bias in ontologies?

Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs [1], one attempt toward a framework but looking at FOAF only [2], which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases [3].

We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases [4] (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.

The outcome of that exploratory analysis [5], in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.

The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.

An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.

Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.

Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.

Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:

The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.

Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.

Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.

References

[1] K. Janowicz, B. Yan, B. Regalia, R. Zhu, G. Mai, Debiasing knowledge graphs: Why female presidents are not like female popes, in: M. van Erp, M. Atre, V. Lopez, K. Srinivas, C. Fortuna (Eds.), Proceeding of ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, volume 2180 of CEUR-WS, 2017.

[2] D. L. Gomes, T. H. Bragato Barros, The bias in ontologies: An analysis of the FOAF ontology, in: M. Lykke, T. Svarre, M. Skov, D. Martínez-Ávila (Eds.), Proceedings of the Sixteenth International ISKO Conference, Ergon-Verlag, 2020, pp. 236 – 244.

[3] Keet, C.M. Dirty wars, databases, and indices. Peace & Conflict Review, 2009, 4(1):75-78.

[4] E. Dimara, S. Franconeri, C. Plaisant, A. Bezerianos, P. Dragicevic, A task-based taxonomy of cognitive biases for information visualization, IEEE Transactions on Visualization and Computer Graphics 26 (2020) 1413–1432.

[5] Keet, C.M. An exploration into cognitive bias in ontologies. Cognition And OntologieS (CAOS’21), part of JOWO’21, part of BoSK’21. 13-16 September 2021, Bolzano, Italy. (in print)

# CLaRO v2.0: A larger CNL for competency questions for ontologies

The avid blog reader with a good memory might remember we had developed a controlled natural language (CNL) in 2019 that we called CLaRO, a Competency question Language for specifying Requirements for an Ontology, model, or specification [1], for specifying requirements on the contents of the TBox (type-level) knowledge specifically. The paper won the best student paper award at the MTSR’19 conference.  Then COVID-19 came along.

Notwithstanding, we did take next steps and obtained some advances in the meantime, which resulted in a substantially extended CNL, called CLaRO v2 [2]. The paper describing how it came about has been accepted recently at the 7th Controlled Natural Language Workshop (CNL2020/21), which will be held on 8-9 September in Amsterdam, The Netherlands, in hybrid mode.

So, what is it about, being “new and improved!” compared to the first version? The first version was created in a bottom-up fashion based on a dataset of 234 competency questions [3] in a few domains only. It turned out alright with decent performance on coverage for unseen questions (88% overall) and very significantly outperforming the others, but there were some nagging doubts about the feasibility of bottom-up approaches to template development, which are essentially at the heart of every bottom-up approach: questions about representativeness and quality of the source data. We used more questions as basis to work from than others and had better coverage, but would coverage improve further then still with even more questions? Would it matter for coverage if the CQs were to come from more diverse subject domains? Also, upon manual inspection of the original CQs, it could be seen that some CQs from the dataset were ill-formed, which propagated through to the final set of templates of CLaRO. Would ‘cleaning’ the source data to presumably better quality templates improve coverage?

One of the PhD students I supervise, Mary-Jane Antia, set out to find answer to these questions. CQs were cleaned and vetted by a linguist, the templates recreated and compared and evaluated—this time automatically in a new testing pipeline. New CQs for ontologies were sourced by searching all over the place and finding some 70, to which we added 22 more variants by tweaking wording of existing CQs such that they still would be potentially answerable by an ontology. They were tested on the templates, which resulted in a lower than ideal percentage of coverage and so new templates were created from them, and yet again evaluated. The key results:

• An increase from 88% for CLaRO v1 to 94.1% for CLaRO v2 coverage.
• The new CLaRO v2 has 147 main templates and another 59 variants to cater for minor differences (e.g., singular/plural, redundant words), up from 93 and 41 in CLaRO.
• Increasing the number of domains that the CQs were drawn from had a larger effect on the CQ coverage than cleaning the source data.

All the data, including the new templates, are available on Github and the details are described in the paper [2]. The CLaRO tool that supports the authoring is in the process of being updated so as to incorporate the v2 templates (currently it is working with the v1 templates).

I will try to make it to Amsterdam where CNL’21 will take place, but travel restrictions aren’t cooperating with that plan just yet; else I’ll participate virtually. Mary-Jane will present the paper, and also for her, despite also having funding for the trip, it increasingly looks like a virtual presentation. On the bright side: at least there is a way to participate virtually.

References

[1] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol. 1075, 3-15.

[2]  Antia, M.-J., Keet, C.M. Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies. 7th International Workshop on Controlled Natural language (CNL’21), 8-9 Sept. 2021, Amsterdam, the Netherlands. (in print)

[3] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, 29: 105098.

# Automatically simplifying an ontology with NOMSA

Ever wanted only to get the gist of the ontology rather than wading manually through thousands of axioms, or to extract only a section of an ontology for reuse? Then the NOMSA tool may provide the solution to your problem.

There are quite a number of ways to create modules for a range of purposes [1]. We zoomed in on the notion of abstraction: how to remove all sorts of details and create a new ontology module of that. It’s a long-standing topic in computer science that returns every couple of years with another few tries. My first attempts date back to 2005 [2], which references modules & abstractions for conceptual models and logical theories to works published in the mid-1990s and, stretching the scope to granularity, to 1985, even. Those efforts, however, tend to halt at the theory stage or worked for one very specific scenario (e.g., clustering in ER diagrams). In this case, however, my former PhD student and now Senior Research at the CSIR, Zubeida Khan, went further and also devised the algorithms for five types of abstraction, implemented them for OWL ontologies, and evaluated them on various metrics.

The tool itself, NOMSA, was presented very briefly at the EKAW 2018 Posters & Demos session [3] and has supplementary material, such as the definitions and algorithms, a very short screencast and the source code. Five different ways of abstraction to generate ontology modules were implemented: i) removing participation constraints between classes (e.g., the ‘each X R at least one Y’ type of axioms), ii) removing vocabulary (e.g., remove all object properties to yield a bare taxonomy of classes), iii) keeping only a small number of levels in the hierarchy, iv) weightings based on how much some element is used (removing less-connected elements), and v) removing specific language profile features (e.g., qualified cardinality, object property characteristics).

In the meantime, we have added a categorisation of different ways of abstracting conceptual models and ontologies, a larger use case illustrating those five types of abstractions that were chosen for specification and implementation, and an evaluation to see how well the abstraction algorithms work on a set of published ontologies. It was all written up and polished in 2018. Then it took a while in the publication pipeline mixed with pandemic delays, but eventually it has emerged as a book chapter entitled Structuring abstraction to achieve ontology modularisation [4] in the book “Advanced Concepts, methods, and Applications in Semantic Computing” that was edited by Olawande Daramola and Thomas Moser, in January 2021.

Since I bought new video editing software for the ‘physically distanced learning’ that we’re in now at UCT, I decided to play a bit with the software’s features and record a more comprehensive screencast demo video. In the nearly 13 minutes, I illustrate NOMSA with four real ontologies, being the AWO tutorial ontology, BioTop top-domain ontology, BFO top-level ontology, and the Stuff core ontology. Here’s a screengrab from somewhere in the middle of the presentation, where I just automatically removed all 76 object properties from BioTop, with just one click of a button:

The embedded video (below) might keep it perhaps still readable with really good eyesight; else you can view it here in a separate tab.

The source code is available from Zubeida’s website (and I have a local copy as well). If you have any questions or suggestions, please feel free to contact either of us. Under the fair use clause, we also can share the book chapter that contains the details.

References

[1] Khan, Z.C., Keet, C.M. An empirically-based framework for ontology modularization. Applied Ontology, 2015, 10(3-4):171-195.

[2] Keet, C.M. Using abstractions to facilitate management of large ORM models and ontologies. International Workshop on Object-Role Modeling (ORM’05). Cyprus, 3-4 November 2005. In: OTM Workshops 2005. Halpin, T., Meersman, R. (eds.), LNCS 3762. Berlin: Springer-Verlag, 2005. pp603-612.

[3] Khan, Z.C., Keet, C.M. NOMSA: Automated modularisation for abstraction modules. Proceedings of the EKAW 2018 Posters and Demonstrations Session (EKAW’18). CEUR-WS vol. 2262, pp13-16. 12-16 Nov. 2018, Nancy, France.

[4] Khan, Z.C., Keet, C.M. Structuring abstraction to achieve ontology modularisation. Advanced Concepts, methods, and Applications in Semantic Computing. Daramola O, Moser T (Eds.). IGI Global. 2021, 296p. DOI: 10.4018/978-1-7998-6697-8.ch004

# The ontological commitments embedded in a representation language

Just like programming language preferences generate heated debates, this happens every now and then with languages to represent ontologies as well. Passionate dislikes for description logics or limitations of OWL are not unheard of, in favour of, say, Common Logic for more expressiveness and a different notation style, or of OBO because of its graph-based fundamentals, or that abuse of UML Class Diagram syntax  won’t do as approximation of an OWL file. But what is really going on here? Are they practically all just the same anyway and modellers merely stick with, and defend, what they know? If you could design your pet language, what would it look like?

The short answer is: they are not all the same and interchangeable. There are actually ontological commitments baked into the language, even though in most cases this is not explicitly stated as such. The ‘things’ one has in the language indicate what the fundamental building blocks are in the world (also called “epistemological primitives” [1]) and therewith assume some philosophical stance. For instance, a crisp vs vague world (say, plain OWL or a fuzzy variant thereof) or whether parthood is such a special relation that it deserves its own primitive next to class subsumption (alike UML’s aggregation). Or maybe you want one type of class for things indicated with count nouns and another type of element for stuffs (substances generally denoted with mass nouns). This then raises the question as to what the sort of commitments are that are embedded in, or can go into, a language specification and that have an underlying philosophical point of view. This, in turn, raises the question about which philosophical stances actually can have a knock-on effect on the specification or selection of an ontology language.

My collaborator, Pablo Fillottrani, and I tried to answer these questions in the paper entitled An Analysis of Commitments in Ontology Language Design that was published late last year as part of the proceedings of the 11th Conference on Formal Ontology in Information Systems 2020 that was supposed to have been held in September 2020 in Bolzano, Italy. In the paper, we identified and analysed ontological commitments that are, or could have been, embedded in logics, and we showed how they have been taken for well-known languages for representing ontologies and similar artefacts, such as OBO, SKOS, OWL 2DL, DLRifd, and FOL. We organised them in four main categories: what the very fundamental furniture is (e.g., including roles or not, time), acknowledging refinements thereof (e.g., types of relations, types of classes), the logic’s interaction with natural language, and crisp vs various vagueness options. They are discussed over about 1/3 of the paper.

Obviously, engineering considerations can interfere in the design of the logic as well. They concern issues such as how the syntax should look like and whether scalability is an issue, but this is not the focus of the paper.

We did spend some time contextualising the language specification in an overall systematic engineering process of language design, which is summarised in the figure below (the paper focuses on the highlighted step).

While such a process can be used for the design of a new logic, it also can be used for post hoc reconstructions of past design processes of extant logics and conceptual data modelling languages, and for choosing which one you want to use. At present, the documentation of the vast majority of published languages do not describe much of the ‘softer’ design rationales, though.

We played with the design process to illustrate how it can work out, availing also of our requirements catalogue for ontology languages and we analysed several popular ontology languages on their commitments, which can be summed up as in the table shown below, also taken from the paper:

In a roundabout way, it also suggests some explanations as to why some of those transformation algorithms aren’t always working well; e.g., any UML-to-OWL or OBO-to-OWL transformation algorithm is trying to shoe-horn one ontological commitment into another, and that can only be approximated, at best. Things have to be dropped (e.g., roles, due to standard view vs positionalism) or cannot be enforced (e.g., labels, due to natural language layer vs embedding of it in the logic), and that’ll cause some hick-ups here and there. Now you know why, and that won’t ever work well.

Hopefully, all this will feed into a way to help choosing a suitable language for the ontology one may want to develop, or assist with understanding better the language that you may be using, or perhaps gain new ideas for designing a new ontology language.

References

[1] Brachman R, Schmolze J. An overview of the KL-ONE Knowledge Representation System. Cognitive Science. 1985, 9:171–216.

[2] Fillottrani, P.R., Keet, C.M. An Analysis of Commitments in Ontology Language Design. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 46-60.

# An architecture for Knowledge-driven Information and Data access: KnowID

Advanced so-called ‘intelligent’ information systems may use an ontology or runtime-suitable conceptual data modelling techniques in the back end combined with efficient data management. Such a set-up aims to provide a way to better support informed decision-making and data integration, among others. A major challenge to create such systems, is to figure out which components to design and put together to realise a ‘knowledge to data’ pipeline, since each component and process has trade-offs; see e.g., the very recent overview of sub-topics and challenges [1]. A (very) high level categorization of the four principal approaches is shown in the following figure: put the knowledge and data together in the logical theory the AI way (left) or the database way (right), or bridge it by means of mappings or by means of transformations (centre two):

Among those variants, one can dig into considerations like which logic to design or choose in the AI-based “knowledge with (little) data” (e.g.: which OWL species? common logic? Other?), which type of database (relational, object-relational, or rather an RDF store), which query language to use or design, which reasoning services to support, how expressive it all has to and optimized for what purpose. None is best in all deployment scenarios. The AI-only one with, say, OWL 2 DL, is not scalable; the database-only one either lacks interesting reasoning services or supports few types of constraints.

Among the two in the middle, the “knowledge mapping data” is best known under the term ‘ontology-based data access’ (OBDA) and the Ontop system in particular [2] with its recent extension into ‘virtual knowledge graphs’ and the various use cases [3]. Its distinguishing characteristic of the architecture is the mapping layer to bridge the knowledge to the data. In the “Data transformation knowledge” approach, the idea is to link the knowledge to the data through a series of transformations. No such system is available yet. Considering the requirements for that, it turned out that a good few components are already available and just needed one crucial piece of transformations to convincingly put that together.

We did just that and devised a new knowledge-to-data architecture. We dub this the KnowID architecture (pronounced as ‘know it’), abbreviated from Knowledge-driven Information and Data access. KnowID adds novel transformation rules between suitably formalised EER diagrams as application ontology and Borgida, Toman & Weddel’s Abstract Relational Model with SQLP ([4,5]) to complete the pipeline (together with some recently proposed other components). Overall, it then looks like this:

Its details are described in the article entitled “KnowID: an architecture for efficient Knowledge-driven Information and Data access” [6], which was recently publish in the Data Intelligence journal. In a nutshell: the logic-based EER diagram (with deductions materialised) is transformed into an abstract relational model (ARM) that is transformed into a traditional relational model and then onward to a database schema, where the original ‘background knowledge’ of the ARM is used for data completion (i.e., materializing the deductions w.r.t. the data), and then the query posed in SQLP (SQL + path queries) is answered over that ‘extended’ database.

Besides the description of the architecture and the new transformation rules, the open access journal article also describes several examples and it features a more detailed comparison of the four approaches shown in figure 1 above. For KnowID, compared to other ontology-based data access approaches, its key distinctive architectural features are that runtime use can avail of full SQL augmented with path queries, the closed world assumption commonly used in information systems, and it avoids a computationally costly mapping layer.

We are working on the implementation of the architecture. The transformation rules and corresponding algorithms were implemented last year [7] and two computer science honours students are currently finalising their 4th-year project, therewith contributing to the materialization and query formulation steps aspects of the architecture. The latest results are available from the KnowID webpage. If you were to worry that will suffer from link rot: the version associated with the Data Intelligence paper has been archived as supplementary material of the paper at [8]. The plan is, however, to steadily continue with putting the pieces together to make a functional software system.

References

[1] Schneider, T., Šimkus, M. Ontologies and Data Management: A Brief Survey. Künstl Intell 34, 329–353 (2020).

[2] Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic Web Journal, 2017, 8(3), 471-487.

[3] G. Xiao, L. Ding, B. Cogrel, & D. Calvanese. Virtual knowledge graphs: An overview of systems and use cases. Data Intelligence, 2019, 1, 201-223.

[4] A. Borgida, D. Toman & G.E. Weddell. On referring expressions in information systems derived from conceptual modeling. In: Proceedings of ER’16, 2016, pp. 183–197

[5] W. Ma, C.M. Keet, W. Oldford, D. Toman & G. Weddell. The utility of the abstract relational model and attribute paths in SQL. In: C. Faron Zucker, C. Ghidini, A. Napoli & Y. Toussaint (eds.) Proceedings of the 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’18)), 2018, pp. 195–211.

[6] P.R. Fillottrani & C.M. Keet. KnowID: An architecture for efficient knowledge-driven information and data access. Data Intelligence, 2020 2(4), 487–512.

[7] Fillottrani, P.R., Jamieson, S., Keet, C.M. Connecting knowledge to data through transformations in KnowID: system description. Künstliche Intelligenz, 2020, 34, 373-379.

[8] Pablo Rubén Fillottrani, C. Maria Keet. KnowID. V1. Science Data Bank. http://www.dx.doi.org/10.11922/sciencedb.j00104.00015. (2020-09-30)

# Toward a framework for resolving conflicts in ontologies (with COVID-19 examples)

Among the many tasks involved in developing an ontologies, are deciding what part of the subject domain to include, and how. This may involve selecting a foundational ontology, reuse of related domain ontologies, and more detailed decisions for ontology authoring for specific axioms and design patterns. A recent example of reuse is that of the Infectious Diseases Ontology for schistosomiasis knowledge [1], but even before reuse, one may have to assess differences among ontologies, as Haendel et al did for disease ontologies [2]. Put differently, even before throwing alignment tools at them or selecting one with an import statement and hope for the best, issues may arise. For instance, two relevant domain ontologies may have been aligned to different foundational ontologies, a partOf relation could be set to be transitive in one ontology but is also used in a qualified cardinality constraint in the other (so then one cannot use an OWL 2 DL reasoner anymore when the ontologies are combined), something like Infection may be represented as a class in one ontology but as a property infectedby in another, or the ontologies differ on the science, like whether Virus is an organism or an inanimate object.

What to do then?

Upfront, it helps to be cognizant of the different types of conflict that may arise, and understand what their causes are. Then one would want to be able to find those automatically. And, most importantly, get some assistance in how to resolve them; if possible, also even preventing conflicts from happening in the first place. This is what Rolf Grütter, from the Swiss Federal Research Institute WSL, and I have been working since he visited UCT last year. The first results have been accepted for the International Conference on Biomedical Ontologies (ICBO) 2020, which are described in a paper entitled “Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring” [3]. A sample scenario of the process is illustrated informally in the following figure.

Summary of a sample scenario of detecting and resolving conflicts, illustrated with an ontology reuse scenario where Onto2 will be imported into Onto1. (source: [3])

The paper first defines and illustrates the notions of meaning negotiation and conflict resolution and summarises their main causes, to then go into some detail of the various categories of conflicts and ways how to resolve them. The detection and resolution is assisted by the notion of a conflict set, which is a data structure that stores the details for further processing.

It was tested with a use case of an epizootic disease outbreak in the Lemanic Arc in Switzerland in 2006, due to H5N1 (avian influenza): an administrative ontology had to be merged with one about the epidemiology for infected birds and surveillance zones. With that use case in place already well before the spread of SARS-CoV-2 that caused the current pandemic, it was a small step to add a few examples to the paper about COVID-19. This was made possible thanks to recently developed relevant ontologies that were made available, including for COVID-19 specifically. Let’s highlight the examples here, also so that I can write a bit more about it than the terse text in the paper, since there are no page limits for a blog post.

Example 1: OWL profile violations

Medical terminologies tend to veer toward being represented in an ontology language that is less or equal to OWL 2 EL: this permits scalability, compatibility with typical OBO Foundry ontologies, as well as fitting with the popular SNOMED CT. As one may expect, there have been efforts in ontology development with content relevant for the current pandemic; e.g., the Coronavirus Infectious Disease Ontology (CIDO) [4]. The CIDO is not in OWL 2 EL, however: it has a class expressions with a universal quantifier (ObjectAllValuesFrom) on the right-hand side; specifically (in DL notation): ‘Yale New Haven Hospital SARS-CoV-2 assay’ $\sqsubseteq \forall$ ‘EUA-authorized use at’.’FDA EUA-authorized organization’ or, in the Protégé interface:

(codes: CIDO_0000020, CIDO_0000024, and CIDO_0000031, respectively). It also imported many ontologies and either used them to cause some profile violations or the violations came with them, such as by having used the union operator (‘or’) in the following axiom for therapeutic vaccine function (VO_0000562):

How did I find that? Most certainly NOT by manually browsing through the more than 70000 axioms of the CIDO (including imports) to find the needle in the haystack. Instead, I burned the proverbial haystack to easily get the needles. In this case, the burning was done with the OWL Classifier, which automatically computes which axioms violate any of the OWL species, and lists them accordingly. Here are two examples, illustrating an OWL 2 EL violation (that aforementioned universal quantification) and an OWL 2 QL violation (a property chain with entities from BFO and RO); you can do likewise for OWL 2 RL violations.

Following the scenario with the assumption that the CIDO would have to stay in the OWL 2 EL profile, then it is easy to find the conflicting axioms and act accordingly, i.e., remove them. (It also indicates something did not go well with importing the NDF-RT.owl into the cido-base.owl, but that as an aside for this example.)

Example 2: Modelling issues: same idea, different elements

Let’s take the CIDO again and now also the COviD Ontology for cases and patient information (CODO), which have some overlapping and complementary information, so perhaps could be merged. A not unimportant thing is the test for SARS-CoV-2 and its outcome. CODO has a ‘laboratory test finding’ $\equiv$ {positive, pending, negative}, i.e., the possible outcomes of the test are individuals made into a class using the ObjectOneOf constructor. Consulting CIDO for the test outcomes, it has a class ‘COVID-19 diagnosis’ with three subclasses: Negative, Positive, and Presumptive positive. Aside from the inexact matches of the test status that won’t simplify data integration efforts, this is an example of class vs. instance modeling of what is ontologically the same thing. Resolving this in any merging attempt means that either

1. the CODO has to change and bump up the test results from individuals to classes, or
2. the CIDO has to change the subclasses to individuals in the ABox, or
3. take an ‘outside option’ and represent it in yet a different way where both the CODO and the CIDO have to modify the ontology (e.g., take a conceptual data modeling approach by making the test outcome an attribute with a few possible values).

The paper provides an attempt to systematize such type of conflicts toward a library of common types of conflict, so that it should become easier to find them, and offers steps toward a proper framework to manage all that, which assisted with devising generic approaches to resolution of conflicts. We already have done more to realize all that (which could not all be squeezed into the 12 pages), but more is still to be done, so stay tuned.

Since COVID-19 is still doing the rounds and the international borders of South Africa are still closed (with a lockdown for some 5 months already), I can’t end the blog post with the usual ‘I hope to see you at ICBO 2020 in Bolzano in September’—well, not in the common sense understanding at least. Hopefully next year then.

References

[1] Cisse PA, Camara G, Dembele JM, Lo M. An Ontological Model for the Annotation of Infectious Disease Simulation Models. In: Bassioni G, Kebe CMF, Gueye A, Ndiaye A, editors. Innovations and Interdisciplinary Solutions for Underserved Areas. Springer LNICST, vol. 296, 82–91. 2019.

[2] Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annual Review of Biomedical Data Science, 2018, 1:305–331.

[3] Grütter R, Keet CM. Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring. 11th International Conference on Biomedical Ontologies (ICBO’20), 16-19 Sept 2020, Bolzano, Italy. CEUR-WS (in print).

[4] He Y, Yu H, Ong E, Wang Y, Liu Y, Huffman A, Huang H, Beverley J, Hur J, Yang X, Chen L, Omenn GS, Athey B, Smith B. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data, 2020, 7:181.

# A requirements catalogue for ontology languages

If you could ‘mail order’ a language for representing ontologies or fancy knowledge graphs, what features would you want it to have? Or, from an artefact development viewpoint: what requirements would it have to meet? Perhaps it may not be a ‘Christmas wish list’ in these days, but a COVID-19 lockdown ‘keep dreaming’ one instead, although perhaps it may even be feasible to realise if you don’t ask for too much. Either way, answering this on the spot may not be easy, and possibly incomplete. Therefore, I have created a sample catalogue, based on the published list of requirements and goals for OWL and CL, and I added a few more. The possible requirements to choose from currently are loosely structured into six groups: expressiveness/constructs/modelling features; features of the language as a whole; usability by a computer; usability for modelling by humans; interaction with ‘outside’, i.e., other languages and systems; and ontological decisions. If you think the current draft catalogue should be extended, please leave a comment on this post or contact the author, and I’ll update accordingly.

Expressiveness/constructs/modelling features

E-1 Equipped with basic language elements: predicates (1, 2, n-ary), classes, roles, properties, data-types, individuals, … [select or add as appropriate].

E-2 Equipped with language features/constraints/constructs: domain/range axioms, equality (for classes, for individuals), cardinality constraints, transitivity, … [select or add as appropriate].

E-3 Sufficiently expressive to express various commonly used ‘syntactic sugarings’ for logical forms or commonly used patterns of logical sentences.

E-4 Such that any assumptions about logical relationships between different expressions can be expressed in the logic directly.

Features of the language as a whole

F-1 It has to cater for meta-data; e.g., author, release notes, release date, copyright, … [select or add as appropriate].

F-2 An ontology represented in the language may change over time and it should be possible to track that.

F-3 Provide a general-purpose syntax for communicating logical expressions.

F-4 Unambiguous, i.e., not needed to have negotiation about syntactic roles of symbols, or translations between syntactic roles.

F-5 Such that every name has the same logical meaning at every node of the network.

F-6 Such that it is possible to refer to a local universe of discourse (roughly: a module).

F-7 Such that it is possible to relate the ontology to other such universes of discourse.

F-8 Specified with a particular semantics.

F-9 Should not make arbitrary assumptions about semantics.

F-10 Cater for internationalization (e.g., language tags, additional language model).

F-11 Extendable (e.g., regarding adding more axioms to same ontology, add more vocabulary, and/or in the sense of importing other ontologies).

F-12 Balance expressivity and complexity (e.g., for scalable applications, for decidable automated reasoning tasks).

F-13 Have a query language for the ontology.

F-14 Declared with Closed World Assumption.

F-15 Declared with Open World Assumption.

F-16 Use Unique Name Assumption.

F-17 Do not use Unique Name Assumption.

F-18 Ability to modify the language with the language features.

F-19 Ability to plug in language feature extensions; e.g., ‘loading’ a module for a temporal extension.

Usability by computer

UC-1 Be an (identifiable) object on the Web.

UC-2 Be usable on the Web.

UC-3 Using URIs and URI references that should be usable as names in the language.

UC-4 Using URIs to give names to expressions and sets of expressions, in order to facilitate Web operations such as retrieval, importation and cross reference.

UC-5 Have a serialisation in [XML/JSON/…] syntax.

UC-6 Have symbol support for the syntax in LaTeX/…

UC-7 Such that the same entailments are supported, everywhere on the network of ontologies.

UC-8 Able to be used by tools that can do subsumption reasoning/taxonomic classification.

UC-9 Able to be used by tools that can detect inconsistency.

UC-10 Possible to read and write in the document with simple tools, such as a text editor.

UC-11 Unabiguous and simple grammar to ensure parsing documents as simple as possible.

Usability & modelling by humans

HU-1 Easy to use

HU-2 Have at least one compact, human-readable syntax defined which can be used to express the entire language

HU-3 Have at least one compact, human-readable syntax defined so that it can be easily typed up in emails

HU-4 Such that no agent should be able to limit the ability of another agent to refer to any entity or to make assertions about any entity

HU-5 Such that a modeller is free to invent new names and use them in published content.

HU-6 Have clearly definined syntactic sugar, such as a controlled natural language for authoring or rendering the ontology or an exhaustive diagramamtic notation

Interaction with outside

I-1 Shareable (e.g., on paper, on the computer, concurrent access)

I-2 Interoperable (with what?)

I-3 Compatible with existing standards (e.g., RDF, OWL, XML, URIs, Unicode)

I-4 Support an open networks of ontologies

I-5 Possible to import ontologies (theories, files)

I-6 Option ot declare inter-ontology assertions

Ontological decisions

O-1 3-Dimensionalist commitment, where entities are in space but one doesn’t care about time

O-2 3-Dimensionalist with a temporal extension

O-3 4-Dimensionalist commitment, where entities are in spacetime

O-4 Standard view of relations and relationships (there is an order in which the entities participare)

O-5 Positionalist relations and relationships (there’s no order, but entities play a role in the relation/relationship)

O-6 Have additional primitives, such as for subsumption, parthood, collective, stuff, sortal, anti-rigid entities, … [select or add as appropriate]

O-7 Statements are either true or false

O-8 Statements may vague or uncertain; e.g., fuzzy, rough, probabilistic [select as appropriate]

O-9 There should be a clear separation between natural language and ontology

O-10 Ontology and natural language are intertwined

That’s all, for now.

# A set of competency questions and SPARQL-OWL queries, with analysis

As a good beginning of the new year, our Data in Brief article Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations [1] was accepted and came online this week, which accompanies our Journal of Web Semantics article Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL [2] that was published in December 2019—with ‘our’ referring to my collaborators in Poznan, Dawid Wisniewski, Jedrzej Potoniec, and Agnieszka Lawrynowicz, and myself. The former article provides extensive detail of a dataset we created that was subsequently used for analysis that provided new insights that is described in the latter article.

The dataset

In short, we tried to find existing good TBox-level competency questions (CQs) for available ontologies and manually formulate (i.e., formalise the CQ in) SPARQL-OWL queries for each of the CQs over said ontologies. We ended up with 234 CQs for 5 ontologies, with 131 accompanying SPARQL-OWL queries. This constitutes the first gold standard pipeline for verifying an ontology’s requirements and it presents the systematic analyses of what is translatable from the CQs and what not, and when not, why not. This may assist in further research and tool development on CQs, automating CQ verification, assessing the main query language constructs and therewith language optimisation, among others. The dataset itself is indeed independently reusable for other experiments, and has been reused already [3].

The key insights

The first analysis we conducted on it, reported in [2], revealed several insights. First, a larger set of CQs (cf. earlier work) indeed did increase the number of CQ patterns. There are recurring patterns in the shape of the CQs, when analysed linguistically; a popular one is What EC1 PC1 EC2? obtained from CQs like “What data are collected for the trail making test?” (a Dem@care CQ). Observe that, yes, indeed, we did decouple the language layer from the formalisation layer rather than mixing the two; hence, the ECs (resp. PCs) are not necessarily classes (resp. object properties) in an ontology. The SPARQL-OWL queries were also analysed at to what is really used of that query language, and used most often (see table 7 of the paper).

Second, these characteristics are not the same across CQ sets by different authors of different ontologies in different subject domains, although some patterns do recur and are thus somehow ‘popular’ regardless. Third, the relation CQ (pattern or not) : SPARQL-OWL query (or its signature) is m:n, not 1:1. That is, a CQ may have multiple SPARQL-OWL queries or signatures, and a SPARQL-OWL query or signature may be put into a natural language question (CQ) in different ways. The latter sucks for any aim of automated verification, but unfortunately, there doesn’t seem to be an easy way around that: 1) there are different ways to say the same thing, and 2) the same knowledge can be represented in different ways and therewith leading to a different shape of the query. Some possible ways to mitigate either is being looked into, like specifying a CQ controlled natural language [3] and modelling styles [4] so that one might be able to generate an algorithm to find and link or swap or choose one of them [5,6], but all that is still in the preliminary stages.

Meanwhile, there is that freely available dataset and the in-depth rigorous analysis, so that, hopefully, a solution may be found sooner rather than later.

References

[1] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, in press.

[2] Wisniewski, D., Potoniec, J., Lawrynowicz, A., Keet, C.M. Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL. Journal of Web Semantics, 2019, 59:100534.

[3] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol 1057, 3-15.

[4] Fillottrani, P.R., Keet, C.M. Dimensions Affecting Representation Styles in Ontologies. 1st Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’19). Springer CCIS vol 1029, 186-200. 24-28 June 2019, Villa Clara, Cuba. Paper at Springer

[5] Fillottrani, P.R., Keet, C.M. Patterns for Heterogeneous TBox Mappings to Bridge Different Modelling Decisions. 14th Extended Semantic Web Conference (ESWC’17). Springer LNCS vol 10249, 371-386. Portoroz, Slovenia, May 28 – June 2, 2017.

[6] Khan, Z.C., Keet, C.M. Automatically changing modules in modular ontology development and management. Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT’17). ACM Proceedings, 19:1-19:10. Thaba Nchu, South Africa. September 26-28, 2017.

# More and better TDD for ontology authoring

Test-driven development (TDD) for ontology authoring [1] has received attention previously, including its accompanying tool TDDOnto [2] that was subsequently improved upon into the (also open source) TDDonto2 tool [3]. The TDDonto2 demo paper [3] did not contain the technical details about the new-and-improved algorithms and specification for TDD testing that we claimed it had. They are published just now in the International Journal on Artificial Intelligence Tools, as the article entitled More Effective Ontology Authoring with Test-Driven Development and the TDDonto2 tool [4]. The better algorithms cover more OWL language features than the original v1 of the theory and tool and it includes a specification for TDD testing such that there is not just pass/fail/absent as test result, but specific outcomes of the TDD test that are more informative, like that the ontology will become incoherent if that axiom were to be added. Given that model, the general flow for a simple standard case of a single TDD test (though more axioms can be tested at once) is as follows:

simplified view of the extended TDD process (source: adapted from [4])

The elements in the figure that are coloured light grey are the steps covered by the specification for TDD testing, algorithms, and TDDonto2 tool that is introduced in the paper.

The paper’s title clearly also hints to another contribution: using TDDonto2 for ontology authoring is significantly more effective. It was compared against the commonly used (and test-last) Protégé interface, which showed that the participants completed a larger part of the task in less time and with fewer mistakes. It also requires fewer interactions (clicking and typing) in the interface, which we reported on in an earlier (longer) tech report [5].

screenshot of the outcome of running the four tests on the sample ontology, in TDDonto2

As usual with research, more can be done. This is especially with respect to the white boxes in the figure above, i.e., the other aspects that would contribute toward a complete TDD methodology for ontology development. One step that we have been working on, is the idea of turning competency questions into axioms for TDD, which now is doable from CQ to SPARQL-OWL query [6] (more about that later), a CNL that may contribute to the authoring [7], and trying to figure out the modelling styles more precisely [8], since they hamper automation of these first steps in the process to get those axioms into the TDD plugin in a user-friendly way.

References

[1] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. 13th Extended Semantic Web Conference (ESWC’16). Springer LNCS vol. 9678, 642-657. 29 May – 2 June, 2016, Crete, Greece.

[2] Lawrynowicz, A., Keet, C.M. The TDDonto Tool for Test-Driven Development of DL Knowledge bases. 29th International Workshop on Description Logics (DL’16). April 22-25, Cape Town, South Africa. CEUR WS vol. 1577.

[3] Davies, K. Keet, C.M., Lawrynowicz, A. TDDonto2: A Test-Driven Development Plugin for arbitrary TBox and ABox axioms. The Semantic Web: ESWC 2017 Satellite Events, Blomqvist, E et al. (eds.). Springer LNCS vol 10577, 120-125. Portoroz, Slovenia, May 28 – June 2, 2017.

[4] Davies, K., Keet, C.M., Lawrynowicz, A. More Effective Ontology Authoring with Test-Driven Development and the TDDonto2 tool. International Journal on Artificial Intelligence Tools, 2019, 28(7): 1950023.

[5] Keet, C.M., Davies, K., Lawrynowicz, A. More Effective Ontology Authoring with Test-Driven Development. Technical Report 1812.06015. December 2018

[6] Wisniewski, D., Potoniec, J., Lawrynowicz, A., Keet, C.M. Analysis of Ontology Competency Questions and their Formalisations in SPARQL-OWL. Journal of Web Semantics. (in print)

[7] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS. (in print)

[8] Fillottrani, P.R., Keet, C.M.. Dimensions Affecting Representation Styles in Ontologies. 1st Iberoamerican conference on Knowledge Graphs and Semantic Web (KGSWC’19). Springer CCIS vol. 1029, 186-200. 23-30 June 2019, Villa Clara, Cuba.