Countless articles have announced the death of symbolic AI, which includes, among others, ontology engineering, in favour of data-driven AI with deep learning, even more loudly so since large language model-based apps like ChatGPT have captured the public’s attention and imagination. There are those who don’t even realise there is more to AI than deep learning with neural networks. But there is; have a look at the ACM Computing Classification or scroll down to the screenshots at the end of this post if you’re unaware of that. With all the hype and narrow focus, doom and gloom is being predicted with a new AI winter on the cards. But is it? It’s not like we all ditched mathematics at school when portable calculators became cheap gadgets, so why would AI now with machine and deep learning and Large Language Models (LLMs) and an app that attracts attention? Let me touch upon a few examples to illustrate that ontologies have not become obsolete, nor will they.
How exactly do you think data integration is done? Maybe ChatGPT can tell you what’s involved, superficially, but it won’t actually do it for you. Consider, for instance, a paper published earlier this month, on finding clusters of long Covid patient symptoms [Reese23], described in a press release: they obtained data of 20,532 relevant patients from 38 (!!) data partners, where the authors mapped the clinical findings taken from the electronic health records “to computable terms contained in the Human Phenotype Ontology (HPO), a standard framework for describing human traits … This allowed the researchers to analyze the data across the entire cohort.” (italics are mine). Here’s an illustration of the idea:
Could reliable data integration possibly be done by LLMs? No, not even in the future. NLP with electronic health records is an option, true, but it won’t harmonise terminology for you, nor will it integrate different electronic health record systems.
LLMs aren’t good at playing with data in the myriad of ways where ontologies are used to power ‘intelligent’ applications. Data that’s generated in automation of scientific experiments, for instance, like that cell types in the brain need to be annotated and processed to try to find new cell types and then add annotations with those new types, which is used downstream in queries and further analysis [Tan23]. There is no new stuff in off-the-shelf LLMs, so they can’t help; ontologies can – and do. Ontologies are used and extended as needed to document the new ground truth, which won’t ever be replaced by LLMs, nor by the approximations that machine learning’s outputs are.
What about intelligent analysis of real-time data? Those LLMs won’t be of assistance there either. Take, e.g., energy-optimised building systems control: the system takes real-time data that is linked to an ontology and then it can automatically derive energy conservation measures for the building and its use [Pruvost22].
Much has been written on ChatGPT and education. It’s an application domain that permits for no mistakes on the teaching side of it and, in fact, demands for vetted quality. There are many tasks, from content presentation to assessment. ChatGPT can generate quiz questions, indeed, but only on general knowledge. It can generate a response as well, but whether that will be correct answer is another matter altogether. We also need other types of educational questions besides MCQs, in many disciplines, on specific texts and textbooks with its particular vocabulary, and have the answer computed for automated marking. Computing correct questions and answers can be done with ontologies and some basic automated reasoning services [Raboanary22]. One obtains precision with ontologies that cannot be had with probabilistic guessing. Or take the Foundational Model of Anatomy ontology as a concrete example, which is used to manage the topics in anatomy classes augmented with VR [Soergel22]. Ontologies can also be used as a method of teaching, in art history no less, to push students to dig into the details and be precise [Bertens22] – the opposite of bland, handwaivy, roughly, sort of, non-committal, and fickle responses ChatGPT provides, at times, to open questions.
They’re just a few application examples that I lazily came across in the timespan of a mere 15 minutes (including selecting them) – one via the LinkedIn timeline, a GS search on “ontologies” with a “since 2022” (17300 results this morning) and clicking a few links that sounded appealing, and one I’m involved in.
This post is not a cry of desperation before sinking, but, rather, mainly one of annoyance. Technology blinkers of any kind are no good and one better has more than just a hammer in one’s toolbox. Not everything can be solved by LLMs and deep learning, and Knowledge Representation (& Reasoning) is not dead. It may have been elbowed to the side by the new kids on the block. I suspect that those in the ‘symbolic AI is obsolete’ camp simply aren’t aware – or would like to pretend not to be aware – of the many different AI-driven computing tasks that need to be solved and implemented. Tasks for which there are no humongous amounts of text or non-text data to grab and learn from. Tasks that are not tolerant to outputs that are noisy or plain wrong. Tasks that require current data, not stale stuff from over a year old and longer ago. Tasks where past data are not a good predictor for the future. Tasks in specialised domains. Tasks that are quirky to a locale. And so on. The NLP community already has recognised LLM’s outputs need fixing, which I was pleasantly surprised with when I attended EMNLP’22 in December (see my EMNLP22 trip report for a few pointers).
Also, and casting the net a little wider, our academic year is about to start, where students need to choose projects and courses, including, among others, another installment of ontology engineering, of logic for AI, Computer Vision, and so on. Perhaps this might assist in choosing and in reflecting that computing as a whole is not going to be obsolete either. ChatGPT and CodePilot can probably pass our 1st-year practical assignments, but there’s so much more computing beyond that, that relies on students understanding the foundations and problem-solving methods. Why should the whole rest of AI, and even computing as a discipline, become obsolete the instant a tool can, at best, regurgitate the known coding solutions to common basic tasks. There are still mathematicians notwithstanding all the devices more powerful than a pocket calculator and there are linguists regardless the free availability of Google Translate’s services; so why would software engineers not remain when there’s a code-completion tool for basic tasks.
Perhaps you still do not care about ontologies and knowledge representation & reasoning. That’s fine; everyone has their interests – just don’t confound new interests for obsolescence of established topics. In case you do want to know more about ontologies and ontology engineering: you may like to have a look at my award-winning open textbook, with exercises, tools, and slides.
p.s.: here are those screenshots on the ACM classification and AI, annotated:
Software systems aren’t getting any less complex to design, implement, and maintain, which applies to both the numerous diverse components and the myriad of people involved in the development processes. Even a straightforward configuration of a database back-end and an object-oriented front-end tool requires coordination among database analysts, programmers, HCI people, and increasing involvement of domain experts and stakeholders. They each may prefer, and have different competencies in, certain specific design mechanisms; e.g., one may want EER for the database design, UML diagrams for the front-end app, and perhaps structured natural language sentences with SBVR or ORM for expressing the business rules. This requires multi-modal modelling in a plurality of paradigms. This would then need to be supported by hybrid tools that offer interoperability among those modelling languages, since such heterogeneity won’t go away any time soon, or ever.
Example of possible interactions between the various developers of a software system and the models they may be using.
It is far from trivial to have these people work together whilst maintaining their preferred view of a unified system’s design, let alone doing all this design in one system. In fact, there’s no such tool that can seamlessly render such varied models across multiple modelling languages whilst preserving the semantics. At best, there’s either only theory that aims to do that, or only a subset of the respective languages’ features, or a subset of the required combinations. Well, more precisely, until our efforts. We set out to fill this gap in functionality, both in a theoretically sound way and implemented as proof-of-concept to demonstrate its feasibility. The latest progress was recently published in the paper entitled A framework for interoperability with hybrid tools in the Journal of Intelligent Information Systems [1], in collaboration with Germán Braun and Pablo Fillottrani.
First, we propose the Framework for semantiCInteroperability of conceptual data modelling Languages, FaCIL, which serves as the core orchestration mechanism for hybrid modelling tools with relations between components and a workflow that uses them. At its centre, it has a metamodel that is used for the interchange between the various conceptual models represented in different languages and it has sets of rules to and from the metamodel (and at the metamodel level) to ensure the semantics is preserved when transforming a model in one language into a model in a different language and such that edits to one model automaticallypropagatecorrectly to the model in another language. In addition, thanks to the metamodel-based approach, logic-based reconstructions of the modelling languages also have become easier to manage, and so a path to automated reasoning is integrated in FaCIL as well.
This generic multi-modal modelling interoperability framework FaCIL was instantiated with a metamodel for UML Class Diagrams, EER, and ORM2 interoperability specifically [2] (introduced in 2015), called the KF metamodel [3] with its relevant rules (initial and implemented ones), an English controlled natural language, and a logic-based reconstruction into a fragment of OWL (orchestration graphically from the paper). This enables a range of different user interactions in the modelling process, of which an example of a possible workflow is shown in the following figure.
A sample workflow in the hybrid setting, showing interactions between visual conceptual data models (i.e., in their diagram version) and in their (pseudo-)natural language versions, with updates propagating to the others automatically. At the start (top), there’s a visual model in one’s preferred language from which a KF runtime model is generated. From there, it can go in various directions: verbalise, convert, or modify it. If the latter, then the KF runtime model is also updated and the changes are propagated to the other versions of the model, as often as needed. The elements in yellow/green/blue are thanks to FaCIL and the white ones are the usual tasks in the traditional one-off one-language modelling setting.
These theoretical foundations were implemented in the web-based crowd 2.0 tool (with source code). crowd 2.0 is the first hybrid tool of its kind, tying together all the pieces such that now, instead of partial or full manual model management of transformations and updates in multiple disparate tools, these tasks can be carried out automatically in one application and therewith also allow diverse developers and stakeholders to work from a shared single system.
We also describe a use case scenario for it – on Covid-19, as pretty much all of the work for this paper was done during the worse-than-today’s stage of the pandemic – that has lots of screenshots from the tool in action, both in the paper (starting here, with details halfway in this section) and more online.
Besides evaluating the framework with an instantiation, a proof-of-concept implementation of that instantiation, and a use case, it was also assessed against the reference framework for conceptual data modelling of Delcambre and co-authors [4] and shown to meet those requirements. Finally, crowd 2.0’s features were assessed against five relevant tools, considering the key requirements for hybrid tools, and shown to compare favourable against them (see Table 2 in the paper).
Distinct advantages can be summed up as follows, from those 26 pages of the paper, where the, in my opinion, most useful ones are underlined here, and the most promising ones to solve another set of related problems with conceptual data modelling (in one fell swoop!) in italics:
One system for related tasks, including visual and text-based modelling in multiple modelling languages, automated transformations and update propagation between the models, as well as verification of the model on coherence and consistency.
Any visual and text-based conceptual model interaction with the logic has to be maintained only in one place rather than for each conceptual modelling and controlled natural language separately;
A controlled natural language can be specified on the KF metamodel elements so that it then can be applied throughout the models regardless the visual language and therewith eliminating duplicate work of re-specifications for each modelling language and fragment thereof;
Any further model management, especially in the case of large models, such as abstraction and modularisation, can be specified either on the logic or on the KF metamodel in one place and propagate to other models accordingly, rather than re-inventing or reworking the algorithms for each language over and over again;
The modular design of the framework allows for extensions of each component, including more variants of visual languages, more controlled languages in your natural language of choice, or different logic-based reconstructions.
Of course, more can be done to make it even better, but it is a milestone of sorts: research into the theoretical foundations of this particular line or research had commenced 10 years ago with the DST/MINCyT-funded bi-lateral project on ontology-driven unification of conceptual data modelling languages. Back then, we fantasised that, with more theory, we might get something like this sometime in the future. And we did.
[4] Delcambre, L. M. L., Liddle, S. W., Pastor, O., & Storey, V. C. (2018). A reference framework for conceptual modeling. In: 37th International Conference on Conceptual Modeling (ER’18). LNCS. Springer, vol. 11157, 27–42.
It’s a question I’ve been asked several times. Students see ontology papers in venues such as FOIS, EKAW, KR, AAAI, Applied Ontology, or the FOUST workshops and it seems as if all that stuff just fell from the sky neatly into the paper, or that the authors perhaps played with mud and somehow got the paper’s contents to emerge neatly from it. Not quite. It’s just that none of the authors bothered to write a “methods and methodologies” or “procedure” section. That it’s not written doesn’t mean it didn’t happen.
To figure out how to go about doing such an ontological investigation, there are a few options available to you:
Read many such papers and try to distill commonalities with which one could reverse engineer a possible process that could have led to those documented outcomes.
Guess the processes and do something, submit the manuscript, swallow the critical reviews and act upon those suggestions; repeat this process until it makes it through the review system. Then try again with another topic to see if you can do it now by yourself in fewer iterations.
Try to get a supervisor or a mentor who has published such papers and be their apprentice or protégé formally or informally.
Enrol in an applied ontology course, where they should be introducing you to the mores of the field, including the process of doing ontological investigations. Or take up a major/minor in philosophy.
Pursuing all options likely will get you the best results. In a time of publish-or-perish, shortcuts may be welcome since the ever greater pressures are less forgiving to learning things the hard way.
Every discipline has its own ways for how to investigate something. At a very high level, it still will look the same: you arrive at a question, a hypothesis, or a problem that no one has answered/falsified/solved before, you do your thing and obtain results, discuss them, and conclude. For ontology, what hopefully rolls out of such an investigation is what the nature of the entity under investigation is. For instance, what dispositions are, a new insight on the transitivity of parthood, the nature of the relation between portions of stuff, or what a particular domain entity (e.g., money, peace, pandemic) means.
I haven’t seen cookbook instructions for how to go about doing this for applied ontology. I did do most of the options listed above: I read (and still read) a lot of articles, conducted a number of such investigations myself and managed to get them published, and even did a (small) dissertation in applied philosophy (mentorships are hard to come by for women in academia, let alone the next stage of being someone’s protégé). I think it is possible to distill some procedure from all of that, for applied ontology at least. While it’s still only a rough outline, it may be of interest to put it out there to get feedback on it to see whether this can be collectively refined or extended.
With X the subject of investigation, which could be anything—a feature such as the colour of objects, the nature of a relation, the roles people fulfill, causality, stuff, collectives, events, money, secrets—the following steps will get you at least closer to an answer, if not finding the answer outright:
(optional) Consult dictionaries and the like for what they say about X;
Do a scientific literature review on X and, if needed when there’s little on X, also look up attendant topics for possible ideas;
Criticise the related work for where they fall short and how, and narrow down the problem/question regarding X;
Put forth your view on the matter, by building up the argument step by step; e.g., as follows:
From informal explanation to a possible intermediate stage with sketching a solution (in ad hoc notation for illustration or by abusing ORM or UML class diagram notation) to a formal characterisation of X, or the aspect of X if the scope was narrowed down.
From each piece of informal explanation, create the theory one axiom or definition at a time.
Either of the two may involve proofs for logical consequences and will have some iterations of looking up more scientific literature to finalise an axiom or definition.
(optional) Evaluate and implement.
Discuss where it gave new insight, note any shortcomings, and mention new questions it may generate or problem it doesn’t solve yet, and conclude.
For step 3, and as compared to scientific literature I’ve read in other disciplines, the ontologists are a rather blunt critical lot. The formalisation stage in step 4 is more flexible than indicated. For instance, you can choose your logic or make one up [1], but you do need at least something of that (more about that below). Few use tools, such as Isabelle, Prover9, and HeTS, to assist with the logic aspects, but I would recommend you do. Also within that grand step 4, is that philosophers typically would not use UML or ORM or the like, but use total freedom in drawing something, if there’s a drawing at all (and a good number would recoil at the very word ‘conceptual data modeling language’, but that’s for another time), and likewise for many a logician. Here are two sample sequences for that step 4:
A visualization of the ‘one definition or axiom at a time’ option (4b)A visualization of the ‘iterating over a diagram first’ option (4a)
As an aside, the philosophical investigations are lonesome endeavours resulting in disproportionately more single-author articles and books. This is in stark contrast with ontologies, those artefacts in computing and IT: many of them are developed in teams or even in large consortia, ranging from a few modellers to hundreds of contributors. Possibly because there are more tasks and the scope often may be larger.
Is that all there is to it? Sort of, yes, but for different reasons, there may be different emphases on different components (and so it still may not get you through the publication process to tell the world about your awesome results). Different venues have different scopes, even if they use the same terminology in their respective CFPs. Venues such as KR and AAAI are very much logic oriented, so there must be a formalization and proving interesting properties will substantially increase the (very small) chance of getting the paper accepted. Toning down the philosophical musings and deliberations is unlikely to be detrimental. For instance, our paper on essential vs immutable part-whole relations [2]. I wouldn’t expect the earlier papers, such as on social roles by Masolo et al [3] or temporal mereology by Donnelly and Bittner [4], to be able to make it through in the KR/AAAI/IJCAI venues nowadays (none of the IJCAI’22 papers sound even remotely like an ontology paper). But feel free to try. IJCAI 2023 will be in Cape Town, in case that information would help to motivate trying.
Venues such as EKAW and KCAP like some theory, but there’s got to be some implementation, (plausible) use, and/or evaluation to it for it to have a chance to make it through the review process. For instance, my theory on relations was evaluated on a few ontologies [5] and the stuff paper had the ontology also in OWL, modelling guidance for use, and notes on interoperability [6]. All those topics, which reside in the “step 5” above, come at the ‘cost’ of less logic and less detailed philosophical deliberations—research time and a paper’s page limits do have hard boundaries.
Ontology papers in FOIS and the like prefer to see more emphasis on the theory and what can be dragged in and used or adapted from advances in analytic philosophy, cognitive science, and attendant disciplines. Evaluation is not asked for as a separate item but assumed to be evident from the argumentation. I admit that sometimes I skip that as well when I write for such venues, e.g., in [7], but typically do put some evaluation in there nonetheless (recall [1]). And there still tends to be the assumption that one can write axioms flawlessly and oversee consequences without the assistance of automated model checkers and provers. For instance, have a look at the FOIS 2020 best paper award paper on a theory of secrets [8], which went through the steps mentioned above with the 4b route, and the one about the ontology of competition [9], which took the 4a route with OntoUML diagrams (with the logic implied by its use), and one more on mereology that first had other diagrams as part of the domain analysis to then go to the formalization with definitions and theorems and a version in CLIF [10]. That’s not to say you shouldn’t do an evaluation of sorts (of the variety use cases, checking against requirements, proving consistency, etc.), but just that you may be able to get away with not doing so (provided your argumentation is good enough and there’s enough novelty to it).
Finally, note that this is a blog post and it was not easy to keep it short. Alleys and more explanations and illustrations and details are quite possible. If you have comments on the high-level procedure, please don’t hesitate to leave a comment on the blog or contact me directly!
References
[1] Fillottrani, P.R., Keet, C.M.. An analysis of commitments in ontology language design. Proceedings of the 11th International Conference on Formal Ontology in Information Systems 2020 (FOIS’20). Brodaric, B and Neuhaus, F. (Eds.). IOS Press, FAIA vol. 330, 46-60.
[2] Artale, A., Guarino, N., and Keet, C.M. Formalising temporal constraints on part-whole relations. Proceedings of the 11th International Conference on Principles of Knowledge Representation and Reasoning (KR’08). Gerhard Brewka, Jerome Lang (Eds.) AAAI Press, pp 673-683.
[3] Masolo, C., Vieu, L., Bottazzi, E., Catenacci, C., Ferrario, R., Gangemi, A., & Guarino, N. Social Roles and their Descriptions. Proceedings of the 9th International Conference on Principles of Knowledge Representation and Reasoning (KR’04). AAAI press. pp 267-277.
[6] Keet, C.M. A core ontology of macroscopic stuff. 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW’14). K. Janowicz et al. (Eds.). Springer LNAI vol. 8876, 209-224.
[7] Keet, C.M. The computer program as a functional whole.Proceedings of the 11th International Conference on Formal Ontology in Information Systems 2020 (FOIS’20). Brodaric, B and Neuhaus, F. (Eds.). IOS Press, FAIA vol. 330, 216-230.
[8] Haythem O. Ismail, Merna Shafie. A commonsense theory of secrets. Proceedings of the 11th International Conference on Formal Ontology in Information Systems 2020 (FOIS’20). Brodaric, B and Neuhaus, F. (Eds.). IOS Press, FAIA vol. 330, 77-91.
[9] Tiago Prince Sales, Daniele Porello, Nicola Guarino, Giancarlo Guizzardi, John Mylopoulos. Ontological foundations of competition. Proceedings of the 10th International Conference on Formal Ontology in Information Systems 2020 (FOIS’18). Stefano Borgo, Pascal Hitzler, Oliver Kutz (eds.). IOS Press, FAIA vol. 306, 96-109.
[10] Michael Grüninger, Carmen Chui, Yi Ru, Jona Thai. A mereology for connected structures. Proceedings of the 11th International Conference on Formal Ontology in Information Systems 2020 (FOIS’20). Brodaric, B and Neuhaus, F. (Eds.). IOS Press, FAIA vol. 330, 171-185.
When we can declare the covid-19 pandemic to be over? I mulled about that earlier in January this year when the omicron wave was fizzling out in South Africa, and wrote a blog post as a step toward trying to figure out and a short general public article was published by The Conversation (republished widely, including by The Next Web). That was not all and the end of it. In parallel – or, more precisely, behind the scenes – that ontological investigation did happen scientifically and in much more detail.
First, it includes a proper discussion of how the 9 relevant domain ontologies have pandemic represented in the ontology – the same as epidemic, a sibling thereof, or as a subclass, and why – and what sort of generic top-level entity it is asserted to be, and a few more scientific references by domain experts.
Second, besides the two foundational ontologies that I discussed the alignment to (DOLCE and BFO) in the blog post, I tried with five more foundational ontologies that were selected meeting several criteria: BORO, GFO, SUMO, UFO, and YAMATO. That mainly took up a whole lot more time, but it didn’t add substantially to insights into what kind of entity pandemic is. It did, however, make clear that manually aligning is hard and difficult to get it as precise as it ought, and may need, to be, for several reasons (elaborated on in the paper).
Third, I dug deeper into the eight characteristics of pandemics according to the review by Morens, Folkers and Fauci (yes, him, from the CDC) [1] and disentangled what’s really going on with those, besides already having noted that several of them are fuzzy. Some of the characteristics aren’t really a property of pandemic itself, but of closely related entities, such as the disease (see table below). There are so many intertwined entities and relations, in fact, that one could very well develop an ontology of just pandemics, rather than have it only as a single class on an ontology as is now the case. For instance, there has to be a high attack rate, but ‘attack rate’ itself relies on the fact that there is an infectious agent that causes a disease and that R (reproduction) number that, in turn, is a complex thing that takes into account factors including susceptibility to infection, social dynamics of a population, and the ability to measure infections.
Finally, there are different ways to represent all the knowledge, or a relevant part thereof, as I also elaborated on in my Bio-Ontologies keynote last month. For instance, the attack rate could be squashed into a single data property if the calculation is done elsewhere and you don’t care how it is calculated, or it can be represented in all its glory details for the sake of it or for getting a clearer picture of what goes into computing the R number. For a scientific ontology, the latter is obviously the better choice, but there may be scenarios where the former is more practical.
The conclusion? The analysis cleared up a few things, but with some imprecise and highly complex properties as part of the mix to determine what is (and is not) a pandemic, there will be more than one optimum/finish line for a particular pandemic. To arrive at something more specific than in the paper, the domain experts may need to carry out a bit more research or come up with a consensus on how to precisiate those properties that are currently still vague.
Last, but not least, on attending ICBO’22, which will be held from 25-28 September in Ann Arbour, MI, USA: it runs in hybrid format. At the moment, I’m looking into the logistics of trying to attend in person now that we don’t have the highly anticipated ‘winter wave’ like the one we had last year and that thwarted my conference travel planning. While that takes extra time and resources to sort out, there’s that very thick silver lining that that also means we seem to be considerably closer to that real end of this pandemic (of the acute infections at least). According to the draft characterisation pandemic, one indeed might argue it’s over.
References
[1] Keet, C.M. Exploring the Ontology of Pandemic. 13th International Conference on Biomedical Ontology (ICBO’22). CEUR-WS. Michigan, USA, September 25-28, 2022.
[2] Morens, DM, Folkers, GK, Fauci, AS. What Is a Pandemic?The Journal of Infectious Diseases, 2009, 200(7): 1018-1021.
Natural language generation applications have been ‘mainstreaming’ behind the scenes for the last couple of years, from automatically generating text for images, to weather forecasts, summarising news articles, digital assistants that mechanically blurt out text based the structured information they have, and many more. Google, Reuters, BBC, Facebook – they all do it. Wikipedia is working on it as well, principally within the scope of Abstract Wikipedia to try to build a better multilingual Wikipedia [1] to reach more readers better. They all have some source of structured content – like data fetched from a database or spreadsheet, information from, say, a UML class diagram, or knowledge from some knowledge graph or ontology – and a specification as to what the structure of the sentence should be, typically with some grammar rules to at least prettify it, if not also being essential to generate a grammatically correct sentence [2]. That specification is written in templates that are then filled with content.
For instance, a simple rendering of a template may be “Each [C1] [R1] at least one [C2]” or “[I1] is an instance of [C1]”, where the things within the square brackets are variables standing in for content that will be fetched from the source, like a class, relationship, or individual. Linking these to a knowledge graph about universities, it may generate, e.g., “Each academic teaches at least one course” and “Joanne Soap is an instance of Academic”. To get the computer to do this, just “Each [C1] [R1] at least one [C2]” for template won’t do: we need to tell it what the components are so that the program can process it to generate that (pseudo-)natural language sentence.
Many years ago, we did this for multiple languages and used XML to specify the templates for the key aspects of the content. The structured input were conceptual data models in ORM in the DOGMA tool that had that verbalisation component [3]. As example, the template for verbalising a mandatory constraint was as follows:
Besides demarcating the sentence and indicating the constraint, there’s fixed text within the <text> … </text> tags and there’s the variable part with the <Object… that declares that the name of the object type has to be fetched and the <Role… that declares that the name of the relationship has to be fetched from the model (well, more precisely in this care: the reading label), which were elements declared in an XML Schema. With the same example as before, where Academic is in the object index “0” position and Course in the “1” position (see [3] for details), the software would then generate “ – [Mandatory] Each Academic must teaches at least one Course.”
This can be turned up several notches by adding grammatical features to it in order to handle, among others, gender for nouns in German, because they affect the rendering of the ‘each’ and ‘one’ in the sample sentence, not to mention the noun classes of isiZulu and many other languages [4], where even the verb conjugation depends on the noun class of the noun that plays the role of subject in the sentence. Or you could add sentence aggregation to combine two templates into one larger one to generate more flowy text, like a “Joanne Soap is an academic who teaches at least one course”. Or change the application scenario or the machinery for how to deal with the templates. For instance, instead of those variables in the template + code elsewhere that does the content fetching and any linguistic processing, we could put part of that in the template specification. Then there are no variables as such in the template, but functions. The template specification for that same constraint in an ORM diagram might then look like this:
ConstraintIsMandatory {
“[Mandatory] Each ”
FetchObjectType(0)
“ must ”
MakeInfinitive(FetchRole(0))
“ at least one ”
FetchObjectType(1)}
If you want to go with newer technology than markup languages, you may prefer to specify it in JSON. If you’re excited about functional programming languages and see everything through the lens of functions, you even can turn the whole template specification into a bunch of only functions. Either way: there must be a specification of how those templates are permitted to look like, or: what elements can be used to make a valid specification of a template. This so that the software will work properly so that it neither will spit out garbage nor will halt halfway before returning anything. What is permitted in a template language can be specified by means of a model, such as an XML Schema or a DTD, a JSON artefact, or even an ontology [5], a formal definition in some notation of choice, or by defining a grammar (be it a CFG or in BNF notation), and anyhow with enough documentation to figure out what’s going on.
How might this look like in the context of Abstract Wikipedia? For the natural language generation aspects and its first proposal for the realiser architecture, the structured content to be rendered in a natural language sentence is fetched from Wikidata, as is the lexicographic data, and the functions to do the various computations are to come from/go in Wikifunctions. They’re then combined with the templates in various stages in the realiser pipeline to generate those sentences. But there was still a gap as to what those templates in this context may look like. Ariel Gutman, a google.org fellow working on Abstract Wikipedia, and I gave it a try and that proposal for a template language for Abstract Wikipedia is now online accessible for comment, feedback, and, if you happen to speak a grammatically rich language, an option to provide difficult examples so that we can check whether the language is expressive enough.
The proposal is – as any other proposal for a software system – some combination of theoretical foundations, software infrastructure peculiarities, reasoned and arbitrary design decisions, compromises, and time constraints. Here’s a diagram of the key aspects of the syntax, i.e., with the elements, how they relate, and the constraints holding between them, in ORM notation:
An illustrative diagram with the key features of the template language in ORM notation.
There’s also a version in CFG notation, and there are a few examples, each of which shows how the template looks like for verbalising one piece of information (Malala Yousafzai’s age) in Swedish, French, Hebrew, and isiZulu. Swedish is the simplest one, as would English or Dutch be, so let’s begin with that:
Persoon_leeftijd_nl(Entity,Age_in_years): “{Person(Entity) is
{Age_in_years} jaar.}”
Where the Person(Entity) fetches the name of the person (that’s identified by an identifier) and the Age_in_years fetches the age. One may like to complicate matters and add a conditional statement, like that any age less than 30 will render that last part not just as jaar ‘year’, but as jaar oud ‘years old’ but jaar jong ‘years young’, but where that dividing line is, is a sensitive topic for some and I will let that rest. In any case, in Dutch, there’s no processing of the number itself to be able to render it in the sentence – 25 renders as 25 – but in other languages there is. For instance, in isiZulu. In that case, instead of a simple fetching of the number, we can put a function in the slot:
Where Lexeme(L686326) is the word for ‘year’ in isiZulu, unyaka, and for the rest, it first links the age rendering to the ‘year’ with the RelativeConcord() of that word, which practically fetches e- for the ‘years’ (iminyaka, noun class 4), then gets the copulative (ng in this case), and then the concord for the noun class of the noun of the number. Malala is in her 20s, which is amashumi amabili .. (noun class 6, which is computed via Cardinal(years)), and thus the function nounPrefix will fetch ama-. So, for Malala’s age data, Year_zu(years) will return iminyaka engama-25. That then gets processed with the rest of the Person_AgeYr_zu template, such as adding an U to the name by subj:Person(Entity), and later steps in the pipeline that take care of things like phonological conditioning (-na- + i- = –ne-), to eventually output UMalala Yousafzai uneminyaka engama-25. In other words: such a template indeed can be specified with the proposed template syntax.
There’s also a section in the proposal about how that template language then connects to the composition syntax so that it can be processed by the Wikifunctions Orchestrator component of the overall architecture. That helps hiding a few complexities from the template declarations, but, yes, someone’s got to write those functions (or take them from existing grammar engines) that will take care of those more or less complicated processing steps. That’s a different problem to solve. You also could link it up with another realiser by means of a transformation the the input type it expects. For now, it’s the syntax of the declarative part for the templates.
If you have any questions or comments or suggestions on that proposal or interesting use cases to test with, please don’t hesitate to add something to the talk page of the proposal, leave a comment here, or contact either Ariel or me directly.
[5] Mahlaza, Z., Keet, C. M. ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.
How do you know whether the ontology you developed or want to reuse is any good? It’s not a new question. It has been investigated quite a bit, and so the answer to that is not a short one. Based on a number of anecdotes, however, it seems ever more people are leaning toward a short answer along the line of “it’ll be fine if it can answer my competency questions”. That is most certainly not the right answer. Let me illustrate this.
Here’s a set of 5 competency questions and a bad ontology (with the OWL file), being a newly mutilated version of the African Wildlife Ontology [1] modified with a popular South African pastime: the braai, i.e., a barbecue.
CQ1: Which animals are served at a barbecue? (Sample answers: kudu, impala, warthog)
CQ2: What are the materials used for a barbecue? (Sample answers: tongs, skewers, poolbraai)
CQ3: What is the energy source for a braai device? (Sample answers: gas, coal)
CQ4: Which vegetables taste good with a braai? (Sample answers: tomatoes, onion, butternut)
CQ5: What food is eaten at a braai, or: what collection of edible things are offered?
The bad ontology does have answers to the competency questions, so a ‘CQs-only’ criterion for quality would suggest that the bad ontology is a good one. 100% good, even.
Why is it a bad one nonetheless?
That’s where years of methods, techniques, and tool development enter the stage (my textbook dedicates Section 5.2 to that), there are heuristics-based tips to prevent pitfalls [2] in general and for bio-ontologies with GoodOD, and there’s also a framework for ontology quality, OQuaRE [3], that all aim to approach this issue of quality systematically. Let’s have look at some of that.
Low-hanging fruit for a quick sanity check is to run the ontology through the Ontology Pitfall Scanner OOPS! [4]. Here’s the summary result, with two opened up that show what was flagged and why:
Mixing naming conventions is not neat. Examples of those in the badBBQ ontology are using CamelCase with PoolBraai but dash in tasty-plant and spaces converted to underscores in Food_Preparation_Material, and lower-case for some classes and upper case for others (PoolBraai and plant). An example of unconnected ontology element is Site: the idea is that if it isn’t really used anywhere in the ontology, then maybe it shouldn’t be in the ontology, or you forgot to add something there and OOPS! points you to that. Pitfall P11 may be contested, but if at all possible, one really should add domain and range to the object property so as to minimise unintended models and make the ontology closer to the reality (or understanding thereof) one aims to present. For instance, surely eats should not have any of the braai equipment on the left-hand side in the domain position, because equipment does not eat—only organisms do.
At the other end of the spectrum are the philosophy and Ontology-inspired methods. The most well-known one is OntoClean [5], which is summarised in the textbook and there’s a tutorial for it in Appendix A. The, perhaps, most straightforward (and simplified) rule within that package is that anti-rigid classes cannot subsume rigid classes, or, in layperson terminology: (physical) entities cannot be subclasses of things that are roles that entities play. Person cannot be a subclass of Employee, since not all persons are always employees. For the badBBQ: Food is a role that an organism or part thereof plays in a certain context, and animals and plants are not always food—they are organisms (or part thereof) irrespective of the roles they may play (or, worded differently: of the roles that they are the ‘bearer of’).
Then there are the methods and tools in-between these two extremes. Take, for instance, Advocatus Diaboli / PEW (Possible World Explorer) [6], which helps you find places where disjointness axioms ought to be added. This is in the same line of thinking as adding those domain and range axioms: it helps you to be more precise and find mistakes. For instance, Site and BraaiEquipment are definitely intended to be disjoint: some location cannot be a concrete physicalobject. Adding the disjointness axiom results in an error, however: the PoolBraai is unsatisfiable because it was declared to be both a subclass of Site and of BraaiEquipment. Pool braais do exist, as there are braais that can be placed in or next to a pool. What the issue is here, is that there are two different meanings of the same term: once that device for the barbecue and once the ‘braai area by the pool’. That is, they are two different entities, not one, and so they either have to appear as two different entities in the ontology, with different names, or the intended one chosen and one of the subsumption axioms removed.
I also put some ugly things in the description of Braai: both those two ways of the source of heating and the member. While one may say informally that a braai involves a collection of things (CQ5), ontologically, it won’t fly with ‘member’. Membership is not arbitrary. There are foundational (or top-level) ontologies whose developers already did the heavy-lifting of ontological analysis of key elements and membership is one of them (see, among others, [7-9]). Such relations can simply be reused in one’s own ontology (e.g., imported from here), with their widely-agreed upon meaning; there’s even a tool to assist you with that [10]. If what you want is something else than that, then that relation is not membership but indeed something else. In this case, there are two options to fix it: 1) a braai as an event (rather than the device) will have objects (such as food, the tongs) participating in the event, or 2) for the braai as a device, it has accessories (related with has Accessory, if you will), such as the tongs, and it is used for preparing (/barbecuing/cooking/frying) food (/meals/dinners).
Then the source of heating. The one-off construct (with the {…}) is relatively popular in conceptual data modelling when you know the set of values is ever only allowed to be that, like the days of the week. But in our open world of ontologies, more just might be added or removed. And, ontologically, coal, gas, and electricity are not individuals, so also that is incorrect. The other option, with heatedBy xsd:String, has its own set of problems, largely because data properties with their data types entail application implementation decisions that ought not to be in an ontology that is supposed to be usable across multiple applications (see Section 6.1 ‘attributions’ for a longer explanation). It can be addressed by granting them their rightful status as classes in the OWL file and relating that to the braai.
This is not an exhaustive analysis of the badBBQ ontology, nor even close to a full list of the latest methods and techniques for good ontology development, but I hope I’ve illustrated my point about not relying on just CQs as evaluation of your ontology. Sample changes made to the badBBQ are included in the improvedBBQ OWL file. Here’s snapshot of the differences in the basic metrics (on the left). There’s room for another round of improvements, but I’ll leave that for later.
All this was not to say that competency questions are useless. They are not. They can be very useful to demarcate the scope of the ontology’s content, to keep on track with that since it’s easy to go astray from the intended scope once you begin or be subjected to scope creep, and to check whether at least the minimum content is in there somehow (and if not, why not). It’s the easy thing to check compared to the methods, techniques, and theory about good, sub-optimal, and bad ways of representing something. But such relative ease with CQs, perhaps unfortunately, does not mean it suffices to obtain a ‘good quality’ stamp of approval. Why the plethora of methods, techniques, theories, and tools aren’t used as often as they should, is a question I’d like to know the answer to, and may be a topic for another time.
[2] Keet, C.M., Suárez-Figueroa, M.C., Poveda-Villalón, M. Pitfalls in Ontologies and TIPS to Prevent Them. Knowledge Discovery, Knowledge Engineering and Knowledge Management: IC3K 2013 Selected Papers. A. Fred et al. (Eds.). Springer CCIS vol. 454, pp. 115-131, 2015. preprint
[3] Duque-Ramos, A. et al. OQuaRE: A SQuaRE-based approach for evaluating the quality of ontologies. Journal of research and practice in information technology, 2011, 43(2): 159-176
At some point in time, this COVID-19 pandemic will be over. Each time that thought crossed my mind, there was that little homunculus in my head whispering: but do you know the criteria for when it can be declared ‘over’? I tried to push that idea away by deferring it to a ‘whenever the WHO says it’s over’, but the thought kept nagging. Surely there would be a clear set of criteria lying on the shelf awaiting to be ticked off? Now, with the omicron peak well past us here in South Africa, and with comparatively little harm done in that fourth wave, there’s more talk publicly of perhaps having that end in sight – and thus also needing to know what the decisive factors are for calling it an end.
Then there are the anti-vaxxers. I know a few of them as well. One raged on with the argument that ‘they’ (the baddies in the governments in multiple countries) count the death toll entirely unfairly: “flu deaths count per season in a year, but for covid they keep adding up to the same counter from 2020 to make the death toll look much worse!! Trying to exaggerate the severity!” My response? Duh, well, yes they do count from early 2020, because a pandemic is one event and you count per event! Since the COVID-19 pandemic is a pandemic that is an event, we count from the start until the end – whenever that end is. It hadn’t even crossed my mind that someone wouldn’t count per event but, rather, wanted to chop up an event to pretend it would be smaller than it actually is.
So I did a little digging after all. What is the definition of a pandemic? What are its characteristics? Ontologically, what is that notion of ‘pandemic’, be it according to the analytic philosophers, ontologists, or modellers, or how it may be aligned to some of the foundational ontologies used in ontology engineering? From that, we then should be able to determine when all this COVID-19 has become a ‘is not a pandemic’ (whatever it may be classified into after the pandemic is over).
I could not find any works from the philosophers and theory-focussed ontologists that would have done the work for me already. (If there is and I missed it, please let me know.) Then, to start: what about definitions? There are some, like the recently updated one from dictionary.com where they tried to explain it from a language perspective, and lots of debate and misunderstandings in the debate about defining and describing a pandemic [1]. The WHO has descriptions, but not a clear definition, and pandemic phases. Formulations of definitions elsewhere vary slightly as well, except for the lowest common denominator: it’s a large epidemic.
Ontologically, that is an entirely unsatisfying answer. What is ‘large’? Some, like the CDC of the USA qualified it somewhat: it’s spread over the world or at least multiple regions and continents, and in those areas, it usually affects many people. The Australian Department of Health adds ‘new disease’ to it. Now we’re starting to get somewhere with inclusion of key properties of a pandemic. Kelly [2] adds another criterion to it, albeit focussed on influenza: besides worldwide/very wide area and affecting a large number of people, “almost simultaneous transmission takes place worldwide” and thus for a part of the world, there is an out-of-season influenza virus transmission.
Image credits: Miroslava Chrienova, taken from this page.
The best resource of all from an ontologists’ perspective, is a very clear, well-written, perspective article written by Morens, Folkers and Fauci – yes, that Fauci from the CDC – in the Journal of Infectious Diseases that, in their lack of wisdom, keeps the article paywalled (it somehow made it onto the webarchive with free access here anyhow). They’re experts and they trawled the literature to, if not define a pandemic, then at least describe it through trying to list the characteristics and the merits, or demerits, thereof. They are, in short, and with my annotation on what sort of attribute (/feature/characteristic, as loosely used term for now) it is:
Wide geographic extension; as aforementioned. That’s a scale or ‘fuzzy’ (imprecise in some way) feature, i.e., without a crisp cut-off point when ‘wide’ starts or ends.
Disease movement, i.e., there’s some transmission going on from place to place and that can be traced. That’s a yes/no characteristic.
High attack rates and explosiveness, i.e., lots of people affected in a short timespan. There’s no clear cut-off point on how fast the disease has to spread for counting as ‘fast spreading’, so a scale or fuzzy feature.
Minimal population immunity; while immunity is a “relative concept” (i.e., you have it to a degree), it’s a clear notion for a population when that exists or not; e.g., it certainly wasn’t there when SARS-CoV-2 started spreading. It is agnostic about how that population immunity is obtained. This may sound like a yes/no feature, perhaps, but is fuzzy, because practically we may not know and there’s for sure a grey area thanks to possible cross-immunity (natural or vaccine-induced) and due to the extent of immune-evasion of the infectious agent.
Novelty; the term speaks for itself, and clearly is a yes/no feature as well. It seems to me like ‘novel’ implies ‘minimal population immunity’, but that may not be the case.
Infectiousness; it’s got to be infectious, and so excluding non-infectious things, like obesity and smoking. Clear yes/no.
Contagiousness; this may be from person to person or through some other medium (like water for cholera). Perhaps as an attribute with categorical values; e.g., human-to-human, human-animal intermediary (e.g., fleas, rats), and human-environment (notably: water).
Severity; while the authors note that it’s not typically included, historically, the term ‘pandemic’ has been applied more often for diseases that are severe or with high fatality rates (e.g., HIV/AIDS) than for milder ones. Fuzzy concept for which a scale could be used.
And, at the end of their conclusions, “In summary, simply defining a pandemic as a large epidemic may make ultimate sense in terms of comprehensibility and consistency. We also suggest that use of the term is best reserved for infectious diseases that share many of the same epidemiologic features discussed above” (p1020), largely for simplifying it to the public, but where scientists and public health officials would maintain their more precise consensus understanding of the complex scientific concept.
Those imprecise/fuzzy properties and lack of clarity of cut-off points bug the epidemiologists, because they lead to different outcomes of their prediction models. From my ontologist viewpoint, however, we’re getting somewhere with these properties: SARS-CoV-2, at least early in 2020 when the pandemic was declared, ticked all those eight boxes and so any reasoner would classify the disease it causes, COVID-19, as a pandemic. Now, in early 2022 with/after the omicron variant of concern? Of those eight properties, numbers 4 and 8 much less so, and number 5 is the million-dollar-question two years into the pandemic. Either way, considering all those properties of a pandemic that have passed the revue here so far, calling an end to the pandemic is not as trivial is it initially may have sounded like. WHO’s “post pandemic period” phase refers to “levels seen for seasonal influenza in most countries with adequate surveillance”. That is a clear specification operationally.
Ontologically, if we were to take these eight properties at face value, the next question then is: are all eight of them combined the necessary and sufficient conditions, or are some of them ‘more essential’ for calling it a pandemic, and the other ones would then be optional features? Etymologically, the pan in pandemic means ‘all’, so then as long as it rages across the world, it would remain a pandemic?
Now that things get ontologically more interesting, the ontological status. Informally, an epidemic is an occurrence (read: instance/individual entity) of an infectious disease at a particular time (read: an unspecified duration of time, not an instant) and that affects some community (be that a community of humans, chicken, or whatever other organisms that live in a community), and pandemic, as a minimum, extends the region that it affects and amount of organisms infected, and then some of those other features listed above.
A pandemic is in the same subject domain as an infectious disease, and so we can consult the OBO Foundry and see what they did, or first start with just the main BFO categories for a general sense of what it would align to. With our BFO Classifier, I get as far as process:
As to the last (optional) question: could one argue that a pandemic is a collection of disjoint part-processes? Not if the part-processes all have to be instances of different types of processes. The other loose end is that BFO’s processes need not have an end, but pandemics do. For now, what’s the most relevant is that the pandemic is distinctly in the occurrent branch of BFO, and occurrents have temporal parts.
Digging further into the OBO Foundry, they indeed did quite some work on infectious diseases and COVID-19 already [4], and following the trail from their Figure 1 (see below): disposition is a realizable entity is a specifically dependentcontinuant is a continuant; infectious disease course is a disease course is a process is an occurrent; and “realizable entity comes to be realized in the course of the process”.
Source: Figure 1 of [4].
In that approach, COVID-19 is the infectious disease being realised in the pandemic we’re in at the moment, with multiple infectious disease courses in humans and a few other animals. But where does that leave us with pandemic? Inspecting the Infectious Disease Ontology (IDO) since the article does not give a definition, infectious disease epidemic and infectious disease pandemic are siblings of infectious disease course, where disease course is described as “Totality of all processes through which a given disease instance is realized.” (presumably the totality of all processes in one human where there’s an instance of, say, COVID-19). Infectious disease pandemic is an atomic class with no properties or formal definitions, but there’s an annotation with a definition. Nice try; won’t work.
What’s the problem? There are three. The first, and key, problem is that pandemic is stated to be a collection of epidemics, but i) collections of individual things (collectives, aggregates) are categorically different kind of entities than individual things, and ii) epidemic and pandemic are not categorically different things. Not just that, there’s a fiat boundary (along a continuum, really) between an epidemic evolving into becoming a pandemic and then subsiding into separate epidemics. A comparatively minor, or at least secondary, issue is how to determine the boundary of one epidemic from another to be able to construct a collective, since, more fundamentally: what are the respective identities of those co-occurring epidemics? One can’t get collections of things we can’t quite identify. For instance, is it one epidemic in two places that it jumped to, or do they count as two then, and what when two separate ones touch and presumably merge to become one large one? The third issue, and also minor for the current scope, is the definition for epidemic in the ontology’s annotation field, talking of “statistically significant increase in the infectious disease incidence” as determiner, but actually it’s based on a threshold.
Let’s try DOLCE as foundational ontology and see what we get there. With the DOLCE Decision Diagram [5], pandemic ends up as: Is [pandemic] something that is happening or occurring? Yes (perdurant – alike BFO’s occurrent). Are you able to be present or participate in [a pandemic]? Yes (event). Is [a pandemic] atomic, i.e., has no subdivisions of it and has a definite end point? No (accomplishment). Not the greatest word choice to say that a pandemic is an accomplishment – almost right up there with the DOLCE developers’ example that death is an achievement – but it sure is an accomplishment from the perspective of the infectious agent. The nice thing of dolce:accomplishment over bfo:process is that it entails there’s a limited duration to it (DOLCE also has process that also can go on and on and on).
The last question in both decision diagrams made me pause. The instances of COVID-19 going around could possibly be going around after the pandemic is over, uninterrupted in the sense that there is no time interval where no-one is infected with SARS-CoV-2, or it could be interrupted with later flare-ups if it’s still SARS-CoV-2 and not substantially different, but the latter is a grey area (is it a flare-up or a COVID-2xxx?). The latter is not our problem now. The former would not be in contradiction with pandemic as accomplishment, because COVID-19-the-pandemic and COVID-19-the-disease are two different things. (How those two relate can be a separate story.)
To recap, we have pandemic as an occurrent/perdurant entity unfolding in time and, depending on one’s foundational ontology, something along the line of accomplishment. For an epidemic to be classified as a pandemic, there are a varying number of features that aren’t all crisp and for which the fuzzy boundaries haven’t been set.
To sketch this diagrammatically (hence, informally), it would look something like this:
where the clocks and the DEX and DEV arrows are borrowed from the TREND temporal conceptual data modelling language [6]: Epidemic and Pandemic are temporal entities, DEX (+dashed arrow) verbalised is “An epidemic may also become a pandemic” and DEV (+solid arrow): “Each pandemic must evolve to epidemic ceasing to be a pandemic” (hiding the logic at the back-end).
It isn’t a full answer as to what a pandemic is ontologically – hence, the title of the blog post still has that question mark – but we can already clear up the two issues from the introduction of this post, as follows.
Consequences
We already saw that with any definition, description, and list of properties proposed, there is no unambiguous and certain definite endpoint to a pandemic that can be deterministically computed. Well, other than the extremes of either 100% population immunity or the affected species is extinct such that there is no single instance of a disease course (in casu, of COVID-19) either way. Several measured values of the scales for the fuzzy variables will go down and immunity increase (further) as the pandemic unfolds, and then the pandemic phase is over eventually. Since there are no thresholds defined, there likely will be people who are forever disagreeing on when it can be called over. That is inherent in the current state of defining what a pandemic is. Perhaps it now also makes you appreciate the somewhat weak operational statement of the WHO post-pandemic period phase – specifying anything better is fraught with difficulties to date and unlikely to ever make everybody happy.
There’s that flawed argument of the anti-vaxxer to deal with still. Flu epidemics last about 10 weeks, on average [7]. They happen in the winter and in the northern hemisphere that may cross a New Year (although I can’t remember that has ever happened in all the years I’ve lived in Europe). And yet, they also count per epidemic and not per calendar year. School years run from September to July, which provides a different sort of year, and the flu epidemics there are typically reported as ‘flu season 2014/2015’, indicating just that. Because those epidemics are short-lived, you typical get only one of those in a year, and in-season only.
Contrast this with COVID-19: it’s been going round and round and round since late December 2019, with waves and lulls for all countries, regions, and continents, but never did it stop for a season in whole regions or continents. Most countries come close to a stop during a lull at some point between the waves; for South Africa, according to worldometers, the lowest 7-day moving average since the first wave in 2020 was 265 recorded infections per day, on 7 November 2021. Any out-of-season waves? Oh yes – beta came along in summer last year and it was awful; at least for this year’s summer we got a relatively harmless omicron. And it’s not just South Africa that has been having out-of-season spikes. Point is, the COVID-19 pandemic ‘accomplishment’ wasn’t over within the year – neither a calendar year nor a northern hemisphere school year – and so we keep counting with the same counter for as long as the event takes until the pandemic as event is over. There’s no nefarious plot of evil controlling scaremongering governments, just a ‘demic that takes a while longer than we’ve been used to until 2019.
In closing, it is, perhaps, not the last word on the ontological status of pandemic, but I hope the walkthrough provided a little bit of clarity in the meantime already.
[5] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. 2013.
A photo of the city where it was supposed to take place: Leiden (NL) (Source: here)
It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.
Keynotes
The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool [1], a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.
The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources [2]; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases [3]. It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically [4]. There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.
Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.
The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.
Papers
Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.
Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project [5]. He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).
Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid [6], which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).
Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) [7], and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO [8]. What to do next with these insights remains to be seen.
Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do [9]. Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.
Source: [9]
Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.
Other
There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.
The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!
[5] Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[6] Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[7] Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[8] César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
How to align your domain ontology to a foundational ontology? It’s a well-known question, and one that I’ve looked into before as well. In some of that earlier work, we used DOLCE to align one’s ontology to. We devised the DOLCE decision diagram as part of the FORZA method to assist with the alignment process and implemented that in the MoKI ontology development tool [1]. MoKI is no more, but the theory and the algorithm’s design approach still stand. Instead of re-implementing it as a Protégé plugin and have it go defunct in a few years again (due to incompatible version upgrades, say), it sounded like more fun to design one for BFO and make a stand-alone tool out of it. And that design and the evaluation thereof is precisely what two of my ontology engineering course students—Chiadika Emeruem and Steve Wang—did for their mini-project of the course. That was then finalised and implemented in a tool for general use as part of the DOT4D project extension for my (award-winning) OE textbook afterward.
More precisely, as first part, there’s a diagram specifically for BFO – well, for one of its 2.0-ish versions in existence at least. Deciding on which version to use and what would be good questions was not as trivial as it may sound. While the questions seem to work (as evaluated with several ontologies), it might still be of use to set up an experiment to assess usability from a modeller’s viewpoint.
BFO ‘decision diagram’ to assist trying to align one’s class of a domain or core ontology to BFO (click to enlarge, or navigate to the user guide at https://bfo-classifier.github.io/)
Be this as it may, this decision diagram was incorporated into the tool that wraps around it with a nice interface with user guidance and feedback, and it has the option to load an ontology and save the alignment into the ontology (along with BFO). The decision tree itself is stored as a separate XML file so that it easily can be replaced with any update thereto, be it to reflect changes in question formulation or to adjust it to some later version of BFO. The stand-alone tool is a jar file that can be downloaded from the GitHub repo, and the repo also has the source code that may be used/adapted (i.e., has an open source licence). There’s also a user guide with explanations and screenshots. Here’s another screenshot of the tool in action:
Example of the BFO classifier in use, trying to align CODO’s ‘Disease’ to BFO, the trail of questions answered to get to ‘Disposition’, and the subsumption axiom that can be added to the ontology.
If you have any questions, please feel free to contact either of us.
References
[1] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.
With increasing student numbers, but not as much more funding for schools and universities, and the desire to automate certain tasks anyhow, there have been multiple efforts to generate and mark educational exercises automatically. There are a number of efforts for the relatively easy tasks, such as for learning a language, which range from the entry level with simple vocabulary exercises to advanced ones of automatically marking essays. I’ve dabbled in that area as well, mainly with 3rd-year capstone projects and 4th-year honours project student projects [1]. Then there’s one notch up with fact recall and concept meaning recall questions, and further steps up, such as generating multiple-choice questions (MCQs) with not just obviously wrong distractors but good distractors to make the question harder. There’s quite a bit of work done on generating those MCQs in theory and in tooling, notably [2,3,4,5]. As a recent review [6] also notes, however, there are still quite a few gaps. Among others, about generalisability of theory and systems – can you plug in any structured data or knowledge source to question templates – and the type of questions. Most of the research on ‘not-so-hard to generate and mark’ questions has been done for MCQs, but there are multiple of other types of questions that also should be doable to generate automatically, such as true/false, yes/no, and enumerations. For instance, with an axiom such as in a ontology or knowledge graph, a suitable question generation system may then generate “Does an impala live on land?” or “True or false: An impala lives on land.”, among other options.
We set out to make a start with tackling those sort of questions, for the type-level information from an ontology (cf. facts in the ABox or knowledge graph). The only work done there, when we started with it, was for the slick and fancy Inquire Biology [5], but which did not have their tech available for inspection and use, so we had to start from scratch. In particular, we wanted to find a way to be able to plug in any ontology into a system and generate those non-MCQ other types of educations questions (10 in total), where the questions generated are at least grammatically good and for which the answers also can be generated automatically, so that we get to automated marking as well.
Different types of questions and the answer they have to provide put different prerequisiteson the content of the ontology with certain types of axioms. We specified those for 10 types of educational questions.
Three strategies of question generation were devised, being ‘simple’ from the vocabulary and axioms and plug it into a template, guided by some more semantics in the ontology (a foundational ontology), and one that didn’t really care about either but rather took a natural language approach. Variants were added to cater for differences in naming and other variations, amounting to 75 question templates in total.
The human evaluation with questions generated from three ontologies showed that while the semantics-based one was slightly better than the baseline, the NLP-based one gave the best results on syntactic and semantic correctness of the sentences (according to the human evaluators).
It was tested with several ontologies in different domains, and the generalisability looks promising.
Graphical Abstract (made by Toky Raboanary)
To be honest to those getting their hopes up: there are some issues that cause it never to make it to the ‘100% fabulous!’ if one still wants to designs a system that should be able to take any ontology as input. A main culprit is naming of elements in the ontology, which varies widely across ontologies. There are several guidelines for how to name entities, such as using camel case or underscores, and those things easily can be coded into an algorithm, indeed, but developers don’t stick to them consistently or there’s an ontology import that uses another naming convention so that there likely will be a glitch in the generated sentences here or there. Or they name things within the context of the hierarchy where they put the class, but in the question it is out of that context and then looks weird or is even meaningless. I moaned about this before; e.g., ‘American’ as the name of the class that should have been named ‘American Pizza’ in the Pizza ontology. Or the word used for the name of the class can have different POS tags such that it makes the generated sentence hard to read; e.g., ‘stuff’ as a noun or a verb.