When we can declare the covid-19 pandemic to be over? I mulled about that earlier in January this year when the omicron wave was fizzling out in South Africa, and wrote a blog post as a step toward trying to figure out and a short general public article was published by The Conversation (republished widely, including by The Next Web). That was not all and the end of it. In parallel – or, more precisely, behind the scenes – that ontological investigation did happen scientifically and in much more detail.
First, it includes a proper discussion of how the 9 relevant domain ontologies have pandemic represented in the ontology – the same as epidemic, a sibling thereof, or as a subclass, and why – and what sort of generic top-level entity it is asserted to be, and a few more scientific references by domain experts.
Second, besides the two foundational ontologies that I discussed the alignment to (DOLCE and BFO) in the blog post, I tried with five more foundational ontologies that were selected meeting several criteria: BORO, GFO, SUMO, UFO, and YAMATO. That mainly took up a whole lot more time, but it didn’t add substantially to insights into what kind of entity pandemic is. It did, however, make clear that manually aligning is hard and difficult to get it as precise as it ought, and may need, to be, for several reasons (elaborated on in the paper).
Third, I dug deeper into the eight characteristics of pandemics according to the review by Morens, Folkers and Fauci (yes, him, from the CDC)  and disentangled what’s really going on with those, besides already having noted that several of them are fuzzy. Some of the characteristics aren’t really a property of pandemic itself, but of closely related entities, such as the disease (see table below). There are so many intertwined entities and relations, in fact, that one could very well develop an ontology of just pandemics, rather than have it only as a single class on an ontology as is now the case. For instance, there has to be a high attack rate, but ‘attack rate’ itself relies on the fact that there is an infectious agent that causes a disease and that R (reproduction) number that, in turn, is a complex thing that takes into account factors including susceptibility to infection, social dynamics of a population, and the ability to measure infections.
Finally, there are different ways to represent all the knowledge, or a relevant part thereof, as I also elaborated on in my Bio-Ontologies keynote last month. For instance, the attack rate could be squashed into a single data property if the calculation is done elsewhere and you don’t care how it is calculated, or it can be represented in all its glory details for the sake of it or for getting a clearer picture of what goes into computing the R number. For a scientific ontology, the latter is obviously the better choice, but there may be scenarios where the former is more practical.
The conclusion? The analysis cleared up a few things, but with some imprecise and highly complex properties as part of the mix to determine what is (and is not) a pandemic, there will be more than one optimum/finish line for a particular pandemic. To arrive at something more specific than in the paper, the domain experts may need to carry out a bit more research or come up with a consensus on how to precisiate those properties that are currently still vague.
Last, but not least, on attending ICBO’22, which will be held from 25-28 September in Ann Arbour, MI, USA: it runs in hybrid format. At the moment, I’m looking into the logistics of trying to attend in person now that we don’t have the highly anticipated ‘winter wave’ like the one we had last year and that thwarted my conference travel planning. While that takes extra time and resources to sort out, there’s that very thick silver lining that that also means we seem to be considerably closer to that real end of this pandemic (of the acute infections at least). According to the draft characterisation pandemic, one indeed might argue it’s over.
At some point in time, this COVID-19 pandemic will be over. Each time that thought crossed my mind, there was that little homunculus in my head whispering: but do you know the criteria for when it can be declared ‘over’? I tried to push that idea away by deferring it to a ‘whenever the WHO says it’s over’, but the thought kept nagging. Surely there would be a clear set of criteria lying on the shelf awaiting to be ticked off? Now, with the omicron peak well past us here in South Africa, and with comparatively little harm done in that fourth wave, there’s more talk publicly of perhaps having that end in sight – and thus also needing to know what the decisive factors are for calling it an end.
Then there are the anti-vaxxers. I know a few of them as well. One raged on with the argument that ‘they’ (the baddies in the governments in multiple countries) count the death toll entirely unfairly: “flu deaths count per season in a year, but for covid they keep adding up to the same counter from 2020 to make the death toll look much worse!! Trying to exaggerate the severity!” My response? Duh, well, yes they do count from early 2020, because a pandemic is one event and you count per event! Since the COVID-19 pandemic is a pandemic that is an event, we count from the start until the end – whenever that end is. It hadn’t even crossed my mind that someone wouldn’t count per event but, rather, wanted to chop up an event to pretend it would be smaller than it actually is.
So I did a little digging after all. What is the definition of a pandemic? What are its characteristics? Ontologically, what is that notion of ‘pandemic’, be it according to the analytic philosophers, ontologists, or modellers, or how it may be aligned to some of the foundational ontologies used in ontology engineering? From that, we then should be able to determine when all this COVID-19 has become a ‘is not a pandemic’ (whatever it may be classified into after the pandemic is over).
I could not find any works from the philosophers and theory-focussed ontologists that would have done the work for me already. (If there is and I missed it, please let me know.) Then, to start: what about definitions? There are some, like the recently updated one from dictionary.com where they tried to explain it from a language perspective, and lots of debate and misunderstandings in the debate about defining and describing a pandemic . The WHO has descriptions, but not a clear definition, and pandemic phases. Formulations of definitions elsewhere vary slightly as well, except for the lowest common denominator: it’s a large epidemic.
Ontologically, that is an entirely unsatisfying answer. What is ‘large’? Some, like the CDC of the USA qualified it somewhat: it’s spread over the world or at least multiple regions and continents, and in those areas, it usually affects many people. The Australian Department of Health adds ‘new disease’ to it. Now we’re starting to get somewhere with inclusion of key properties of a pandemic. Kelly  adds another criterion to it, albeit focussed on influenza: besides worldwide/very wide area and affecting a large number of people, “almost simultaneous transmission takes place worldwide” and thus for a part of the world, there is an out-of-season influenza virus transmission.
The best resource of all from an ontologists’ perspective, is a very clear, well-written, perspective article written by Morens, Folkers and Fauci – yes, that Fauci from the CDC – in the Journal of Infectious Diseases that, in their lack of wisdom, keeps the article paywalled (it somehow made it onto the webarchive with free access here anyhow). They’re experts and they trawled the literature to, if not define a pandemic, then at least describe it through trying to list the characteristics and the merits, or demerits, thereof. They are, in short, and with my annotation on what sort of attribute (/feature/characteristic, as loosely used term for now) it is:
Wide geographic extension; as aforementioned. That’s a scale or ‘fuzzy’ (imprecise in some way) feature, i.e., without a crisp cut-off point when ‘wide’ starts or ends.
Disease movement, i.e., there’s some transmission going on from place to place and that can be traced. That’s a yes/no characteristic.
High attack rates and explosiveness, i.e., lots of people affected in a short timespan. There’s no clear cut-off point on how fast the disease has to spread for counting as ‘fast spreading’, so a scale or fuzzy feature.
Minimal population immunity; while immunity is a “relative concept” (i.e., you have it to a degree), it’s a clear notion for a population when that exists or not; e.g., it certainly wasn’t there when SARS-CoV-2 started spreading. It is agnostic about how that population immunity is obtained. This may sound like a yes/no feature, perhaps, but is fuzzy, because practically we may not know and there’s for sure a grey area thanks to possible cross-immunity (natural or vaccine-induced) and due to the extent of immune-evasion of the infectious agent.
Novelty; the term speaks for itself, and clearly is a yes/no feature as well. It seems to me like ‘novel’ implies ‘minimal population immunity’, but that may not be the case.
Infectiousness; it’s got to be infectious, and so excluding non-infectious things, like obesity and smoking. Clear yes/no.
Contagiousness; this may be from person to person or through some other medium (like water for cholera). Perhaps as an attribute with categorical values; e.g., human-to-human, human-animal intermediary (e.g., fleas, rats), and human-environment (notably: water).
Severity; while the authors note that it’s not typically included, historically, the term ‘pandemic’ has been applied more often for diseases that are severe or with high fatality rates (e.g., HIV/AIDS) than for milder ones. Fuzzy concept for which a scale could be used.
And, at the end of their conclusions, “In summary, simply defining a pandemic as a large epidemic may make ultimate sense in terms of comprehensibility and consistency. We also suggest that use of the term is best reserved for infectious diseases that share many of the same epidemiologic features discussed above” (p1020), largely for simplifying it to the public, but where scientists and public health officials would maintain their more precise consensus understanding of the complex scientific concept.
Those imprecise/fuzzy properties and lack of clarity of cut-off points bug the epidemiologists, because they lead to different outcomes of their prediction models. From my ontologist viewpoint, however, we’re getting somewhere with these properties: SARS-CoV-2, at least early in 2020 when the pandemic was declared, ticked all those eight boxes and so any reasoner would classify the disease it causes, COVID-19, as a pandemic. Now, in early 2022 with/after the omicron variant of concern? Of those eight properties, numbers 4 and 8 much less so, and number 5 is the million-dollar-question two years into the pandemic. Either way, considering all those properties of a pandemic that have passed the revue here so far, calling an end to the pandemic is not as trivial is it initially may have sounded like. WHO’s “post pandemic period” phase refers to “levels seen for seasonal influenza in most countries with adequate surveillance”. That is a clear specification operationally.
Ontologically, if we were to take these eight properties at face value, the next question then is: are all eight of them combined the necessary and sufficient conditions, or are some of them ‘more essential’ for calling it a pandemic, and the other ones would then be optional features? Etymologically, the pan in pandemic means ‘all’, so then as long as it rages across the world, it would remain a pandemic?
Now that things get ontologically more interesting, the ontological status. Informally, an epidemic is an occurrence (read: instance/individual entity) of an infectious disease at a particular time (read: an unspecified duration of time, not an instant) and that affects some community (be that a community of humans, chicken, or whatever other organisms that live in a community), and pandemic, as a minimum, extends the region that it affects and amount of organisms infected, and then some of those other features listed above.
A pandemic is in the same subject domain as an infectious disease, and so we can consult the OBO Foundry and see what they did, or first start with just the main BFO categories for a general sense of what it would align to. With our BFO Classifier, I get as far as process:
As to the last (optional) question: could one argue that a pandemic is a collection of disjoint part-processes? Not if the part-processes all have to be instances of different types of processes. The other loose end is that BFO’s processes need not have an end, but pandemics do. For now, what’s the most relevant is that the pandemic is distinctly in the occurrent branch of BFO, and occurrents have temporal parts.
Digging further into the OBO Foundry, they indeed did quite some work on infectious diseases and COVID-19 already , and following the trail from their Figure 1 (see below): disposition is a realizable entity is a specifically dependentcontinuant is a continuant; infectious disease course is a disease course is a process is an occurrent; and “realizable entity comes to be realized in the course of the process”.
In that approach, COVID-19 is the infectious disease being realised in the pandemic we’re in at the moment, with multiple infectious disease courses in humans and a few other animals. But where does that leave us with pandemic? Inspecting the Infectious Disease Ontology (IDO) since the article does not give a definition, infectious disease epidemic and infectious disease pandemic are siblings of infectious disease course, where disease course is described as “Totality of all processes through which a given disease instance is realized.” (presumably the totality of all processes in one human where there’s an instance of, say, COVID-19). Infectious disease pandemic is an atomic class with no properties or formal definitions, but there’s an annotation with a definition. Nice try; won’t work.
What’s the problem? There are three. The first, and key, problem is that pandemic is stated to be a collection of epidemics, but i) collections of individual things (collectives, aggregates) are categorically different kind of entities than individual things, and ii) epidemic and pandemic are not categorically different things. Not just that, there’s a fiat boundary (along a continuum, really) between an epidemic evolving into becoming a pandemic and then subsiding into separate epidemics. A comparatively minor, or at least secondary, issue is how to determine the boundary of one epidemic from another to be able to construct a collective, since, more fundamentally: what are the respective identities of those co-occurring epidemics? One can’t get collections of things we can’t quite identify. For instance, is it one epidemic in two places that it jumped to, or do they count as two then, and what when two separate ones touch and presumably merge to become one large one? The third issue, and also minor for the current scope, is the definition for epidemic in the ontology’s annotation field, talking of “statistically significant increase in the infectious disease incidence” as determiner, but actually it’s based on a threshold.
Let’s try DOLCE as foundational ontology and see what we get there. With the DOLCE Decision Diagram , pandemic ends up as: Is [pandemic] something that is happening or occurring? Yes (perdurant – alike BFO’s occurrent). Are you able to be present or participate in [a pandemic]? Yes (event). Is [a pandemic] atomic, i.e., has no subdivisions of it and has a definite end point? No (accomplishment). Not the greatest word choice to say that a pandemic is an accomplishment – almost right up there with the DOLCE developers’ example that death is an achievement – but it sure is an accomplishment from the perspective of the infectious agent. The nice thing of dolce:accomplishment over bfo:process is that it entails there’s a limited duration to it (DOLCE also has process that also can go on and on and on).
The last question in both decision diagrams made me pause. The instances of COVID-19 going around could possibly be going around after the pandemic is over, uninterrupted in the sense that there is no time interval where no-one is infected with SARS-CoV-2, or it could be interrupted with later flare-ups if it’s still SARS-CoV-2 and not substantially different, but the latter is a grey area (is it a flare-up or a COVID-2xxx?). The latter is not our problem now. The former would not be in contradiction with pandemic as accomplishment, because COVID-19-the-pandemic and COVID-19-the-disease are two different things. (How those two relate can be a separate story.)
To recap, we have pandemic as an occurrent/perdurant entity unfolding in time and, depending on one’s foundational ontology, something along the line of accomplishment. For an epidemic to be classified as a pandemic, there are a varying number of features that aren’t all crisp and for which the fuzzy boundaries haven’t been set.
To sketch this diagrammatically (hence, informally), it would look something like this:
where the clocks and the DEX and DEV arrows are borrowed from the TREND temporal conceptual data modelling language : Epidemic and Pandemic are temporal entities, DEX (+dashed arrow) verbalised is “An epidemic may also become a pandemic” and DEV (+solid arrow): “Each pandemic must evolve to epidemic ceasing to be a pandemic” (hiding the logic at the back-end).
It isn’t a full answer as to what a pandemic is ontologically – hence, the title of the blog post still has that question mark – but we can already clear up the two issues from the introduction of this post, as follows.
We already saw that with any definition, description, and list of properties proposed, there is no unambiguous and certain definite endpoint to a pandemic that can be deterministically computed. Well, other than the extremes of either 100% population immunity or the affected species is extinct such that there is no single instance of a disease course (in casu, of COVID-19) either way. Several measured values of the scales for the fuzzy variables will go down and immunity increase (further) as the pandemic unfolds, and then the pandemic phase is over eventually. Since there are no thresholds defined, there likely will be people who are forever disagreeing on when it can be called over. That is inherent in the current state of defining what a pandemic is. Perhaps it now also makes you appreciate the somewhat weak operational statement of the WHO post-pandemic period phase – specifying anything better is fraught with difficulties to date and unlikely to ever make everybody happy.
There’s that flawed argument of the anti-vaxxer to deal with still. Flu epidemics last about 10 weeks, on average . They happen in the winter and in the northern hemisphere that may cross a New Year (although I can’t remember that has ever happened in all the years I’ve lived in Europe). And yet, they also count per epidemic and not per calendar year. School years run from September to July, which provides a different sort of year, and the flu epidemics there are typically reported as ‘flu season 2014/2015’, indicating just that. Because those epidemics are short-lived, you typical get only one of those in a year, and in-season only.
Contrast this with COVID-19: it’s been going round and round and round since late December 2019, with waves and lulls for all countries, regions, and continents, but never did it stop for a season in whole regions or continents. Most countries come close to a stop during a lull at some point between the waves; for South Africa, according to worldometers, the lowest 7-day moving average since the first wave in 2020 was 265 recorded infections per day, on 7 November 2021. Any out-of-season waves? Oh yes – beta came along in summer last year and it was awful; at least for this year’s summer we got a relatively harmless omicron. And it’s not just South Africa that has been having out-of-season spikes. Point is, the COVID-19 pandemic ‘accomplishment’ wasn’t over within the year – neither a calendar year nor a northern hemisphere school year – and so we keep counting with the same counter for as long as the event takes until the pandemic as event is over. There’s no nefarious plot of evil controlling scaremongering governments, just a ‘demic that takes a while longer than we’ve been used to until 2019.
In closing, it is, perhaps, not the last word on the ontological status of pandemic, but I hope the walkthrough provided a little bit of clarity in the meantime already.
It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.
The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool , a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.
The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources ; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases . It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically . There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.
Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.
The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.
Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.
Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project . He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).
Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid , which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).
Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) , and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO . What to do next with these insights remains to be seen.
Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do . Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.
Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.
There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.
The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!
 Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs , one attempt toward a framework but looking at FOAF only , which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases .
We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases  (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.
The outcome of that exploratory analysis , in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.
The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.
An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.
Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.
Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.
Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:
The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.
Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.
Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.