Riffling through readability metrics

I was interviewed recently about my ontology engineering textbook, following having won the 2021 UCT Open Textbook Award for it. The interviewer assumed initially it was a textbook for undergraduate students because it has the word ‘Introduction’ in the title. Not quite. Soon thereafter, one of the 3rd-year computer science students who arrived early in class congratulated me on the award and laughed that that was an introduction at a different level altogether. It is, by design, but largely so with respect to the topics covered: it does not assume the reader knows anything about ontologies—hence, the ‘introduction’—but it does take for granted that the reader knows some of the basics in computer science or software engineering. For instance, there’s no explanation on what a database is, or a conceptual data model, or object-oriented software.

In addition, and getting to this post’s topic, I had tried to make the textbook readable, and at least definitely more accessible than scientific papers and handbooks that were the only alternatives before this textbook saw the light of day. I think it is readable and I also have received feedback that the book was easily readable. Admittedly, though, the notion of assessing readability only came afore in the editing process of my memoir, for it is aimed at a broader audience than the textbook. This raised a nagging question. What is it that makes some text readable?

It’s one of those easy questions that just do not have a simple answer. The quickest answer is “use a readability metric standardised by grade level” for a home language/mother tongue speaker. Scratching that surface, it lays bare the next question: what parameters have to be taken into account in what way so as to come up with a score for the estimated grade level? Even the brief overview on the Wikipedia page on readability already lists 11 measurable parameters, and there are different ways to measure them and to possibly combine them as well. The same page lists 8 popular metrics and 4 advanced ones. That’s just for English. For instance, the Flesch reading ease is calculated as

206.835 – 1.015 * (total number of words / total number of sentence) – 84.6 * (total number of syllables / total number of words)

A rough categorisation of various texts for adults according to their respective Flesh Reading ease scores. Source: https://blog.cathy-moore.com/2017/07/how-to-get-everyone-to-write-like-ernest-hemingway/.

to result in rough bands of reading ease. For instance, 90-100 for an 11-year old, 60-70 as ‘plain English’, up to anything <30 down to 0 (and possibly even negative) for very to extremely difficult English texts and for professionals and graduate students. See also the figure on the right.

The Gunning fog index has fewer fantastically tweaked multipliers:

Grade level = 0.4 * (average sentence length + percentage of Hard Words)

but there’s a wonderful Hard Words variable. What is that supposed to mean exactly? The readability page says that they are those words with two or more syllables, but the Gunning fog index page says three or more syllables (excluding proper nouns, familiar jargon, or compound words, and not counting common suffixes either).

Either way, the popular metrics are all easy to measure computationally without human intervention. Parameters such as fatigue or speed of perception or background knowledge are not. Proxies for reading speed surely will be available by now somewhere; e,g., in the form of algorithms that analyse page-turning in eBook readers and a visitor’s behaviour scrolling webpages when reading a long article (the system likely knows that you probably won’t finish reading this post).

I don’t know why I never thought about all that before writing the textbook and why none of the writing guidelines I have looked up over the years had mentioned it. The most I did for readability, especially when I was writing my PhD thesis, was the “read aloud test” that was proposed in one of those writing guidelines: read your text aloud, and if you can’t, then something is wrong with the sentence. I used the Acrobat built-in screen reader for that as a first pass. If the text-to-speech algorithm stumbled over it, then it was time to reconsider the phrasing. I would then read it aloud myself and decide whether the Acrobat algorithm had to be improved upon or my sentence had to be revised.

How does the ontology engineering textbook fare? Are my blog posts any more readable? How much worse are the scientific papers? Is it true that the English in articles in science are a sort of a pidgin English whereas in other fields, notably humanities, the erudition and wordsmithery shines through in the readability metrics scores? I have no good answers now, but it would be easy to compute with a fine dataset of texts and the Python py-readability-metrics module for some quick ‘n dirty checks or to adapt some other open source code for batch processing (e.g., from here, among multiple options). Maybe later; there are some other kinks to straighten first.

Notably, one can game the system based on some of those key parameters. Besides sentences length—around 20 words is fine, I was told a long while ago—there are the number of syllables of the words and the vocabulary that are taken into account. More monosyllabic words in shorter sentences with fewer types will come out as more easily readable, according to the metric that is.

But ‘easier’ or ‘better’ lies in the eyes of the beholder: it may be such confetti so as to have become awful to read due to its lack of flow and coherence. Really. It is as I say. Don’t you think? It’s the way I see it. What say you? The “ Really. … you?” has a Flesch reading ease of 90.38 and a Gunning Fog index of 1.44 as number of years of formal education you would have needed to easily understand that. The “Notably, … and coherence” before it in this paragraph has a Flesch reading ease of 50.52 and a Gunning Fog index of 13.82.

Based on random sampling from my textbook, at least one of the paragraphs (p34, ‘purposes’) got a Flesch reading ease of 9.29 and a Gunning Fog index of 22.73, while other parts are around 30 and some are even in the 50-70 region for reading ease.

The illustration out of the way, let’s look at limitations. First, not all polysyllabic words are difficult and not all monosyllabic words are simple; e.g., the common, and therewith easy, ‘education’ and ‘interesting’ vs. the semi-obscure ‘nub’, ‘sloop’, ‘gry’, and ‘squick’ (more here). The longest monosyllabic words, such as ‘scraunched’ and ‘strengthed’, aren’t exactly easy to read either.

Plenty of other languages have predominantly polysyllabic words with lots of syllables, such as Dutch or German where new words can be formed by putting existing ones together. Dutch woord meervoudigepersoonlijkheidsstoornis puts together into one concept meervoudige and persoonlijkheid and stoornis (‘multiple personality disorder’). Agglutinating languages, such as isiZulu, not only compose long words, but have so many meaningful pieces that a single word may well be a whole sentence in a disjunctive language. For instance, the 10-syllabic word that one of my former students used to make the point: titukakimureeterahoganu ‘we have never ever brought it to him’. You get used to long words and there’s no reason why English speakers would be inherently incapable to handle that. Intelligence does not depend on one’s mother tongue. Perhaps, if one is used to a disjunctive orthography, one may have become lazy. Any use off aforementioned readability metrics for ‘non-English’ clearly will have to be revised to tailor it to a language.

Then there’s foreign language background that interferes with reading ease. Many a so-called supposedly ‘difficult‘ word in English comes from French, Italian, Latin, or Greek; e.g., oxymoron (Gr), camaraderie (Fr), quotidian (It), and obfuscate (La). For instance, we use oxymoron in Dutch as well, so there’s no ‘difficulty’ to it for a Dutch person, or take maalstroom that is pronounced nearly the same as ‘maelstrom’ and demagoog for ‘demagogue’ (also Greek origins, similar pronunciation) and algorithme for ‘algorithm’ (Persian origins, not an Anglicism), and recalcitrant is even spelled the same. The foreigner trying to speak or write English may not be erudite, but just winging it and hoping that the ‘copy and adapt’ works out. Conversely, supposedly ‘simpler’ words may not be: ‘wayward’ is a synonym for recalcitrant and with only two syllables, it will make the readability score better. It would make it less readable to at least Dutch, Spanish, Italian and so on readers who are trying to read English text, however, because there’s no connection with a familiar-looking word. About 80% of English words are borrowed from other languages.

Be that as it may, maybe I should reassess my textbook on the metric; maybe not. What does the algorithm know about computer science terminology anyhow? “Ontology Engineering is a specialisation in knowledge representation and reasoning.” has a Flesh reading ease of -31.73 and a Gunning Fog index of 20.00; a tough game it would be to get that back to a reading ease of 50.

It did affect a number of sentences in my memoir book. I don’t expect Joe and Joanne Soap to be interested, but teenagers who are shopping around for a university degree programme might, and then professionals, students, and academics with a little spare time to relax and read, too. In other words: a reading ease of around 40-60. Some long sentences could indeed be split up without losing content, coherence, and flow.

There were others where the simplification didn’t feel like an improvement. For instance, compare “according to my opinion” with “the way I saw it”: the former flows smoothly whereas the latter sounds alike a nagging firing off. The latter for sure improves the readability score with all those monosyllabic words. The copy editor changed the former into the latter. It still bugs me. Why? After some further pondering beyond just blaming the grating staccato of a sequence of monosyllabic words, perhaps it is because an opinion generally is (though need not be) formed after considering the facts and analysing them, whereas seeing something in some way may (but definitely need not) be based on facts and analysis. That is, on closer inspection, they’re not equivalent phrases, not at all. Nuances can be, and were, lost with shorter sentences and simpler words. One’s voice, too. So there’s that. Overall, though, I hope the balance leans toward more readable, to get the message across better to more readers.

Lastly, there seems to be plenty of scope for more research on readability metrics—ones that can be computed, that is. While there are several applications for other well-resourced languages, including easy web apps, such as for Spanish and German and even for Dutch, there are very many languages spoken around the globe that do not have such metrics and nice algorithms yet. But even the readability metrics for English could be tweaked. For instance, to tailor it to a genre or a discipline. Then one it would be easier to determine if a book is, say, an easy-reading popular science book for the holidays on the beach or one that requires some or even a lot of effort. For computer science, one could take Gunning Fog and adjust the Hard Words variable to exclude common jargon that is detrimental to the score, like ‘encapsulation’ and ‘representation’ (both 5 syllables); biochemistry would need that too, given the long names for chemical compounds. And to add a penalty for too many successive monosyllabic words. There will be more options to tweak the formulae and test it, but such additional digging is something for another time.

As to my question in the introductory paragraph of this post, “What is it that makes some text readable?”: if you’re made it all the way here reading this post, we’re all a bit wiser on readability, but a short and simple answer I still don’t have. It’s a long story with ifs and buts, and the last word is yet to be said about it.

As a bonus, here are a few hints to make something more readable, according to the readability calculator of the web-based editor tool of the The Conversation:

Screenshot I took some time halfway when working on a article for The Conversation.

p.s.: The ‘science of reading‘ adds more to it, to the point you wonder how there even can be metrics. But, their scope is broader.

pp.s.: The first full draft of this post had a reading ease of 52.37 and a Gunning Fog of 11.78, and the final one 54.37 and 11.18, respectively, which is fine by me. Length is probably more of an issue.

A handful of memoirs and autobiographies for computer science

Since I published my second book, that memoir on a scenic route into computer science, several people have asked me “why?” and “what makes yours stand out from the crowd?”. The answer to the latter is easy: there is no crowd. (The brief answer to ‘why’ is mentioned in the Introduction chapter). Let me elaborate a little.

In the early stage of writing the book, I dutifully did do my market research to answer the typical starter questions like: What books in your genre or on your topic are already out there? How crowded is the field? Will your prospective book be just another one on that pile? Will it stand out as different? And if so, is that an interesting difference to at least some readership segment so that it will have potential to be sold beyond a close circle of friends and family? So, I searched and searched and searched, in late 2020 and again twice in 2021, and even now when writing this post. Memoirs by female computer scientists, by male computer scientists, whatever gender computer scientist in academia. Autobiographies as well then. I stretched the search criteria further, into the not-in-their-own-words biographies of computer science professors.

Collage made with the respective covers or first page of the memoir and autobiography books listed and linked here.

If you take your time searching for those books, you should be able to find the following four books and booklets of the memoir or autobiography variety, by computer science professors, on computing, computing milieux, or computer science:

  • James Morris’ memoir that was published in the same week as mine was in late 2021. It covers his 60 years career in computer science and, according to the book’s tweet-size blurb “is a search for intelligence across multiple facets of the human condition—religion and science, evolution, and innovation”.
  • The early years of academic computing professional memoir by Kenneth King made available in 2014 (free pdf).
  • The unpublished memoir by Ray Miller, on 50 years in computing (1953-1993), online available from the IEEE Computer Society as part of its computer history museum.
  • Maurice Wilkes’ hardcopy autobiography from 1985 that is, consequently, hard to access.

That’s all. Four retired (and some meanwhile deceased) computer science professors telling their tale, three of which cover only the early days of computing.

Collage made with the covers or first page of the quite related memoir and autobiography books listed and linked here.

There are a few very recent memoirs by professors that were in print or announced to go in print soon, on attendant topics, notably:

What there are lots of, are books about, and occasionally by, ‘celebrity’ people in IT and computing who made it in industry these days, such as Bill Gates, Steve Jobs, Elon Musk, Satya Nadella, and Sheryl Sandberg, and famous people in computing history, such as Ada Lovelace, Grace Hopper, George Boole, and Alan Turing (also about, not by). And there are short and long memoirs about tech by journalists and writers and by engineers and programmers who write, such as on Linux in Australia (here) or 10 years in Silicon Valley (here). There are also a few professional memoir essays and articles by computer science professors, such as about the development of the network time protocol by David Mills (here).

The people ‘out there’ – outside of the ivory tower of academia – do have lots of assumptions about computer science professors. When I mention to them that, yes, I’m one of those, at UCT even, a not uncommon reaction is an involuntary reflex of apprehension. The eyes move to a corner of the eye socket, the head turns a little and moves back, and the upper body follows, even if only slightly. I notice. But what do you really know about us? Nothing, really.

Even among academics in computer science, we have only sketchy information about our colleagues’ respective backgrounds. Yes there are the privileged ones, who had early access to computers, tinkered with them in their spare time, got their pizza delivered, participated in programming contests and so on. But there are others who made it. Who escaped persecution in Eastern Europe during the Cold War and had to find their way in a different country, whose first interaction with a computer was only at university, or who grew up in some hamlet with limited electricity and potable water. Who came from a broken home, or who had to leave family and friends to get that elusive job in the scarce academic job market many kilometers away, or whose relations stranded due to the two-body problem (partner who is also an academic, but in a different city or country). Who made it against the odds. And there are those who defected from physics, or who took a stroll out of philosophy to never return, or who still flip-flop with chemistry, to name but a few, and who thus have at least two specialisations under their belt. Those who know about more stuff than just computing.

That’s just about an academic’s background. What do you know of our daily activities? Nothing really, either. Assumptions abound; there are about as many memes and jokes about our jobs as assumption. And movies, TV series, and fiction novels that aren’t necessarily depicting it accurately either.

But us, in our own words? The memoir and autobiography books literally can be counted on one hand. I can assure you it’s not because we have no life and have nothing to say. We do. For instance, it takes about 10-30 years before the theories and techniques we investigate will mature enough to seep into the wider society. Impactful, cool, and fun things happen along the way. Those ‘infoboxes’ from Google when it returns the search results? The theory and techniques behind it date back to the late 1990s with ontologies and I was a part of that. Toy drones? There was one to play with at the European Conference on Artificial Intelligence 2006 (ECAI’06) that I attended, when the first small toy drones needed to be equipped with ‘intelligent’ processing of sensor data. The drone demo area was suitably demarcated with red-white coloured tape, for neither the engineers nor the organisers, nor us as attendees, were convinced it was safe to make it fly around without causing trouble.

Screengrab of “Dr Fill” in action in last year’s crossword puzzle contest: Video: https://www.youtube.com/watch?v=aIjD-sIDCeE

The demo session at ECAI’06 also had a crossword puzzle contest with WebCrow: researchers against an algorithm that trawled the Web for answers. The 25 of us onsite participants – perhaps the first ever to participate in such a contest – sat on uncomfortable plastic chairs in cinema style in a section of a large hall in the conference venue at Riva del Garda in Italy. Onlookers marveled that the event really took place, and unsure about which horse to bet on. The algorithm won, but we had fun. Last year’s news that an algorithmic solver won from expert human puzzlers seems a bit late and old news. I can very well imagine what those human participants must have felt.

Maybe you don’t care about computer science professors or about early days of new theories and techniques and how they came about. We all have our interests and time is limited. That’s fine; I don’t read all books either. But, if you were to ever wonder about the human in the computer science academic, there are, for now, those four books listed above, mine, and the other three books that are quite nearby in scope. Happy reading!

What is a pandemic, ontologically?

At some point in time, this COVID-19 pandemic will be over. Each time that thought crossed my mind, there was that little homunculus in my head whispering: but do you know the criteria for when it can be declared ‘over’? I tried to push that idea away by deferring it to a ‘whenever the WHO says it’s over’, but the thought kept nagging. Surely there would be a clear set of criteria lying on the shelf awaiting to be ticked off? Now, with the omicron peak well past us here in South Africa, and with comparatively little harm done in that fourth wave, there’s more talk publicly of perhaps having that end in sight – and thus also needing to know what the decisive factors are for calling it an end.

Then there are the anti-vaxxers. I know a few of them as well. One raged on with the argument that ‘they’ (the baddies in the governments in multiple countries) count the death toll entirely unfairly: “flu deaths count per season in a year, but for covid they keep adding up to the same counter from 2020 to make the death toll look much worse!! Trying to exaggerate the severity!” My response? Duh, well, yes they do count from early 2020, because a pandemic is one event and you count per event! Since the COVID-19 pandemic is a pandemic that is an event, we count from the start until the end – whenever that end is. It hadn’t even crossed my mind that someone wouldn’t count per event but, rather, wanted to chop up an event to pretend it would be smaller than it actually is.

So I did a little digging after all. What is the definition of a pandemic? What are its characteristics? Ontologically, what is that notion of ‘pandemic’, be it according to the analytic philosophers, ontologists, or modellers, or how it may be aligned to some of the foundational ontologies used in ontology engineering? From that, we then should be able to determine when all this COVID-19 has become a ‘is not a pandemic’ (whatever it may be classified into after the pandemic is over).

I could not find any works from the philosophers and theory-focussed ontologists that would have done the work for me already. (If there is and I missed it, please let me know.) Then, to start: what about definitions? There are some, like the recently updated one from dictionary.com where they tried to explain it from a language perspective, and lots of debate and misunderstandings in the debate about defining and describing a pandemic [1]. The WHO has descriptions, but not a clear definition, and pandemic phases. Formulations of definitions elsewhere vary slightly as well, except for the lowest common denominator: it’s a large epidemic.

Ontologically, that is an entirely unsatisfying answer. What is ‘large’? Some, like the CDC of the USA qualified it somewhat: it’s spread over the world or at least multiple regions and continents, and in those areas, it usually affects many people. The Australian Department of Health adds ‘new disease’ to it. Now we’re starting to get somewhere with inclusion of key properties of a pandemic. Kelly [2] adds another criterion to it, albeit focussed on influenza: besides worldwide/very wide area and  affecting a large number of people, “almost simultaneous transmission takes place worldwide” and thus for a part of the world, there is an out-of-season influenza virus transmission.

Image credits: Miroslava Chrienova, taken from this page.

The best resource of all from an ontologists’ perspective, is a very clear, well-written, perspective article written by Morens, Folkers and Fauci – yes, that Fauci from the CDC – in the Journal of Infectious Diseases that, in their lack of wisdom, keeps the article paywalled (it somehow made it onto the webarchive with free access here anyhow). They’re experts and they trawled the literature to, if not define a pandemic, then at least describe it through trying to list the characteristics and the merits, or demerits, thereof. They are, in short, and with my annotation on what sort of attribute (/feature/characteristic, as loosely used term for now) it is:

  1. Wide geographic extension; as aforementioned. That’s a scale or ‘fuzzy’ (imprecise in some way) feature, i.e., without a crisp cut-off point when ‘wide’ starts or ends.
  2. Disease movement, i.e., there’s some transmission going on from place to place and that can be traced. That’s a yes/no characteristic.
  3. High attack rates and explosiveness, i.e., lots of people affected in a short timespan. There’s no clear cut-off point on how fast the disease has to spread for counting as ‘fast spreading’, so a scale or fuzzy feature.
  4. Minimal population immunity; while immunity is a “relative concept” (i.e., you have it to a degree), it’s a clear notion for a population when that exists or not; e.g., it certainly wasn’t there when SARS-CoV-2 started spreading. It is agnostic about how that population immunity is obtained. This may sound like a yes/no feature, perhaps, but is fuzzy, because practically we may not know and there’s for sure a grey area thanks to possible cross-immunity (natural or vaccine-induced) and due to the extent of immune-evasion of the infectious agent.
  5. Novelty; the term speaks for itself, and clearly is a yes/no feature as well. It seems to me like ‘novel’ implies ‘minimal population immunity’, but that may not be the case.
  6. Infectiousness; it’s got to be infectious, and so excluding non-infectious things, like obesity and smoking. Clear yes/no.
  7. Contagiousness; this may be from person to person or through some other medium (like water for cholera). Perhaps as an attribute with categorical values; e.g., human-to-human, human-animal intermediary (e.g., fleas, rats), and human-environment (notably: water).
  8. Severity; while the authors note that it’s not typically included, historically, the term ‘pandemic’ has been applied more often for diseases that are severe or with high fatality rates (e.g., HIV/AIDS) than for milder ones. Fuzzy concept for which a scale could be used.

And, at the end of their conclusions, “In summary, simply defining a pandemic as a large epidemic may make ultimate sense in terms of comprehensibility and consistency. We also suggest that use of the term is best reserved for infectious diseases that share many of the same epidemiologic features discussed above” (p1020), largely for simplifying it to the public, but where scientists and public health officials would maintain their more precise consensus understanding of the complex scientific concept.

Those imprecise/fuzzy properties and lack of clarity of cut-off points bug the epidemiologists, because they lead to different outcomes of their prediction models. From my ontologist viewpoint, however, we’re getting somewhere with these properties: SARS-CoV-2, at least early in 2020 when the pandemic was declared, ticked all those eight boxes and so any reasoner would classify the disease it causes, COVID-19, as a pandemic. Now, in early 2022 with/after the omicron variant of concern? Of those eight properties, numbers 4 and 8 much less so, and number 5 is the million-dollar-question two years into the pandemic. Either way, considering all those properties of a pandemic that have passed the revue here so far, calling an end to the pandemic is not as trivial is it initially may have sounded like. WHO’s “post pandemic period” phase refers to “levels seen for seasonal influenza in most countries with adequate surveillance”. That is a clear specification operationally.

Ontologically, if we were to take these eight properties at face value, the next question then is: are all eight of them combined the necessary and sufficient conditions, or are some of them ‘more essential’ for calling it a pandemic, and the other ones would then be optional features? Etymologically, the pan in pandemic means ‘all’, so then as long as it rages across the world, it would remain a pandemic?

Now that things get ontologically more interesting, the ontological status. Informally, an epidemic is an occurrence (read: instance/individual entity) of an infectious disease at a particular time (read: an unspecified duration of time, not an instant) and that affects some community (be that a community of humans, chicken, or whatever other organisms that live in a community), and pandemic, as a minimum, extends the region that it affects and amount of organisms infected, and then some of those other features listed above.

A pandemic is in the same subject domain as an infectious disease, and so we can consult the OBO Foundry and see what they did, or first start with just the main BFO categories for a general sense of what it would align to. With our BFO Classifier, I get as far as process:

As to the last (optional) question: could one argue that a pandemic is a collection of disjoint part-processes? Not if the part-processes all have to be instances of different types of processes. The other loose end is that BFO’s processes need not have an end, but pandemics do. For now, what’s the most relevant is that the pandemic is distinctly in the occurrent branch of BFO, and occurrents have temporal parts.

Digging further into the OBO Foundry, they indeed did quite some work on infectious diseases and COVID-19 already [4], and following the trail from their Figure 1 (see below): disposition is a realizable entity is a specifically dependent continuant is a continuant; infectious disease course is a disease course is a process is an occurrent; and “realizable entity comes to be realized in the course of the process”.

Source: Figure 1 of [4].

In that approach, COVID-19 is the infectious disease being realised in the pandemic we’re in at the moment, with multiple infectious disease courses in humans and a few other animals. But where does that leave us with pandemic? Inspecting the Infectious Disease Ontology (IDO) since the article does not give a definition, infectious disease epidemic and infectious disease pandemic are siblings of infectious disease course, where disease course is described as “Totality of all processes through which a given disease instance is realized.” (presumably the totality of all processes in one human where there’s an instance of, say, COVID-19). Infectious disease pandemic is an atomic class with no properties or formal definitions, but there’s an annotation with a definition. Nice try; won’t work.

What’s the problem? There are three. The first, and key, problem is that pandemic is stated to be a collection of epidemics, but i) collections of individual things (collectives, aggregates) are categorically different kind of entities than individual things, and ii) epidemic and pandemic are not categorically different things. Not just that, there’s a fiat boundary (along a continuum, really) between an epidemic evolving into becoming a pandemic and then subsiding into separate epidemics. A comparatively minor, or at least secondary, issue is how to determine the boundary of one epidemic from another to be able to construct a collective, since, more fundamentally: what are the respective identities of those co-occurring epidemics? One can’t get collections of things we can’t quite identify. For instance, is it one epidemic in two places that it jumped to, or do they count as two then, and what when two separate ones touch and presumably merge to become one large one? The third issue, and also minor for the current scope, is the definition for epidemic in the ontology’s annotation field, talking of “statistically significant increase in the infectious disease incidence” as determiner, but actually it’s based on a threshold.

Let’s try DOLCE as foundational ontology and see what we get there. With the DOLCE Decision Diagram [5], pandemic ends up as: Is [pandemic] something that is happening or occurring? Yes (perdurant – alike BFO’s occurrent). Are you able to be present or participate in [a pandemic]? Yes (event). Is [a pandemic] atomic, i.e., has no subdivisions of it and has a definite end point? No (accomplishment). Not the greatest word choice to say that a pandemic is an accomplishment – almost right up there with the DOLCE developers’ example that death is an achievement – but it sure is an accomplishment from the perspective of the infectious agent. The nice thing of dolce:accomplishment over  bfo:process is that it entails there’s a limited duration to it (DOLCE also has process that also can go on and on and on).

The last question in both decision diagrams made me pause. The instances of COVID-19 going around could possibly be going around after the pandemic is over, uninterrupted in the sense that there is no time interval where no-one is infected with SARS-CoV-2, or it could be interrupted with later flare-ups if it’s still SARS-CoV-2 and not substantially different, but the latter is a grey area (is it a flare-up or a COVID-2xxx?). The latter is not our problem now. The former would not be in contradiction with pandemic as accomplishment, because COVID-19-the-pandemic and COVID-19-the-disease are two different things. (How those two relate can be a separate story.)

To recap, we have pandemic as an occurrent/perdurant entity unfolding in time and, depending on one’s foundational ontology, something along the line of accomplishment. For an epidemic to be classified as a pandemic, there are a varying number of features that aren’t all crisp and for which the fuzzy boundaries haven’t been set.

To sketch this diagrammatically (hence, informally), it would look something like this:

where the clocks and the DEX and DEV arrows are borrowed from the TREND temporal conceptual data modelling language [6]: Epidemic and Pandemic are temporal entities, DEX (+dashed arrow) verbalised is “An epidemic may also become a pandemic” and DEV (+solid arrow): “Each pandemic must evolve to epidemic ceasing to be a pandemic” (hiding the logic at the back-end).

It isn’t a full answer as to what a pandemic is ontologically – hence, the title of the blog post still has that question mark – but we can already clear up the two issues from the introduction of this post, as follows.

Consequences

We already saw that with any definition, description, and list of properties proposed, there is no unambiguous and certain definite endpoint to a pandemic that can be deterministically computed. Well, other than the extremes of either 100% population immunity or the affected species is extinct such that there is no single instance of a disease course (in casu, of COVID-19) either way. Several measured values of the scales for the fuzzy variables will go down and immunity increase (further) as the pandemic unfolds, and then the pandemic phase is over eventually. Since there are no thresholds defined, there likely will be people who are forever disagreeing on when it can be called over. That is inherent in the current state of defining what a pandemic is. Perhaps it now also makes you appreciate the somewhat weak operational statement of the WHO post-pandemic period phase – specifying anything better is fraught with difficulties to date and unlikely to ever make everybody happy.

There’s that flawed argument of the anti-vaxxer to deal with still. Flu epidemics last about 10 weeks, on average [7]. They happen in the winter and in the  northern hemisphere that may cross a New Year (although I can’t remember that has ever happened in all the years I’ve lived in Europe). And yet, they also count per epidemic and not per calendar year. School years run from September to July, which provides a different sort of year, and the flu epidemics there are typically reported as ‘flu season 2014/2015’, indicating just that. Because those epidemics are short-lived, you typical get only one of those in a year, and in-season only.

Contrast this with COVID-19: it’s been going round and round and round since late December 2019, with waves and lulls for all countries, regions, and continents, but never did it stop for a season in whole regions or continents. Most countries come close to a stop during a lull at some point between the waves; for South Africa, according to worldometers, the lowest 7-day moving average since the first wave in 2020 was 265 recorded infections per day, on 7 November 2021. Any out-of-season waves? Oh yes – beta came along in summer last year and it was awful; at least for this year’s summer we got a relatively harmless omicron. And it’s not just South Africa that has been having out-of-season spikes. Point is, the COVID-19 pandemic ‘accomplishment’ wasn’t over within the year – neither a calendar year nor a northern hemisphere school year – and so we keep counting with the same counter for as long as the event takes until the pandemic as event is over. There’s no nefarious plot of evil controlling scaremongering governments, just a ‘demic that takes a while longer than we’ve been used to until 2019.

In closing, it is, perhaps, not the last word on the ontological status of pandemic, but I hope the walkthrough provided a little bit of clarity in the meantime already.

References

[1] Doshi, P. The elusive definition of pandemic influenza. Bulletin of the World Health Organization,  2011, 89:532–538

[2] Kelly, H. The classical definition of a pandemic is not elusive. Bulletin of the World Health Organization, 2011, 89 (‎7)‎, 540 – 541.

[3] Morens, DM, Folkers, GK, Fauci, AS. What Is a Pandemic? The Journal of Infectious Diseases, 2009, 200(7): 1018-1021.

[4] Babcock, S., Beverley, J., Cowell, L.G. et al. The Infectious Disease Ontology in the age of COVID-19. Journal of Biomedical Semantics, 2021, 12, 13.

[5] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. 2013.

[6] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Springer LNCS 10650, 437-450. 6-9 Nov 2017, Valencia, Spain.

[7] Fleming DM, Zambon M, Bartelds AI, de Jong JC. The duration and magnitude of influenza epidemics: a study of surveillance data from sentinel general practices in England, Wales and the Netherlands. European Journal of Epidemiology, 1999, 15(5):467-73.

Conference report: SWAT4HCLS 2022

The things one can do when on sabbatical! For this week, it’s mainly attending the 13th Semantic Web Applications and tools for Health Care and Life Science (SWAT4HCLS) conference and even having some time to write a conference report again. (The last lost tagged with conference report was FOIS2018, at the end of my previous sabbatical.) The conference consisted of a tutorial day, two conference days with several keynotes and invited talks, paper presentations and poster sessions, and the last day a ‘hackathon’/unconference. This clearly has grown over the years from the early days of the event series (one day, workshop, life science).

A photo of the city where it was supposed to take place: Leiden (NL) (Source: here)

It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.

Keynotes

The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool [1], a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.

The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources [2]; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases [3]. It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically [4]. There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.

Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.

The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.

Papers

Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.

Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project [5]. He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).

Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid [6], which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).

Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu  and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) [7], and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO [8]. What to do next with these insights remains to be seen.

Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do [9]. Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.

Source: [9]

Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.

Other

There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.

The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!

References

[1] Verspoor K. et al. Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research. In: Hiemstra D., Moens MF., Mothe J., Perego R., Potthast M., Sebastiani F. (eds). Advances in Information Retrieval. ECIR 2021. Springer LNCS, vol 12657, 559-564.

[2] Ostaszewski M. et al. COVID19 Disease Map, a computational knowledge repository of virus–host interaction mechanisms. Molecular Systems Biology, 2021, 17:e10387.

[3] Hanspers, K., Riutta, A., Summer-Kutmon, M. et al. Pathway information extracted from 25 years of pathway figures. Genome Biology, 2020, 21,273.

[4] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics, 2012, 45(3): 482-494. DOI: dx.doi.org/10.1016/j.jbi.2012.01.004.

[5] Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[6] Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[7] Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[8] César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

[9] Frances Gillis-Webber and C. Maria Keet, A Survey of Multilingual OWL Ontologies in BioPortal. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.

Trying to categorise popular science books

Some time last year, a colleague asked about good examples of popular science books, in order to read and thereby to get inspiration on how to write books at that level, or at least for first-year students at a university. I’ve read (and briefly reviewed) ‘quite a few’ across multiple disciplines and proposed to him a few of them that I enjoyed reading. One aspect that bubbled up at the time, is that not all popsci books are of the same quality and, zooming in on this post’s topic: not all popsci books are of the same level, or, likely, do not have the same target audience.

I’d say they range from targeting advanced interested laypersons to entertaining laypersons. The former entails that you’d be better off having covered the topic at school and an undergrad course or two will help as well for making it an enjoyable read, and be fully awake, not tired, when reading it. For the latter category at the other end of the spectrum: having completed little more than primary school will do fine and no prior subject domain knowledge is required, at all, and it’s good material for the beach; brain candy.

Either way you’ll learn something from any popsci book, even if it’s too little for the time spent reading the book or too much to remember it all. But some of them are much more dense than others. Compare cramming the essence of a few scientific papers in a book’s page to drawing out one scientific paper into a whole chapter. Then there’s humor—or the lack thereof—and lighthearted anecdotes (or not) to spice up the content to a greater or lesser extent. The author writing about fungi recounting eating magic mushrooms, say, or an economist being just as much of a sucker for summer sales in the shops as just about anyone. And, of course, there’s readability (more about that shortly in another post).

Putting all that in the mix, my groupings are as follows, with a selection of positive exemplars that I also enjoyed reading.

There are more popsci books of which I thought they were interesting to read, but I didn’t want to turn it into a laundry list. Also, it seemed that books on politics and society and philosophy and such seem to be deserving their own discussion on categorisation, but that’s for another time. I also intentionally excluded computer science, information systems, and IT books, because I may be differently biassed to those books compared to the out-of-my-own-current-specialisation books listed above. For instance, Dataclysm by Cristian Rudder on Data Science mainly with OKCupid data (reviewed earlier) was of the ‘entertainment’ level to me, but probably isn’t so for the general audience.

Perhaps it is also of use to contrast them to ‘bad’ examples—well, not bad, but I think they did not succeed well in their aim. Two of them are Critical mass by Phillip Ball (physics, social networks), because it was too wordy and drawn out and dull, and This is your brain on music by Daniel Levitin (neuroscience, music), which was really interesting, but very, very, dense. Looking up their scores on goodreads, those readers converge to that view for your brain on music as well (still a good 3.87 our of 5, from nearly 60000 ratings and well over 1500 reviews), as well as for the critical mass one (3.88 from some 1300 ratings and about 100 reviews). Compare that to a 4.39 for the award-wining Entangled life, 4.35 of Why we sleep, and 4.18 for Mama’s last hug. To be fair, not all books listed above have a rating above 4.

Be this as it may, I still recommend all of those listed in the four categories, and hopefully the sort of rough categorisation I added will assist in choosing a book among the very many vying for your attention and time.

Pushing the envelope categorising popsci books

Regarding book categories more generally, romance novels have subgenres, as does science fiction, so why not the non-fiction popsci books? Currently, they’re mostly either just listed (e.g., here or the new releases) or grouped by discipline, but not according to, say, their level of difficulty, humor, whether it mixes science with politics, self-help, or philosophy, or some other quality dimension of the book along which they possibly could be assessed.

As example that the latter might work for assigning attributes to the books: Why we sleep is 100% science but a reader can distill some ideas to practice with as self-help for sleeping better, whereas When: the scientific secrets of perfect timing is, contrary to what the title suggests, largely just self-help. Delusions of gender and Inside rebellion can, or, rather, should have some policy implications, and Why we sleep possibly as well (even if only to make school not start so early in the morning), whereas the sort of content of Elephants on acid already did (ethics review boards for scientific experiments, notably). And if you were not convinced of the presence of animal cognition, then Mama’s last hug may induce some philosophical reflecting, and then have a knock-on effect on policies. Then there are some books that I can’t see having either a direct or indirect effect on policy, such as Gastrophysics and Entangled life.

Let’s play a little more with that idea. What about vignettes composed of something like the followings shown in the table below?

Then a small section of the back cover of Entangled life would look like this, with the note that the humor is probably inbetween the ‘yes’ and ‘some’ (I laughed harder with the book on drunkenness).

Mama’s last hug would then have something like:

And Why we sleep as follows (though I can’t recall for sure now whether it was ‘some’ or ‘no laughing matter’ and a friend has borrowed the book):

A real-life example of a categorisation box on a product; coffee suitable for moka pots, according to House of Coffees.

Of course, these are just mock-ups to demonstrate the idea visually and to try out whether it is even doable to classify the books. They are. There very well may be better icons than these scruffy ‘take a cc or public domain one and fiddle with it in MS Paint’ or a mixed mode approach, like on the packs of coffee (see image on the right).

Moreover: would you have created the same categorisation for the three examples? What (other) properties of popular science books could useful? Also, and perhaps before going down that route: would something like that possibly be useful according to you or someone you know who reads popular science books? You may leave your comments below, on my facebook page, or write an email, or we can meet in person some day.

p.s.: this is not a serious post on the ontology of popular science books — it is summer vacation time here and I used to write book reviews in the first week of the year and this is sort of related.

A brief reflection on maintaining a blog for 15 years (going on 16)

Fifteen years is a long time in IT, yet blogging software is still around and working—the same WordPress I started my blog with, even. At the time, in 2006, when WordPress was still only offering blogging functionality, they had the air of being respectable and at least somewhat serious compared to blogspot (redirects to Blogger now) that hosted a larger share of the informal and whimsical blogs. Blogs are not nearly as popular now as they used to be, there seems to be a move to huddle together to take a ride on a branded bandwagon, like Medium and Substack, and all of the blog-providing companies have diversified the services they offer for blogging. WordPress now markets itself as website builder, rather than blogging, software.

One might even be tempted to argue that blogs are (nearly) obsolete, with TikTok and the like having come along over the years. No so, claims a blogger here, some 10 more more bloggers here, and even a necessity according to another that does provide a list of links to data to back it up. (Just maybe don’t try making a living from it—there are plenty of people who like to read, but writing doesn’t pay well.)

Some data for this blog, then. It has 325 published post, there are around 400-600 visitors per month in recent years (depending on the season and posting frequency), there are people still signed up to receive updates (78), some even like some of the posts, and some of them are shared Twitter and other social media. The most visited post of all time got over 21000 visits and counting (since 2011) and the most visited post in the past year (after the home page) still had a fine 355 visitors and is on my research and teaching topic (see also the occasionally updated vox populi). So, obsolete it is not. Admitted, the latter post had its heydays in 2010-2012 with about 2500 visits/year and the former saw its best of times in 2014-2015 (4425 and 4948 visits in each of those years alone, respectively). The best visited post of the mere 10 posts I wrote in 2021 is on bias in ontologies, having attracted the attention of 119 visitors. Summarizing this blog’s stats trends: numbers are down compared to 5-10 years ago, indeed, but insignificant it is not and multiple posts have staying power.

Heatmap of monthly views to this blog over time.

I also can reveal that there’s no clear correlation between the time-to-write and number-of-visits variables, nor between either of them and the post’s topic, and not with post length either. With more time, there would have been more, and more polished, posts. There’s plenty to write about, not only the long overdue posts for published papers that came out at an extra-busy time and therefore have slipped through writing about, but also other interesting research that’s going on and deserves that extra bit of attention, some more book reviews, teaching updates and so on. There’s no shortage of topics to write about, which therewith turned out to be an unfounded worry from 15 years ago.

Will I go on for another 15 years? Perhaps, perhaps not. I’m still fence-sitting, from the very first post in 2006 that summed up the reasons for starting a blog to this day, to give it a try nonetheless and see when and where it will end.

Why still fence-sitting? I still don’t know whether it’s beneficial or harmful to one’s career, and if beneficial, whether the time put into writing those posts could have been used better for obtaining more benefit from those alternative activities than from the blog post writing. What I do know, is that, among others, it has helped me to learn to write better, it made me take notes during conferences in order to write conference reports and therewith engage more productively with a conference, structure ideas and thoughts, and pitch papers. Also, the background searches for fact-checking, adding links, and trying to find pictures made me stumble into interesting detours as well. Some of the posts took a long time to write, but at least they were enjoyable pastimes or worktimes.

Uhm, so, the benefit is to (just?) me? I do hope the posts have been worthwhile to the readers. But, it brings into vision the question that’s well-known to aspiring writers: should I write for myself or for my readers? The answer depends on whom you consult: blog for yourself, says the blogger from paradise, write for another, imaginary, reader persona, says the novelist, and go for bothsideism for the best results according to the writer’s guide. I write for myself, and brush it up in an attempt to increase a post’s appeal. The brushing up mainly concerns the choice of words, phrases, and paragraphs and the ordering thereof, and the images to brighten up some of the otherwise text-only posts (like this one).

After so many years and posts, I ought to be able to say something more profound. It’s really just that, though: the joy of writing the posts, the hope it makes a difference to readers and to what I’ve written about, and the slight worry it may not be the best thing to do for advancing my career.

Be this as it may, over the past few days, I’ve added a bit more structure to the blog to assist readers finding the topics they may be interested in. The key different categories are now also accessible from the ‘Menu’, being work-related topics (research and papers, software, and teaching), posts on writing and publishing, and there are a few posts that belong to neither, which still can be found on the complete list of posts. Happy reading!

p.s.: in case you wondered: yes, I intended to do a reflection when the blog turned a nice round 15 in late March, were it not for that blurry extension to 2020 and lots of extra teaching and teaching admin duties in 2021. The summer break has started now and there’s not much of a chance to properly go on holiday, and writing also counts as leisure activity, so there the opportunity was, just about three months shy of the blog turning 16. (In case the post’s title vaguely rings a bell: yes, there’s that cheesy song from one of the top-5 movie musicals of all time [according to imdb], depicting a happy moment with promise of staying together before Rolfe makes some more bad decisions, but that’s 16 going on 17.)

BFO decision diagram and alignment tool

How to align your domain ontology to a foundational ontology? It’s a well-known question, and one that I’ve looked into before as well. In some of that earlier work, we used DOLCE to align one’s ontology to. We devised the DOLCE decision diagram as part of the FORZA method to assist with the alignment process and implemented that in the MoKI ontology development tool [1]. MoKI is no more, but the theory and the algorithm’s design approach still stand. Instead of re-implementing it as a Protégé plugin and have it go defunct in a few years again (due to incompatible version upgrades, say), it sounded like more fun to design one for BFO and make a stand-alone tool out of it. And that design and the evaluation thereof is precisely what two of my ontology engineering course students—Chiadika Emeruem and Steve Wang—did for their mini-project of the course. That was then finalised and implemented in a tool for general use as part of the DOT4D project extension for my (award-winning) OE textbook afterward.

More precisely, as first part, there’s a diagram specifically for BFO – well, for one of its 2.0-ish versions in existence at least. Deciding on which version to use and what would be good questions was not as trivial as it may sound. While the questions seem to work (as evaluated with several ontologies), it might still be of use to set up an experiment to assess usability from a modeller’s viewpoint.

BFO ‘decision diagram’ to assist trying to align one’s class of a domain or core ontology to BFO (click to enlarge, or navigate to the user guide at https://bfo-classifier.github.io/)

Be this as it may, this decision diagram was incorporated into the tool that wraps around it with a nice interface with user guidance and feedback, and it has the option to load an ontology and save the alignment into the ontology (along with BFO). The decision tree itself is stored as a separate XML file so that it easily can be replaced with any update thereto, be it to reflect changes in question formulation or to adjust it to some later version of BFO. The stand-alone tool is a jar file that can be downloaded from the GitHub repo, and the repo also has the source code that may be used/adapted (i.e., has an open source licence). There’s also a user guide with explanations and screenshots. Here’s another screenshot of the tool in action:

Example of the BFO classifier in use, trying to align CODO’s ‘Disease’ to BFO, the trail of questions answered to get to ‘Disposition’, and the subsumption axiom that can be added to the ontology.

If you have any questions, please feel free to contact either of us.

References

[1] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.

Some explorations into book publishing logistics

Writing a book is only one part of the whole process of publishing a book. There’s the actual thing that eventually needs to get out into the wide world. Hard copy? E-book? Print-on-demand? All three or a subset only? Taking a step back: where are you as author located, where are the publisher and the printer, and where is the prospective audience? Is the prospective readership IT savvy enough for e-books to even consider that option? Is the book’s content suitable for reading on devices with a gazillion different screen sizes? Here’s a brief digest from after my analysis paralysis of the too many options where none has it all – not ever, it seems.

I’ve written about book publishing logistics and choices for my open textbook, but that is, well, a textbook. My new book, No Taming of the Enthusiast, is of a different genre and aimed at a broader audience. Also, I’m a little wiser on the practicalities of hard copy publishing. For instance, it took nearly 1.5 months for the College Publications-published textbook to arrive in Cape Town, having travelled all the way from Europe where the publisher and printer are located. Admittedly, these days aren’t the best days for international cargo, but such a delivery time is a bit too long for the average book buyer. I’ve tried buying books with other overseas retailers and book sellers over the past few years—same story. On top of that, in South Africa, you then have to go to the post office to pick up the parcel and pay a picking-up-the-parcel fee (or whatever the fee is for), on top of the book’s cost and shipping fee. And it may get stuck in Customs limbo. This is not a good strategy if I want to reach South African readers. Also, it would be cool to get at least some books all the way onto the shelves of local book stores.

A local publisher then? That would be good for contributing my bit to stimulating the local economy as well. It has the hard copy logistics problem in reverse at least in part, however: how to get the books from so far down south to other places in the world where buyers may be located. Since the memoir is expected to have an international audience as well, some international distribution is a must. This requirement still gives three options: a multinational hard copy publisher that distributes to main cities with various shipping delays, print-on-demand (soft copy distributed, printed locally wherever it is bought), or e-book.

Let’s take the e-books detour for a short while. There is a low percentage of uptake of e-books – some 20% at best – and lively subjective opinions on why people don’t like ebooks. I prefer hard copies as well, but tolerate soft copies for work. Both are useful for different types of use: a hard copy for serious reading and a soft copy for skimming and searching so as to save oneself endless flicking to look up something. It’s happening the same with my textbook as well, to some extent at least: people pay for it to have it nicely printed and bound even though they can do that with the pdf themselves or just read the pdf. For other genres, some are better in print in any case, such as colourful cookbooks, but others should tolerate e-readers quite well, such as fiction when it’s just plain text.

In deciding whether to go for an e-book, I did explore usability and readability of e-books for non-work books to form my own opinion on it. I really tried. I jumped into the rabbit hole of e-reader software with their pros and cons, and settled on Calibre eventually as best fit. I read a fixed-size e-book in its entirety and it was fine, but there was a glitch in that it did not quite adjust to the screen size of the device easily and navigating pages was awkward; I didn’t try to search. I also bought two e-book novels from smashwords (epub format) and tested one for cross-device usability and readability. Regarding the ‘across devices’: I think I deserve to share and read e-books on all my devices when I duly paid for the copyrighted books. And, lo and behold, I indeed could do so across unconnected devices through emailing myself on different email addresses. The flip side of that is that it means that once any epub is downloaded by one buyer (separately, not into e-books software), it’s basically a free-for-all. There are also epub to pdf converters. The hurdles to do so may be enough of a deterrent for an average reader, but it’s not even a real challenge for anyone in IT or computing.

After the tech tests, I’ve read through the first few pages of one of the two epub e-books – and abandoned it since. Although the epub file resized well, and I suppose that’s a pat on the back for the software developers, it renders ugly on the dual laptop/tablet and smartphone I checked it with. It offers not nearly the same neat affordances of a physical book. For the time being, I’ll buy an e-book only if there’s no option to buy a hard copy and I really, really, want to read it. Else to just let it slide – there are plenty of interesting books that are accessible and my reading time is limited.

Spoiler alert on how the logistics ended up eventually 🙂

So, now what for my new book? There is no perfect solution. I don’t want to be an author of something I would not want to read (the e-book), but it can be set up if there’s enough demand for it. Then, for the hard copies route, if you’re not already a best-selling author or a VIP who dabbles in writing, it’s not possible to get it both published ‘fast’ – in, say, at most 6 months cf. the usual 1.5-2 years with a traditional publisher – and have it distributed ‘globally’. Even if you are quite the hotshot writer, you have to be rather patient and contend with limited reach.

Then what about me, as humble award-wining textbook writer who wrote a memoir as well, and who can be patient but generally isn’t for long? First, I still prefer hard copies first and foremost nonetheless. Second, there’s the decision to either favour local or global in the logistics. Eventually, I decided to favour local and found a willing South African publisher, Porcupine Press, to publish it under their imprint and then went for the print-on-demand for elsewhere. PoD will take a few days lead time for an outside-South-Africa buyer, but that’s little compared to international shipping times and costs.

How to do the PoD? A reader/buyer need not worry and simply will be able to buy it from the main online retailers later in the upcoming week, with the exact timing depending on how often they run their batch update scripts and how much manual post-processing they do.

From the publishing and distribution side: it turns out someone has thought about all that already. More precisely, IngramSpark has set up an international network of local distributors that has a wider reach than, notably, KDP for the Kindle, if that floats your boat (there are multiple comparisons of the two on many more parameters, e.g., here and here). You load the softcopy files onto their system and then they push it into some 40000 outlets, including the main international ones like Amazon and multiple national ones (e.g., Adlibris in Sweden, Agapea in Spain). Anyway, that’s how it works in theory. Let’s see how that works in practice. The ‘loading onto the system’ stage started last week and should be all done some time this upcoming week. Please let me know if it doesn’t work out; we’ll figure something out.

Meanwhile for people in South Africa who can’t wait for the book store distribution that likely will take another few weeks to cover the Joburg/Pretoria and Cape Town book shops (an possibly on the shelf only in January): 1) it’s on its way for distribution through the usual sites, such as TakeALot and Loot, over the upcoming days (plus some days that they’ll take to update their online shop); 2) you’ll be able to buy it from the Porcupine Press website once they’ve updated their site when the currently-in-transit books arrive there in Gauteng; 3) for those of you in Cape Town, and where the company that did the actual printing is located (did I already mention logistics matter?): I received some copies for distribution on Thursday and I will bring copies to the book launch next weekend. If the impending ‘family meeting’ is going to mess up the launch plans due to an unpleasant more impractical adjusted lockdown level, or you simply can’t wait: you may contact me directly as well.

Progress on generating educational questions from ontologies

With increasing student numbers, but not as much more funding for schools and universities, and the desire to automate certain tasks anyhow, there have been multiple efforts to generate and mark educational exercises automatically. There are a number of efforts for the relatively easy tasks, such as for learning a language, which range from the entry level with simple vocabulary exercises to advanced ones of automatically marking essays. I’ve dabbled in that area as well, mainly with 3rd-year capstone projects and 4th-year honours project student projects [1]. Then there’s one notch up with fact recall and concept meaning recall questions, and further steps up, such as generating multiple-choice questions (MCQs) with not just obviously wrong distractors but good distractors to make the question harder. There’s quite a bit of work done on generating those MCQs in theory and in tooling, notably [2,3,4,5]. As a recent review [6] also notes, however, there are still quite a few gaps. Among others, about generalisability of theory and systems – can you plug in any structured data or knowledge source to question templates – and the type of questions. Most of the research on ‘not-so-hard to generate and mark’ questions has been done for MCQs, but there are multiple of other types of questions that also should be doable to generate automatically, such as true/false, yes/no, and enumerations. For instance, with an axiom such as impala \sqsubseteq \exists livesOn.land in a ontology or knowledge graph, a suitable question generation system may then generate “Does an impala live on land?” or “True or false: An impala lives on land.”, among other options.

We set out to make a start with tackling those sort of questions, for the type-level information from an ontology (cf. facts in the ABox or knowledge graph). The only work done there, when we started with it, was for the slick and fancy Inquire Biology [5], but which did not have their tech available for inspection and use, so we had to start from scratch. In particular, we wanted to find a way to be able to plug in any ontology into a system and generate those non-MCQ other types of educations questions (10 in total), where the questions generated are at least grammatically good and for which the answers also can be generated automatically, so that we get to automated marking as well.

Initial explorations started in 2019 with an honours project to develop some basics and a baseline, which was then expanded upon. Meanwhile, we have some more designed, developed, and evaluated, which was written up in the paper “Generating Answerable Questions from Ontologies for Educational Exercises” [7] that has been accepted for publication and presentation at the 15th international conference on metadata and semantics research (MTSR’21) that will be held online next week.

In short:

  • Different types of questions and the answer they have to provide put different prerequisites on the content of the ontology with certain types of axioms. We specified those for 10 types of educational questions.
  • Three strategies of question generation were devised, being ‘simple’ from the vocabulary and axioms and plug it into a template, guided by some more semantics in the ontology (a foundational ontology), and one that didn’t really care about either but rather took a natural language approach. Variants were added to cater for differences in naming and other variations, amounting to 75 question templates in total.
  • The human evaluation with questions generated from three ontologies showed that while the semantics-based one was slightly better than the baseline, the NLP-based one gave the best results on syntactic and semantic correctness of the sentences (according to the human evaluators).
  • It was tested with several ontologies in different domains, and the generalisability looks promising.
Graphical Abstract (made by Toky Raboanary)

To be honest to those getting their hopes up: there are some issues that cause it never to make it to the ‘100% fabulous!’ if one still wants to designs a system that should be able to take any ontology as input. A main culprit is naming of elements in the ontology, which varies widely across ontologies. There are several guidelines for how to name entities, such as using camel case or underscores, and those things easily can be coded into an algorithm, indeed, but developers don’t stick to them consistently or there’s an ontology import that uses another naming convention so that there likely will be a glitch in the generated sentences here or there. Or they name things within the context of the hierarchy where they put the class, but in the question it is out of that context and then looks weird or is even meaningless. I moaned about this before; e.g., ‘American’ as the name of the class that should have been named ‘American Pizza’ in the Pizza ontology. Or the word used for the name of the class can have different POS tags such that it makes the generated sentence hard to read; e.g., ‘stuff’ as a noun or a verb.

Be this as it may, overall, promising results were obtained and are being extended (more to follow). Some details can be found in the (CRC of the) paper and the algorithms and data are available from the GitHub repo. The first author of the paper, Toky Raboanary, recently made a short presentation video about the paper for the yearly Open Evening/Showcase, which was held virtually and that page is still online available.

References

[1] Gilbert, N., Keet, C.M. Automating question generation and marking of language learning exercises for isiZulu. 6th International Workshop on Controlled Natural language (CNL’18). Davis, B., Keet, C.M., Wyner, A. (Eds.). IOS Press, FAIA vol. 304, 31-40. Co. Kildare, Ireland, 27-28 August 2018.

[2] Alsubait, T., Parsia, B., Sattler, U. Ontology-based multiple choice question generation. KI – Kuenstliche Intelligenz, 2016, 30(2), 183-188.

[3] Rodriguez Rocha, O., Faron Zucker, C. Automatic generation of quizzes from dbpedia according to educational standards. In: The Third Educational Knowledge Management Workshop. pp. 1035-1041 (2018), Lyon, France. April 23 – 27, 2018.

[4] Vega-Gorgojo, G. Clover Quiz: A trivia game powered by DBpedia. Semantic Web Journal, 2019, 10(4), 779-793.

[5] Chaudhri, V., Cheng, B., Overholtzer, A., Roschelle, J., Spaulding, A., Clark, P., Greaves, M., Gunning, D. Inquire biology: A textbook that answers questions. AI Magazine, 2013, 34(3), 55-72.

[6] Kurdi, G., Leo, J., Parsia, B., Sattler, U., Al-Emari, S. A systematic review of automatic question generation for educational purposes. Int. J. Artif. Intell. Edu, 2020, 30(1), 121-204.

[7] Raboanary, T., Wang, S., Keet, C.M. Generating Answerable Questions from Ontologies for Educational Exercises. 15th Metadata and Semantics Research Conference (MTSR’21). 29 Nov – 3 Dec, Madrid, Spain / online. Springer CCIS (in print).

Bias in ontologies?

Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs [1], one attempt toward a framework but looking at FOAF only [2], which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases [3].

We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases [4] (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.

The outcome of that exploratory analysis [5], in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.

The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.

Table 1. Summary of typical possible biases in ontologies grouped by source, with an indication whether such biases would be explicit choices or whether they may creep in unintentionally and lead to implicit bias. (Source: [5])

An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.

Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.

Table 2. Tentative presence of bias in the three COVID-19 ontologies, by cognitive bias; see paper for details.

Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.

Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:

Figure 1. OBDI scenario with the CIDO, two database, and a query over the system that returns a logically correct but undesirable result due to some optimism that an experimental substance is already a drug.

The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.

Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.

Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.

References

[1] K. Janowicz, B. Yan, B. Regalia, R. Zhu, G. Mai, Debiasing knowledge graphs: Why female presidents are not like female popes, in: M. van Erp, M. Atre, V. Lopez, K. Srinivas, C. Fortuna (Eds.), Proceeding of ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, volume 2180 of CEUR-WS, 2017.

[2] D. L. Gomes, T. H. Bragato Barros, The bias in ontologies: An analysis of the FOAF ontology, in: M. Lykke, T. Svarre, M. Skov, D. Martínez-Ávila (Eds.), Proceedings of the Sixteenth International ISKO Conference, Ergon-Verlag, 2020, pp. 236 – 244.

[3] Keet, C.M. Dirty wars, databases, and indices. Peace & Conflict Review, 2009, 4(1):75-78.

[4] E. Dimara, S. Franconeri, C. Plaisant, A. Bezerianos, P. Dragicevic, A task-based taxonomy of cognitive biases for information visualization, IEEE Transactions on Visualization and Computer Graphics 26 (2020) 1413–1432.

[5] Keet, C.M. An exploration into cognitive bias in ontologies. Cognition And OntologieS (CAOS’21), part of JOWO’21, part of BoSK’21. 13-16 September 2021, Bolzano, Italy. (in print)