When we can declare the covid-19 pandemic to be over? I mulled about that earlier in January this year when the omicron wave was fizzling out in South Africa, and wrote a blog post as a step toward trying to figure out and a short general public article was published by The Conversation (republished widely, including by The Next Web). That was not all and the end of it. In parallel – or, more precisely, behind the scenes – that ontological investigation did happen scientifically and in much more detail.
First, it includes a proper discussion of how the 9 relevant domain ontologies have pandemic represented in the ontology – the same as epidemic, a sibling thereof, or as a subclass, and why – and what sort of generic top-level entity it is asserted to be, and a few more scientific references by domain experts.
Second, besides the two foundational ontologies that I discussed the alignment to (DOLCE and BFO) in the blog post, I tried with five more foundational ontologies that were selected meeting several criteria: BORO, GFO, SUMO, UFO, and YAMATO. That mainly took up a whole lot more time, but it didn’t add substantially to insights into what kind of entity pandemic is. It did, however, make clear that manually aligning is hard and difficult to get it as precise as it ought, and may need, to be, for several reasons (elaborated on in the paper).
Third, I dug deeper into the eight characteristics of pandemics according to the review by Morens, Folkers and Fauci (yes, him, from the CDC) [1] and disentangled what’s really going on with those, besides already having noted that several of them are fuzzy. Some of the characteristics aren’t really a property of pandemic itself, but of closely related entities, such as the disease (see table below). There are so many intertwined entities and relations, in fact, that one could very well develop an ontology of just pandemics, rather than have it only as a single class on an ontology as is now the case. For instance, there has to be a high attack rate, but ‘attack rate’ itself relies on the fact that there is an infectious agent that causes a disease and that R (reproduction) number that, in turn, is a complex thing that takes into account factors including susceptibility to infection, social dynamics of a population, and the ability to measure infections.
Finally, there are different ways to represent all the knowledge, or a relevant part thereof, as I also elaborated on in my Bio-Ontologies keynote last month. For instance, the attack rate could be squashed into a single data property if the calculation is done elsewhere and you don’t care how it is calculated, or it can be represented in all its glory details for the sake of it or for getting a clearer picture of what goes into computing the R number. For a scientific ontology, the latter is obviously the better choice, but there may be scenarios where the former is more practical.
The conclusion? The analysis cleared up a few things, but with some imprecise and highly complex properties as part of the mix to determine what is (and is not) a pandemic, there will be more than one optimum/finish line for a particular pandemic. To arrive at something more specific than in the paper, the domain experts may need to carry out a bit more research or come up with a consensus on how to precisiate those properties that are currently still vague.
Last, but not least, on attending ICBO’22, which will be held from 25-28 September in Ann Arbour, MI, USA: it runs in hybrid format. At the moment, I’m looking into the logistics of trying to attend in person now that we don’t have the highly anticipated ‘winter wave’ like the one we had last year and that thwarted my conference travel planning. While that takes extra time and resources to sort out, there’s that very thick silver lining that that also means we seem to be considerably closer to that real end of this pandemic (of the acute infections at least). According to the draft characterisation pandemic, one indeed might argue it’s over.
References
[1] Keet, C.M. Exploring the Ontology of Pandemic. 13th International Conference on Biomedical Ontology (ICBO’22). CEUR-WS. Michigan, USA, September 25-28, 2022.
[2] Morens, DM, Folkers, GK, Fauci, AS. What Is a Pandemic?The Journal of Infectious Diseases, 2009, 200(7): 1018-1021.
Natural language generation applications have been ‘mainstreaming’ behind the scenes for the last couple of years, from automatically generating text for images, to weather forecasts, summarising news articles, digital assistants that mechanically blurt out text based the structured information they have, and many more. Google, Reuters, BBC, Facebook – they all do it. Wikipedia is working on it as well, principally within the scope of Abstract Wikipedia to try to build a better multilingual Wikipedia [1] to reach more readers better. They all have some source of structured content – like data fetched from a database or spreadsheet, information from, say, a UML class diagram, or knowledge from some knowledge graph or ontology – and a specification as to what the structure of the sentence should be, typically with some grammar rules to at least prettify it, if not also being essential to generate a grammatically correct sentence [2]. That specification is written in templates that are then filled with content.
For instance, a simple rendering of a template may be “Each [C1] [R1] at least one [C2]” or “[I1] is an instance of [C1]”, where the things within the square brackets are variables standing in for content that will be fetched from the source, like a class, relationship, or individual. Linking these to a knowledge graph about universities, it may generate, e.g., “Each academic teaches at least one course” and “Joanne Soap is an instance of Academic”. To get the computer to do this, just “Each [C1] [R1] at least one [C2]” for template won’t do: we need to tell it what the components are so that the program can process it to generate that (pseudo-)natural language sentence.
Many years ago, we did this for multiple languages and used XML to specify the templates for the key aspects of the content. The structured input were conceptual data models in ORM in the DOGMA tool that had that verbalisation component [3]. As example, the template for verbalising a mandatory constraint was as follows:
Besides demarcating the sentence and indicating the constraint, there’s fixed text within the <text> … </text> tags and there’s the variable part with the <Object… that declares that the name of the object type has to be fetched and the <Role… that declares that the name of the relationship has to be fetched from the model (well, more precisely in this care: the reading label), which were elements declared in an XML Schema. With the same example as before, where Academic is in the object index “0” position and Course in the “1” position (see [3] for details), the software would then generate “ – [Mandatory] Each Academic must teaches at least one Course.”
This can be turned up several notches by adding grammatical features to it in order to handle, among others, gender for nouns in German, because they affect the rendering of the ‘each’ and ‘one’ in the sample sentence, not to mention the noun classes of isiZulu and many other languages [4], where even the verb conjugation depends on the noun class of the noun that plays the role of subject in the sentence. Or you could add sentence aggregation to combine two templates into one larger one to generate more flowy text, like a “Joanne Soap is an academic who teaches at least one course”. Or change the application scenario or the machinery for how to deal with the templates. For instance, instead of those variables in the template + code elsewhere that does the content fetching and any linguistic processing, we could put part of that in the template specification. Then there are no variables as such in the template, but functions. The template specification for that same constraint in an ORM diagram might then look like this:
ConstraintIsMandatory {
“[Mandatory] Each ”
FetchObjectType(0)
“ must ”
MakeInfinitive(FetchRole(0))
“ at least one ”
FetchObjectType(1)}
If you want to go with newer technology than markup languages, you may prefer to specify it in JSON. If you’re excited about functional programming languages and see everything through the lens of functions, you even can turn the whole template specification into a bunch of only functions. Either way: there must be a specification of how those templates are permitted to look like, or: what elements can be used to make a valid specification of a template. This so that the software will work properly so that it neither will spit out garbage nor will halt halfway before returning anything. What is permitted in a template language can be specified by means of a model, such as an XML Schema or a DTD, a JSON artefact, or even an ontology [5], a formal definition in some notation of choice, or by defining a grammar (be it a CFG or in BNF notation), and anyhow with enough documentation to figure out what’s going on.
How might this look like in the context of Abstract Wikipedia? For the natural language generation aspects and its first proposal for the realiser architecture, the structured content to be rendered in a natural language sentence is fetched from Wikidata, as is the lexicographic data, and the functions to do the various computations are to come from/go in Wikifunctions. They’re then combined with the templates in various stages in the realiser pipeline to generate those sentences. But there was still a gap as to what those templates in this context may look like. Ariel Gutman, a google.org fellow working on Abstract Wikipedia, and I gave it a try and that proposal for a template language for Abstract Wikipedia is now online accessible for comment, feedback, and, if you happen to speak a grammatically rich language, an option to provide difficult examples so that we can check whether the language is expressive enough.
The proposal is – as any other proposal for a software system – some combination of theoretical foundations, software infrastructure peculiarities, reasoned and arbitrary design decisions, compromises, and time constraints. Here’s a diagram of the key aspects of the syntax, i.e., with the elements, how they relate, and the constraints holding between them, in ORM notation:
An illustrative diagram with the key features of the template language in ORM notation.
There’s also a version in CFG notation, and there are a few examples, each of which shows how the template looks like for verbalising one piece of information (Malala Yousafzai’s age) in Swedish, French, Hebrew, and isiZulu. Swedish is the simplest one, as would English or Dutch be, so let’s begin with that:
Persoon_leeftijd_nl(Entity,Age_in_years): “{Person(Entity) is
{Age_in_years} jaar.}”
Where the Person(Entity) fetches the name of the person (that’s identified by an identifier) and the Age_in_years fetches the age. One may like to complicate matters and add a conditional statement, like that any age less than 30 will render that last part not just as jaar ‘year’, but as jaar oud ‘years old’ but jaar jong ‘years young’, but where that dividing line is, is a sensitive topic for some and I will let that rest. In any case, in Dutch, there’s no processing of the number itself to be able to render it in the sentence – 25 renders as 25 – but in other languages there is. For instance, in isiZulu. In that case, instead of a simple fetching of the number, we can put a function in the slot:
Where Lexeme(L686326) is the word for ‘year’ in isiZulu, unyaka, and for the rest, it first links the age rendering to the ‘year’ with the RelativeConcord() of that word, which practically fetches e- for the ‘years’ (iminyaka, noun class 4), then gets the copulative (ng in this case), and then the concord for the noun class of the noun of the number. Malala is in her 20s, which is amashumi amabili .. (noun class 6, which is computed via Cardinal(years)), and thus the function nounPrefix will fetch ama-. So, for Malala’s age data, Year_zu(years) will return iminyaka engama-25. That then gets processed with the rest of the Person_AgeYr_zu template, such as adding an U to the name by subj:Person(Entity), and later steps in the pipeline that take care of things like phonological conditioning (-na- + i- = –ne-), to eventually output UMalala Yousafzai uneminyaka engama-25. In other words: such a template indeed can be specified with the proposed template syntax.
There’s also a section in the proposal about how that template language then connects to the composition syntax so that it can be processed by the Wikifunctions Orchestrator component of the overall architecture. That helps hiding a few complexities from the template declarations, but, yes, someone’s got to write those functions (or take them from existing grammar engines) that will take care of those more or less complicated processing steps. That’s a different problem to solve. You also could link it up with another realiser by means of a transformation the the input type it expects. For now, it’s the syntax of the declarative part for the templates.
If you have any questions or comments or suggestions on that proposal or interesting use cases to test with, please don’t hesitate to add something to the talk page of the proposal, leave a comment here, or contact either Ariel or me directly.
[5] Mahlaza, Z., Keet, C. M. ToCT: A Task Ontology to Manage Complex Templates. Proceedings of the Joint Ontology Workshops 2021, FOIS’21 Ontology Showcase. Sanfilippo, E.M. et al. (Eds.). CEUR-WS vol. 2969. 9p.
How do you know whether the ontology you developed or want to reuse is any good? It’s not a new question. It has been investigated quite a bit, and so the answer to that is not a short one. Based on a number of anecdotes, however, it seems ever more people are leaning toward a short answer along the line of “it’ll be fine if it can answer my competency questions”. That is most certainly not the right answer. Let me illustrate this.
Here’s a set of 5 competency questions and a bad ontology (with the OWL file), being a newly mutilated version of the African Wildlife Ontology [1] modified with a popular South African pastime: the braai, i.e., a barbecue.
CQ1: Which animals are served at a barbecue? (Sample answers: kudu, impala, warthog)
CQ2: What are the materials used for a barbecue? (Sample answers: tongs, skewers, poolbraai)
CQ3: What is the energy source for a braai device? (Sample answers: gas, coal)
CQ4: Which vegetables taste good with a braai? (Sample answers: tomatoes, onion, butternut)
CQ5: What food is eaten at a braai, or: what collection of edible things are offered?
The bad ontology does have answers to the competency questions, so a ‘CQs-only’ criterion for quality would suggest that the bad ontology is a good one. 100% good, even.
Why is it a bad one nonetheless?
That’s where years of methods, techniques, and tool development enter the stage (my textbook dedicates Section 5.2 to that), there are heuristics-based tips to prevent pitfalls [2] in general and for bio-ontologies with GoodOD, and there’s also a framework for ontology quality, OQuaRE [3], that all aim to approach this issue of quality systematically. Let’s have look at some of that.
Low-hanging fruit for a quick sanity check is to run the ontology through the Ontology Pitfall Scanner OOPS! [4]. Here’s the summary result, with two opened up that show what was flagged and why:
Mixing naming conventions is not neat. Examples of those in the badBBQ ontology are using CamelCase with PoolBraai but dash in tasty-plant and spaces converted to underscores in Food_Preparation_Material, and lower-case for some classes and upper case for others (PoolBraai and plant). An example of unconnected ontology element is Site: the idea is that if it isn’t really used anywhere in the ontology, then maybe it shouldn’t be in the ontology, or you forgot to add something there and OOPS! points you to that. Pitfall P11 may be contested, but if at all possible, one really should add domain and range to the object property so as to minimise unintended models and make the ontology closer to the reality (or understanding thereof) one aims to present. For instance, surely eats should not have any of the braai equipment on the left-hand side in the domain position, because equipment does not eat—only organisms do.
At the other end of the spectrum are the philosophy and Ontology-inspired methods. The most well-known one is OntoClean [5], which is summarised in the textbook and there’s a tutorial for it in Appendix A. The, perhaps, most straightforward (and simplified) rule within that package is that anti-rigid classes cannot subsume rigid classes, or, in layperson terminology: (physical) entities cannot be subclasses of things that are roles that entities play. Person cannot be a subclass of Employee, since not all persons are always employees. For the badBBQ: Food is a role that an organism or part thereof plays in a certain context, and animals and plants are not always food—they are organisms (or part thereof) irrespective of the roles they may play (or, worded differently: of the roles that they are the ‘bearer of’).
Then there are the methods and tools in-between these two extremes. Take, for instance, Advocatus Diaboli / PEW (Possible World Explorer) [6], which helps you find places where disjointness axioms ought to be added. This is in the same line of thinking as adding those domain and range axioms: it helps you to be more precise and find mistakes. For instance, Site and BraaiEquipment are definitely intended to be disjoint: some location cannot be a concrete physicalobject. Adding the disjointness axiom results in an error, however: the PoolBraai is unsatisfiable because it was declared to be both a subclass of Site and of BraaiEquipment. Pool braais do exist, as there are braais that can be placed in or next to a pool. What the issue is here, is that there are two different meanings of the same term: once that device for the barbecue and once the ‘braai area by the pool’. That is, they are two different entities, not one, and so they either have to appear as two different entities in the ontology, with different names, or the intended one chosen and one of the subsumption axioms removed.
I also put some ugly things in the description of Braai: both those two ways of the source of heating and the member. While one may say informally that a braai involves a collection of things (CQ5), ontologically, it won’t fly with ‘member’. Membership is not arbitrary. There are foundational (or top-level) ontologies whose developers already did the heavy-lifting of ontological analysis of key elements and membership is one of them (see, among others, [7-9]). Such relations can simply be reused in one’s own ontology (e.g., imported from here), with their widely-agreed upon meaning; there’s even a tool to assist you with that [10]. If what you want is something else than that, then that relation is not membership but indeed something else. In this case, there are two options to fix it: 1) a braai as an event (rather than the device) will have objects (such as food, the tongs) participating in the event, or 2) for the braai as a device, it has accessories (related with has Accessory, if you will), such as the tongs, and it is used for preparing (/barbecuing/cooking/frying) food (/meals/dinners).
Then the source of heating. The one-off construct (with the {…}) is relatively popular in conceptual data modelling when you know the set of values is ever only allowed to be that, like the days of the week. But in our open world of ontologies, more just might be added or removed. And, ontologically, coal, gas, and electricity are not individuals, so also that is incorrect. The other option, with heatedBy xsd:String, has its own set of problems, largely because data properties with their data types entail application implementation decisions that ought not to be in an ontology that is supposed to be usable across multiple applications (see Section 6.1 ‘attributions’ for a longer explanation). It can be addressed by granting them their rightful status as classes in the OWL file and relating that to the braai.
This is not an exhaustive analysis of the badBBQ ontology, nor even close to a full list of the latest methods and techniques for good ontology development, but I hope I’ve illustrated my point about not relying on just CQs as evaluation of your ontology. Sample changes made to the badBBQ are included in the improvedBBQ OWL file. Here’s snapshot of the differences in the basic metrics (on the left). There’s room for another round of improvements, but I’ll leave that for later.
All this was not to say that competency questions are useless. They are not. They can be very useful to demarcate the scope of the ontology’s content, to keep on track with that since it’s easy to go astray from the intended scope once you begin or be subjected to scope creep, and to check whether at least the minimum content is in there somehow (and if not, why not). It’s the easy thing to check compared to the methods, techniques, and theory about good, sub-optimal, and bad ways of representing something. But such relative ease with CQs, perhaps unfortunately, does not mean it suffices to obtain a ‘good quality’ stamp of approval. Why the plethora of methods, techniques, theories, and tools aren’t used as often as they should, is a question I’d like to know the answer to, and may be a topic for another time.
[2] Keet, C.M., Suárez-Figueroa, M.C., Poveda-Villalón, M. Pitfalls in Ontologies and TIPS to Prevent Them. Knowledge Discovery, Knowledge Engineering and Knowledge Management: IC3K 2013 Selected Papers. A. Fred et al. (Eds.). Springer CCIS vol. 454, pp. 115-131, 2015. preprint
[3] Duque-Ramos, A. et al. OQuaRE: A SQuaRE-based approach for evaluating the quality of ontologies. Journal of research and practice in information technology, 2011, 43(2): 159-176
I was interviewed recently about my ontology engineering textbook, following having won the 2021 UCT Open Textbook Award for it. The interviewer assumed initially it was a textbook for undergraduate students because it has the word ‘Introduction’ in the title. Not quite. Soon thereafter, one of the 3rd-year computer science students who arrived early in class congratulated me on the award and laughed that that was an introduction at a different level altogether. It is, by design, but largely so with respect to the topics covered: it does not assume the reader knows anything about ontologies—hence, the ‘introduction’—but it does take for granted that the reader knows some of the basics in computer science or software engineering. For instance, there’s no explanation on what a database is, or a conceptual data model, or object-oriented software.
In addition, and getting to this post’s topic, I had tried to make the textbook readable, and at least definitely more accessible than scientific papers and handbooks that were the only alternatives before this textbook saw the light of day. I think it is readable and I also have received feedback that the book was easily readable. Admittedly, though, the notion of assessing readability only came afore in the editing process of my memoir, for it is aimed at a broader audience than the textbook. This raised a nagging question. What is it that makes some text readable?
It’s one of those easy questions that just do not have a simple answer. The quickest answer is “use a readability metric standardised by grade level” for a home language/mother tongue speaker. Scratching that surface, it lays bare the next question: what parameters have to be taken into account in what way so as to come up with a score for the estimated grade level? Even the brief overview on the Wikipedia page on readability already lists 11 measurable parameters, and there are different ways to measure them and to possibly combine them as well. The same page lists 8 popular metrics and 4 advanced ones. That’s just for English. For instance, the Flesch reading ease is calculated as
206.835 – 1.015 * (total number of words / total number of sentence) – 84.6 * (total number of syllables / total number of words)
to result in rough bands of reading ease. For instance, 90-100 for an 11-year old, 60-70 as ‘plain English’, up to anything <30 down to 0 (and possibly even negative) for very to extremely difficult English texts and for professionals and graduate students. See also the figure on the right.
The Gunning fog index has fewer fantastically tweaked multipliers:
Grade level = 0.4 * (average sentence length + percentage of Hard Words)
but there’s a wonderful Hard Words variable. What is that supposed to mean exactly? The readability page says that they are those words with two or more syllables, but the Gunning fog index page says three or more syllables (excluding proper nouns, familiar jargon, or compound words, and not counting common suffixes either).
I don’t know why I never thought about all that before writing the textbook and why none of the writing guidelines I have looked up over the years had mentioned it. The most I did for readability, especially when I was writing my PhD thesis, was the “read aloud test” that was proposed in one of those writing guidelines: read your text aloud, and if you can’t, then something is wrong with the sentence. I used the Acrobat built-in screen reader for that as a first pass. If the text-to-speech algorithm stumbled over it, then it was time to reconsider the phrasing. I would then read it aloud myself and decide whether the Acrobat algorithm had to be improved upon or my sentence had to be revised.
How does the ontology engineering textbook fare? Are my blog posts any more readable? How much worse are the scientific papers? Is it true that the English in articles in science are a sort of a pidgin English whereas in other fields, notably humanities, the erudition and wordsmithery shines through in the readability metrics scores? I have no good answers now, but it would be easy to compute with a fine dataset of texts and the Python py-readability-metrics module for some quick ‘n dirty checks or to adapt some other open source code for batch processing (e.g., from here, among multiple options). Maybe later; there are some other kinks to straighten first.
Notably, one can game the system based on some of those key parameters. Besides sentences length—around 20 words is fine, I was told a long while ago—there are the number of syllables of the words and the vocabulary that are taken into account. More monosyllabic words in shorter sentences with fewer types will come out as more easily readable, according to the metric that is.
But ‘easier’ or ‘better’ lies in the eyes of the beholder: it may be such confetti so as to have become awful to read due to its lack of flow and coherence. Really. It is as I say. Don’t you think? It’s the way I see it. What say you? The “ Really. … you?” has a Flesch reading ease of 90.38 and a Gunning Fog index of 1.44 as number of years of formal education you would have needed to easily understand that. The “Notably, … and coherence” before it in this paragraph has a Flesch reading ease of 50.52 and a Gunning Fog index of 13.82.
Based on random sampling from my textbook, at least one of the paragraphs (p34, ‘purposes’) got a Flesch reading ease of 9.29 and a Gunning Fog index of 22.73, while other parts are around 30 and some are even in the 50-70 region for reading ease.
The illustration out of the way, let’s look at limitations. First, not all polysyllabic words are difficult and not all monosyllabic words are simple; e.g., the common, and therewith easy, ‘education’ and ‘interesting’ vs. the semi-obscure ‘nub’, ‘sloop’, ‘gry’, and ‘squick’ (more here). The longest monosyllabic words, such as ‘scraunched’ and ‘strengthed’, aren’t exactly easy to read either.
Plenty of other languages have predominantly polysyllabic words with lots of syllables, such as Dutch or German where new words can be formed by putting existing ones together. Dutch woord meervoudigepersoonlijkheidsstoornis puts together into one concept meervoudige and persoonlijkheid and stoornis (‘multiple personality disorder’). Agglutinating languages, such as isiZulu, not only compose long words, but have so many meaningful pieces that a single word may well be a whole sentence in a disjunctive language. For instance, the 10-syllabic word that one of my former students used to make the point: titukakimureeterahoganu ‘we have never ever brought it to him’. You get used to long words and there’s no reason why English speakers would be inherently incapable to handle that. Intelligence does not depend on one’s mother tongue. Perhaps, if one is used to a disjunctive orthography, one may have become lazy. Any use off aforementioned readability metrics for ‘non-English’ clearly will have to be revised to tailor it to a language.
Then there’s foreign language background that interferes with reading ease. Many a so-called supposedly ‘difficult‘ word in English comes from French, Italian, Latin, or Greek; e.g., oxymoron (Gr), camaraderie (Fr), quotidian (It), and obfuscate (La). For instance, we use oxymoron in Dutch as well, so there’s no ‘difficulty’ to it for a Dutch person, or take maalstroom that is pronounced nearly the same as ‘maelstrom’ and demagoog for ‘demagogue’ (also Greek origins, similar pronunciation) and algorithme for ‘algorithm’ (Persian origins, not an Anglicism), and recalcitrant is even spelled the same. The foreigner trying to speak or write English may not be erudite, but just winging it and hoping that the ‘copy and adapt’ works out. Conversely, supposedly ‘simpler’ words may not be: ‘wayward’ is a synonym for recalcitrant and with only two syllables, it will make the readability score better. It would make it less readable to at least Dutch, Spanish, Italian and so on readers who are trying to read English text, however, because there’s no connection with a familiar-looking word. About 80% of English words are borrowed from other languages.
Be that as it may, maybe I should reassess my textbook on the metric; maybe not. What does the algorithm know about computer science terminology anyhow? “Ontology Engineering is a specialisation in knowledge representation and reasoning.” has a Flesh reading ease of -31.73 and a Gunning Fog index of 20.00; a tough game it would be to get that back to a reading ease of 50.
It did affect a number of sentences in my memoir book. I don’t expect Joe and Joanne Soap to be interested, but teenagers who are shopping around for a university degree programme might, and then professionals, students, and academics with a little spare time to relax and read, too. In other words: a reading ease of around 40-60. Some long sentences could indeed be split up without losing content, coherence, and flow.
There were others where the simplification didn’t feel like an improvement. For instance, compare “according to my opinion” with “the way I saw it”: the former flows smoothly whereas the latter sounds alike a nagging firing off. The latter for sure improves the readability score with all those monosyllabic words. The copy editor changed the former into the latter. It still bugs me. Why? After some further pondering beyond just blaming the grating staccato of a sequence of monosyllabic words, perhaps it is because an opinion generally is (though need not be) formed after considering the facts and analysing them, whereas seeing something in some way may (but definitely need not) be based on facts and analysis. That is, on closer inspection, they’re not equivalent phrases, not at all. Nuances can be, and were, lost with shorter sentences and simpler words. One’s voice, too. So there’s that. Overall, though, I hope the balance leans toward more readable, to get the message across better to more readers.
Lastly, there seems to be plenty of scope for more research on readability metrics—ones that can be computed, that is. While there are several applications for other well-resourced languages, including easy web apps, such as for Spanish and German and even for Dutch, there are very many languages spoken around the globe that do not have such metrics and nice algorithms yet. But even the readability metrics for English could be tweaked. For instance, to tailor it to a genre or a discipline. Then one it would be easier to determine if a book is, say, an easy-reading popular science book for the holidays on the beach or one that requires some or even a lot of effort. For computer science, one could take Gunning Fog and adjust the Hard Words variable to exclude common jargon that is detrimental to the score, like ‘encapsulation’ and ‘representation’ (both 5 syllables); biochemistry would need that too, given the long names for chemical compounds. And to add a penalty for too many successive monosyllabic words. There will be more options to tweak the formulae and test it, but such additional digging is something for another time.
As to my question in the introductory paragraph of this post, “What is it that makes some text readable?”: if you’re made it all the way here reading this post, we’re all a bit wiser on readability, but a short and simple answer I still don’t have. It’s a long story with ifs and buts, and the last word is yet to be said about it.
As a bonus, here are a few hints to make something more readable, according to the readability calculator of the web-based editor tool of the The Conversation:
Screenshot I took some time halfway when working on a article for The Conversation.
p.s.: The ‘science of reading‘ adds more to it, to the point you wonder how there even can be metrics. But, their scope is broader.
pp.s.: The first full draft of this post had a reading ease of 52.37 and a Gunning Fog of 11.78, and the final one 54.37 and 11.18, respectively, which is fine by me. Length is probably more of an issue.
Since I published my second book, that memoir on a scenic route into computer science, several people have asked me “why?” and “what makes yours stand out from the crowd?”. The answer to the latter is easy: there is no crowd. (The brief answer to ‘why’ is mentioned in the Introduction chapter). Let me elaborate a little.
In the early stage of writing the book, I dutifully did do my market research to answer the typical starter questions like: What books in your genre or on your topic are already out there? How crowded is the field? Will your prospective book be just another one on that pile? Will it stand out as different? And if so, is that an interesting difference to at least some readership segment so that it will have potential to be sold beyond a close circle of friends and family? So, I searched and searched and searched, in late 2020 and again twice in 2021, and even now when writing this post. Memoirs by female computer scientists, by male computer scientists, whatever gender computer scientist in academia. Autobiographies as well then. I stretched the search criteria further, into the not-in-their-own-words biographies of computer science professors.
Collage made with the respective covers or first page of the memoir and autobiography books listed and linked here.
If you take your time searching for those books, you should be able to find the following four books and booklets of the memoir or autobiography variety, by computer science professors, on computing, computing milieux, or computer science:
James Morris’ memoir that was published in the same week as mine was in late 2021. It covers his 60 years career in computer science and, according to the book’s tweet-size blurb “is a search for intelligence across multiple facets of the human condition—religion and science, evolution, and innovation”.
The unpublished memoir by Ray Miller, on 50 years in computing (1953-1993), online available from the IEEE Computer Society as part of its computer history museum.
That’s all. Four retired (and some meanwhile deceased) computer science professors telling their tale, three of which cover only the early days of computing.
Collage made with the covers or first page of the quite related memoir and autobiography books listed and linked here.
There are a few very recent memoirs by professors that were in print or announced to go in print soon, on attendant topics, notably:
Cecilia Aragon’s “Flying Free”, which was published in 2020. It is about becoming the first Latina pilot on the US Aerobatic Team. She is also a professor in computer science.
What there are lots of, are books about, and occasionally by, ‘celebrity’ people in IT and computing who made it in industry these days, such as Bill Gates, Steve Jobs, Elon Musk, Satya Nadella, and Sheryl Sandberg, and famous people in computing history, such as Ada Lovelace, Grace Hopper, George Boole, and Alan Turing (also about, not by). And there are short and long memoirs about tech by journalists and writers and by engineers and programmers who write, such as on Linux in Australia (here) or 10 years in Silicon Valley (here). There are also a few professional memoir essays and articles by computer science professors, such as about the development of the network time protocol by David Mills (here).
The people ‘out there’ – outside of the ivory tower of academia – do have lots of assumptions about computer science professors. When I mention to them that, yes, I’m one of those, at UCT even, a not uncommon reaction is an involuntary reflex of apprehension. The eyes move to a corner of the eye socket, the head turns a little and moves back, and the upper body follows, even if only slightly. I notice. But what do you really know about us? Nothing, really.
Even among academics in computer science, we have only sketchy information about our colleagues’ respective backgrounds. Yes there are the privileged ones, who had early access to computers, tinkered with them in their spare time, got their pizza delivered, participated in programming contests and so on. But there are others who made it. Who escaped persecution in Eastern Europe during the Cold War and had to find their way in a different country, whose first interaction with a computer was only at university, or who grew up in some hamlet with limited electricity and potable water. Who came from a broken home, or who had to leave family and friends to get that elusive job in the scarce academic job market many kilometers away, or whose relations stranded due to the two-body problem (partner who is also an academic, but in a different city or country). Who made it against the odds. And there are those who defected from physics, or who took a stroll out of philosophy to never return, or who still flip-flop with chemistry, to name but a few, and who thus have at least two specialisations under their belt. Those who know about more stuff than just computing.
That’s just about an academic’s background. What do you know of our daily activities? Nothing really, either. Assumptions abound; there are about as many memes and jokes about our jobs as assumption. And movies, TV series, and fiction novels that aren’t necessarily depicting it accurately either.
But us, in our own words? The memoir and autobiography books literally can be counted on one hand. I can assure you it’s not because we have no life and have nothing to say. We do. For instance, it takes about 10-30 years before the theories and techniques we investigate will mature enough to seep into the wider society. Impactful, cool, and fun things happen along the way. Those ‘infoboxes’ from Google when it returns the search results? The theory and techniques behind it date back to the late 1990s with ontologies and I was a part of that. Toy drones? There was one to play with at the European Conference on Artificial Intelligence 2006 (ECAI’06) that I attended, when the first small toy drones needed to be equipped with ‘intelligent’ processing of sensor data. The drone demo area was suitably demarcated with red-white coloured tape, for neither the engineers nor the organisers, nor us as attendees, were convinced it was safe to make it fly around without causing trouble.
The demo session at ECAI’06 also had a crossword puzzle contest with WebCrow: researchers against an algorithm that trawled the Web for answers. The 25 of us onsite participants – perhaps the first ever to participate in such a contest – sat on uncomfortable plastic chairs in cinema style in a section of a large hall in the conference venue at Riva del Garda in Italy. Onlookers marveled that the event really took place, and unsure about which horse to bet on. The algorithm won, but we had fun. Last year’s news that an algorithmic solver won from expert human puzzlers seems a bit late and old news. I can very well imagine what those human participants must have felt.
Maybe you don’t care about computer science professors or about early days of new theories and techniques and how they came about. We all have our interests and time is limited. That’s fine; I don’t read all books either. But, if you were to ever wonder about the human in the computer science academic, there are, for now, those four books listed above, mine, and the other three books that are quite nearby in scope. Happy reading!
At some point in time, this COVID-19 pandemic will be over. Each time that thought crossed my mind, there was that little homunculus in my head whispering: but do you know the criteria for when it can be declared ‘over’? I tried to push that idea away by deferring it to a ‘whenever the WHO says it’s over’, but the thought kept nagging. Surely there would be a clear set of criteria lying on the shelf awaiting to be ticked off? Now, with the omicron peak well past us here in South Africa, and with comparatively little harm done in that fourth wave, there’s more talk publicly of perhaps having that end in sight – and thus also needing to know what the decisive factors are for calling it an end.
Then there are the anti-vaxxers. I know a few of them as well. One raged on with the argument that ‘they’ (the baddies in the governments in multiple countries) count the death toll entirely unfairly: “flu deaths count per season in a year, but for covid they keep adding up to the same counter from 2020 to make the death toll look much worse!! Trying to exaggerate the severity!” My response? Duh, well, yes they do count from early 2020, because a pandemic is one event and you count per event! Since the COVID-19 pandemic is a pandemic that is an event, we count from the start until the end – whenever that end is. It hadn’t even crossed my mind that someone wouldn’t count per event but, rather, wanted to chop up an event to pretend it would be smaller than it actually is.
So I did a little digging after all. What is the definition of a pandemic? What are its characteristics? Ontologically, what is that notion of ‘pandemic’, be it according to the analytic philosophers, ontologists, or modellers, or how it may be aligned to some of the foundational ontologies used in ontology engineering? From that, we then should be able to determine when all this COVID-19 has become a ‘is not a pandemic’ (whatever it may be classified into after the pandemic is over).
I could not find any works from the philosophers and theory-focussed ontologists that would have done the work for me already. (If there is and I missed it, please let me know.) Then, to start: what about definitions? There are some, like the recently updated one from dictionary.com where they tried to explain it from a language perspective, and lots of debate and misunderstandings in the debate about defining and describing a pandemic [1]. The WHO has descriptions, but not a clear definition, and pandemic phases. Formulations of definitions elsewhere vary slightly as well, except for the lowest common denominator: it’s a large epidemic.
Ontologically, that is an entirely unsatisfying answer. What is ‘large’? Some, like the CDC of the USA qualified it somewhat: it’s spread over the world or at least multiple regions and continents, and in those areas, it usually affects many people. The Australian Department of Health adds ‘new disease’ to it. Now we’re starting to get somewhere with inclusion of key properties of a pandemic. Kelly [2] adds another criterion to it, albeit focussed on influenza: besides worldwide/very wide area and affecting a large number of people, “almost simultaneous transmission takes place worldwide” and thus for a part of the world, there is an out-of-season influenza virus transmission.
Image credits: Miroslava Chrienova, taken from this page.
The best resource of all from an ontologists’ perspective, is a very clear, well-written, perspective article written by Morens, Folkers and Fauci – yes, that Fauci from the CDC – in the Journal of Infectious Diseases that, in their lack of wisdom, keeps the article paywalled (it somehow made it onto the webarchive with free access here anyhow). They’re experts and they trawled the literature to, if not define a pandemic, then at least describe it through trying to list the characteristics and the merits, or demerits, thereof. They are, in short, and with my annotation on what sort of attribute (/feature/characteristic, as loosely used term for now) it is:
Wide geographic extension; as aforementioned. That’s a scale or ‘fuzzy’ (imprecise in some way) feature, i.e., without a crisp cut-off point when ‘wide’ starts or ends.
Disease movement, i.e., there’s some transmission going on from place to place and that can be traced. That’s a yes/no characteristic.
High attack rates and explosiveness, i.e., lots of people affected in a short timespan. There’s no clear cut-off point on how fast the disease has to spread for counting as ‘fast spreading’, so a scale or fuzzy feature.
Minimal population immunity; while immunity is a “relative concept” (i.e., you have it to a degree), it’s a clear notion for a population when that exists or not; e.g., it certainly wasn’t there when SARS-CoV-2 started spreading. It is agnostic about how that population immunity is obtained. This may sound like a yes/no feature, perhaps, but is fuzzy, because practically we may not know and there’s for sure a grey area thanks to possible cross-immunity (natural or vaccine-induced) and due to the extent of immune-evasion of the infectious agent.
Novelty; the term speaks for itself, and clearly is a yes/no feature as well. It seems to me like ‘novel’ implies ‘minimal population immunity’, but that may not be the case.
Infectiousness; it’s got to be infectious, and so excluding non-infectious things, like obesity and smoking. Clear yes/no.
Contagiousness; this may be from person to person or through some other medium (like water for cholera). Perhaps as an attribute with categorical values; e.g., human-to-human, human-animal intermediary (e.g., fleas, rats), and human-environment (notably: water).
Severity; while the authors note that it’s not typically included, historically, the term ‘pandemic’ has been applied more often for diseases that are severe or with high fatality rates (e.g., HIV/AIDS) than for milder ones. Fuzzy concept for which a scale could be used.
And, at the end of their conclusions, “In summary, simply defining a pandemic as a large epidemic may make ultimate sense in terms of comprehensibility and consistency. We also suggest that use of the term is best reserved for infectious diseases that share many of the same epidemiologic features discussed above” (p1020), largely for simplifying it to the public, but where scientists and public health officials would maintain their more precise consensus understanding of the complex scientific concept.
Those imprecise/fuzzy properties and lack of clarity of cut-off points bug the epidemiologists, because they lead to different outcomes of their prediction models. From my ontologist viewpoint, however, we’re getting somewhere with these properties: SARS-CoV-2, at least early in 2020 when the pandemic was declared, ticked all those eight boxes and so any reasoner would classify the disease it causes, COVID-19, as a pandemic. Now, in early 2022 with/after the omicron variant of concern? Of those eight properties, numbers 4 and 8 much less so, and number 5 is the million-dollar-question two years into the pandemic. Either way, considering all those properties of a pandemic that have passed the revue here so far, calling an end to the pandemic is not as trivial is it initially may have sounded like. WHO’s “post pandemic period” phase refers to “levels seen for seasonal influenza in most countries with adequate surveillance”. That is a clear specification operationally.
Ontologically, if we were to take these eight properties at face value, the next question then is: are all eight of them combined the necessary and sufficient conditions, or are some of them ‘more essential’ for calling it a pandemic, and the other ones would then be optional features? Etymologically, the pan in pandemic means ‘all’, so then as long as it rages across the world, it would remain a pandemic?
Now that things get ontologically more interesting, the ontological status. Informally, an epidemic is an occurrence (read: instance/individual entity) of an infectious disease at a particular time (read: an unspecified duration of time, not an instant) and that affects some community (be that a community of humans, chicken, or whatever other organisms that live in a community), and pandemic, as a minimum, extends the region that it affects and amount of organisms infected, and then some of those other features listed above.
A pandemic is in the same subject domain as an infectious disease, and so we can consult the OBO Foundry and see what they did, or first start with just the main BFO categories for a general sense of what it would align to. With our BFO Classifier, I get as far as process:
As to the last (optional) question: could one argue that a pandemic is a collection of disjoint part-processes? Not if the part-processes all have to be instances of different types of processes. The other loose end is that BFO’s processes need not have an end, but pandemics do. For now, what’s the most relevant is that the pandemic is distinctly in the occurrent branch of BFO, and occurrents have temporal parts.
Digging further into the OBO Foundry, they indeed did quite some work on infectious diseases and COVID-19 already [4], and following the trail from their Figure 1 (see below): disposition is a realizable entity is a specifically dependentcontinuant is a continuant; infectious disease course is a disease course is a process is an occurrent; and “realizable entity comes to be realized in the course of the process”.
Source: Figure 1 of [4].
In that approach, COVID-19 is the infectious disease being realised in the pandemic we’re in at the moment, with multiple infectious disease courses in humans and a few other animals. But where does that leave us with pandemic? Inspecting the Infectious Disease Ontology (IDO) since the article does not give a definition, infectious disease epidemic and infectious disease pandemic are siblings of infectious disease course, where disease course is described as “Totality of all processes through which a given disease instance is realized.” (presumably the totality of all processes in one human where there’s an instance of, say, COVID-19). Infectious disease pandemic is an atomic class with no properties or formal definitions, but there’s an annotation with a definition. Nice try; won’t work.
What’s the problem? There are three. The first, and key, problem is that pandemic is stated to be a collection of epidemics, but i) collections of individual things (collectives, aggregates) are categorically different kind of entities than individual things, and ii) epidemic and pandemic are not categorically different things. Not just that, there’s a fiat boundary (along a continuum, really) between an epidemic evolving into becoming a pandemic and then subsiding into separate epidemics. A comparatively minor, or at least secondary, issue is how to determine the boundary of one epidemic from another to be able to construct a collective, since, more fundamentally: what are the respective identities of those co-occurring epidemics? One can’t get collections of things we can’t quite identify. For instance, is it one epidemic in two places that it jumped to, or do they count as two then, and what when two separate ones touch and presumably merge to become one large one? The third issue, and also minor for the current scope, is the definition for epidemic in the ontology’s annotation field, talking of “statistically significant increase in the infectious disease incidence” as determiner, but actually it’s based on a threshold.
Let’s try DOLCE as foundational ontology and see what we get there. With the DOLCE Decision Diagram [5], pandemic ends up as: Is [pandemic] something that is happening or occurring? Yes (perdurant – alike BFO’s occurrent). Are you able to be present or participate in [a pandemic]? Yes (event). Is [a pandemic] atomic, i.e., has no subdivisions of it and has a definite end point? No (accomplishment). Not the greatest word choice to say that a pandemic is an accomplishment – almost right up there with the DOLCE developers’ example that death is an achievement – but it sure is an accomplishment from the perspective of the infectious agent. The nice thing of dolce:accomplishment over bfo:process is that it entails there’s a limited duration to it (DOLCE also has process that also can go on and on and on).
The last question in both decision diagrams made me pause. The instances of COVID-19 going around could possibly be going around after the pandemic is over, uninterrupted in the sense that there is no time interval where no-one is infected with SARS-CoV-2, or it could be interrupted with later flare-ups if it’s still SARS-CoV-2 and not substantially different, but the latter is a grey area (is it a flare-up or a COVID-2xxx?). The latter is not our problem now. The former would not be in contradiction with pandemic as accomplishment, because COVID-19-the-pandemic and COVID-19-the-disease are two different things. (How those two relate can be a separate story.)
To recap, we have pandemic as an occurrent/perdurant entity unfolding in time and, depending on one’s foundational ontology, something along the line of accomplishment. For an epidemic to be classified as a pandemic, there are a varying number of features that aren’t all crisp and for which the fuzzy boundaries haven’t been set.
To sketch this diagrammatically (hence, informally), it would look something like this:
where the clocks and the DEX and DEV arrows are borrowed from the TREND temporal conceptual data modelling language [6]: Epidemic and Pandemic are temporal entities, DEX (+dashed arrow) verbalised is “An epidemic may also become a pandemic” and DEV (+solid arrow): “Each pandemic must evolve to epidemic ceasing to be a pandemic” (hiding the logic at the back-end).
It isn’t a full answer as to what a pandemic is ontologically – hence, the title of the blog post still has that question mark – but we can already clear up the two issues from the introduction of this post, as follows.
Consequences
We already saw that with any definition, description, and list of properties proposed, there is no unambiguous and certain definite endpoint to a pandemic that can be deterministically computed. Well, other than the extremes of either 100% population immunity or the affected species is extinct such that there is no single instance of a disease course (in casu, of COVID-19) either way. Several measured values of the scales for the fuzzy variables will go down and immunity increase (further) as the pandemic unfolds, and then the pandemic phase is over eventually. Since there are no thresholds defined, there likely will be people who are forever disagreeing on when it can be called over. That is inherent in the current state of defining what a pandemic is. Perhaps it now also makes you appreciate the somewhat weak operational statement of the WHO post-pandemic period phase – specifying anything better is fraught with difficulties to date and unlikely to ever make everybody happy.
There’s that flawed argument of the anti-vaxxer to deal with still. Flu epidemics last about 10 weeks, on average [7]. They happen in the winter and in the northern hemisphere that may cross a New Year (although I can’t remember that has ever happened in all the years I’ve lived in Europe). And yet, they also count per epidemic and not per calendar year. School years run from September to July, which provides a different sort of year, and the flu epidemics there are typically reported as ‘flu season 2014/2015’, indicating just that. Because those epidemics are short-lived, you typical get only one of those in a year, and in-season only.
Contrast this with COVID-19: it’s been going round and round and round since late December 2019, with waves and lulls for all countries, regions, and continents, but never did it stop for a season in whole regions or continents. Most countries come close to a stop during a lull at some point between the waves; for South Africa, according to worldometers, the lowest 7-day moving average since the first wave in 2020 was 265 recorded infections per day, on 7 November 2021. Any out-of-season waves? Oh yes – beta came along in summer last year and it was awful; at least for this year’s summer we got a relatively harmless omicron. And it’s not just South Africa that has been having out-of-season spikes. Point is, the COVID-19 pandemic ‘accomplishment’ wasn’t over within the year – neither a calendar year nor a northern hemisphere school year – and so we keep counting with the same counter for as long as the event takes until the pandemic as event is over. There’s no nefarious plot of evil controlling scaremongering governments, just a ‘demic that takes a while longer than we’ve been used to until 2019.
In closing, it is, perhaps, not the last word on the ontological status of pandemic, but I hope the walkthrough provided a little bit of clarity in the meantime already.
[5] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. 2013.
A photo of the city where it was supposed to take place: Leiden (NL) (Source: here)
It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.
Keynotes
The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool [1], a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.
The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources [2]; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases [3]. It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically [4]. There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.
Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.
The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.
Papers
Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.
Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project [5]. He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).
Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid [6], which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).
Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) [7], and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO [8]. What to do next with these insights remains to be seen.
Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do [9]. Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.
Source: [9]
Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.
Other
There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.
The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!
[5] Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[6] Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[7] Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
[8] César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
Some time last year, a colleague asked about good examples of popular science books, in order to read and thereby to get inspiration on how to write books at that level, or at least for first-year students at a university. I’ve read (and briefly reviewed) ‘quite a few’ across multiple disciplines and proposed to him a few of them that I enjoyed reading. One aspect that bubbled up at the time, is that not all popsci books are of the same quality and, zooming in on this post’s topic: not all popsci books are of the same level, or, likely, do not have the same target audience.
I’d say they range from targeting advancedinterested laypersons to entertaining laypersons. The former entails that you’d be better off having covered the topic at school and an undergrad course or two will help as well for making it an enjoyable read, and be fully awake, not tired, when reading it. For the latter category at the other end of the spectrum: having completed little more than primary school will do fine and no prior subject domain knowledge is required, at all, and it’s good material for the beach; brain candy.
Either way you’ll learn something from any popsci book, even if it’s too little for the time spent reading the book or too much to remember it all. But some of them are much more dense than others. Compare cramming the essence of a few scientific papers in a book’s page to drawing out one scientific paper into a whole chapter. Then there’s humor—or the lack thereof—and lighthearted anecdotes (or not) to spice up the content to a greater or lesser extent. The author writing about fungi recounting eating magic mushrooms, say, or an economist being just as much of a sucker for summer sales in the shops as just about anyone. And, of course, there’s readability (more about that shortly in another post).
Putting all that in the mix, my groupings are as follows, with a selection of positive exemplars that I also enjoyed reading.
Advanced interested layperson level: Predictably irrational by Dan Ariely (behavioural economics) and Entangled life by Merlin Sheldrake (microbiology, on fungi).
There are more popsci books of which I thought they were interesting to read, but I didn’t want to turn it into a laundry list. Also, it seemed that books on politics and society and philosophy and such seem to be deserving their own discussion on categorisation, but that’s for another time. I also intentionally excluded computer science, information systems, and IT books, because I may be differently biassed to those books compared to the out-of-my-own-current-specialisation books listed above. For instance, Dataclysm by Cristian Rudder on Data Science mainly with OKCupid data (reviewed earlier) was of the ‘entertainment’ level to me, but probably isn’t so for the general audience.
Perhaps it is also of use to contrast them to ‘bad’ examples—well, not bad, but I think they did not succeed well in their aim. Two of them are Critical mass by Phillip Ball (physics, social networks), because it was too wordy and drawn out and dull, and This is your brain on music by Daniel Levitin (neuroscience, music), which was really interesting, but very, very, dense. Looking up their scores on goodreads, those readers converge to that view for your brain on music as well (still a good 3.87 our of 5, from nearly 60000 ratings and well over 1500 reviews), as well as for the critical mass one (3.88 from some 1300 ratings and about 100 reviews). Compare that to a 4.39 for the award-wining Entangled life, 4.35 of Why we sleep, and 4.18 for Mama’s last hug. To be fair, not all books listed above have a rating above 4.
Be this as it may, I still recommend all of those listed in the four categories, and hopefully the sort of rough categorisation I added will assist in choosing a book among the very many vying for your attention and time.
Pushing the envelope categorising popsci books
Regarding book categories more generally, romance novels have subgenres, as does science fiction, so why not the non-fiction popsci books? Currently, they’re mostly either just listed (e.g., here or the new releases) or grouped by discipline, but not according to, say, their level of difficulty, humor, whether it mixes science with politics, self-help, or philosophy, or some other quality dimension of the book along which they possibly could be assessed.
As example that the latter might work for assigning attributes to the books: Why we sleep is 100% science but a reader can distill some ideas to practice with as self-help for sleeping better, whereas When: the scientific secrets of perfect timing is, contrary to what the title suggests, largely just self-help. Delusions of gender and Inside rebellion can, or, rather, should have some policy implications, and Why we sleep possibly as well (even if only to make school not start so early in the morning), whereas the sort of content of Elephants on acid already did (ethics review boards for scientific experiments, notably). And if you were not convinced of the presence of animal cognition, then Mama’s last hug may induce some philosophical reflecting, and then have a knock-on effect on policies. Then there are some books that I can’t see having either a direct or indirect effect on policy, such as Gastrophysics and Entangled life.
Let’s play a little more with that idea. What about vignettes composed of something like the followings shown in the table below?
Then a small section of the back cover of Entangled life would look like this, with the note that the humor is probably inbetween the ‘yes’ and ‘some’ (I laughed harder with the book on drunkenness).
Mama’s last hug would then have something like:
And Why we sleep as follows (though I can’t recall for sure now whether it was ‘some’ or ‘no laughing matter’ and a friend has borrowed the book):
A real-life example of a categorisation box on a product; coffee suitable for moka pots, according to House of Coffees.
Of course, these are just mock-ups to demonstrate the idea visually and to try out whether it is even doable to classify the books. They are. There very well may be better icons than these scruffy ‘take a cc or public domain one and fiddle with it in MS Paint’ or a mixed mode approach, like on the packs of coffee (see image on the right).
Moreover: would you have created the same categorisation for the three examples? What (other) properties of popular science books could useful? Also, and perhaps before going down that route: would something like that possibly be useful according to you or someone you know who reads popular science books? You may leave your comments below, on my facebook page, or write an email, or we can meet in person some day.
p.s.: this is not a serious post on the ontology of popular science books — it is summer vacation time here and I used to write book reviews in the first week of the year and this is sort of related.
Fifteen years is a long time in IT, yet blogging software is still around and working—the same WordPress I started my blog with, even. At the time, in 2006, when WordPress was still only offering blogging functionality, they had the air of being respectable and at least somewhat serious compared to blogspot (redirects to Blogger now) that hosted a larger share of the informal and whimsical blogs. Blogs are not nearly as popular now as they used to be, there seems to be a move to huddle together to take a ride on a branded bandwagon, like Medium and Substack, and all of the blog-providing companies have diversified the services they offer for blogging. WordPress now markets itself as website builder, rather than blogging, software.
One might even be tempted to argue that blogs are (nearly) obsolete, with TikTok and the like having come along over the years. No so, claims a blogger here, some 10 more more bloggers here, and even a necessity according to another that does provide a list of links to data to back it up. (Just maybe don’t try making a living from it—there are plenty of people who like to read, but writing doesn’t pay well.)
Some data for this blog, then. It has 325 published post, there are around 400-600 visitors per month in recent years (depending on the season and posting frequency), there are people still signed up to receive updates (78), some even like some of the posts, and some of them are shared Twitter and other social media. The most visited post of all time got over 21000 visits and counting (since 2011) and the most visited post in the past year (after the home page) still had a fine 355 visitors and is on my research and teaching topic (see also the occasionally updated vox populi). So, obsolete it is not. Admitted, the latter post had its heydays in 2010-2012 with about 2500 visits/year and the former saw its best of times in 2014-2015 (4425 and 4948 visits in each of those years alone, respectively). The best visited post of the mere 10 posts I wrote in 2021 is on bias in ontologies, having attracted the attention of 119 visitors. Summarizing this blog’s stats trends: numbers are down compared to 5-10 years ago, indeed, but insignificant it is not and multiple posts have staying power.
Heatmap of monthly views to this blog over time.
I also can reveal that there’s no clear correlation between the time-to-write and number-of-visits variables, nor between either of them and the post’s topic, and not with post length either. With more time, there would have been more, and more polished, posts. There’s plenty to write about, not only the long overdue posts for published papers that came out at an extra-busy time and therefore have slipped through writing about, but also other interesting research that’s going on and deserves that extra bit of attention, some more book reviews, teaching updates and so on. There’s no shortage of topics to write about, which therewith turned out to be an unfounded worry from 15 years ago.
Will I go on for another 15 years? Perhaps, perhaps not. I’m still fence-sitting, from the very first post in 2006 that summed up the reasons for starting a blog to this day, to give it a try nonetheless and see when and where it will end.
Why still fence-sitting? I still don’t know whether it’s beneficial or harmful to one’s career, and if beneficial, whether the time put into writing those posts could have been used better for obtaining more benefit from those alternative activities than from the blog post writing. What I do know, is that, among others, it has helped me to learn to write better, it made me take notes during conferences in order to write conference reports and therewith engage more productively with a conference, structure ideas and thoughts, and pitch papers. Also, the background searches for fact-checking, adding links, and trying to find pictures made me stumble into interesting detours as well. Some of the posts took a long time to write, but at least they were enjoyable pastimes or worktimes.
Uhm, so, the benefit is to (just?) me? I do hope the posts have been worthwhile to the readers. But, it brings into vision the question that’s well-known to aspiring writers: should I write for myself or for my readers? The answer depends on whom you consult: blog for yourself, says the blogger from paradise, write for another, imaginary, reader persona, says the novelist, and go for bothsideism for the best results according to the writer’s guide. I write for myself, and brush it up in an attempt to increase a post’s appeal. The brushing up mainly concerns the choice of words, phrases, and paragraphs and the ordering thereof, and the images to brighten up some of the otherwise text-only posts (like this one).
After so many years and posts, I ought to be able to say something more profound. It’s really just that, though: the joy of writing the posts, the hope it makes a difference to readers and to what I’ve written about, and the slight worry it may not be the best thing to do for advancing my career.
Be this as it may, over the past few days, I’ve added a bit more structure to the blog to assist readers finding the topics they may be interested in. The key different categories are now also accessible from the ‘Menu’, being work-related topics (research and papers, software, and teaching), posts on writing and publishing, and there are a few posts that belong to neither, which still can be found on the complete list of posts. Happy reading!
p.s.: in case you wondered: yes, I intended to do a reflection when the blog turned a nice round 15 in late March, were it not for that blurry extension to 2020 and lots of extra teaching and teaching admin duties in 2021. The summer break has started now and there’s not much of a chance to properly go on holiday, and writing also counts as leisure activity, so there the opportunity was, just about three months shy of the blog turning 16. (In case the post’s title vaguely rings a bell: yes, there’s that cheesy song from one of the top-5 movie musicals of all time [according to imdb], depicting a happy moment with promise of staying together before Rolfe makes some more bad decisions, but that’s 16 going on 17.)
How to align your domain ontology to a foundational ontology? It’s a well-known question, and one that I’ve looked into before as well. In some of that earlier work, we used DOLCE to align one’s ontology to. We devised the DOLCE decision diagram as part of the FORZA method to assist with the alignment process and implemented that in the MoKI ontology development tool [1]. MoKI is no more, but the theory and the algorithm’s design approach still stand. Instead of re-implementing it as a Protégé plugin and have it go defunct in a few years again (due to incompatible version upgrades, say), it sounded like more fun to design one for BFO and make a stand-alone tool out of it. And that design and the evaluation thereof is precisely what two of my ontology engineering course students—Chiadika Emeruem and Steve Wang—did for their mini-project of the course. That was then finalised and implemented in a tool for general use as part of the DOT4D project extension for my (award-winning) OE textbook afterward.
More precisely, as first part, there’s a diagram specifically for BFO – well, for one of its 2.0-ish versions in existence at least. Deciding on which version to use and what would be good questions was not as trivial as it may sound. While the questions seem to work (as evaluated with several ontologies), it might still be of use to set up an experiment to assess usability from a modeller’s viewpoint.
BFO ‘decision diagram’ to assist trying to align one’s class of a domain or core ontology to BFO (click to enlarge, or navigate to the user guide at https://bfo-classifier.github.io/)
Be this as it may, this decision diagram was incorporated into the tool that wraps around it with a nice interface with user guidance and feedback, and it has the option to load an ontology and save the alignment into the ontology (along with BFO). The decision tree itself is stored as a separate XML file so that it easily can be replaced with any update thereto, be it to reflect changes in question formulation or to adjust it to some later version of BFO. The stand-alone tool is a jar file that can be downloaded from the GitHub repo, and the repo also has the source code that may be used/adapted (i.e., has an open source licence). There’s also a user guide with explanations and screenshots. Here’s another screenshot of the tool in action:
Example of the BFO classifier in use, trying to align CODO’s ‘Disease’ to BFO, the trail of questions answered to get to ‘Disposition’, and the subsumption axiom that can be added to the ontology.
If you have any questions, please feel free to contact either of us.
References
[1] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.