# Bias in ontologies?

Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs [1], one attempt toward a framework but looking at FOAF only [2], which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases [3].

We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases [4] (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.

The outcome of that exploratory analysis [5], in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.

The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.

An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.

Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.

Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.

Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:

The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.

Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.

Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.

References

[1] K. Janowicz, B. Yan, B. Regalia, R. Zhu, G. Mai, Debiasing knowledge graphs: Why female presidents are not like female popes, in: M. van Erp, M. Atre, V. Lopez, K. Srinivas, C. Fortuna (Eds.), Proceeding of ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, volume 2180 of CEUR-WS, 2017.

[2] D. L. Gomes, T. H. Bragato Barros, The bias in ontologies: An analysis of the FOAF ontology, in: M. Lykke, T. Svarre, M. Skov, D. Martínez-Ávila (Eds.), Proceedings of the Sixteenth International ISKO Conference, Ergon-Verlag, 2020, pp. 236 – 244.

[3] Keet, C.M. Dirty wars, databases, and indices. Peace & Conflict Review, 2009, 4(1):75-78.

[4] E. Dimara, S. Franconeri, C. Plaisant, A. Bezerianos, P. Dragicevic, A task-based taxonomy of cognitive biases for information visualization, IEEE Transactions on Visualization and Computer Graphics 26 (2020) 1413–1432.

[5] Keet, C.M. An exploration into cognitive bias in ontologies. Cognition And OntologieS (CAOS’21), part of JOWO’21, part of BoSK’21. 13-16 September 2021, Bolzano, Italy. (in print)

# CLaRO v2.0: A larger CNL for competency questions for ontologies

The avid blog reader with a good memory might remember we had developed a controlled natural language (CNL) in 2019 that we called CLaRO, a Competency question Language for specifying Requirements for an Ontology, model, or specification [1], for specifying requirements on the contents of the TBox (type-level) knowledge specifically. The paper won the best student paper award at the MTSR’19 conference.  Then COVID-19 came along.

Notwithstanding, we did take next steps and obtained some advances in the meantime, which resulted in a substantially extended CNL, called CLaRO v2 [2]. The paper describing how it came about has been accepted recently at the 7th Controlled Natural Language Workshop (CNL2020/21), which will be held on 8-9 September in Amsterdam, The Netherlands, in hybrid mode.

So, what is it about, being “new and improved!” compared to the first version? The first version was created in a bottom-up fashion based on a dataset of 234 competency questions [3] in a few domains only. It turned out alright with decent performance on coverage for unseen questions (88% overall) and very significantly outperforming the others, but there were some nagging doubts about the feasibility of bottom-up approaches to template development, which are essentially at the heart of every bottom-up approach: questions about representativeness and quality of the source data. We used more questions as basis to work from than others and had better coverage, but would coverage improve further then still with even more questions? Would it matter for coverage if the CQs were to come from more diverse subject domains? Also, upon manual inspection of the original CQs, it could be seen that some CQs from the dataset were ill-formed, which propagated through to the final set of templates of CLaRO. Would ‘cleaning’ the source data to presumably better quality templates improve coverage?

One of the PhD students I supervise, Mary-Jane Antia, set out to find answer to these questions. CQs were cleaned and vetted by a linguist, the templates recreated and compared and evaluated—this time automatically in a new testing pipeline. New CQs for ontologies were sourced by searching all over the place and finding some 70, to which we added 22 more variants by tweaking wording of existing CQs such that they still would be potentially answerable by an ontology. They were tested on the templates, which resulted in a lower than ideal percentage of coverage and so new templates were created from them, and yet again evaluated. The key results:

• An increase from 88% for CLaRO v1 to 94.1% for CLaRO v2 coverage.
• The new CLaRO v2 has 147 main templates and another 59 variants to cater for minor differences (e.g., singular/plural, redundant words), up from 93 and 41 in CLaRO.
• Increasing the number of domains that the CQs were drawn from had a larger effect on the CQ coverage than cleaning the source data.

All the data, including the new templates, are available on Github and the details are described in the paper [2]. The CLaRO tool that supports the authoring is in the process of being updated so as to incorporate the v2 templates (currently it is working with the v1 templates).

I will try to make it to Amsterdam where CNL’21 will take place, but travel restrictions aren’t cooperating with that plan just yet; else I’ll participate virtually. Mary-Jane will present the paper, and also for her, despite also having funding for the trip, it increasingly looks like a virtual presentation. On the bright side: at least there is a way to participate virtually.

References

[1] Keet, C.M., Mahlaza, Z., Antia, M.-J. CLaRO: a Controlled Language for Authoring Competency Questions. 13th Metadata and Semantics Research Conference (MTSR’19). 28-31 Oct 2019, Rome, Italy. Springer CCIS vol. 1075, 3-15.

[2]  Antia, M.-J., Keet, C.M. Assessing and Enhancing Bottom-up CNL Design for Competency Questions for Ontologies. 7th International Workshop on Controlled Natural language (CNL’21), 8-9 Sept. 2021, Amsterdam, the Netherlands. (in print)

[3] Potoniec, J., Wisniewski, D., Lawrynowicz, A., Keet, C.M. Dataset of Ontology Competency Questions to SPARQL-OWL Queries Translations. Data in Brief, 2020, 29: 105098.

# What about ethics and responsible data integration and data firewalls?

With another level 4 lockdown and a curfew from 9pm for most of July, I eventually gave in and decided to buy a TV, for some diversion with the national TV channels. In the process of buying, it appeared that here in South Africa, you have to have a valid paid-up TV licence to be allowed to buy a TV. I had none yet. So there I was in the online shopping check-out on a Sunday evening being held up by a message that boiled down to a ‘we don’t recognise your ID or passport number as having a TV licence’. As advances in the state’s information systems would have it, you can register for a TV licence online and pay with credit card to obtain one near-instantly. The interesting question from an IT perspective then was: how long will it take for the online retailer to know I duly registered and paid for the licence? In other words: are the two systems integrated and if so, how? It definitely is not based on a simple live SPARQL query from the retailer to a SPARQL endpoint of the TV licences database, as I still failed the retailer’s TV licence check immediately after payment of the licence and confirmation of it. Some time passed with refreshing the page and trying again and writing a message to the retailer, perhaps 30-45 minutes or so. And then it worked! A periodic data push or pull it is then, either between the licence database and the retailer or within the state’s back-end system and any front-end query interface. Not bad, not bad at all.

One may question from a privacy viewpoint whether this is the right process. Why could I not simply query by, say, just TV licence number and surname, but having had to hand over my ID or passport number for the check? Should it even be the retailer’s responsibility to check whether their customer has paid the tax?

There are other places in the state’s systems where there’s some relatively advanced integration of data between the state and companies as well. Notably, the SA Revenue Service (SARS) system pulls data from any company you work for (or they submit that via some ETL process) and from any bank you’re banking with to check whether you paid the right amount (if you owe them, they send the payment order straight to your bank, but you still have to click ‘approve’ online). No doubt it will help reduce fraud, and by making it easier to fill in tax forms, it likely will increase the amount collected and will cause less errors that otherwise may be costly to fix. Clearly, the system amounts to reduced privacy, but it remains within the legal framework—someone trying to evade paying taxes is breaking the law, rather—and I support the notion of redistributive taxation and to achieve that will as little admin as possible.

These examples do raise broader questions, though: when is data integration justified? Always? If not always, then when is it not? How to ensure that it won’t happen when it should not? Who regulates data integration, if anyone? Are there any guidelines or a checklist for doing it responsibly so that it at least won’t cause unintentional harm? Which steps in the data integration, if any, are crucial from a responsibility and ethical point of view?

I did search for academic literature, but found only one paper mentioning we should think of at least some of these sort of questions [1]. There are plenty of ethics & Big Data papers (e.g., [2,3]), but those papers focus on the algorithms let loose on the data and consequences thereof once the data has been integrated, rather than yes/no integration or any of the preceding integration processes themselves. There are, among others, data cleaning, data harmonisation and algorithms for that, schema-based integration (LAV, GAV, or GLAV), conceptual model-based integration, ontology-driven integration, possibly recurring ETL processes and so on, and something may go wrong at each step or may be the fine-grained crucial component of the ethical considerations. I devised one toy example in the context of ontology-based data access and integration where things would go wrong because of a bias [4] in that COVID-19 ontology that has data integration as its explicit purpose [5]. There are also informal [page offline dd 25-7-2021] descriptions of cases where things went wrong, such as the data integration issues with the City of Johannesburg that caused multiple riots in 2011, and no doubt there will be more.

Taking the ‘non-science’ route further to see if I could find something, I did find a few websites with some ‘best practices’ and ‘guidelines’ for data integration (e.g., here and here), with the brand new and most comprehensive set of data integration guidelines at end-user level by UN’s ESCAP that focuses on data integration for statistics offices on what to do and where errors may creep in [6]. But that’s all. No substantive hits with ‘ethics in data integration’ and similar searches in the academic literature. Maybe I’m searching in the wrong places. Wading through all ‘data ethics’ papers to find the needle in the haystack may have to be done some other time. If you know of scientific literature that I missed specifically regarding data integration, I’d be most grateful if you’d let me know.

The ‘recurring reliables’ for issues: health and education

Meanwhile, to take a step toward an answer of at least a subset of the aforementioned questions, let me first mention two other recent cases, also from South Africa, although the second issue happened in the Netherlands as well.

The first one is about healthcare data. I’m trying to get a SARS-CoV-2 vaccine. Registration for the age group I’m in opened on the 14th in the evening and so I did register in the state’s electronic vaccination data system (EVDS), which is the basic requirement for getting a vaccine. The next day, it appeared that we could book a slot via the health insurance I’m a member of. Their database and the EVDS are definitely not integrated, and so my insurer spammed me for a while with online messages in red, via email, and via SMS that I should register with the EVDS, even though I had already done that well before trying out their app.

Perhaps the health data are not integrated because it’s health; perhaps it was just time pressure to not delay the SARS-CoV-2 vaccination programme rollout. For some sectors, such as the basic education sector and then the police, they got loaded into the EVDS by the respective state department in one go via some ETL process, rather than people having to bother with individual registration. ID number, names, health insurance, dependants, home address, phone number, and whatnot that the EVDS asked for. And that regardless whether you want the vaccine or not—at least most people do. I don’t recall anyone having had a problem with that back-end process that it happened, aside from reported glitches in the basic education sectors’ ETL process, with reports on missing foreign national teachers and employees of independent schools who wanted in but weren’t.

Both the IT systems for vaccination management and any app for a ‘pass’ for having been vaccinated enjoys some debates on privacy internationally. Should they be self-standing systems? If it is allowed some integration, then with what? Should a healthcare provider or insurer be informed of the vaccination status of a member (and, consequently, act accordingly, whatever that may be), only if the member voluntarily discloses it (like with the vaccination scheduling app), or never? One’s employer? The movie theatre or mall you may want to enter? Perhaps airline companies want access to the vaccine database as well, who could choose to only let vaccinated people on their planes? The latter happens with other vaccinations for sure; e.g., yellow fever vaccination proof to enter SA from some countries, which the airline staff did ask for when I checked in in Argentina when travelling back to SA in 2012. That vaccination proof had gone into the physical yellow fever vaccination booklet that I carried with me; no app was involved in that process, ever. But now more things are digital. Must any such ‘covid-19 pass’ necessarily be digital? If so, who decides who, if anyone, will get access to the vaccination data, be it the EVDS data in SA or their homologous systems in other countries? To the best of my knowledge, no regulations exist yet. Since the EVDS is an IT system of the state, I presume they will decide. If they don’t, it will be up to the whims of each company, municipality, or province, and then is bound to generate lots of confusion among people.

The other case of a different nature comes in the news regularly; e.g., here, here, and here. It’s the tension that exists between children’s right to education and the paperwork to apply for a school. This runs into complications when they have an “undocumented” status, be it because of an absent birth certificate or their and their parent’s status as legal/illegal and their related ID documents or the absence thereof. It is forbidden for a school to contact Home affairs to get the prospective pupil’s and their respective parents’/guardians’ status, and for Home Affairs to provide that data to the schools, let alone integrate those two database at the ministerial level. Essentially, it is an intentional ‘Chinese wall’ between the two databases: the right to education of a child trumps any possible violation of legality of stay in the country or missing paperwork of the child or their parents/guardians.

Notwithstanding, exclusive or exclusionary schools try to filter them out by other means, such as by demanding that sort of data when you want to apply for admission; here’s an example, compared to public schools where evidence of an application for permission to stay suffices or at least evidence of efforts to engage with Home Affairs will do already. When the law says ‘no’ to the integration, how can you guarantee it won’t happen, neither through the software nor by other means (like by de facto requiring the relevant data stored in the Home Affairs database in an admission form)? Policing it? People reporting it somewhere? Would requesting such information now be a violation of the Protection of Personal Information Act (POPIA) that came into force on the 1st of July, since it asks for more personal data than needed by law?

Regulatory aspects

These cases—TV licence, SARS (the tax, not the syndrome), vaccine database, school admissions—are just a few anecdotes. Data integration clearly is not always allowed and when it is not, it has been a deliberate decision not to do so because its outcome is easy to predict and deemed unwanted. Notably for the education case, it is the government who devised the policy for a regulatory Chinese wall between its systems. The TV licence appears to lie at the other end of the spectrum. The broadcasting act of 1999 implicitly puts the onus on the seller of TVs: the licence is not a fee to watch public TV, it is a thing to give the licence holder the right to use a TV (article 27, if you must know), so if you don’t have the right to have it, then you can’t buy it. It’s analogous to having to be over 18 to buy alcohol, where the seller is held culpable if the buyer isn’t. That said, there are differences in what the seller requests from the customer: Makro requires the licence number only and asks for ID only if you can’t remember the licence number so as to ‘help you find it’, whereas takealot demands both ID and licence in any case, and therewith perhaps is then asking for more than strictly needed. Either way, since any retailer thus should be able to access the licence information instantly to check whether you have the right to own a TV, it’s a bit like as if “come in and take my data” is written all over the TV licence database. I haven’t seen any news articles about abuse.

For the SARS-CoV-2 vaccine and the EVDS data, there is, to the best of my knowledge, no specific regulation in place from the EVDS to third parties, other than that vaccination is voluntary and there is SA’s version of the GDPR, the aforementioned POPIA, which is based on the GDPR principles. I haven’t seen much debate about organisations requiring vaccination, but they can make vaccination mandatory if they want to, from which follows that there will have to be some data exchange either between the EVDS and third parties or from EVDS to the person and from there to the company. Would it then become another “come in and take my data”? We’ll cross that bridge when it comes, I suppose; coverage is currently at about 10% of the population and not everyone who wants to could get vaccinated yet, so we’re still in a limbo.

What could possibly go wrong with widespread access, alike with the TV licence database? A lot, of course. There are the usual privacy and interoperability issues (also noted here), and there are calls even in the laissez faire USA to put a framework in place to provide companies with “standards and bounds”. They are unlikely going to be solved by the CommonPass of the Commons Project bottom-up initiative, since there are so many countries with so many rules on privacy and data sharing. Interoperability between some systems is one thing; one world-wide system is another cup of tea.

What all this boils down to is not unlike Moshe Vardi’s argument, in that there’s the need for more policy to reduce and avoid ethical issues in IT, AI, and computing, rather than that computing would be facing an ethics crisis [7]. His claim is that failures of policy cause problems and that the “remedy is public policy, in the form of laws and regulations”, not some more “ethics outrage”. Presumably, there’s no ethics crisis, of the form that there would be a lack of understanding of ethical behaviour among computer scientists and their managers. Seeing each year how students’ arguments improve between the start of the ethics course and at the end in the essay and exam, I’d argue that basic sensitization is still needed, but on the whole, more and better policy could go a long way indeed.

More research on possible missteps in the various data integration processes would also be helpful, and that from a technical angle, as would learning from case studies be, and contextual inquiries [8], as well as a rigorous assessment on possible biases, alike it was examined for software development processes [9]. Those outcomes then may end up as a set of guidelines for data integration practitioners and the companies they work for, and inform government to devise policies. For now, the ESCAP guidelines [6] probably will be of most use to a data integration practitioner. It won’t catch all biases and algorithmic issues & tools and assumes one is allowed to integrate already, but it is a step in the direction of responsible data integration. I’ll think about it a bit more, too, and for the time being I won’t bother my students with writing an essay about ethics of data integration just yet.

References

[1] Firmani, D., Tanca, L., Torlone, R. Data processing: reflection on ethics. International Workshop on Processing Information Ethically (PIE’19). CEUR-WS vol. 2417. 4 June 2019.

[2] Herschel, R., Miori, V.M. Ethics & Big Data. Technology in Society, 2017, 49:31‐36.

[3] Sax, M. Finders keepers, losers weepers. Ethics and Information Technology, 2016, 18: 25‐31.

[4] Keet, C.M. Bias in ontologies — a preliminary assessment. Technical Report, Arxiv.org, January 20, 2021. 10p

[5] He, Y., et al. 2020. CIDO: The Community-based CoronavirusInfectious Disease Ontology. In Hastings, J.; and Loebe, F., eds., Proceedings of the 11th international Conference on Biomedical Ontologies, CEUR-WS vol. 2807.

[6] Economic and Social Commission for Asia and the Pacific (ESCAP). Asia-Pacific Guidelines to Data Integration for Official Statistics. Training manual. 15 April 2021.

[7] Vardi, M.Y. Are We Having An Ethical Crisis in Computing? Communications of the ACM, 62(1):7

[8] McKeown, A., Cliffe, C., Arora, A. et al. Ethical challenges of integration across primary and secondary care: a qualitative and normative analysis. BMC Med Ethics 20, 42 (2019).

[9] R. Mohanani, I. Salman, B. Turhan, P. Rodriguez, P. Ralph, Cognitive biases in software engineering: A systematic mapping study, IEEE Transactions on Software Engineering, 46 (2020): 1318–1339.

# My road travelled from microbiology to computer science

From bites to bytes or, more precisely, from foods to formalisations, and that sprinkled with a handful of humanities and a dash of design. It does add up. The road I travelled into computer science has nothing to do with any ‘gender blabla’, nor with an idealistic drive to solve the world food problem by other means, nor that I would have become fed up with the broad theme of agriculture. But then what was it? I’m regularly asked about that road into computer science, for various reasons. There are those who are curious or nosy, some deem it improbable and that I must be making it up, and yet others chiefly speculate about where I obtained the money from to pay for it all. So here it goes, in a fairly large write-up since I did not take a straight path, let alone a shortcut.

If you’ve seen my CV, you know I studied “Food Science, free specialisation” at Wageningen University in the Netherlands. It is the university to go to for all things to do with agriculture in the broad sense. Somehow I made it into computer science, but it was not there. The motivation does come from there, thanks to it being at the forefront of science and such has an ambiance that facilitates exposure to a wide range of topics and techniques within the education system and among fellow students. (Also, it really was the best quality education I ever had, which deserves to be said—and I’ve been around to have ample comparison material.)

And yet.

Perhaps it is conceivable to speculate that all the hurdles with mathematics and PC use when I was young were the motivation to turn to computing. Definitely not. Instead, it happened when I was working on my last, and major, Master’s thesis in the Molecular Ecology section of the Laboratory of Microbiology at Wageningen University, having drifted away a little from microbes in food science.

My thesis topic was about trying to clean up chemically contaminated soil by using bacteria that would eat the harmful compounds, rather than cleaning up the site by disrupting the ecosystem with excavations and chemical treatments of the soil. In this case, it was about 3-chlorobenzoate, which is an intermediate degradation product from, mainly, spilled paint that had been going on since the 1920s and said molecule substantially reduces growth and yield of maize, which is undesirable. I set out to examine a bunch of configurations of different amounts of 3-chlorobenzoate in the soil together with the Pseudomonas B13 bacteria and distance to the roots of the maize plants and their effects on the growth of the maize plants. The bacteria were expected to clean up more of the 3-chlorobenzoate in the area nearby the roots (the rhizosphere), and there were some questions about what the bacteria would do once the 3-chlorobenzoate ran out (mainly: will they die or feed on other molecules?).

The birds-eye view still sounds interesting to me, but there was a lot of boring work to do to find the answer. There were days that the only excitement was to open the stove to see whether my beasts had grown on the agar plate in the petri dish; if they had (yay!), I was punished with counting the colonies. Staring at dots on the agar plate in the petri dish and counting them. Then there were the analysis methods to be used, of which two turned out to be crucial for changing track, mixed with a minor logistical issue to top it off.

First, there was the PCR technique to sequence genetic material, which by now during COVID-19 times, may be a familiar term. There are machines that do the procedure automatically. In 1997, it was still a cumbersome procedure, which took about a day near non-stop work to sequence the short ribosomal RNA (16S rRNA) strand that was extracted from the collected bacteria. That was how we could figure out whether any of those white dots in the petri dish were, say, the Pseudomonas B13 I had inoculated the soil with, or some other soil bacteria. You extract the genetic material, multiply it, sequence it and then compare it. It was the last step that was the coolest.

The average number of base pairs of the 16S rRNA of a bacterium is around 1500 base pairs which is represented as a sequence of some 1500 capital letters consisting of A’s, C’s, G’s, and U’s. For comparison: the SARS-CoV-2 genome is about 30000 base pairs. You really don’t want to compare either one by hand against even one other similar sequence of letters, let alone manually checking your newly PCR-ed sequence against many others to figure out which bacteria you likely had isolated or which one is phylogenetically most closely related. Instead, we sent the sequence, as a string of flat text with those ACGU letters, to a database called the RNABase and we received an answer with a list of more or less likely matches within a few hours to a day, depending on the time of submitting it to the database.

It was like magic. But how did it really do that? What is a database? How does it calculate the alignments? And since it can do this cool stuff that’s not doable by humans, what else can you do with such techniques to advance our knowledge about the world? How much faster can science advance with these things? I wanted to know. I needed to know.

The other technique I had to work with was not new to me, but I had to scale it up: the High-Performance Liquid Chromatography (HPLC). You give the machine a solution and it separates out the component molecules, so you can figure out what’s in the solution and how much of it is in there. Different types of molecules stick to the wall of the tube inside the machine at different places. The machine then spits out the result as a graph, where different peaks scattered across the x axis indicate different substances in the solution and the size of the peak indicates the concentration of that molecule in the sample.

I had taken multiple soil samples closer and father away from the rhizosphere of different boxes with maize plants with different treatments of the soil, rinsed it and tested the solution in the HPLC. The task then was to compare the resulting graphs to see if there was a difference in treatment. Having printed them all out, they covered a large table of about 1.5 by 2 meter, and I had to look closely at them and try to do some manual pattern matching on the shape and size of the graphs and sub-graphs. There was no program that could compare graphs automatically. I tried to overlay printouts and hold them in front of the ceiling light. With every printed graph about the size of 20x20cm, you can calculate how many I had and how many 1-by-1 comparisons that amounts to (this is left as an exercise to the reader). It felt primitive, especially considering all the fancy toys in the lab and on the PC. Couldn’t those software developers not also develop a tool to compare graphs?! Now that would have been useful. But no. If only I could develop such a useful tool myself; then I would not have to wait on the software developers until they care to develop it.

On top of that manual analysis was that it seemed unfair that I had to copy the data from the HPLC machine in the basement of the building onto a 3.5 inch floppy disk and walk upstairs to the third floor to the shared MSc thesis students’ desktop PCs to be able to process it, whereas the PCR data was accessible from my desktop PC even though the PCR machine was on the ground floor. The PC could access the internet and present data from all over the world, even, so surely it should be able to connect to the HPLC downstairs?! Enter questions about computer networks.

The first step in trying to get some answers, was to inquire with the academics in the department. “Maybe there’s something like ‘theoretical microbiology’, or whatever it’s called that focuses on data analysis and modelling of microbiology? It is the fun part of the research—and avoids lab work?”, I asked my supervisor and more generally in the lab. “Not really,”, was the answer, continuing “ok, sure, there is some, but theory-only without the evidence from experiments isn’t it.” Despite all the advanced equipment, of which computing is an indispensable component, they still deemed that wetlab research trumped solely theory and computing. “Those technologies are there to assist answering faster the new and more advanced questions, but not replace the processes”, I was told.

Sigh. Pity. So be it, I supposed. But I still wanted answers to those computing questions. I also wanted to do a PhD in microbiology and then probably move to some other discipline, since I sensed that possibly after another 4-6 years I might become bored with microbiology. Then there was the logistical issue that I still could not walk well, which made wetlab work difficult; hence, it would make obtaining a PhD scholarship harder. Lab work was a hard requirement for a PhD in microbiology and it wasn’t exactly the most exciting part of studying bacteria. So, I might as well swap to something else straight away then. Since there were those questions in computing that I wanted answers to, there we have the inevitable conclusion to move to greener, or at least as green, pastures.

***

How to obtain those answers in computing? Signing up for a sort of ‘top up’ degree for the computing aspects would be nice, so as to do that brand new thing called bioinformatics. There were no such top-up degrees in the Netherlands at the time and the only one that came close was a full degree in medical informatics, which is not what I wanted. I didn’t want to know about all the horrible diseases people can get.

The only way to combine it, was to enrol in the 1st year of a degree in computing. The snag was the money. I was finishing up my 5 years of state funding for the master’s degree (old system, so it included the BSc) and the state paid for only one such degree. The only way to be able to do it, was to start working, save money, and pay for it myself at some point in the near future once I’d have enough money. Going into IT in industry out in the big wide world sounded somewhat interesting as second-choice option, since it should be easier with such skills to work anywhere in the world, and I still wanted to travel the world as well.

Once I finished the thesis in molecular ecology and graduated with a master’s degree in January 1998, I started looking for work whilst receiving unemployment benefit. IT companies only offered ‘conversion’ courses, such as a crash course in Cobol—the Y2K bug was alive and well—or some IT admin course, including Microsoft Certified System Engineer program (MCSE), with the catch that you’d have to keep working for the IT company for 3 years to pay off the debt of that training. That sounded like bonded labour and not particularly appealing.

Some day flicking through the newspapers on the lookout for interesting job offers, an advertisement caught my eye: a conversion course over a year for an MCSE consisting of five months full-time training and the rest of the year a practice period in industry whilst maintaining one’s unemployment benefit whose amount was just about sufficient to get by, and then all was paid off. A sizeable portion of funding came from the European Union. The programme was geared toward giving a second chance for basket cases, such as the long-term unemployed and the disabled. I was not a basket case, not yet at least. I tried nonetheless, applied for a position, and was invited for an interview. My main task was to try to convince them that I was basket case-like enough to qualify to be accepted in the programme, but good enough to pass fast and with good marks. The arguments worked and I was accepted for the programme. A foothold in the door.

We were a class of 16 people, 15 men and me the only woman. I completed the MCSE successfully, and then I also completed a range of other vocational training courses whilst employed in various IT jobs. Unix system administration, ITIL service management, a bit of Novell Netware and Cisco, and some more online self-study training sessions, which were all paid for by the companies I was employed at. The downside with those trainings, is that they all were, in my humble opinion, superficial and the how-to technology changes fast and the prospect or perpetual rote learning did not sound appealing to me. I wanted to know the underlying principles so that I wouldn’t have to keep updating myself with the latest trivia modification in an application. It was time to take the next step.

I was working for Eurologic Systems in Dublin, Ireland, at the time as a systems integration test engineer for fibre channel storage enclosures, which are boxes with many hard drives stacked up and connected for fast access to lots of data stored on the disks. They were a good employer, but they had only few training opportunities since it was an R&D company with experienced and highly educated engineers. I asked HR if I could sign up elsewhere, with, say, the Open University, and that they’d pay for some of it, maybe? “Yes,” the humane HR lady said, “that’s a good idea, and we’ll pay for every course you pass whilst in our employment.” Deal!

So, I enrolled with the Open University UK. I breezed through my first year even though I had skipped their 1st year courses and jumped straight into 2nd year courses. My second year went just as smoothly. The third year I paid myself, since I had opted for voluntary redundancy and was allowed to take it in the second round, since I wanted to get back on track of my original plan to go into bioinformatics. The dotcom bubble had burst and Eurologic could not escape some of its effects. While they were not fond of seeing me go, they knew I’d leave soon anyway and they were happy to see that the redundancy money would be put to good use to finish my Computing & IT degree. With that finished, I’d be able to finally do the bioinformatics that I was after since 1997, or so I thought.

My honours project was on database development, with a focus on conceptual data modelling languages. I rediscovered the Object-Role Modelling language from the lecture notes of the Saxion University of Applied Sciences that I had bought out of curiosity when I did the aforementioned MCSE course (in Enschede, the Netherlands). The database was about bacteriocins, which are produced by bacteria and they can be used in food for food safety and preservation. A first real step into bioinformatics. Bacteriocins have something to do with genes, too, and in searching for conceptual models about genes, I had stumbled into a new world in 2003, one with the Gene Ontology and the notion of ontologies to solve the data integration problem. Marking and marks processing took a bit longer than usual that year (the academics were on strike), and I was awarded the BSc(honours) degree (1st class) in March 2004. By that time, there were several bioinformatics conversion courses available. Ah, well.

The long route taken did give me some precious insight that no bioinformatics conversion top-up degree can give: a deeper understanding of indoctrination into disciplinary thinking and ways of doing science. That is, on what the respective mores are, how to question, how to identify a problem, looking at things, ways of answering questions and solving problems. Of course, when there’s, say, an experimental method, the principles of the methods are the same—hypothesis, set up experiment, do experiment, check results against hypothesis—as are some of the results processing tools the same (e.g., statistics), but there are substantive differences. For instance, in computing, you break down to problem, isolate it, and solve that piece of something that’s all human-made. In microbiology, it’s about trying to figure out how nature works, with all its interconnected parts that may interfere and complicate the picture. In the engineering side of food science, it was more along the line of, once we figure out what it does and what we really need, can we find something that does what we need or can we me make it do it to solve the problem? It doesn’t necessarily mean one is less cool; just different. And hard to explain to someone who has ever studied only one degree in one discipline, most of whom invariably have the ‘my way or the highway’ attitude or think everyone is homologous to them. If you manage to create the chance to do a second full degree, take it.

***

I clearly did not have a Bachelor of arts, but I had done some courses roughly in that area in my degree in Wageningen and had done a range of extra-curricular activities. Perhaps that, and more, would help me persuade the selection committee? I put it all in detail in the application form in the hope it would increase my chances to try to make it look like I could pull this off and be accepted into the programme. I was accepted into the programme. Yay. Afterwards, I heard from one of the professors that it had been an easy decision, “since you already have a Masters degree, of science, no less”. Also this door was opened thanks to that first degree I had obtained that was paid for by the state merely because I qualified for the tertiary education. The money to pay for this study came from my savings and the severance package from Eurologic. I had earned too much money in industry to qualify for state subsidy in Ireland; fair enough.

Doing the courses, I could feel I was missing the foundations, both regarding the content of some established theories here and there and in tackling things. By that time, I was immersed in computing, where you break down things in smaller sub-components and that systematising is also reflected in the reports you write. My essays and reports have sections and subsections and suitably itemised lists—Ordnung muss sein. But no, we’re in a fluffy humanities space and it should have been ‘verbal diarrhoea’. That was my interpretation of some essay feedback I had received, which claimed that there was too much structure and that it should have been one long piece of text without visually identifiable begin, middle, and end. That was early in the first semester. A few months into the programme, I thought that the only way I’d be able to pull off the dissertation, was to drag the topic as much as I could into an area that I was comparatively good at: modelling and maths.

That is: to stick with my disciplinary indoctrinations as much as possible, rather than fully descend into what to me still resembled mud and quicksand. For sure, there’s much more to the humanities than meets an average scientist’s eye, and I gained an appreciation of it during that degree, but that does not mean I was comfortable with it. In addition, for thesis topic choice, there were still the ‘terrorists’ I was looking for an answer to. Combine the two, and voila, my dissertation topic: applying game theory to peace negotiations in the so-called ‘terrorist theatre’. Prof. Moxon-Browne was not only a willing, but also eager, supervisor, and a great one at that. The fact that he could not wait to see my progress was a good stimulator to work and achieve that progress.

In the end, the dissertation had some ‘fluffy’ theory, some mathematical modelling, and some experimentation. It looked into three party negotiations cf. the common zero-sum approach in the literature: the government and two aggrieved groups, of which one was the politically-oriented one and the other one the violent one. For instance, in the case of South Africa, the Apartheid government on the one side and the ANC and the MK on the other side, and in case of Ireland, the UK/Northern Ireland government, Sinn Fein and the IRA. The strategic benefits of who teams up with whom during negotiations, if at all, depends on their relative strength: mathematically, in several identified power-dynamic circumstances, an aggrieved participant could obtain a larger slice of the pie for the victims if they were not in a coalition than if they were, and the desire, or not, for a coalition among aggrieved groups depended on their relative power. This deviated from the widespread assumption at the time that said that the aggrieved groups should always band together. I hoped it would still be enough for a pass.

It was awarded a distinction. It turned out that my approach was fairly novel. Perhaps therein lies a retort argument for the top-up degrees against the ‘do both’ advice I mentioned before: a fresh look on the matter, if not interdisciplinarity or transdisciplinarity. I can see it also with the dissertation topics of our conversion Masters in IT students as well. They’re all interesting and topics that perhaps no disciplinarian would have produced.

***

The final step, then. With a distinction in the MA in Peace & Development in my pocket and a first in the BSc(honours) in CS&IT at around the same time, what next? The humanities topics were becoming too depressing even with a detached scientific mind—too many devastating problems and too little agency to influence—and I had worked toward the plan to go into bioinformatics for so many years already. Looking for jobs in bioinformatics, they all demanded a PhD. With the knowledge and experience amassed studying for the two full degrees, I could do all those tasks they wanted the bioinformatician to do. However, without meeting that requirement for a PhD, there was no chance I’d make it through the first selection round. That’s what I thought at the time. I tried 1-2 regardless—reject because no PhD. Maybe I should have tried and applied more widely nonetheless, since, in hindsight, it was the system’s way of saying they wanted someone well-versed in both fields, not someone trained to become an academic, since most of those jobs are software development jobs anyway.

Disappointed that I still couldn’t be the bioinformatician I thought I would be able to be after those two degrees, I sighed and resigned to the idea that, gracious sakes, I’ll get that PhD, too, then, and defer the dream a little longer.

In a roundabout way I ended up at the Free University of Bozen-Bolzano (FUB), Italy. They paid for the scholarship and there was generous project funding to pay for conference attendance. Meanwhile in the bioinformatics field, things had moved on from databases for molecular biology to bio-ontologies to facilitate data integration. The KRDB research centre at FUB was into ontologies, but then rather from the logic side of things. Fairly soon after my commencement with the PhD studies, my supervisor, who did not even have a PhD in Computer Science, told me in no unclear terms that I was enrolled in a PhD in computer science, that my scientific contributions had to be in computer science, and if I wanted to do something in ‘bio-whatever’, that was fine, but that I’d have to do that in my own time. Crystal clear.

The `bio-whatever’ petered out, since I had to step up the computer science content because I had only three years to complete the PhD. On the bright side, passion will come the more you investigate something. Modelling, with some examples in bio, and ontologies and conceptual modelling it was. I completed my PhD in three year(-ish); fully indoctrinated in the computer science way. Journey completed.

***

I’ve not yet mentioned the design I indicated at the start of the blog post. It has nothing to do with moving into computer science. At all. Weaving in the interior design into the narrative didn’t work well, and it falls under the “vocational training courses whilst employed in various IT jobs” phrase earlier on. The costs of the associate diploma at the Portobello Institute in Dublin? I earned most of the costs (1200 pound or so? I can’t recall exactly, but it was somewhere between 1-2K) together in a week: we got double pay for working a shift on New Year (the year 2000 no less) and then I volunteered for the double pay for 12h shifts instead of regular 8h shifts for the week thereafter. One week extra work for an interesting hobby in the evening hours for a year was a good deal in my opinion, and it allowed me to explore whether I liked the topic as much as I thought I might in secondary school. I passed with a distinction and also got Rhodec certified. I still enjoy playing around with interiors, as hobby, and have given up the initial idea (in 1999) to use IT with it, since tangible samples work fine.

So, yes, I really have completed degrees in science, engineering, and political science straddling into humanities, and a little bit of the arts. A substantial chunck was paid for by the state (‘full scholarships’), companies chimed in as well, and I paid some of it from my hard earned money. On the motivations for the journey: I hope I made that clear despite cutting out some text in an attempt to reduce the post’s length. (Getting into university in the first place and staying in academia after completing a PhD are two different stories altogether, and left for another time.)

I still have many questions, but I also realise that many will remain unanswered even if the answer is known to humanity already, since to live means it’s finite and there’s simply not enough time to learn everything. In any case: do study what you want, not what anyone tells you to study. If the choice is a study or, say, a down payment on a mortgage for a house, then if completing the study will give good prospects and relieves you from a job you are not aiming for, go for it—that house may be bought later and be a tad bit smaller. It’s your life you’re living, not someone else’s.

# NLG requirements for social robots in Sub-Saharan Africa

When the robots come rolling, or just trickling or seeping or slowly creeping, into daily life, I want them to be culturally aware, give contextually relevant responses, and to do that in a language that the user can understand and speak well. Currently, they don’t. Since I work and in live in South Africa, then what does all that mean for the Southern Africa context? Would social robot use case scenarios be different here than in the Global North where most of the robot research and development is happening, and if so, how? What is meant with  contextually relevant responses? Which language(s) should the robot communicate in?

The question of which languages is the easiest to answer: those spoken in this region, which are mainly those in the Niger-Congo B [NCB] (aka ‘Bantu’) family of languages, and then also Portuguese, French, Afrikaans, and English. I’ve been working on theory and tools for NCB languages, and isiZulu in particular (and some isiXhosa and Runyankore), whose research was mainly as part of the two NRF-funded projects GeNI and MoReNL. However, if we don’t know how that human-robot interaction occurs in which setting, we won’t know whether the algorithms designed so far can also be used for that, which may well be beyond the ontology verbalisation, a patient’s medicine prescription generation, weather forecasts, or language learning exercises that we roughly got covered for the controlled language and natural language generation aspects of it.

So then what about those use case scenarios and contextually relevant responses? Let me first give an example of the latter. A few years ago in one of the social issues and professional practice lectures I was teaching, I brought in the Amazon Echo to illustrate precisely that as well as privacy issues with Alexa and digital assistants (‘robot secretaries’) in general. Upon asking “What is the EFF?”, the whole class—some 300 students present at the time—was expecting that Alexa would respond with something like “The EFF is the economic freedom fighters, a political party in South Africa”. Instead, Alexa fetched the international/US-based answer and responded with “The EFF is the electronic frontier foundation” that the class had never heard of and that EFF doesn’t really do anything in South Africa (it does pass the revue later on in the module nonetheless, btw). There’s plenty of online content about the EFF as political party, yet Alexa chose to ignore that and prioritise information from elsewhere. Go figure with lots of other information that has limited online presence and doesn’t score high in the search engine results because there are fewer queries about it. How to get the right answer in those cases is not my problem (area of expertise), but I take that a solved black box and zoom in on the natural language aspects to automatically generate a sentence that has the answer taken from some structured data or knowledge.

The other aspect of this instance, is that the interactions both during and after the lecture was not a 1:1 interaction of students with their own version of Siri or Cortana and the like, but eager and curious students came in teams, so a 1:m interaction. While that particular class is relatively large and was already split into two sessions, larger classes are also not uncommon in several Sub-Saharan countries: for secondary school class sizes, the SADC average is 23.55 learners per class (the world average is 17), with the lowest is Botswana (13.8 learners) and the highest in Malawi with a whopping 72.3 learners in a class, on average. An educational robot could well be a useful way to get out of that catch-22, and, given resource constraints, end up as a deployment scenario with a robot per study group, and that in a multilingual setting that permits code switching (going back and forth between different languages). While human-robot interaction experts still will need to do some contextual inquiries and such to get to the bottom of the exact requirements and sentences, this variation in use is on top of the hitherto know possible ways for educational robots.

Going beyond this sort of informal chatter, I tried to structure that a bit and narrowed it down to a requirements analysis for the natural language generation aspects of it. After some contextualisation, I principally used two main use cases to elucidate natural language generation requirements and assessed that against key advances in research and technologies for NCB languages. Very, very, briefly, any system will need to i) combine data-to-text and knowledge-to-text, ii) generate many more different types of sentences, including sentences for  both written and spoken languages in the NCB languages that are grammatically rich and often agglutinating, and iii) process non-trivial numbers that is non-trivial to do for NCB languages because the surface realization of the numbers depend on the noun class of the noun that is being counted. At present, no system out there can do all of that. A condensed version of the analysis was recently accepted as a paper entitled Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa [1], for the IST-Africa’21 conference, and it will be presented there next week at the virtual event, in the ‘next generation computing’ session no less, on Wednesday the 12th of May.

Probably none of you has ever heard of this conference. IST-Africa is yearly IT conference in Africa that aims to foster North-South and South-South  networking, promote the academia->industry and academia->policy bridge-creation and knowledge transfer pipelines, and capacity building for paper writing and presentation. The topics covered are distinctly of regional relevance and, according to its call for papers, the “Technical, Policy, Social Implications Papers must present analysis of early/final Research or Implementation Project Results, or business, government, or societal sector Case Study”.

Why should I even bother with an event like that? It’s good to sometimes reflect on the context and ponder about relevance of one’s research—after all, part of the university’s income (and thus my salary) and a large part of the research project funding I have received so far comes ultimately from the taxpayers. South African tax payers, to be more precise; not the taxpayers of the Global North. I can ‘advertise’, ahem, my research area and its progress to a regional audience. Also, I don’t expect that the average scientist in the Global North would care about HRI in Africa and even less so for NCB languages, but the analysis needed to be done and papers equate brownie points. Also, if everyone thinks to better not participate in something locally or regionally, it won’t ever become a vibrant network of research, applied research, and technology. I’ve attended the event once, in 2018 when we had a paper on error correction for isiZulu spellcheckers, and from my researcher viewpoint, it was useful for networking and ‘shopping’ for interesting problems that I may be able to solve, based on other participants’ case studies and inquiries.

Time will tell whether attending that event then and now this paper and online attendance will be time wasted or well spent. Unlike the papers on the isiZulu spellcheckers that reported research and concrete results that a tech company easily could take up (feel free to do so), this is a ‘fluffy’ paper, but exploring the use of robots in Africa was an interesting activity to do, I learned a few things along the way, it will save other interested people time in the analysis phase, and hopefully it also will generate some interest and discussion about what sort of robots we’d want and what they could or should be doing to assist, rather than replace, humans.

p.s.: if you still were to think that there are no robots in Africa and deem all this to be irrelevant: besides robots in the automotive and mining industries by, e.g., Robotic Innovations and Robotic Handling Systems, there are robots in education (also in Cape Town, by RD-9), robot butlers in hotels that serve quarantined people with mild COVID-19 in Johannesburg, they’re used for COVID-19 screening in Rwanda, and the Naledi personal banking app by Botlhale, to name but a few examples. Other tools are moving in that direction, such as, among others, Awezamed’s use of speech synthesis with (canned) text in isiZulu, isiXhosa and Afrikaans and there’s of course my research group where we look into knowledge-to-text text generation in African languages.

References

[1] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. in print.

# Automatically simplifying an ontology with NOMSA

Ever wanted only to get the gist of the ontology rather than wading manually through thousands of axioms, or to extract only a section of an ontology for reuse? Then the NOMSA tool may provide the solution to your problem.

There are quite a number of ways to create modules for a range of purposes [1]. We zoomed in on the notion of abstraction: how to remove all sorts of details and create a new ontology module of that. It’s a long-standing topic in computer science that returns every couple of years with another few tries. My first attempts date back to 2005 [2], which references modules & abstractions for conceptual models and logical theories to works published in the mid-1990s and, stretching the scope to granularity, to 1985, even. Those efforts, however, tend to halt at the theory stage or worked for one very specific scenario (e.g., clustering in ER diagrams). In this case, however, my former PhD student and now Senior Research at the CSIR, Zubeida Khan, went further and also devised the algorithms for five types of abstraction, implemented them for OWL ontologies, and evaluated them on various metrics.

The tool itself, NOMSA, was presented very briefly at the EKAW 2018 Posters & Demos session [3] and has supplementary material, such as the definitions and algorithms, a very short screencast and the source code. Five different ways of abstraction to generate ontology modules were implemented: i) removing participation constraints between classes (e.g., the ‘each X R at least one Y’ type of axioms), ii) removing vocabulary (e.g., remove all object properties to yield a bare taxonomy of classes), iii) keeping only a small number of levels in the hierarchy, iv) weightings based on how much some element is used (removing less-connected elements), and v) removing specific language profile features (e.g., qualified cardinality, object property characteristics).

In the meantime, we have added a categorisation of different ways of abstracting conceptual models and ontologies, a larger use case illustrating those five types of abstractions that were chosen for specification and implementation, and an evaluation to see how well the abstraction algorithms work on a set of published ontologies. It was all written up and polished in 2018. Then it took a while in the publication pipeline mixed with pandemic delays, but eventually it has emerged as a book chapter entitled Structuring abstraction to achieve ontology modularisation [4] in the book “Advanced Concepts, methods, and Applications in Semantic Computing” that was edited by Olawande Daramola and Thomas Moser, in January 2021.

Since I bought new video editing software for the ‘physically distanced learning’ that we’re in now at UCT, I decided to play a bit with the software’s features and record a more comprehensive screencast demo video. In the nearly 13 minutes, I illustrate NOMSA with four real ontologies, being the AWO tutorial ontology, BioTop top-domain ontology, BFO top-level ontology, and the Stuff core ontology. Here’s a screengrab from somewhere in the middle of the presentation, where I just automatically removed all 76 object properties from BioTop, with just one click of a button:

The embedded video (below) might keep it perhaps still readable with really good eyesight; else you can view it here in a separate tab.

The source code is available from Zubeida’s website (and I have a local copy as well). If you have any questions or suggestions, please feel free to contact either of us. Under the fair use clause, we also can share the book chapter that contains the details.

References

[1] Khan, Z.C., Keet, C.M. An empirically-based framework for ontology modularization. Applied Ontology, 2015, 10(3-4):171-195.

[2] Keet, C.M. Using abstractions to facilitate management of large ORM models and ontologies. International Workshop on Object-Role Modeling (ORM’05). Cyprus, 3-4 November 2005. In: OTM Workshops 2005. Halpin, T., Meersman, R. (eds.), LNCS 3762. Berlin: Springer-Verlag, 2005. pp603-612.

[3] Khan, Z.C., Keet, C.M. NOMSA: Automated modularisation for abstraction modules. Proceedings of the EKAW 2018 Posters and Demonstrations Session (EKAW’18). CEUR-WS vol. 2262, pp13-16. 12-16 Nov. 2018, Nancy, France.

[4] Khan, Z.C., Keet, C.M. Structuring abstraction to achieve ontology modularisation. Advanced Concepts, methods, and Applications in Semantic Computing. Daramola O, Moser T (Eds.). IGI Global. 2021, 296p. DOI: 10.4018/978-1-7998-6697-8.ch004

# The ontological commitments embedded in a representation language

Just like programming language preferences generate heated debates, this happens every now and then with languages to represent ontologies as well. Passionate dislikes for description logics or limitations of OWL are not unheard of, in favour of, say, Common Logic for more expressiveness and a different notation style, or of OBO because of its graph-based fundamentals, or that abuse of UML Class Diagram syntax  won’t do as approximation of an OWL file. But what is really going on here? Are they practically all just the same anyway and modellers merely stick with, and defend, what they know? If you could design your pet language, what would it look like?

The short answer is: they are not all the same and interchangeable. There are actually ontological commitments baked into the language, even though in most cases this is not explicitly stated as such. The ‘things’ one has in the language indicate what the fundamental building blocks are in the world (also called “epistemological primitives” [1]) and therewith assume some philosophical stance. For instance, a crisp vs vague world (say, plain OWL or a fuzzy variant thereof) or whether parthood is such a special relation that it deserves its own primitive next to class subsumption (alike UML’s aggregation). Or maybe you want one type of class for things indicated with count nouns and another type of element for stuffs (substances generally denoted with mass nouns). This then raises the question as to what the sort of commitments are that are embedded in, or can go into, a language specification and that have an underlying philosophical point of view. This, in turn, raises the question about which philosophical stances actually can have a knock-on effect on the specification or selection of an ontology language.

My collaborator, Pablo Fillottrani, and I tried to answer these questions in the paper entitled An Analysis of Commitments in Ontology Language Design that was published late last year as part of the proceedings of the 11th Conference on Formal Ontology in Information Systems 2020 that was supposed to have been held in September 2020 in Bolzano, Italy. In the paper, we identified and analysed ontological commitments that are, or could have been, embedded in logics, and we showed how they have been taken for well-known languages for representing ontologies and similar artefacts, such as OBO, SKOS, OWL 2DL, DLRifd, and FOL. We organised them in four main categories: what the very fundamental furniture is (e.g., including roles or not, time), acknowledging refinements thereof (e.g., types of relations, types of classes), the logic’s interaction with natural language, and crisp vs various vagueness options. They are discussed over about 1/3 of the paper.

Obviously, engineering considerations can interfere in the design of the logic as well. They concern issues such as how the syntax should look like and whether scalability is an issue, but this is not the focus of the paper.

We did spend some time contextualising the language specification in an overall systematic engineering process of language design, which is summarised in the figure below (the paper focuses on the highlighted step).

While such a process can be used for the design of a new logic, it also can be used for post hoc reconstructions of past design processes of extant logics and conceptual data modelling languages, and for choosing which one you want to use. At present, the documentation of the vast majority of published languages do not describe much of the ‘softer’ design rationales, though.

We played with the design process to illustrate how it can work out, availing also of our requirements catalogue for ontology languages and we analysed several popular ontology languages on their commitments, which can be summed up as in the table shown below, also taken from the paper:

In a roundabout way, it also suggests some explanations as to why some of those transformation algorithms aren’t always working well; e.g., any UML-to-OWL or OBO-to-OWL transformation algorithm is trying to shoe-horn one ontological commitment into another, and that can only be approximated, at best. Things have to be dropped (e.g., roles, due to standard view vs positionalism) or cannot be enforced (e.g., labels, due to natural language layer vs embedding of it in the logic), and that’ll cause some hick-ups here and there. Now you know why, and that won’t ever work well.

Hopefully, all this will feed into a way to help choosing a suitable language for the ontology one may want to develop, or assist with understanding better the language that you may be using, or perhaps gain new ideas for designing a new ontology language.

References

[1] Brachman R, Schmolze J. An overview of the KL-ONE Knowledge Representation System. Cognitive Science. 1985, 9:171–216.

[2] Fillottrani, P.R., Keet, C.M. An Analysis of Commitments in Ontology Language Design. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 46-60.

# On computer program being a whole

Who cares whether some computer program is a whole, how, and why? Turns out, more people than you may think—and so should you, since it can be costly depending on the answer. Consider the following two scenarios: 1) you download a ‘pirated’ version of MS Office or Adobe Photoshop (the most popular ones still) and 2) you take the source code of a popular open source program, such as Notepad++, add a little code for some additional function, and put it up for sale only as an executable app called ‘Notepad++ extreme (NEXT)’ so as to try to earn money quickly. Are these actions legal?

In both cases, you’d break the law, but how many infringements took place, of the one that you potentially could be fined for or face jail time? For the piracy case, is that once for the MS Office suite, or for each progam in the suite, or for each file created upon installing MS office, or for each source code file that went into making the suite during software development? For the open source case, was that violating its GNU GLP open source licence once for the zipped&downloaded or cloned source code or for each file in the source code, of which there are hundreds? It is possible to construct similar questions for trade secret violations and patent infringements for programs, as well as other software artefacts, like illegal downloads of TV series episodes (going strong during COVID-19 lockdowns indeed). Just in case you think this sort of issue is merely hypothetical: recently, Arista paid Cisco $400 million for copyright damages and just before that, Zenimax got$500 million from Oculus (yes, the VR software) for trade secret violations, and Google vs Oracle is ongoing with “billions of dollars at stake”.

Let’s consider some principles first. To be able to answer the number of infringements, we first need to know whether a computer program is a whole or not and why, and if so, what’s ‘in’ (i.e., a part of it) and what’s ‘out’ (i.e., definitely not part of it). Spoiler alert: a computer program is a functional whole.

To get to that conclusion, I had to combine insights from theories of parthood (mereology), granularity, modularity, unity, and function and add a little more into the mix. To provide less and more condensed versions of the argumentation, there is a longer technical report [1], of which I hope it is readable by a wider audience, and a condensed version for a specialist audience [2] that was published in the Proceedings of the 11th Conference on Formal Ontologies in Information Systems (FOIS’20) two weeks ago. Very briefly and informally, the state of affairs can be illustrated with the following picture:

This schematic representation shows, first, two levels of granularity: level 1 and level 2. At level 1, there’s some whole, like the a1 and a2 in the figure that could be referring to, say, a computer program, a module repository, an electorate, or a human body. At a more fine-grained level 2, there are different entities, which are in some way linked to the respective whole. This ‘link’ to the whole is indicated with the vertical dashed lines, and one can say that they are part of the whole. For the blue dots on the right residing at level 2, i.e., the parts of a1, there’s also a unifying relation among the parts, indicated with the solid lines with arrows, which makes a1 an integral whole. Moreover, for that sort of whole, it holds that if some object x (residing at level 2) is part of a1 then if there’s a y that is also part of a1, it participates in that unifying relation with x and vice versa (i.e., if y is in that unifying relation with x, then it must also be part of a1). For the computer program’s source code, that unifying relation can be the source tree graph.

There is some nitty gritty detail also involving the notion of function—a source code file contributes to doing something—and optional vs mandatory vs essential part that you can read about in the report or in the paper [1,2], covering the formalisation, more argumentation, and examples.

How would it pan out for the infringements? The Notepad++ exploitation scenario would simply be a case of one infringement in total for all the files needed to create the executable, not one for each source code file. This conclusion from the theory turns out remarkably in line with the GNU GPL’s explanation of their licence, albeit then providing a theoretical foundation for their intuition that there’s a difference between a mere aggregate where different things are bundled, loose coupling (e.g., sockets and pipes) and a single program (e.g., using function calls, being included in the same executable). The order of things perhaps should have been from there into the theory, but practically, I did the analysis and stumbled into a situation where I had to look up the GPL and its explanatory FAQ. On the bright side, in the other direction now then: just  in case someone wants to take on copyleft principles of open source software, here are some theoretical foundations to support that there’s probably much less money to be gained than you might think.

For the MS Office suite case mentioned at the start, I’d need a look under the hood to determine how it ties together and one may have to argue about the sameness of, or difference between, a suite and a program. The easier case for a self-standing app, like the 3rd-place most pirated Windows app Internet Download Manager, is that it is one whole and so one infringement then.

It’s a pity that FOIS 2020 has been postponed to 2021, but at least I got to talk about some of this as expert witness for a litigation case and I managed to weave an exercise about the source tree with open source licences into the social issues and professional practice module I thought to some 750 students this past winter.

References

[1] Keet, C.M. Why a computer program is a functional whole. Technical report 2008.07273, arXiv. 21 July 2020. 25 pages.

[2] Keet, C.M. The computer program as a functional whole. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 216-230.

# An architecture for Knowledge-driven Information and Data access: KnowID

Advanced so-called ‘intelligent’ information systems may use an ontology or runtime-suitable conceptual data modelling techniques in the back end combined with efficient data management. Such a set-up aims to provide a way to better support informed decision-making and data integration, among others. A major challenge to create such systems, is to figure out which components to design and put together to realise a ‘knowledge to data’ pipeline, since each component and process has trade-offs; see e.g., the very recent overview of sub-topics and challenges [1]. A (very) high level categorization of the four principal approaches is shown in the following figure: put the knowledge and data together in the logical theory the AI way (left) or the database way (right), or bridge it by means of mappings or by means of transformations (centre two):

Among those variants, one can dig into considerations like which logic to design or choose in the AI-based “knowledge with (little) data” (e.g.: which OWL species? common logic? Other?), which type of database (relational, object-relational, or rather an RDF store), which query language to use or design, which reasoning services to support, how expressive it all has to and optimized for what purpose. None is best in all deployment scenarios. The AI-only one with, say, OWL 2 DL, is not scalable; the database-only one either lacks interesting reasoning services or supports few types of constraints.

Among the two in the middle, the “knowledge mapping data” is best known under the term ‘ontology-based data access’ (OBDA) and the Ontop system in particular [2] with its recent extension into ‘virtual knowledge graphs’ and the various use cases [3]. Its distinguishing characteristic of the architecture is the mapping layer to bridge the knowledge to the data. In the “Data transformation knowledge” approach, the idea is to link the knowledge to the data through a series of transformations. No such system is available yet. Considering the requirements for that, it turned out that a good few components are already available and just needed one crucial piece of transformations to convincingly put that together.

We did just that and devised a new knowledge-to-data architecture. We dub this the KnowID architecture (pronounced as ‘know it’), abbreviated from Knowledge-driven Information and Data access. KnowID adds novel transformation rules between suitably formalised EER diagrams as application ontology and Borgida, Toman & Weddel’s Abstract Relational Model with SQLP ([4,5]) to complete the pipeline (together with some recently proposed other components). Overall, it then looks like this:

Its details are described in the article entitled “KnowID: an architecture for efficient Knowledge-driven Information and Data access” [6], which was recently publish in the Data Intelligence journal. In a nutshell: the logic-based EER diagram (with deductions materialised) is transformed into an abstract relational model (ARM) that is transformed into a traditional relational model and then onward to a database schema, where the original ‘background knowledge’ of the ARM is used for data completion (i.e., materializing the deductions w.r.t. the data), and then the query posed in SQLP (SQL + path queries) is answered over that ‘extended’ database.

Besides the description of the architecture and the new transformation rules, the open access journal article also describes several examples and it features a more detailed comparison of the four approaches shown in figure 1 above. For KnowID, compared to other ontology-based data access approaches, its key distinctive architectural features are that runtime use can avail of full SQL augmented with path queries, the closed world assumption commonly used in information systems, and it avoids a computationally costly mapping layer.

We are working on the implementation of the architecture. The transformation rules and corresponding algorithms were implemented last year [7] and two computer science honours students are currently finalising their 4th-year project, therewith contributing to the materialization and query formulation steps aspects of the architecture. The latest results are available from the KnowID webpage. If you were to worry that will suffer from link rot: the version associated with the Data Intelligence paper has been archived as supplementary material of the paper at [8]. The plan is, however, to steadily continue with putting the pieces together to make a functional software system.

References

[1] Schneider, T., Šimkus, M. Ontologies and Data Management: A Brief Survey. Künstl Intell 34, 329–353 (2020).

[2] Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic Web Journal, 2017, 8(3), 471-487.

[3] G. Xiao, L. Ding, B. Cogrel, & D. Calvanese. Virtual knowledge graphs: An overview of systems and use cases. Data Intelligence, 2019, 1, 201-223.

[4] A. Borgida, D. Toman & G.E. Weddell. On referring expressions in information systems derived from conceptual modeling. In: Proceedings of ER’16, 2016, pp. 183–197

[5] W. Ma, C.M. Keet, W. Oldford, D. Toman & G. Weddell. The utility of the abstract relational model and attribute paths in SQL. In: C. Faron Zucker, C. Ghidini, A. Napoli & Y. Toussaint (eds.) Proceedings of the 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’18)), 2018, pp. 195–211.

[6] P.R. Fillottrani & C.M. Keet. KnowID: An architecture for efficient knowledge-driven information and data access. Data Intelligence, 2020 2(4), 487–512.

[7] Fillottrani, P.R., Jamieson, S., Keet, C.M. Connecting knowledge to data through transformations in KnowID: system description. Künstliche Intelligenz, 2020, 34, 373-379.

[8] Pablo Rubén Fillottrani, C. Maria Keet. KnowID. V1. Science Data Bank. http://www.dx.doi.org/10.11922/sciencedb.j00104.00015. (2020-09-30)

# Toward a framework for resolving conflicts in ontologies (with COVID-19 examples)

Among the many tasks involved in developing an ontologies, are deciding what part of the subject domain to include, and how. This may involve selecting a foundational ontology, reuse of related domain ontologies, and more detailed decisions for ontology authoring for specific axioms and design patterns. A recent example of reuse is that of the Infectious Diseases Ontology for schistosomiasis knowledge [1], but even before reuse, one may have to assess differences among ontologies, as Haendel et al did for disease ontologies [2]. Put differently, even before throwing alignment tools at them or selecting one with an import statement and hope for the best, issues may arise. For instance, two relevant domain ontologies may have been aligned to different foundational ontologies, a partOf relation could be set to be transitive in one ontology but is also used in a qualified cardinality constraint in the other (so then one cannot use an OWL 2 DL reasoner anymore when the ontologies are combined), something like Infection may be represented as a class in one ontology but as a property infectedby in another, or the ontologies differ on the science, like whether Virus is an organism or an inanimate object.

What to do then?

Upfront, it helps to be cognizant of the different types of conflict that may arise, and understand what their causes are. Then one would want to be able to find those automatically. And, most importantly, get some assistance in how to resolve them; if possible, also even preventing conflicts from happening in the first place. This is what Rolf Grütter, from the Swiss Federal Research Institute WSL, and I have been working since he visited UCT last year. The first results have been accepted for the International Conference on Biomedical Ontologies (ICBO) 2020, which are described in a paper entitled “Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring” [3]. A sample scenario of the process is illustrated informally in the following figure.

Summary of a sample scenario of detecting and resolving conflicts, illustrated with an ontology reuse scenario where Onto2 will be imported into Onto1. (source: [3])

The paper first defines and illustrates the notions of meaning negotiation and conflict resolution and summarises their main causes, to then go into some detail of the various categories of conflicts and ways how to resolve them. The detection and resolution is assisted by the notion of a conflict set, which is a data structure that stores the details for further processing.

It was tested with a use case of an epizootic disease outbreak in the Lemanic Arc in Switzerland in 2006, due to H5N1 (avian influenza): an administrative ontology had to be merged with one about the epidemiology for infected birds and surveillance zones. With that use case in place already well before the spread of SARS-CoV-2 that caused the current pandemic, it was a small step to add a few examples to the paper about COVID-19. This was made possible thanks to recently developed relevant ontologies that were made available, including for COVID-19 specifically. Let’s highlight the examples here, also so that I can write a bit more about it than the terse text in the paper, since there are no page limits for a blog post.

Example 1: OWL profile violations

Medical terminologies tend to veer toward being represented in an ontology language that is less or equal to OWL 2 EL: this permits scalability, compatibility with typical OBO Foundry ontologies, as well as fitting with the popular SNOMED CT. As one may expect, there have been efforts in ontology development with content relevant for the current pandemic; e.g., the Coronavirus Infectious Disease Ontology (CIDO) [4]. The CIDO is not in OWL 2 EL, however: it has a class expressions with a universal quantifier (ObjectAllValuesFrom) on the right-hand side; specifically (in DL notation): ‘Yale New Haven Hospital SARS-CoV-2 assay’ $\sqsubseteq \forall$ ‘EUA-authorized use at’.’FDA EUA-authorized organization’ or, in the Protégé interface:

(codes: CIDO_0000020, CIDO_0000024, and CIDO_0000031, respectively). It also imported many ontologies and either used them to cause some profile violations or the violations came with them, such as by having used the union operator (‘or’) in the following axiom for therapeutic vaccine function (VO_0000562):

How did I find that? Most certainly NOT by manually browsing through the more than 70000 axioms of the CIDO (including imports) to find the needle in the haystack. Instead, I burned the proverbial haystack to easily get the needles. In this case, the burning was done with the OWL Classifier, which automatically computes which axioms violate any of the OWL species, and lists them accordingly. Here are two examples, illustrating an OWL 2 EL violation (that aforementioned universal quantification) and an OWL 2 QL violation (a property chain with entities from BFO and RO); you can do likewise for OWL 2 RL violations.

Following the scenario with the assumption that the CIDO would have to stay in the OWL 2 EL profile, then it is easy to find the conflicting axioms and act accordingly, i.e., remove them. (It also indicates something did not go well with importing the NDF-RT.owl into the cido-base.owl, but that as an aside for this example.)

Example 2: Modelling issues: same idea, different elements

Let’s take the CIDO again and now also the COviD Ontology for cases and patient information (CODO), which have some overlapping and complementary information, so perhaps could be merged. A not unimportant thing is the test for SARS-CoV-2 and its outcome. CODO has a ‘laboratory test finding’ $\equiv$ {positive, pending, negative}, i.e., the possible outcomes of the test are individuals made into a class using the ObjectOneOf constructor. Consulting CIDO for the test outcomes, it has a class ‘COVID-19 diagnosis’ with three subclasses: Negative, Positive, and Presumptive positive. Aside from the inexact matches of the test status that won’t simplify data integration efforts, this is an example of class vs. instance modeling of what is ontologically the same thing. Resolving this in any merging attempt means that either

1. the CODO has to change and bump up the test results from individuals to classes, or
2. the CIDO has to change the subclasses to individuals in the ABox, or
3. take an ‘outside option’ and represent it in yet a different way where both the CODO and the CIDO have to modify the ontology (e.g., take a conceptual data modeling approach by making the test outcome an attribute with a few possible values).

The paper provides an attempt to systematize such type of conflicts toward a library of common types of conflict, so that it should become easier to find them, and offers steps toward a proper framework to manage all that, which assisted with devising generic approaches to resolution of conflicts. We already have done more to realize all that (which could not all be squeezed into the 12 pages), but more is still to be done, so stay tuned.

Since COVID-19 is still doing the rounds and the international borders of South Africa are still closed (with a lockdown for some 5 months already), I can’t end the blog post with the usual ‘I hope to see you at ICBO 2020 in Bolzano in September’—well, not in the common sense understanding at least. Hopefully next year then.

References

[1] Cisse PA, Camara G, Dembele JM, Lo M. An Ontological Model for the Annotation of Infectious Disease Simulation Models. In: Bassioni G, Kebe CMF, Gueye A, Ndiaye A, editors. Innovations and Interdisciplinary Solutions for Underserved Areas. Springer LNICST, vol. 296, 82–91. 2019.

[2] Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annual Review of Biomedical Data Science, 2018, 1:305–331.

[3] Grütter R, Keet CM. Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring. 11th International Conference on Biomedical Ontologies (ICBO’20), 16-19 Sept 2020, Bolzano, Italy. CEUR-WS (in print).

[4] He Y, Yu H, Ong E, Wang Y, Liu Y, Huffman A, Huang H, Beverley J, Hur J, Yang X, Chen L, Omenn GS, Athey B, Smith B. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data, 2020, 7:181.