# My road travelled from microbiology to computer science

From bites to bytes or, more precisely, from foods to formalisations, and that sprinkled with a handful of humanities and a dash of design. It does add up. The road I travelled into computer science has nothing to do with any ‘gender blabla’, nor with an idealistic drive to solve the world food problem by other means, nor that I would have become fed up with the broad theme of agriculture. But then what was it? I’m regularly asked about that road into computer science, for various reasons. There are those who are curious or nosy, some deem it improbable and that I must be making it up, and yet others chiefly speculate about where I obtained the money from to pay for it all. So here it goes, in a fairly large write-up since I did not take a straight path, let alone a shortcut.

If you’ve seen my CV, you know I studied “Food Science, free specialisation” at Wageningen University in the Netherlands. It is the university to go to for all things to do with agriculture in the broad sense. Somehow I made it into computer science, but it was not there. The motivation does come from there, thanks to it being at the forefront of science and such has an ambiance that facilitates exposure to a wide range of topics and techniques within the education system and among fellow students. (Also, it really was the best quality education I ever had, which deserves to be said—and I’ve been around to have ample comparison material.)

And yet.

Perhaps it is conceivable to speculate that all the hurdles with mathematics and PC use when I was young were the motivation to turn to computing. Definitely not. Instead, it happened when I was working on my last, and major, Master’s thesis in the Molecular Ecology section of the Laboratory of Microbiology at Wageningen University, having drifted away a little from microbes in food science.

My thesis topic was about trying to clean up chemically contaminated soil by using bacteria that would eat the harmful compounds, rather than cleaning up the site by disrupting the ecosystem with excavations and chemical treatments of the soil. In this case, it was about 3-chlorobenzoate, which is an intermediate degradation product from, mainly, spilled paint that had been going on since the 1920s and said molecule substantially reduces growth and yield of maize, which is undesirable. I set out to examine a bunch of configurations of different amounts of 3-chlorobenzoate in the soil together with the Pseudomonas B13 bacteria and distance to the roots of the maize plants and their effects on the growth of the maize plants. The bacteria were expected to clean up more of the 3-chlorobenzoate in the area nearby the roots (the rhizosphere), and there were some questions about what the bacteria would do once the 3-chlorobenzoate ran out (mainly: will they die or feed on other molecules?).

The birds-eye view still sounds interesting to me, but there was a lot of boring work to do to find the answer. There were days that the only excitement was to open the stove to see whether my beasts had grown on the agar plate in the petri dish; if they had (yay!), I was punished with counting the colonies. Staring at dots on the agar plate in the petri dish and counting them. Then there were the analysis methods to be used, of which two turned out to be crucial for changing track, mixed with a minor logistical issue to top it off.

First, there was the PCR technique to sequence genetic material, which by now during COVID-19 times, may be a familiar term. There are machines that do the procedure automatically. In 1997, it was still a cumbersome procedure, which took about a day near non-stop work to sequence the short ribosomal RNA (16S rRNA) strand that was extracted from the collected bacteria. That was how we could figure out whether any of those white dots in the petri dish were, say, the Pseudomonas B13 I had inoculated the soil with, or some other soil bacteria. You extract the genetic material, multiply it, sequence it and then compare it. It was the last step that was the coolest.

The average number of base pairs of the 16S rRNA of a bacterium is around 1500 base pairs which is represented as a sequence of some 1500 capital letters consisting of A’s, C’s, G’s, and U’s. For comparison: the SARS-CoV-2 genome is about 30000 base pairs. You really don’t want to compare either one by hand against even one other similar sequence of letters, let alone manually checking your newly PCR-ed sequence against many others to figure out which bacteria you likely had isolated or which one is phylogenetically most closely related. Instead, we sent the sequence, as a string of flat text with those ACGU letters, to a database called the RNABase and we received an answer with a list of more or less likely matches within a few hours to a day, depending on the time of submitting it to the database.

It was like magic. But how did it really do that? What is a database? How does it calculate the alignments? And since it can do this cool stuff that’s not doable by humans, what else can you do with such techniques to advance our knowledge about the world? How much faster can science advance with these things? I wanted to know. I needed to know.

The other technique I had to work with was not new to me, but I had to scale it up: the High-Performance Liquid Chromatography (HPLC). You give the machine a solution and it separates out the component molecules, so you can figure out what’s in the solution and how much of it is in there. Different types of molecules stick to the wall of the tube inside the machine at different places. The machine then spits out the result as a graph, where different peaks scattered across the x axis indicate different substances in the solution and the size of the peak indicates the concentration of that molecule in the sample.

I had taken multiple soil samples closer and father away from the rhizosphere of different boxes with maize plants with different treatments of the soil, rinsed it and tested the solution in the HPLC. The task then was to compare the resulting graphs to see if there was a difference in treatment. Having printed them all out, they covered a large table of about 1.5 by 2 meter, and I had to look closely at them and try to do some manual pattern matching on the shape and size of the graphs and sub-graphs. There was no program that could compare graphs automatically. I tried to overlay printouts and hold them in front of the ceiling light. With every printed graph about the size of 20x20cm, you can calculate how many I had and how many 1-by-1 comparisons that amounts to (this is left as an exercise to the reader). It felt primitive, especially considering all the fancy toys in the lab and on the PC. Couldn’t those software developers not also develop a tool to compare graphs?! Now that would have been useful. But no. If only I could develop such a useful tool myself; then I would not have to wait on the software developers until they care to develop it.

On top of that manual analysis was that it seemed unfair that I had to copy the data from the HPLC machine in the basement of the building onto a 3.5 inch floppy disk and walk upstairs to the third floor to the shared MSc thesis students’ desktop PCs to be able to process it, whereas the PCR data was accessible from my desktop PC even though the PCR machine was on the ground floor. The PC could access the internet and present data from all over the world, even, so surely it should be able to connect to the HPLC downstairs?! Enter questions about computer networks.

The first step in trying to get some answers, was to inquire with the academics in the department. “Maybe there’s something like ‘theoretical microbiology’, or whatever it’s called that focuses on data analysis and modelling of microbiology? It is the fun part of the research—and avoids lab work?”, I asked my supervisor and more generally in the lab. “Not really,”, was the answer, continuing “ok, sure, there is some, but theory-only without the evidence from experiments isn’t it.” Despite all the advanced equipment, of which computing is an indispensable component, they still deemed that wetlab research trumped solely theory and computing. “Those technologies are there to assist answering faster the new and more advanced questions, but not replace the processes”, I was told.

Sigh. Pity. So be it, I supposed. But I still wanted answers to those computing questions. I also wanted to do a PhD in microbiology and then probably move to some other discipline, since I sensed that possibly after another 4-6 years I might become bored with microbiology. Then there was the logistical issue that I still could not walk well, which made wetlab work difficult; hence, it would make obtaining a PhD scholarship harder. Lab work was a hard requirement for a PhD in microbiology and it wasn’t exactly the most exciting part of studying bacteria. So, I might as well swap to something else straight away then. Since there were those questions in computing that I wanted answers to, there we have the inevitable conclusion to move to greener, or at least as green, pastures.

***

How to obtain those answers in computing? Signing up for a sort of ‘top up’ degree for the computing aspects would be nice, so as to do that brand new thing called bioinformatics. There were no such top-up degrees in the Netherlands at the time and the only one that came close was a full degree in medical informatics, which is not what I wanted. I didn’t want to know about all the horrible diseases people can get.

The only way to combine it, was to enrol in the 1st year of a degree in computing. The snag was the money. I was finishing up my 5 years of state funding for the master’s degree (old system, so it included the BSc) and the state paid for only one such degree. The only way to be able to do it, was to start working, save money, and pay for it myself at some point in the near future once I’d have enough money. Going into IT in industry out in the big wide world sounded somewhat interesting as second-choice option, since it should be easier with such skills to work anywhere in the world, and I still wanted to travel the world as well.

Once I finished the thesis in molecular ecology and graduated with a master’s degree in January 1998, I started looking for work whilst receiving unemployment benefit. IT companies only offered ‘conversion’ courses, such as a crash course in Cobol—the Y2K bug was alive and well—or some IT admin course, including Microsoft Certified System Engineer program (MCSE), with the catch that you’d have to keep working for the IT company for 3 years to pay off the debt of that training. That sounded like bonded labour and not particularly appealing.

Some day flicking through the newspapers on the lookout for interesting job offers, an advertisement caught my eye: a conversion course over a year for an MCSE consisting of five months full-time training and the rest of the year a practice period in industry whilst maintaining one’s unemployment benefit whose amount was just about sufficient to get by, and then all was paid off. A sizeable portion of funding came from the European Union. The programme was geared toward giving a second chance for basket cases, such as the long-term unemployed and the disabled. I was not a basket case, not yet at least. I tried nonetheless, applied for a position, and was invited for an interview. My main task was to try to convince them that I was basket case-like enough to qualify to be accepted in the programme, but good enough to pass fast and with good marks. The arguments worked and I was accepted for the programme. A foothold in the door.

We were a class of 16 people, 15 men and me the only woman. I completed the MCSE successfully, and then I also completed a range of other vocational training courses whilst employed in various IT jobs. Unix system administration, ITIL service management, a bit of Novell Netware and Cisco, and some more online self-study training sessions, which were all paid for by the companies I was employed at. The downside with those trainings, is that they all were, in my humble opinion, superficial and the how-to technology changes fast and the prospect or perpetual rote learning did not sound appealing to me. I wanted to know the underlying principles so that I wouldn’t have to keep updating myself with the latest trivia modification in an application. It was time to take the next step.

I was working for Eurologic Systems in Dublin, Ireland, at the time as a systems integration test engineer for fibre channel storage enclosures, which are boxes with many hard drives stacked up and connected for fast access to lots of data stored on the disks. They were a good employer, but they had only few training opportunities since it was an R&D company with experienced and highly educated engineers. I asked HR if I could sign up elsewhere, with, say, the Open University, and that they’d pay for some of it, maybe? “Yes,” the humane HR lady said, “that’s a good idea, and we’ll pay for every course you pass whilst in our employment.” Deal!

So, I enrolled with the Open University UK. I breezed through my first year even though I had skipped their 1st year courses and jumped straight into 2nd year courses. My second year went just as smoothly. The third year I paid myself, since I had opted for voluntary redundancy and was allowed to take it in the second round, since I wanted to get back on track of my original plan to go into bioinformatics. The dotcom bubble had burst and Eurologic could not escape some of its effects. While they were not fond of seeing me go, they knew I’d leave soon anyway and they were happy to see that the redundancy money would be put to good use to finish my Computing & IT degree. With that finished, I’d be able to finally do the bioinformatics that I was after since 1997, or so I thought.

My honours project was on database development, with a focus on conceptual data modelling languages. I rediscovered the Object-Role Modelling language from the lecture notes of the Saxion University of Applied Sciences that I had bought out of curiosity when I did the aforementioned MCSE course (in Enschede, the Netherlands). The database was about bacteriocins, which are produced by bacteria and they can be used in food for food safety and preservation. A first real step into bioinformatics. Bacteriocins have something to do with genes, too, and in searching for conceptual models about genes, I had stumbled into a new world in 2003, one with the Gene Ontology and the notion of ontologies to solve the data integration problem. Marking and marks processing took a bit longer than usual that year (the academics were on strike), and I was awarded the BSc(honours) degree (1st class) in March 2004. By that time, there were several bioinformatics conversion courses available. Ah, well.

The long route taken did give me some precious insight that no bioinformatics conversion top-up degree can give: a deeper understanding of indoctrination into disciplinary thinking and ways of doing science. That is, on what the respective mores are, how to question, how to identify a problem, looking at things, ways of answering questions and solving problems. Of course, when there’s, say, an experimental method, the principles of the methods are the same—hypothesis, set up experiment, do experiment, check results against hypothesis—as are some of the results processing tools the same (e.g., statistics), but there are substantive differences. For instance, in computing, you break down to problem, isolate it, and solve that piece of something that’s all human-made. In microbiology, it’s about trying to figure out how nature works, with all its interconnected parts that may interfere and complicate the picture. In the engineering side of food science, it was more along the line of, once we figure out what it does and what we really need, can we find something that does what we need or can we me make it do it to solve the problem? It doesn’t necessarily mean one is less cool; just different. And hard to explain to someone who has ever studied only one degree in one discipline, most of whom invariably have the ‘my way or the highway’ attitude or think everyone is homologous to them. If you manage to create the chance to do a second full degree, take it.

***

I clearly did not have a Bachelor of arts, but I had done some courses roughly in that area in my degree in Wageningen and had done a range of extra-curricular activities. Perhaps that, and more, would help me persuade the selection committee? I put it all in detail in the application form in the hope it would increase my chances to try to make it look like I could pull this off and be accepted into the programme. I was accepted into the programme. Yay. Afterwards, I heard from one of the professors that it had been an easy decision, “since you already have a Masters degree, of science, no less”. Also this door was opened thanks to that first degree I had obtained that was paid for by the state merely because I qualified for the tertiary education. The money to pay for this study came from my savings and the severance package from Eurologic. I had earned too much money in industry to qualify for state subsidy in Ireland; fair enough.

Doing the courses, I could feel I was missing the foundations, both regarding the content of some established theories here and there and in tackling things. By that time, I was immersed in computing, where you break down things in smaller sub-components and that systematising is also reflected in the reports you write. My essays and reports have sections and subsections and suitably itemised lists—Ordnung muss sein. But no, we’re in a fluffy humanities space and it should have been ‘verbal diarrhoea’. That was my interpretation of some essay feedback I had received, which claimed that there was too much structure and that it should have been one long piece of text without visually identifiable begin, middle, and end. That was early in the first semester. A few months into the programme, I thought that the only way I’d be able to pull off the dissertation, was to drag the topic as much as I could into an area that I was comparatively good at: modelling and maths.

That is: to stick with my disciplinary indoctrinations as much as possible, rather than fully descend into what to me still resembled mud and quicksand. For sure, there’s much more to the humanities than meets an average scientist’s eye, and I gained an appreciation of it during that degree, but that does not mean I was comfortable with it. In addition, for thesis topic choice, there were still the ‘terrorists’ I was looking for an answer to. Combine the two, and voila, my dissertation topic: applying game theory to peace negotiations in the so-called ‘terrorist theatre’. Prof. Moxon-Browne was not only a willing, but also eager, supervisor, and a great one at that. The fact that he could not wait to see my progress was a good stimulator to work and achieve that progress.

In the end, the dissertation had some ‘fluffy’ theory, some mathematical modelling, and some experimentation. It looked into three party negotiations cf. the common zero-sum approach in the literature: the government and two aggrieved groups, of which one was the politically-oriented one and the other one the violent one. For instance, in the case of South Africa, the Apartheid government on the one side and the ANC and the MK on the other side, and in case of Ireland, the UK/Northern Ireland government, Sinn Fein and the IRA. The strategic benefits of who teams up with whom during negotiations, if at all, depends on their relative strength: mathematically, in several identified power-dynamic circumstances, an aggrieved participant could obtain a larger slice of the pie for the victims if they were not in a coalition than if they were, and the desire, or not, for a coalition among aggrieved groups depended on their relative power. This deviated from the widespread assumption at the time that said that the aggrieved groups should always band together. I hoped it would still be enough for a pass.

It was awarded a distinction. It turned out that my approach was fairly novel. Perhaps therein lies a retort argument for the top-up degrees against the ‘do both’ advice I mentioned before: a fresh look on the matter, if not interdisciplinarity or transdisciplinarity. I can see it also with the dissertation topics of our conversion Masters in IT students as well. They’re all interesting and topics that perhaps no disciplinarian would have produced.

***

The final step, then. With a distinction in the MA in Peace & Development in my pocket and a first in the BSc(honours) in CS&IT at around the same time, what next? The humanities topics were becoming too depressing even with a detached scientific mind—too many devastating problems and too little agency to influence—and I had worked toward the plan to go into bioinformatics for so many years already. Looking for jobs in bioinformatics, they all demanded a PhD. With the knowledge and experience amassed studying for the two full degrees, I could do all those tasks they wanted the bioinformatician to do. However, without meeting that requirement for a PhD, there was no chance I’d make it through the first selection round. That’s what I thought at the time. I tried 1-2 regardless—reject because no PhD. Maybe I should have tried and applied more widely nonetheless, since, in hindsight, it was the system’s way of saying they wanted someone well-versed in both fields, not someone trained to become an academic, since most of those jobs are software development jobs anyway.

Disappointed that I still couldn’t be the bioinformatician I thought I would be able to be after those two degrees, I sighed and resigned to the idea that, gracious sakes, I’ll get that PhD, too, then, and defer the dream a little longer.

In a roundabout way I ended up at the Free University of Bozen-Bolzano (FUB), Italy. They paid for the scholarship and there was generous project funding to pay for conference attendance. Meanwhile in the bioinformatics field, things had moved on from databases for molecular biology to bio-ontologies to facilitate data integration. The KRDB research centre at FUB was into ontologies, but then rather from the logic side of things. Fairly soon after my commencement with the PhD studies, my supervisor, who did not even have a PhD in Computer Science, told me in no unclear terms that I was enrolled in a PhD in computer science, that my scientific contributions had to be in computer science, and if I wanted to do something in ‘bio-whatever’, that was fine, but that I’d have to do that in my own time. Crystal clear.

The `bio-whatever’ petered out, since I had to step up the computer science content because I had only three years to complete the PhD. On the bright side, passion will come the more you investigate something. Modelling, with some examples in bio, and ontologies and conceptual modelling it was. I completed my PhD in three year(-ish); fully indoctrinated in the computer science way. Journey completed.

***

I’ve not yet mentioned the design I indicated at the start of the blog post. It has nothing to do with moving into computer science. At all. Weaving in the interior design into the narrative didn’t work well, and it falls under the “vocational training courses whilst employed in various IT jobs” phrase earlier on. The costs of the associate diploma at the Portobello Institute in Dublin? I earned most of the costs (1200 pound or so? I can’t recall exactly, but it was somewhere between 1-2K) together in a week: we got double pay for working a shift on New Year (the year 2000 no less) and then I volunteered for the double pay for 12h shifts instead of regular 8h shifts for the week thereafter. One week extra work for an interesting hobby in the evening hours for a year was a good deal in my opinion, and it allowed me to explore whether I liked the topic as much as I thought I might in secondary school. I passed with a distinction and also got Rhodec certified. I still enjoy playing around with interiors, as hobby, and have given up the initial idea (in 1999) to use IT with it, since tangible samples work fine.

So, yes, I really have completed degrees in science, engineering, and political science straddling into humanities, and a little bit of the arts. A substantial chunck was paid for by the state (‘full scholarships’), companies chimed in as well, and I paid some of it from my hard earned money. On the motivations for the journey: I hope I made that clear despite cutting out some text in an attempt to reduce the post’s length. (Getting into university in the first place and staying in academia after completing a PhD are two different stories altogether, and left for another time.)

I still have many questions, but I also realise that many will remain unanswered even if the answer is known to humanity already, since to live means it’s finite and there’s simply not enough time to learn everything. In any case: do study what you want, not what anyone tells you to study. If the choice is a study or, say, a down payment on a mortgage for a house, then if completing the study will give good prospects and relieves you from a job you are not aiming for, go for it—that house may be bought later and be a tad bit smaller. It’s your life you’re living, not someone else’s.

# NLG requirements for social robots in Sub-Saharan Africa

When the robots come rolling, or just trickling or seeping or slowly creeping, into daily life, I want them to be culturally aware, give contextually relevant responses, and to do that in a language that the user can understand and speak well. Currently, they don’t. Since I work and in live in South Africa, then what does all that mean for the Southern Africa context? Would social robot use case scenarios be different here than in the Global North where most of the robot research and development is happening, and if so, how? What is meant with  contextually relevant responses? Which language(s) should the robot communicate in?

The question of which languages is the easiest to answer: those spoken in this region, which are mainly those in the Niger-Congo B [NCB] (aka ‘Bantu’) family of languages, and then also Portuguese, French, Afrikaans, and English. I’ve been working on theory and tools for NCB languages, and isiZulu in particular (and some isiXhosa and Runyankore), whose research was mainly as part of the two NRF-funded projects GeNI and MoReNL. However, if we don’t know how that human-robot interaction occurs in which setting, we won’t know whether the algorithms designed so far can also be used for that, which may well be beyond the ontology verbalisation, a patient’s medicine prescription generation, weather forecasts, or language learning exercises that we roughly got covered for the controlled language and natural language generation aspects of it.

So then what about those use case scenarios and contextually relevant responses? Let me first give an example of the latter. A few years ago in one of the social issues and professional practice lectures I was teaching, I brought in the Amazon Echo to illustrate precisely that as well as privacy issues with Alexa and digital assistants (‘robot secretaries’) in general. Upon asking “What is the EFF?”, the whole class—some 300 students present at the time—was expecting that Alexa would respond with something like “The EFF is the economic freedom fighters, a political party in South Africa”. Instead, Alexa fetched the international/US-based answer and responded with “The EFF is the electronic frontier foundation” that the class had never heard of and that EFF doesn’t really do anything in South Africa (it does pass the revue later on in the module nonetheless, btw). There’s plenty of online content about the EFF as political party, yet Alexa chose to ignore that and prioritise information from elsewhere. Go figure with lots of other information that has limited online presence and doesn’t score high in the search engine results because there are fewer queries about it. How to get the right answer in those cases is not my problem (area of expertise), but I take that a solved black box and zoom in on the natural language aspects to automatically generate a sentence that has the answer taken from some structured data or knowledge.

The other aspect of this instance, is that the interactions both during and after the lecture was not a 1:1 interaction of students with their own version of Siri or Cortana and the like, but eager and curious students came in teams, so a 1:m interaction. While that particular class is relatively large and was already split into two sessions, larger classes are also not uncommon in several Sub-Saharan countries: for secondary school class sizes, the SADC average is 23.55 learners per class (the world average is 17), with the lowest is Botswana (13.8 learners) and the highest in Malawi with a whopping 72.3 learners in a class, on average. An educational robot could well be a useful way to get out of that catch-22, and, given resource constraints, end up as a deployment scenario with a robot per study group, and that in a multilingual setting that permits code switching (going back and forth between different languages). While human-robot interaction experts still will need to do some contextual inquiries and such to get to the bottom of the exact requirements and sentences, this variation in use is on top of the hitherto know possible ways for educational robots.

Going beyond this sort of informal chatter, I tried to structure that a bit and narrowed it down to a requirements analysis for the natural language generation aspects of it. After some contextualisation, I principally used two main use cases to elucidate natural language generation requirements and assessed that against key advances in research and technologies for NCB languages. Very, very, briefly, any system will need to i) combine data-to-text and knowledge-to-text, ii) generate many more different types of sentences, including sentences for  both written and spoken languages in the NCB languages that are grammatically rich and often agglutinating, and iii) process non-trivial numbers that is non-trivial to do for NCB languages because the surface realization of the numbers depend on the noun class of the noun that is being counted. At present, no system out there can do all of that. A condensed version of the analysis was recently accepted as a paper entitled Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa [1], for the IST-Africa’21 conference, and it will be presented there next week at the virtual event, in the ‘next generation computing’ session no less, on Wednesday the 12th of May.

Probably none of you has ever heard of this conference. IST-Africa is yearly IT conference in Africa that aims to foster North-South and South-South  networking, promote the academia->industry and academia->policy bridge-creation and knowledge transfer pipelines, and capacity building for paper writing and presentation. The topics covered are distinctly of regional relevance and, according to its call for papers, the “Technical, Policy, Social Implications Papers must present analysis of early/final Research or Implementation Project Results, or business, government, or societal sector Case Study”.

Why should I even bother with an event like that? It’s good to sometimes reflect on the context and ponder about relevance of one’s research—after all, part of the university’s income (and thus my salary) and a large part of the research project funding I have received so far comes ultimately from the taxpayers. South African tax payers, to be more precise; not the taxpayers of the Global North. I can ‘advertise’, ahem, my research area and its progress to a regional audience. Also, I don’t expect that the average scientist in the Global North would care about HRI in Africa and even less so for NCB languages, but the analysis needed to be done and papers equate brownie points. Also, if everyone thinks to better not participate in something locally or regionally, it won’t ever become a vibrant network of research, applied research, and technology. I’ve attended the event once, in 2018 when we had a paper on error correction for isiZulu spellcheckers, and from my researcher viewpoint, it was useful for networking and ‘shopping’ for interesting problems that I may be able to solve, based on other participants’ case studies and inquiries.

Time will tell whether attending that event then and now this paper and online attendance will be time wasted or well spent. Unlike the papers on the isiZulu spellcheckers that reported research and concrete results that a tech company easily could take up (feel free to do so), this is a ‘fluffy’ paper, but exploring the use of robots in Africa was an interesting activity to do, I learned a few things along the way, it will save other interested people time in the analysis phase, and hopefully it also will generate some interest and discussion about what sort of robots we’d want and what they could or should be doing to assist, rather than replace, humans.

p.s.: if you still were to think that there are no robots in Africa and deem all this to be irrelevant: besides robots in the automotive and mining industries by, e.g., Robotic Innovations and Robotic Handling Systems, there are robots in education (also in Cape Town, by RD-9), robot butlers in hotels that serve quarantined people with mild COVID-19 in Johannesburg, they’re used for COVID-19 screening in Rwanda, and the Naledi personal banking app by Botlhale, to name but a few examples. Other tools are moving in that direction, such as, among others, Awezamed’s use of speech synthesis with (canned) text in isiZulu, isiXhosa and Afrikaans and there’s of course my research group where we look into knowledge-to-text text generation in African languages.

References

[1] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, 10-14 May 2021, online. in print.

# Automatically simplifying an ontology with NOMSA

Ever wanted only to get the gist of the ontology rather than wading manually through thousands of axioms, or to extract only a section of an ontology for reuse? Then the NOMSA tool may provide the solution to your problem.

There are quite a number of ways to create modules for a range of purposes [1]. We zoomed in on the notion of abstraction: how to remove all sorts of details and create a new ontology module of that. It’s a long-standing topic in computer science that returns every couple of years with another few tries. My first attempts date back to 2005 [2], which references modules & abstractions for conceptual models and logical theories to works published in the mid-1990s and, stretching the scope to granularity, to 1985, even. Those efforts, however, tend to halt at the theory stage or worked for one very specific scenario (e.g., clustering in ER diagrams). In this case, however, my former PhD student and now Senior Research at the CSIR, Zubeida Khan, went further and also devised the algorithms for five types of abstraction, implemented them for OWL ontologies, and evaluated them on various metrics.

The tool itself, NOMSA, was presented very briefly at the EKAW 2018 Posters & Demos session [3] and has supplementary material, such as the definitions and algorithms, a very short screencast and the source code. Five different ways of abstraction to generate ontology modules were implemented: i) removing participation constraints between classes (e.g., the ‘each X R at least one Y’ type of axioms), ii) removing vocabulary (e.g., remove all object properties to yield a bare taxonomy of classes), iii) keeping only a small number of levels in the hierarchy, iv) weightings based on how much some element is used (removing less-connected elements), and v) removing specific language profile features (e.g., qualified cardinality, object property characteristics).

In the meantime, we have added a categorisation of different ways of abstracting conceptual models and ontologies, a larger use case illustrating those five types of abstractions that were chosen for specification and implementation, and an evaluation to see how well the abstraction algorithms work on a set of published ontologies. It was all written up and polished in 2018. Then it took a while in the publication pipeline mixed with pandemic delays, but eventually it has emerged as a book chapter entitled Structuring abstraction to achieve ontology modularisation [4] in the book “Advanced Concepts, methods, and Applications in Semantic Computing” that was edited by Olawande Daramola and Thomas Moser, in January 2021.

Since I bought new video editing software for the ‘physically distanced learning’ that we’re in now at UCT, I decided to play a bit with the software’s features and record a more comprehensive screencast demo video. In the nearly 13 minutes, I illustrate NOMSA with four real ontologies, being the AWO tutorial ontology, BioTop top-domain ontology, BFO top-level ontology, and the Stuff core ontology. Here’s a screengrab from somewhere in the middle of the presentation, where I just automatically removed all 76 object properties from BioTop, with just one click of a button:

The embedded video (below) might keep it perhaps still readable with really good eyesight; else you can view it here in a separate tab.

The source code is available from Zubeida’s website (and I have a local copy as well). If you have any questions or suggestions, please feel free to contact either of us. Under the fair use clause, we also can share the book chapter that contains the details.

References

[1] Khan, Z.C., Keet, C.M. An empirically-based framework for ontology modularization. Applied Ontology, 2015, 10(3-4):171-195.

[2] Keet, C.M. Using abstractions to facilitate management of large ORM models and ontologies. International Workshop on Object-Role Modeling (ORM’05). Cyprus, 3-4 November 2005. In: OTM Workshops 2005. Halpin, T., Meersman, R. (eds.), LNCS 3762. Berlin: Springer-Verlag, 2005. pp603-612.

[3] Khan, Z.C., Keet, C.M. NOMSA: Automated modularisation for abstraction modules. Proceedings of the EKAW 2018 Posters and Demonstrations Session (EKAW’18). CEUR-WS vol. 2262, pp13-16. 12-16 Nov. 2018, Nancy, France.

[4] Khan, Z.C., Keet, C.M. Structuring abstraction to achieve ontology modularisation. Advanced Concepts, methods, and Applications in Semantic Computing. Daramola O, Moser T (Eds.). IGI Global. 2021, 296p. DOI: 10.4018/978-1-7998-6697-8.ch004

# The ontological commitments embedded in a representation language

Just like programming language preferences generate heated debates, this happens every now and then with languages to represent ontologies as well. Passionate dislikes for description logics or limitations of OWL are not unheard of, in favour of, say, Common Logic for more expressiveness and a different notation style, or of OBO because of its graph-based fundamentals, or that abuse of UML Class Diagram syntax  won’t do as approximation of an OWL file. But what is really going on here? Are they practically all just the same anyway and modellers merely stick with, and defend, what they know? If you could design your pet language, what would it look like?

The short answer is: they are not all the same and interchangeable. There are actually ontological commitments baked into the language, even though in most cases this is not explicitly stated as such. The ‘things’ one has in the language indicate what the fundamental building blocks are in the world (also called “epistemological primitives” [1]) and therewith assume some philosophical stance. For instance, a crisp vs vague world (say, plain OWL or a fuzzy variant thereof) or whether parthood is such a special relation that it deserves its own primitive next to class subsumption (alike UML’s aggregation). Or maybe you want one type of class for things indicated with count nouns and another type of element for stuffs (substances generally denoted with mass nouns). This then raises the question as to what the sort of commitments are that are embedded in, or can go into, a language specification and that have an underlying philosophical point of view. This, in turn, raises the question about which philosophical stances actually can have a knock-on effect on the specification or selection of an ontology language.

My collaborator, Pablo Fillottrani, and I tried to answer these questions in the paper entitled An Analysis of Commitments in Ontology Language Design that was published late last year as part of the proceedings of the 11th Conference on Formal Ontology in Information Systems 2020 that was supposed to have been held in September 2020 in Bolzano, Italy. In the paper, we identified and analysed ontological commitments that are, or could have been, embedded in logics, and we showed how they have been taken for well-known languages for representing ontologies and similar artefacts, such as OBO, SKOS, OWL 2DL, DLRifd, and FOL. We organised them in four main categories: what the very fundamental furniture is (e.g., including roles or not, time), acknowledging refinements thereof (e.g., types of relations, types of classes), the logic’s interaction with natural language, and crisp vs various vagueness options. They are discussed over about 1/3 of the paper.

Obviously, engineering considerations can interfere in the design of the logic as well. They concern issues such as how the syntax should look like and whether scalability is an issue, but this is not the focus of the paper.

We did spend some time contextualising the language specification in an overall systematic engineering process of language design, which is summarised in the figure below (the paper focuses on the highlighted step).

While such a process can be used for the design of a new logic, it also can be used for post hoc reconstructions of past design processes of extant logics and conceptual data modelling languages, and for choosing which one you want to use. At present, the documentation of the vast majority of published languages do not describe much of the ‘softer’ design rationales, though.

We played with the design process to illustrate how it can work out, availing also of our requirements catalogue for ontology languages and we analysed several popular ontology languages on their commitments, which can be summed up as in the table shown below, also taken from the paper:

In a roundabout way, it also suggests some explanations as to why some of those transformation algorithms aren’t always working well; e.g., any UML-to-OWL or OBO-to-OWL transformation algorithm is trying to shoe-horn one ontological commitment into another, and that can only be approximated, at best. Things have to be dropped (e.g., roles, due to standard view vs positionalism) or cannot be enforced (e.g., labels, due to natural language layer vs embedding of it in the logic), and that’ll cause some hick-ups here and there. Now you know why, and that won’t ever work well.

Hopefully, all this will feed into a way to help choosing a suitable language for the ontology one may want to develop, or assist with understanding better the language that you may be using, or perhaps gain new ideas for designing a new ontology language.

References

[1] Brachman R, Schmolze J. An overview of the KL-ONE Knowledge Representation System. Cognitive Science. 1985, 9:171–216.

[2] Fillottrani, P.R., Keet, C.M. An Analysis of Commitments in Ontology Language Design. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 46-60.

# On computer program being a whole

Who cares whether some computer program is a whole, how, and why? Turns out, more people than you may think—and so should you, since it can be costly depending on the answer. Consider the following two scenarios: 1) you download a ‘pirated’ version of MS Office or Adobe Photoshop (the most popular ones still) and 2) you take the source code of a popular open source program, such as Notepad++, add a little code for some additional function, and put it up for sale only as an executable app called ‘Notepad++ extreme (NEXT)’ so as to try to earn money quickly. Are these actions legal?

In both cases, you’d break the law, but how many infringements took place, of the one that you potentially could be fined for or face jail time? For the piracy case, is that once for the MS Office suite, or for each progam in the suite, or for each file created upon installing MS office, or for each source code file that went into making the suite during software development? For the open source case, was that violating its GNU GLP open source licence once for the zipped&downloaded or cloned source code or for each file in the source code, of which there are hundreds? It is possible to construct similar questions for trade secret violations and patent infringements for programs, as well as other software artefacts, like illegal downloads of TV series episodes (going strong during COVID-19 lockdowns indeed). Just in case you think this sort of issue is merely hypothetical: recently, Arista paid Cisco $400 million for copyright damages and just before that, Zenimax got$500 million from Oculus (yes, the VR software) for trade secret violations, and Google vs Oracle is ongoing with “billions of dollars at stake”.

Let’s consider some principles first. To be able to answer the number of infringements, we first need to know whether a computer program is a whole or not and why, and if so, what’s ‘in’ (i.e., a part of it) and what’s ‘out’ (i.e., definitely not part of it). Spoiler alert: a computer program is a functional whole.

To get to that conclusion, I had to combine insights from theories of parthood (mereology), granularity, modularity, unity, and function and add a little more into the mix. To provide less and more condensed versions of the argumentation, there is a longer technical report [1], of which I hope it is readable by a wider audience, and a condensed version for a specialist audience [2] that was published in the Proceedings of the 11th Conference on Formal Ontologies in Information Systems (FOIS’20) two weeks ago. Very briefly and informally, the state of affairs can be illustrated with the following picture:

This schematic representation shows, first, two levels of granularity: level 1 and level 2. At level 1, there’s some whole, like the a1 and a2 in the figure that could be referring to, say, a computer program, a module repository, an electorate, or a human body. At a more fine-grained level 2, there are different entities, which are in some way linked to the respective whole. This ‘link’ to the whole is indicated with the vertical dashed lines, and one can say that they are part of the whole. For the blue dots on the right residing at level 2, i.e., the parts of a1, there’s also a unifying relation among the parts, indicated with the solid lines with arrows, which makes a1 an integral whole. Moreover, for that sort of whole, it holds that if some object x (residing at level 2) is part of a1 then if there’s a y that is also part of a1, it participates in that unifying relation with x and vice versa (i.e., if y is in that unifying relation with x, then it must also be part of a1). For the computer program’s source code, that unifying relation can be the source tree graph.

There is some nitty gritty detail also involving the notion of function—a source code file contributes to doing something—and optional vs mandatory vs essential part that you can read about in the report or in the paper [1,2], covering the formalisation, more argumentation, and examples.

How would it pan out for the infringements? The Notepad++ exploitation scenario would simply be a case of one infringement in total for all the files needed to create the executable, not one for each source code file. This conclusion from the theory turns out remarkably in line with the GNU GPL’s explanation of their licence, albeit then providing a theoretical foundation for their intuition that there’s a difference between a mere aggregate where different things are bundled, loose coupling (e.g., sockets and pipes) and a single program (e.g., using function calls, being included in the same executable). The order of things perhaps should have been from there into the theory, but practically, I did the analysis and stumbled into a situation where I had to look up the GPL and its explanatory FAQ. On the bright side, in the other direction now then: just  in case someone wants to take on copyleft principles of open source software, here are some theoretical foundations to support that there’s probably much less money to be gained than you might think.

For the MS Office suite case mentioned at the start, I’d need a look under the hood to determine how it ties together and one may have to argue about the sameness of, or difference between, a suite and a program. The easier case for a self-standing app, like the 3rd-place most pirated Windows app Internet Download Manager, is that it is one whole and so one infringement then.

It’s a pity that FOIS 2020 has been postponed to 2021, but at least I got to talk about some of this as expert witness for a litigation case and I managed to weave an exercise about the source tree with open source licences into the social issues and professional practice module I thought to some 750 students this past winter.

References

[1] Keet, C.M. Why a computer program is a functional whole. Technical report 2008.07273, arXiv. 21 July 2020. 25 pages.

[2] Keet, C.M. The computer program as a functional whole. Proc. of FOIS 2020. Brodaric, B. and Neuhaus, F. (Eds.). IOS Press. FAIA vol. 330, 216-230.

# An architecture for Knowledge-driven Information and Data access: KnowID

Advanced so-called ‘intelligent’ information systems may use an ontology or runtime-suitable conceptual data modelling techniques in the back end combined with efficient data management. Such a set-up aims to provide a way to better support informed decision-making and data integration, among others. A major challenge to create such systems, is to figure out which components to design and put together to realise a ‘knowledge to data’ pipeline, since each component and process has trade-offs; see e.g., the very recent overview of sub-topics and challenges [1]. A (very) high level categorization of the four principal approaches is shown in the following figure: put the knowledge and data together in the logical theory the AI way (left) or the database way (right), or bridge it by means of mappings or by means of transformations (centre two):

Among those variants, one can dig into considerations like which logic to design or choose in the AI-based “knowledge with (little) data” (e.g.: which OWL species? common logic? Other?), which type of database (relational, object-relational, or rather an RDF store), which query language to use or design, which reasoning services to support, how expressive it all has to and optimized for what purpose. None is best in all deployment scenarios. The AI-only one with, say, OWL 2 DL, is not scalable; the database-only one either lacks interesting reasoning services or supports few types of constraints.

Among the two in the middle, the “knowledge mapping data” is best known under the term ‘ontology-based data access’ (OBDA) and the Ontop system in particular [2] with its recent extension into ‘virtual knowledge graphs’ and the various use cases [3]. Its distinguishing characteristic of the architecture is the mapping layer to bridge the knowledge to the data. In the “Data transformation knowledge” approach, the idea is to link the knowledge to the data through a series of transformations. No such system is available yet. Considering the requirements for that, it turned out that a good few components are already available and just needed one crucial piece of transformations to convincingly put that together.

We did just that and devised a new knowledge-to-data architecture. We dub this the KnowID architecture (pronounced as ‘know it’), abbreviated from Knowledge-driven Information and Data access. KnowID adds novel transformation rules between suitably formalised EER diagrams as application ontology and Borgida, Toman & Weddel’s Abstract Relational Model with SQLP ([4,5]) to complete the pipeline (together with some recently proposed other components). Overall, it then looks like this:

Its details are described in the article entitled “KnowID: an architecture for efficient Knowledge-driven Information and Data access” [6], which was recently publish in the Data Intelligence journal. In a nutshell: the logic-based EER diagram (with deductions materialised) is transformed into an abstract relational model (ARM) that is transformed into a traditional relational model and then onward to a database schema, where the original ‘background knowledge’ of the ARM is used for data completion (i.e., materializing the deductions w.r.t. the data), and then the query posed in SQLP (SQL + path queries) is answered over that ‘extended’ database.

Besides the description of the architecture and the new transformation rules, the open access journal article also describes several examples and it features a more detailed comparison of the four approaches shown in figure 1 above. For KnowID, compared to other ontology-based data access approaches, its key distinctive architectural features are that runtime use can avail of full SQL augmented with path queries, the closed world assumption commonly used in information systems, and it avoids a computationally costly mapping layer.

We are working on the implementation of the architecture. The transformation rules and corresponding algorithms were implemented last year [7] and two computer science honours students are currently finalising their 4th-year project, therewith contributing to the materialization and query formulation steps aspects of the architecture. The latest results are available from the KnowID webpage. If you were to worry that will suffer from link rot: the version associated with the Data Intelligence paper has been archived as supplementary material of the paper at [8]. The plan is, however, to steadily continue with putting the pieces together to make a functional software system.

References

[1] Schneider, T., Šimkus, M. Ontologies and Data Management: A Brief Survey. Künstl Intell 34, 329–353 (2020).

[2] Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez-Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic Web Journal, 2017, 8(3), 471-487.

[3] G. Xiao, L. Ding, B. Cogrel, & D. Calvanese. Virtual knowledge graphs: An overview of systems and use cases. Data Intelligence, 2019, 1, 201-223.

[4] A. Borgida, D. Toman & G.E. Weddell. On referring expressions in information systems derived from conceptual modeling. In: Proceedings of ER’16, 2016, pp. 183–197

[5] W. Ma, C.M. Keet, W. Oldford, D. Toman & G. Weddell. The utility of the abstract relational model and attribute paths in SQL. In: C. Faron Zucker, C. Ghidini, A. Napoli & Y. Toussaint (eds.) Proceedings of the 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’18)), 2018, pp. 195–211.

[6] P.R. Fillottrani & C.M. Keet. KnowID: An architecture for efficient knowledge-driven information and data access. Data Intelligence, 2020 2(4), 487–512.

[7] Fillottrani, P.R., Jamieson, S., Keet, C.M. Connecting knowledge to data through transformations in KnowID: system description. Künstliche Intelligenz, 2020, 34, 373-379.

[8] Pablo Rubén Fillottrani, C. Maria Keet. KnowID. V1. Science Data Bank. http://www.dx.doi.org/10.11922/sciencedb.j00104.00015. (2020-09-30)

# Toward a framework for resolving conflicts in ontologies (with COVID-19 examples)

Among the many tasks involved in developing an ontologies, are deciding what part of the subject domain to include, and how. This may involve selecting a foundational ontology, reuse of related domain ontologies, and more detailed decisions for ontology authoring for specific axioms and design patterns. A recent example of reuse is that of the Infectious Diseases Ontology for schistosomiasis knowledge [1], but even before reuse, one may have to assess differences among ontologies, as Haendel et al did for disease ontologies [2]. Put differently, even before throwing alignment tools at them or selecting one with an import statement and hope for the best, issues may arise. For instance, two relevant domain ontologies may have been aligned to different foundational ontologies, a partOf relation could be set to be transitive in one ontology but is also used in a qualified cardinality constraint in the other (so then one cannot use an OWL 2 DL reasoner anymore when the ontologies are combined), something like Infection may be represented as a class in one ontology but as a property infectedby in another, or the ontologies differ on the science, like whether Virus is an organism or an inanimate object.

What to do then?

Upfront, it helps to be cognizant of the different types of conflict that may arise, and understand what their causes are. Then one would want to be able to find those automatically. And, most importantly, get some assistance in how to resolve them; if possible, also even preventing conflicts from happening in the first place. This is what Rolf Grütter, from the Swiss Federal Research Institute WSL, and I have been working since he visited UCT last year. The first results have been accepted for the International Conference on Biomedical Ontologies (ICBO) 2020, which are described in a paper entitled “Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring” [3]. A sample scenario of the process is illustrated informally in the following figure.

Summary of a sample scenario of detecting and resolving conflicts, illustrated with an ontology reuse scenario where Onto2 will be imported into Onto1. (source: [3])

The paper first defines and illustrates the notions of meaning negotiation and conflict resolution and summarises their main causes, to then go into some detail of the various categories of conflicts and ways how to resolve them. The detection and resolution is assisted by the notion of a conflict set, which is a data structure that stores the details for further processing.

It was tested with a use case of an epizootic disease outbreak in the Lemanic Arc in Switzerland in 2006, due to H5N1 (avian influenza): an administrative ontology had to be merged with one about the epidemiology for infected birds and surveillance zones. With that use case in place already well before the spread of SARS-CoV-2 that caused the current pandemic, it was a small step to add a few examples to the paper about COVID-19. This was made possible thanks to recently developed relevant ontologies that were made available, including for COVID-19 specifically. Let’s highlight the examples here, also so that I can write a bit more about it than the terse text in the paper, since there are no page limits for a blog post.

Example 1: OWL profile violations

Medical terminologies tend to veer toward being represented in an ontology language that is less or equal to OWL 2 EL: this permits scalability, compatibility with typical OBO Foundry ontologies, as well as fitting with the popular SNOMED CT. As one may expect, there have been efforts in ontology development with content relevant for the current pandemic; e.g., the Coronavirus Infectious Disease Ontology (CIDO) [4]. The CIDO is not in OWL 2 EL, however: it has a class expressions with a universal quantifier (ObjectAllValuesFrom) on the right-hand side; specifically (in DL notation): ‘Yale New Haven Hospital SARS-CoV-2 assay’ $\sqsubseteq \forall$ ‘EUA-authorized use at’.’FDA EUA-authorized organization’ or, in the Protégé interface:

(codes: CIDO_0000020, CIDO_0000024, and CIDO_0000031, respectively). It also imported many ontologies and either used them to cause some profile violations or the violations came with them, such as by having used the union operator (‘or’) in the following axiom for therapeutic vaccine function (VO_0000562):

How did I find that? Most certainly NOT by manually browsing through the more than 70000 axioms of the CIDO (including imports) to find the needle in the haystack. Instead, I burned the proverbial haystack to easily get the needles. In this case, the burning was done with the OWL Classifier, which automatically computes which axioms violate any of the OWL species, and lists them accordingly. Here are two examples, illustrating an OWL 2 EL violation (that aforementioned universal quantification) and an OWL 2 QL violation (a property chain with entities from BFO and RO); you can do likewise for OWL 2 RL violations.

Following the scenario with the assumption that the CIDO would have to stay in the OWL 2 EL profile, then it is easy to find the conflicting axioms and act accordingly, i.e., remove them. (It also indicates something did not go well with importing the NDF-RT.owl into the cido-base.owl, but that as an aside for this example.)

Example 2: Modelling issues: same idea, different elements

Let’s take the CIDO again and now also the COviD Ontology for cases and patient information (CODO), which have some overlapping and complementary information, so perhaps could be merged. A not unimportant thing is the test for SARS-CoV-2 and its outcome. CODO has a ‘laboratory test finding’ $\equiv$ {positive, pending, negative}, i.e., the possible outcomes of the test are individuals made into a class using the ObjectOneOf constructor. Consulting CIDO for the test outcomes, it has a class ‘COVID-19 diagnosis’ with three subclasses: Negative, Positive, and Presumptive positive. Aside from the inexact matches of the test status that won’t simplify data integration efforts, this is an example of class vs. instance modeling of what is ontologically the same thing. Resolving this in any merging attempt means that either

1. the CODO has to change and bump up the test results from individuals to classes, or
2. the CIDO has to change the subclasses to individuals in the ABox, or
3. take an ‘outside option’ and represent it in yet a different way where both the CODO and the CIDO have to modify the ontology (e.g., take a conceptual data modeling approach by making the test outcome an attribute with a few possible values).

The paper provides an attempt to systematize such type of conflicts toward a library of common types of conflict, so that it should become easier to find them, and offers steps toward a proper framework to manage all that, which assisted with devising generic approaches to resolution of conflicts. We already have done more to realize all that (which could not all be squeezed into the 12 pages), but more is still to be done, so stay tuned.

Since COVID-19 is still doing the rounds and the international borders of South Africa are still closed (with a lockdown for some 5 months already), I can’t end the blog post with the usual ‘I hope to see you at ICBO 2020 in Bolzano in September’—well, not in the common sense understanding at least. Hopefully next year then.

References

[1] Cisse PA, Camara G, Dembele JM, Lo M. An Ontological Model for the Annotation of Infectious Disease Simulation Models. In: Bassioni G, Kebe CMF, Gueye A, Ndiaye A, editors. Innovations and Interdisciplinary Solutions for Underserved Areas. Springer LNICST, vol. 296, 82–91. 2019.

[2] Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annual Review of Biomedical Data Science, 2018, 1:305–331.

[3] Grütter R, Keet CM. Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring. 11th International Conference on Biomedical Ontologies (ICBO’20), 16-19 Sept 2020, Bolzano, Italy. CEUR-WS (in print).

[4] He Y, Yu H, Ong E, Wang Y, Liu Y, Huffman A, Huang H, Beverley J, Hur J, Yang X, Chen L, Omenn GS, Athey B, Smith B. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data, 2020, 7:181.

# A requirements catalogue for ontology languages

If you could ‘mail order’ a language for representing ontologies or fancy knowledge graphs, what features would you want it to have? Or, from an artefact development viewpoint: what requirements would it have to meet? Perhaps it may not be a ‘Christmas wish list’ in these days, but a COVID-19 lockdown ‘keep dreaming’ one instead, although perhaps it may even be feasible to realise if you don’t ask for too much. Either way, answering this on the spot may not be easy, and possibly incomplete. Therefore, I have created a sample catalogue, based on the published list of requirements and goals for OWL and CL, and I added a few more. The possible requirements to choose from currently are loosely structured into six groups: expressiveness/constructs/modelling features; features of the language as a whole; usability by a computer; usability for modelling by humans; interaction with ‘outside’, i.e., other languages and systems; and ontological decisions. If you think the current draft catalogue should be extended, please leave a comment on this post or contact the author, and I’ll update accordingly.

Expressiveness/constructs/modelling features

E-1 Equipped with basic language elements: predicates (1, 2, n-ary), classes, roles, properties, data-types, individuals, … [select or add as appropriate].

E-2 Equipped with language features/constraints/constructs: domain/range axioms, equality (for classes, for individuals), cardinality constraints, transitivity, … [select or add as appropriate].

E-3 Sufficiently expressive to express various commonly used ‘syntactic sugarings’ for logical forms or commonly used patterns of logical sentences.

E-4 Such that any assumptions about logical relationships between different expressions can be expressed in the logic directly.

Features of the language as a whole

F-1 It has to cater for meta-data; e.g., author, release notes, release date, copyright, … [select or add as appropriate].

F-2 An ontology represented in the language may change over time and it should be possible to track that.

F-3 Provide a general-purpose syntax for communicating logical expressions.

F-4 Unambiguous, i.e., not needed to have negotiation about syntactic roles of symbols, or translations between syntactic roles.

F-5 Such that every name has the same logical meaning at every node of the network.

F-6 Such that it is possible to refer to a local universe of discourse (roughly: a module).

F-7 Such that it is possible to relate the ontology to other such universes of discourse.

F-8 Specified with a particular semantics.

F-9 Should not make arbitrary assumptions about semantics.

F-10 Cater for internationalization (e.g., language tags, additional language model).

F-11 Extendable (e.g., regarding adding more axioms to same ontology, add more vocabulary, and/or in the sense of importing other ontologies).

F-12 Balance expressivity and complexity (e.g., for scalable applications, for decidable automated reasoning tasks).

F-13 Have a query language for the ontology.

F-14 Declared with Closed World Assumption.

F-15 Declared with Open World Assumption.

F-16 Use Unique Name Assumption.

F-17 Do not use Unique Name Assumption.

F-18 Ability to modify the language with the language features.

F-19 Ability to plug in language feature extensions; e.g., ‘loading’ a module for a temporal extension.

Usability by computer

UC-1 Be an (identifiable) object on the Web.

UC-2 Be usable on the Web.

UC-3 Using URIs and URI references that should be usable as names in the language.

UC-4 Using URIs to give names to expressions and sets of expressions, in order to facilitate Web operations such as retrieval, importation and cross reference.

UC-5 Have a serialisation in [XML/JSON/…] syntax.

UC-6 Have symbol support for the syntax in LaTeX/…

UC-7 Such that the same entailments are supported, everywhere on the network of ontologies.

UC-8 Able to be used by tools that can do subsumption reasoning/taxonomic classification.

UC-9 Able to be used by tools that can detect inconsistency.

UC-10 Possible to read and write in the document with simple tools, such as a text editor.

UC-11 Unabiguous and simple grammar to ensure parsing documents as simple as possible.

Usability & modelling by humans

HU-1 Easy to use

HU-2 Have at least one compact, human-readable syntax defined which can be used to express the entire language

HU-3 Have at least one compact, human-readable syntax defined so that it can be easily typed up in emails

HU-4 Such that no agent should be able to limit the ability of another agent to refer to any entity or to make assertions about any entity

HU-5 Such that a modeller is free to invent new names and use them in published content.

HU-6 Have clearly definined syntactic sugar, such as a controlled natural language for authoring or rendering the ontology or an exhaustive diagramamtic notation

Interaction with outside

I-1 Shareable (e.g., on paper, on the computer, concurrent access)

I-2 Interoperable (with what?)

I-3 Compatible with existing standards (e.g., RDF, OWL, XML, URIs, Unicode)

I-4 Support an open networks of ontologies

I-5 Possible to import ontologies (theories, files)

I-6 Option ot declare inter-ontology assertions

Ontological decisions

O-1 3-Dimensionalist commitment, where entities are in space but one doesn’t care about time

O-2 3-Dimensionalist with a temporal extension

O-3 4-Dimensionalist commitment, where entities are in spacetime

O-4 Standard view of relations and relationships (there is an order in which the entities participare)

O-5 Positionalist relations and relationships (there’s no order, but entities play a role in the relation/relationship)

O-6 Have additional primitives, such as for subsumption, parthood, collective, stuff, sortal, anti-rigid entities, … [select or add as appropriate]

O-7 Statements are either true or false

O-8 Statements may vague or uncertain; e.g., fuzzy, rough, probabilistic [select as appropriate]

O-9 There should be a clear separation between natural language and ontology

O-10 Ontology and natural language are intertwined

That’s all, for now.

# What can you do when you have to stay at home?

Most people may not be used to having to stay at home. Due to a soccer (football) injury, I had to stay put for a long time, yet, I hardly ever got bored (lonely, at times, yes, but doing things makes one forget about that, be content with one’s own company, and get lots of new knowledge experiences along the way). As a silver lining of that—and since I’m missing out on some social activities now as well—I’m compiling a (non-exhaustive) ‘what to do?’ list, which may give you some idea(s) to make good use of the time spent at home, besides working for home if you can or have to. They’re structured in three main categories: enriching the mind, being creative, and exercising the body, and there’s an ‘other’ category at the end.

Enrich the mind

If you haven’t signed up for the library, or aren’t allowed to go there anymore, here are a few sources that may distract you from the flood of COVID-19 news bites:

• Old novels for free: The Gutenberg project, where people have scanned and typed up old books.

Learning

• A new language to read, speak, and write. Currently, the most popular site for that is probably Duolingo. If you’re short on a dictionary: Wordreference is good for, at least, Spanish, Italian, and English, Leo for German<->English, and isiZulu.net for isiZulu<->English, to name but a few.
• A programming language. There are very many free lessons, textbooks, and video lectures for young and old. If you have never done this before, try Python.
• Dance. See ‘exercises’ below.
• Some academic topic. There are several websites with legally free textbooks, such as the Open Textbook Archive, and there is a drive toward open educational resources at several universities, including UCT’s OpenUCT (which also has our departmental course notes on computer ethics), and there are many MOOCs.
• Science experiments at home. Yes, some of those can be done at home, and they’re fun to do. A few suggestions: here (for kids, with household stuff), and here, or here, among many sites.

Be creative

Writing

• Keeping a diary may sound boring, but we live in interesting times. What you’re experiencing now may easily be blurred by whatever comes next. Write it down, so you can look back and reflect on the pandemic later.
• Write stories (though maybe don’t go down the road of apocalypses). You think you’re not creative enough for that? Then try to re-tell GoT to someone who hasn’t seen the series, or write a modern-day version of, say, red riding hood or Romeo & Juliet.
• Write about something else. For instance, writing this blog post took me as much time as I would otherwise have spent on two dance classes, this post took me three evenings + another 2-3 hours to write, and this series of posts eventually evolved into a textbook. Or you can add a few pages to Wikipedia.

Arts

These activities tend to call for lots of materials, but those shops are possibly closed already. The following list is an intersection of supermarket-materials and artsy creations.

• Durable ‘bread’ figures with salt dough, for if you have no clay. Regular dough for bread perishes, but add lots of salt, and after baking it, it will remain good for years. The solid dough allows for many creations.
• Food art with fruit and vegetables (and then eat it, of course); there are pictures for ideas, as well as YouTube videos.
• Paper-folding and cutting to make decorations, like paper doll chains, origami, kirigami.
• Painting with food paints or make your own paint. For instance, when cooking beetroot, the water turns very dark red-ish—don’t throw that away. iirc onion for yellow and spinach for green. This can be used for, among others, painting eggs and water-colour painting on paper. Or take a tea sieve and a toothbrush, cut out a desired figurine, dip the toothbrush in the colour-water and scrape it against the sieve to create small irregular drops and splashes.

• Life-size toilet roll elephant figures… or even toilet roll art (optionally with paper) 😉
• Knitting, sewing and all that. For instance, take some clothes that don’t fit anymore and rework it into something new (trousers into shorts, t-shirt as a top, insert colourful bands on the sides).
• Colourful thread art, which requires only a hammer, nails, and >=1 colours of sewing threads.

Exercise that body

one of the many COVID-19 memes (source: passed by on FB)–Let’s try not to gain too much weight.

Barbie memes aside, it is very well possible to exercise at home, even if you have only about 1-2 square meters available. If you don’t: you get double the exercise by moving the furniture out of the way 🙂

• Yoga and pilates. There are several websites with posters and sheets demonstrating moves.
• Gym-free exercises, like running on the spot, making a ‘steps’ from two piles of books and a plank and doing those steps or take the kitchen mini-ladder or go up and down the stairs 20 times, push-ups, squats, crunches, etc. There are several websites with examples of such exercises. If you need weights but don’t have them: fill two 500ml bottles with water or sand. Even the NHS has a page for it, and there are many other sites with ideas.
• Dance. True, for some dance styles, one needs a lot of space. Then again, think [back at/about] the clubs you frequent[ed]: they are crowded and there isn’t a lot of space, but you still manage(d) to dance and get tired. So, this is doable even with a small space available. For instance, the Kizomba World Project: while you’d be late for that now to submit a flashmob video, you still can practice it at home, using their instruction videos and dance together once all this is over. There are also websites with dance lessons (for-payment) and tons of free instruction videos on YouTube (e.g., for Salsa and Bachata—no partner? Search for ‘salsa shines’ or ‘bachata shines’ or footwork that can be done on your own, or try Bollywood or a belly dance workout [disclaimer: I did not watch these videos]).
• Zumba in the living room?

Other

Ontologically an awful category, but well, they still are good for keeping you occupied:

If you have more low-cost ideas that require little resources: please put them in the comments section.

p.s.: I did a good number of the activities listed above, but not all—yet.

# Digital Assistants and AMAs with configurable ethical theories

About a year ago, there was a bit of furore in the newspapers on digital assistants, like Amazon Echo’s Alexa, Apple’s Siri, or Microsoft’s Cortana, in a smart home to possibly snitch on you if you’re the marijuana-smoking family member [1,2]. This may be relevant if you live in a conservative state or country, where it is still illegal to do so. Behind it is a multi-agent system that would do some argumentation among the stakeholders (the kids, the parents, and the police). That example sure did get the students’ attention in the computer ethics class I taught last year. It did so too with an undergraduate student—double majoring in compsci and philosophy—who opted to do the independent research module. Instead of the multiple actor scenario, however, we considered it may be useful to equip such a digital assistant, or an artificial moral agent (AMA) more broadly, with multiple moral theories, so that a user would be able to select their preferred theory and let the AMA make the appropriate decision for her on whichever dilemma comes up. This seems preferable over an at-most-one-theory AMA.

For instance, there’s the “Mia the alcoholic” moral dilemma [3]: Mia is disabled and has a new model of the carebot that can fetch her alcoholic drinks in the comfort of her home. At some point, she’s getting drunk but still orders the carebot to bring her one more tasty cocktail. Should the carebot comply? The answer depends on one’s ethical viewpoint. If you had answered with ‘yes’, you probably would not want to buy a carebot that would refuse to serve you, and likewise vv. But how to make the AMA culturally and ethically more flexible to be able to adjust to the user’s moral preferences?

The first step in that direction has now been made by that (undergrad) research student, George Rautenbach, which I supervised. The first component is a three-layered approach, with at the top layer a ‘general ethical theory’ model (called Genet) that is expressive enough to be able to model a specific ethical theory, such as utilitarianism, ethical egoism, or Divine Command Theory. This was done for those three and Kantianism, so as to have a few differences in consequence-based or not, the possible ‘patients’ of the action, sort of principles, possible thresholds and such. These reside in the middle layer. Then there’s Mia’s egoism, the parent’s Kantian viewpoint about the marijuana, a train company’s utilitarianism to sort out the trolley problem, and so on at the bottom layer, which are instantiations of the respective specific ethical theories in the middle layer.

The Genet model was evaluated by demonstrating that those four theories can be modelled with Genet and the individual theories were evaluated with a few use cases to show that the attributes stored are relevant and sufficient for those reasoning scenarios for the individuals. For instance, eventually, Mia’s egoism wouldn’t get her another drink fetched by the carebot, but as a Kantian, she would have been served.

The details are described in the technical report “Toward Equipping Artificial Moral Agents with multiple ethical theories” [4] and the models are also available for download as XML files and an OWL file. To get all this to work in a device, there’s still the actual reasoning component to implement (a few architectures exist for that) and for a user to figure out which theory they actually subscribe to so as to have the device configured accordingly. And of course, there is a range of ethical issues with digital assistants and AMAs, but that’s a topic perhaps better suited for the SIPP (FKA computer ethics) module in our compsci programme [5] and other departments.

p.s.: a genet is also an agile cat-like animal mostly living in Africa, just in case you were wondering about the abbreviation of the model.

References

[1] Swain, F. AIs could debate whether a smart assistant should snitch on you. New Scientist, 22 February 2019. Online: https://www.newscientist.com/article/2194613-ais-could-debatewhether-a-smart-assistant-should-snitch-on-you/ (last accessed: 5 March 2020).

[2] Liao, B., Slavkovik, M., van der Torre, L. Building Jiminy Cricket: An Architecture for Moral Agreements Among Stakeholders. ACM Conference on Artificial Intelligence, Ethics, and Society 2019, Hawaii, USA. Preprint: arXiv:1812.04741v2, 7 March 2019.

[3] Millar, J. An ethics evaluation tool for automating ethical decision-making in robotsand self-driving cars. Applied Artificial Intelligence, 30(8):787–809, 2016.

[4] Rautenbach, G., Keet, C.M. Toward equipping Artificial Moral Agents with multiple ethical theories. University of Cape Town. arxiv:2003.00935, 2 March 2020.

[5] Computer Science Department. Social Issues and Professional Practice in IT & Computing. Lecture Notes. 6 December 2019.