It’s been a while since I looked in more detail into the life sciences and healthcare semantics-driven software ecosystems. The problems are largely the same, or more complex, with more technologies and standards to choose from that promise that this time it will be solved once and for all but where practitioners know it isn’t that easy. And lots of tooling for SARS-CoV-2 and COVID-19, of course. I’ll summarise and comment on a few presentations in the remainder of this post.
The first keynote speaker was Karin Verspoor from RMIT in Melbourne, Australia, who focussed her talk on their COVID-SEE tool , a Scientific Evidence Explorer for COVID-19 information that relies on advanced NLP and some semantics to help finding information, notably taking open questions where the sentence is analysed by PICO (population, intervention, comparator, outcome) or part thereof, and using UMLS and MetaMap to help find more connections. In contrast to a well-known domain with well-known terminology to formulate very specific queries over academic literature, that was (and still is) not so for COVID-19. Their “NLP+” approach helped to get better search results.
The second keynote was by Martina Summer-Kutmon from Maastricht University, the Netherlands, who focussed on metabolic pathways and computation and is involved in WikiPathways. With pretty pictures, like the COVID-19 Disease map that culminated from a lot of effort by many research communities with lots of online data resources ; see also the WikiPathways one for covid, where the work had commenced in February 2020 already. She also came to the idea that there’s a lot of semantics embedded in the varied pathway diagrams. They collected 64643 diagrams from the literature of the past 25 years, analysed them with ML, OCR, and manual curation, and managed to find gaps between information in those diagrams and the databases . It reminded me of my own observations and work on that with DiDOn, on how to get information from such diagrams into an ontology automatically . There’s clearly still lots more work to do, but substantive advances surely have been made over the past 10 years since I looked into it.
Then there were Mirjam van Reisen from Leiden UMC, the Netherlands, and Francisca Oladipo from the Federal University of Lokoja, Nigeria, who presented the VODAN-Africa project that tries to get Africa to buy into FAIR data, especially for COVID-19 health monitoring within this particular project, but also more generally to try to get Africans to share data fairly. Their software architecture with tooling is open source. Apart from, perhaps, South Africa, the disease burden picture for, and due to, COVID-19, is not at all clear in Africa, but ideally would be. Let me illustrate this: the world-wide trackers say there are some 3.5mln infections and 90000+ COVID-19 deaths in South Africa to date, and from far away, you might take this at face value. But we know from SA’s data at the SAMRC that deaths are about three times as much; that only about 10% of the COVID-19-positives are detected by the diagnostics tests—the rest doesn’t get tested [asymptomatic, the hassle, cost, etc.]; and that about 70-80% of the population already had it at least once (that amounts to about 45mln infected, not the 3.5mln recorded), among other things that have been pieced together from multiple credible sources. There are lots of issues with ‘sharing’ data for free with The North, but then not getting the know-how with algorithms and outcomes etc back (a key search term for that debate has become digital colonialism), so there’s some increased hesitancy. The VODAN project tries to contribute to addressing the underlying issues, starting with FAIR and the GDPR as basis.
The last keynote at the end of the conference was by Amit Shet, with the University of South Carolina, USA, whose talk focussed on how to get to augmented personalised health care systems, with as one of the cases being asthma. Big Data augmented with Smart Data, mainly, combining multiple techniques. Ontologies, knowledge graphs, sensor data, clinical data, machine learning, Bayesian networks, chatbots and so on—you name it, somewhere it’s used in the systems.
Reporting on the papers isn’t as easy and reliable as it used to be. Once upon a time, the papers were available online beforehand, so I could come prepared. Now it was a case of ‘rock up and listen’ and there’s no access to the papers yet to look up more details to check my notes and pad them. I’m assuming the papers will be online accessible soon (CEUR-WS again presumably). So, aside from our own paper, described further below, all of the following is based on notes, presentation screenshots, and any Q&A on Discord.
Ruduan Plug elaborated on the FAIR & GDPR and querying over integrated data within that above-mentioned VODAN-Africa project . He also noted that South Africa’s PoPIA is stricter than the GDPR. I’m suspecting that is due to the cross-border restrictions on the flow of data that the GDPR won’t have. (PoPIA is based on the GDPR principles, btw).
Deepak Sharma talked about FHIR with RDF and JSON-LD and ShEx and validation, which also related to the tutorial from the preceding day. The threesome Mercedes Arguello-Casteleiro, Chloe Henson, and Nava Maroto presented a comparison of MetaMap vs BERT in the context of covid , which I have to leave here with a cliff-hanger, because I didn’t manage to make a note of which one won because I had to go to a meeting that we were already starting later because of my conference attendance. My bet would be on the semantics (those deep learning models probably need more reliable data than there is available to date).
Besides papers related to scientific research into all things covid, another recurring topic was FAIR data—whether it’s findable, accessible, interoperable, and reusable. Fuqi Xu and collaborators assessed 11 features for FAIR vocabularies in practice, and how to use them properly. Some noteworthy observations were that comparing a FAIR level makes more sense before-and-after changing a single resource compared to pitting different vocabularies against each other, “FAIR enough” can be enough (cf. demanding 100% compliance) , and a FAIR vocabulary does not imply that it is also a good quality vocabulary. Arriving at the topic of quality, César Bernabé presented an analysis on the use of foundational ontologies in bioinformatics by means of a systematic literature mapping. It showed that they’re used in a range of activities of ontology engineering, there’s not enough empirical analysis of the pros and cons of using one, and, for the numbers game: 33 of the ontologies described in the selected literature used BFO, 16 DOLCE, 7 GFO, and 1 SUMO . What to do next with these insights remains to be seen.
Last, but not least—to try to keep the blog post at a sort of just about readable length—our paper, among the 15 that were accepted. Frances Gillis-Webber, a PhD student I supervise, did most of the work surveying OWL Ontologies in BioPortal on whether, and if so how, they take into account the notion of multilingualism in some way. TL;DR: they barely do . Even when they do, it’s just with labels rather than any of the language models, be they the ontolex-lemon from the W3C community group or another, and if so, mainly French and German.
Does it matter? It depends on what your aims are. We use mainly the motivation of ontology verbalisation and electronic health records with SNOMED CT and patient discharge note generation, which ideally also would happen for ‘non-English’. Another use case scenario, indicated by one of the participants, Marco Roos, was that the bio-ontologies—not just health care ones—could use it as well, especially in the case of rare diseases, where the patients are more involved and up-to-date with the science, and thus where science communication plays a larger role. One could argue the same way for the science about SARS-CoV-2 and COVID-19, and thus that also the related bio-ontologies can do with coordinated multilingualism so that it may assist in better communication with the public. There are lots of opportunities for follow-up work here as well.
There were also posters where we could hang out in gathertown, and more data and ontologies for a range of topics, such as protein sequences, patient data, pharmacovigilance, food and agriculture, bioschemas, and more covid stuff (like Wikidata on COVID-19, to name yet one more such resource). Put differently: the science can’t do without the semantic-driven tools, from sharing data, to searching data, to integrating data, and analysis to develop the theory figuring out all its workings.
The conference was supposed to be mainly in person, but then on 18 Dec, the Dutch government threw a curveball and imposed a relatively hard lockdown prohibiting all in-person events effective until, would you believe, 14 Jan—one day after the end of the event. This caused extra work with last-minute changes to the local organisation, but in the end it all worked out online. Hereby thanks to the organising committee to make it work under the difficult circumstances!
 Ruduan Plug, Yan Liang, Mariam Basajja, Aliya Aktau, Putu Jati, Samson Amare, Getu Taye, Mouhamad Mpezamihigo, Francisca Oladipo and Mirjam van Reisen: FAIR and GDPR Compliant Population Health Data Generation, Processing and Analytics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 Mercedes Arguello-Casteleiro, Chloe Henson, Nava Maroto, Saihong Li, Julio Des-Diz, Maria Jesus Fernandez-Prieto, Simon Peters, Timothy Furmston, Carlos Sevillano-Torrado, Diego Maseda-Fernandez, Manoj Kulshrestha, John Keane, Robert Stevens and Chris Wroe, MetaMap versus BERT models with explainable active learning: ontology-based experiments with prior knowledge for COVID-19. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 Fuqi Xu, Nick Juty, Carole Goble, Simon Jupp, Helen Parkinson and Mélanie Courtot, Features of a FAIR vocabulary. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
 César Bernabé, Núria Queralt-Rosinach, Vitor Souza, Luiz Santos, Annika Jacobsen, Barend Mons and Marco Roos, The use of Foundational Ontologies in Bioinformatics. SWAT4HCLS 2022. online/Leiden, the Netherlands, 10-13 January 2022.
Some time last year, a colleague asked about good examples of popular science books, in order to read and thereby to get inspiration on how to write books at that level, or at least for first-year students at a university. I’ve read (and briefly reviewed) ‘quite a few’ across multiple disciplines and proposed to him a few of them that I enjoyed reading. One aspect that bubbled up at the time, is that not all popsci books are of the same quality and, zooming in on this post’s topic: not all popsci books are of the same level, or, likely, do not have the same target audience.
I’d say they range from targeting advancedinterested laypersons to entertaining laypersons. The former entails that you’d be better off having covered the topic at school and an undergrad course or two will help as well for making it an enjoyable read, and be fully awake, not tired, when reading it. For the latter category at the other end of the spectrum: having completed little more than primary school will do fine and no prior subject domain knowledge is required, at all, and it’s good material for the beach; brain candy.
Either way you’ll learn something from any popsci book, even if it’s too little for the time spent reading the book or too much to remember it all. But some of them are much more dense than others. Compare cramming the essence of a few scientific papers in a book’s page to drawing out one scientific paper into a whole chapter. Then there’s humor—or the lack thereof—and lighthearted anecdotes (or not) to spice up the content to a greater or lesser extent. The author writing about fungi recounting eating magic mushrooms, say, or an economist being just as much of a sucker for summer sales in the shops as just about anyone. And, of course, there’s readability (more about that shortly in another post).
Putting all that in the mix, my groupings are as follows, with a selection of positive exemplars that I also enjoyed reading.
There are more popsci books of which I thought they were interesting to read, but I didn’t want to turn it into a laundry list. Also, it seemed that books on politics and society and philosophy and such seem to be deserving their own discussion on categorisation, but that’s for another time. I also intentionally excluded computer science, information systems, and IT books, because I may be differently biassed to those books compared to the out-of-my-own-current-specialisation books listed above. For instance, Dataclysm by Cristian Rudder on Data Science mainly with OKCupid data (reviewed earlier) was of the ‘entertainment’ level to me, but probably isn’t so for the general audience.
Perhaps it is also of use to contrast them to ‘bad’ examples—well, not bad, but I think they did not succeed well in their aim. Two of them are Critical mass by Phillip Ball (physics, social networks), because it was too wordy and drawn out and dull, and This is your brain on music by Daniel Levitin (neuroscience, music), which was really interesting, but very, very, dense. Looking up their scores on goodreads, those readers converge to that view for your brain on music as well (still a good 3.87 our of 5, from nearly 60000 ratings and well over 1500 reviews), as well as for the critical mass one (3.88 from some 1300 ratings and about 100 reviews). Compare that to a 4.39 for the award-wining Entangled life, 4.35 of Why we sleep, and 4.18 for Mama’s last hug. To be fair, not all books listed above have a rating above 4.
Be this as it may, I still recommend all of those listed in the four categories, and hopefully the sort of rough categorisation I added will assist in choosing a book among the very many vying for your attention and time.
Pushing the envelope categorising popsci books
Regarding book categories more generally, romance novels have subgenres, as does science fiction, so why not the non-fiction popsci books? Currently, they’re mostly either just listed (e.g., here or the new releases) or grouped by discipline, but not according to, say, their level of difficulty, humor, whether it mixes science with politics, self-help, or philosophy, or some other quality dimension of the book along which they possibly could be assessed.
As example that the latter might work for assigning attributes to the books: Why we sleep is 100% science but a reader can distill some ideas to practice with as self-help for sleeping better, whereas When: the scientific secrets of perfect timing is, contrary to what the title suggests, largely just self-help. Delusions of gender and Inside rebellion can, or, rather, should have some policy implications, and Why we sleep possibly as well (even if only to make school not start so early in the morning), whereas the sort of content of Elephants on acid already did (ethics review boards for scientific experiments, notably). And if you were not convinced of the presence of animal cognition, then Mama’s last hug may induce some philosophical reflecting, and then have a knock-on effect on policies. Then there are some books that I can’t see having either a direct or indirect effect on policy, such as Gastrophysics and Entangled life.
Let’s play a little more with that idea. What about vignettes composed of something like the followings shown in the table below?
Then a small section of the back cover of Entangled life would look like this, with the note that the humor is probably inbetween the ‘yes’ and ‘some’ (I laughed harder with the book on drunkenness).
Mama’s last hug would then have something like:
And Why we sleep as follows (though I can’t recall for sure now whether it was ‘some’ or ‘no laughing matter’ and a friend has borrowed the book):
Of course, these are just mock-ups to demonstrate the idea visually and to try out whether it is even doable to classify the books. They are. There very well may be better icons than these scruffy ‘take a cc or public domain one and fiddle with it in MS Paint’ or a mixed mode approach, like on the packs of coffee (see image on the right).
Moreover: would you have created the same categorisation for the three examples? What (other) properties of popular science books could useful? Also, and perhaps before going down that route: would something like that possibly be useful according to you or someone you know who reads popular science books? You may leave your comments below, on my facebook page, or write an email, or we can meet in person some day.
p.s.: this is not a serious post on the ontology of popular science books — it is summer vacation time here and I used to write book reviews in the first week of the year and this is sort of related.
Fifteen years is a long time in IT, yet blogging software is still around and working—the same WordPress I started my blog with, even. At the time, in 2006, when WordPress was still only offering blogging functionality, they had the air of being respectable and at least somewhat serious compared to blogspot (redirects to Blogger now) that hosted a larger share of the informal and whimsical blogs. Blogs are not nearly as popular now as they used to be, there seems to be a move to huddle together to take a ride on a branded bandwagon, like Medium and Substack, and all of the blog-providing companies have diversified the services they offer for blogging. WordPress now markets itself as website builder, rather than blogging, software.
One might even be tempted to argue that blogs are (nearly) obsolete, with TikTok and the like having come along over the years. No so, claims a blogger here, some 10 more more bloggers here, and even a necessity according to another that does provide a list of links to data to back it up. (Just maybe don’t try making a living from it—there are plenty of people who like to read, but writing doesn’t pay well.)
Some data for this blog, then. It has 325 published post, there are around 400-600 visitors per month in recent years (depending on the season and posting frequency), there are people still signed up to receive updates (78), some even like some of the posts, and some of them are shared Twitter and other social media. The most visited post of all time got over 21000 visits and counting (since 2011) and the most visited post in the past year (after the home page) still had a fine 355 visitors and is on my research and teaching topic (see also the occasionally updated vox populi). So, obsolete it is not. Admitted, the latter post had its heydays in 2010-2012 with about 2500 visits/year and the former saw its best of times in 2014-2015 (4425 and 4948 visits in each of those years alone, respectively). The best visited post of the mere 10 posts I wrote in 2021 is on bias in ontologies, having attracted the attention of 119 visitors. Summarizing this blog’s stats trends: numbers are down compared to 5-10 years ago, indeed, but insignificant it is not and multiple posts have staying power.
I also can reveal that there’s no clear correlation between the time-to-write and number-of-visits variables, nor between either of them and the post’s topic, and not with post length either. With more time, there would have been more, and more polished, posts. There’s plenty to write about, not only the long overdue posts for published papers that came out at an extra-busy time and therefore have slipped through writing about, but also other interesting research that’s going on and deserves that extra bit of attention, some more book reviews, teaching updates and so on. There’s no shortage of topics to write about, which therewith turned out to be an unfounded worry from 15 years ago.
Will I go on for another 15 years? Perhaps, perhaps not. I’m still fence-sitting, from the very first post in 2006 that summed up the reasons for starting a blog to this day, to give it a try nonetheless and see when and where it will end.
Why still fence-sitting? I still don’t know whether it’s beneficial or harmful to one’s career, and if beneficial, whether the time put into writing those posts could have been used better for obtaining more benefit from those alternative activities than from the blog post writing. What I do know, is that, among others, it has helped me to learn to write better, it made me take notes during conferences in order to write conference reports and therewith engage more productively with a conference, structure ideas and thoughts, and pitch papers. Also, the background searches for fact-checking, adding links, and trying to find pictures made me stumble into interesting detours as well. Some of the posts took a long time to write, but at least they were enjoyable pastimes or worktimes.
Uhm, so, the benefit is to (just?) me? I do hope the posts have been worthwhile to the readers. But, it brings into vision the question that’s well-known to aspiring writers: should I write for myself or for my readers? The answer depends on whom you consult: blog for yourself, says the blogger from paradise, write for another, imaginary, reader persona, says the novelist, and go for bothsideism for the best results according to the writer’s guide. I write for myself, and brush it up in an attempt to increase a post’s appeal. The brushing up mainly concerns the choice of words, phrases, and paragraphs and the ordering thereof, and the images to brighten up some of the otherwise text-only posts (like this one).
After so many years and posts, I ought to be able to say something more profound. It’s really just that, though: the joy of writing the posts, the hope it makes a difference to readers and to what I’ve written about, and the slight worry it may not be the best thing to do for advancing my career.
Be this as it may, over the past few days, I’ve added a bit more structure to the blog to assist readers finding the topics they may be interested in. The key different categories are now also accessible from the ‘Menu’, being work-related topics (research and papers, software, and teaching), posts on writing and publishing, and there are a few posts that belong to neither, which still can be found on the complete list of posts. Happy reading!
p.s.: in case you wondered: yes, I intended to do a reflection when the blog turned a nice round 15 in late March, were it not for that blurry extension to 2020 and lots of extra teaching and teaching admin duties in 2021. The summer break has started now and there’s not much of a chance to properly go on holiday, and writing also counts as leisure activity, so there the opportunity was, just about three months shy of the blog turning 16. (In case the post’s title vaguely rings a bell: yes, there’s that cheesy song from one of the top-5 movie musicals of all time [according to imdb], depicting a happy moment with promise of staying together before Rolfe makes some more bad decisions, but that’s 16 going on 17.)
How to align your domain ontology to a foundational ontology? It’s a well-known question, and one that I’ve looked into before as well. In some of that earlier work, we used DOLCE to align one’s ontology to. We devised the DOLCE decision diagram as part of the FORZA method to assist with the alignment process and implemented that in the MoKI ontology development tool . MoKI is no more, but the theory and the algorithm’s design approach still stand. Instead of re-implementing it as a Protégé plugin and have it go defunct in a few years again (due to incompatible version upgrades, say), it sounded like more fun to design one for BFO and make a stand-alone tool out of it. And that design and the evaluation thereof is precisely what two of my ontology engineering course students—Chiadika Emeruem and Steve Wang—did for their mini-project of the course. That was then finalised and implemented in a tool for general use as part of the DOT4D project extension for my (award-winning) OE textbook afterward.
More precisely, as first part, there’s a diagram specifically for BFO – well, for one of its 2.0-ish versions in existence at least. Deciding on which version to use and what would be good questions was not as trivial as it may sound. While the questions seem to work (as evaluated with several ontologies), it might still be of use to set up an experiment to assess usability from a modeller’s viewpoint.
If you have any questions, please feel free to contact either of us.
 Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.
Writing a book is only one part of the whole process of publishing a book. There’s the actual thing that eventually needs to get out into the wide world. Hard copy? E-book? Print-on-demand? All three or a subset only? Taking a step back: where are you as author located, where are the publisher and the printer, and where is the prospective audience? Is the prospective readership IT savvy enough for e-books to even consider that option? Is the book’s content suitable for reading on devices with a gazillion different screen sizes? Here’s a brief digest from after my analysis paralysis of the too many options where none has it all – not ever, it seems.
I’ve written about book publishing logistics and choices for my open textbook, but that is, well, a textbook. My new book, No Taming of the Enthusiast, is of a different genre and aimed at a broader audience. Also, I’m a little wiser on the practicalities of hard copy publishing. For instance, it took nearly 1.5 months for the College Publications-published textbook to arrive in Cape Town, having travelled all the way from Europe where the publisher and printer are located. Admittedly, these days aren’t the best days for international cargo, but such a delivery time is a bit too long for the average book buyer. I’ve tried buying books with other overseas retailers and book sellers over the past few years—same story. On top of that, in South Africa, you then have to go to the post office to pick up the parcel and pay a picking-up-the-parcel fee (or whatever the fee is for), on top of the book’s cost and shipping fee. And it may get stuck in Customs limbo. This is not a good strategy if I want to reach South African readers. Also, it would be cool to get at least some books all the way onto the shelves of local book stores.
A local publisher then? That would be good for contributing my bit to stimulating the local economy as well. It has the hard copy logistics problem in reverse at least in part, however: how to get the books from so far down south to other places in the world where buyers may be located. Since the memoir is expected to have an international audience as well, some international distribution is a must. This requirement still gives three options: a multinational hard copy publisher that distributes to main cities with various shipping delays, print-on-demand (soft copy distributed, printed locally wherever it is bought), or e-book.
Let’s take the e-books detour for a short while. There is a low percentage of uptake of e-books – some 20% at best – and lively subjective opinions on why people don’t like ebooks. I prefer hard copies as well, but tolerate soft copies for work. Both are useful for different types of use: a hard copy for serious reading and a soft copy for skimming and searching so as to save oneself endless flicking to look up something. It’s happening the same with my textbook as well, to some extent at least: people pay for it to have it nicely printed and bound even though they can do that with the pdf themselves or just read the pdf. For other genres, some are better in print in any case, such as colourful cookbooks, but others should tolerate e-readers quite well, such as fiction when it’s just plain text.
In deciding whether to go for an e-book, I did explore usability and readability of e-books for non-work books to form my own opinion on it. I really tried. I jumped into the rabbit hole of e-reader software with their pros and cons, and settled on Calibre eventually as best fit. I read a fixed-size e-book in its entirety and it was fine, but there was a glitch in that it did not quite adjust to the screen size of the device easily and navigating pages was awkward; I didn’t try to search. I also bought two e-book novels from smashwords (epub format) and tested one for cross-device usability and readability. Regarding the ‘across devices’: I think I deserve to share and read e-books on all my devices when I duly paid for the copyrighted books. And, lo and behold, I indeed could do so across unconnected devices through emailing myself on different email addresses. The flip side of that is that it means that once any epub is downloaded by one buyer (separately, not into e-books software), it’s basically a free-for-all. There are also epub to pdf converters. The hurdles to do so may be enough of a deterrent for an average reader, but it’s not even a real challenge for anyone in IT or computing.
After the tech tests, I’ve read through the first few pages of one of the two epub e-books – and abandoned it since. Although the epub file resized well, and I suppose that’s a pat on the back for the software developers, it renders ugly on the dual laptop/tablet and smartphone I checked it with. It offers not nearly the same neat affordances of a physical book. For the time being, I’ll buy an e-book only if there’s no option to buy a hard copy and I really, really, want to read it. Else to just let it slide – there are plenty of interesting books that are accessible and my reading time is limited.
So, now what for my new book? There is no perfect solution. I don’t want to be an author of something I would not want to read (the e-book), but it can be set up if there’s enough demand for it. Then, for the hard copies route, if you’re not already a best-selling author or a VIP who dabbles in writing, it’s not possible to get it both published ‘fast’ – in, say, at most 6 months cf. the usual 1.5-2 years with a traditional publisher – and have it distributed ‘globally’. Even if you are quite the hotshot writer, you have to be rather patient and contend with limited reach.
Then what about me, as humble award-wining textbook writer who wrote a memoir as well, and who can be patient but generally isn’t for long? First, I still prefer hard copies first and foremost nonetheless. Second, there’s the decision to either favour local or global in the logistics. Eventually, I decided to favour local and found a willing South African publisher, Porcupine Press, to publish it under their imprint and then went for the print-on-demand for elsewhere. PoD will take a few days lead time for an outside-South-Africa buyer, but that’s little compared to international shipping times and costs.
How to do the PoD? A reader/buyer need not worry and simply will be able to buy it from the main online retailers later in the upcoming week, with the exact timing depending on how often they run their batch update scripts and how much manual post-processing they do.
From the publishing and distribution side: it turns out someone has thought about all that already. More precisely, IngramSpark has set up an international network of local distributors that has a wider reach than, notably, KDP for the Kindle, if that floats your boat (there are multiple comparisons of the two on many more parameters, e.g., here and here). You load the softcopy files onto their system and then they push it into some 40000 outlets, including the main international ones like Amazon and multiple national ones (e.g., Adlibris in Sweden, Agapea in Spain). Anyway, that’s how it works in theory. Let’s see how that works in practice. The ‘loading onto the system’ stage started last week and should be all done some time this upcoming week. Please let me know if it doesn’t work out; we’ll figure something out.
Meanwhile for people in South Africa who can’t wait for the book store distribution that likely will take another few weeks to cover the Joburg/Pretoria and Cape Town book shops (an possibly on the shelf only in January): 1) it’s on its way for distribution through the usual sites, such as TakeALot and Loot, over the upcoming days (plus some days that they’ll take to update their online shop); 2) you’ll be able to buy it from the Porcupine Press website once they’ve updated their site when the currently-in-transit books arrive there in Gauteng; 3) for those of you in Cape Town, and where the company that did the actual printing is located (did I already mention logistics matter?): I received some copies for distribution on Thursday and I will bring copies to the book launch next weekend. If the impending ‘family meeting’ is going to mess up the launch plans due to an unpleasant more impractical adjusted lockdown level, or you simply can’t wait: you may contact me directly as well.
With increasing student numbers, but not as much more funding for schools and universities, and the desire to automate certain tasks anyhow, there have been multiple efforts to generate and mark educational exercises automatically. There are a number of efforts for the relatively easy tasks, such as for learning a language, which range from the entry level with simple vocabulary exercises to advanced ones of automatically marking essays. I’ve dabbled in that area as well, mainly with 3rd-year capstone projects and 4th-year honours project student projects . Then there’s one notch up with fact recall and concept meaning recall questions, and further steps up, such as generating multiple-choice questions (MCQs) with not just obviously wrong distractors but good distractors to make the question harder. There’s quite a bit of work done on generating those MCQs in theory and in tooling, notably [2,3,4,5]. As a recent review  also notes, however, there are still quite a few gaps. Among others, about generalisability of theory and systems – can you plug in any structured data or knowledge source to question templates – and the type of questions. Most of the research on ‘not-so-hard to generate and mark’ questions has been done for MCQs, but there are multiple of other types of questions that also should be doable to generate automatically, such as true/false, yes/no, and enumerations. For instance, with an axiom such as in a ontology or knowledge graph, a suitable question generation system may then generate “Does an impala live on land?” or “True or false: An impala lives on land.”, among other options.
We set out to make a start with tackling those sort of questions, for the type-level information from an ontology (cf. facts in the ABox or knowledge graph). The only work done there, when we started with it, was for the slick and fancy Inquire Biology , but which did not have their tech available for inspection and use, so we had to start from scratch. In particular, we wanted to find a way to be able to plug in any ontology into a system and generate those non-MCQ other types of educations questions (10 in total), where the questions generated are at least grammatically good and for which the answers also can be generated automatically, so that we get to automated marking as well.
Different types of questions and the answer they have to provide put different prerequisiteson the content of the ontology with certain types of axioms. We specified those for 10 types of educational questions.
Three strategies of question generation were devised, being ‘simple’ from the vocabulary and axioms and plug it into a template, guided by some more semantics in the ontology (a foundational ontology), and one that didn’t really care about either but rather took a natural language approach. Variants were added to cater for differences in naming and other variations, amounting to 75 question templates in total.
The human evaluation with questions generated from three ontologies showed that while the semantics-based one was slightly better than the baseline, the NLP-based one gave the best results on syntactic and semantic correctness of the sentences (according to the human evaluators).
It was tested with several ontologies in different domains, and the generalisability looks promising.
To be honest to those getting their hopes up: there are some issues that cause it never to make it to the ‘100% fabulous!’ if one still wants to designs a system that should be able to take any ontology as input. A main culprit is naming of elements in the ontology, which varies widely across ontologies. There are several guidelines for how to name entities, such as using camel case or underscores, and those things easily can be coded into an algorithm, indeed, but developers don’t stick to them consistently or there’s an ontology import that uses another naming convention so that there likely will be a glitch in the generated sentences here or there. Or they name things within the context of the hierarchy where they put the class, but in the question it is out of that context and then looks weird or is even meaningless. I moaned about this before; e.g., ‘American’ as the name of the class that should have been named ‘American Pizza’ in the Pizza ontology. Or the word used for the name of the class can have different POS tags such that it makes the generated sentence hard to read; e.g., ‘stuff’ as a noun or a verb.
Bias in models in the area of Machine Learning and Deep Learning are well known. They feature in the news regularly with catchy headlines and there are longer, more in-depth, reports as well, such as the Excavating AI by Crawford and Paglen and the book Weapons of Math Destruction by O’Neil (with many positive reviews). What about other types of ‘models’, like those that are not built in a data-driven bottom-up way from datasets that happen to lie around for the taking, but that are built by humans? Within Artificial Intelligence still, there are, notably, ontologies. I searched for papers about bias in ontologies, but could find only one vision paper with an anecdote for knowledge graphs , one attempt toward a framework but looking at FOAF only , which is stretching it a little for what passes as an ontology, and then stretching it even further, there’s an old one of mine on bias in relation to conceptual data models for databases .
We simply don’t have bias in ontologies? That sounds a bit optimistic since it’s pervasive elsewhere, and at least worthy of examination whether there is such notion as bias in ontologies and if so, what the sources of that may be. And, if one wants to dig deeper, since Ontology: what is bias anyhow? The popular media is much more liberal in the use of the term ‘bias’ than scientific literature and I’m not going to answer that last question here now. What I did do, is try to identify sources of bias in the context of ontologies and I took a relevant selection of Dimara et al’s list of 154 biases  (just like only a subset is relevant to their scope) to see whether they would apply to a set of existing ontologies in roughly the same domain.
The outcome of that exploratory analysis , in short, is: yes, there is such notion as bias in ontologies as well. First, I’ve identified 8 types of sources, described them, and illustrated them with hand-picked examples from extant ontologies. Second, I examined the three COVID-19 ontologies (CIDO, CODO, COVoc) on possible bias, and they exhibited different subsets indeed.
The sources can be philosophical, by purpose (commonly known as encoding bias), and ‘subject domain’ source, such as scientific theory, granularity, linguistic, social-cultural, political or religious, and economic motivations, and they may be explicit choices or implicit.
An example of an economic motivation is to (try to) categorise some disorder as a type of disease: there latter gets more resources for medicines, research, treatments and is more costly for insurers who’s rather keep it out of the terminology altogether. Or modifying the properties of a disease or disorder in the classification in the medical ontology so that more people will be categorised as having the disorder even when they don’t. It has happened (see paper for details). Terrorism ontologies can provide ample material for political views to creep in.
Besides the hand-picked examples, I did assess the three COVID-19 ontologies in more detail. Not because I wanted to pick on them—I actually think it’s laudable they tried in trying times—but because they were developed in the same timeframe by three different groups in relative isolation from each other. I looked at both the sources, which can be argued to be present and identified some from a selection of Dimara et al’s list, such as the “mere exposure/familiarity” bias and “false consensus” bias (see table below). How they are present, is also described in that same paper, entitled “An exploration into cognitive bias in ontologies”, which has recently been accepted at the workshop on Cognition And OntologieS V (CAOS’21), which is part of the Joint Ontology Workshops Episode VII at the Bolzano Summer of Knowledge.
Will it matter for automated reasoning when the ontologies are deployed in various information systems? For reasoning over the TBox only, perhaps not so much, or, at least, any inconsistencies that it would have caused should have been detected and discussed during the ontology development stage, rather.
Will it matter for, say, annotating data or literature etc? Some of it yes, for sure. For instance, COVoc has only ‘male’ in the vocabulary, not female (in line with a well-known issue in evidence-based medicine), so when it is used for the “scientific literature triage” they want to, then it’s going to be even harder to retrieve COVID-19 research papers in relation to women specifically. Similarly, when ontologies are used with data, such as for ontology-based data access, bias may have negative effects. Take as example CIDO’s optimism bias, where a ‘COVID-19 experimental drug in a clinical trial’ is a subclass of ‘COVID-19 drug’, and this ontology would be used for OBDA and data integration, as illustrated in the following use case scenario with actual data from the ClinicalTrials database and the FDA approved drugs database:
The data together with the OBDA-enabled reasoner will return ‘hydroxychloroquine’, which is incorrect and the error is due to the biased and erroneous class subsumption declared in the ontology, not the data source itself.
Some peculiarities of content in an ontology may not be due to an underlying bias, but merely a case of ‘ran out of time’ rather than an act of omission due to a bias, for instance. Or it may not be an honest mistake due to bias but a mistake because of some other reason, such as due to having clicked erroneously on a wrong button in the tool’s interface, say, or having misunderstood the modelling language’s features. Disentangling the notion of bias from attendant ontology quality issues is one of the possible avenues of future work. One also can have a go at those lists and mini-taxonomies of cognitive biases and make a better or more comprehensive one, or to try to harmonise the multitude of definitions of what bias is exactly. Methods and supporting software may also assist ontology developers more concretely further down the line. Or: there seems to be enough to do yet.
Lastly, I still hope that I’ll be allowed to present the paper in person at the CAOS workshop, but it’s increasingly looking less and less likely, as our third wave doesn’t seem to want to quiet down and Italy is putting up more hurdles. If not, I’ll try to make a fancy video presentation.
Notwithstanding, we did take next steps and obtained some advances in the meantime, which resulted in a substantially extended CNL, called CLaRO v2. The paper describing how it came about has been accepted recently at the 7th Controlled Natural Language Workshop (CNL2020/21), which will be held on 8-9 September in Amsterdam, The Netherlands, in hybrid mode.
So, what is it about, being “new and improved!” compared to the first version? The first version was created in a bottom-up fashion based on a dataset of 234 competency questions  in a few domains only. It turned out alright with decent performance on coverage for unseen questions (88% overall) and very significantly outperforming the others, but there were some nagging doubts about the feasibility of bottom-up approaches to template development, which are essentially at the heart of every bottom-up approach: questions aboutrepresentativeness and quality of the source data. We used more questions as basis to work from than others and had better coverage, but would coverage improve further then still with even more questions? Would it matter for coverage if the CQs were to come from more diverse subject domains? Also, upon manual inspection of the original CQs, it could be seen that some CQs from the dataset were ill-formed, which propagated through to the final set of templates of CLaRO. Would ‘cleaning’ the source data to presumably better quality templates improve coverage?
One of the PhD students I supervise, Mary-Jane Antia, set out to find answer to these questions. CQs were cleaned and vetted by a linguist, the templates recreated and compared and evaluated—this time automatically in a new testing pipeline. New CQs for ontologies were sourced by searching all over the place and finding some 70, to which we added 22 more variants by tweaking wording of existing CQs such that they still would be potentially answerable by an ontology. They were tested on the templates, which resulted in a lower than ideal percentage of coverage and so new templates were created from them, and yet again evaluated. The key results:
An increase from 88% for CLaRO v1 to 94.1% for CLaRO v2 coverage.
The new CLaRO v2 has 147 main templates and another 59 variants to cater for minor differences (e.g., singular/plural, redundant words), up from 93 and 41 in CLaRO.
Increasing the number of domains that the CQs were drawn from had a larger effect on the CQ coverage than cleaning the source data.
I will try to make it to Amsterdam where CNL’21 will take place, but travel restrictions aren’t cooperating with that plan just yet; else I’ll participate virtually. Mary-Jane will present the paper, and also for her, despite also having funding for the trip, it increasingly looks like a virtual presentation. On the bright side: at least there is a way to participate virtually.
With another level 4 lockdown and a curfew from 9pm for most of July, I eventually gave in and decided to buy a TV, for some diversion with the national TV channels. In the process of buying, it appeared that here in South Africa, you have to have a valid paid-up TV licence to be allowed to buy a TV. I had none yet. So there I was in the online shopping check-out on a Sunday evening being held up by a message that boiled down to a ‘we don’t recognise your ID or passport number as having a TV licence’. As advances in the state’s information systems would have it, you can register for a TV licence online and pay with credit card to obtain one near-instantly. The interesting question from an IT perspective then was: how long will it take for the online retailer to know I duly registered and paid for the licence? In other words: are the two systems integrated and if so, how? It definitely is not based on a simple live SPARQL query from the retailer to a SPARQL endpoint of the TV licences database, as I still failed the retailer’s TV licence check immediately after payment of the licence and confirmation of it. Some time passed with refreshing the page and trying again and writing a message to the retailer, perhaps 30-45 minutes or so. And then it worked! A periodic data push or pull it is then, either between the licence database and the retailer or within the state’s back-end system and any front-end query interface. Not bad, not bad at all.
One may question from a privacy viewpoint whether this is the right process. Why could I not simply query by, say, just TV licence number and surname, but having had to hand over my ID or passport number for the check? Should it even be the retailer’s responsibility to check whether their customer has paid the tax?
There are other places in the state’s systems where there’s some relatively advanced integration of data between the state and companies as well. Notably, the SA Revenue Service (SARS) system pulls data from any company you work for (or they submit that via some ETL process) and from any bank you’re banking with to check whether you paid the right amount (if you owe them, they send the payment order straight to your bank, but you still have to click ‘approve’ online). No doubt it will help reduce fraud, and by making it easier to fill in tax forms, it likely will increase the amount collected and will cause less errors that otherwise may be costly to fix. Clearly, the system amounts to reduced privacy, but it remains within the legal framework—someone trying to evade paying taxes is breaking the law, rather—and I support the notion of redistributive taxation and to achieve that will as little admin as possible.
These examples do raise broader questions, though: when is data integration justified? Always? If not always, then when is it not? How to ensure that it won’t happen when it should not? Who regulates data integration, if anyone? Are there any guidelines or a checklist for doing it responsibly so that it at least won’t cause unintentional harm? Which steps in the data integration, if any, are crucial from a responsibility and ethical point of view?
No good answers
I did search for academic literature, but found only one paper mentioning we should think of at least some of these sort of questions . There are plenty of ethics & Big Data papers (e.g., [2,3]), but those papers focus on the algorithms let loose on the data and consequences thereof once the data has been integrated, rather than yes/no integration or any of the preceding integration processes themselves. There are, among others, data cleaning, data harmonisation and algorithms for that, schema-based integration (LAV, GAV, or GLAV), conceptual model-based integration, ontology-driven integration, possibly recurring ETL processes and so on, and something may go wrong at each step or may be the fine-grained crucial component of the ethical considerations. I devised one toy example in the context of ontology-based data access and integration where things would go wrong because of a bias  in that COVID-19 ontology that has data integration as its explicit purpose . There are also informal [page offline dd 25-7-2021] descriptions of cases where things went wrong, such as the data integration issues with the City of Johannesburg that caused multiple riots in 2011, and no doubt there will be more.
Taking the ‘non-science’ route further to see if I could find something, I did find a few websites with some ‘best practices’ and ‘guidelines’ for data integration (e.g., here and here), with the brand new and most comprehensive set of data integration guidelines at end-user level by UN’s ESCAP that focuses on data integration for statistics offices on what to do and where errors may creep in . But that’s all. No substantive hits with ‘ethics in data integration’ and similar searches in the academic literature. Maybe I’m searching in the wrong places. Wading through all ‘data ethics’ papers to find the needle in the haystack may have to be done some other time. If you know of scientific literature that I missed specifically regarding data integration, I’d be most grateful if you’d let me know.
The ‘recurring reliables’ for issues: health and education
Meanwhile, to take a step toward an answer of at least a subset of the aforementioned questions, let me first mention two other recent cases, also from South Africa, although the second issue happened in the Netherlands as well.
The first one is about healthcare data. I’m trying to get a SARS-CoV-2 vaccine. Registration for the age group I’m in opened on the 14th in the evening and so I did register in the state’s electronic vaccination data system (EVDS), which is the basic requirement for getting a vaccine. The next day, it appeared that we could book a slot via the health insurance I’m a member of. Their database and the EVDS are definitely not integrated, and so my insurer spammed me for a while with online messages in red, via email, and via SMS that I should register with the EVDS, even though I had already done that well before trying out their app.
Perhaps the health data are not integrated because it’s health; perhaps it was just time pressure to not delay the SARS-CoV-2 vaccination programme rollout. For some sectors, such as the basic education sector and then the police, they got loaded into the EVDS by the respective state department in one go via some ETL process, rather than people having to bother with individual registration. ID number, names, health insurance, dependants, home address, phone number, and whatnot that the EVDS asked for. And that regardless whether you want the vaccine or not—at least most people do. I don’t recall anyone having had a problem with that back-end process that it happened, aside from reported glitches in the basic education sectors’ ETL process, with reports on missing foreign national teachers and employees of independent schools who wanted in but weren’t.
Both the IT systems for vaccination management and any app for a ‘pass’ for having been vaccinated enjoys some debates on privacy internationally. Should they be self-standing systems? If it is allowed some integration, then with what? Should a healthcare provider or insurer be informed of the vaccination status of a member (and, consequently, act accordingly, whatever that may be), only if the member voluntarily discloses it (like with the vaccination scheduling app), or never? One’s employer? The movie theatre or mall you may want to enter? Perhaps airline companies want access to the vaccine database as well, who could choose to only let vaccinated people on their planes? The latter happens with other vaccinations for sure; e.g., yellow fever vaccination proof to enter SA from some countries, which the airline staff did ask for when I checked in in Argentina when travelling back to SA in 2012. That vaccination proof had gone into the physical yellow fever vaccination booklet that I carried with me; no app was involved in that process, ever. But now more things are digital. Must any such ‘covid-19 pass’ necessarily be digital? If so, who decides who, if anyone, will get access to the vaccination data, be it the EVDS data in SA or their homologous systems in other countries? To the best of my knowledge, no regulations exist yet. Since the EVDS is an IT system of the state, I presume they will decide. If they don’t, it will be up to the whims of each company, municipality, or province, and then is bound to generate lots of confusion among people.
The other case of a different nature comes in the news regularly; e.g., here, here, and here. It’s the tension that exists between children’s right to education and the paperwork to apply for a school. This runs into complications when they have an “undocumented” status, be it because of an absent birth certificate or their and their parent’s status as legal/illegal and their related ID documents or the absence thereof. It is forbidden for a school to contact Home affairs to get the prospective pupil’s and their respective parents’/guardians’ status, and for Home Affairs to provide that data to the schools, let alone integrate those two database at the ministerial level. Essentially, it is an intentional ‘Chinese wall’ between the two databases: the right to education of a child trumps any possible violation of legality of stay in the country or missing paperwork of the child or their parents/guardians.
Notwithstanding, exclusive or exclusionary schools try to filter them out by other means, such as by demanding that sort of data when you want to apply for admission; here’s an example, compared to public schools where evidence of an application for permission to stay suffices or at least evidence of efforts to engage with Home Affairs will do already. When the law says ‘no’ to the integration, how can you guarantee it won’t happen, neither through the software nor by other means (like by de facto requiring the relevant data stored in the Home Affairs database in an admission form)? Policing it? People reporting it somewhere? Would requesting such information now be a violation of the Protection of Personal Information Act (POPIA) that came into force on the 1st of July, since it asks for more personal data than needed by law?
These cases—TV licence, SARS (the tax, not the syndrome), vaccine database, school admissions—are just a few anecdotes. Data integration clearly is not always allowed and when it is not, it has been a deliberate decision not to do so because its outcome is easy to predict and deemed unwanted. Notably for the education case, it is the government who devised the policy for a regulatory Chinese wall between its systems. The TV licence appears to lie at the other end of the spectrum. The broadcasting act of 1999 implicitly puts the onus on the seller of TVs: the licence is not a fee to watch public TV, it is a thing to give the licence holder the right to use a TV (article 27, if you must know), so if you don’t have the right to have it, then you can’t buy it. It’s analogous to having to be over 18 to buy alcohol, where the seller is held culpable if the buyer isn’t. That said, there are differences in what the seller requests from the customer: Makro requires the licence number only and asks for ID only if you can’t remember the licence number so as to ‘help you find it’, whereas takealot demands both ID and licence in any case, and therewith perhaps is then asking for more than strictly needed. Either way, since any retailer thus should be able to access the licence information instantly to check whether you have the right to own a TV, it’s a bit like as if “come in and take my data” is written all over the TV licence database. I haven’t seen any news articles about abuse.
For the SARS-CoV-2 vaccine and the EVDS data, there is, to the best of my knowledge, no specific regulation in place from the EVDS to third parties, other than that vaccination is voluntary and there is SA’s version of the GDPR, the aforementioned POPIA, which is based on the GDPR principles. I haven’t seen much debate about organisations requiring vaccination, but they can make vaccination mandatory if they want to, from which follows that there will have to be some data exchange either between the EVDS and third parties or from EVDS to the person and from there to the company. Would it then become another “come in and take my data”? We’ll cross that bridge when it comes, I suppose; coverage is currently at about 10% of the population and not everyone who wants to could get vaccinated yet, so we’re still in a limbo.
What could possibly go wrong with widespread access, alike with the TV licence database? A lot, of course. There are the usual privacy and interoperability issues (also noted here), and there are calls even in the laissez faire USA to put a framework in place to provide companies with “standards and bounds”. They are unlikely going to be solved by the CommonPass of the Commons Project bottom-up initiative, since there are so many countries with so many rules on privacy and data sharing. Interoperability between some systems is one thing; one world-wide system is another cup of tea.
What all this boils down to is not unlike Moshe Vardi’s argument, in that there’s the need for more policy to reduce and avoid ethical issues in IT, AI, and computing, rather than that computing would be facing an ethics crisis . His claim is that failures of policy cause problems and that the “remedy is public policy, in the form of laws and regulations”, not some more “ethics outrage”. Presumably, there’s no ethics crisis, of the form that there would be a lack of understanding of ethical behaviour among computer scientists and their managers. Seeing each year how students’ arguments improve between the start of the ethics course and at the end in the essay and exam, I’d argue that basic sensitization is still needed, but on the whole, more and better policy could go a long way indeed.
More research on possible missteps in the various data integration processes would also be helpful, and that from a technical angle, as would learning from case studies be, and contextual inquiries , as well as a rigorous assessment on possible biases, alike it was examined for software development processes . Those outcomes then may end up as a set of guidelines for data integration practitioners and the companies they work for, and inform government to devise policies. For now, the ESCAP guidelines  probably will be of most use to a data integration practitioner. It won’t catch all biases and algorithmic issues & tools and assumes one is allowed to integrate already, but it is a step in the direction of responsible data integration. I’ll think about it a bit more, too, and for the time being I won’t bother my students with writing an essay about ethics of data integration just yet.
From bites to bytes or, more precisely, from foods to formalisations, and that sprinkled with a handful of humanities and a dash of design. It does add up. The road I travelled into computer science has nothing to do with any ‘gender blabla’, nor with an idealistic drive to solve the world food problem by other means, nor that I would have become fed up with the broad theme of agriculture. But then what was it? I’m regularly asked about that road into computer science, for various reasons. There are those who are curious or nosy, some deem it improbable and that I must be making it up, and yet others chiefly speculate about where I obtained the money from to pay for it all. So here it goes, in a fairly large write-up since I did not take a straight path, let alone a shortcut.
If you’ve seen my CV, you know I studied “Food Science, free specialisation” at Wageningen University in the Netherlands. It is the university to go to for all things to do with agriculture in the broad sense. Somehow I made it into computer science, but it was not there. The motivation does come from there, thanks to it being at the forefront of science and such has an ambiance that facilitates exposure to a wide range of topics and techniques within the education system and among fellow students. (Also, it really was the best quality education I ever had, which deserves to be said—and I’ve been around to have ample comparison material.)
Perhaps it is conceivable to speculate that all the hurdles with mathematics and PC use when I was young were the motivation to turn to computing. Definitely not. Instead, it happened when I was working on my last, and major, Master’s thesis in the Molecular Ecology section of the Laboratory of Microbiology at Wageningen University, having drifted away a little from microbes in food science.
My thesis topic was about trying to clean up chemically contaminated soil by using bacteria that would eat the harmful compounds, rather than cleaning up the site by disrupting the ecosystem with excavations and chemical treatments of the soil. In this case, it was about 3-chlorobenzoate, which is an intermediate degradation product from, mainly, spilled paint that had been going on since the 1920s and said molecule substantially reduces growth and yield of maize, which is undesirable. I set out to examine a bunch of configurations of different amounts of 3-chlorobenzoate in the soil together with the Pseudomonas B13 bacteria and distance to the roots of the maize plants and their effects on the growth of the maize plants. The bacteria were expected to clean up more of the 3-chlorobenzoate in the area nearby the roots (the rhizosphere), and there were some questions about what the bacteria would do once the 3-chlorobenzoate ran out (mainly: will they die or feed on other molecules?).
The birds-eye view still sounds interesting to me, but there was a lot of boring work to do to find the answer. There were days that the only excitement was to open the stove to see whether my beasts had grown on the agar plate in the petri dish; if they had (yay!), I was punished with counting the colonies. Staring at dots on the agar plate in the petri dish and counting them. Then there were the analysis methods to be used, of which two turned out to be crucial for changing track, mixed with a minor logistical issue to top it off.
First, there was the PCR technique to sequence genetic material, which by now during COVID-19 times, may be a familiar term. There are machines that do the procedure automatically. In 1997, it was still a cumbersome procedure, which took about a day near non-stop work to sequence the short ribosomal RNA (16S rRNA) strand that was extracted from the collected bacteria. That was how we could figure out whether any of those white dots in the petri dish were, say, the Pseudomonas B13 I had inoculated the soil with, or some other soil bacteria. You extract the genetic material, multiply it, sequence it and then compare it. It was the last step that was the coolest.
The average number of base pairs of the 16S rRNA of a bacterium is around 1500 base pairs which is represented as a sequence of some 1500 capital letters consisting of A’s, C’s, G’s, and U’s. For comparison: the SARS-CoV-2 genome is about 30000 base pairs. You really don’t want to compare either one by hand against even one other similar sequence of letters, let alone manually checking your newly PCR-ed sequence against many others to figure out which bacteria you likely had isolated or which one is phylogenetically most closely related. Instead, we sent the sequence, as a string of flat text with those ACGU letters, to a database called the RNABase and we received an answer with a list of more or less likely matches within a few hours to a day, depending on the time of submitting it to the database.
It was like magic. But how did it really do that? What is a database? How does it calculate the alignments? And since it can do this cool stuff that’s not doable by humans, what else can you do with such techniques to advance our knowledge about the world? How much faster can science advance with these things? I wanted to know. I needed to know.
The other technique I had to work with was not new to me, but I had to scale it up: the High-Performance Liquid Chromatography (HPLC). You give the machine a solution and it separates out the component molecules, so you can figure out what’s in the solution and how much of it is in there. Different types of molecules stick to the wall of the tube inside the machine at different places. The machine then spits out the result as a graph, where different peaks scattered across the x axis indicate different substances in the solution and the size of the peak indicates the concentration of that molecule in the sample.
I had taken multiple soil samples closer and father away from the rhizosphere of different boxes with maize plants with different treatments of the soil, rinsed it and tested the solution in the HPLC. The task then was to compare the resulting graphs to see if there was a difference in treatment. Having printed them all out, they covered a large table of about 1.5 by 2 meter, and I had to look closely at them and try to do some manual pattern matching on the shape and size of the graphs and sub-graphs. There was no program that could compare graphs automatically. I tried to overlay printouts and hold them in front of the ceiling light. With every printed graph about the size of 20x20cm, you can calculate how many I had and how many 1-by-1 comparisons that amounts to (this is left as an exercise to the reader). It felt primitive, especially considering all the fancy toys in the lab and on the PC. Couldn’t those software developers not also develop a tool to compare graphs?! Now that would have been useful. But no. If only I could develop such a useful tool myself; then I would not have to wait on the software developers until they care to develop it.
On top of that manual analysis was that it seemed unfair that I had to copy the data from the HPLC machine in the basement of the building onto a 3.5 inch floppy disk and walk upstairs to the third floor to the shared MSc thesis students’ desktop PCs to be able to process it, whereas the PCR data was accessible from my desktop PC even though the PCR machine was on the ground floor. The PC could access the internet and present data from all over the world, even, so surely it should be able to connect to the HPLC downstairs?! Enter questions about computer networks.
The first step in trying to get some answers, was to inquire with the academics in the department. “Maybe there’s something like ‘theoretical microbiology’, or whatever it’s called that focuses on data analysis and modelling of microbiology? It is the fun part of the research—and avoids lab work?”, I asked my supervisor and more generally in the lab. “Not really,”, was the answer, continuing “ok, sure, there is some, but theory-only without the evidence from experiments isn’t it.” Despite all the advanced equipment, of which computing is an indispensable component, they still deemed that wetlab research trumped solely theory and computing. “Those technologies are there to assist answering faster the new and more advanced questions, but not replace the processes”, I was told.
Sigh. Pity. So be it, I supposed. But I still wanted answers to those computing questions. I also wanted to do a PhD in microbiology and then probably move to some other discipline, since I sensed that possibly after another 4-6 years I might become bored with microbiology. Then there was the logistical issue that I still could not walk well, which made wetlab work difficult; hence, it would make obtaining a PhD scholarship harder. Lab work was a hard requirement for a PhD in microbiology and it wasn’t exactly the most exciting part of studying bacteria. So, I might as well swap to something else straight away then. Since there were those questions in computing that I wanted answers to, there we have the inevitable conclusion to move to greener, or at least as green, pastures.
How to obtain those answers in computing? Signing up for a sort of ‘top up’ degree for the computing aspects would be nice, so as to do that brand new thing called bioinformatics. There were no such top-up degrees in the Netherlands at the time and the only one that came close was a full degree in medical informatics, which is not what I wanted. I didn’t want to know about all the horrible diseases people can get.
The only way to combine it, was to enrol in the 1st year of a degree in computing. The snag was the money. I was finishing up my 5 years of state funding for the master’s degree (old system, so it included the BSc) and the state paid for only one such degree. The only way to be able to do it, was to start working, save money, and pay for it myself at some point in the near future once I’d have enough money. Going into IT in industry out in the big wide world sounded somewhat interesting as second-choice option, since it should be easier with such skills to work anywhere in the world, and I still wanted to travel the world as well.
Once I finished the thesis in molecular ecology and graduated with a master’s degree in January 1998, I started looking for work whilst receiving unemployment benefit. IT companies only offered ‘conversion’ courses, such as a crash course in Cobol—the Y2K bug was alive and well—or some IT admin course, including Microsoft Certified System Engineer program (MCSE), with the catch that you’d have to keep working for the IT company for 3 years to pay off the debt of that training. That sounded like bonded labour and not particularly appealing.
Some day flicking through the newspapers on the lookout for interesting job offers, an advertisement caught my eye: a conversion course over a year for an MCSE consisting of five months full-time training and the rest of the year a practice period in industry whilst maintaining one’s unemployment benefit whose amount was just about sufficient to get by, and then all was paid off. A sizeable portion of funding came from the European Union. The programme was geared toward giving a second chance for basket cases, such as the long-term unemployed and the disabled. I was not a basket case, not yet at least. I tried nonetheless, applied for a position, and was invited for an interview. My main task was to try to convince them that I was basket case-like enough to qualify to be accepted in the programme, but good enough to pass fast and with good marks. The arguments worked and I was accepted for the programme. A foothold in the door.
We were a class of 16 people, 15 men and me the only woman. I completed the MCSE successfully, and then I also completed a range of other vocational training courses whilst employed in various IT jobs. Unix system administration, ITIL service management, a bit of Novell Netware and Cisco, and some more online self-study training sessions, which were all paid for by the companies I was employed at. The downside with those trainings, is that they all were, in my humble opinion, superficial and the how-to technology changes fast and the prospect or perpetual rote learning did not sound appealing to me. I wanted to know the underlying principles so that I wouldn’t have to keep updating myself with the latest trivia modification in an application. It was time to take the next step.
I was working for Eurologic Systems in Dublin, Ireland, at the time as a systems integration test engineer for fibre channel storage enclosures, which are boxes with many hard drives stacked up and connected for fast access to lots of data stored on the disks. They were a good employer, but they had only few training opportunities since it was an R&D company with experienced and highly educated engineers. I asked HR if I could sign up elsewhere, with, say, the Open University, and that they’d pay for some of it, maybe? “Yes,” the humane HR lady said, “that’s a good idea, and we’ll pay for every course you pass whilst in our employment.” Deal!
So, I enrolled with the Open University UK. I breezed through my first year even though I had skipped their 1st year courses and jumped straight into 2nd year courses. My second year went just as smoothly. The third year I paid myself, since I had opted for voluntary redundancy and was allowed to take it in the second round, since I wanted to get back on track of my original plan to go into bioinformatics. The dotcom bubble had burst and Eurologic could not escape some of its effects. While they were not fond of seeing me go, they knew I’d leave soon anyway and they were happy to see that the redundancy money would be put to good use to finish my Computing & IT degree. With that finished, I’d be able to finally do the bioinformatics that I was after since 1997, or so I thought.
My honours project was on database development, with a focus on conceptual data modelling languages. I rediscovered the Object-Role Modelling language from the lecture notes of the Saxion University of Applied Sciences that I had bought out of curiosity when I did the aforementioned MCSE course (in Enschede, the Netherlands). The database was about bacteriocins, which are produced by bacteria and they can be used in food for food safety and preservation. A first real step into bioinformatics. Bacteriocins have something to do with genes, too, and in searching for conceptual models about genes, I had stumbled into a new world in 2003, one with the Gene Ontology and the notion of ontologies to solve the data integration problem. Marking and marks processing took a bit longer than usual that year (the academics were on strike), and I was awarded the BSc(honours) degree (1st class) in March 2004. By that time, there were several bioinformatics conversion courses available. Ah, well.
The long route taken did give me some precious insight that no bioinformatics conversion top-up degree can give: a deeper understanding of indoctrination into disciplinary thinking and ways of doing science. That is, on what the respective mores are, how to question, how to identify a problem, looking at things, ways of answering questions and solving problems. Of course, when there’s, say, an experimental method, the principles of the methods are the same—hypothesis, set up experiment, do experiment, check results against hypothesis—as are some of the results processing tools the same (e.g., statistics), but there are substantive differences. For instance, in computing, you break down to problem, isolate it, and solve that piece of something that’s all human-made. In microbiology, it’s about trying to figure out how nature works, with all its interconnected parts that may interfere and complicate the picture. In the engineering side of food science, it was more along the line of, once we figure out what it does and what we really need, can we find something that does what we need or can we me make it do it to solve the problem? It doesn’t necessarily mean one is less cool; just different. And hard to explain to someone who has ever studied only one degree in one discipline, most of whom invariably have the ‘my way or the highway’ attitude or think everyone is homologous to them. If you manage to create the chance to do a second full degree, take it.
Who am I to say that a top-up degree is unlike the double indoctrination into a discipline’s mores? Because I also did a top-up degree, in yet another discipline. Besides studying for the last year in Computing & IT with a full-time load, I had also signed up for a conversion Master’s of Arts in Peace & Development studies at the University of Limerick, Ireland. The Computing & IT degree didn’t seem like it would be a lot of work, so I was looking for something to do on the side. I had also started exploring what to do after completing the degree, and in particular to maybe sign up for a masters or PhD in bioinformatics. And so it was that I stumbled upon the information about the Masters of Arts in Peace & Development studies in the postgraduate prospectus. Reading up on the aims and the courses, this coursework and dissertation masters looked like it might actually help me answer some questions I had that were nagging since I spent some time in Peru. Before going to Peru, I was a committed pacifist; violence doesn’t solve problems. Then Peru’s Moviemento Revolucionario de Tupac Amaru (MRTA) hijacked the Japanese embassy in Lima in late 1996 when I was in Lima. They were trying to draw attention to the plight of the people in the Andes and demanded more resources and investments there. I’d seen the situation there, with its malnutrition, limited potable water, and limited to no electricity, which was in stark contrast to the coastal region. The Peruvians I spoke to did not condone the MRTA’s methods, but they had a valid point, or so went the consensus. Can violence ever be justified? Maybe violence could be justified if all else had failed in trying to address injustices? If it is used, will it lead to something good, or merely a set-up for the next cycle of violence and oppression?
I clearly did not have a Bachelor of arts, but I had done some courses roughly in that area in my degree in Wageningen and had done a range of extra-curricular activities. Perhaps that, and more, would help me persuade the selection committee? I put it all in detail in the application form in the hope it would increase my chances to try to make it look like I could pull this off and be accepted into the programme. I was accepted into the programme. Yay. Afterwards, I heard from one of the professors that it had been an easy decision, “since you already have a Masters degree, of science, no less”. Also this door was opened thanks to that first degree I had obtained that was paid for by the state merely because I qualified for the tertiary education. The money to pay for this study came from my savings and the severance package from Eurologic. I had earned too much money in industry to qualify for state subsidy in Ireland; fair enough.
Doing the courses, I could feel I was missing the foundations, both regarding the content of some established theories here and there and in tackling things. By that time, I was immersed in computing, where you break down things in smaller sub-components and that systematising is also reflected in the reports you write. My essays and reports have sections and subsections and suitably itemised lists—Ordnung muss sein. But no, we’re in a fluffy humanities space and it should have been ‘verbal diarrhoea’. That was my interpretation of some essay feedback I had received, which claimed that there was too much structure and that it should have been one long piece of text without visually identifiable begin, middle, and end. That was early in the first semester. A few months into the programme, I thought that the only way I’d be able to pull off the dissertation, was to drag the topic as much as I could into an area that I was comparatively good at: modelling and maths.
That is: to stick with my disciplinary indoctrinations as much as possible, rather than fully descend into what to me still resembled mud and quicksand. For sure, there’s much more to the humanities than meets an average scientist’s eye, and I gained an appreciation of it during that degree, but that does not mean I was comfortable with it. In addition, for thesis topic choice, there were still the ‘terrorists’ I was looking for an answer to. Combine the two, and voila, my dissertation topic: applying game theory to peace negotiations in the so-called ‘terrorist theatre’. Prof. Moxon-Browne was not only a willing, but also eager, supervisor, and a great one at that. The fact that he could not wait to see my progress was a good stimulator to work and achieve that progress.
In the end, the dissertation had some ‘fluffy’ theory, some mathematical modelling, and some experimentation. It looked into three party negotiations cf. the common zero-sum approach in the literature: the government and two aggrieved groups, of which one was the politically-oriented one and the other one the violent one. For instance, in the case of South Africa, the Apartheid government on the one side and the ANC and the MK on the other side, and in case of Ireland, the UK/Northern Ireland government, Sinn Fein and the IRA. The strategic benefits of who teams up with whom during negotiations, if at all, depends on their relative strength: mathematically, in several identified power-dynamic circumstances, an aggrieved participant could obtain a larger slice of the pie for the victims if they were not in a coalition than if they were, and the desire, or not, for a coalition among aggrieved groups depended on their relative power. This deviated from the widespread assumption at the time that said that the aggrieved groups should always band together. I hoped it would still be enough for a pass.
It was awarded a distinction. It turned out that my approach was fairly novel. Perhaps therein lies a retort argument for the top-up degrees against the ‘do both’ advice I mentioned before: a fresh look on the matter, if not interdisciplinarity or transdisciplinarity. I can see it also with the dissertation topics of our conversion Masters in IT students as well. They’re all interesting and topics that perhaps no disciplinarian would have produced.
The final step, then. With a distinction in the MA in Peace & Development in my pocket and a first in the BSc(honours) in CS&IT at around the same time, what next? The humanities topics were becoming too depressing even with a detached scientific mind—too many devastating problems and too little agency to influence—and I had worked toward the plan to go into bioinformatics for so many years already. Looking for jobs in bioinformatics, they all demanded a PhD. With the knowledge and experience amassed studying for the two full degrees, I could do all those tasks they wanted the bioinformatician to do. However, without meeting that requirement for a PhD, there was no chance I’d make it through the first selection round. That’s what I thought at the time. I tried 1-2 regardless—reject because no PhD. Maybe I should have tried and applied more widely nonetheless, since, in hindsight, it was the system’s way of saying they wanted someone well-versed in both fields, not someone trained to become an academic, since most of those jobs are software development jobs anyway.
Disappointed that I still couldn’t be the bioinformatician I thought I would be able to be after those two degrees, I sighed and resigned to the idea that, gracious sakes, I’ll get that PhD, too, then, and defer the dream a little longer.
In a roundabout way I ended up at the Free University of Bozen-Bolzano (FUB), Italy. They paid for the scholarship and there was generous project funding to pay for conference attendance. Meanwhile in the bioinformatics field, things had moved on from databases for molecular biology to bio-ontologies to facilitate data integration. The KRDB research centre at FUB was into ontologies, but then rather from the logic side of things. Fairly soon after my commencement with the PhD studies, my supervisor, who did not even have a PhD in Computer Science, told me in no unclear terms that I was enrolled in a PhD in computer science, that my scientific contributions had to be in computer science, and if I wanted to do something in ‘bio-whatever’, that was fine, but that I’d have to do that in my own time. Crystal clear.
The `bio-whatever’ petered out, since I had to step up the computer science content because I had only three years to complete the PhD. On the bright side, passion will come the more you investigate something. Modelling, with some examples in bio, and ontologies and conceptual modelling it was. I completed my PhD in three year(-ish); fully indoctrinated in the computer science way. Journey completed.
I’ve not yet mentioned the design I indicated at the start of the blog post. It has nothing to do with moving into computer science. At all. Weaving in the interior design into the narrative didn’t work well, and it falls under the “vocational training courses whilst employed in various IT jobs” phrase earlier on. The costs of the associate diploma at the Portobello Institute in Dublin? I earned most of the costs (1200 pound or so? I can’t recall exactly, but it was somewhere between 1-2K) together in a week: we got double pay for working a shift on New Year (the year 2000 no less) and then I volunteered for the double pay for 12h shifts instead of regular 8h shifts for the week thereafter. One week extra work for an interesting hobby in the evening hours for a year was a good deal in my opinion, and it allowed me to explore whether I liked the topic as much as I thought I might in secondary school. I passed with a distinction and also got Rhodec certified. I still enjoy playing around with interiors, as hobby, and have given up the initial idea (in 1999) to use IT with it, since tangible samples work fine.
So, yes, I really have completed degrees in science, engineering, and political science straddling into humanities, and a little bit of the arts. A substantial chunck was paid for by the state (‘full scholarships’), companies chimed in as well, and I paid some of it from my hard earned money. On the motivations for the journey: I hope I made that clear despite cutting out some text in an attempt to reduce the post’s length. (Getting into university in the first place and staying in academia after completing a PhD are two different stories altogether, and left for another time.)
I still have many questions, but I also realise that many will remain unanswered even if the answer is known to humanity already, since to live means it’s finite and there’s simply not enough time to learn everything. In any case: do study what you want, not what anyone tells you to study. If the choice is a study or, say, a down payment on a mortgage for a house, then if completing the study will give good prospects and relieves you from a job you are not aiming for, go for it—that house may be bought later and be a tad bit smaller. It’s your life you’re living, not someone else’s.