A review on logics for conceptual data modelling

Pablo and I thought we could write the review quickly. We probably could have done so for a superficial review, describing the popular logics, formalisation decisions, and reasoning services for conceptual data models. Those sections were the easiest sections to write as well, but reviewing some 30 years of research on only that theme was heading toward a ‘boring’ read. If the lingering draft review could have spoken to us last year, it would have begged to be fed and nurtured… and we listened, or, rather, we decided to put in some extra work.

There’s much more to the endeavour than a first glance would suggest, and so we started digging deeper to add more flavour and content. Clarifying the three main strands on logics for conceptual data modelling, for instance. Spelling out what the key dimensions are where one has to make choices when formalising a conceptual data model, just in case anyone else wants to give it a try, too. Elucidating distinctions between the two approaches to formalising the models, being rule-based and mapping-based, and where and how exactly that affects the whole thing.

A conceptual model describing the characteristics of the two main approaches used for creating logic-based reconstructions of conceptual data models: Mapping-based and rule-based. (See paper for details)

Specifically, along the way in the paper, we try to answer four questions:

  • Q1: What are the tasks and challenges in that formalisation?
  • Q2: Which logics are popular for which (sub-)aim?
  • Q3: What are the known benefits of a logic-based reconstruction in terms of the outcome and in terms of reasoning services that one may use once a CDM is formalised?
  • Q4: What are some of the outstanding problems in logic-based conceptual data modelling?

Is there anything to do still on this topic, one may wonder, considering that it has been around since the 1990s? Few, if anyone, will care about just another formalisation and you’ll unlikely to get that published no matter how much effort it took you to do. Yet, Question 4 could indeed be answered and it’s far from a ‘no’.

We need more evidence-based research, more tools with more features, and conceptual modelling methodologies that incorporate the automated reasoner. There’s some work to do to integrate better, or at least offer lessons-learnt and have results re-purposed, with closely related areas, such as with ontology-based data access and with ShEx & SHACL with graphs. One could use the logic foundations to explore new applications in other contexts than just modelling and that also need such rigour, such as automated generation and maintenance of conceptual data models,  multilingual models and related tasks with controlled natural languages or summarization (text generation from models), test data generation, and query optimization, among others.

More details of all this can be found in the (open access) paper:  

Pablo R. Fillottrani and C. Maria Keet. Logics for Conceptual Data Modelling: A Review. In Special Issue on Trends in Graph Data and Knowledge – Part 2. Transactions on Graph Data and Knowledge (TGDK), Volume 2, Issue 1, pp. 4:1-4:30, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/TGDK.2.1.4

Notes on the ontology of International Standard Book Numbers

Preamble. Draft notes about experiences and peculiar findings on ISBNs were gathering dust and some people got curious: how could something as mundane as ISBNs, and a popular topic for EER model design exercises, be not straightforward? It was not worth the effort to write scientific paper, but cute enough for a blog post on ontology and modelling in everyday life. (everyday ontology #1)

~~~

A glitch occurred with the ISBN of my memoir at a late stage in the publication process when that one memoir by that one publisher needed two ISBNs, one for the printed version and one for the print-on-demand version. The printed version can be re-printed by popular demand, with the same ISBN, but that’s different from printing more print-on-demand copies with that other ISBN. It made me explore the international standard of book numbers. ISBN basics and a dose of knowledge of database theory and techniques were the key to resolving that glitch in the publication process of the memoir. But questions had surfaced as to what the ISBN really means, questions that needed answers.

A quick search revealed the easy part to grasp, and rule to adhere to: each format – hardcover, paperback, e-book – needs its own ISBN. A database needs to be able to distinguish which of the three has been sold in the shop in order to keep track of the inventory and for online retailers to send you the format you bought. The ‘same’ book published by a different publisher also requires its own ISBN, for each of these formats, because the right amount of money has to be sent to the right publisher. Different editions need their own ISBN as well, as they tend to differ, like an extra preface or postscript. So far so good from the conceptual data modelling and retailer’s perspectives.

Ontologically, one might not be so happy. One might have assumed that an ISBN help to identify a book, but clearly it only provides identification at best, not identity. What is ‘the book’? What makes a book a book is its content, as the thing we refer to when someone says “Yes, I bought the book; the cheaper paperback, not the hard cover though”. In that sense ‘the book’ excludes at least the table of contents and any index, since page numbers will vary across formats, and front matter and cover image and text may differ. Then there’s a meaning of ‘the book’ as the physical manifestation, be it on the bookshelf or stored on disk in e-book format. Let us follow the direction of formats rather than the alley of an information artefact ontology. Curiously, for e-books, each of those formats also need their own ISBN, like the EPUB and PDF and Kindle versions.

What does that say about the identity and identification of the object? The reasoning by the ISBN organisation is as follows:

“if a specific device or software is required to read the e-book or different usage constraints that control user functionality are offered (e.g. copy, print, lend etc.) then each separate version will be a distinct product. Each distinct product that is available must be identified by its own ISBN as it is a separate publication. Thus, a separate publication is normally defined by a combination of product form features or details and usage constraints.” (emphasis added)

So, a book number is a number for a unique product that is a publication; and since a publication need not be a book, a non-book artefact may have a book number. Also, the (PDF file for the) print-on-demand requires a different ISBN, and thus deemed to be a separate product, from the PDF file for the regular batch printing of the same book. It may be the same book that is being printed from the PDF file, but nonetheless the softcover hardcopies are somehow different, counting as two publications instead of one.

For my no taming of the enthusiast book, the only difference was the change in ISBN number (circular issue) and the cut-off points of the width of side of the cover due to different paper type in overseas printers assumed by the international distributor. For my modelling book, there are, at least, the ISBN-10 3031396944, ISBN-13 softcover 978-3-031-39694-6, ISBN-13 e-book 978-3-031-39695-3, a digital object identifier (DOI) 10.1007/978-3-031-39695-3, and an Amazon Standard Identification Number (ASIN) for the Kindle version (B0CDP5KXT7). Declare sameAs in the knowledge graph or LOD cloud at your own peril. Yet, I doubt that the Department of Higher Education’s publication bean counters will count my modelling book as at least two, if not five, publications – with each publication earning a subsidy for the university.

Section of an image found at https://kitaboo.com/online-ebook-publishing-5-easy-steps/ and copyrighted by “yanlev – Fotolia” or “Federico Caputo”

The content in each book is exactly the same otherwise. If either of the publishers were to create a PDF that disabled printing, it should obtain a separate ISBN, according to the ISBN description. If there’s software that determines 0, 1, 5, or whichever number of sequential or concurrent lending for 30 minutes, a day, two weeks or whichever amount of time, it would be different ISBNs. Access restrictions by country, be it due to censorship or just so, likewise. The ‘just so’ indeed does exist. My most recent experience was trying to get my hands on Ten Planets by Yuri Herrera: the sci-fi short story collection is not on AppleBooks at all according to the app and Amazon didn’t make the the Kindle version available to people physically in South Africa, yet Rakuten’s Kobo was eager to sell it to me and made me a happy reader. Restrictive and, at times, opaque ‘digital rights management’ supposedly all require separate ISBNs. That’s absurd.

What’s going on? Book-loving people not grasping software or ontology? Or, given that ISBNs have to be bought, is it a money-making exercise to find more ways to collect money from the mostly poor and underpaid writers and struggling publishers? What is the criterion for “distinct”? Really only differences in basic “functionality”? Sure, an ontology-enhanced digital and interactive Inquire Biology is functionally is different from the original printed textbook, but that’s thanks to its text mark-up, context-sensitive questions and answers, semantic browsing and search.

In contrast, a lending constraint, say, is not a functionality or feature of the book intrinsically. If desired, such an accidental feature added to an e-book should be constraints managed by the software rather than requiring new book identifiers. Digital Rights Management (DRM) technology adds a wrapper to the e-book for each variation of access control, including number of users and devices, time, and so on. Each variant is a newly wrapped e-book file and, apparently, needs its own ISBN.

There are countably infinite ways to declare usage constraints. One could create a unique DRM-ed version for each person in the world, multiplied by the number of devices we each can use it on, multiplied by region locations, etc. And that for each book published. About 4 million new books are released each year, including traditional (500K to 1mln), self-published (some 1.7mln), and other forms of publishing. With a more realistic 10-25 ISBNs per book, the numbers will run out well beyond the current global civilization, if they were classless numbers.

But that’s not how it works. The 13 digits of ISBN-13 are class-based, so it won’t accommodate nine trillion (9 999 999 999 999) books. A bar code is associated with it, and the first three digits are allocated to the books: 978 and, for the sake of argument, let’s take all of 979 as well. That reduces the amount to 2 times 9 999 999 999, or about 20 billion. Between one and five digits are allocated to the group (language), four to the publisher, at least three for the title, and the last one is for the checksum. So, we obtain at most 2 billion books with ISBN-13. Practically, and assuming no-one will take the ISBN organisation to the task with their “product form features”, it’s probably enough, albeit a trivial exercise to ruin for anyone with enough money to buy ISBNs.

Either way, ontologically, it’s clear what causes the conceptual mess. Different book numbers for intrinsic and extrinsic features of what makes a book a book, fine; different numbers for countably infinite combinations of accidental padded-on usage features for the same book, not. A DRM wrapper can be hacked, cracked, and circumvented and then it violates the numbering scheme. The EPUB format can be converted into PDF and into MS Word and so on, violating the app-specific identification principle. Convertible formats shouldn’t require distinct ISBNs just because different default software applications may be needed to open them. An e-book standard for interoperability would negate the application file format issue, which was precisely the point of the EPUB open standard.

The plethora of external accidental arbitrary constraints/features added to an e-book belong to a different category of features from those intrinsic to the book. They would need a book usage number or a DRM wrapper number for identification, not a book number, for it is an identification for a file on top of the actual e-book, not a new publication of the book itself by any definition of what a book is. In its own way, Amazon’s ASIN for Kindle editions of e-books does just that. But it’s a missed opportunity for the current ISBN standard.

How this affects a typical ‘library loan’ modelling homework exercise or an information artefact ontology is left as an exercise to the reader… Edraw’s examples would need to be updated and the IAO extended, for instance. Alternatively, the ISBN organisation could revise what merits a distinct ISBN.

On comparing models

Some of the readers of this blog are interested in modelling, and then mainly conceptual data models or ontologies. There are more types of models and modelling languages as well, such as mind maps, biological models, domain-specific languages, and so on. Can you confidently say—and justify!—which one is the best? Would such an answer be so elaborate so as to lean towards the idea of, and support existing calls for, modelling as a specialisation in an IT or computing degree programme, if not deserving to be a separate discipline outright? If so: why? What sets it apart and what are recurring themes across the various types of models and ways of modelling, and their differences? These questions are easy to ask, but far from trivial to answer. I tried anyway, to some extent at least. The latest attempt written in an accessible way—i.e., more like popular science than textbook-like—can be found in Chapter 7 of my recently published book entitled “The what and how of modelling information and knowledge: from mind maps to ontologies”, which was published by Springer (also available through Springer professional, online retailers such as Amazon, and university libraries). Instead of summarising that in this post, I did so in a guest post of Jordi Cabot’s blog, which can be read here: https://modeling-languages.com/on-comparing-modelling-languages/

Figure 1. Two example diagrams about espresso machines: a mind map and a conceptual data model. If you have no idea about what or how to compare yet: before reading about the comparisons, can you describe differences between these two examples?

Background readings to the “Melokuhle – good things” short story

==== WARNING: SPOILERS AHEAD ====

If you have not yet read the short story Melokuhle – good things, published recently by East of the Web, you are advised to do so before continuing to read this post, unless you are an academic who really insists on looking at some research first.

========

This post is not about analysing the story about the somewhat culturally aware cooperative care robot, but about some of the papers relating to the theory, technology, and ethical aspects of it. I did intend to practice writing fiction, yet somehow I couldn’t resist weaving some scientific and educational aspects into the story. Occupational hazard, I suppose. (I’ve been teaching computer ethics/Social Issues and Professional Practice since 2016 to some 500-700 first-year students in computing.)

The idea of the story was initially motivated by three research topics that came together into the story it has become. First, computing and AI has ‘scenario banks’ and websites with lists of ethics questions to debate as to what we, as scientists, engineers, system architects, or programmers, should make the machine do, or not. One of my former students picked a few he liked to use in his project, one of which was the ‘Mia the alcoholic’ 1-paragraph scenario described by Millar [Millar16]. In short, it concerns the question: should the care robot [be programmed such that it will] serve more alcoholic beverages to the physically challenged alcoholic user when they ask for it? Melokuhle – good things provides a nontrivial answer that can be fun to discuss.

Perhaps unsurprisingly, not everyone will give the same answer to that question. The probably most popular demonstration of why this may be so is the research conducted with the MIT Moral Machine that broadened the trolley problem to self-driving cars and a range of scenarios, like whether you’d swerve for 5 homeless people and let yourself die or not, or, say, drive into 5 dogs vs 1 toddler if it had to be a binary choice. It turned out that clusters of answers were by found by region and, presumably, culture [Awad18]. Enter the idea of culturally aware robot.

But what is a ‘culturally aware’ robot supposed to do differently from ‘just’ a robot? Around the same time, Stefano Borgo gave a stimulating talk in our department about culturally aware robots, based on his paper about the traits that such a culture module should have [BorgoBlanzieri18]. The appealing idea turned out to be fraught with questions and potential problems, and a lively debate ensued during and after the talk. When is a robot culturally aware and when does one encode tropes and societal prejudices in the robot’s actions? Research is ongoing to answer that question.

Enter the idea to make the user configure the robot upfront somehow as a way to tailor the cultural awareness to their liking. Yet, user-configurable settings for every possible scenario is practically unrealistic: no-one is going to spend days answering questions before being able to use the machine. A possible solution is to have the user enter (somehow) their moral theory and then it would draw the logical conclusion for any particular scenario based on the chosen moral theory. For instance, if the user were to be a devout Muslim and had chosen Divine Command Theory, then with the ‘thou shalt not drink(alcohol)’ command in effect, the carebot’s actions for Mia, or Lubanzi in the short story, would be easy to determine: a resounding no—it wouldn’t even have poured him the first bottle of wine. (refer to the SIPP lecture notes [CSDept19] for summaries of 8 other ethical/moral theories.)

To be able to get this to work in an artificial moral agent, we need to be able to represent a bunch of moral theories computationally and then devise a reasoning module that can take the formally represented theory as input and apply it to the case at hand. We’ve worked on the first part and developed a model for representing moral theories [RautenbachKeet20] and illustrated that, yes, a different moral theory can lead to a different conclusion, and why. The reasoning module doesn’t exist as a piece of software; in fact, there’s not even a design for it on paper yet for how to do it. There are sketchy ideas and the occasional rules-based approaches for one theory, but generalising from that is a few steps beyond still. And there’s the task of theory elicitation, which the short story also alludes to; a student I currently supervise is finishing up his Masters in IT dissertation on that.

The natural language interface issues that passed the revue in the story deserve their own post, or two or three. I wrote an IT outreach and discussion paper on some aspects of it and on just requirements in the context of robots in Sub-Saharan Africa [Keet21], and I conduct research on natural language generation for, mainly, Nguni languages. That comment about lacking resources for isiZulu natural language generation that the programmer supposedly snuck into Melokuhle’s code? That was me, unapologetically: we have plenty good ideas and really would like to receive more research funds…

Overall, much research needs to be done still to realise the capabilities that Melokuhle (the carebot) exhibits in the story—if that sort of robot is one you’d like to have, that is. And so, yes, the East of the Web team that published the short story rightly classified it in the Sci-Fi genre.

Lastly, I did embed a few other bits and pieces of computer ethics, like the notion of behaviour modification or so-called nudging of users by computing devices and a whiff of surveillance in the home if the robot were indeed to be permanently in listening mode. If the story were to be used in an educational setting, they could be elaborated on as well. Further, here are a few questions that may be used to direct a discussion, be it in class or a practical or tutorial group discussion:

  1. What does “culturally aware robot” mean for you?
  2. Is it acceptable to program behaviour modification, or at least ‘nudging’, of a user in a computing device?
  3. Should the care robot always, occasionally, or never comply to the request for more alcoholic beverages? Why?
  4. Would your answers be different if Lubanzi were to have been given a back story of being an alcoholic or an athlete or a retiree in their late 70s?

Also, I used the story for the essay assignment last year, which was on the ethics of deploying robot teachers. The students could receive a few marks if they included a paragraph that contained an answer to “does any issue raised in the short story apply to robot teachers as well?”. I did that partially as a way to reduce the chance that students would farm out the whole task to ChatGPT and partially to make them practice with reasoning by analogy.

To eager learners who are about to register at UCT and will enroll in CSC1016S: I won’t be asking this of you in the second semester, as I’m taking a break from the module in 2024 to teach something else and anyhow we don’t repeat assignments in immediate successive years. I do have a few more draft short stories, though.

References

[Awad18] Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., Rahwan, I. The Moral Machine experiment. Nature, 2018, 563(7729): 59-64

[BorgoBlanzieri18] Borgo, S., Blanzieri, E. Trait-based Culture and its Organization: Developing a Culture Enabler for Artificial Agents. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 333-338, doi: 10.1109/IROS.2018.8593369.

[CSDept19] Computer Science Department. Social Issues and Professional Practice in IT & Computing. Lecture Notes. 6 December 2019.

[Keet21] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, IST-Africa 2021 Conference Proceedings. IST-Africa Institute and IIMC Ireland. Cunningham, M. and Cunningham, P. (Eds). 10-14 May 2021, online. (discussion paper)

[Millar16] Millar, J. An ethics evaluation tool for automating ethical decision-making in robots and self-driving cars. Applied Artificial Intelligence, 2016, 30(8):787–809.

[RautenbachKeet20] Rautenbach, J.G., Keet, C.M. Toward equipping Artificial Moral Agents with multiple ethical theories. RobOntics: International Workshop on Ontologies for Autonomous Robotics, co-located with BoSK’20, Bolzano. CEUR-WS vol. 2708, 5 (7 pages). Extended version: https://arxiv.org/abs/2003.00935

A few book reviews for 2023

It’s that time of year again to revisit books read over the year. Although I’ve been writing more than reading this year, publishing a new book, even, I’ve read a few brand new non-fiction books that are worth mentioning:

The first two books are current affairs books that, while clearly focussing on South Africa, bring afore issues that transcend borders. The remainder of this post discusses mainly the first three books.

Corrupted and chronic dysfunction

Distinguished scholar, university executive manager, and thought leader Jonathan Jansen provides a critical analysis of what, where, how, and why things go wrong to various degrees at South African universities. His analysis is based on over 100 interviews and his extensive experience at multiple universities as academic, manager, and fixer at universities under administration. It’s not a simple case of messed up historically disadvantaged universities versus historically privileged ones or centre versus periphery. But, “at the heart of the dysfunction in universities is an intense and sometimes deadly competition for resources especially on campuses located in impoverished communities.” Not only in the sense of limited resources but, importantly, all and sundry want to eat at the trough and have their slice of the cake.

His analysis shows that the academic project may take a back-seat when the university is the key employer in the region, having turned into an combination of unemployment benefit distribution point of sorts and the focal point for corruption to amass more personal wealth. Then there’s a mix of politicised management and limited capacity that add their own share to the dysfunction cocktail. It must also be said, as does Jansen in his book as well, that not all universities are utterly dysfunctional. The easily readable gripping tale is a must-read for anyone who’d like to understand the South African university landscape better.

A, in my opinion, missed opportunity for the book, if any, was to look beyond the borders and make a connection to academia more broadly internationally, because many of the issues are not unique to South African universities. The ‘academic news/gossip weeklies’, such as UWN, CHE, and THE, regularly report on various issues with political meddling (e.g., DeSantis in Florida HE) and corruption at universities in multiple countries across the globe. And of one university I know first-hand that it was originally created to revert the brain drain, open up the city a bit and stimulate its economy and vibe, allow for multilingual education, and provide employment opportunities for locals, not necessarily always in that order of priority (and that was no secret); they weren’t the first to do so and are unlikely to be the last, and they aren’t trailing in the university rankings either.

Truth to Power

De Ruyter’s book is full of juicy sordid details of the back story to the rolling blackouts we’ve been trying to put up with. The power cuts used to happen in fits and bursts and, if happening, see-saw between stage 1 (up to 2.5h/day without electricity) and stage 4 (around 6.5-7.5h/day). Then there was 2022 where the outages sky-rocketed 4-5 times as bad as before and hitting stage 6 several times (9-11h/day without electricity). We were shocked. But then, then there’s 2023, where the amount and severity of outages surpassed the 2022 outages in the second week of May already. Why? Electricity is something many people take for granted as simply being there. There’s a very complex system behind it to make it work, however, and uncountable ways to thwart it.

For South Africa, it’s a complicated banquet of causes why there are so many rolling blackouts for so many hours. In no particular order: problems in management, corruption, sabotage, crime, ageing coal-fired power plants and that nuclear one that is also being pushed to an extended lifespan, state government (a.o.: lack of timely action, conflicting interests across ministries), political power plays, attitude, cartels, coal mafia, BBE subversion, ineffective police. De Ruyter’s book delves into those issues, interspersed with his ideas and motivations of how to manage and lead. It was a fascinating read.

The insect crisis

This popular science book by Oliver Milman does not have anything to do with my research whatsoever, but it’s well-written as to make it an easy and gripping holiday reading. It’s popular science writing of the ‘telling the story as it unfolds’ page-turner variety, rather than a dry summary of scientific papers.

Calling the rapidly dwindling amounts of insects merely a ‘crisis’ seems to be a euphemism; for both popular species and unsung heroes, it’s a disaster of epic proportions on all inhabited continents with dwindling sightings and biomass anywhere from a few percent per year to a collapse of 25-97% over a decade or two (or three or four, depending on the species and region).

The book starts off with an apocalyptic prologue on what earth would look like when all insects had gone the route of the dodo. The main part of the book first spends several chapters reporting on a mix of recording the decline of insect populations and extinctions with their main causes—pesticides, other agricultural practices not supportive of insects, and climate change—and then devotes a chapter each on the challenges with bees and butterflies. The last two chapters are about trying to answer the ‘what can we do?’ question. In a meandering way, it proposes a range of options to pursue, from managing land less and using less pesticides, to more green corridors, to eating less meat (since that requires a disproportionate amount of arable land), to exploring what the techno-optimism might be able to offer. And wishing insects will rebound when they’re given back a little space with a variety of plants.

Just in case you don’t like insects: after reading this book, you will, even if for ulterior motives, like wanting to keep eating chocolate and apples and almonds, to continue drinking coffee, and to enjoy watching geckos and birds and butterflies. The entomologists can’t quite proffer images to generate the aahhs and oohs of cute endangered animals for attention, but many have usefulness on their side, which just might convince the public of their importance and act upon it accordingly.

The ‘other’ basket

And then, then there are books you struggle through, or drag everywhere with the intention to read it during a lost moment but just can’t get yourself to do so. One such blocker is Raihani’s “The social instinct: what nature can teach us about working together” (Vintage, 2022). It still looks interesting and I will try my best to finish it, one day, but, so far and some 8 months later since opening it, I made it to page 35 only.

The other one is “Rebel Talent: why it pays to break the rules in work and life” by Francesca Gino. It’s half memoir with TMI, half pop management research. And then it turns out she probably made up some of the research stuff, partly together with academic rock star Dan Ariely (I read his book, “Predictably Irrational”, too, which was a good read). On research about preventing dishonesty. It all blew up publicly from June 2023 onwards: first by Data Colada and in the NYT, then a long read appeared in The Atlantic in July ‘23, and Gino went ahead and sued the accusers. The saga is ongoing.

As usual, there are a few books on the bookshelf for which I haven’t had the time yet. I hope to get to at least one of them this week and I’m open to recommendations 🙂

On my new book about modelling

It was published last month by Springer: “The what and how of modelling information and knowledge: from mind maps to ontologies”. The book’s three character-limited unique selling points are that it “introduces models and modelling processes to improve analytical skills and precision; describes and compares five modelling approaches: mind maps, models in biology, conceptual data models, ontologies, and ontology; aims at readers looking for a digestible introduction to information modelling and knowledge representation”. The softcover hardcopy and the eBook are available from Springer, Springer professional, many national and international online retailers (e.g., Amazon), as well as university libraries, and hopefully soon in the ‘science’ section of select bookstores.

There’s also a back flap blurb with the book’s motivations and aims, and intended readership. The remainder of this post are informal comments on it.

From my side as author and having read many popular science books on a wide range of topics, I wanted to write a popular science book too, but then about modelling. Modelling for the masses, as it were, or at least something that is comparatively easily readable for professionals who don’t have a computing background and who haven’t had, or had very little, training in modelling, yet who can greatly benefit from doing so. And to some extent also for computing and IT professionals who’d like a refresher on information modelling or a concise introduction to ontologies but don’t want to (re-)open their textbook tomes from college. Modelling doesn’t lend itself well to juicy world-changing discoveries the same way that vaccines and fungi can be themes for page-turners, but a few tales and juicy details do exist.

Then next consideration was about which aspects of modelling to include and what sort of popular science book to aim for. I distinguished four types of popular science books based on my prior readings, ranging from ‘entertaining layperson’ level holiday reading to ‘advanced interested layperson’ level where having at least a Bachelors in that field or a Master’s degree in an adjacent field may be needed to make it through the tiny-font book. I have no experience writing humour, and modelling is a rather dry topic compared to laugh-out-loud musings and investigations into stupidity, drunkenness, or elephants on acid—that entertainment can be found here, here, and here—so that was easily excluded. I’ve already tried out advanced texts tailored to specialists, in the form of an award-winning postgraduate textbook on ontology engineering, and wasn’t in the mood for writing another such book at the time when I was exploring ideas, which was around late 2021 and early 2022. I think this modelling book ended up between the two extremes regarding the amount of content, difficulty, and readability.

And so, I chose the tone of writing to be in so-called ‘casual writing’ style to make it more readable, there are a few anecdotes to enliven the text as is customary for popular science books, and the first three chapters are relatively easy in content compared to later chapters. The difficulty level of the chapters’ contents is turned up a notch each chapter going from Chapters 2 to 6 when we’re moving onwards with the journey passing by the five types of models covered in the book. Each successive chapter solves modelling limitations from the preceding chapter, and so it gets more challenging at least up to Chapter 5 (ontologies). Whether a reader finds Chapter 6 on Ontology (philosophy) even harder, depends on their background, because in other ways it is easier than ontologies because we can set aside certain interfering practicalities.

Chapter 7 mixes easier use cases with theoretically more abstract sections when we’re putting things together, reflect on Chapters 2-6, and look ahead. There’s no avoiding a little challenge. But then, we read non-fiction/science/tech books to learn from it and learning requires some effort.

Aside from the reader learning from reading the book, an author is supposed to gain new insights from writing it. And so did I. Moreover, upfront when planning the book, I tried to make sure I likely would. I mention a few salient points in the preface and I’ll select two for this blog post: the cladograms (Section 3.2.1) and the task-based evaluation (Section 7.1.2.2).

Diagrams/models in biology are sometimes ridiculed as “cartoons” by non-biologists. Cladograms would be the xkcd version of it, visually. I already knew that there are common practices, recurring icons, and rules governing the biological models drawn as diagrams. Digging deeper to find more diagrams with rules governing their notation, cladograms came up. They visualise key aspects of the scientific theory of evolution. Conversely, drawing an evolutionary diagram that doesn’t adhere to those rules then amounts to misunderstanding evolution. I think the case deserves more attention, especially because a bunch of school textbooks have been shown to have errors, and there’s room for improvement designing cladogram drawing software. Maybe clarifying matters and being more precise with such models helps resolve some debates on the topic as well.

The motivation for the task-based evaluation is easy to argue for in theory — actually doing it offered a deeper understanding, and writing the book spurred me to do so. One of my claims in the beginning of the book is that with better modelling—better than mind maps, not better mind maps—one learns more. The task-based evaluation is precisely about that. We take one page from a textbook and try to create a model of it, one for each type of model covered in the book. It demonstrates in a clear and straightforward way — assisted by Bloom’s taxonomy if you so fancy — why developing an ontology is much harder than developing a mind map or a conceptual data model, and in what way designing a conceptual data model of that textbook page is better for learning the content than creating a mind map of it.

There were more joys of writing the book. Like that the running example—dance—was also good for some additional interesting paper reading beyond what I already had read and engaged with in various projects. (There are also other subject domains in the examples and illustrations, such as fermentation, peace, labour law, and stuff, and a separate post will be dedicated to more content of the book.)

To jump the gun on questions like “why didn’t you include my preferred type of model or my language, being [DSL x/KG y/BPM z/etc.]?”: the point I wanted to make with this book was made with these five types of models and this was the shortest coherent story arc with which I could do it. The DSLs/KGs/BPMs/etc are not less worthy, but they would have caused the number of pages to explode without adding to the argument. As consolation, perhaps: knowledge graphs (KGs) are likely to appear in a v2 of my ontology engineering textbook and BPM likely will be linked to the TREND temporal conceptual data modelling language, but that’s future music.

Last, I’ve created a web page for the book, which collates information about the book, such as direct links where to buy it, media coverage and links to recent related blog posts (e.g., this one is a spin-off [with an add-on] of an early draft of section 6.3 and that one of a draft of section 7.3), and has extra supplementary material, including a longer illustration of a conceptual model design procedure using a prospective dance school database as example. Feedback is welcome!

An illustration of an “ERDP” to create an EER diagram: the dance school database

How to develop a conceptual data model, such as an EER diagram, UML Class Diagram, or ORM model? Besides dropping icons here and there on an empty canvas, a few strategies exist for approaching it systematically or at least in an assisted way, be it for ‘small data’ or for ‘big data’. One of them that I found useful to experiment with when I started out many years ago with the ‘small data’ cases, was the Conceptual Schema Design Procedure (CSDP) for ORM, as summarised in Table 1 below. It is summarised in that whitepaper and its details span a few hundred pages in Terry Halpin’s books [Halpin01], which was further extended in his later works. Extended Entity-Relationship modelling is more popular than Object-Role Modeling, however, and yet there’s no such CSDP for it. The elements don’t have the same name and the list of possible constraints to take into account are not the same in both families of languages either [KeetFillottrani15]. So, I amended it to make it work for EER.

Table 1. CSDP as summarised by Halpin in the white paper about Object-Role Modeling.

StepDescription
1Transform familiar information examples into elementary facts, and apply quality checks
2Draw the fact types, and apply a population check
3Check for entity types that should be combined, and note any arithmetic derivations
4Add uniqueness constraints, and check arity of fact types
5Add mandatory role constraints, and check for logical derivations
6Add value, set comparison and subtyping constraints
7Add other constraints and perform final checks

Unsurprisingly, yes, it is feasible to rework the CSDP for ORM to also be of use for designing EER diagrams, in an “ERDP”, ER Design Procedure, if you will. A basic first version is described in Chapter 4 of my new book that is currently in print with Springer [Keet23] (and available for pre-order from multiple online retailers already). I padded the CSDP-like procedure of the example a bit on both ends. There’s an optional preceding ‘step 0’ to explore the domain to prepare for a client meeting. Steps 1-7 are summarised in Table 2: listing the sample facts, drawing the core elements, and then adding constraints: cardinality, mandatory/optional participation, value, disjointness and completeness. Step 7 mostly amounts to adding nothing more, since EER has fewer constraints than ORM. Later steps may include quality improvements and various additions that some, but not all, EER variants have.

Table 2. Revised basic CSDP for EER diagrams.

StepDescription
0Universe of discourse (subject domain) exploration
1Transform familiar or provided sample examples into elementary facts, and apply quality checks
2Draw the entity types, relationships, and attributes
3Check for entity types that should be combined or generalised
4Add cardinality constraints, and check arity of fact types
5Add mandatory/optional constraints
6Add value constraints and subtyping constraints
7Add any other constraints of the EER variant used and perform final checks

The book’s chapter on conceptual data models also includes an example of the size that fits neatly when taking into account the page numbers and the rest of the content. As bonus material, I made a longer example now available on this page, which is about developing an EER diagram for a database to manage data for a dance school.

Picture of dancing the Ball de pastors del pirineo
Picture of our group dancing the “Ball de pastors del pirineo”.

I did go through a ‘step 0’ to explore the subject domain to explore my knowledge of dance schools, which was facilitated by having been member of several dance schools over the years. The example then goes through the 7-step procedure. All this gets us from devising facts such as

in a step-wise fashion with intermediate partial models to the final one, in Information Engineering notation, as shown in the following image:

Figure 1. The final EER diagram at the end of “step 6” of the procedure.

The dance school model description also hints at what lies beyond step 7, such as automated reasoning and ontology-driven aspects (not included in this basic version), and the page has a few notes on notations. I used IE notation because I really like the visuals of the crow’s feet for cardinality, but there’s a snag and some textbooks use Chen’s or a ‘Chen-like’ notation. Therefore, I added those variants on the page near the end.

Are the resulting models any better with such a basic procedure than without? I don’t know; it has never been tested. We have around 450 students who will have to learn EER in the first semester of their second year in computer science, so there may be plenty of participants for an experiment to make the conclusions more convincing. If you’re interested in teaming up for the research to find out, feel free to email me. 

References

[Halpin01] Halpin, T. Information Modeling and Relational Databases. San Francisco: Morgan Kaufmann Publishers. 2001.

[KeetFillottrani15] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[Keet23] Keet, C.M. The What and How of Modelling Information and Knowledge: From Mind Maps to Ontologies. Springer, in press. ISBN-10: 3031396944; ISBN-13: 978-3031396946.

Social impact issues with LLMs – a brief write-up of my list from the SIGdial’23 panel

The SIGdial 2023 organisers wanted a panel at the jointly held SIGdial 2023 and INLG 2023 conferences in Prague that took place last week. Svetlana Stoyanchev, as PC Chair in charge of it, proposed “Social impact of LLMs”. It was to follow the keynote talk by Ryan Lowe of OpenAI, the company behind the popular ChatGPT and also Whisper for speech, and he also would participate. I ended up in the panel as well (coming from the NLG angle of the matter), as did Ehud Reiter from the University of Aberdeen (UK) and Malihe Alikhani from Northeastern University (USA), with David Traum from the University of Southern California (USA) as moderator.

There was to be a 3-5 minutes opening statement by each panel member, which I had duly prepared for, but that did not happen. What happened first, was an unassuming 1-liner with name, affiliation, and area of specialisation. It then proceeded with questions the likes of “can you provide your view on how LLMs benefit society?”, “What is more important: factualness or fluency?”, and “which ethical concerns about LLMs are overstated?”.

I didn’t get on the panel for that sort of stuff. I was cajoled into saying ‘yes’ to the invitation because I already had compiled a partial list of social issues with LLMs. I’m teaching a module on “social issues and professional practice” to first-years in computer science at UCT and touched upon it in late July and early August at the start of the semester and I had mentioned some of it at a research ethics workshop at UCT. Note that ‘issues’ can be interesting and provide ideas for new research projects, provided they’re not inherent limitations of the theory, method, technique, tool, or practice.

As preparation for the panel, I tried to structure the list into a taxonomy of sorts to try to maximise information density in the short time I thought I would have. So when the moderator opened the floor for questions from the audience and no-one queued up instantly, I jumped in the gap. It might help to get the audience into action, too, or so I thought. And someone had to state the unpleasantries and challenges. So here’s that taxonomy-like list of social issues I managed to mention (in a nutshell and still incomplete):

1. In creation of LLMs

1.1 Resource usage (sensu climate change issues):

1.1.1 Electricity use for the computations training the LLMs;

1.1.2 Water use, used in data centre cooling where the computation takes place.

1.2 Exacerbating disparities, in that the less well off can’t compete with the rich corporations in The North and end up crowded out and as consumers only (and possibly also some colonialism, as noted by the speech researchers on Maori w.r.t. OpenAI’s Whisper).

1.3 Data (text) collection, notably regarding:

1.3.1 IP/copyright issues of the text ingested to generate the LLM;

1.3.2 The lack of trust (or the angst) on what data went and go into the LLMs (the ‘could be your emails, googledocs, sharepoint files etc.’), that no-one was asked whether they consented to their content being grabbed for that purpose, and when some would have disapproved of inclusion if they could, there’s the powerlessness in that it seems one neither can opt out nor verify if one’s text was excluded if opt-out were to be possible.

1.4 Psychological harm done unto the ‘garbage collectors’, such as the Kenyans in clickfarms, who are the manual labourers hired to remove the harmful content so that the system’s responses are clean and polite.

2. In content of LLMs

2.1 Bias, amplifying the bias in the source text the LLM trained on and that may be undesirable (e.g., gender).

2.2 Cultural imperialism:

2.2.1 Coverage/performance disparities. The LLM has ingested more from one region than another, so its output may not be relevant to the locale (say, to people in the RSA) or culture where it is used but rather output something that is applicable to people in the USA as if that were valid for the whole world;

2.2.2 Language. Whose language does it use in the interaction? On pushing out language varieties and dialects that are less well-represented in the training dataset, reducing diversity in expression, and steering towards homogenization.

3. In use of LLMs

3.1 Work:

3.1.1 It creates more work without getting extra resources for it;at least so far it has created more work for, among others, us lecturers than it purportedly would save (as if we didn’t have enough to do already);

3.1.2 It puts people out of jobs; this is for many a novel computing technique and should be managed but isn’t.

3.2 Information-seeking behaviour affecting democracy. The ‘one answer’ versus equally easy accessible answer options to assess multiple sources as part of information-seeking in democratic discourse, which is problematic due to fabrications (‘hallucinations’) and being fickle in property (content) dropping and an LLM may be amenable to manipulation for use as a propaganda machine.

3.3 Learning avoidance. There’s a difference between using LLMs as time-saver when one has the skill versus skipping learning competencies at school and university, such as writing and summarisation of course material when learning a subject.

3.x [there surely is more but I didn’t even have enough time to elaborate on item 3.3 already.]

The list in my lecture and workshop slides also included issues with misinformation, disinformation, privacy, and the unclear culpability attribution when there are bugs in the code it generates, which I hadn’t gotten around to include due to time constraints.

I can very well imagine the list will change, not only ending up longer, but also that more research may solve some of the issues so they can be removed. For instance, currently, language varieties descend into getting mixed onto one cocktail (they also did when David Traum tried with several Englishes) but it’s an interesting research question how one can (re)train an LLM to detect them in the training corpus and output it correctly, be this for written text or speech. It does not sound like an insurmountable problem to solve. Fear may be addressed with openness and education; policies might address some others.

Rotating Kafka head/disco ball in the city centre of Prague. (Source: I took it a few days before the INLG’23 conference)

While I was quickly going through my list, one attendee had walked over to the microphone and so I ended it at item 3.3. The question was about the impact of LLMs on the research community. The panel was called closed soon thereafter and lively comments followed when we all strolled into the conference welcome reception that took place at the same venue. I was pleased to hear those comments. More public debate in the panel session, however, would have been better for everyone compared to relegating it to the reception. Whether the muted response during the panel session was due to it having been a long day already—a great keynote talk by Emmanuel Dupoux, two long-paper sessions with interesting research, a poster session, and Ryan’s keynote—or due to it being recorded or for some other reason, I don’t know. Perhaps it is also up for debate whether it was wise to speak up. But no-one saying anything about some of the challenges with the social impact of LLMs in society was, in my view, not an acceptable option either.

To close off this blog post, I must note that there are more lists on social issues with LLMs and there’s quite some overlap between those resources and the taxonomy-like list described above. Among others: I can suggest you read this or that paper, or, if you’re short on time, have a look here or here that all have more explanatory text and references than this blog post.

CoSMo: a content selection modelling language

How do you specify declaring content-of-interest or selecting content from a database or RDF triple store with the aim to convert that data into natural language text? Where that content may be either individual objects or classes or both, the relations that hold between them, and even functions to compute stuff from it? To be able to store such a specification and build larger ones from smaller ones? And that by possibly anyone, in their own language if they so wish? They are non-trivial broad requirements for, at least, the so-called “abstract representation” language for Abstract Wikipedia [Vrandecic21], with the data store being Wikidata (an RDF triple store that currently runs on Blazegraph). Examining related work, it turned out that no such language exists, so we set out to develop one.

Fortunately, we had an idea about how to go about designing a modelling language [FillottraniKeet21], but theory is not the same as praxis, and it hadn’t been tested with a real use case just yet, and a different type of modelling at that. One might as well try to kill two birds with one stone, and that is indeed what we did. The outcome is described in an extended technical report [ArrietaEtal23], which is, admittedly, still quite research-oriented, although we tried to explain things more than we normally do in scientific publications. We dubbed our language CoSMo, from Content Selection Modeling language. It makes one component of that magic box called “article generation”, as part of the overall orchestration of Abstract Wikipedia components, clear and precise.

CoSMo's intended place in the high-level outline of Abstract Wikipedia, at step 1 of the NLG pipeline. (It can also be used outside the AW context).
CoSMo’s intended place in the high-level outline of Abstract Wikipedia. (It can also be used outside the AW context).

The remainder of this blog post contains a quick non-technical digest of that 32-page report.

The preparations: figuring out the requirements

In the Abstract Wikipedia NLG team that focuses on the natural language generation (NLG) specifically, we all knew we needed something between the content of Wikidata and the natural language text as output. The content of that text would be represented abstractly, in a structured fashion, i.e., the so-called “abstract representation” of that, which would be written in self-contained pieces, called “constructors”. So far so good. Then there came had several stakeholder meetings with the Abstract Wikipedia team and other interested Wikipedians in September and October 2022. We discovered there were wildly different assumption about the constructors, what they were for, and what one should be able to model with them.

Taking a step back, we all devised examples, and two proof-of-concepts were experimented with (Nina/Udiron by Mahir Morshed and a Scribunto-based implementation by Ariel Gutman). From there, it was possible to extract requirements for language design. Some of this was written up in a discussion document and put online late November 2022. It did not receive any feedback. More examples and implicit requirements were collected, including more ‘reasoning backwards’ from sample desired output to what data would need to be in Wikidata to be able to get such as sentence out of it, to help inform what a constructor needs to have that would help bridge the two extremes. Various features were elaborated on and discussed in an Abstract Wikipedia open meeting of the NLG group (recorded). It was February 2023 already.

It had become clear that not everything could reasonably be done all at once, nor would one want to, so Pablo Fillottrani, Kutz Arrieta, and I made the decision to take one step at a time: the first step in the natural language generation pipeline is content selection, not also all sorts of linguistic structures that one can find later on in the generation pipeline. From these preliminaries, we selected a list of design principles and features we though made most sense (see section 3.2 of the technical report for details), and moved on to the next step.

Defining a language

With the parameters and requirements in hand, it should have been straightforward to define the content selection language. It wasn’t. Sometimes, a requirement can be met in more than one way and we also wanted a graphical counterpart to any textual version, which also caused a few iterations to make a constructor look pretty and easily readable. We tried to reuse existing icons as much as possible, so that then at least a segment of users may already be familiar with them. For instance, the diamond-shape from UML for composition of smaller constructors into larger ones. The report’s tables 1-3 list the icons for the elements, connectors and constraints, respectively, their respective longform notation when it’s rendered in English, and a usage comment. Here are the elements:

The syntax needed a semantics, which it got (see also Table 4 of the report). Working on the details and example diagrams, we admitted it did need a grammar as well, to make sure that constructors are specified in the way they’re supposed to, and to make the step towards an implementation easier. Those details can be found in Appendix A near the end of the report.

Behind the pretty graphics and longform renderings, it looks uninviting and not intended for human consumption, as is the case for most of Wikidata and Wikifunctions, due to the aim for natural language independence, or at least some form of multilingualism. Here’s an example of the not-for-humans specification of a constructor, where the CSMxxx numbers are elements in CoSMo, and Q and P items exist in Wikidata:

CSM007:C5(                    
   CSM003(P106(r1,r2)),
   r1:CSM002(Q5),
   r2:CSM002(Q18844224),
   CSM003(P136(r3,r4)),
   r3:Q18844224,
   r4:Q24925,
   CSM002(Q5)={Q42})

In a text-based user interface, the identifiers are to be replaced with their respective labels in the chosen natural language. In a graphical user interface, the CSMxxx elements are to be rendered graphically. An example of a diagram that illustrates a case where three constructors are combined is shown next:

The CSMxxx identifiers were mapped to Wikidata items where possible at present, and from there we obtain the human-readable names, which are listed in the technical report’s Table 5 for English, Spanish, and Basque.

Evaluation

There are several strategies to determine whether the language is any good. The first step is the check whether it meets the requirements it is supposed to meet. CoSMo does (see section 4.2). The second step is the ‘can one create a constructor?’ (without too much effort) and, pushing that further: ‘is it expressive enough to represent the examples we started out with?’. The answer to the first question is a resounding Yes. Further to the previous image, here’s one of those in CoSMo longform textual notation where the elements are rendered in English and in Spanish (Appendix C has more examples):

The answer to the second question is trickier. In the very strict sense, the answer is No, which doesn’t sound good on first reading. The first reason why it’s not as bad as it sounds, is that some examples from last year mixed content and natural language features, which was precisely what we needed to disentangle to make constructors manageable, usable, and reusable. The second reason was that some examples were making assumptions about Wikidata content but where it was represented differently or it should be possible to delegate the task to the recently launched Wikifunctions. Here’s a section of the San Francisco example from [Vrandecic21] in CoSMo graphical notation, which contains a function for computing the ranking on-the-fly rather than hard-coding the ranking (the Z item number is made up and such a function is still to be added to Wikifunctions that launched just this month).

Finally, we conducted a feature comparison of CoSMo against similar-looking modelling languages, and, indeed, it is one of a kind.

In closing

Is this all there is and can a system generate natural language text with it already? No, sorry, things don’t go that fast when only limited resources are at one’s disposal. A few other parts of the natural language generation pipeline are available for different NLG approaches and CoSMo fills a gap in the procedure. Not only that, CoSMo can just as well be used outside Abstract Wikipedia and even when no natural language generation is in sight. From a data store design perspective, CoSMo offers an abstract representation, i.e., a modelling language, of a data store view specification, alike for SQL VIEW statements, but then regardless whether that is for a database, an RDF triple store, or even some JSON thing that tries to pass for a database. Or, from a graph algorithm viewpoint: CoSMo makes defining subgraphs (among other things) more accessible to a broader user base.

Last, but not least: no, LLMs will not fix it for you. There are a few technical hick-ups at present (read a discussion of LLM-to-SPARQL), but even if/when they are solved: SPARQL itself does not meet all the requirements for the constructor specification language, hence, an LLM-to-SPARQL app does not either. (see Table 6 and compare the WONDER, VSB and RDF Explorer columns with CoSMo for details.)

Finally, CoSMo is a concrete proposal – admittedly, one that we hope will be adopted by the Abstract Wikipedia project – and we look forward to feedback!

References

[ArrietaEtal23] Arrieta, K., Fillottrani, P.R., Keet, C.M. CoSMo: A constructor specification language for Abstract Wikipedia’s content selection process. Technical Report, Arxiv.org, number 2308.02539. 32p. 1 August 2023.

[FillottraniKeet21] Fillottrani, P.R., Keet, C.M. Evidence-based lean conceptual data modelling languages. Journal of Computer Science and Technology, 2021, 2021, 21(2): 93-111.

[Vrandecic21] Vrandečić, D. Building a multilingual Wikipedia. Communications of the ACM, 2021, 64(4), 38-41.

**Improved** — the BFO Classifier

A tacky title, but it has been shown to be true. I had introduced the BFO Classifier, which assists with aligning an ontology to BFO, in a blog post in December 2021. We worked on the write-up and some more evaluation and presented it last year at the FOUST-VI workshop at JOWO 2022 [1]. It was well-received, yet, questions remained on the quality of the decision diagram. Would a different decision diagram yield better results in aligning the top-level ontology or make the process even easier (less hard)? What is a good way of designing such a decision diagram for any foundational ontology anyway? And what if the evaluation approach were to be different?

We set out to answer these questions, which included improving the decision diagram. Reducing the philosophy terminology for better-known words and adding examples was a first strategy. We went for another one as well: to try to incorporate solutions to alignment mistakes. That is, to determine where modellers go for a suboptimal or even wrong alignments, assess why this may be the case, and incorporate in the selection a procedure a way to avoid such modelling issues, where feasible.

One of the collaborators, Zubeida Khan, arbitrarily selected 10 BFO-aligned ontologies from BioPortal. We then slogged manually through them, tabulated the alignments and noted whether it was theirs or actually an alignment thanks to an ontology that was reused in that domain ontology. There were 234 alignments in those 10 ontologies, of which 103 to material_entity. Seventeen of the 34 BFO entities did not have any alignment. A whopping 26 imported (modules of) ontologies were the ones who actually provided the alignment to a BFO entity, rather than the 10 that were selected.

The lopsidedness of alignments won’t help with creating a better decision diagram, but, perhaps, illustrates that alignment assistance may be needed to make better use of all those BFO entities. What can be gleaned from the alignments upon further analysis? Unfortunately, there were indeed a number of infelicities, which we categorised into seven key issues, as shown in the table below, with illustrative examples for each.

Further details and suggestions to avoid them are described in the paper entitled A method to improve alignments between domain and foundational ontologies [2] that will be presented at the 13th International Conference on Formal Ontology in Information Systems (FOIS2023) that will be held 17-20 July in Sherbrooke, Canada. Where possible, the avoidance strategies to emphasise certain ontological distinctions were incorporated into the revised decision diagram, which now is as follows:

BFO Decision Diagram version 2 (Source: [2])

Finally, the question of whether this decision diagram is any better than its version 1. Under the assumption that alignment would be easier to do with a tool, we revised the XML file for the decision diagram and rebuilt the BFO Classifier tool (with thanks to Toky Raboanary). Here are two sample screenshots of the interface. The first one is at the end of the sequence after having answered all the questions in a branch, where the user now either can backtrack if they’re not happy with the alignment or click ‘insert axiom’ to insert the alignment into the ontology.

The second example with Eating is somewhere along the path of answering questions, with the options available shown in the drop-down box.

There’s the occasional glitch in the tool (once missing an example that’s in the diagram and once showing ‘[]’ instead of rendering the class name) and we have a ‘desired features’ list ourselves, but it was not the aim to make a prefect tool at this stage. We first wanted to get the approach right and see if we could make a better decision diagram. The code is freely available for adaptations.

The key question obviously then was whether this revised version improves alignments. This blog posts title gave it away already. We did contact a few modellers who then used both versions or one of the two and we compared it to any original alignments and against v1. This included César Bernabé (with Leiden UMC’s BioSemantics Group) and Zola Mahlaza (with UCT) who provided motivated alignments and Zola also volunteered a rigorous tool assessment. Details are described in the paper. Version 2 improved the alignments and was also deemed a confidence-builder of the alignment when it matched.

It also became clear that if the to-be-aligned entity is not fully understood then either (1) the decision diagram questions help clarifying it and so the entity can be aligned deeper down in the hierarchy (cf. ‘playing safe’ aligning higher up and being less precise), as was the case in our previous evaluation [1] as well, or (2) it becomes evident that the to-be-aligned entity really needs further ontological investigation first to be able to determine what’s really meant with it. Put differently, it assists in uncovering any lingering vagaries.

Conversely, it also pinpoints to possible gaps in the foundational ontology. It can be irritating or confusing when the answer options are only “yes” or “none of the above” – why not a “no”?! It may be tempting to blame the tool or the decision diagram, but, at times, there’s only a lonely subentity without any sibling entities, so then the ‘yes’ will get you to the child node and the ‘none of the above’ will keep you at the parent node. We can’t help that, other than making a user cognizant of a need they may have, so that it may be passed on to the foundational ontology developers to possibly extend their ontology there.

Whether this decision diagram will resolve all the alignment struggles mentioned in, among others, [3,4], remains to be seen, but we’re hopeful. If there’s interest among the ISAO2023 participants, I can include a practical exercise/experiment about it during one of the Friday ‘barcamp’ slots. As to the paper’s presentation, while joint first author with Zubeida, I’ll present it at FOIS2023 in Sherbrooke next month. If you have any questions or comments, please feel free to contact either of us, or let’s meet up at FOIS2023!

References

[1] Emeruem, C., Keet, C.M., Khan, Z.C., Wang, S. BFO Classifier: Aligning domain ontologies to BFO. FOUST-VI: 6th Workshop on Foundational Ontology, part of JOWO’22. Prince Sales, T., Hedblom, M. and Tan, H. (Eds). CEUR-WS vol 3249. 13p. 15-19 August, Jonkoping, Sweden 2022.

[2] Bernabé, C.H., Keet, C.M., Khan, Z.C., Mahlaza, Z. A method to improve alignments between domain and foundational ontologies. 13th International Conference on Formal Ontology in Information Systems 2023 (FOIS’23). IOS Press, FAIA vol. xxx, xx-xx. 18-20 July Sherbrooke, Canada. (in print)

[3] Stevens R, Lord P, Malone J, Matentzoglu N. Measuring expert performance at manually classifying domain entities under upper ontology classes. Journal of Web Semantics. 2019;57:100469

[4] Bernabé CH, Queralt-Rosinach N, Souza VES, da Silva Santos LOB, Jacobsen A, Mons B, et al.. The Use of Foundational Ontologies in Bioinformatics; 2022. https://doi.org/10.21203/rs.3.rs-1929507/v1