Can a pizza have fruit as topping?

Posted on May 13, 2024 by keet

(everyday ontology #2, informal musings on everyday objects)

If we have a pizza and put, say, pineapple on top of it, is it still a pizza? If you ask an Italian, the answer is a resounding “NO!”; one can’t have fruit on a pizza, but vegetables are permitted. This begs the question: what makes a pizza a pizza? And, to be fair: what is fruit and what is a vegetable?

I have seen peculiar toppings on edibles that were sold as pizza. Kebab meat, shawarma, spare ribs, and whatnot. The Debonairs pizza chain in South Africa sells a “Cram-Decker” pizza where the crust contains two chicken cheese grillers (sausages in the dough of the rim of the pizza) – I did not even bother to ask my Italian colleagues whether that’s fine. My distinct impression is that outside of Italy, a pizza is a thing that consists of a plate-sized, thin, round, yeast-based dough baked in the oven with a topping starting with tomato sauce and then other edibles added to it subject to the inventiveness of the cook and adjusted to the taste of the local population.

*Photo I took in the Spar supermarket nearby, earlier this month*

The key issue is the claim that “a pizza does not have fruit as topping”, or, if you will, “no pizza has fruit as topping”, or “a pizza has toppings that are not fruit” – they’re not logically equivalent statements, but let’s not squabble about that, for now.

I, among other people, have retorted to fellow Italian ontologists that, yes, most pizzas in fact do have fruit as topping: tomato. But somehow the tomato doesn’t count in their view; it’s supposedly a vegetable because in the context of the kitchen tomatoes are used in recipes where one expects vegetables to play their part. Biologically, however, tomato sure is a fruit. Now what? We can explore a number of avenues…

Setting precedents?

The squabbles about tomatoes were recorded already in the Nix v. Hedden case at the Supreme Court of the USA, in 1893. Arguments were put forward to determine whether a tomato is a fruit or vegetable, principally because vegetables were subject to import duties and fruit was exempt. In that court, common parlance usage and collecting taxes prevailed over a ‘does it have seeds’-test.

What is common usage? A fruit surely does not cease to be fruit just when it’s brought into the kitchen, also in Italy; consider, e.g., strudel di mele with apple as a sweet treat and the pears in the pear-walnut-gorgonzola salad of a main meal, presumably a savoury side-dish. Is the humble tomato the exception to the rule? No, but that may become evident only when we unpack the ‘fruit’ vs ‘vegetable’ issue.

Biologically, a fruit is “fleshy or dry ripened ovary of a flowering plant, enclosing the seed or seeds.” Not only those pears are fruit, but also cucumbers, pumpkins, aubergines, courgettes, peppers, green (string) beans, and maize (corn), to name but a few. A pineapple is a composite fruit, based on multiple flowers growing together (artichokes are flowers, as is broccoli, by the way). I have seen aubergines and peppers on Italian pizzas in Italy as well, so the issue extends well beyond the emblematic tomato.

What science and industry make of it

There’s a journal called International Journal of Vegetable Science. They should know. In 2021, the editor Vincent Russo wrote an editorial and he started it deadpan with “There is some confusion associated with the journal title.” An in-between the lines reading suggests he’s fed up having to reject out-of-scope papers because the authors don’t get the point of what a vegetable is. “Botanically a vegetable is anything that is not the reproductive portion of the plant derived from a flower.” he states, but also that weeds and herbs are a bit borderline, for the journal at least, whereas others—including our tomato—“are botanically fruit but have been through misuse of the term, and through court decisions, have come to be considered vegetables.”

Kathy Wolfe of Washington State University wrote up an accessible introductory overview on vegetables, fruits, and nuts and added a list of informal further reading. Spoiler alert: it complicates the issues. The FoodON food ontology went the other way with vegetable, copying over the first sentence on the Wikipedia page: “Vegetables are parts of plants that are consumed by humans or other animals as food.”; consequently, fruits are vegetables. The Plant Ontology takes the scientific stance on fruit and thus also on the tomato; it doesn’t contain the term ‘vegetable’.

There’s also a World Vegetable Centre that states early on that “There is no botanical definition of a vegetable.” They go on providing properties against which to test: is it eaten during the main part of the meal vs a snack or dessert? Is it sweet vs not sweet? And not a root crop (like potato, which are starchy vegetables)? Let’s take that sweetness criterion: lemons are definitely fruits from the culinary viewpoint, yet have a lower sugar content than tomatoes (2.5g vs 2.63g/100g according to the FDA), pumpkin with 2.8g/100g is used in both sweet and savoury dishes, eggplant with 3.5g of sugar is distinctly on the vegetable side of dishes. And so sweetness as measured by sugar content is not a viable option to distinguish a fruit from a vegetable. The world vegetable centre’s final proposal: “Based on the above considerations, we propose to define vegetables as mostly herbaceous annual plants of which some portion is eaten, either cooked or raw, during the principal part of the meal to complement starchy food and other food items.” (boldface in the original) Immediately thereafter, they tear it down themselves for its failings to capture the nuances both due to culture (potatoes are considered vegetables in some Asian cuisine) and the same thing can be considered either a fruit or a vegetable depending on the level of maturity (papayas and mangoes). Classifying by either of these properties would make for an awkward ontology at best.

A linguistic analysis is not going to resolve it either

It’s easy to take a convenient stroll into language if the language is English. A fruit is the fruiting body and the word ‘vegetable’ is similar to the word ‘vegetative’, as in non-reproductive part of the plant. In Dutch, it’s fruit and groente ‘greens’ and in Spanish it’s fruta and verdura ‘greens’, respectively, but not all vegetables are greens and some fruits are green. Etymologically, fruit’s origin is in Latin, from frug- and frui, and vegetable in Indo-European languages is claimed to refer to notions of green or growing. The German Gemüse, however, comes from Mus and ‘meal from (chopped) plants-of-some-use’. In isiZulu, the largest language by first language speakers in South Africa and in a different language family, isithelo translates to ‘fruit’, ‘product’ and ‘results’, with a root –thel- referring to blossom(ing), bringing forth, bearing (fruit), (paying) taxes and the like. Umfino is one translation for vegetable,with the root word the same. My Dent and Nyembezi hardcopy dictionary also offers uhlaza for vegetable, which is back-translated into ‘plant life’, ‘green grass,’ ‘greenness’, and ‘dagga’. Its versatile root –hlaz- has completions also into nouns and verbs in what seems to be quite different concepts (e.g., inhlazo ‘disgraceful event’). If anything, the German and isiZulu notions lean toward aforementioned FoodON and Wikipedia entry.

Should we create two ontologies?

What if we were to resolve it as two ontologies: one that respects the scientific approach with botany, the other a culinary one with the multiple classification issues as a to-be-determined loose end? As if they—biology and culinary—were two different languages and the same things classified differently, alike beer in English is a drink and in Dutch an animal (a bear) and burro being butter or a donkey (more examples here), but then not for strings of characters but objects. Ontologically, this is entirely unsatisfactory. It is not the case that both viewpoints are right or both contexts equally valid—Ontology, the one with the capital O, analytic philosophy, seeks the one nature of the entity.

The kitchen ontology is untenable. It lacks principles of objects exhibiting the same sets of properties as being the same type of thing. Culture, ripeness, sweetness, time and context of eating as a criterion all have their issues.

Changing a definition

Should we abandon the pursuit of defining fruit and vegetable, as it seems like an endless debate? Meh. Moreover, why should unscientific kitchen gut feeling to rule in an ontology? It should not. In analogy: does speaking ‘kitchen French’ trump the diktats of the Académie Française and do all potions trump medicine? No. What would be so bad about educating people about the food they eat, where it comes from, and how it’s grown? Nothing. So why should pre-scientific notions about food, and a dash of the courts, trump botany? It doesn’t matter that what we put in a meal salad or on top of a pizza is a fruit, nor, for that matter, that the pizza dough is actually also made of a fruit (wheat grains are fruits). We’ll continue eating it all the same anyway – with more appreciation.

Perhaps what is at play is merely the fear of pizza ending up like the extremely versatile pancake. That can be easily addressed with a simple tweak to the definition of pizza, alike: a pizza is a savoury dish that consists of a plate-sized, thin, round, yeast-based dough baked in the oven with a topping starting with tomato sauce and then other edibles added to it subject to the inventiveness of the cook and adjusted to the taste of the local population.

One might argue over other edibles as pizza toppings, but as to the question in the title: the answer is thus ‘yes’…

If you have better suggestions, feel free to propose them in the comments! I have no doubt readers will have their opinion on the matter 🙂

*CC0 image from http://www.publicdomainpictures.net*

A review on logics for conceptual data modelling

Posted on May 4, 2024 by keet

Pablo and I thought we could write the review quickly. We probably could have done so for a superficial review, describing the popular logics, formalisation decisions, and reasoning services for conceptual data models. Those sections were the easiest sections to write as well, but reviewing some 30 years of research on only that theme was heading toward a ‘boring’ read. If the lingering draft review could have spoken to us last year, it would have begged to be fed and nurtured… and we listened, or, rather, we decided to put in some extra work.

There’s much more to the endeavour than a first glance would suggest, and so we started digging deeper to add more flavour and content. Clarifying the three main strands on logics for conceptual data modelling, for instance. Spelling out what the key dimensions are where one has to make choices when formalising a conceptual data model, just in case anyone else wants to give it a try, too. Elucidating distinctions between the two approaches to formalising the models, being rule-based and mapping-based, and where and how exactly that affects the whole thing.

A conceptual model describing the characteristics of the two main approaches used for creating logic-based reconstructions of conceptual data models: Mapping-based and rule-based. (See paper for details)

Specifically, along the way in the paper, we try to answer four questions:

Q1: What are the tasks and challenges in that formalisation?
Q2: Which logics are popular for which (sub-)aim?
Q3: What are the known benefits of a logic-based reconstruction in terms of the outcome and in terms of reasoning services that one may use once a CDM is formalised?
Q4: What are some of the outstanding problems in logic-based conceptual data modelling?

Is there anything to do still on this topic, one may wonder, considering that it has been around since the 1990s? Few, if anyone, will care about just another formalisation and you’ll unlikely to get that published no matter how much effort it took you to do. Yet, Question 4 could indeed be answered and it’s far from a ‘no’.

We need more evidence-based research, more tools with more features, and conceptual modelling methodologies that incorporate the automated reasoner. There’s some work to do to integrate better, or at least offer lessons-learnt and have results re-purposed, with closely related areas, such as with ontology-based data access and with ShEx & SHACL with graphs. One could use the logic foundations to explore new applications in other contexts than just modelling and that also need such rigour, such as automated generation and maintenance of conceptual data models, multilingual models and related tasks with controlled natural languages or summarization (text generation from models), test data generation, and query optimization, among others.

More details of all this can be found in the (open access) paper:

Pablo R. Fillottrani and C. Maria Keet. Logics for Conceptual Data Modelling: A Review. In Special Issue on Trends in Graph Data and Knowledge – Part 2. Transactions on Graph Data and Knowledge (TGDK), Volume 2, Issue 1, pp. 4:1-4:30, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/TGDK.2.1.4

Notes on the ontology of International Standard Book Numbers

Posted on April 1, 2024 by keet

Preamble. Draft notes about experiences and peculiar findings on ISBNs were gathering dust and some people got curious: how could something as mundane as ISBNs, and a popular topic for EER model design exercises, be not straightforward? It was not worth the effort to write scientific paper, but cute enough for a blog post on ontology and modelling in everyday life. (everyday ontology #1)

~~~

A glitch occurred with the ISBN of my memoir at a late stage in the publication process when that one memoir by that one publisher needed two ISBNs, one for the printed version and one for the print-on-demand version. The printed version can be re-printed by popular demand, with the same ISBN, but that’s different from printing more print-on-demand copies with that other ISBN. It made me explore the international standard of book numbers. ISBN basics and a dose of knowledge of database theory and techniques were the key to resolving that glitch in the publication process of the memoir. But questions had surfaced as to what the ISBN really means, questions that needed answers.

A quick search revealed the easy part to grasp, and rule to adhere to: each format – hardcover, paperback, e-book – needs its own ISBN. A database needs to be able to distinguish which of the three has been sold in the shop in order to keep track of the inventory and for online retailers to send you the format you bought. The ‘same’ book published by a different publisher also requires its own ISBN, for each of these formats, because the right amount of money has to be sent to the right publisher. Different editions need their own ISBN as well, as they tend to differ, like an extra preface or postscript. So far so good from the conceptual data modelling and retailer’s perspectives.

Ontologically, one might not be so happy. One might have assumed that an ISBN help to identify a book, but clearly it only provides identification at best, not identity. What is ‘the book’? What makes a book a book is its content, as the thing we refer to when someone says “Yes, I bought the book; the cheaper paperback, not the hard cover though”. In that sense ‘the book’ excludes at least the table of contents and any index, since page numbers will vary across formats, and front matter and cover image and text may differ. Then there’s a meaning of ‘the book’ as the physical manifestation, be it on the bookshelf or stored on disk in e-book format. Let us follow the direction of formats rather than the alley of an information artefact ontology. Curiously, for e-books, each of those formats also need their own ISBN, like the EPUB and PDF and Kindle versions.

What does that say about the identity and identification of the object? The reasoning by the ISBN organisation is as follows:

“if a specific device or software is required to read the e-book or different usage constraints that control user functionality are offered (e.g. copy, print, lend etc.) then each separate version will be a distinct product. Each distinct product that is available must be identified by its own ISBN as it is a separate publication. Thus, a separate publication is normally defined by a combination of product form features or details and usage constraints.” (emphasis added)

So, a book number is a number for a unique product that is a publication; and since a publication need not be a book, a non-book artefact may have a book number. Also, the (PDF file for the) print-on-demand requires a different ISBN, and thus deemed to be a separate product, from the PDF file for the regular batch printing of the same book. It may be the same book that is being printed from the PDF file, but nonetheless the softcover hardcopies are somehow different, counting as two publications instead of one.

For my no taming of the enthusiast book, the only difference was the change in ISBN number (circular issue) and the cut-off points of the width of side of the cover due to different paper type in overseas printers assumed by the international distributor. For my modelling book, there are, at least, the ISBN-10 3031396944, ISBN-13 softcover 978-3-031-39694-6, ISBN-13 e-book 978-3-031-39695-3, a digital object identifier (DOI) 10.1007/978-3-031-39695-3, and an Amazon Standard Identification Number (ASIN) for the Kindle version (B0CDP5KXT7). Declare sameAs in the knowledge graph or LOD cloud at your own peril. Yet, I doubt that the Department of Higher Education’s publication bean counters will count my modelling book as at least two, if not five, publications – with each publication earning a subsidy for the university.

*Section of an image found at https://kitaboo.com/online-ebook-publishing-5-easy-steps/ and copyrighted by “yanlev – Fotolia” or “Federico Caputo”*

The content in each book is exactly the same otherwise. If either of the publishers were to create a PDF that disabled printing, it should obtain a separate ISBN, according to the ISBN description. If there’s software that determines 0, 1, 5, or whichever number of sequential or concurrent lending for 30 minutes, a day, two weeks or whichever amount of time, it would be different ISBNs. Access restrictions by country, be it due to censorship or just so, likewise. The ‘just so’ indeed does exist. My most recent experience was trying to get my hands on Ten Planets by Yuri Herrera: the sci-fi short story collection is not on AppleBooks at all according to the app and Amazon didn’t make the the Kindle version available to people physically in South Africa, yet Rakuten’s Kobo was eager to sell it to me and made me a happy reader. Restrictive and, at times, opaque ‘digital rights management’ supposedly all require separate ISBNs. That’s absurd.

What’s going on? Book-loving people not grasping software or ontology? Or, given that ISBNs have to be bought, is it a money-making exercise to find more ways to collect money from the mostly poor and underpaid writers and struggling publishers? What is the criterion for “distinct”? Really only differences in basic “functionality”? Sure, an ontology-enhanced digital and interactive Inquire Biology is functionally is different from the original printed textbook, but that’s thanks to its text mark-up, context-sensitive questions and answers, semantic browsing and search.

In contrast, a lending constraint, say, is not a functionality or feature of the book intrinsically. If desired, such an accidental feature added to an e-book should be constraints managed by the software rather than requiring new book identifiers. Digital Rights Management (DRM) technology adds a wrapper to the e-book for each variation of access control, including number of users and devices, time, and so on. Each variant is a newly wrapped e-book file and, apparently, needs its own ISBN.

There are countably infinite ways to declare usage constraints. One could create a unique DRM-ed version for each person in the world, multiplied by the number of devices we each can use it on, multiplied by region locations, etc. And that for each book published. About 4 million new books are released each year, including traditional (500K to 1mln), self-published (some 1.7mln), and other forms of publishing. With a more realistic 10-25 ISBNs per book, the numbers will run out well beyond the current global civilization, if they were classless numbers.

https://en.wikipedia.org/wiki/ISBN#/media/File:ISBN_Details.svg (source: Wikipedia)

But that’s not how it works. The 13 digits of ISBN-13 are class-based, so it won’t accommodate nine trillion (9 999 999 999 999) books. A bar code is associated with it, and the first three digits are allocated to the books: 978 and, for the sake of argument, let’s take all of 979 as well. That reduces the amount to 2 times 9 999 999 999, or about 20 billion. Between one and five digits are allocated to the group (language), four to the publisher, at least three for the title, and the last one is for the checksum. So, we obtain at most 2 billion books with ISBN-13. Practically, and assuming no-one will take the ISBN organisation to the task with their “product form features”, it’s probably enough, albeit a trivial exercise to ruin for anyone with enough money to buy ISBNs.

Either way, ontologically, it’s clear what causes the conceptual mess. Different book numbers for intrinsic and extrinsic features of what makes a book a book, fine; different numbers for countably infinite combinations of accidental padded-on usage features for the same book, not. A DRM wrapper can be hacked, cracked, and circumvented and then it violates the numbering scheme. The EPUB format can be converted into PDF and into MS Word and so on, violating the app-specific identification principle. Convertible formats shouldn’t require distinct ISBNs just because different default software applications may be needed to open them. An e-book standard for interoperability would negate the application file format issue, which was precisely the point of the EPUB open standard.

The plethora of external accidental arbitrary constraints/features added to an e-book belong to a different category of features from those intrinsic to the book. They would need a book usage number or a DRM wrapper number for identification, not a book number, for it is an identification for a file on top of the actual e-book, not a new publication of the book itself by any definition of what a book is. In its own way, Amazon’s ASIN for Kindle editions of e-books does just that. But it’s a missed opportunity for the current ISBN standard.

How this affects a typical ‘library loan’ modelling homework exercise or an information artefact ontology is left as an exercise to the reader… Edraw’s examples would need to be updated and the IAO extended, for instance. Alternatively, the ISBN organisation could revise what merits a distinct ISBN.

On comparing models

Posted on February 25, 2024 by keet

Some of the readers of this blog are interested in modelling, and then mainly conceptual data models or ontologies. There are more types of models and modelling languages as well, such as mind maps, biological models, domain-specific languages, and so on. Can you confidently say—and justify!—which one is the best? Would such an answer be so elaborate so as to lean towards the idea of, and support existing calls for, modelling as a specialisation in an IT or computing degree programme, if not deserving to be a separate discipline outright? If so: why? What sets it apart and what are recurring themes across the various types of models and ways of modelling, and their differences? These questions are easy to ask, but far from trivial to answer. I tried anyway, to some extent at least. The latest attempt written in an accessible way—i.e., more like popular science than textbook-like—can be found in Chapter 7 of my recently published book entitled “The what and how of modelling information and knowledge: from mind maps to ontologies”, which was published by Springer (also available through Springer professional, online retailers such as Amazon, and university libraries). Instead of summarising that in this post, I did so in a guest post of Jordi Cabot’s blog, which can be read here: https://modeling-languages.com/on-comparing-modelling-languages/.

Figure 1. Two example diagrams about espresso machines: a mind map and a conceptual data model. If you have no idea about what or how to compare yet: before reading about the comparisons, can you describe differences between these two examples?

Background readings to the “Melokuhle – good things” short story

Posted on January 7, 2024 by keet

==== WARNING: SPOILERS AHEAD ====

If you have not yet read the short story Melokuhle – good things, published recently by East of the Web, you are advised to do so before continuing to read this post, unless you are an academic who really insists on looking at some research first.

========

This post is not about analysing the story about the somewhat culturally aware cooperative care robot, but about some of the papers relating to the theory, technology, and ethical aspects of it. I did intend to practice writing fiction, yet somehow I couldn’t resist weaving some scientific and educational aspects into the story. Occupational hazard, I suppose. (I’ve been teaching computer ethics/Social Issues and Professional Practice since 2016 to some 500-700 first-year students in computing.)

The idea of the story was initially motivated by three research topics that came together into the story it has become. First, computing and AI has ‘scenario banks’ and websites with lists of ethics questions to debate as to what we, as scientists, engineers, system architects, or programmers, should make the machine do, or not. One of my former students picked a few he liked to use in his project, one of which was the ‘Mia the alcoholic’ 1-paragraph scenario described by Millar [Millar16]. In short, it concerns the question: should the care robot [be programmed such that it will] serve more alcoholic beverages to the physically challenged alcoholic user when they ask for it? Melokuhle – good things provides a nontrivial answer that can be fun to discuss.

Perhaps unsurprisingly, not everyone will give the same answer to that question. The probably most popular demonstration of why this may be so is the research conducted with the MIT Moral Machine that broadened the trolley problem to self-driving cars and a range of scenarios, like whether you’d swerve for 5 homeless people and let yourself die or not, or, say, drive into 5 dogs vs 1 toddler if it had to be a binary choice. It turned out that clusters of answers were by found by region and, presumably, culture [Awad18]. Enter the idea of culturally aware robot.

(image from robotise; https://robotise.eu/wp-content/uploads/2018/02/robot-ethics-3.jpg)

But what is a ‘culturally aware’ robot supposed to do differently from ‘just’ a robot? Around the same time, Stefano Borgo gave a stimulating talk in our department about culturally aware robots, based on his paper about the traits that such a culture module should have [BorgoBlanzieri18]. The appealing idea turned out to be fraught with questions and potential problems, and a lively debate ensued during and after the talk. When is a robot culturally aware and when does one encode tropes and societal prejudices in the robot’s actions? Research is ongoing to answer that question.

Enter the idea to make the user configure the robot upfront somehow as a way to tailor the cultural awareness to their liking. Yet, user-configurable settings for every possible scenario is practically unrealistic: no-one is going to spend days answering questions before being able to use the machine. A possible solution is to have the user enter (somehow) their moral theory and then it would draw the logical conclusion for any particular scenario based on the chosen moral theory. For instance, if the user were to be a devout Muslim and had chosen Divine Command Theory, then with the ‘thou shalt not drink(alcohol)’ command in effect, the carebot’s actions for Mia, or Lubanzi in the short story, would be easy to determine: a resounding no—it wouldn’t even have poured him the first bottle of wine. (refer to the SIPP lecture notes [CSDept19] for summaries of 8 other ethical/moral theories.)

To be able to get this to work in an artificial moral agent, we need to be able to represent a bunch of moral theories computationally and then devise a reasoning module that can take the formally represented theory as input and apply it to the case at hand. We’ve worked on the first part and developed a model for representing moral theories [RautenbachKeet20] and illustrated that, yes, a different moral theory can lead to a different conclusion, and why. The reasoning module doesn’t exist as a piece of software; in fact, there’s not even a design for it on paper yet for how to do it. There are sketchy ideas and the occasional rules-based approaches for one theory, but generalising from that is a few steps beyond still. And there’s the task of theory elicitation, which the short story also alludes to; a student I currently supervise is finishing up his Masters in IT dissertation on that.

The natural language interface issues that passed the revue in the story deserve their own post, or two or three. I wrote an IT outreach and discussion paper on some aspects of it and on just requirements in the context of robots in Sub-Saharan Africa [Keet21], and I conduct research on natural language generation for, mainly, Nguni languages. That comment about lacking resources for isiZulu natural language generation that the programmer supposedly snuck into Melokuhle’s code? That was me, unapologetically: we have plenty good ideas and really would like to receive more research funds…

Overall, much research needs to be done still to realise the capabilities that Melokuhle (the carebot) exhibits in the story—if that sort of robot is one you’d like to have, that is. And so, yes, the East of the Web team that published the short story rightly classified it in the Sci-Fi genre.

Lastly, I did embed a few other bits and pieces of computer ethics, like the notion of behaviour modification or so-called nudging of users by computing devices and a whiff of surveillance in the home if the robot were indeed to be permanently in listening mode. If the story were to be used in an educational setting, they could be elaborated on as well. Further, here are a few questions that may be used to direct a discussion, be it in class or a practical or tutorial group discussion:

What does “culturally aware robot” mean for you?
Is it acceptable to program behaviour modification, or at least ‘nudging’, of a user in a computing device?
Should the care robot always, occasionally, or never comply to the request for more alcoholic beverages? Why?
Would your answers be different if Lubanzi were to have been given a back story of being an alcoholic or an athlete or a retiree in their late 70s?

Also, I used the story for the essay assignment last year, which was on the ethics of deploying robot teachers. The students could receive a few marks if they included a paragraph that contained an answer to “does any issue raised in the short story apply to robot teachers as well?”. I did that partially as a way to reduce the chance that students would farm out the whole task to ChatGPT and partially to make them practice with reasoning by analogy.

To eager learners who are about to register at UCT and will enroll in CSC1016S: I won’t be asking this of you in the second semester, as I’m taking a break from the module in 2024 to teach something else and anyhow we don’t repeat assignments in immediate successive years. I do have a few more draft short stories, though.

References

[Awad18] Awad, E., Dsouza, S., Kim, R., Schulz, J., Henrich, J., Shariff, A., Bonnefon, J.-F., Rahwan, I. The Moral Machine experiment. Nature, 2018, 563(7729): 59-64

[BorgoBlanzieri18] Borgo, S., Blanzieri, E. Trait-based Culture and its Organization: Developing a Culture Enabler for Artificial Agents. 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 333-338, doi: 10.1109/IROS.2018.8593369.

[CSDept19] Computer Science Department. Social Issues and Professional Practice in IT & Computing. Lecture Notes. 6 December 2019.

[Keet21] Keet, C.M. Natural Language Generation Requirements for Social Robots in Sub-Saharan Africa. IST-Africa 2021, IST-Africa 2021 Conference Proceedings. IST-Africa Institute and IIMC Ireland. Cunningham, M. and Cunningham, P. (Eds). 10-14 May 2021, online. (discussion paper)

[Millar16] Millar, J. An ethics evaluation tool for automating ethical decision-making in robots and self-driving cars. Applied Artificial Intelligence, 2016, 30(8):787–809.

[RautenbachKeet20] Rautenbach, J.G., Keet, C.M. Toward equipping Artificial Moral Agents with multiple ethical theories. RobOntics: International Workshop on Ontologies for Autonomous Robotics, co-located with BoSK’20, Bolzano. CEUR-WS vol. 2708, 5 (7 pages). Extended version: https://arxiv.org/abs/2003.00935

A few book reviews for 2023

Posted on December 28, 2023 by keet

It’s that time of year again to revisit books read over the year. Although I’ve been writing more than reading this year, publishing a new book, even, I’ve read a few brand new non-fiction books that are worth mentioning:

Corrupted: a study of chronic dysfunction in South African universities by Jonathan Jansen
Truth to power: my three years inside Eskom by André de Ruyter
The insect crisis: the fall of the tiny empires that run the world by Oliver Milman (UK edition of 2022)
That dreaded ‘other’ basket with books that sounded interesting, but…

The first two books are current affairs books that, while clearly focussing on South Africa, bring afore issues that transcend borders. The remainder of this post discusses mainly the first three books.

Corrupted and chronic dysfunction

Distinguished scholar, university executive manager, and thought leader Jonathan Jansen provides a critical analysis of what, where, how, and why things go wrong to various degrees at South African universities. His analysis is based on over 100 interviews and his extensive experience at multiple universities as academic, manager, and fixer at universities under administration. It’s not a simple case of messed up historically disadvantaged universities versus historically privileged ones or centre versus periphery. But, “at the heart of the dysfunction in universities is an intense and sometimes deadly competition for resources especially on campuses located in impoverished communities.” Not only in the sense of limited resources but, importantly, all and sundry want to eat at the trough and have their slice of the cake.

His analysis shows that the academic project may take a back-seat when the university is the key employer in the region, having turned into an combination of unemployment benefit distribution point of sorts and the focal point for corruption to amass more personal wealth. Then there’s a mix of politicised management and limited capacity that add their own share to the dysfunction cocktail. It must also be said, as does Jansen in his book as well, that not all universities are utterly dysfunctional. The easily readable gripping tale is a must-read for anyone who’d like to understand the South African university landscape better.

A, in my opinion, missed opportunity for the book, if any, was to look beyond the borders and make a connection to academia more broadly internationally, because many of the issues are not unique to South African universities. The ‘academic news/gossip weeklies’, such as UWN, CHE, and THE, regularly report on various issues with political meddling (e.g., DeSantis in Florida HE) and corruption at universities in multiple countries across the globe. And of one university I know first-hand that it was originally created to revert the brain drain, open up the city a bit and stimulate its economy and vibe, allow for multilingual education, and provide employment opportunities for locals, not necessarily always in that order of priority (and that was no secret); they weren’t the first to do so and are unlikely to be the last, and they aren’t trailing in the university rankings either.

Truth to Power

*Image source: Wikimedia Commons https://commons.wikimedia.org/wiki/File:Load_shedding_hour_-_Flickr_-_deeps.adhi.jpg*

De Ruyter’s book is full of juicy sordid details of the back story to the rolling blackouts we’ve been trying to put up with. The power cuts used to happen in fits and bursts and, if happening, see-saw between stage 1 (up to 2.5h/day without electricity) and stage 4 (around 6.5-7.5h/day). Then there was 2022 where the outages sky-rocketed 4-5 times as bad as before and hitting stage 6 several times (9-11h/day without electricity). We were shocked. But then, then there’s 2023, where the amount and severity of outages surpassed the 2022 outages in the second week of May already. Why? Electricity is something many people take for granted as simply being there. There’s a very complex system behind it to make it work, however, and uncountable ways to thwart it.

For South Africa, it’s a complicated banquet of causes why there are so many rolling blackouts for so many hours. In no particular order: problems in management, corruption, sabotage, crime, ageing coal-fired power plants and that nuclear one that is also being pushed to an extended lifespan, state government (a.o.: lack of timely action, conflicting interests across ministries), political power plays, attitude, cartels, coal mafia, BBE subversion, ineffective police. De Ruyter’s book delves into those issues, interspersed with his ideas and motivations of how to manage and lead. It was a fascinating read.

The insect crisis

This popular science book by Oliver Milman does not have anything to do with my research whatsoever, but it’s well-written as to make it an easy and gripping holiday reading. It’s popular science writing of the ‘telling the story as it unfolds’ page-turner variety, rather than a dry summary of scientific papers.

Calling the rapidly dwindling amounts of insects merely a ‘crisis’ seems to be a euphemism; for both popular species and unsung heroes, it’s a disaster of epic proportions on all inhabited continents with dwindling sightings and biomass anywhere from a few percent per year to a collapse of 25-97% over a decade or two (or three or four, depending on the species and region).

The book starts off with an apocalyptic prologue on what earth would look like when all insects had gone the route of the dodo. The main part of the book first spends several chapters reporting on a mix of recording the decline of insect populations and extinctions with their main causes—pesticides, other agricultural practices not supportive of insects, and climate change—and then devotes a chapter each on the challenges with bees and butterflies. The last two chapters are about trying to answer the ‘what can we do?’ question. In a meandering way, it proposes a range of options to pursue, from managing land less and using less pesticides, to more green corridors, to eating less meat (since that requires a disproportionate amount of arable land), to exploring what the techno-optimism might be able to offer. And wishing insects will rebound when they’re given back a little space with a variety of plants.

Just in case you don’t like insects: after reading this book, you will, even if for ulterior motives, like wanting to keep eating chocolate and apples and almonds, to continue drinking coffee, and to enjoy watching geckos and birds and butterflies. The entomologists can’t quite proffer images to generate the aahhs and oohs of cute endangered animals for attention, but many have usefulness on their side, which just might convince the public of their importance and act upon it accordingly.

The ‘other’ basket

And then, then there are books you struggle through, or drag everywhere with the intention to read it during a lost moment but just can’t get yourself to do so. One such blocker is Raihani’s “The social instinct: what nature can teach us about working together” (Vintage, 2022). It still looks interesting and I will try my best to finish it, one day, but, so far and some 8 months later since opening it, I made it to page 35 only.

The other one is “Rebel Talent: why it pays to break the rules in work and life” by Francesca Gino. It’s half memoir with TMI, half pop management research. And then it turns out she probably made up some of the research stuff, partly together with academic rock star Dan Ariely (I read his book, “Predictably Irrational”, too, which was a good read). On research about preventing dishonesty. It all blew up publicly from June 2023 onwards: first by Data Colada and in the NYT, then a long read appeared in The Atlantic in July ‘23, and Gino went ahead and sued the accusers. The saga is ongoing.

As usual, there are a few books on the bookshelf for which I haven’t had the time yet. I hope to get to at least one of them this week and I’m open to recommendations 🙂

On my new book about modelling

Posted on December 22, 2023 by keet

It was published last month by Springer: “The what and how of modelling information and knowledge: from mind maps to ontologies”. The book’s three character-limited unique selling points are that it “introduces models and modelling processes to improve analytical skills and precision; describes and compares five modelling approaches: mind maps, models in biology, conceptual data models, ontologies, and ontology; aims at readers looking for a digestible introduction to information modelling and knowledge representation”. The softcover hardcopy and the eBook are available from Springer, Springer professional, many national and international online retailers (e.g., Amazon), as well as university libraries, and hopefully soon in the ‘science’ section of select bookstores.

There’s also a back flap blurb with the book’s motivations and aims, and intended readership. The remainder of this post are informal comments on it.

From my side as author and having read many popular science books on a wide range of topics, I wanted to write a popular science book too, but then about modelling. Modelling for the masses, as it were, or at least something that is comparatively easily readable for professionals who don’t have a computing background and who haven’t had, or had very little, training in modelling, yet who can greatly benefit from doing so. And to some extent also for computing and IT professionals who’d like a refresher on information modelling or a concise introduction to ontologies but don’t want to (re-)open their textbook tomes from college. Modelling doesn’t lend itself well to juicy world-changing discoveries the same way that vaccines and fungi can be themes for page-turners, but a few tales and juicy details do exist.

Then next consideration was about which aspects of modelling to include and what sort of popular science book to aim for. I distinguished four types of popular science books based on my prior readings, ranging from ‘entertaining layperson’ level holiday reading to ‘advanced interested layperson’ level where having at least a Bachelors in that field or a Master’s degree in an adjacent field may be needed to make it through the tiny-font book. I have no experience writing humour, and modelling is a rather dry topic compared to laugh-out-loud musings and investigations into stupidity, drunkenness, or elephants on acid—that entertainment can be found here, here, and here—so that was easily excluded. I’ve already tried out advanced texts tailored to specialists, in the form of an award-winning postgraduate textbook on ontology engineering, and wasn’t in the mood for writing another such book at the time when I was exploring ideas, which was around late 2021 and early 2022. I think this modelling book ended up between the two extremes regarding the amount of content, difficulty, and readability.

And so, I chose the tone of writing to be in so-called ‘casual writing’ style to make it more readable, there are a few anecdotes to enliven the text as is customary for popular science books, and the first three chapters are relatively easy in content compared to later chapters. The difficulty level of the chapters’ contents is turned up a notch each chapter going from Chapters 2 to 6 when we’re moving onwards with the journey passing by the five types of models covered in the book. Each successive chapter solves modelling limitations from the preceding chapter, and so it gets more challenging at least up to Chapter 5 (ontologies). Whether a reader finds Chapter 6 on Ontology (philosophy) even harder, depends on their background, because in other ways it is easier than ontologies because we can set aside certain interfering practicalities.

Chapter 7 mixes easier use cases with theoretically more abstract sections when we’re putting things together, reflect on Chapters 2-6, and look ahead. There’s no avoiding a little challenge. But then, we read non-fiction/science/tech books to learn from it and learning requires some effort.

Aside from the reader learning from reading the book, an author is supposed to gain new insights from writing it. And so did I. Moreover, upfront when planning the book, I tried to make sure I likely would. I mention a few salient points in the preface and I’ll select two for this blog post: the cladograms (Section 3.2.1) and the task-based evaluation (Section 7.1.2.2).

Diagrams/models in biology are sometimes ridiculed as “cartoons” by non-biologists. Cladograms would be the xkcd version of it, visually. I already knew that there are common practices, recurring icons, and rules governing the biological models drawn as diagrams. Digging deeper to find more diagrams with rules governing their notation, cladograms came up. They visualise key aspects of the scientific theory of evolution. Conversely, drawing an evolutionary diagram that doesn’t adhere to those rules then amounts to misunderstanding evolution. I think the case deserves more attention, especially because a bunch of school textbooks have been shown to have errors, and there’s room for improvement designing cladogram drawing software. Maybe clarifying matters and being more precise with such models helps resolve some debates on the topic as well.

The motivation for the task-based evaluation is easy to argue for in theory — actually doing it offered a deeper understanding, and writing the book spurred me to do so. One of my claims in the beginning of the book is that with better modelling—better than mind maps, not better mind maps—one learns more. The task-based evaluation is precisely about that. We take one page from a textbook and try to create a model of it, one for each type of model covered in the book. It demonstrates in a clear and straightforward way — assisted by Bloom’s taxonomy if you so fancy — why developing an ontology is much harder than developing a mind map or a conceptual data model, and in what way designing a conceptual data model of that textbook page is better for learning the content than creating a mind map of it.

There were more joys of writing the book. Like that the running example—dance—was also good for some additional interesting paper reading beyond what I already had read and engaged with in various projects. (There are also other subject domains in the examples and illustrations, such as fermentation, peace, labour law, and stuff, and a separate post will be dedicated to more content of the book.)

To jump the gun on questions like “why didn’t you include my preferred type of model or my language, being [DSL x/KG y/BPM z/etc.]?”: the point I wanted to make with this book was made with these five types of models and this was the shortest coherent story arc with which I could do it. The DSLs/KGs/BPMs/etc are not less worthy, but they would have caused the number of pages to explode without adding to the argument. As consolation, perhaps: knowledge graphs (KGs) are likely to appear in a v2 of my ontology engineering textbook and BPM likely will be linked to the TREND temporal conceptual data modelling language, but that’s future music.

Last, I’ve created a web page for the book, which collates information about the book, such as direct links where to buy it, media coverage and links to recent related blog posts (e.g., this one is a spin-off [with an add-on] of an early draft of section 6.3 and that one of a draft of section 7.3), and has extra supplementary material, including a longer illustration of a conceptual model design procedure using a prospective dance school database as example. Feedback is welcome!

An illustration of an “ERDP” to create an EER diagram: the dance school database

Posted on October 8, 2023 by keet

How to develop a conceptual data model, such as an EER diagram, UML Class Diagram, or ORM model? Besides dropping icons here and there on an empty canvas, a few strategies exist for approaching it systematically or at least in an assisted way, be it for ‘small data’ or for ‘big data’. One of them that I found useful to experiment with when I started out many years ago with the ‘small data’ cases, was the Conceptual Schema Design Procedure (CSDP) for ORM, as summarised in Table 1 below. It is summarised in that whitepaper and its details span a few hundred pages in Terry Halpin’s books [Halpin01], which was further extended in his later works. Extended Entity-Relationship modelling is more popular than Object-Role Modeling, however, and yet there’s no such CSDP for it. The elements don’t have the same name and the list of possible constraints to take into account are not the same in both families of languages either [KeetFillottrani15]. So, I amended it to make it work for EER.

Table 1. CSDP as summarised by Halpin in the white paper about Object-Role Modeling.

Step	Description
1	Transform familiar information examples into elementary facts, and apply quality checks
2	Draw the fact types, and apply a population check
3	Check for entity types that should be combined, and note any arithmetic derivations
4	Add uniqueness constraints, and check arity of fact types
5	Add mandatory role constraints, and check for logical derivations
6	Add value, set comparison and subtyping constraints
7	Add other constraints and perform final checks

Unsurprisingly, yes, it is feasible to rework the CSDP for ORM to also be of use for designing EER diagrams, in an “ERDP”, ER Design Procedure, if you will. A basic first version is described in Chapter 4 of my new book that is currently in print with Springer [Keet23] (and available for pre-order from multiple online retailers already). I padded the CSDP-like procedure of the example a bit on both ends. There’s an optional preceding ‘step 0’ to explore the domain to prepare for a client meeting. Steps 1-7 are summarised in Table 2: listing the sample facts, drawing the core elements, and then adding constraints: cardinality, mandatory/optional participation, value, disjointness and completeness. Step 7 mostly amounts to adding nothing more, since EER has fewer constraints than ORM. Later steps may include quality improvements and various additions that some, but not all, EER variants have.

Table 2. Revised basic CSDP for EER diagrams.

Step	Description
0	Universe of discourse (subject domain) exploration
1	Transform familiar or provided sample examples into elementary facts, and apply quality checks
2	Draw the entity types, relationships, and attributes
3	Check for entity types that should be combined or generalised
4	Add cardinality constraints, and check arity of fact types
5	Add mandatory/optional constraints
6	Add value constraints and subtyping constraints
7	Add any other constraints of the EER variant used and perform final checks

The book’s chapter on conceptual data models also includes an example of the size that fits neatly when taking into account the page numbers and the rest of the content. As bonus material, I made a longer example now available on this page, which is about developing an EER diagram for a database to manage data for a dance school.

Picture of dancing the Ball de pastors del pirineo — Picture of our group dancing the “Ball de pastors del pirineo”.

I did go through a ‘step 0’ to explore the subject domain to explore my knowledge of dance schools, which was facilitated by having been member of several dance schools over the years. The example then goes through the 7-step procedure. All this gets us from devising facts such as

Member with name Lauren takes class #sem1in2023B1.

Class #sem1in2023B1 involves Ballroom genre at level Beginner in semester 1 2023.

in a step-wise fashion with intermediate partial models to the final one, in Information Engineering notation, as shown in the following image:

*Figure 1. The final EER diagram at the end of “step 6” of the procedure.*

The dance school model description also hints at what lies beyond step 7, such as automated reasoning and ontology-driven aspects (not included in this basic version), and the page has a few notes on notations. I used IE notation because I really like the visuals of the crow’s feet for cardinality, but there’s a snag and some textbooks use Chen’s or a ‘Chen-like’ notation. Therefore, I added those variants on the page near the end.

Are the resulting models any better with such a basic procedure than without? I don’t know; it has never been tested. We have around 450 students who will have to learn EER in the first semester of their second year in computer science, so there may be plenty of participants for an experiment to make the conclusions more convincing. If you’re interested in teaming up for the research to find out, feel free to email me.

References

[Halpin01] Halpin, T. Information Modeling and Relational Databases. San Francisco: Morgan Kaufmann Publishers. 2001.

[KeetFillottrani15] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[Keet23] Keet, C.M. The What and How of Modelling Information and Knowledge: From Mind Maps to Ontologies. Springer, in press. ISBN-10: 3031396944; ISBN-13: 978-3031396946.

Social impact issues with LLMs – a brief write-up of my list from the SIGdial’23 panel

Posted on September 17, 2023 by keet

The SIGdial 2023 organisers wanted a panel at the jointly held SIGdial 2023 and INLG 2023 conferences in Prague that took place last week. Svetlana Stoyanchev, as PC Chair in charge of it, proposed “Social impact of LLMs”. It was to follow the keynote talk by Ryan Lowe of OpenAI, the company behind the popular ChatGPT and also Whisper for speech, and he also would participate. I ended up in the panel as well (coming from the NLG angle of the matter), as did Ehud Reiter from the University of Aberdeen (UK) and Malihe Alikhani from Northeastern University (USA), with David Traum from the University of Southern California (USA) as moderator.

There was to be a 3-5 minutes opening statement by each panel member, which I had duly prepared for, but that did not happen. What happened first, was an unassuming 1-liner with name, affiliation, and area of specialisation. It then proceeded with questions the likes of “can you provide your view on how LLMs benefit society?”, “What is more important: factualness or fluency?”, and “which ethical concerns about LLMs are overstated?”.

I didn’t get on the panel for that sort of stuff. I was cajoled into saying ‘yes’ to the invitation because I already had compiled a partial list of social issues with LLMs. I’m teaching a module on “social issues and professional practice” to first-years in computer science at UCT and touched upon it in late July and early August at the start of the semester and I had mentioned some of it at a research ethics workshop at UCT. Note that ‘issues’ can be interesting and provide ideas for new research projects, provided they’re not inherent limitations of the theory, method, technique, tool, or practice.

As preparation for the panel, I tried to structure the list into a taxonomy of sorts to try to maximise information density in the short time I thought I would have. So when the moderator opened the floor for questions from the audience and no-one queued up instantly, I jumped in the gap. It might help to get the audience into action, too, or so I thought. And someone had to state the unpleasantries and challenges. So here’s that taxonomy-like list of social issues I managed to mention (in a nutshell and still incomplete):

1. In creation of LLMs

1.1 Resource usage (sensu climate change issues):

1.1.1 Electricity use for the computations training the LLMs;

1.1.2 Water use, used in data centre cooling where the computation takes place.

1.2 Exacerbating disparities, in that the less well off can’t compete with the rich corporations in The North and end up crowded out and as consumers only (and possibly also some colonialism, as noted by the speech researchers on Maori w.r.t. OpenAI’s Whisper).

1.3 Data (text) collection, notably regarding:

1.3.1 IP/copyright issues of the text ingested to generate the LLM;

1.3.2 The lack of trust (or the angst) on what data went and go into the LLMs (the ‘could be your emails, googledocs, sharepoint files etc.’), that no-one was asked whether they consented to their content being grabbed for that purpose, and when some would have disapproved of inclusion if they could, there’s the powerlessness in that it seems one neither can opt out nor verify if one’s text was excluded if opt-out were to be possible.

1.4 Psychological harm done unto the ‘garbage collectors’, such as the Kenyans in clickfarms, who are the manual labourers hired to remove the harmful content so that the system’s responses are clean and polite.

2. In content of LLMs

2.1 Bias, amplifying the bias in the source text the LLM trained on and that may be undesirable (e.g., gender).

2.2 Cultural imperialism:

2.2.1 Coverage/performance disparities. The LLM has ingested more from one region than another, so its output may not be relevant to the locale (say, to people in the RSA) or culture where it is used but rather output something that is applicable to people in the USA as if that were valid for the whole world;

2.2.2 Language. Whose language does it use in the interaction? On pushing out language varieties and dialects that are less well-represented in the training dataset, reducing diversity in expression, and steering towards homogenization.

3. In use of LLMs

3.1 Work:

3.1.1 It creates more work without getting extra resources for it;at least so far it has created more work for, among others, us lecturers than it purportedly would save (as if we didn’t have enough to do already);

3.1.2 It puts people out of jobs; this is for many a novel computing technique and should be managed but isn’t.

3.2 Information-seeking behaviour affecting democracy. The ‘one answer’ versus equally easy accessible answer options to assess multiple sources as part of information-seeking in democratic discourse, which is problematic due to fabrications (‘hallucinations’) and being fickle in property (content) dropping and an LLM may be amenable to manipulation for use as a propaganda machine.

3.3 Learning avoidance. There’s a difference between using LLMs as time-saver when one has the skill versus skipping learning competencies at school and university, such as writing and summarisation of course material when learning a subject.

3.x [there surely is more but I didn’t even have enough time to elaborate on item 3.3 already.]

The list in my lecture and workshop slides also included issues with misinformation, disinformation, privacy, and the unclear culpability attribution when there are bugs in the code it generates, which I hadn’t gotten around to include due to time constraints.

I can very well imagine the list will change, not only ending up longer, but also that more research may solve some of the issues so they can be removed. For instance, currently, language varieties descend into getting mixed onto one cocktail (they also did when David Traum tried with several Englishes) but it’s an interesting research question how one can (re)train an LLM to detect them in the training corpus and output it correctly, be this for written text or speech. It does not sound like an insurmountable problem to solve. Fear may be addressed with openness and education; policies might address some others.

*Rotating Kafka head/disco ball in the city centre of Prague. (Source: I took it a few days before the INLG’23 conference)*

While I was quickly going through my list, one attendee had walked over to the microphone and so I ended it at item 3.3. The question was about the impact of LLMs on the research community. The panel was called closed soon thereafter and lively comments followed when we all strolled into the conference welcome reception that took place at the same venue. I was pleased to hear those comments. More public debate in the panel session, however, would have been better for everyone compared to relegating it to the reception. Whether the muted response during the panel session was due to it having been a long day already—a great keynote talk by Emmanuel Dupoux, two long-paper sessions with interesting research, a poster session, and Ryan’s keynote—or due to it being recorded or for some other reason, I don’t know. Perhaps it is also up for debate whether it was wise to speak up. But no-one saying anything about some of the challenges with the social impact of LLMs in society was, in my view, not an acceptable option either.

To close off this blog post, I must note that there are more lists on social issues with LLMs and there’s quite some overlap between those resources and the taxonomy-like list described above. Among others: I can suggest you read this or that paper, or, if you’re short on time, have a look here or here that all have more explanatory text and references than this blog post.

CoSMo: a content selection modelling language

Posted on August 20, 2023 by keet

How do you specify declaring content-of-interest or selecting content from a database or RDF triple store with the aim to convert that data into natural language text? Where that content may be either individual objects or classes or both, the relations that hold between them, and even functions to compute stuff from it? To be able to store such a specification and build larger ones from smaller ones? And that by possibly anyone, in their own language if they so wish? They are non-trivial broad requirements for, at least, the so-called “abstract representation” language for Abstract Wikipedia [Vrandecic21], with the data store being Wikidata (an RDF triple store that currently runs on Blazegraph). Examining related work, it turned out that no such language exists, so we set out to develop one.

Fortunately, we had an idea about how to go about designing a modelling language [FillottraniKeet21], but theory is not the same as praxis, and it hadn’t been tested with a real use case just yet, and a different type of modelling at that. One might as well try to kill two birds with one stone, and that is indeed what we did. The outcome is described in an extended technical report [ArrietaEtal23], which is, admittedly, still quite research-oriented, although we tried to explain things more than we normally do in scientific publications. We dubbed our language CoSMo, from Content Selection Modeling language. It makes one component of that magic box called “article generation”, as part of the overall orchestration of Abstract Wikipedia components, clear and precise.

CoSMo's intended place in the high-level outline of Abstract Wikipedia, at step 1 of the NLG pipeline. (It can also be used outside the AW context). — *CoSMo’s intended place in the high-level outline of Abstract Wikipedia. (It can also be used* *outside the AW context).*

The remainder of this blog post contains a quick non-technical digest of that 32-page report.

The preparations: figuring out the requirements

In the Abstract Wikipedia NLG team that focuses on the natural language generation (NLG) specifically, we all knew we needed something between the content of Wikidata and the natural language text as output. The content of that text would be represented abstractly, in a structured fashion, i.e., the so-called “abstract representation” of that, which would be written in self-contained pieces, called “constructors”. So far so good. Then there came had several stakeholder meetings with the Abstract Wikipedia team and other interested Wikipedians in September and October 2022. We discovered there were wildly different assumption about the constructors, what they were for, and what one should be able to model with them.

Taking a step back, we all devised examples, and two proof-of-concepts were experimented with (Nina/Udiron by Mahir Morshed and a Scribunto-based implementation by Ariel Gutman). From there, it was possible to extract requirements for language design. Some of this was written up in a discussion document and put online late November 2022. It did not receive any feedback. More examples and implicit requirements were collected, including more ‘reasoning backwards’ from sample desired output to what data would need to be in Wikidata to be able to get such as sentence out of it, to help inform what a constructor needs to have that would help bridge the two extremes. Various features were elaborated on and discussed in an Abstract Wikipedia open meeting of the NLG group (recorded). It was February 2023 already.

It had become clear that not everything could reasonably be done all at once, nor would one want to, so Pablo Fillottrani, Kutz Arrieta, and I made the decision to take one step at a time: the first step in the natural language generation pipeline is content selection, not also all sorts of linguistic structures that one can find later on in the generation pipeline. From these preliminaries, we selected a list of design principles and features we though made most sense (see section 3.2 of the technical report for details), and moved on to the next step.

Defining a language

With the parameters and requirements in hand, it should have been straightforward to define the content selection language. It wasn’t. Sometimes, a requirement can be met in more than one way and we also wanted a graphical counterpart to any textual version, which also caused a few iterations to make a constructor look pretty and easily readable. We tried to reuse existing icons as much as possible, so that then at least a segment of users may already be familiar with them. For instance, the diamond-shape from UML for composition of smaller constructors into larger ones. The report’s tables 1-3 list the icons for the elements, connectors and constraints, respectively, their respective longform notation when it’s rendered in English, and a usage comment. Here are the elements:

The syntax needed a semantics, which it got (see also Table 4 of the report). Working on the details and example diagrams, we admitted it did need a grammar as well, to make sure that constructors are specified in the way they’re supposed to, and to make the step towards an implementation easier. Those details can be found in Appendix A near the end of the report.

Behind the pretty graphics and longform renderings, it looks uninviting and not intended for human consumption, as is the case for most of Wikidata and Wikifunctions, due to the aim for natural language independence, or at least some form of multilingualism. Here’s an example of the not-for-humans specification of a constructor, where the CSMxxx numbers are elements in CoSMo, and Q and P items exist in Wikidata:

CSM007:C5(                    
   CSM003(P106(r1,r2)),
   r1:CSM002(Q5),
   r2:CSM002(Q18844224),
   CSM003(P136(r3,r4)),
   r3:Q18844224,
   r4:Q24925,
   CSM002(Q5)={Q42})

In a text-based user interface, the identifiers are to be replaced with their respective labels in the chosen natural language. In a graphical user interface, the CSMxxx elements are to be rendered graphically. An example of a diagram that illustrates a case where three constructors are combined is shown next:

The CSMxxx identifiers were mapped to Wikidata items where possible at present, and from there we obtain the human-readable names, which are listed in the technical report’s Table 5 for English, Spanish, and Basque.

Evaluation

There are several strategies to determine whether the language is any good. The first step is the check whether it meets the requirements it is supposed to meet. CoSMo does (see section 4.2). The second step is the ‘can one create a constructor?’ (without too much effort) and, pushing that further: ‘is it expressive enough to represent the examples we started out with?’. The answer to the first question is a resounding Yes. Further to the previous image, here’s one of those in CoSMo longform textual notation where the elements are rendered in English and in Spanish (Appendix C has more examples):

The answer to the second question is trickier. In the very strict sense, the answer is No, which doesn’t sound good on first reading. The first reason why it’s not as bad as it sounds, is that some examples from last year mixed content and natural language features, which was precisely what we needed to disentangle to make constructors manageable, usable, and reusable. The second reason was that some examples were making assumptions about Wikidata content but where it was represented differently or it should be possible to delegate the task to the recently launched Wikifunctions. Here’s a section of the San Francisco example from [Vrandecic21] in CoSMo graphical notation, which contains a function for computing the ranking on-the-fly rather than hard-coding the ranking (the Z item number is made up and such a function is still to be added to Wikifunctions that launched just this month).

Finally, we conducted a feature comparison of CoSMo against similar-looking modelling languages, and, indeed, it is one of a kind.

In closing

Is this all there is and can a system generate natural language text with it already? No, sorry, things don’t go that fast when only limited resources are at one’s disposal. A few other parts of the natural language generation pipeline are available for different NLG approaches and CoSMo fills a gap in the procedure. Not only that, CoSMo can just as well be used outside Abstract Wikipedia and even when no natural language generation is in sight. From a data store design perspective, CoSMo offers an abstract representation, i.e., a modelling language, of a data store view specification, alike for SQL VIEW statements, but then regardless whether that is for a database, an RDF triple store, or even some JSON thing that tries to pass for a database. Or, from a graph algorithm viewpoint: CoSMo makes defining subgraphs (among other things) more accessible to a broader user base.

Last, but not least: no, LLMs will not fix it for you. There are a few technical hick-ups at present (read a discussion of LLM-to-SPARQL), but even if/when they are solved: SPARQL itself does not meet all the requirements for the constructor specification language, hence, an LLM-to-SPARQL app does not either. (see Table 6 and compare the WONDER, VSB and RDF Explorer columns with CoSMo for details.)

Finally, CoSMo is a concrete proposal – admittedly, one that we hope will be adopted by the Abstract Wikipedia project – and we look forward to feedback!

References

[ArrietaEtal23] Arrieta, K., Fillottrani, P.R., Keet, C.M. CoSMo: A constructor specification language for Abstract Wikipedia’s content selection process. Technical Report, Arxiv.org, number 2308.02539. 32p. 1 August 2023.

[FillottraniKeet21] Fillottrani, P.R., Keet, C.M. Evidence-based lean conceptual data modelling languages. Journal of Computer Science and Technology, 2021, 2021, 21(2): 93-111.

[Vrandecic21] Vrandečić, D. Building a multilingual Wikipedia. Communications of the ACM, 2021, 64(4), 38-41.

Keet blog

research and teaching, with some relevance for society

Can a pizza have fruit as topping?

A review on logics for conceptual data modelling

Notes on the ontology of International Standard Book Numbers

On comparing models

Background readings to the “Melokuhle – good things” short story

A few book reviews for 2023

On my new book about modelling

An illustration of an “ERDP” to create an EER diagram: the dance school database

Social impact issues with LLMs – a brief write-up of my list from the SIGdial’23 panel

CoSMo: a content selection modelling language

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: