# Reblogging 2006: Figuring out requirements for automated reasoning services for formal bio-ontologies

From the “10 years of keetblog – reblogging: 2006”: a preliminary post that led to the OWLED 2007 paper I co-authored with Marco Roos and Scott Marshall, when I was still predominantly into bio-ontologies and biological databases. The paper received quite a few citations, and a good ‘harvest’ from both OWLED’07 and co-located DL’07 participants on how those requirements may be met (described here). The original post: Figuring out requirements for automated reasoning services for formal bio-ontologies, from Dec 27, 2006.

What does the user want? There is a whole sub-discipline on requirements engineering, where researchers look into methodologies how one best can extract the users’ desires for a software application and organize the requirements according to type and priority. But what to do when users – in this case biologists and (mostly non-formal) bio-ontologies developers – neither do know clearly themselves what they want nor what type of automated reasoning is already ‘on offer’. Here, I’m making a start by briefly listing informally some of the desires & usages that I came across in the literature, picked up from conversations and further probing to disambiguate the (for a logician) vague descriptions, or bumped into myself; they are summarised at the end of this blog entry and update (d.d. 5-5-’07) described more comprehensively in [0].

Feel free to add your wishes & demands; it may even be fed back into current research like [1] or be supported already after all. (An alternative approach is describing ‘scenarios’ from which one can try to extract the required reasoning tasks; if you want to, you can add those as well.)

I. A fairly obvious use of automated reasoners such as Racer, Pellet and FaCT++ with ontologies is to let the software find errors (inconsistencies) in the representation of the knowledge or reality. This is particularly useful to ensure no ‘contradictory’ information remains in the ontology when an ontology gets too big for one person to comprehend and multiple people update an ontology. Also, it tends to facilitate learning how to formally represent something. Hence, the usage is to support the ontology development process.

But this is just the beginning: having a formal ontology gives you other nice options, or at least that is the dangling carrot on front of the developer’s nose.

II. One demonstration of the advantages of having a formal ontology, thus not merely a promise, is the classification of protein phosphatases by Wolstencroft et al. [9], where also some modest results were obtained in discovering novel information about those phosphatases that was entailed in the extant information but hitherto unknown. Bandini and Mosca [2] pushed a closely related idea one step further in another direction. To constrain the search space of candidate rubber molecules for tire production, they defined the constraints (properties) all types of molecules for tires must satisfy in the TBox, treated each candidate molecule as an instance in the ABox, and performed model checking on the knowledgebase: each instance inconsistent w.r.t. the TBox was thrown out of the pool of candidate-molecules. Overall, the formal representation with model checking achieved a considerable reduction in resource usage of the system and reduced the amount of costly wet-lab research. Hence, the usages are classification and model checking.[i]

III. Whereas the former includes usage of particular instances for the reasoning scenarios, another on is to stay at the type level and, in particular, relations between the types (classes in the class hierarchy in Protégé). In short, some users want to discover new or missing relations. What type of relation is not always exactly clear, but I assume for now that any non-isA relation would do. For instance, Roos et al. [8] would like to do that for the subject domain of histones; with or without instance-level data. The former, using instance-level data, resembles the reverse engineering option in VisioModeler, which takes a physical database schema and the data stored in the tables and computes the likely entities, relations, and constraints at the level of a conceptual model (in casu, ORM). Mungall [7] wants to “Check for inconsistent or missing relationships” and “Automatically assign relationships for new terms”. How can one find what is not there but ought to be in the ontology? An automated reasoner is not an oracle. I will split up this topic into two aspects. First, one can derive relations among types, meaning that some ontology developer has declared several types, relations, other properties, but not everything. The reasoner then, takes the declared knowledge and can return relations that are logically implied by the formal ontology. From a user perspective, such a derived relation may be perceived as a ‘new’ or ‘missing’ relation – but it did not fall out of the sky because the relation was already entailed in the ontology (or: you did not realize you knew it already). Second, another notion of ‘missing relations’: e.g. there are 17 types of macrophages (types of cell) in the FMA, which must be part of, contained in, or located in something. If you query the FMA through OQAFMA, it gives as answer that the hepatic macrophage is part of the liver [5]. An informed user knows it cannot be the case that the other macrophages are not part of anything. Then, the ontology developer may want to fill this gap – adding the ‘missing’ relations – by further developing those cell-level sections of the ontology. Note that the reasoner will not second-guess you by asking “do you want more things there?”; it uses the Open World Assumption, i.e. that there always may be more than actually represented on the ontology (and absence of some piece of information is not negation of that piece). Thus, the requirements are to have some way of dealing with gaps’ in an ontology, to support computing derived relations entailed in a logical theory, and, third, deriving type-level relations based on instance-level data. The second one is already supported, the first one only with intervention by an informed user, and the third one might, to some extent.

Now three shorter points, either because there is even less material or there is too much to stuff it in this blog entry.

IV. A ‘this would be nice’ suggestion from Christina Hettne, among others, concerns the desire to compare pathways, which, in its simplest form, amounts to checking for sub-graph isomorphisms. More generally, one could – or should be able to – treat an ontology as a scientific theory [6] and compare competing explanations of some natural phenomenon (provided both are represented formally). Thus, we have a requirement for comparison of two ontologies, not with the aim of doing meaning negotiation and/or merging them, but where the discrepancies themselves are the fun part. This indicates that dealing with ‘errors’ that a reasoner spits out could use an upgrade toward user-friendliness.

V. Reasoning with parthood and parthood-like relations in bio-ontologies are on a par with importance of the subsumption relation. Donnelly [3] and Keet [4], among many, would like to use parthood and parthood-like relations for reasoning, covering more than transitivity alone. Generalizing a bit, we have another requirement: reasoning with properties (relations) and hierarchies of relations, focusing first on the part-whole relation. What reasoning services are required exactly, be it for parthood or any other relation, deserves an entry on its own.

VI. And whatnot? For instance, linking up different ontologies that each reside at their own level of granularity, yet have enabled to perform ‘granular cross-ontology queries’, or infer locations of diseases based on combining an anatomy ontology with a disease taxonomy, hence, reasoning over linked ontologies. This needs to be written down in more detail, and may be covered at least partially with point two in item III.

Summarizing, we have to following requirements for automated reasoning services, in random order w.r.t. importance:

• Support in the ontology development process;
• Classification;
• Model checking;
• Finding ‘gaps’ in the content of an ontology;
• Computing derived relations at the type level;
• Deriving type-level relations from instance-level data;
• Comparison of two ontologies ([logical] theories);
• Reasoning with a plethora of parthood and parthood-like relations;
• Using (including finding inconsistencies in) a hierarchy of relations in conjunction with the class hierarchy;
• Reasoning across linked ontologies;

I doubt this is an exhaustive list, and expect to add more requirements & desires soon. They also have to be specified more precisely than explained briefly above and the solutions to meet these requirements need to be elaborated upon as well.

[0] Keet, C.M., Roos, M., Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS.

[1] European FP6 FET Project “Thinking ONtologiES (TONES)”. (UDATE 29-7-2015: URL defunct by now)

[2] Bandini, S., Mosca, A. Mereological knowledge representation for the chemical formulation. 2nd Workshop on Formal Ontologies Meets Industry 2006 (FOMI2006), 14-15 December 2006, Trento, Italy. pp55-69.

[3] Donnelly, M., Bittner, T. and Rosse, C. A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies. Artificial Intelligence in Medicine, 2006, 36(1):1-27.

[4]
Keet, C.M. Part-whole relations in Object-Role Models. 2nd International Workshop on Object-Role Modelling (ORM 2006), Montpellier, France, Nov 2-3, 2006. In: OTM Workshops 2006. Meersman, R., Tari, Z., Herrero., P. et al. (Eds.), LNCS 4278. Berlin: Springer-Verlag, 2006. pp1116-1127.

[5]
Keet, C.M. Granular information retrieval from the Gene Ontology and from the Foundational Model of Anatomy with OQAFMA. KRDB Research Centre Technical Report KRDB06-1, Free University of Bozen-Bolzano, 6 April 2006. 19p.

[6] Keet, C.M.
Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS2005), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics 3615, Springer Verlag, 2005. pp46-62.

[7] Mungall, C.J. Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics, 2004, 5(6-7):509-520. (UPDATE: link rot as well; a freely accessible veriosn is available at: http://berkeleybop.org/~cjm/obol/doc/Mungall_CFG_2004.pdf)

[8] Roos, M., Rauwerda, H., Marshall, M.S., Post, L., Inda, M., Henkel, C., Breit, T. Towards a virtual laboratory for integrative bioinformatics research. CSBio Reader: Extended abstracts of “CS & IT with/for Biology” Seminar Series 2005. Free University of Bozen-Bolzano, 2005. pp18-25.

[9]
Wolstencroft, K., Lord, P., Tabernero, L., Brass, A., Stevens, R. Using ontology reasoning to classify protein phosphatases [abstract]. 8th Annual Bio-Ontologies Meeting; 2005 24 June; Detroit, United States of America.

[i] Observe another aspect regarding model checking where the automated reasoner checks if the theory is satisfiable, or: given your ontology, if there is/can be a combination of instances such that all the declared knowledge in the ontology holds (is true), which is called a model’ (as well, like so many things). That an ontology is satisfiable does not imply it only has models as intended by you, i.e. there is a difference between ‘all models’ and ‘all intended models’. If an ontology is satisfiable it means that it is logically possible that each type can be instantiated without running into inconsistencies; it neither demonstrates that one can indeed find in reality the real-world versions of those represented entities nor if there is one-and-only-one model that actually matches exactly the data you may want to have linked to & checked against the ontology.

Advertisements

# Reblogging 2006: “We are what we repeatedly do…

This is the 10th year of my blog, which started off as a little experiment and ‘seeing where it ends up’. In numbers, there are over 200 posts and I estimate that in September, the blog will clock its 100,000th visitor. I had a look at the list of posts, and I’ll reblog about 2 blog posts from each year, trying to pick one ‘general’ topic and one about my research that will also note some follow-ups that happened after the post. I’ve selected them ignoring the ratings or visits of the posts, as I still haven’t figured out why some posts get lots of hits whilst others don’t; shouldn’t you all want to know about changes in the ingredients of people’s meals or strive to be a nonviolent person, rather than solving a problem on rearranging luggage in an airport carousel or looking into money-making academia.edu or self-indulgence on mapmaking showing all and sundry the countries you visited? Anyway, this is the first installment of it.

From the “10 years of keetblog – reblogging: 2006” (June 11, 2006): A summary on what to do (repeatedly) to become a good researcher.

…excellence, then, is not an act but a habit”, Aristotle has said. Being an excellent researcher then amounts to habitually doing excellent research. A prerequisite of doing excellent research is to do research effectively. Even the famed, and idealized, eureka! moment scientists occasionally (are supposed to) have is based on sound foundations acquired through searching, reading, researching, thinking, testing, and integrating new scientific developments with extant knowledge already accumulated. But how to get there? I don’t know – I’m only studying to become an excellent scientist.
Besides the aforementioned list of activities, I occasionally browse the Internet and procrastinate by reading how to write a thesis, improve the English grammar and word use, plan activities to avoid unemployment, the PhD comic, and more of those type of suggestions that don’t help me with the topic itself (granularity) but these topics are about how to do things.

Serendipity, perhaps, it was that brought me to an essay by Michael Nielsen, entitled “Principles of effective research” [1]. I summarise it here briefly, but it would be better if you read the 12 pages in full.

The first section is about integrating research into the rest of your life. So, unlike narratives and jokes that tell you its normal to not have a life as a PhD student or researcher, this would be the wrong direction to go or stage to be at. Be fit, have fun.

The principles of personal behaviour to achieve effective research are proactivity, vision, and self-discipline. Don’t abdicate responsibility, and be accountable to other people. Vision does not apply to where you think your research field will be in 20 years, but where do you want to be then, what sort of researcher do you want to be, which areas are you interested in (etc)? Have clear for yourself what you want to achieve, why, and how.

Regarding the research itself, self-development and the creative process are important. But focusing on self-development only is not ok, because then one fails to make a contribution to the community which viability and success depends on input from scientists (among others). On the other hand, keeping on organizing workshops, conferences, doing reviewing etc leaves little time for the self-development and creative process of doing research to make scientific contributions. That is, one should strive for a balance of the two.
Self-development includes developing research strengths, your ‘niche’ with a unique combination of abilities to get a comparative advantage. Emerging disciplines, mostly interdisciplinary, are a nice mess to sort out, for instance. Then, read the 10 seminal contributions in the other field as opposed to skimming several hundred mediocre articles that are fillers of the conferences or journals. (This doesn’t sound particularly friendly, but if I take bioinformatics or the ontologies hype as examples, there are quite a lot of articles that don’t add new ideas [but more descriptions of yet more software tools] and interdisciplinary articles are known to be not easy to review, hence more articles with confused content fall through the cracks and make it into archived publications.) A high-quality research environment helps.
Concerning the creative process, this depends on if you’re a problem-solver or a problem-creator, with each requiring specific skills. The former generally receives more attention, because there are so many things unknown and then figuring out how/what/why it works gives sought-for answers, technologies, or methodologies. Problem-creators, on the other hand, generate new work so to speak; by asking interesting questions, reformulating old nigh unsolvable problems in a new way, or showing connections nobody has thought of before. Read Nielsen’s article for details on the suggested skills set for each type.

Wishing you good luck with all this, then, is inappropriate, as luck does not seem to have much to do with becoming an effective researcher. So, go forth, improve your habits, and reap what you sow.

[1] Nielsen, M.A. Principles of effective research. July 27th , 2004. UPDATE 22-7-2015: this is the new URL: http://michaelnielsen.org/blog/principles-of-effective-research/

# Wikipedia + open access = not quite a revolution (not yet at least)

The title of the arxiv blog post sounded so catchy and wishful thinking into a high truthlikeness: “Why Wikipedia + open access = revolution”, summarizing and expanding on arxiv.org/abs/1506.07608 with the title “Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science.” [1], with some quotes:

“The odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access journals,” say Teplitskiy and co.

Open access publishing has changed the way scientists communicate with each other but Teplitskiy and buddies have now shown that its influence is much more significant. “Our research suggests that open access policies have a tremendous impact on the diffusion of science to the broader general public through an intermediary like Wikipedia,” says Teplitskiy and co.

It means that open access publications are dramatically amplifying the way science diffuses through the world and ultimately changing the way we understand the universe around us.

I sooo want to believe. And, honestly, when I search for something and Wikipedia is the first hit and I do click, it does seem to give a decent introductory overview of something I know little about so that I can make a better start for searching the real sources. I never bothered to look up my own areas of specialisation, other than when a co-author mentioned there was (she put?—I can’t recall) a reference to her tool in Wikipedia some time ago. But there’s that nagging comment to the technologyreview blog post saying the same thing, and adding that when s/he looked up his/her own field, s/he

“then realized that in my own field, my main reaction was to want to scream at the cherry picking of sources to promote some minor researcher.”

So, I looked up “ontology engineering” and “Ontologies” that redirected to “Ontology (information science)” (‘information science’, tsk!)… and I kinda screamed. The next sections are, first, about the merits of the arxiv paper (outcome: their conclusions are certainly rather quite exaggerated) and, second, I’ll use that ‘ontology (information science)’ entry to dig a bit deeper as use case, using both the English entry and in several other languages as that’s what the arxiv paper covers as well. I’ll close with some thoughts on what to do about it.

On the arxiv paper’s data and results

There are several limitations to the paper; some of them discussed by its authors, some are not. The arxiv paper does not distinguish between online freely available scientific literature where only the final typesetted version is behind a paywall and official ‘open access’. This is problematic for processing the computer science entries in Wikipedia for trying to validate their hypothesis. In addition, they considered only journals with their open access policy, and journal-level analysis (cf article-level analysis), idem for the problematic ISI impact factor, and only those 21000 listed in Scopus, amounting eventually to the (ISI index-)top 4721 journals of which 335 open access to test Wikipedia content against. The open access list was taken from being listed in the directory of OA journals, ignoring the difference between ‘green’ and ‘gold’ and paywall-access from, say Elsevier. Overall, this already does not bode well for extending the obtained conclusion to computer science entries and, hence, the diffusion of knowledge claim.

The authors admit they may undercount references for the non-English entries, but they have few references anyway (Fig 1 in the arxiv paper), so it’s basically largely an English-Wikipedia analysis after all, i.e., so the conclusion is not really straightforwardly extending to ‘diffusion of knowledge’ for the non-English speaking world.

The statistical model is described on p19 of the pdf, and I don’t quite follow the rationale, with an elusive set of ‘journal characteristics’ and some estimated variables without detail. Maybe some stats person can shed a light on it.

Then the bubble-figure in the technologyreview, which is Fig 8 in the arxiv paper and it is reproduced in the screenshot below, which “shows that across 50 [non-English] Wikipedias, there is an inverse relationship between the effects of accessibility and status on referencing”. Come again? It’s not like the regression line fits well. And why are the language entries—presumably independent of one another—in a relation after all? Notwithstanding, the odds for a Serbian entry to have a reference to an open access journal is some 275% higher than to a paywalled one, vs entries in Turkish that cite higher impact factor journals some 200% more often, according to the arxiv paper. I haven’t found details of that data, though, other than a back-of-the-envelope calculation when glancing over the figure: Serbian has a 1.5 for impact and a 3.75 or so for open access, Turkish 3 and 1.3-ish. Of how many entries and how many citations for those languages? They state that “While the English Wikipedia references ~32,000 articles from top journals, the Slovak Wikipedia references only 108 and Volapuk references 0.”. But Volapuk still ends up with an open access odd ratio of 0.588 and an ln(impact factor) of 2.330 (Appendix A3), which is counted only with the set of top-rated journals only; how is that possible when there are no references to those top journals? The number of counted journal citations is not given for each language, so a ‘statistically significant’ may well actually be over a number that’s too low to do your statistics with. Waray-Waray is a very small dot, and reading from Fig 1, it’s probably not more than those 108 references in the Slovak entries.

All in all, there is some room for improvement on this paper, and, in any case, some toning down of the conclusions, let alone technologyreview’s sensationalist blog title.

Fig 8 from Teplitskiy et al (2015)

Ontology (information science) Wikipedia entry, some issues

Let me not be petty whining that none of my papers are in the references, but take a small example of the myriad of issues.

Take the statement “There are studies on generalized techniques for merging ontologies,[12] but this area of research is still largely theoretical.” Uh? The reference is to an obscure ‘dynamic ontology repair’ project pdf from the University of Edinburgh, retrieved in 2012. We merged DMOP’s domain content with DOLCE in 2011, with tool support (Protégé, to be precise). owl:import was around and working at that time as well. Not to mention the very large body of papers on ontology alignment, reference book by Shvaiko & Euzenat, and the Ontology Alignment Evaluation Initiative.

The list of ontology languages even includes SBVR and IDEF5 (not ontology languages), and, for good measure of scruffiness, a project (TOVE).

The obscure “Gellish” appears literally everywhere: it is an ontology, it is a language, it is an English dictionary (yes, the latter apparently also falls under ‘examples’ of ontologies. not), and it is even the one and only instantiation of a “hybrid ontology” combining a domain and an upper ontology. Yeah, right. Looking it up, Gellish is van Rensen’s PhD thesis of 2005 that has an underwhelming 2 citations according to Google Scholar (10 for the related journal paper), and there’s a follow-up 2nd edition of 2014 by the same author, published with lulu, no citations. That does not belong to an introductory overview of ontologies in computing. Dublin core as an example of an ontology? No (but it is a useful artefact for metadata annotations).

Under “criticisms”: aside from a Werner Ceusters statement from a commentary on someone from his website—since when deserves that to be upgraded to Wikipedia content?!?—there’s also “It’s also not clear how ontology fits with Schema on Read (NoSQL) databases.”. Ontologies with NoSQL? sigh.

“Further readings” would, I expect, have a fine set of core readings to get a more comprehensive overview of the field. While some relevant ones are there (e.g., the “what is an ontology?” paper by Oberle, Guarino, and Staab; “Ontology (Science)” by Smith, Gruber’s paper despite the flawed definition), numerous ones are the result of some authors’ self-promotion, like the one on bootstrapping biomedical ontologies, an ontology for user profiles, IE for disease intelligence—they’re not even close to ‘staple food’ for ontologies—and the 2001 OIL paper and Ontology Development 101 technical report are woefully out-dated. The “References” section is a mishmash of webpages, slides, and a few scientific papers most of which are not from mainstream ontology research venues.

And that’s just a sampling of the issues with the “Ontology (information science)” Wikipedia entry; the ontology engineering entry is worse. No wonder my students—having grown up with treating Wikipedia as gospel—get confused.

Ontologies entries in other languages

That much about the English language version of ‘ontology (information science)’. I happen to speak a few other languages as well, so I also checked most of those for their ‘ontology (information science)’ entry. For future reference as a stock-taking of today’s contents, I’ve pdf-printed them all (zipped). For starters, they all had ontologies at least categorised properly into ‘informatica’. +1.

The entry in Dutch is very short; one can quibble and nit-pick about term usage, and it is disappointing that there’s only one reference (in Dutch, so wouldn’t count in the arxiv analysis), but at least it’s not riddled with mistakes and inappropriate content.

The German one is quite elaborate, and starts off reasonably well, but has some mistakes. Among others, the typical novice pitfall of confusing classes for instances [“Stadt als Instanz des Begriffs topologisches Element der Klasse Punkte”] and the sample ontology—which of itself is a good idea to add to an overview page—has lots of modelling issues, such as datatypes and mixing subclasses with properties (the Maler [painter] with region of origin Flämish [Flemish]). Interestingly, ontology types for the English reader are foundational, domain, and hybrid, whereas the German reader has only lightweight and heavyweight ones. As for the references, there are some oddball ones, but the fair/good ones are in the majority, if incomplete, and perhaps a bit lopsided to Barry Smith material.

The Italian entry is of similar length as the German entry, but, unfortunately, has some copy-and-paste from the English one when it comes to the list of languages and examples, so, a propagation of issues; the ‘example of applications’ does list another project, and there is no ‘criticisms’ section. The text has been written separately instead of being a translation-of-English (idem ditto for the other entries, btw), and thus also consists of some other information. For starters, removing most of the ‘Premesse’ would be helpful (or elaborating on it in a criticism section; starting the topic with information warfare and terrorism? nah). The section after that (‘uso come glossario di base’) is chaotic, reading like a competitor-author per paragraph, and riddled with problematic statements like that all computer programs are based on foundational ontologies (“Tutti i programmi per computer si basano su ontologie fondazionali,”), and that the scope of an ontology is to develop a database (“Lo scopo di un’ontologia computazionale […] [è] di creare una base di dati”). It does mention OntoClean. Italian readers will also be treated on a brief discussion of the debate on one or multiple ontologies (absent from the other entries). It has a quite different set of ‘external links’ compared to the other entries, and there are hardly any references. Al in all, one leaves with a quite distinct impression of ontologies after reading the Italian one cf the Dutch, German, and English ones.

Last, the Spanish entry is about as short as the Dutch one. There’s overlap in content with the Italian entry in the sense of near-literal translation (on the foundational ontology and that Murray-Rust guy on the ‘semantic and ontological war’ due to ‘competition between standards’), and it has a plug for MathWorld (?!).

So, if the entries on topics I’m an expert in are such of such dubious quality (the German entry is, relatively, the best), then what does that imply for the other entries that superficially may seem potentially useful introductory overviews? By the same token, they probably are not. And the ontology topics are not even in an area with as much contention as topics in political sciences, history, etc. Go figure.

Now what?

Is this a bad thing? I already can see a response in the making along the line of “well, it’s crowdsourced and everyone can contribute, we invite you to not just complain, but instead improve the entry; really, you’re welcome to do so”. Maybe I will. But first, two other questions have to be answered. The arxiv paper that got my rant started claimed that open source papers are good, and that they’re reworked in interested-layperson digestible bites in Wikipedia to spread and diffuse knowledge in the world. The idea is nice, but the reality is different. Pretty much all the main papers on ontologies are freely available online even if not published ‘open access’ (computer science praxis, thank you), yet, they are not the ones that appear in Wikipedia. Question 1: Why are those—freely available—main references of ontologies not referenced there already?

A concern of a different type is that several schools in South Africa have petitioned to get free Internet access to search Wikipedia as a source of information for their studies. Their main argument was that books don’t arrive, or arrive late, and there is no library in many schools, which is a common problem. They got the zero-rate Wikipedia from MTN; more info here. (I’ll let you mull over its effects on the quality of education they get from that.) Question 2: Can Wikipedia be made a really authoritative resource with the current set-up so as to live up to what the learners [and interested laypersons] need? If I were to rewrite an update to the Wikipedia pages today, a pesky editor or someone else simply can click to roll it back to the previous version, or slowly but steadily have funny references seeping back in and sentences cut and rephrased. Writing free textbooks, or at least extensive lecture notes, seems a better option, or a ‘synthesis lectures’ booklet endorsed by lots of people researching and using ontologies. What about a ‘this version is endorsed by …’ button for Wikipedia entries?

Any better ideas, or answers to those questions, perhaps? Free diffusion of digested high quality scientific knowledge really does sound very appealing…

References

[1] Teplitskiy, M., Lu, G., Duede, E. Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science. arxiv.org/abs/1506.07608

# From data on conceptual models to optimal (logic) language profiles

There are manifold logic-based reconstructions of the main conceptual data modelling languages in a ‘gazillion’ of logics. The reasons for pursuing this line of work are good. In case you wonder, consider:

• Automated reasoning over a conceptual data model to improve their quality and avoid bugs; e.g., an empty database table due to an inconsistency in the model (unsatisfiable class). Instead of costly debugging, one can catch it upfront.
• Designing and executing queries with the model’s vocabulary cf. putting up with how the data is stored with its typically cryptic table and column names.
• Test data generation in automation of software engineering.
• Use it as ‘background knowledge’ during the query compilation stage (which helps optimizing it, so better performance querying a database).

Most of the research efforts on formalizing the conceptual data modelling languages have gone to capturing as much as possible of the modelling language, and therewith aiming to solve the first use case scenario. Runtime usage of conceptual models, i.e., use case scenarios 2-4 above, is receiving some attention, but it brings with it its own set of problems: which trade-offs are the best? That is, we know we can’t have both the modelling languages in their full glory formalised on some arbitrary (EXPTIME or undecidable) logic and have scalable runtime performance. But which subset to choose? There are papers where (logician) authors state something like ‘you don’t need keys in ER, so we ignore those’ or ‘let’s skip ternaries, as most relationships are binary anyway’ or ‘we sweep those pesky aggregation associations under the carpet’ or ‘hierarchies, disjointness and completeness are certainly important’. Who’s right? Or is neither one of them right?

So, we had all that data of the 101 UML, ER, EER, ORM, and ORM2 models analysed (see previous post and [1]). With that, we could construct evidence-based profiles based on the features that are actually used by modellers, rather than constructing profiles based on gut feeling or on one’s pet logic. We specified a core profile and one for each family of the conceptual data modelling languages under consideration (UML Class Diagrams, ER/EER, and ORM/ORM2). The details of the outcome can be found in our recently accepted paper “Evidence-based Languages for Conceptual Data Modelling Profiles” [1] that has been accepted at the 19th Conference on Advances in Databases and Information Systems (ADBIS’15), that will take place from September 8-12 in Poitiers, France. As with the other recent posts on conceptual data models, also this paper was co-authored with Pablo Fillottrani and is an output of our DST/MINCyT-funded bi-lateral project on the unification of conceptual data modelling languages (project overview).

To jump to the short answer: the core profile can be represented in $\mathcal{ALNI}$ (called $\mathcal{PL}_1$ in [3], with PTIME subsumption), whereas the modelling language-specific profiles do not match any of the very many currently existing Description Logic languages with known computational complexity.

Now how we got into that situation. There are some formalization options to consider first, which can affect the complexity of the logic. Notably, 1) whether to use inverses or qualified number restrictions, and 2) whether to go for DL role components for UML’s association ends/ORM’s roles/ER’s relationship components with a 1:1 mapping, or to ignore that and formalise the associations/fact types/relationships only (and how to handle that choice then). Extending a logic language with inverses tends to be less costly computationally cf. qualified number restrictions, so we chose the former. The latter is more complicated to handle regardless the choice, which is partially due to the fact that they are surface aspects of an underlying difference in ontological commitment as to what relations are—so-called standard view versus positionalist—and how it is represented in the models (see discussion in the paper). For the core profile, the dataset of conceptual models justified binaries + standard view representation. In addition to that, the core profile has classes, attributes, mandatory and arbitrary (unqualified) cardinality, class subsumption, and single identification. That set covers 87.57% of all the entities in the models in the dataset (91.88% of the UML models, 73.29% of the ORM models, and 94.64% of the ER/EER models). Note there’s no disjointness or completeness (there were too few of them to merit inclusion) and no role and relationship subsumption, so there isn’t much one can deduce automatically, which is a bit of a bummer.

The UML profile extends the core only slightly, yet it covers 99.44% of the elements in the UML diagrams of the dataset: add cardinality on attributes, attribute value constraints, subsumption for DL roles (UML associations), and aggregation (they are plain associations since UML v2.4.1). This makes a “$\mathcal{ALNHI}(D)$” DL that, as far as we know, hasn’t been investigated yet. That said, fiddling a bit by opting for unique name assumption and some constraints on cardinalities and role inclusion, it looks like $DL\mbox{-}Lite^{\mathcal{HN}}_{core}$ [4] may suffice, which is NLOGSPACE in subsumption and $AC^0$ in data complexity.

For ER/EER, we need to add to the core the following to make it to 99.06% coverage: composite and multivalued attribute (remodelled), weak entity type with its identification constraint, ternaries, associative entity types, and multi-attribute identification. With some squeezing and remodelling things a bit (see paper), $DL\mbox{-}Lite^{\mathcal{N}}_{core}$ [4] should do (also NLOGSPACE), though $\mathcal{DLR}_{ifd}$ [5] will make the formalisation better to follow (though that DL has too many features and is EXPTIME-complete).

Last, the ORM/ORM2 profile, which is the largest to achieve a high coverage (98.69% of the elements in the models in the data set): the core profile + subsumption on roles (DL role components) and fact types (DL roles), n-aries, disjointness on roles, nested object types, value constraints, disjunctive mandatory, internal and external uniqueness, external identifier (compound reference scheme). There’s really no way to avoid the roles, n-aries, and disjointness. There’s no exactly fitting DL for this cocktail of features, though $\mathcal{DLR}_{ifd}$ and $latex$\mathcal{CFDI}_{nc}^{\forall -} &s=-2\$ [6] approximate it; however, the former has too much constructs and the latter too few. That said, $\mathcal{DLR}_{ifd}$ is computationally not ‘well-behaved’, but with $\mathcal{CFDI}_{nc}^{\forall -}$ we still can capture over 96% of the elements in the ORM models of the dataset and it’s PTIME (yup, tractable) [7].

The discussion section of the paper answers the research questions we posed at the beginning of the investigation and reflects on not only missing features, but also ‘useless’ ones. Perhaps we won’t make a lot of friends discussing ‘useless’ features, especially when some authors investigated exactly those features. Anyway, here it goes. Really, nominal are certainly not needed (and computationally costly to boot). We can only guess why there were so few disjointness and completeness constraints in the data set, and even when they were present, they were in the few models we got from textbooks (see data set for sources of the models); true, there weren’t a lot of class hierarchies, but still. The other thing that was a bit of a disappointment was that the relational properties weren’t used a lot. Looking at the relationships in the models, there were certainly opportunities for transitivity and more irreflexivity declarations. One of our current conjectures is that they have limited implementation support, so maybe modellers don’t see the point of adding such constraints; another could be that an ‘average modeller’ (whatever that means) doesn’t quite understand all the 11 that are available in ORM2.

Overall, while a bit disappointing for the use case scenario of reasoning over conceptual data models for inconsistency management, the results are actually very promising for runtime usage of conceptual data models. Maybe that of itself will generate more interest from industry in doing that analysis step before implementing a database or software application: instead of developing a conceptual data model “just for documentation and dust-gathering”, you’ll have one that also will add more, new, better advanced features to your application.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in print)

[2] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in print)

[3] Donini, F., Lenzerini, M., Nardi, D., Nutt, W. Tractable concept languages. In: Proc. of IJCAI’91. vol. 91, pp. 458-463. 1991.

[4] Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M. The DL-Lite family and relations. Journal of Artificial Intelligence Research, 2009, 36:1-69.

[5] Calvanese, D., De Giacomo, G., Lenzerini, M. Identification constraints and functional dependencies in Description Logics. In: Proc. of IJCAI’01, pp155-160, Morgan Kaufmann.

[6] Toman, D., Weddell, G. On adding inverse features to the Description Logic $\mathcal{CFDI}_{nc}^{\forall}$. In: Proc. of PRICAI 2014, pp587-599.

[7] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in $\mathcal{CFDI}_{nc}^{\forall -}$. 28th International Workshop on Description Logics (DL’15). Calvanese, D., Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.

# What’s in a publicly available conceptual data model?

We know the answer. And it’s not great from the viewpoint of usage of fancy language features and automated reasoning, which you may have guessed from an earlier post on a tractable encoding of ORM models (see also [1]). How we got there is summarised in the paper [2] that will be presented at the 34th Conference on Conceptual Modeling (ER’15) 19-22 in October in Sweden. An even shorter digest of it follows in the remainder of this post.

I collected 35 models in UML class diagram notation, 35 in ER or EER, and 35 in ORM or ORM2 from various resources, such as online diagrams and repositories (such as GenMyModel), scientific papers (ER conferences, ORM workshops and the like), and some textbooks; see dataset. Each element in the diagram was classified in terms of the unifying metamodel developed earlier with my co-author Pablo Fillottrani [3,4], so that we could examine and compare the contents across the aforementioned three modelling language ‘families’. This data was analysed (incidence, percentages, averages, etc. of the language features), and augmented with further assessments of ratios, like class:relationship, class:attribute and so on. All that had as scope to falsify the following hypotheses:

A: When more features are available in a language, they are used in the models.

B: Following the “80-20 principle”, about 80% of the entities present in the models are from the set of entities that appear in all three language families.

C: Given the different (initial) purposes of UML class diagrams, (E)ER, and ORM, models in each language still have a different characteristic ‘profile’.

To make a long story really short: Hypothesis A is validated, Hypothesis B is falsified, and Hypothesis C validated. Relaxing the constraints for hypothesis B in the sense of including a known transformation between attribute and value type, then it does make it to 87.5%. What was nice to see was the emerging informal profiles of how a typical diagram looks like in a language family, in that ER and EER and ORM and ORM are much more relationship-oriented than UML Class Diagrams, and that there are more hierarchies in the latter. Intuitively, we already kind of knew that, but it’s always better to have that backed up by data.

There are lots of other noteworthy things, like the large amount of aggregations in UML (yay) and weak entity types in ER and EER, and the—from an automated reasoning viewpoint—‘bummer’ of so few disjointness and completeness assertions in all of the models. More of that can be found in the paper (including a top-5 and a bottom-5 for each family), and you may want attend the ER conference so we can discuss more about it in person.

References

[1] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in CFDInc∀-. 28th International Workshop on Description Logics (DL’15). Calvanese, D., and Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.

[2] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (accepted)

[3] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[4] Fillottrani, P.R., Keet, C.M. KF metamodel formalization. Technical Report, Arxiv.org 1412.6545. Dec 19, 2014. 26p.