# An Ontology Engineering textbook

My first textbook “An Introduction to Ontology Engineering” (pdf) is just released as an open textbook. I have revised, updated, and extended my earlier lecture notes on ontology engineering, amounting to about 1/3 more new content cf. its predecessor. Its main aim is to provide an introductory overview of ontology engineering and its secondary aim is to provide hands-on experience in ontology development that illustrate the theory.

The contents and narrative is aimed at advanced undergraduate and postgraduate level in computing (e.g., as a semester-long course), and the book is structured accordingly. After an introductory chapter, there are three blocks:

• Logic foundations for ontologies: languages (FOL, DLs, OWL species) and automated reasoning (principles and the basics of tableau);
• Developing good ontologies with methods and methodologies, the top-down approach with foundational ontologies, and the bottom-up approach to extract as much useful content as possible from legacy material;
• Advanced topics that has a selection of sub-topics: Ontology-Based Data Access, interactions between ontologies and natural languages, and advanced modelling with additional language features (fuzzy and temporal).

Each chapter has several review questions and exercises to explore one or more aspects of the theory, as well as descriptions of two assignments that require using several sub-topics at once. More information is available on the textbook’s page [also here] (including the links to the ontologies used in the exercises), or you can click here for the pdf (7MB).

Feedback is welcome, of course. Also, if you happen to use it in whole or in part for your course, I’d be grateful if you would let me know. Finally, if this textbook will be used half (or even a quarter) as much as the 2009/2010 blogposts have been visited (around 10K unique visitors since posting them), that would mean there are a lot of people learning about ontology engineering and then I’ll have achieved more than I hoped for.

UPDATE: meanwhile, it has been added to several open (text)book repositories, such as OpenUCT and the Open Textbook Archive, and it has been featured on unglue.it in the week of 13-8 (out of its 14K free ebooks).

# Automatically finding the feasible object property

Late last month I wrote about the updated taxonomy of part-whole relations and claimed it wasn’t such a big deal during the modeling process to have that many relations to choose from. Here I’ll back up that claim. Primarily, it is thanks to the ‘Foundational Ontology and Reasoner enhanced axiomatiZAtion’ (FORZA) approach which includes the Guided ENtity reuse and class Expression geneRATOR (GENERATOR) method that was implemented in the OntoPartS-2 tool [1]. The general idea of the GENERATOR method is depicted in the figure below, which outlines two scenarios: one in which the experts perform the authoring of their domain ontology with the help of a foundational ontology, and the other one without a foundational ontology.

I think the pictures are clearer than the following text, but some prefer text, so here goes the explanation attempt. Let’s start with scenario A on the left-hand side of the figure: a modeller has a domain ontology and a foundational ontology and she wants to relate class two domain classes (indicated with C and D) and thus needs to select some object property. The first step is, indeed, selecting C and D (e.g., Human and Heart in an anatomy ontology); this is step (1) in the Figure.

Then (step 2) there are those long red arrows, which indicate that somehow there has to be a way to deal with the alignment of Human and of Heart to the relevant categories in the foundational ontology. This ‘somehow’ can be either of the following three options: (i) the domain ontology was already aligned to the foundational ontology, so that step (2) is executed automatically in the background and the modeler need not to worry, (ii) she manually carries out the alignment (assuming she knows the foundational ontology well enough), or, more likely, (iii) she chooses to be guided by a decision diagram that is specific to the selected foundational ontology. In case of option (ii) or (iii), she can choose to save it permanently or just use it for the duration of the application of the method. Step (3) is an automated process that moves up in the taxonomy to find the possible object properties. Here is where an automated reasoner comes into the equation, which can step-wise retrieve the parent class, en passant relying on taxonomic classification that offers the most up-to-date class hierarchy (i.e., including implicit subsumptions) and therewith avoiding spurious candidates. From a modeller’s viewpoint, one thus only has to select which classes to relate, and, optionally, align the ontology, so that the software will do the rest, as each time it finds a domain and range axiom of a relationship in which the parents of C and D participate, it is marked as a candidate property to be used in the class expression. Finally, the candidate object properties are returned to the user (step 4).

While the figure shows only one foundational ontology, one equally well can use a separate relation ontology, like PW or PWMT, which is just an implementation variant of scenario A: the relation ontology is also traversed upwards and on each iteration, the base ontology class is matched against relational ontology to find relations where the (parent of the) class is defined in a domain and range axiom, also until the top is reached before returning candidate relations.

The second scenario with a domain ontology only is a simplified version of option A, where the alignment step is omitted. In Figure-B above, GENERATOR would return object properties W and R as options to choose from, which, when used, would not generate an inconsistency (in this part of the ontology, at least). Without this guidance, a modeler could, erroneously, select, say, object property S, which, if the branches are disjoint, would result in an inconsistency, and if not declared disjoint, move class C from the left-hand branch to the one in the middle, which may be an undesirable deduction.

For the Heart and Human example, these entities are, in DOLCE terminology, physical objects, so that it will return structural parthood or plain parthood, if the PW ontology is used as well. If, on the other hand, say, Vase and Clay would have been the classes selected from the domain ontology, then a constitution relation would be proposed (be this with DOLCE, PW, or, say, GFO), for Vase is a physical object and Clay an amount of matter. Or with Limpopo and South Africa, a tangential proper parthood would be proposed, because they are both geographic entities.

The approach without the reasoner and without the foundational ontology decision diagram was tested with users, and showed that such a tool (OntoPartS) made the ontology authoring more efficient and accurate [2], and that aligning to DOLCE was the main hurdle for not seeing even more impressive differences. This is addressed with OntoPartS-2, so it ought to work better. What still remains to be done, admittedly, is that larger usability study with the updated version OntoPartS-2. In the meantime: if you use it, please let us know your opinion.

References

[1] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.

[2] Keet, C.M., Fernandez-Reyes, F.C., Morales-Gonzalez, A. Representing mereotopological relations in OWL ontologies with OntoPartS. 9th Extended Semantic Web Conference (ESWC’12), Simperl et al. (eds.), 27-31 May 2012, Heraklion, Crete, Greece. Springer, LNCS 7295, 240-254.

# The TDDonto tool to try out TDD for ontology authoring

Last month I wrote about Test-Driven Development for ontologies, which is described in more detail in the ESWC’16 paper I co-authored with Agnieszka Lawrynowicz [1]. That paper does not describe much about the actual tool implementing the tests, TDDonto, although we have it and used it for the performance evaluation. Some more detail on its design and more experimental results are described in the paper “The TDDonto Tool for Test-Driven Development of DL Knowledge Bases” [2] that has just been published in the proceedings of the 29th International Workshop on Description Logics, which will take place next weekend in Cape Town (22-25 April 2016).

What we couldn’t include there in [2] is multiple screenshots to show how it works, but a blog is a fine medium for that, so I’ll illustrate the tool with some examples in the remainder of the post. It’s an alpha version that works. No usability and HCI evaluations have been done, but at least it’s a Protégé plugin rather than command line :).

First, you need to download the plugin from Agnieszka’s ARISTOTELES project page and place the jar file in the plugins folder of Protégé 5.0. You can then go to the Protégé menu bar, select Windows – Views – Evaluation views – TDDOnto, and place it somewhere on the screen and start using it. For the examples here, I used the African Wildlife Ontology tutorial ontology (AWO v1) from my ontology engineering course.

Make sure to have selected an automated reasoner, and classify your ontology. Now, type a new test in the “New test” field at the top, e.g. carnivore DisjointWith: herbivore, click “Add test”, select the checkbox of the test to execute, and click the “Execute test”: the status will be returned, as shown in the screenshot below. In this case, the “OK” says that the disjointness is already asserted or entailed in the ontology.

Now let’s do a TDD test that is going to fail (you won’t know upfront, of course); e.g., testing whether impalas are herbivores:

The TDD test failed because the subsumption is neither asserted nor entailed in the ontology. One can then click “add to ontology”, which updates the ontology:

Note that the reasoner has to be run again after a change in the ontology.

Lets do two more: testing whether lion is a carnivore and that flower is a plan part. The output of the tests is as follows:

It returns “OK” for the lion, because it is entailed in the ontology: a carnivore is an entity that eats only animals or parts thereof, and lions eat only herbivore and eats some impala (which are animals). The other one, Flower SubClassOf: PlantParts fails as “undefined”, because Flower is not in the ontology.

Ontologies do not have only subsumption and disjointness axioms, so let’s assume that impalas eat leaves and we want check whether that is in the ontology, as well as whether lions eat animals:

The former failed because there are no properties for the impala in the AWO v1, the latter passed, because a lion eats impala, and impala is an animal. Or: the TDDOnto tool indeed behaves as expected.

Currently, only a subset of all the specified tests have been implemented, due to some limitations of existing tools, but we’re working on implementing those as well.

If you have any feedback on TDDOnto, please don’t hesitate to tell us. I hope to be seeing you later in the week at DL’16, where I’ll be presenting the paper on Sunday afternoon (24th) and I also can give a live demo any time during the workshop (or afterwards, if you stay for KR’16).

References

[1] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. 13th Extended Semantic Web Conference (ESWC’16). Springer LNCS. 29 May – 2 June, 2016, Crete, Greece. (in print)

[2] Lawrynowicz, A., Keet, C.M. The TDDonto Tool for Test-Driven Development of DL Knowledge bases. 29th International Workshop on Description Logics (DL’16). April 22-25, Cape Town, South Africa. CEUR WS vol. 1577.

# Reblogging 2012: Fixing flaws in OWL object property expressions

From the “10 years of keetblog – reblogging: 2012”: There are several 2012 papers I (co-)authored that I like and would have liked to reblog—whatever their citation counts may be. Two are on theoretical, methodological, and tooling advances in ontology engineering using foundational ontologies in various ways, in collaboration with Francis Fernandez and Annette Morales following a teaching and research visit to Cuba (ESWC’12 paper on part-whole relations), and a dedicated Honours student who graduated cum laude, Zubeida Khan (EKAW’12 paper on foundational ontology selection). The other one, reblogged here, is of a more fundamental nature—principles of role [object property] hierarchies in ontologies—and ended up winning best paper award at EKAW’12; an extended version has been published in JoDS in 2014. I’m still looking for a student to make a proof-of-concept implementation (in short, thus far: when some are interested, there’s no money, and when there’s money, there’s no interest).

———–

OWL 2 DL is a very expressive language and, thanks to ontology developers’ persistent requests, has many features for declaring complex object property expressions: object sub-properties, (inverse) functional, disjointness, equivalence, cardinality, (ir)reflexivity, (a)symmetry, transitivity, and role chaining. A downside of this is that with the more one can do, the higher is the chance that flaws in the representation are introduced; hence, an unexpected or undesired classification or inconsistency may actually be due to a mistake in the object property box, not a class axiom. While there are nifty automated reasoners and explanation tools that help with the modeling exercise, the standard reasoning services for OWL ontologies assume that the axioms in the ‘object property box’ are correct and according to the ontologist’s intention. This may not be the case. Take, for instance, the following thee examples, where either the assertion is not according to the intention of the modeller, or the consequence may be undesirable.

• Domain and range flaws; asserting hasParent $\sqsubseteq$ hasMother instead of hasMother $\sqsubseteq$ hasParent in accordance with their domain and range restrictions (i.e., a subsetting mistake—a more detailed example can be found in [1]), or declaring a domain or a range to be an intersection of disjoint classes;
• Property characteristics flaws: e.g., the family-tree.owl (when accessed on 12-3-2012) has hasGrandFather $\sqsubseteq$ hasAncestor and Trans(hasAncestor) so that transitivity unintentionally is passed down the property hierarchy, yet hasGrandFather is really intransitive (but that cannot be asserted in OWL);
• Property chain issues; for instance the chain hasPart $\circ$ hasParticipant $\sqsubseteq$ hasParticipant in the pharmacogenomics ontology [2] that forces the classes in class expressions using these properties—in casu, DrugTreatment and DrugGeneInteraction—to be either processes due to the domain of the hasParticipant object property, or they will be inconsistent.

Unfortunately, reasoner output and explanation features in ontology development environments do not point to the actual modelling flaw in the object property box. This is due to that implemented justification and explanation algorithms [3, 4, 5] consider logical deductions only and that class axioms and assertions about instances take precedence over what ‘ought to be’ concerning object property axioms, so that only instances and classes can move about in the taxonomy. This makes sense from a logic viewpoint, but it is not enough from an ontology quality viewpoint, as an object property inclusion axiom—being the property hierarchies, domain and range axioms to type the property, a property’s characteristics (reflexivity etc.), and property chains—may well be wrong, and this should be found as such, and corrections proposed.

So, we have to look at what type of mistakes can be made in object property expressions, how one can get the modeller to choose the ontologically correct options in the object property box so as to achieve a better quality ontology and, in case of flaws, how to guide the modeller to the root defect from the modeller’s viewpoint, and propose corrections. That is: the need to recognise the flaw, explain it, and to suggest revisions.

To this end, two non-standard reasoning services were defined [6], which has been accepted recently at the 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW’12): SubProS and ProChainS. The former is an extension to the RBox Compatibility Service for object subproperties by [1] so that it now also handles the object property characteristics in addition to the subsetting-way of asserting object sub-properties and covers the OWL 2 DL features as a minimum. For the latter, a new ontological reasoning service is defined, which checks whether the chain’s properties are compatible by assessing the domain and range axioms of the participating object properties. Both compatibility services exhaustively check all permutations and therewith pinpoint to the root cause of the problem (if any) in the object property box. In addition, if a test fails, one or more proposals are made how best to revise the identified flaw (depending on the flaw, it may include the option to ignore the warning and accept the deduction). Put differently: SubProS and ProChainS can be considered so-called ontological reasoning services, because the ontology does not necessarily contain logical errors in some of the flaws detected, and these two services thus fall in the category of tools that focus on both logic and additional ontology quality criteria, by aiming toward ontological correctness in addition to just a satisfiable logical theory. (on this topic, see also the works on anti-patterns [7] and OntoClean [8]). Hence, it is different from other works on explanation and pinpointing mistakes that concern logical consequences only [3,4,5], and SubProS and ProChainS also propose revisions for the flaws.

SubProS and ProChainS were evaluated (manually) with several ontologies, including BioTop and the DMOP, which demonstrate that the proposed ontological reasoning services indeed did isolate flaws and could propose useful corrections, which have been incorporated in the latest revisions of the ontologies.

Theoretical details, the definition of the two services, as well as detailed evaluation and explanation going through the steps can be found in the EKAW’12 paper [6], which I’ll present some time between 8 and 12 October in Galway, Ireland. The next phase is to implement an efficient algorithm and make a user-friendly GUI that assists with revising the flaws.

References

[1] Keet, C.M., Artale, A.: Representing and reasoning over a taxonomy of part-whole relations. Applied Ontology 3(1-2) (2008) 91–110

[2] Dumontier, M., Villanueva-Rosales, N.: Modeling life science knowledge with OWL 1.1. In: Fourth International Workshop OWL: Experiences and Directions 2008 (OWLED 2008 DC). (2008) Washington, DC (metro), 1-2 April 2008

[3] Horridge, M., Parsia, B., Sattler, U.: Laconic and precise justifications in OWL. In: Proceedings of the 7th International Semantic Web Conference (ISWC 2008). Volume 5318 of LNCS., Springer (2008)

[4] Parsia, B., Sirin, E., Kalyanpur, A.: Debugging OWL ontologies. In: Proceedings of the World Wide Web Conference (WWW 2005). (2005) May 10-14, 2005, Chiba, Japan.

[5] Kalyanpur, A., Parsia, B., Sirin, E., Grau, B.: Repairing unsatisfiable concepts in OWL ontologies. In: Proceedings of ESWC’06. Springer LNCS (2006)

[6] Keet, C.M. Detecting and Revising Flaws in OWL Object Property Expressions. 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW’12), Oct 8-12, Galway, Ireland. Springer, LNAI, 15p. (in press)

[7] Roussey, C., Corcho, O., Vilches-Blazquez, L.: A catalogue of OWL ontology antipatterns. In: Proceedings of K-CAP’09. (2009) 205–206

[8] Guarino, N., Welty, C.: An overview of OntoClean. In Staab, S., Studer, R., eds.: Handbook on ontologies. Springer Verlag (2004) 151–159

# Reblogging 2010: Rough ontologies from an ontology engineering perspective

From the “10 years of keetblog – reblogging: 2010”: two solid papers on the feasibility of rough ontologies, presented at DL’10 (logic stuff) and EKAW’10 (ontology engineering aspects). Short answer: not really feasible from a computational viewpoint. Notwithstanding, I did give it a try afterwards with ‘rOWL’ (SAICSIT’11 paper) before giving up on it, and other people do try, too (see citations of the DL and EKAW papers).

—-

Somewhere buried in the blogpost about the DL’10 workshop, I mentioned the topic of my paper [1] at the 23rd International Description Logics Workshop (DL’10), which concerned the feasibility of rough DL knowledge bases. That paper was focussed on the theoretical assessment (result: there are serious theoretical hurdles for rough DL KBs) and had a rather short section where experimental results were crammed into the odd page (result: one can squeeze at least something out of the extant languages and tools, but more should be possible in the near future). More recently, my paper [2] submitted to the 17th International Conference on Knowledge Engineering and Knowledge Management (EKAW’10) got accepted, which focuses on the ontology engineering side of rough ontologies and therefore has a lot more information on how one can squeeze something out of the extant languages and tools; if that is not enough, there is also supplementary material that people can play with.

Ideally, they ought to go together in on paper to get a good overview at once, but there are page limits for conference papers and anyhow the last word has not been said about rough ontologies. For what it is worth, I have put the two together in the slides for the weekly KRDB Lunch Seminar that I will present tomorrow at, well, lunch hour in the seminar room on the first floor of the POS building.

References

[1] Keet, C. M. On the feasibility of Description Logic knowledge bases with rough concepts and vague instances. Proc. of DL’10, 4-7 May 2010, Waterloo, Canada. pp314-324.

[2] Keet, C. M. Ontology engineering with rough concepts and instances17th International Conference on Knowledge Engineering and Knowledge Management (EKAW’10). 11-15 October 2010, Lisbon, Portugal.  Springer LNCS.

# Reblogging 2006: Figuring out requirements for automated reasoning services for formal bio-ontologies

From the “10 years of keetblog – reblogging: 2006”: a preliminary post that led to the OWLED 2007 paper I co-authored with Marco Roos and Scott Marshall, when I was still predominantly into bio-ontologies and biological databases. The paper received quite a few citations, and a good ‘harvest’ from both OWLED’07 and co-located DL’07 participants on how those requirements may be met (described here). The original post: Figuring out requirements for automated reasoning services for formal bio-ontologies, from Dec 27, 2006.

What does the user want? There is a whole sub-discipline on requirements engineering, where researchers look into methodologies how one best can extract the users’ desires for a software application and organize the requirements according to type and priority. But what to do when users – in this case biologists and (mostly non-formal) bio-ontologies developers – neither do know clearly themselves what they want nor what type of automated reasoning is already ‘on offer’. Here, I’m making a start by briefly listing informally some of the desires & usages that I came across in the literature, picked up from conversations and further probing to disambiguate the (for a logician) vague descriptions, or bumped into myself; they are summarised at the end of this blog entry and update (d.d. 5-5-’07) described more comprehensively in [0].

Feel free to add your wishes & demands; it may even be fed back into current research like [1] or be supported already after all. (An alternative approach is describing ‘scenarios’ from which one can try to extract the required reasoning tasks; if you want to, you can add those as well.)

I. A fairly obvious use of automated reasoners such as Racer, Pellet and FaCT++ with ontologies is to let the software find errors (inconsistencies) in the representation of the knowledge or reality. This is particularly useful to ensure no ‘contradictory’ information remains in the ontology when an ontology gets too big for one person to comprehend and multiple people update an ontology. Also, it tends to facilitate learning how to formally represent something. Hence, the usage is to support the ontology development process.

But this is just the beginning: having a formal ontology gives you other nice options, or at least that is the dangling carrot on front of the developer’s nose.

II. One demonstration of the advantages of having a formal ontology, thus not merely a promise, is the classification of protein phosphatases by Wolstencroft et al. [9], where also some modest results were obtained in discovering novel information about those phosphatases that was entailed in the extant information but hitherto unknown. Bandini and Mosca [2] pushed a closely related idea one step further in another direction. To constrain the search space of candidate rubber molecules for tire production, they defined the constraints (properties) all types of molecules for tires must satisfy in the TBox, treated each candidate molecule as an instance in the ABox, and performed model checking on the knowledgebase: each instance inconsistent w.r.t. the TBox was thrown out of the pool of candidate-molecules. Overall, the formal representation with model checking achieved a considerable reduction in resource usage of the system and reduced the amount of costly wet-lab research. Hence, the usages are classification and model checking.[i]

III. Whereas the former includes usage of particular instances for the reasoning scenarios, another on is to stay at the type level and, in particular, relations between the types (classes in the class hierarchy in Protégé). In short, some users want to discover new or missing relations. What type of relation is not always exactly clear, but I assume for now that any non-isA relation would do. For instance, Roos et al. [8] would like to do that for the subject domain of histones; with or without instance-level data. The former, using instance-level data, resembles the reverse engineering option in VisioModeler, which takes a physical database schema and the data stored in the tables and computes the likely entities, relations, and constraints at the level of a conceptual model (in casu, ORM). Mungall [7] wants to “Check for inconsistent or missing relationships” and “Automatically assign relationships for new terms”. How can one find what is not there but ought to be in the ontology? An automated reasoner is not an oracle. I will split up this topic into two aspects. First, one can derive relations among types, meaning that some ontology developer has declared several types, relations, other properties, but not everything. The reasoner then, takes the declared knowledge and can return relations that are logically implied by the formal ontology. From a user perspective, such a derived relation may be perceived as a ‘new’ or ‘missing’ relation – but it did not fall out of the sky because the relation was already entailed in the ontology (or: you did not realize you knew it already). Second, another notion of ‘missing relations’: e.g. there are 17 types of macrophages (types of cell) in the FMA, which must be part of, contained in, or located in something. If you query the FMA through OQAFMA, it gives as answer that the hepatic macrophage is part of the liver [5]. An informed user knows it cannot be the case that the other macrophages are not part of anything. Then, the ontology developer may want to fill this gap – adding the ‘missing’ relations – by further developing those cell-level sections of the ontology. Note that the reasoner will not second-guess you by asking “do you want more things there?”; it uses the Open World Assumption, i.e. that there always may be more than actually represented on the ontology (and absence of some piece of information is not negation of that piece). Thus, the requirements are to have some way of dealing with gaps’ in an ontology, to support computing derived relations entailed in a logical theory, and, third, deriving type-level relations based on instance-level data. The second one is already supported, the first one only with intervention by an informed user, and the third one might, to some extent.

Now three shorter points, either because there is even less material or there is too much to stuff it in this blog entry.

IV. A ‘this would be nice’ suggestion from Christina Hettne, among others, concerns the desire to compare pathways, which, in its simplest form, amounts to checking for sub-graph isomorphisms. More generally, one could – or should be able to – treat an ontology as a scientific theory [6] and compare competing explanations of some natural phenomenon (provided both are represented formally). Thus, we have a requirement for comparison of two ontologies, not with the aim of doing meaning negotiation and/or merging them, but where the discrepancies themselves are the fun part. This indicates that dealing with ‘errors’ that a reasoner spits out could use an upgrade toward user-friendliness.

V. Reasoning with parthood and parthood-like relations in bio-ontologies are on a par with importance of the subsumption relation. Donnelly [3] and Keet [4], among many, would like to use parthood and parthood-like relations for reasoning, covering more than transitivity alone. Generalizing a bit, we have another requirement: reasoning with properties (relations) and hierarchies of relations, focusing first on the part-whole relation. What reasoning services are required exactly, be it for parthood or any other relation, deserves an entry on its own.

VI. And whatnot? For instance, linking up different ontologies that each reside at their own level of granularity, yet have enabled to perform ‘granular cross-ontology queries’, or infer locations of diseases based on combining an anatomy ontology with a disease taxonomy, hence, reasoning over linked ontologies. This needs to be written down in more detail, and may be covered at least partially with point two in item III.

Summarizing, we have to following requirements for automated reasoning services, in random order w.r.t. importance:

• Support in the ontology development process;
• Classification;
• Model checking;
• Finding ‘gaps’ in the content of an ontology;
• Computing derived relations at the type level;
• Deriving type-level relations from instance-level data;
• Comparison of two ontologies ([logical] theories);
• Reasoning with a plethora of parthood and parthood-like relations;
• Using (including finding inconsistencies in) a hierarchy of relations in conjunction with the class hierarchy;

I doubt this is an exhaustive list, and expect to add more requirements & desires soon. They also have to be specified more precisely than explained briefly above and the solutions to meet these requirements need to be elaborated upon as well.

[0] Keet, C.M., Roos, M., Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS.

[1] European FP6 FET Project “Thinking ONtologiES (TONES)”. (UDATE 29-7-2015: URL defunct by now)

[2] Bandini, S., Mosca, A. Mereological knowledge representation for the chemical formulation. 2nd Workshop on Formal Ontologies Meets Industry 2006 (FOMI2006), 14-15 December 2006, Trento, Italy. pp55-69.

[3] Donnelly, M., Bittner, T. and Rosse, C. A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies. Artificial Intelligence in Medicine, 2006, 36(1):1-27.

[4]
Keet, C.M. Part-whole relations in Object-Role Models. 2nd International Workshop on Object-Role Modelling (ORM 2006), Montpellier, France, Nov 2-3, 2006. In: OTM Workshops 2006. Meersman, R., Tari, Z., Herrero., P. et al. (Eds.), LNCS 4278. Berlin: Springer-Verlag, 2006. pp1116-1127.

[5]
Keet, C.M. Granular information retrieval from the Gene Ontology and from the Foundational Model of Anatomy with OQAFMA. KRDB Research Centre Technical Report KRDB06-1, Free University of Bozen-Bolzano, 6 April 2006. 19p.

[6] Keet, C.M.
Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS2005), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics 3615, Springer Verlag, 2005. pp46-62.

[7] Mungall, C.J. Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics, 2004, 5(6-7):509-520. (UPDATE: link rot as well; a freely accessible veriosn is available at: http://berkeleybop.org/~cjm/obol/doc/Mungall_CFG_2004.pdf)

[8] Roos, M., Rauwerda, H., Marshall, M.S., Post, L., Inda, M., Henkel, C., Breit, T. Towards a virtual laboratory for integrative bioinformatics research. CSBio Reader: Extended abstracts of “CS & IT with/for Biology” Seminar Series 2005. Free University of Bozen-Bolzano, 2005. pp18-25.

[9]
Wolstencroft, K., Lord, P., Tabernero, L., Brass, A., Stevens, R. Using ontology reasoning to classify protein phosphatases [abstract]. 8th Annual Bio-Ontologies Meeting; 2005 24 June; Detroit, United States of America.

[i] Observe another aspect regarding model checking where the automated reasoner checks if the theory is satisfiable, or: given your ontology, if there is/can be a combination of instances such that all the declared knowledge in the ontology holds (is true), which is called a model’ (as well, like so many things). That an ontology is satisfiable does not imply it only has models as intended by you, i.e. there is a difference between ‘all models’ and ‘all intended models’. If an ontology is satisfiable it means that it is logically possible that each type can be instantiated without running into inconsistencies; it neither demonstrates that one can indeed find in reality the real-world versions of those represented entities nor if there is one-and-only-one model that actually matches exactly the data you may want to have linked to & checked against the ontology.

# From data on conceptual models to optimal (logic) language profiles

There are manifold logic-based reconstructions of the main conceptual data modelling languages in a ‘gazillion’ of logics. The reasons for pursuing this line of work are good. In case you wonder, consider:

• Automated reasoning over a conceptual data model to improve their quality and avoid bugs; e.g., an empty database table due to an inconsistency in the model (unsatisfiable class). Instead of costly debugging, one can catch it upfront.
• Designing and executing queries with the model’s vocabulary cf. putting up with how the data is stored with its typically cryptic table and column names.
• Test data generation in automation of software engineering.
• Use it as ‘background knowledge’ during the query compilation stage (which helps optimizing it, so better performance querying a database).

Most of the research efforts on formalizing the conceptual data modelling languages have gone to capturing as much as possible of the modelling language, and therewith aiming to solve the first use case scenario. Runtime usage of conceptual models, i.e., use case scenarios 2-4 above, is receiving some attention, but it brings with it its own set of problems: which trade-offs are the best? That is, we know we can’t have both the modelling languages in their full glory formalised on some arbitrary (EXPTIME or undecidable) logic and have scalable runtime performance. But which subset to choose? There are papers where (logician) authors state something like ‘you don’t need keys in ER, so we ignore those’ or ‘let’s skip ternaries, as most relationships are binary anyway’ or ‘we sweep those pesky aggregation associations under the carpet’ or ‘hierarchies, disjointness and completeness are certainly important’. Who’s right? Or is neither one of them right?

So, we had all that data of the 101 UML, ER, EER, ORM, and ORM2 models analysed (see previous post and [1]). With that, we could construct evidence-based profiles based on the features that are actually used by modellers, rather than constructing profiles based on gut feeling or on one’s pet logic. We specified a core profile and one for each family of the conceptual data modelling languages under consideration (UML Class Diagrams, ER/EER, and ORM/ORM2). The details of the outcome can be found in our recently accepted paper “Evidence-based Languages for Conceptual Data Modelling Profiles” [1] that has been accepted at the 19th Conference on Advances in Databases and Information Systems (ADBIS’15), that will take place from September 8-12 in Poitiers, France. As with the other recent posts on conceptual data models, also this paper was co-authored with Pablo Fillottrani and is an output of our DST/MINCyT-funded bi-lateral project on the unification of conceptual data modelling languages (project overview).

To jump to the short answer: the core profile can be represented in $\mathcal{ALNI}$ (called $\mathcal{PL}_1$ in [3], with PTIME subsumption), whereas the modelling language-specific profiles do not match any of the very many currently existing Description Logic languages with known computational complexity.

Now how we got into that situation. There are some formalization options to consider first, which can affect the complexity of the logic. Notably, 1) whether to use inverses or qualified number restrictions, and 2) whether to go for DL role components for UML’s association ends/ORM’s roles/ER’s relationship components with a 1:1 mapping, or to ignore that and formalise the associations/fact types/relationships only (and how to handle that choice then). Extending a logic language with inverses tends to be less costly computationally cf. qualified number restrictions, so we chose the former. The latter is more complicated to handle regardless the choice, which is partially due to the fact that they are surface aspects of an underlying difference in ontological commitment as to what relations are—so-called standard view versus positionalist—and how it is represented in the models (see discussion in the paper). For the core profile, the dataset of conceptual models justified binaries + standard view representation. In addition to that, the core profile has classes, attributes, mandatory and arbitrary (unqualified) cardinality, class subsumption, and single identification. That set covers 87.57% of all the entities in the models in the dataset (91.88% of the UML models, 73.29% of the ORM models, and 94.64% of the ER/EER models). Note there’s no disjointness or completeness (there were too few of them to merit inclusion) and no role and relationship subsumption, so there isn’t much one can deduce automatically, which is a bit of a bummer.

The UML profile extends the core only slightly, yet it covers 99.44% of the elements in the UML diagrams of the dataset: add cardinality on attributes, attribute value constraints, subsumption for DL roles (UML associations), and aggregation (they are plain associations since UML v2.4.1). This makes a “$\mathcal{ALNHI}(D)$” DL that, as far as we know, hasn’t been investigated yet. That said, fiddling a bit by opting for unique name assumption and some constraints on cardinalities and role inclusion, it looks like $DL\mbox{-}Lite^{\mathcal{HN}}_{core}$ [4] may suffice, which is NLOGSPACE in subsumption and $AC^0$ in data complexity.

For ER/EER, we need to add to the core the following to make it to 99.06% coverage: composite and multivalued attribute (remodelled), weak entity type with its identification constraint, ternaries, associative entity types, and multi-attribute identification. With some squeezing and remodelling things a bit (see paper), $DL\mbox{-}Lite^{\mathcal{N}}_{core}$ [4] should do (also NLOGSPACE), though $\mathcal{DLR}_{ifd}$ [5] will make the formalisation better to follow (though that DL has too many features and is EXPTIME-complete).

Last, the ORM/ORM2 profile, which is the largest to achieve a high coverage (98.69% of the elements in the models in the data set): the core profile + subsumption on roles (DL role components) and fact types (DL roles), n-aries, disjointness on roles, nested object types, value constraints, disjunctive mandatory, internal and external uniqueness, external identifier (compound reference scheme). There’s really no way to avoid the roles, n-aries, and disjointness. There’s no exactly fitting DL for this cocktail of features, though $\mathcal{DLR}_{ifd}$ and $latex$\mathcal{CFDI}_{nc}^{\forall -} &s=-2\$ [6] approximate it; however, the former has too much constructs and the latter too few. That said, $\mathcal{DLR}_{ifd}$ is computationally not ‘well-behaved’, but with $\mathcal{CFDI}_{nc}^{\forall -}$ we still can capture over 96% of the elements in the ORM models of the dataset and it’s PTIME (yup, tractable) [7].

The discussion section of the paper answers the research questions we posed at the beginning of the investigation and reflects on not only missing features, but also ‘useless’ ones. Perhaps we won’t make a lot of friends discussing ‘useless’ features, especially when some authors investigated exactly those features. Anyway, here it goes. Really, nominal are certainly not needed (and computationally costly to boot). We can only guess why there were so few disjointness and completeness constraints in the data set, and even when they were present, they were in the few models we got from textbooks (see data set for sources of the models); true, there weren’t a lot of class hierarchies, but still. The other thing that was a bit of a disappointment was that the relational properties weren’t used a lot. Looking at the relationships in the models, there were certainly opportunities for transitivity and more irreflexivity declarations. One of our current conjectures is that they have limited implementation support, so maybe modellers don’t see the point of adding such constraints; another could be that an ‘average modeller’ (whatever that means) doesn’t quite understand all the 11 that are available in ORM2.

Overall, while a bit disappointing for the use case scenario of reasoning over conceptual data models for inconsistency management, the results are actually very promising for runtime usage of conceptual data models. Maybe that of itself will generate more interest from industry in doing that analysis step before implementing a database or software application: instead of developing a conceptual data model “just for documentation and dust-gathering”, you’ll have one that also will add more, new, better advanced features to your application.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in print)

[2] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in print)

[3] Donini, F., Lenzerini, M., Nardi, D., Nutt, W. Tractable concept languages. In: Proc. of IJCAI’91. vol. 91, pp. 458-463. 1991.

[4] Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M. The DL-Lite family and relations. Journal of Artificial Intelligence Research, 2009, 36:1-69.

[5] Calvanese, D., De Giacomo, G., Lenzerini, M. Identification constraints and functional dependencies in Description Logics. In: Proc. of IJCAI’01, pp155-160, Morgan Kaufmann.

[6] Toman, D., Weddell, G. On adding inverse features to the Description Logic $\mathcal{CFDI}_{nc}^{\forall}$. In: Proc. of PRICAI 2014, pp587-599.

[7] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in $\mathcal{CFDI}_{nc}^{\forall -}$. 28th International Workshop on Description Logics (DL’15). Calvanese, D., Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.