What’s in a publicly available conceptual data model?

We know the answer. And it’s not great from the viewpoint of usage of fancy language features and automated reasoning, which you may have guessed from an earlier post on a tractable encoding of ORM models (see also [1]). How we got there is summarised in the paper [2] that will be presented at the 34th Conference on Conceptual Modeling (ER’15) 19-22 in October in Sweden. An even shorter digest of it follows in the remainder of this post.

I collected 35 models in UML class diagram notation, 35 in ER or EER, and 35 in ORM or ORM2 from various resources, such as online diagrams and repositories (such as GenMyModel), scientific papers (ER conferences, ORM workshops and the like), and some textbooks; see dataset. Each element in the diagram was classified in terms of the unifying metamodel developed earlier with my co-author Pablo Fillottrani [3,4], so that we could examine and compare the contents across the aforementioned three modelling language ‘families’. This data was analysed (incidence, percentages, averages, etc. of the language features), and augmented with further assessments of ratios, like class:relationship, class:attribute and so on. All that had as scope to falsify the following hypotheses:

A: When more features are available in a language, they are used in the models.

B: Following the “80-20 principle”, about 80% of the entities present in the models are from the set of entities that appear in all three language families.

C: Given the different (initial) purposes of UML class diagrams, (E)ER, and ORM, models in each language still have a different characteristic ‘profile’.

To make a long story really short: Hypothesis A is validated, Hypothesis B is falsified, and Hypothesis C validated. Relaxing the constraints for hypothesis B in the sense of including a known transformation between attribute and value type, then it does make it to 87.5%. What was nice to see was the emerging informal profiles of how a typical diagram looks like in a language family, in that ER and EER and ORM and ORM are much more relationship-oriented than UML Class Diagrams, and that there are more hierarchies in the latter. Intuitively, we already kind of knew that, but it’s always better to have that backed up by data.

There are lots of other noteworthy things, like the large amount of aggregations in UML (yay) and weak entity types in ER and EER, and the—from an automated reasoning viewpoint—‘bummer’ of so few disjointness and completeness assertions in all of the models. More of that can be found in the paper (including a top-5 and a bottom-5 for each family), and you may want attend the ER conference so we can discuss more about it in person.


[1] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in CFDInc∀-. 28th International Workshop on Description Logics (DL’15). Calvanese, D., and Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.

[2] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (accepted)

[3] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[4] Fillottrani, P.R., Keet, C.M. KF metamodel formalization. Technical Report, Arxiv.org 1412.6545. Dec 19, 2014. 26p.