Experimentally-motivated non-trivial intermodel links between conceptual models

I am well aware that some people prefer Agile and mash-ups and such to quickly, scuffily, put an app together, but putting a robust, efficient, lasting, application together does require a bit of planning—analysis and design in the software development process. For instance, it helps to formalise one’s business rules or requirements, or at least structure them with, say, SBVR or ORM, so as to check that the rules obtained from the various stakeholders do not contradict themselves cf. running into problems when testing down the line after having implemented it during a testing phase. Or analyse a bit upfront which classes are needed in the front-end application layer cf. perpetual re-coding to fix one’s mistakes (under the banner ‘refactoring’, as if naming the process gives it an air of respectability), and create, say, a UML diagram or two. Or generating a well-designed database based on an EER model.

Each of these three components can be done in isolation, but how to do this for complex system development where the object-oriented application layer hast to interact with the database back-end, all the while ensuring that the business rules are still adhered to? Or you had those components already, but they need to be integrated? One could link the code to tables in the implementation layer, on an ad hoc basis, and figure it out again and again for any combination of languages and systems. Another one is to do that at the conceptual modelling layer irrespective of the implementation language. The latter approach is reusable (cf. reinventing the mapping wheel time and again), and at a level of abstraction that is easier to cope with for more people, and even more so if the system is really large. So, we went after that option for the past few years and have just added another step to realising all this: how to link which elements in the different models for the system.

It is not difficult to imagine a tool where one can have several windows open, each with a model in some conceptual modelling language—many CASE tools already support modelling in different notations anyway. It is conceptually also fairly straightforward when in, say, the UML model there is a class ‘Employee’ and in the ER diagram there’s an ‘Employee’ entity type: it probably will work out to align these classes. Implementing just this is a bit of an arduous engineering task, but doable. In fact, there is such a tool for models represented in the same language, where the links can be subsumption, equivalence, or disjointness between classes or between relationships: ICOM [2]. But we need something like that to work across modelling languages as well, and for attributes, too. In the hand-waiving abstract sense, this may be intuitively trivial, but the gory details of the conceptual and syntax aspects are far from it. For instance, what should a modeller do if one model has ‘Address’ as an attribute and the other model has it represented as a class? Link the two despite being different types of constructs in the respective languages? Or that ever-recurring example of modelling marriage: a class ‘Marriage’ with (at least) two participants, or ‘Marriage’ as a recursive relationship (UML association) of a ‘Person’ class? What to do if a modeller in one model had chosen the former option and another modeller the latter? Can they be linked up somehow nonetheless, or would one have to waste a lot of time redesigning the other model?

Instead of analysing this for each case, we sought to find a generic solution to it; with we being: Zubeida Khan, Pablo Fillottrani, Karina Cenci, and I. The solution we propose will appear soon in the proceedings of the 20th Conference on Advances in DataBases and Information Systems (ADBIS’16) that will be held at the end of this month in Prague.

So, what did we do? First, we tried to narrow down the possible links between elements in the models: in theory, one might want to try to link anything to anything, but we already knew some model elements are incompatible, and we were hoping that others wouldn’t be needed yet other suspected to be needed, so that a particular useful subset could be the focus. To determine that, we analysed a set of ICOM projects created by students at the Universidad Nacionál del Sur (in Bahía Blanca), and we created model integration scenarios based on publicly available conceptual models of several subject domains, such as hospitals, airlines, and so on, including EER diagrams, UML class diagrams, and ORM models. An example of an integration scenario is shown in the figure below: two conceptual models about airline companies, with on the left the ER diagram and on the right the UML diagram.

One of the integration scenarios [1]

One of the integration scenarios [1]

The solid purple links are straightforward 1:1 mappings; e.g., er:Airlines = uml:Airline. Long-dashed dashed lines represent ‘half links’ that are semantically very similar, such as er:Flight.Arr_time ≈ uml:Flight.arrival_time, where the idea of attribute is the same, but ER attributes don’t have a datatype specified whereas UML attributes do. The red short-dashed dashed lines require some transformation: e.g., er:Airplane.Type is an attribute yet uml:Aircraft is a class, and er:Airport.code is an identifier (with its mandatory 1:1 constraint, still no datatype) but uml:Airport.ID is just a simple attribute. Overall, we had 40 models with 33 schema matchings, with 25 links in the ICOM projects and 258 links in the integration scenarios. The detailed aggregates are described in the paper and the dataset is available for reuse (7MB). Unsurprisingly, there were more attribute links than class links (if a class can be linked, then typically also some of its attributes). There were 64 ‘half’ links and 48 transformation links, notably on the slightly compatible attributes, attributes vs. identifiers, attribute<->value type, and attribute<->class.

Armed with these insights from the experiment, a general intermodel link validation approach [3] that uses the unified metamodel [4], and which type of elements occur mostly in conceptual models with their logic-based profiles [5,6], we set out to define those half links and transformation links. While this could have been done with a logic of choice, we opted for a clear step toward implementability by exploiting the ATLAS Transformation Language (ATL) [7] to specify the transformations. As there’s always a source model and a target model in ATL, we constructed the mappings such that both models in question are the ‘source’ and both are mapped into a new, single, ‘target’ model that still adheres to the constraints imposed by the unifying metamodel. A graphical depiction of the idea is shown in the figure below; see paper for details of the mapping rules (they don’t look nice in a blog post).

Informal, graphical rendering of the rule AttributeObject Type output [1]

Informal, graphical rendering of the rule Attribute<->Object Type output [1]

Someone into this matter might think, based on this high-level description, there’s nothing new here. However, there is, and the paper has an extensive related works section. For instance, there’s related work on Distributed Description Logics with bridge rules [8], but they don’t do attributes and the logics used for that doesn’t fit well with the features needed for conceptual modelling, so it cannot be adopted without substantial modifications. Triple Graph Grammars look quite interesting [9] for this sort of task, as does DOL [10], but either would require another year or two to figure it out (feel free to go ahead already). On the practical side, e.g., the Eclipse metamodel of the popular Eclipse Modeling Framework didn’t have enough in the metamodel for what needs to be included, both regarding types of entities and the constraints that would need to be enforced. And so on, such that by a process of elimination, we ended up with ATL.

It would be good to come up with those logic-based linking options and proofs of correctness of the transformation rules presented in the paper, but in the meantime, an architecture design of the new tool was laid out in [11], which is in the stage of implementation as I write this. For now, at least a step has been taken from the three years of mostly theory and some experimentation toward implementation of all that. To be continued J.

 

References

[1] Khan, Z.C., Keet, C.M., Fillottrani, P.R., Cenci, K.M. Experimentally motivated transformations for intermodel links between conceptual models. 20th Conference on Advances in Databases and Information Systems (ADBIS’16). Springer LNCS. August 28-31, Prague, Czech Republic. (in print)

[2] Fillottrani, P.R., Franconi, E., Tessaris, S. The ICOM 3.0 intelligent conceptual modelling tool and methodology. Semantic Web Journal, 2012, 3(3): 293-306.

[3] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer Lecture Notes in Computer Science LNCS vol. 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

[4] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering, 2015, 98:30-53.

[5] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Johannesson, P., Lee, M.L. Liddle, S.W., Opdahl, A.L., Pastor López, O. (Eds.). Springer LNCS vol 9381, 585-593. 19-22 Oct, Stockholm, Sweden.

[6] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Morzy et al. (Eds.). Springer LNCS vol. 9282, 215-229. Poitiers, France, Sept 8-11, 2015.

[7] Jouault, F. Allilaire, F. Bzivin, J. Kurtev, I. ATL: a model transformation tool. Science of Computer Programming, 2008, 72(12):31-39.

[8] Ghidini, C., Serafini, L., Tessaris, S., Complexity of reasoning with expressive ontology mappings. Formal ontology in Information Systems (FOIS’08). IOS Press, FAIA vol. 183, 151-163.

[9] Golas, U., Ehrig, H., Hermann, F. Formal specification of model transformations by triple graph grammars with application conditions. Electronic Communications of the ESSAT, 2011, 39: 26.

[10] Mossakowsi, T., Codescu, M., Lange, C. The distributed ontology, modeling and specification language. Proceedings of the Workshop on Modular Ontologies 2013 (WoMo’13). CEUR-WS vol 1081. Corunna, Spain, September 15, 2013.

[11] Fillottrani, P.R., Keet, C.M. A Design for Coordinated and Logics-mediated Conceptual Modelling. 29th International Workshop on Description Logics (DL’16). Peñaloza, R. and Lenzerini, M. (Eds.). CEUR-WS Vol. 1577. April 22-25, Cape Town, South Africa. (abstract)

Advertisements

Reblogging 2009: Building bias into your database

From the “10 years of keetblog – reblogging: 2009”: The tl;dr of it: bad data management -> bad policy decisions, and how you can embed political preferences and prejudices in a conceptual data model.

While the post has a computing flavor to it especially on the database design and a touch of ontologies, it is surely also of general interest, because it gives some insight into the management of data that is used for policy-making in and for conflict zones. A nicer version of this blog post and the one after that made it into a paper-review article “Dirty wars, databases, and indices” in the Peace & Conflict Review journal (Fall 2009 issue) of the UN-mandated University for Peace in Costa Rica.

Building bias into your database; Jan 7, 2009

 p.s.: while I intended to write a post on attending the ER’15 conferences, the exciting times with the student protests in South Africa put that plan on the backburner for a few more days at least.

—–

For developing bio-ontologies, if one follows Barry Smith and cs., then one is solely concerned with the representation of reality; moreover, it has been noted that ontologies can, or should be, seen as a representation of a scientific theory [1] or at least that they are an important part of doing science [2]. In that case, life is easy, not hard, for we have the established method of scientific inquiry to settle disputes (among others, by doing additional lab experiments to figure out more about reality). Domain- and application ontologies, as well as conceptual data models, for the enterprise universe of discourse require, at times, a consensus-based approach where some parts of the represented information are the outcome of negotiations and agreements among the stakeholders.

Going one step further on the sliding scale: for databases and application software for the humanities, and conflict databases in particular, one makes an ontology or conceptual data model conforming to one’s own (or the funding organisation’s) political convictions and with the desired conclusions in mind. Building data vaults seems to be the intended norm rather than the exception, hence, maintenance and usage and data analysis beyond the developers limited intentions, let alone integration, are a nightmare.

 In this post, I will outline some suggestions for building your own politicized representation—be it an ontology or conceptual data model—for armed conflict data, such as terrorist incidents, civil war, and inter-state war. I will discuss in the next post a few examples of conflict data analysis, both regarding extant databases and the ‘dirty war index’ application built on top of them. A later post may deal with a solution to the problems, but for now, it would already be a great help not to adhere to the tips below.

Tips for biasing the representation

In random order, you could do any of the following to pollute the model and hamper data analysis so as to ensure your data is scientifically unreliable but suitable to serve your political agenda.

1. Have a fairly flat taxonomy of types of parties; in fact, just two subtypes suffice: US and THEM, although one could subtype the latter into ‘they’, ‘with them’, and ‘for them’. The analogue, with ‘we’, ‘with us’, and ‘for us’ is too risky for potential of contagion of responsibility of atrocities and therefore not advisable to include; if you want to record any of it, then it is better to introduce types such as ‘unknown perpetrator’ or ‘not officially claimed event’ or ‘independent actor’.

2. Aggregate creatively. For instance, if some of the funding for your database comes from a building construction or civil engineering company, refine that section of target types, or include new target types only when you feel like it is targeted sufficiently often by the opponent to warrant a whole new tuple or table from then onwards. Likewise, some funding agencies would like to see a more detailed breakdown of types of victims by types of violence, some don’t. Last, be careful with the typology of arms used, in particular when your country is producing them; a category like ‘DIY explosive device’ helps masking the producer.

3. Under-/over-represent geography. Play with granularity (by city/village, region, country, continent) and categorization criteria (state borders, language, former chiefdoms, parishes, and so forth), e.g., include (or not) notions such as ‘occupied territory’ (related to the actors) and `liberated region’ or `autonomous zone’, or that an area may, or may not, be categorized or named differently at the same time. Above all, make the modelling decisions in an inconsistent way, so that no single dimension can be analysed properly.

4. Make an a-temporal model and pretend not to change it, but (a) allow non-traceable object migration so that defecting parties who used to be with US (see point 1) can be safely re-categorised as THEM, and (b) refine the hierarchy over time anyway so as to generate time-inconsistency for target types (see point 2) and geography (see point 3), in order to avoid time series analyses and prevent discovering possible patterns.

5. Have a minimal amount of classes for bibliographic information, lest someone would want to verify the primary/secondary sources that report on numbers of casualties and discovers you only included media reports from the government-censored newspapers (or the proxy-funding agency, or the rebel radio station, or the guerrilla pamphlets).

6. Keep natural language definitions for key concepts in a separate file, if recorded at all. This allows for time-inconsistency in operational definitions as well as ignorance of the data entry clerks so that each one can have his own ideas about where in the database the conflict data should go.

7. Minimize the use of database integrity constraints, hence, minimize representing constraints in the ontology to begin with, hence, use a very simple modelling language so you can blame the language for not representing the subject domain adequately.

I’m not saying all conflict databases use all of these tricks; but some use at least most of them, which ruins credibility of those database of which the analysts actually did try to avoid these pitfalls (assuming there are such databases, that is). Optimism wants me to believe developers did not think of all those issues when designing the database. However, there is a tendency that each conflict researcher compiles his own data set and that each database is built from scratch.

For the current scope, I will set aside the problems with data collection and how to arrive at guesstimated semi-reliable approximations of deaths, severe injuries, rape, torture victims and so forth (see e.g. [3] and appendix B of [4]). Inherent problems with data collection is one thing and difficult to fix, bad modelling and dubious or partial data analysis is a whole different thing and doable to fix. I elaborate on latter claim in the next post.

References

[1] Barry Smith. Ontology (Science). In: C. Eschenbach and M. Gruninger (eds.), Formal Ontology in Information Systems. Proceedings of FOIS 2008. preprint

[2] Keet, C.M. Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS’05), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics LNBI 3615, Springer Verlag, 2005. pp46-62.

[3] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[4] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press. 402p.

Fruitful ADBIS’15 in Poitiers

The 19th Conference on Advances in Databases and Information Systems (ADBIS’15) just finished yesterday. It was an enjoyable and well-organised conference in the lovely town of Poitiers, France. Thanks to the general chair, Ladjel Bellatreche, and the participants I had the pleasure to meet up with, listen to, and receive feedback from. The remainder of this post mainly recaps the keynotes and some of the presentations.

 

Keynotes

The conference featured two keynotes, one by Serge Abiteboul and on by Jens Dittrich, both distinguished scientists in databases. Abiteboul presented the multi-year project on Webdamlog that ended up as a ‘personal information management system’, which is a simple term that hides the complexity that happens behind the scenes. (PIMS is informally explained here). It breaks with the paradigm of centralised text (e.g., Facebook) to distributed knowledge. To achieve that, one has to analyse what’s happening and construct the knowledge from that, exchange knowledge, and reason and infer knowledge. This requires distributed reasoning, exchanging facts and rules, and taking care of access control. It is being realised with a datalog-style language but that then also can handle a non-local knowledge base. That is, there’s both solid theory and implementation (going by the presentation; I haven’t had time to check it out).

The main part of the cool keynote talk by Dittrich was on ‘the case for small data management’. From the who-wants-to-be-a-millionaire style popquiz question asking us to guess the typical size of a web database, it appeared to be only in the MBs (which most of us overestimated), and sort of explains why MySQL [that doesn’t scale well] is used rather widely. This results in a mismatch between problem size and tools. Another popquiz question answer: the 100MB RDF can just as well be handled efficiently by python, apparently. Interesting factoids, and one that has/should have as consequence we should be looking perhaps more into ‘small data’. He presented his work on PDbF as an example of that small data management. Very briefly, and based on my scribbles from the talk: its an enhanced pdf where you can access the raw data behind the graphs in the paper as well (it is embedded in it, with OLAP engine for posing the same and other queries), has a html rendering so you can hover over the graphs, and some more visualisation. If there’s software associated with the paper, it can go into the whole thing as well. Overall, that makes the data dynamic, manageable, traceable (from figure back to raw data), and re-analysable. The last part of his talk was on his experiences with the flipped classroom (more here; in German), but that was not nearly as fun as his analysis and criticism of the “big data” hype. I can’t recall exactly his plain English terms for the “four V4”, but the ‘lots of crappy XML data that changes’ remained of it in my memory bank (it was similar to the first 5 minutes of another keynote talk he gave).

 

Sessions

Sure, despite the notes on big data, there were presentations in the sessions that could be categorised under ‘big data’. Among others, Ajantha Dahanayake presented a paper on a proposal for requirements engineering for big data [1]. Big data people tend to assume it is just there already for them to play with. But how did it get there, how to collect good data? The presentation outlined a scenario-based backwards analysis, so that one can reduce unnecessary or garbage data collection. Dahanayake also has a tool for it. Besides the requirements analysis for big data, there’s also querying the data and the desire to optimize it so as to keep having fast responses despite its large size. A solution to that was presented by Reuben Ndindi, whose paper also won the best paper award of the conference [2] (for the Malawians at CS@UCT: yes, the Reuben you know). It was scheduled in the very last session on Friday and my note-taking had grinded to a halt. If my memory serves me well, they make a metric database out of a regular database, compute the distances between the values, and evaluate the query on that, so as to obtain a good approximation of the true answer. There’s both the theoretical foundation and an experimental validation of the approach. In the end, it’s faster.

Data and schema evolution research is alive and well, as were time series and temporal aspects. Due to parallel sessions and my time constraints writing this post, I’ll mention only two on the evolution; one because it was a very good talk, the other because of the results of the experiments. Kai Herrmann presented the CoDEL language for database evolution [3]. A database and the application that uses it change (e.g., adding an attribute, splitting a table), which requires quite lengthy scripts with lots of SQL statements to execute. CoDEL does it with fewer statements, and the language has the good quality of being relationally complete [3]. Lesley Wevers approached the problem from a more practical angle and restricted to online databases. For instance, Wikipedia does make updates to their database schema, but they wouldn’t want to have Wikipedia go offline for that duration. How long does it take for which operation, in which RDBMS, and will it only slow down during the schema update, or block any use of the database entirely? The results obtained with MySQL, PostgreSQL and Oracle are a bit of a mixed bag [4]. It generated a lively debate during the presentation regarding the test set-up, what one would have expected the results to be, and the duration of blocking. There’s some work to do there yet.

The presentation of the paper I co-authored with Pablo Fillottrani [5] (informally described here) was scheduled for that dreaded 9am slot the morning after the social dinner. Notwithstanding, quite a few participants did show up, and they showed interest. The questions and comments had to do with earlier work we used as input (the metamodel), qualifying quality of the conceptual model, and that all too familiar sense of disappointment that so few language features were used widely in publicly available conceptual models (the silver lining of excellent prospects of runtime usage of conceptual models notwithstanding). Why this is so, I don’t know, though I have my guesses.

 

And the other things that make conference useful and fun to go to

In short: Networking, meeting up again with colleagues not seen for a while (ranging from a few months [Robert Wrembel] to some 8 years [Nadeem Iftikhar] and in between [a.o., Martin Rezk, Bernhard Thalheim]), meeting new people, exchanging ideas, and the social events.

2008 was the last time I’d been in France, for EMMSAD’08, where, looking back now, I coincidentally presented a paper also on conceptual modelling languages and logic [6], but one that looked at comprehensive feature coverage and comparing languages rather than unifying. It was good to be back in France, and it was nice to realise my understanding and speaking skills in French aren’t as rusty as I thought they were. The travels from South Africa are rather long, but definitely worthwhile. And it gives me time to write blog posts killing time on the airport.

 

References

(note: most papers don’t show up at Google scholar yet, hence, no links; they are on the Springer website, though)

[1] Noufa Al-Najran and Ajantha Dahanayake. A Requirements Specification Framework for Big Data Collection and Capture. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, .

[2] Boris Cule, Floris Geerts and Reuben Ndindi. Space-bounded query approximation. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 397-414.

[3] Kai Herrmann, Hannes Voigt, Andreas Behrend and Wolfgang Lehner. CoDEL – A Relationally Complete Language for Database Evolution. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 63-76.

[4] Lesley Wevers, Matthijs Hofstra, Menno Tammens, Marieke Huisman and Maurice van Keulen. Analysis of the Blocking Behaviour of Schema Transformations in Relational Database Systems. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 169-183.

[5] Pablo R. Fillottrani and C. Maria Keet. Evidence-based Languages for Conceptual Data Modelling Profiles. ADBIS’15. Morzy et al. (Eds.). Springer LNCS vol. 9282, 215-229.

[6] C. Maria Keet. A formal comparison of conceptual data modeling languages. EMMSAD’08. CEUR-WS Vol-337, 25-39.

The ontology-driven unifying metamodel of UML class diagrams, ER, EER, ORM, and ORM2

Metamodelling of conceptual data modelling languages is nothing new, and one may wonder why one would need yet another one. But you do, if you want to develop complex systems or integrate various legacy sources (which South Africa is going to invest more money in) and automate at least some parts of it. For instance: you want to link up the business rules modelled in ORM, the EER diagram of the database, and the UML class diagram that was developed for the application layer. Are the, say, Student entity types across the models really the same kind of thing? And UML’s attribute StudentID vs. the one in the EER diagram? Or EER’s EmployeesDependent weak entity type with the ORM business rule that states that “each dependent of an employee is identified by EmployeeID an the Dependent’s Name?

Ascertaining the correctness of such inter-model assertions in different languages does not require a comparison and contrast of their differences, but a way to harmonise or unify them. Some such models already exist, but they take subsets of the languages, whereas all those features do appear in actual models [1] (described here informally). Our metamodel, in contrast, aims to capture all constructs of the aforementioned languages and the constraints that hold between them, and generalize in an ontology-driven way so that the integrated metamodel subsumes the structural, static elements of them (i.e., the integrated metamodel has as them as fragments). Besides some updates to the earlier metamodel fragment presented in [2,3], the current version [4,5] also includes the metamodel fragment of their constraints (though omits temporal aspects and derived constraints). The metamodel and its explanation can be found in the paper in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4] that I co-authored with Pablo Fillottrani, and which was recently accepted in Data & Knowledge Engineering.

Methodologically, the unifying metamodel presented in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4], is ontological rather than formal (cf. all other known works). On that ‘ontology-driven approach’, here is meant the use of insights from Ontology (philosophy) and ontologies (in computing) to enhance the quality of a conceptual data model and obtain that ‘glue stuff’ to unify the metamodels of the languages. The DKE paper describes all that, such as: on the nature of the UML association/ORM fact type (different wording, same ontological commitment), attributes with and without data types, the plethora of identification constraints (weak entity types, reference modes, etc.), where can one reuse an ‘attribute’ if at all, and more. The main benefit of this approach is being able to cope with the larger amount of elements that are present in those languages, and it shows that, in the details, the overlap in features across the languages is rather small: 4 among the set of 23 types of relationship, role, and entity type are essentially the same across the languages (see figure below), and 6 of the 49 types of constraints. The metamodel is stable for the modelling languages covered. It is represented in UML for ease of communication, but, as mentioned earlier, it also has been formalised in the meantime [5].

Types of elements in the languages; black-shaded: entity is present in all three language families (UML, EER, ORM); darg grey: on two of the three; light grey: in one; while-filled: in none, but we added it to glue things together. (Source: [6])

Types of elements in the languages; black-shaded: entity is present in all three language families (UML, EER, ORM); dark grey: on two of the three; light grey: in one; while-filled: in none, but we added the more general entities to ‘glue’ things together. (Source: [4])

Metamodel fragment with some constraints among some of the entities. (Source [4])

Metamodel fragment with some constraints among some of the entities. (Source [4])

The DKE paper also puts it in a broader context with examples, model analyses using the harmonised terminology, and a use case scenario that demonstrates the usefulness of the metamodel for inter-model assertions.

While the 24-page paper is rather comprehensive, research results wouldn’t live up to it if it didn’t uncover new questions. Some of them have been, and are being, answered in the meantime, such as its use for classifying models and comparing their characteristics [1,6] (blogged about here and here) and a rule-based approach to validating inter-model assertions [7] (informally here). Although the 3-year funded project on the Ontology-driven unification of conceptual data modelling languages—which surely contributed to realising this paper—just finished officially, we’re not done yet, or: more is in the pipeline. To be continued…

 

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in press)

[2] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[3] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[4] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering. 2015. DOI: 10.1016/j.datak.2015.07.004. (in press)

[5] Fillottrani, P.R., Keet, C.M. KF metamodel Formalization. Technical Report, Arxiv.org http://arxiv.org/abs/1412.6545. Dec 19, 2014. 26p.

[6] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in press)

[7] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

FAIR’14 and ‘modelling relationships’ tutorial

After a weekend of ‘loadshedding’ (one of those South African euphemisms) I’m posting a few notes on the Forum on Artificial Intelligence Research 2014 (FAIR’14) that took place from 3-5 Dec 2014 at Stellenbosch University, which was organised by CAIR and co-located with the FASTAR/Espresso Workshop 2014, which, in turn, was co-located with PRASA, AFLaT, and RobMech 2014 in Cape Town. FAIR’14 consisted of a presentation by Sergei Obiedkov of the Higher School of Economics, Russia, a tutorial on modelling relationships in ontologies by me, and a course on computational social choice theory by Ulle Endriss from the ILLC, University of Amsterdam, The Netherlands.

While not quite relevant to my current research except for judgement aggregation at the end (for crowdsourcing), Ulle’s course was one of those events that made me think “[why didn’t/if only] I was exposed to this material before?!”, when I had to make choices as to what to study and specialise in (though, admitted, once knowing about the math with game theory and applying that to peace negotiations in my MA pdf, I still went on in CS with KR&R and ontologies). Ulle’s course combined socially relevant topics, such as the fair allocation of resources and voting systems, with solid, precise, logic- and math-based representations and computation. Besides the engaging content, he’s also good at teaching it. The content and slides are a condensed version of his MSc course on social choice theory and are available online here, which also has links to related reading material.

I tried to condense into 2 hours some aspects of modelling relationships in ontologies. It started with some problems and questions, proceeded to touching upon the nature of relations and some detail of the formal semantics, common relationships (with some detail about mereotopology), and closing with some practical modelling guidance and reasoner performance when modelling it one way or another. It being a tutorial, and not all participants had Protégé installed, I resorted to a peer instruction audience response system to incorporate interactively some questions about modelling some relationships. The slides are available online (though also here the text on the slides only partially reflect what I’ve talked about).

Other than that, there’s always the social component. Despite the weird time-warp that Stellenbosch town constitutes, it was really nice to catch up with former colleagues and to see the progress of postgrads of UKZN, to hear about the future of CAIR, and that it’s a small world even when meeting people new to me. And the food & wine was delicious. The train travel back to Cape Town took a bit longer than the schedule said it ought to be, but I recommend it nevertheless.

A metamodel-driven approach for linking conceptual data models

Interoperability among applications and components of large complex software is still a bit of a nightmare and a with-your-hands-in-the-mud scenario that no-one looks forward to—people look forward to already having linked them up, so they can pose queries across departmental and institutional boundaries, or even across the different data sets within a research unit to advance their data analysis and discover new things.

Sure, ontologies can help with that, but you have to develop one if none is available, and sometimes it’s not even necessary. For instance, you have an ER diagram for the database and a UML model for the business layer. How do you link up those two?

Superficially, this looks easy: an ER entity type matches up with a UML class, and an ER relationship with an UML association. The devil is in the details, however. To name just a few examples: how are you supposed to match a UML qualified association, an ER weak entity type, or an ORM join-subset constraint, to any of the others?

Within the South Africa – Argentina bilateral collaboration project (scope), we set out to solve such things. Although we first planned to ‘simply’ formalize the most common conceptual data modelling languages (ER, UML, and ORM families), we quickly found out we needed not just an ‘inventory’ of terms used in each language matched to one in the other languages, but also under what conditions these entities can be used, hence, we needed a proper metamodel. This we published last year at ER’13 and MEDI’13 [1,2], which I blogged about last year. In the meantime, we not only have finalized the metamodel for the constraints, but also formalized the metamodel, and a journal article describing all this is close to being submitted.

But a metamodel alone doesn’t link up the conceptual data models. To achieve that, we, Pablo Follottrani and I, devised a metamodel-driven approach for conceptual model interoperability, which uses a formalised metamodel with a set of modular rules to mediate the linking and transformation of elements in the conceptual models represented in different languages. This also simplifies the verification of inter-model assertions and model conversion. Its description has recently been accepted as a full paper at the 8th International Web Rule Symposium 2014 (RuleML’14) [3], which I’ll present in Prague on 18 August.

To be able to assert a link between two entities in different models and evaluate automatically (or at least: systematically) whether it is a valid assertion and what it entails, you have to know i) what type of entities they are, ii) whether they are the same, and if not, whether one can be transformed into the other for that particular selection. So, to be able to have those valid inter-model assertions, an approach is needed for transforming one or more elements of a model in one language into another. The general idea of that is shown in the following picture, and explained briefly afterward.

Overview of the approach to transform a model represented in language A to one in language B, illustrated with some sample data from UML to ORM2 (Fig 1 in [3])

Overview of the approach to transform a model represented in language A to one in language B, illustrated with some sample data from UML to ORM2 (Fig 1 in [3])

We have three input items (top of the figure, with the ovals), then a set of algorithms and rules (on the right), and two outputs (bottom, green). The conceptual model is provided by the user, the formalized metamodel is available and a selection of it is included in the RuleML’14 paper [3], and the “vocabulary containing a terminology comparison” was published in ER’13 [1]. Our RuleML paper [3] zooms in on those rules for the algorithms, availing of the formalized metamodel and vocabulary. To give a taste of that (more below): the algorithm has to know that a UML class in the diagram can be mapped 1:1 to a ORM entity type, and that there is some rule or set of rules to transform a UML attribute to an ORM value type.

This is also can be used for the inter-model assertions, albeit in a slightly modified way overall, which is depicted below. Here we use not only the formalised metamodel and the algorithms, but also which entities have 1:1 mappings, which are equivalent but need several steps (called transformations), and which once can only be approximated (requiring user input), and it can be run in both directions from one fragment to the other (one direction is chosen arbitrarily).

Overview for checking the inter-model assertions, and some sample data, checking whether the UML Flower is the same as the ORM Flower (Fig. 2 in [3]).

Overview for checking the inter-model assertions, and some sample data, checking whether the UML Flower is the same as the ORM Flower (Fig. 2 in [3]).

The rules themselves are not directly from one entity in one model to another entity in another, as that would become too messy, isn’t really scalable, and would have lots of repetitions. We use the more efficient way of declaring rules for mapping a conceptual data model entity into its corresponding entity in the metamodel, do any mapping, transformation, or approximation there in the metamodel, and then map it into the matching entity in the other conceptual data model. The rules for the main entities are described in the paper: those for object types, relationships, roles, attributes, and value types, and how one can use those to build up more complex ones for validation of intermodal assertions.

This metamodel-mediated approach to the mappings is one nice advantage of having a metamodel, but one possibly could have gotten away with just having an ‘inventory’ of the entities, not all the extra effort with a full metamodel. But there are benefits to having that metamodel, in particular when actually validating mappings: it can drive the validation of mappings and the generation of model transformations thanks to the constraints declared in the metamodel. How this can work, is illustrated in the following example, showing one example of how the “process mapping assertions using the transformation algorithms” in the centre-part of Fig. 2, above, works out.

Example. Take i) two models, let’s call them Ma and Mb, ii) an inter-model assertion, e.g., a UML association Ra and an ORM fact type Rb, ii) the look-up list with the mappings, transformations, approximations, and the non-mappable elements, and iii) the formalised metamodel. Then the model elements of Ma and Mb are classified in terms of the metamodel, so that the mapping validation process can start. Let us illustrate that for some Ra to Rb (or vv.) mapping of two relationships.

  1. For the vocabulary table, we see that UML association and ORM fact type correspond to Relationship in the metamodel, and enjoy a 1:1 mapping. The ruleset that will be commenced with are R1 from UML to the metamodel and 2R to ORM’s fact type (see rules in the paper).
  2. The R1 and 2R rules refer to Role and Object type in the metamodel. Now things become interesting. The metamodel has represented that each Relationship has at least two Roles, which there are, and each one causes the role-rules to be evaluated, with Ro1 of Ra’s two association ends into the metamodel’s Role and 2Ro to ORM’s roles (‘2Ro’ etc. are the abbreviations of the rules; see paper [3] for details).
  3. The metamodel asserts that Role must participate in the rolePlaying relationship and thus that it has a participating Object type (possibly a subtype thereof) and, optionally, a Cardinality constraint. Luckily, they have 1:1 mappings.
  4. This, in turn causes the rules for classes to be evaluated. From the classes, we see in the metamodel that each Object type must have at least one Identification constraint that involves one or more attributes or value types (which one it is has, been determined by the original classification). This also then has to be mapped using the rules specified.

This whole sequence was set in motion thanks to the mandatory constraints in the metamodel, having gone Relationship to Role to Object type to Single identification (that, in turn, consults Attribute and Datatype for the UML to ORM example here). The ‘chain reaction’ becomes longer with more elaborate participating entities, such as a Nested object type.

Overall, the whole orchestration is no trivial matter, requiring all those inputs, and it won’t be implemented in one codefest on a single rainy Sunday afternoon. Nevertheless, the prospect for semantically good (correct) inter-model assertions across conceptual data modeling languages and automated validation thereof is now certainly a step closer to becoming a reality.

References

[1] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[2] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[3] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

Book chapter on conceptual data modeling for biology published

Just a quick note that my book chapter on “Ontology-driven formal conceptual data modeling for biological data analysis” finally has been published in the Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data (edited by Mourad Elloumi and Albert Y. Zomaya). A summary of the chapter’s contents is described in an earlier blog post from little over two years ago, and I’ve put the preprint online.

The whole book is an impressive 1192 pages consisting of 48 chapters of about 25 pages each, which are grouped into three main sections. The first section, Biological data pre-processing, has four parts: biological data management, biological data modeling (which includes my chapter), biological feature extraction, and biological feature selection. The second section, biological data mining, has six parts: Regression Analysis of Biological Data, Biological Data Clustering, Biological Data Classification, Association Rules Learning from Biological Data, Text Mining and Application to Biological Data, and High-Performance Computing for Biological Data Mining. The third section, biological data post-processing, has only one part: biological knowledge integration and visualization. (check the detailed table of contents). Happy reading!