Systematic design of conceptual modelling languages

What would your ideal modelling language look like if you were to design one yourself? How would you go about defining your own language? The act of creating your own pet language can be viewed as a design process. Processes can be structured. It wasn’t the first thing we wanted to address when my collaborator Pablo Fillottrani and I were trying to design evidence-based conceptual data modelling languages. Yet. All those other conceptual modelling languages out there did not sprout from a tree; people designed them, albeit most often not always in a systematic way. We wanted to design ours in a systematic, repeatable, and justified way.

More broadly, modelling is growing up as a field of specialisation, and is even claimed by some to be deserving to be its own discipline [CabotVallecillo22]. Surely someone must have thought of this notion of language design processes before? To a limited extent, yes. There are a few logicians who have thought about procedures and have used a procedure or part thereof. Two notable examples are OWL and DOL, which both went through a requirements specification phase, goals were formulated, and the language was designed. OWL was also assessed on usage and a ‘lessons learned’ was extracted from it to add one round of improvements, which resulted in OWL 2.

But what would a systematic procedure look like? Ulrich Frank devised a waterfall methodology for domain-specific languages [Frank13], which are a bit different from conceptual data modelling languages. Pablo and I modified that to make it work for designing ontology languages. Its details, and focussing on the additional ‘ontological analysis’ step, is described in our FOIS2020 paper [FillottraniKeet20] and I wrote a blogpost about that before. It also includes the option to iterate over the steps, there are optional steps, and there is that ontological analysis where deciding on certain elements entail philosophical choices for one theory or another. We tweaked it further so that it also would work for conceptual data modelling language design, which was published in a journal article on the design of a set of evidence-based conceptual data modelling languages [FillottraniKeet21] in late 2021, but that I hadn’t gotten around to writing a blog post about yet. Let me summarise the steps visually in the figure below.

Overview of a procedure for conceptual modelling and ontology language design (coloured in from [FillottraniKeet21])

For marketing purposes, I probably should come up with a easily pronounceable name for the proposed procedure, like MeCModeL (Methodology for the Creation of Modelling Languages) or something, We’re open to suggestions. Be that as it may, let’s briefly summarise each step in the remainder of this post.

Step 1. Clarification of scope and purpose

We first need to clarify the scope, purpose, expected benefits, and possible long-term perspective, and consider the feasibility given the resources available. For instance, if you were to want to design a new conceptual data modelling language tailored to temporal model-based data access, and surpass UML class diagrams, it’s unlikely going to work. For one, the Object Management Group has more resources both in the short and in the long term to promote and sustain UML. Second, reasoning over temporal constraints is computationally expensive so it won’t scale to access large amounts of data. We’re halted in our tracks already. Let’s try this again. What about a new temporal UML that has a logic-based reconstruction for precision? Its purpose is to model more of the subject domain more precisely. The expected benefits would be better quality models, because more precise, and thus better quality applications. A long-term perspective does not apply, as it’s just a use case scenario here. Regarding feasibility, let’s assume we do have the competencies, people, and funding to develop the language and tool, and to carry out the evaluation.

Step 2. Analysis of general requirements

The “analysis and general requirements” step can be divided into three parallel or sequential tasks: determining the requirements for modelling (and possibly the associated automated reasoning over the models), devising use case scenarios, and assigning priorities to each. An example of a requirement is the ability to represent change in the data and to keep track of it, such as the successive stages in signing computational legal contracts. Devising a list of requirements out of the blue is nontrivial, but there are a few libraries of possible requirements out there that can help with picking and choosing. For conceptual modelling languages, there is no such library yet, however, but we created a preliminary library of features for ontology languages that may be of use.

Use cases can vary widely, depending on the scope, purpose, and requirements of the language aimed for. For requirements, use cases can be described as the kind of things you want to be able to represent in the prospective language. For instance, that employee Jane as Product Manager
may change her job in the company to Area Manager or that she’s repeatedly assigned on a project for a specified duration. The former is an example of object migration and the latter of a ternary relationship or a binary with an attribute. An end user stakeholder bringing up these examples may not know that, but as language designer, one would need to recognise the language feature(s) needed for it. Another type of use case may be about how a modeller would interact with the language and the prospective modelling tool.

Step 3. Analysis of specific requirements and ontological analysis

Here’s were the ontological commitments are made, even if you don’t want to or think you don’t do so. Even before looking at the temporal aspects, the fact that we committed to UML class diagrams already entails we committed to, among others, the so-called positionalist commitment of relations and a class-based approach (cf. first order predicate logic, where there are just ordered relations of arity >=1), and we adhere to the most common take on representing temporality, where there are 3-dimensional objects and a separate temporal dimension is added whenever the entity needs it (the other option being 4-dimensionalism). Different views affect how time is included in the language. With the ‘add time to a-temporal’ choice, there are still more decisions to take, like whether time is linear and whether it consists of adjacent successive timepoints (chronons) or that another point can always be squeezed in-between (dense time). Ontological differences they really are, even if you chose ‘intuitively’ hitherto. There are more such ontological decisions, besides these obvious ones on time and relations, which are described in our FOIS2020 paper. In all but one paper about languages, such choices were left implicit and time will tell whether it’ll be picked up for the design of new languages.

The other sub-step of step 3 has been very much to the fore if logic plays a role in the language design. Which elements are going to be in the language, how are they going to look like, how scalable does it have to be, and should it extend existing infrastructure or be something entirely separate from it? For our temporal UML, the answers may be that the atemporal elements are those from UML class diagrams, all the temporal stuff with their icons shall be carried over from the TREND conceptual data modelling language [KeetBerman17], and the underlying logic, DLRus, is not even remotely close to being scalable so there is no existing tool infrastructure. Of course, someone else may make other decisions here.

Step 4. Language specification

Now we’re finally getting down to what from the outside may seem to be the only task: defining the language. There are two key ways of doing it, being either to define the syntax and the semantics or to make a metamodel for your language. The syntax can be informal-ish, like listing the permissible graphical elements and then a BNF grammar for how they can be used. This we can do also for logics more precisely, like that UML’s arrow for class subsumption is a ⇒ in our logic-based reconstruction rather than a →, as you wish. Once the syntax is settled, we need to give it meaning, or: define the semantics of the language. For instance, that a rectangle means that it’s a class that can have instances and a line between classes denotes a relationship. Or that that fancy arrow means that if C ⇒ D, then all instances of C are also instances of D in all possible worlds (that in the interpretation of C ⇒ D we have that CI ⊂ DI). Since logic is not everyone’s preference, metamodelling to define the language may be a way out; sometimes a language can be defined in its own language, sometimes not (e.g., ORM can be [Halpin04]). For our temporal UML example, we can use the conversions from EER to UML class diagrams (see, e.g., our FaCIL paper with the framework, implementation and the theory it uses), and then also reuse the extant logic-based reconstruction in the DLRus Description Logic.

Once all that has been sorted, there’s still the glossary and documentation to write so that potential users and tool developers can figure out what you did. There’s neither a minimum nor a maximum page limit for it. The UML standard is over 700 pages long, DOL is 209 pages, and the CL Standard is 70 pages. Others hide their length by rendering it as a web page and toggle figures and examples; the OWL 2 functional style syntax in A4-sized MS Word amounts to 118 pages in 12-point Times New Roman font, whereas the original syntax and semantics of the underlying logic SROIQ [HorrocksEtAl06], including the key algorithms, is just 11 pages or about 20 reformatted in 12-point single-column A4. And it may need to be revised due to potential infelicities in steps 5-7. For our temporal UML, there will be quite a number of pages.

Step 5. Design of notation for modeller

It may be argued that designing the notation is part of the language specification, but, practically, different stakeholders want different things out of it, especially if your language is more like a programming language or a logic rather than diagrammatic. Depending on your intended audience, graphical or textual notations may be preferred. You’ll need to tweak that additional notation and evaluate it with a representative selection of prospective users on whether the models are easy to understand and to create. To the best of my knowledge, that never happened at the bedrock of any of the popular logics, be it first order predicate logic, Description Logics, or OWL, which may well be a reason why there are so many research papers on providing nicer renderings of them, sugar-coating it either diagrammatically, with a controlled natural language, or a different syntax. OWL 2 has 5 different official syntaxes, even. For our hypothetical temporal UML: since we’re transferring TREND, we may as well do so for the graphical notation and the controlled natural language for it.

Step 6. Development of modelling tool

Create a computer-processable format of it, i.e., a serialisation, which assumes 1) you want to have it implemented and a modelling tool for it and 2) it wasn’t already serialised in step 4. If you don’t want an implementation, this step can be skipped. Creating such a serialisation format, however, will help getting it adopted more widely than yourself (although it’s by no means a guarantee that it will). There are also other reasons why you may want to create a computer processable version for the new language, such as sending it to an automated reasoner or automatically checking that a model adheres to the language specifications and to highlight syntax errors, or any other application scenario. Our fictitious temporal UML doesn’t have a computer-processable format and neither does TREND to copy it from, but we ought to because we do want a tool for both.

Step 7. Evaluation and refinement

Evaluation involves defining and executing test cases to validate and verify them on the language. Remember those use cases from step 2 and the ontological requirements of step 3? They count as test cases: can that be modelled in the new language and does it have the selected features? If so, good; if not, you better have a good reason for why not. If you don’t, then you’ll need to return to step 4 to improve the language. For our temporal UML, we’re all sorted, as both the object and relation migration constraints can be represented, as well as ternaries.

Let’s optimistically assume it all went well with your design, and your language passes all those tests. The last task, at least for the first round, is to analyse the effect of usage in practice. Do users use it in the way intended? Are they under-using some language features and discovering they want another, now that they’re deploying it? Are there unexpected user groups with additional requirements that may be strategically savvy to satisfy? If the answers are a resounding ‘no’ to the second and third question in particular, you may rest on your laurels. If the answer is ‘yes’, you may need to cycle through the procedure again to incorporate updates and meet moving goalposts. There’s no shame in that. UML’s version 1.0 was released in 1997 and then came 1.1, 1.3, 1.4, 1.5, 2.0, 2.1, 2.1.1, 2.1.2, 2.2, 2.3, 2.4.1, 2.5, and 2.5.1. The UML 2.6 Revision Task Force faces an issue tracker of around 800 issues, five years after the 2.5.1 official release. They are not all issues with the UML class diagram language, but it does indicate things change. OWL had a first version in 2004 and then a revised one in 2008. ER evolved into EER; ORM into ORM2.

Regardless of whether your pet language is used by anyone other than yourself, it’s fun designing one, even if only because then you don’t have to abide by other people’s decisions on what features modelling language should have and if it turns out the same as an existing one, you’ll have a better understanding of why that is the way it is. What the procedure does not include, but may help marketing your pet language, is how to name it. UML, ER, and ORM are not the liveliest acronyms and not easy to pronounce. Compare that to Mind Maps, which is a fine alliteration at least. OWL, for the web ontology language, is easy to pronounce and it is nifty in that owl-as-animal is associated with knowledge, and OWL is a knowledge representation language, albeit that this explanation is a tad bit long for explaining a name. Some of the temporal ER languages have good names too, like TimER and TREND. With this last naming consideration, we have pushed it as far as possible in the current language development process.

In closing

The overall process is, perhaps, not an exciting one, but it will get the job done and you’ll be able to justify what you did and why. Such an explanation beats an ‘I just liked it this way’. It also may keep language scope creep in check, or at least help to become cognizant about it, and you may have the answer ready to a user asking for a feature.

Our evidence-based conceptual data modelling languages introduced in [FillottraniKeet21] have clear design rationales and evidence to back it up. We initially didn’t like them much ourselves, for they are lean languages rather than the very expressive ones that we’d hoped for when we started out with the investigation, but they do have their advantages, such as run-time usage in applications including ontology-based data access, automated verification, query compilation, and, last but not least, seamless interoperability among EER, UML class diagrams and ORM2 [BraunEtAl23].


[BraunEtAl23] Braun, G., Fillottrani, P.R., Keet, C.M. A Framework for Interoperability Between Models with Hybrid Tools, Journal of Intelligent Information Systems, (in print since July 2022).

[CabotVallecillo22] Cabot, Jordi and Vallecillo, Antonio. Modeling should be an independent scientific discipline. Software and Systems Modeling, 2022, 22:2101–2107.

[Frank13] Frank, Ulrich. Domain-specific modeling languages – requirements analysis anddesign guidelines. In Reinhartz-Berger, I.; Sturm, A.; Clark, T.; Bettin, J., and Cohen, S., editors, Domain Engineering: Product Lines, Conceptual Models, and Languages, pages 133–157. Springer, 2013

[Halpin04] Halpin, T. A. Advanced Topics in Database Research, volume 3, chapter Comparing Metamodels for ER, ORM and UML Data Models, pages 23–44. Idea Publishing Group, Hershey PA, USA, 2004.

[HorrocksEtAl06] Horrocks, Ian, Kutz, Oliver, and Sattler, Ulrike. The even more irresistible SROIQ. Proceedings of KR-2006, AAAI, pages 457–467, 2006.

[FillottraniKeet20] Fillottrani, P.R., Keet, C.M.. An analysis of commitments in ontology language design. 11th International Conference on Formal Ontology in Information Systems 2020 (FOIS’20). Brodaric, B and Neuhaus, F. (Eds.). IOS Press, FAIA vol. 330, 46-60.

[FillottraniKeet21] Fillottrani, P.R., Keet, C.M. Evidence-based lean conceptual data modelling languages. Journal of Computer Science and Technology, 2021, 21(2): 93-111.

[KeetBerman17] Keet, C.M., Berman, S. Determining the preferred representation of temporal constraints in conceptual models. 36th International Conference on Conceptual Modeling (ER’17). Mayr, H.C., Guizzardi, G., Ma, H. Pastor. O. (Eds.). Springer LNCS vol. 10650, 437-450. 6-9 Nov 2017, Valencia, Spain.