Modelling issues and choices in the development of the Data Mining OPtimization ontology

The Data Mining OPtimization ontology (DMOP) is a sizeable ontology with about 600 classes, over 1000 subclass axioms, more than 100 object properties, 40 object sub-property axioms and about 10 property chains, and thus uses several SROIQ/OWL 2DL features. The ontology contains detailed knowledge represented about data mining tasks, algorithms, hypotheses (mined models or patterns), workflows, and data with its characteristics. Such detailed knowledge is required to meet its high-level aim: to support informed decision-making in the knowledge discovery process. While the ontology can be used as a reference by data miners, its primary purpose—at least, the main motivation why it was developed—is automation of algorithm and model selection that relies heavily on semantic meta-mining [1] (ontology-based meta-analysis where data mining experiments are conducted, annotated, and mined and analysed, and from that patterns are extracted about data mining performance). Unlike other data mining ontologies, DMOP helps proposing not just any set of valid workflows, but optimal workflows, thanks to all this detailed knowledge about data mining. (DMOP was developed in the EU FP7 e-lico project and is used in such a system that proposes relatively optimal workflows.)

DMOP’s development was no trivial exercise, however, and several modeling problems popped up that required use of OWL 2 DL features and started to stretch the recent performance improvements of the automated reasoners. A summary of the ontology and a description, discussion, and solution of those issues—or: the choices we made for version 5.3 of the ontology—is described in our OWLED’13 paper Modeling issues and choices in the Data Mining OPtimization Ontology [2], which was co-authored with Agnieszka Lawrynowicz (from uni of Poznan, who will present the paper at OWLED’13), Claudia d’Amato (uni of Bari), and Melanie Hilario (uni of Geneva, Axone, and e-lico coordinator).

The main issues we describe in the paper are about meta-modelling and punning, property chains, aligning DMOP to a foundational ontology, and qualities and attributes (and data properties). The meta-modelling topic arose primarily because of the ontological status of Algorithm: is it a class or an instance, and what are the consequences of modeling it either way? Generally, one would consider an algorithm to be an instance, and it can have zero or more implementations that are also instances. In addition, it can take types of inputs (data mining data sets) and outputs (data mining hypotheses), but one cannot assert an axiom that involves both an instance and a class other than instantiation (which is not applicable for an algorithm’s input and output).  In the end, we settled for OWL 2’s punning feature (for details and arguments, refer to the paper).

There is a brief section about property chains, its issues, and that they were resolved. A detailed description how this was done, as well as a generalization of and theoretical foundation for it, was described in my EKAW’12 paper [3] (there’s an informal introduction in an earlier blog post). There were chains that caused undesirable deductions, which are resolved in v5.3 of DMOP using the tests described in [3]. The chains themselves do not exceed the use of three object properties, i.e., two on the left-hand side of the inclusion, yet some nifty desirable inferences can be made now.

Linking DMOP to a foundational ontology does introduce several modelling issues besides the linking of DMOP classes and properties to the categories and relationship in the chosen foundational ontology. These include whether to import or to extend the foundational ontology (normally: import); whether the whole foundational ontology should be imported or only a relevant section of it (i.e., the need for module extraction); harmonize any expressiveness issues (e.g., the foundational ontology may be too expressive for the purpose of the domain ontology); and what to do with any possible differences in ‘modeling philosophies’ between the two ontologies (e.g., data properties). We ended up importing DOLCE-lite. Linking the data mining classes to DOLCE categories was performed manually, where most of them (like algorithm, software, strategy, task, and optimization problem) were asserted as subclasses of dolce:non-physical-endurant, and their characteristics and parameters are subclasses of dolce:abstract-quality.

A tricky representation issue concerns the ‘attributes’ of entities, such as that each FeatureExtractionAlgorithm has a transformation function that is either linear or non-linear. I’m skipping the arguments here in the blog post (it deserves its own one, and see also the paper), and I jump to the choices we made. Instead of using OWL’s data properties, we went for the ‘foundational ontology way’ of dealing with attributes, where an attribute is not a binary relation between a class and a data type, but an entity itself (subsumed by dolce:quality) that, in turn, is related to a space dolce:region. There is where DOLCE stops, but we needed the data types, so we added a data property hasDataValue from dolce:region to the data type anyType. A section of the ontology is depicted graphically in the next figure.

DMOPattr

A section of DMOP with a partial representation of DMOP’s ‘attributes’ (Source: [2]).

For instance, a ModelingAlgorithm has as quality exactly one LearningPolicy (so, LearningPolicy is a subclass of dolce:quality), this LearningPolicy has as quale exactly one abstract region Eager-Lazy, and that Eager-Lazy has as data value at most one anyType data type to record the value of the learning policy of a modeling algorithm. Although this is more cumbersome than with data properties, it makes the ontology much more reusable for a broader set of application scenarios. This comprehensive approach required quite some modeling effort: there are more than 40 DMOP classes made subclass of dolce:abstract-region, and Characteristic (with its 94 subclasses) and Parameter (with 42 subclasses) are subclasses of dolce:abstract-quality, and most are used in class expressions.

A few other choices are briefly mentioned in the paper.

Eventually, these and future improvements to DMOP are expected to pay off in the quality of the meta-miner so that it will compute better optimal workflows.

References

[1] Hilario, M., Nguyen, P., Do, H., Woznica, A., Kalousis, A. Ontology-based meta-mining of knowledge discovery workflows. In: Meta-Learning in Computational Intelligence. Volume 358 of Studies in Computational Intelligence. Springer (2011) 273–315.

[2] Keet, C.M., Lawrynowicz, A., d’Amato, C., Hilario, M. Modeling issues and choices in the Data Mining OPtimisation Ontology. 8th Workshop on OWL: Experiences and Directions (OWLED’13), 26-27 May 2013, Montpellier, France. CEUR-WS vol xx (to appear).

[3] Keet, C.M.. Detecting and Revising Flaws in OWL Object Property Expressions. Proc. of EKAW’12. Springer LNAI vol 7603, pp2 52-266.

Advertisements

2 responses to “Modelling issues and choices in the development of the Data Mining OPtimization ontology

  1. Pingback: Dabbling into evaluating reasoners with the DMOP ontology | Keet blog

  2. Pingback: Journal paper on Data Mining OPtimization Ontology | Keet blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s