From data on conceptual models to optimal (logic) language profiles

There are manifold logic-based reconstructions of the main conceptual data modelling languages in a ‘gazillion’ of logics. The reasons for pursuing this line of work are good. In case you wonder, consider:

• Automated reasoning over a conceptual data model to improve their quality and avoid bugs; e.g., an empty database table due to an inconsistency in the model (unsatisfiable class). Instead of costly debugging, one can catch it upfront.
• Designing and executing queries with the model’s vocabulary cf. putting up with how the data is stored with its typically cryptic table and column names.
• Test data generation in automation of software engineering.
• Use it as ‘background knowledge’ during the query compilation stage (which helps optimizing it, so better performance querying a database).

Most of the research efforts on formalizing the conceptual data modelling languages have gone to capturing as much as possible of the modelling language, and therewith aiming to solve the first use case scenario. Runtime usage of conceptual models, i.e., use case scenarios 2-4 above, is receiving some attention, but it brings with it its own set of problems: which trade-offs are the best? That is, we know we can’t have both the modelling languages in their full glory formalised on some arbitrary (EXPTIME or undecidable) logic and have scalable runtime performance. But which subset to choose? There are papers where (logician) authors state something like ‘you don’t need keys in ER, so we ignore those’ or ‘let’s skip ternaries, as most relationships are binary anyway’ or ‘we sweep those pesky aggregation associations under the carpet’ or ‘hierarchies, disjointness and completeness are certainly important’. Who’s right? Or is neither one of them right?

So, we had all that data of the 101 UML, ER, EER, ORM, and ORM2 models analysed (see previous post and [1]). With that, we could construct evidence-based profiles based on the features that are actually used by modellers, rather than constructing profiles based on gut feeling or on one’s pet logic. We specified a core profile and one for each family of the conceptual data modelling languages under consideration (UML Class Diagrams, ER/EER, and ORM/ORM2). The details of the outcome can be found in our recently accepted paper “Evidence-based Languages for Conceptual Data Modelling Profiles” [1] that has been accepted at the 19th Conference on Advances in Databases and Information Systems (ADBIS’15), that will take place from September 8-12 in Poitiers, France. As with the other recent posts on conceptual data models, also this paper was co-authored with Pablo Fillottrani and is an output of our DST/MINCyT-funded bi-lateral project on the unification of conceptual data modelling languages (project overview).

To jump to the short answer: the core profile can be represented in $\mathcal{ALNI}$ (called $\mathcal{PL}_1$ in [3], with PTIME subsumption), whereas the modelling language-specific profiles do not match any of the very many currently existing Description Logic languages with known computational complexity.

Now how we got into that situation. There are some formalization options to consider first, which can affect the complexity of the logic. Notably, 1) whether to use inverses or qualified number restrictions, and 2) whether to go for DL role components for UML’s association ends/ORM’s roles/ER’s relationship components with a 1:1 mapping, or to ignore that and formalise the associations/fact types/relationships only (and how to handle that choice then). Extending a logic language with inverses tends to be less costly computationally cf. qualified number restrictions, so we chose the former. The latter is more complicated to handle regardless the choice, which is partially due to the fact that they are surface aspects of an underlying difference in ontological commitment as to what relations are—so-called standard view versus positionalist—and how it is represented in the models (see discussion in the paper). For the core profile, the dataset of conceptual models justified binaries + standard view representation. In addition to that, the core profile has classes, attributes, mandatory and arbitrary (unqualified) cardinality, class subsumption, and single identification. That set covers 87.57% of all the entities in the models in the dataset (91.88% of the UML models, 73.29% of the ORM models, and 94.64% of the ER/EER models). Note there’s no disjointness or completeness (there were too few of them to merit inclusion) and no role and relationship subsumption, so there isn’t much one can deduce automatically, which is a bit of a bummer.

The UML profile extends the core only slightly, yet it covers 99.44% of the elements in the UML diagrams of the dataset: add cardinality on attributes, attribute value constraints, subsumption for DL roles (UML associations), and aggregation (they are plain associations since UML v2.4.1). This makes a “$\mathcal{ALNHI}(D)$” DL that, as far as we know, hasn’t been investigated yet. That said, fiddling a bit by opting for unique name assumption and some constraints on cardinalities and role inclusion, it looks like $DL\mbox{-}Lite^{\mathcal{HN}}_{core}$ [4] may suffice, which is NLOGSPACE in subsumption and $AC^0$ in data complexity.

For ER/EER, we need to add to the core the following to make it to 99.06% coverage: composite and multivalued attribute (remodelled), weak entity type with its identification constraint, ternaries, associative entity types, and multi-attribute identification. With some squeezing and remodelling things a bit (see paper), $DL\mbox{-}Lite^{\mathcal{N}}_{core}$ [4] should do (also NLOGSPACE), though $\mathcal{DLR}_{ifd}$ [5] will make the formalisation better to follow (though that DL has too many features and is EXPTIME-complete).

Last, the ORM/ORM2 profile, which is the largest to achieve a high coverage (98.69% of the elements in the models in the data set): the core profile + subsumption on roles (DL role components) and fact types (DL roles), n-aries, disjointness on roles, nested object types, value constraints, disjunctive mandatory, internal and external uniqueness, external identifier (compound reference scheme). There’s really no way to avoid the roles, n-aries, and disjointness. There’s no exactly fitting DL for this cocktail of features, though $\mathcal{DLR}_{ifd}$ and $latex$\mathcal{CFDI}_{nc}^{\forall -} &s=-2\$ [6] approximate it; however, the former has too much constructs and the latter too few. That said, $\mathcal{DLR}_{ifd}$ is computationally not ‘well-behaved’, but with $\mathcal{CFDI}_{nc}^{\forall -}$ we still can capture over 96% of the elements in the ORM models of the dataset and it’s PTIME (yup, tractable) [7].

The discussion section of the paper answers the research questions we posed at the beginning of the investigation and reflects on not only missing features, but also ‘useless’ ones. Perhaps we won’t make a lot of friends discussing ‘useless’ features, especially when some authors investigated exactly those features. Anyway, here it goes. Really, nominal are certainly not needed (and computationally costly to boot). We can only guess why there were so few disjointness and completeness constraints in the data set, and even when they were present, they were in the few models we got from textbooks (see data set for sources of the models); true, there weren’t a lot of class hierarchies, but still. The other thing that was a bit of a disappointment was that the relational properties weren’t used a lot. Looking at the relationships in the models, there were certainly opportunities for transitivity and more irreflexivity declarations. One of our current conjectures is that they have limited implementation support, so maybe modellers don’t see the point of adding such constraints; another could be that an ‘average modeller’ (whatever that means) doesn’t quite understand all the 11 that are available in ORM2.

Overall, while a bit disappointing for the use case scenario of reasoning over conceptual data models for inconsistency management, the results are actually very promising for runtime usage of conceptual data models. Maybe that of itself will generate more interest from industry in doing that analysis step before implementing a database or software application: instead of developing a conceptual data model “just for documentation and dust-gathering”, you’ll have one that also will add more, new, better advanced features to your application.

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in print)

[2] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in print)

[3] Donini, F., Lenzerini, M., Nardi, D., Nutt, W. Tractable concept languages. In: Proc. of IJCAI’91. vol. 91, pp. 458-463. 1991.

[4] Artale, A., Calvanese, D., Kontchakov, R., Zakharyaschev, M. The DL-Lite family and relations. Journal of Artificial Intelligence Research, 2009, 36:1-69.

[5] Calvanese, D., De Giacomo, G., Lenzerini, M. Identification constraints and functional dependencies in Description Logics. In: Proc. of IJCAI’01, pp155-160, Morgan Kaufmann.

[6] Toman, D., Weddell, G. On adding inverse features to the Description Logic $\mathcal{CFDI}_{nc}^{\forall}$. In: Proc. of PRICAI 2014, pp587-599.

[7] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in $\mathcal{CFDI}_{nc}^{\forall -}$. 28th International Workshop on Description Logics (DL’15). Calvanese, D., Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.

What’s in a publicly available conceptual data model?

We know the answer. And it’s not great from the viewpoint of usage of fancy language features and automated reasoning, which you may have guessed from an earlier post on a tractable encoding of ORM models (see also [1]). How we got there is summarised in the paper [2] that will be presented at the 34th Conference on Conceptual Modeling (ER’15) 19-22 in October in Sweden. An even shorter digest of it follows in the remainder of this post.

I collected 35 models in UML class diagram notation, 35 in ER or EER, and 35 in ORM or ORM2 from various resources, such as online diagrams and repositories (such as GenMyModel), scientific papers (ER conferences, ORM workshops and the like), and some textbooks; see dataset. Each element in the diagram was classified in terms of the unifying metamodel developed earlier with my co-author Pablo Fillottrani [3,4], so that we could examine and compare the contents across the aforementioned three modelling language ‘families’. This data was analysed (incidence, percentages, averages, etc. of the language features), and augmented with further assessments of ratios, like class:relationship, class:attribute and so on. All that had as scope to falsify the following hypotheses:

A: When more features are available in a language, they are used in the models.

B: Following the “80-20 principle”, about 80% of the entities present in the models are from the set of entities that appear in all three language families.

C: Given the different (initial) purposes of UML class diagrams, (E)ER, and ORM, models in each language still have a different characteristic ‘profile’.

To make a long story really short: Hypothesis A is validated, Hypothesis B is falsified, and Hypothesis C validated. Relaxing the constraints for hypothesis B in the sense of including a known transformation between attribute and value type, then it does make it to 87.5%. What was nice to see was the emerging informal profiles of how a typical diagram looks like in a language family, in that ER and EER and ORM and ORM are much more relationship-oriented than UML Class Diagrams, and that there are more hierarchies in the latter. Intuitively, we already kind of knew that, but it’s always better to have that backed up by data.

There are lots of other noteworthy things, like the large amount of aggregations in UML (yay) and weak entity types in ER and EER, and the—from an automated reasoning viewpoint—‘bummer’ of so few disjointness and completeness assertions in all of the models. More of that can be found in the paper (including a top-5 and a bottom-5 for each family), and you may want attend the ER conference so we can discuss more about it in person.

References

[1] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in CFDInc∀-. 28th International Workshop on Description Logics (DL’15). Calvanese, D., and Konev, B. (Eds.), CEUR-WS vol. 1350, pp401-414. 7-10 June 2015, Athens, Greece.

[2] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (accepted)

[3] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[4] Fillottrani, P.R., Keet, C.M. KF metamodel formalization. Technical Report, Arxiv.org 1412.6545. Dec 19, 2014. 26p.

Peer instruction with a computer networks course

Last year, I attempted to use Peer Instruction (PI) in the networks course, hoping that it would correlate with class attendance (it didn’t; see [1] and related post), but it was useful for various reasons nevertheless. So, I tried again this year and, imho, with better questions (the results are ambivalent regarding class attendance—TBC). But what are “better questions”? There’s no such thing as a ‘concept inventory’ for networks, like there is for physics where PI was applied to first [2]. The PI 4 CS website has several sets of PI questions and the ones I checked for theory of computation are really good (discussed here), but there are none for networks.

So I played around from scratch. One issue with last year’s installment was that software-based PI didn’t do the job, because none of the freely available tools have the feature to show figures, or have too many other shortcomings to use effectively in the classroom [1], therewith limiting my options. One of our students developed a funky PI tool this year (that I will write about later); suffice to say, I can do questions with math display and figures now, so that that issue was solved. I’d searched a bit on guidance on designing PI questions, to no avail, until after the course’s installment when preparing for this post, one useful result turned up, being Julie Schell’s blog post over at teach.com (a search engine indexing lag issue, it appears). Mine seem to be mostly fine with respect to those suggestions, so I will give some examples below.

Let me start with a PI question that essentially also returned in the exam, though it was reworded as a non-MCQ question (yes, just to rub it in to those students who did not attend class). A screenshot of the PI tool is shown below, where the graph comes straight from the course slides. The setting is that it is posed right after students try to come to grips with flow control (TCP buffer at the receiver) versus congestion (in the network), and within congestion, the effects of lost packets (that have to be resent, but only 1 arrives) versus delayed packets (packet resent, and both arrive at the receiver). What is the answer?

Screenshot of a PI question in our own software-based PI tool

Clearly, the first distinction is to eliminate TCP buffer at the end versus network congestion issues. The buffer size is handled by TCP such that it will never overflow to lose any packets there (part of the protocol), so it must be congestion that causes the red line not to reach the dotted one, ever. With (know to be) lost packets, only one arrives; with delays, duplicates arrive. Due to having to handle duplicate packets, the receiver receives more packets than strictly needed, so the red line can never reach the maximum. Hence, (d) is the correct answer.

Another one that I like is the ARP one, which basically tests understanding of network layer services versus link layer services as a first pass, and then what ARP is actually doing. The question is as follows.

Consider the following network configuration and addresses. Host A wants to communicate with B. It did so yesterday. Which of the following will NOT happen/is NOT true?

a) Host 111.111.111.112 also receives a frame from A.

b) Router R will put <111.111.111.111, 74-29-9C-E8-FF-55, 20> in its ARP table, if not already there

c) A’s frame has destination addresses 222.222.222.222 and 49-BD-D2-C7-56-2A

d) Router R sends a frame containing addresses 1A-23-F9-CD-06-9B and 49-BD-D2-C7-56-2A, and 111.111.111.111 and 222.222.222.222

Read the explanation in [1] if you want to know why the answer is option (c). Or, take something what seems to be contradictory, but isn’t: statelessness and persistence (on HTTP). One of those think-and-analyse questions is the one on protocol design, notably regarding reliable transfer, and a hint of a gradation in goodness of answer, and therewith inherently some argumentation is needed regardless of what option one chooses.

Consider the following protocol that tries to solve reliable transfer over a channel with bit errors. It has several shortcomings, some worse than others. Which of the following is the worst?

You can choose between:

a) it can’t handle corrupt ACK/NAK packets

b) it gets stuck in the “wait” state when the ACK/NAK is lost

c) it uses superfluous NAKs

d) it is an inefficient stop-and-wait protocol

It requires one to ponder about what it means to have a really a bad protocol design (what is ‘worst’?). Getting stuck in a state forever—‘hanging’/’freezing’, if you will—is really, really bad. But, we’re dealing with a protocol “that tries to solve a channel with bit errors”, only. The scope is not lost packets, but corrupt packets, so it won’t get stuck in the “wait” state because a msg always arrives; hence, it cannot be (b). Then, what is worse of the other three? Superfluous NAKs, just because you can do it with ACKs + sequence numbers? Waiting for a response on your individual message, knowing that there are more efficient protocols that can pipeline packets with cumulative AKCs to reduce waiting and limit the cable being idle (option (b))? That surely is suboptimal, but what does that have to do with ‘worse’ from a viewpoint of reliable transfer as the only requirement? Or, still from a reliable transfer viewpoint, the receiver can’t figure out if the msg is an ACK or a NAK (option (a))? After all, with bit errors, it may well be the case that not only the msg arrives corrupted, but also the ACK/NAK. If the sender doesn’t know which one it is, and doesn’t know what to do with such a jumbled-up ACK-or-NAK packet, it cannot guarantee it has reliably transferred the packet. That’s certainly worse than having a cable idle (which is also a problem, one of performance, but not as bad). This reduces the option to either (a) or (d). A superfluous NAKs is, well, superfluous in the design, because one would have the handle those if they were there, and one could be more elegant in the design if they weren’t there, but it’s not as bad as not dealing with them at all. So, the answer is option (a).

These are just some of the questions that probe a bit deeper than the more commonly known rote learning for networks. It’s not that networks in CS is not devoid of rote learning—there’s quite a bit of it, in fact—but there surely are some things one would do well to try to understand.

If you want to have the full set of questions, don’t hesitate contact me; by then I hope to have them formatted in a more presentable way, with explanations for each question.

P.S.: Note to my dear students: yes, the latter question appeared in a slightly different format in the class test, as did the one on stateless and persistence, as did the congestion control one in the exam (in non-MCQ format). Awww… if only you would have attended lectures and had taken notes… It’s not that I want you to fail, just to show unequivocally some of the myriad of benefits of attending lectures.

References

[1] Keet, C.M. An Experiment with Peer Instruction in Computer Science to Enhance Class Attendance. 23rd Annual Meeting of the Southern African Association for Research in Mathematics, Science, and Technology Education (SAARMSTE’15), Huillet, E. (Ed.), pp319-331. 13-16 January 2015, Maputo, Mozambique.

[2] Crouch, C.H. & Mazur, E. (2001). Peer Instruction: Ten years of experience and results. American Journal of Physics, 69, 970-977.

Exciting ICPC World Finals 2015 in Marrakech

Also this year did we participate in the 39th ACM Intercollegiate Programming Contest World Finals, held in Marrakech, Morocco; the ‘we’ being: the “I Can’t Pronounce Catachtonic” team composed of Yaseen Hamdulay, Robert Spencer, and Sean Wentzel, and me as coach, from the University of Cape Town. We’re the only team from Sub-Saharan Africa, and one of 10 teams in the Africa & Middle East Region, of a total of 128 teams that participated, who were selected from 38160 contestants from 2534 universities of 101 countries on 6 continents that competed in the qualifying regionals.

One year more of studying, practice, and training wiser, our last training session indicated we might be a contender for A&ME regional winner. (For overall winner, we’d need to have and do what the medalist teams do, such as starting training in your early teens and winning IMO and IOI, weekly training sessions, monthly local contests, week-long training camps by previous medal winners, designated labs, competitive programming courses, scholarships and whatnot that other coaches talked about regarding preparations. At this point in time, we don’t have nearly enough such resources.)

The ‘first to solve a problem’ did so in a mere 5 minutes, from opening the envelope with the problems to having submitted the right code! This was Problem A: Amalgamated Artichokes, and the honour went to Peking University, setting a new record. The UCT team did so in 14 minutes, due to being sidetracked with another first. Then it was like: is this the only ‘easy’ problem, and the rest as grueling as last year’s problem set, and will it come down to ‘the team who solves the second problem will win A&ME’? Soon thereafter, the UCT team solved a second problem—but then so did two other A&ME teams, upping the ante that perhaps the 3rd solved problem would be the decider. UCT was still leading—at some point even on 27th position in the overall scoreboard. I got too nervous, and went for lunch, hoping they would have solved a 3rd one upon returning to the scoreboard. And lo an behold, they had, still leading for A&ME, though overall moving down to the 50s-70s in the dynamic scoreboard. To make matters more exciting for spectators, there were 5 PCs with shared screens and video, so one can see one’s team live on webcam, and see what they are coding, every single keystroke. Nifty, imho; nail-biting for some coaches of medal-contenders.

Right before the scoreboard was frozen regarding solved problems (for the last hour of the 5-hour nonstop contest), the American University of Cairo had solved 4 problems, surpassing UCT, but at the cost of a lot of penalty time due to a few wrong submissions, so if the UCT team would solve another problem, and Cairo not, then we’d win A&ME region on time difference. I could see UCT submitted a solution, hoping it was right. Then, sitting in the spectator area, and the Cairo team sitting near that, the scoreboard updated that they had submitted a solution for their 5th problem… and then came the involuntary reaction of its team members, being a mini-cheer. And UCT did not surpass that in the last 30 minutes. Overall, this placed Cairo on 75th place in the final standing, winning the prize for the A&ME region, and UCT just below that, as second in the A&ME region on a very respectable 83, therewith also receiving congrats from other participants, coaches, and interested spectators.

The “I Can’t Pronounce Catachtonic” team from UCT

So, relatively, they did well, having solved an impressive 4 problems, being A, D, I, and J, and all correct on first submission. This placed UCT ahead of other well-known, and arguably better resourced, universities, such as Uni Illinois at Urbana-Champaign, Virginia Tech, IIIT, Uni Western Australia, Cornell, Moscow Aviation, Calgary, and Rice. That said, at the other end of the spectrum, St. Petersburg ITMO broke the record of having solved all contest problems—a first of all the 39 editions of the ICPC world finals—and first to solve problem G. Moscow State Uni came second (11 problems solved out of 13, with first to solve B and H), Uni of Tokyo came third (also 11 problems solved, with first to solve J and K), and the fourth gold medal went to Tsinghua University (10 problems solved, and first to solve C).

If you don’t feel like solving the problems yourself, but still want to know the answer to, among others, cheese slicing, shooting asteroids, tile cutting, and the qanat irrigation system, then have a look at former UCT coach Bruce Merry’s analysis of the problems and directions of the solutions.

All in all, it was a good World Finals. An the food was good, the weather good, the other events too (including a fun camel ride), meeting up with coaches and some contestants met last year, the CLI symposium brought some useful information as well, and Steven and Felix Halim generously gave me a hardcopy of their Competitive Programming 3 book. Sean won the ICPC Quest, so a 1st prize was brought back to Cape Town.

The planning for participating with a strong UCT team next year has commenced; the 2016 finals will be in Thailand.

First tractable encoding of ORM conceptual data models

For (relatively) many years I’ve been focusing on as-expressive-as-possible languages to represent information and knowledge, including the computationally impractical full first order logic, because one would/should want to be as precise as possible and required to represent the subject domain in an ontology and universe of discourse for the application in a conceptual data model. After all, one can always throw out the computationally unpleasant constructs later during the implementation stage, if the ontology or conceptual data model is intended for use at runtime, such as OBDA [1], test data generate for verification [2], and in the query compilation stage in RDBMSs [3]. The resulting slimmed theories/models may be different for different applications, but then at least the set of slimmed theories/models share their common understanding.

So, now I ventured in that area, not because there’s some logic x and conceptual modeling language y has to be forced into it, but it actually appears that many fancy construct/features are not used in publicly available conceptual data models anyway (see data set and xls with some analysis). The timing of the outcome of the analysis of the data set coincided with David Toman’s visit to UCT as part of his sabbatical and Pablo Fillottrani’s visit, who enjoyed the last exchange of our bi-lateral project on the unification of conceptual data modelling languages (project page). To sum up the issue we were looking at: the need for run-time usage of conceptual data models requires a tractable logic-based reconstruction of the conceptual models (i.e., in at most PTIME), which appeared to hardly exist or miss constructs important for conceptual models (regardless whether that was ORM, EER or UML Class Diagrams), or both.

The solution ended up to be a logic-based reconstruction for most of ORM2 using the $\mathcal{CFDI}_{nc}^{\forall -}$ Description Logic, which also happens to be the first tractable encoding of (most of) ORM/ORM2. With this logic, several features important for conceptual models (i.e., occur relatively often) do have their proper encoding in the logic, notably n-aries, complex identification constraints, and n-ary role subsumption. The, admittedly quite tedious, mapping

Low resolution and small version of our DL15 poster summarising the contributions.

captures over 96% of the constructs used in practice in the set of 33 ORM diagrams we analysed (see data set). Further, the results are easily transferable to EER and UML Class diagrams, with an even greater coverage. The results (and comparison with related works) are presented in our recently accepted paper at the 28th International Workshop on Description Logics (DL’15) that will take place form 7 to 11 June in Athens, Greece.

The list of accepted papers of DL’15 is available, listing 21 papers with long presentations, 16 papers with short presentation, and 26 papers with poster presentations. David will present our results in the poster session, as it’s probably of more relevance in the conceptual modelling community (and I’ll be marking exams then), and some other accepted papers cover more new ground, such as casting schema.org as a description logic, temporal query answering in EL, exact learning of ontologies, and more. The proceedings is will be online on CEUR-WS in the upcoming days as volume 1350. I’ve added a mini version of our poster on the right. I tried tikzposter, as they look really cool, but it doesn’t support figures (other than those made in latex), so I resorted to ppt (that doesn’t support math), wondering why these issues haven’t been solved by now.

References

[1] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland. pp 1389-1396.

[2] Toman, D., Weddell, G.E.: Fundamentals of Physical Design and Query Compilation. Synthesis Lectures on Data Management, Morgan & Claypool  Publishers (2011)

[3] Smaragdakis, Y., Csallner, C., Subramanian, R.: Scalable satisfiability checking and test data generation from modeling diagrams. Automation in Software Engineering 16, 73–99 (2009)

[4] Fillottrani, P.R., Keet, C.M., Toman, D. Polynomial encoding of ORM conceptual models in $\mathcal{CFDI}_{nc}^{\forall -}$. 28th International Workshop on Description Logics (DL’15). CEUR-WS vol xx., 7-10 June 2015, Athens, Greece.

Three CS problem-solving strategies exercise sets

The preparations for the ACM ICPC World Finals 2015 in Morocco are in full swing, and so is the training for by with the 125 or so 3-person teams selected of over 30000 contestants who participated in the various selection rounds across the world. While a lot comes down to “practice, practice, practice” all the hard problems you can find, and learning more algorithms and maths, there’s also work to do on refining one’s team strategy and problem solving skills. For the latter, Steven and Felix Halim’s Competitive Programming book comes in handy (as coach at least). Instead of going back-to-back through the book and do the problems listed in the designated problems solving paradigm categories, I turned it around, and selected problems of which the students had to figure out which problem solving paradigm was the right one (Greedy, DP, ad hoc, etc., and more detailed, such as geometry, shortest path, etc.). I’ve made three sets of increasing difficulty:

1. easy’, where most of them can be solved with one problem solving paradigm (also useful for those who are preparing for the Standard Bank IT Challenge ‘heats’ on May 16);
2. medium’, where most of them can be solved with two problem solving paradigms combined;
3. hard’, where most of the problems have appeared in a World Final.

The (direction of the) solutions are on the last page of each pdf, including the UVa Online Judge number, so you can implement and test your solution as well.

I did change some of the problem descriptions for various reasons:

• Localization of the problem description to South Africa: among others, ‘Durban prawns’ (big fat cockroaches are endemic there), ‘nuts for nuts’ (we do have squirrels on campus), ‘shopping for operas’ (Amazon gave up delivering in SA, because so much packets were lost), and some characters got different names.
• Changed story line: ‘charming canines’ (disagreeable storyline in original), and some now have female characters (cf. mostly male or none).
• To make the title an alliteration, like the other titles in the problem set: ‘colliding catamarans’ and ‘cult caps’.

Happy solving :)

Journal paper on Data Mining OPtimization Ontology

I’ve been writing here about bits and pieces of the Data Mining OPtimization ontology (DMOP) before (modeling issues, reasoner performance), but there never was something about the whole setting. I’m happy to say that now there is, for the Semantic Web Journal paper about DMOP is in print now and its in-press version is online, waiting in the queue to be assigned a volume [1]. The ontology itself (v5.4) is freely accessible and downloadable in several formats from the dmo foundry.

The paper can be considered the new so-called ‘reference paper’ of the ontology: it describes the rationale, the non-trivial design choices, content, and its use. The abstract sums it up nicely:

The Data Mining OPtimization Ontology (DMOP) has been developed to support informed decision-making at various choice points of the data mining process. The ontology can be used by data miners and deployed in ontology-driven information systems. The primary purpose for which DMOP has been developed is the automation of algorithm and model selection through semantic meta-mining that makes use of an ontology-based meta-analysis of complete data mining processes in view of extracting patterns associated with mining performance.

To this end, DMOP contains detailed descriptions of data mining tasks (e.g., learning, feature selection), data, algorithms, hypotheses such as mined models or patterns, and workflows. A development methodology was used for DMOP, including items such as competency questions and foundational ontology reuse. Several non-trivial modeling problems were encountered and due to the complexity of the data mining details, the ontology requires the use of the OWL 2 DL profile.

DMOP was successfully evaluated for semantic meta-mining and used in constructing the Intelligent Discovery Assistant, deployed at the popular data mining environment RapidMiner.

As two more teasers to lift the veil a bit, the architecture of various related components is shown in the first figure below, and how it is integrated in the RapidMiner Intellgent Discovery Assistant is shown in the second figure.

(source: [1])

(source: [1])

Unfortunately, the paper is behind Elsevier’s paywall, but we’re free to distribute to individuals. (If only the copyright stuff question from Elsevier would have come some 1.5 month later, this would not have been the case—things have improved and there are addenda and whatnot so that apparently it could have been put in an institutional repository. But, alas, better next time.) More precisely, the ‘we’ are Agniezska Lawrynowicz (shared first author), Claudia d’Amato, Alexandros Kalousis, Phong Nguyen, Raul Palma, Robert Stevens, and Melanie Hilario, and I.

If you use DMOP, experiment with it, or would like to contribute to its further development, please let us know.

References

[1] Keet, C.M., Lawrynowicz, A., d’Amato, C., Kalousis, A., Nguyen, P., Palma, R., Stevens, R., Hilario, M. The Data Mining OPtimization ontology. Web Semantics: Science, Services and Agents on the World Wide Web. in press. http://dx.doi.org/10.1016/j.websem.2015.01.001