Coupon collecting the computing way

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have \frac{total\_no\_of\_stickers}{5}*P , and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a \frac{638}{637} chance of being new, and so on to the last card \frac{638}{1} , wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant {\sf \gamma} (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is n*H_n , with H_n the harmonic number, and the whole thing equalling also

n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, \gamma \approx 0.5772156649 , with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for \gamma , i.e., how many decimal digits, and only \gamma or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.

A metamodel-driven approach for linking conceptual data models

Interoperability among applications and components of large complex software is still a bit of a nightmare and a with-your-hands-in-the-mud scenario that no-one looks forward to—people look forward to already having linked them up, so they can pose queries across departmental and institutional boundaries, or even across the different data sets within a research unit to advance their data analysis and discover new things.

Sure, ontologies can help with that, but you have to develop one if none is available, and sometimes it’s not even necessary. For instance, you have an ER diagram for the database and a UML model for the business layer. How do you link up those two?

Superficially, this looks easy: an ER entity type matches up with a UML class, and an ER relationship with an UML association. The devil is in the details, however. To name just a few examples: how are you supposed to match a UML qualified association, an ER weak entity type, or an ORM join-subset constraint, to any of the others?

Within the South Africa – Argentina bilateral collaboration project (scope), we set out to solve such things. Although we first planned to ‘simply’ formalize the most common conceptual data modelling languages (ER, UML, and ORM families), we quickly found out we needed not just an ‘inventory’ of terms used in each language matched to one in the other languages, but also under what conditions these entities can be used, hence, we needed a proper metamodel. This we published last year at ER’13 and MEDI’13 [1,2], which I blogged about last year. In the meantime, we not only have finalized the metamodel for the constraints, but also formalized the metamodel, and a journal article describing all this is close to being submitted.

But a metamodel alone doesn’t link up the conceptual data models. To achieve that, we, Pablo Follottrani and I, devised a metamodel-driven approach for conceptual model interoperability, which uses a formalised metamodel with a set of modular rules to mediate the linking and transformation of elements in the conceptual models represented in different languages. This also simplifies the verification of inter-model assertions and model conversion. Its description has recently been accepted as a full paper at the 8th International Web Rule Symposium 2014 (RuleML’14) [3], which I’ll present in Prague on 18 August.

To be able to assert a link between two entities in different models and evaluate automatically (or at least: systematically) whether it is a valid assertion and what it entails, you have to know i) what type of entities they are, ii) whether they are the same, and if not, whether one can be transformed into the other for that particular selection. So, to be able to have those valid inter-model assertions, an approach is needed for transforming one or more elements of a model in one language into another. The general idea of that is shown in the following picture, and explained briefly afterward.

Overview of the approach to transform a model represented in language A to one in language B, illustrated with some sample data from UML to ORM2 (Fig 1 in [3])

Overview of the approach to transform a model represented in language A to one in language B, illustrated with some sample data from UML to ORM2 (Fig 1 in [3])

We have three input items (top of the figure, with the ovals), then a set of algorithms and rules (on the right), and two outputs (bottom, green). The conceptual model is provided by the user, the formalized metamodel is available and a selection of it is included in the RuleML’14 paper [3], and the “vocabulary containing a terminology comparison” was published in ER’13 [1]. Our RuleML paper [3] zooms in on those rules for the algorithms, availing of the formalized metamodel and vocabulary. To give a taste of that (more below): the algorithm has to know that a UML class in the diagram can be mapped 1:1 to a ORM entity type, and that there is some rule or set of rules to transform a UML attribute to an ORM value type.

This is also can be used for the inter-model assertions, albeit in a slightly modified way overall, which is depicted below. Here we use not only the formalised metamodel and the algorithms, but also which entities have 1:1 mappings, which are equivalent but need several steps (called transformations), and which once can only be approximated (requiring user input), and it can be run in both directions from one fragment to the other (one direction is chosen arbitrarily).

Overview for checking the inter-model assertions, and some sample data, checking whether the UML Flower is the same as the ORM Flower (Fig. 2 in [3]).

Overview for checking the inter-model assertions, and some sample data, checking whether the UML Flower is the same as the ORM Flower (Fig. 2 in [3]).

The rules themselves are not directly from one entity in one model to another entity in another, as that would become too messy, isn’t really scalable, and would have lots of repetitions. We use the more efficient way of declaring rules for mapping a conceptual data model entity into its corresponding entity in the metamodel, do any mapping, transformation, or approximation there in the metamodel, and then map it into the matching entity in the other conceptual data model. The rules for the main entities are described in the paper: those for object types, relationships, roles, attributes, and value types, and how one can use those to build up more complex ones for validation of intermodal assertions.

This metamodel-mediated approach to the mappings is one nice advantage of having a metamodel, but one possibly could have gotten away with just having an ‘inventory’ of the entities, not all the extra effort with a full metamodel. But there are benefits to having that metamodel, in particular when actually validating mappings: it can drive the validation of mappings and the generation of model transformations thanks to the constraints declared in the metamodel. How this can work, is illustrated in the following example, showing one example of how the “process mapping assertions using the transformation algorithms” in the centre-part of Fig. 2, above, works out.

Example. Take i) two models, let’s call them Ma and Mb, ii) an inter-model assertion, e.g., a UML association Ra and an ORM fact type Rb, ii) the look-up list with the mappings, transformations, approximations, and the non-mappable elements, and iii) the formalised metamodel. Then the model elements of Ma and Mb are classified in terms of the metamodel, so that the mapping validation process can start. Let us illustrate that for some Ra to Rb (or vv.) mapping of two relationships.

  1. For the vocabulary table, we see that UML association and ORM fact type correspond to Relationship in the metamodel, and enjoy a 1:1 mapping. The ruleset that will be commenced with are R1 from UML to the metamodel and 2R to ORM’s fact type (see rules in the paper).
  2. The R1 and 2R rules refer to Role and Object type in the metamodel. Now things become interesting. The metamodel has represented that each Relationship has at least two Roles, which there are, and each one causes the role-rules to be evaluated, with Ro1 of Ra’s two association ends into the metamodel’s Role and 2Ro to ORM’s roles (‘2Ro’ etc. are the abbreviations of the rules; see paper [3] for details).
  3. The metamodel asserts that Role must participate in the rolePlaying relationship and thus that it has a participating Object type (possibly a subtype thereof) and, optionally, a Cardinality constraint. Luckily, they have 1:1 mappings.
  4. This, in turn causes the rules for classes to be evaluated. From the classes, we see in the metamodel that each Object type must have at least one Identification constraint that involves one or more attributes or value types (which one it is has, been determined by the original classification). This also then has to be mapped using the rules specified.

This whole sequence was set in motion thanks to the mandatory constraints in the metamodel, having gone Relationship to Role to Object type to Single identification (that, in turn, consults Attribute and Datatype for the UML to ORM example here). The ‘chain reaction’ becomes longer with more elaborate participating entities, such as a Nested object type.

Overall, the whole orchestration is no trivial matter, requiring all those inputs, and it won’t be implemented in one codefest on a single rainy Sunday afternoon. Nevertheless, the prospect for semantically good (correct) inter-model assertions across conceptual data modeling languages and automated validation thereof is now certainly a step closer to becoming a reality.

References

[1] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[2] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[3] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

More lively Theory of Computation lectures with peer instruction

Although I don’t teach theory of computation these days, I did encounter something that I’m pretty sure of that would have made the lectures more lively compared to my attempts (blogged about earlier here), and deserves some ‘air time’: peer instruction [1].

My database systems, computer literacy, and networks students already encountered it during some of my lectures, but to briefly recap the general idea (pioneered by Eric Mazur): you insert ‘quiz questions’ (i.e., concept tests) into the lectures, students vote on the multiple choice question, then there is a discussion among the students about the answers, and a re-vote. This sounds perhaps too playful for higher education, but it has been shown to improve course grades by 34-45% [1, 2], thanks to, among others, active engagement, focusing on understanding principles, and student discussion involving ‘teaching each other’.

Even if one thinks it can be suitable for some courses but not others (like tough computer science courses), there’s even peer instruction course material for something so abstract as theory of computation, pioneered by Cynthia Lee and available through peerinstruction4cs. And it works [2].

The peer instruction slides for theory of computation and several other courses can be downloaded after registration. This I did, both out of curiosity and for getting ideas how I might improve the quizzes for the networks course I’m teaching, for which no materials are available yet. Here, I’ll give two examples of such questions for theory of computation to give an indication that it is feasible.

The first one is about tracing through an NFA, and grasping the notion that if any trace through the NFA computation ends in a final state with all input read, it accepts the string. The question is shown in the figure below, which I rendered with the AutoSim automata simulator.

Peer instruction quiz question about NFAs.

Peer instruction quiz question about NFAs.

A and D can be excluded ‘easily’: A has the second trace as [accept] but it rejects, and D has the first trace [reject] but it accepts (which can be checked nicely with the AutoSim file loaded into the tool). The real meat is with options B and C: if one trace accepts and the other rejects, then does the NFA as a whole accept the string or not? Yes, it does; thus, the answer is B.

The other question is further into the course material, on understanding reductions. Let us have ATM and it reduces to MYSTERY_LANGUAGE. Which of the following is true (given the above statement is true):

  1. a) You can reduce to MYSTERY_LANG from ATM.
  2. b) MYSTERY_LANG is undecidable.
  3. c) A decider for MYSTERY_LANG (if it exists) could be used to decide ATM.
  4. d) All of the above

Answering this during the lecture requires more than consulting a ‘cheat sheet’ on reductions. A is a rephrasing of the statement, so that holds (though this option may not be very fair with students whose first language is not English). You will have to remember that ATM is undecidable, and, if visually oriented, the following figure:

Graphical depiction of reduction

Graphical depiction of reduction

 

So, if ATM is undecidable (the “P1” on the left), then so is MYSTERY_LANGUAGE (the “P2” on the right); for the aficionados: there’s a theorem for that in Chapter 9 of [3], being ‘if there is a reduction from P1 to P2, then if P1 is undecidable, then so is P2’. So B holds. Thus, the answer is probably D, but let’s look at C just to be sure of D. Because ATM is ‘in’ MYSTERY_LANGUAGE, then if we can decide MYSTERY_LANGUAGE, then so can we decide ATM—it just happens to be the case we can’t (but if we could, it would have been possible). Thus, indeed, D is the answer.

Sure, there is more to theory of computation than quizzes alone, but feasible it is, and on top of that, student perception of peer instruction is apparently positive [4]. But before making the post too long on seemingly playful matters in education, I’m going to have a look at how that other game—the second semi-final—will unfold to see how interesting the world cup final is going be.

References

[1] Crouch, C.H., Mazur, E. (2001). Peer Instruction: Ten years of experience and results. American Journal of Physics, 69, 970-977.

[2] Bailey Lee, C., Garcia, S., Porter, L. (2013). Can Peer Instruction Be Effective in Upper-Division Computer Science Courses?. ACM Transactions on Computing Education, 13(3): 12-22.

[3] Hopcroft, J.E., Motwani, R. & Ullman, J.D. (2007). Introduction to Automata Theory, Languages, and Computation, 3rd ed., Pearson Education.

[4] Simon, B., Esper, S., Porter, L., Cutts, Q. (2013). Student experience in a student-centered peer instruction classroom. Proceedings of the International Computing Education Research Conference (ICER’13), ACM Conference Proceedings. pp129-136.

ACM ICPC 2014 solution to problem A – baggage

Some of you already know I was on-site coach for the UCT team at the ACM Inter Collegiate Programming Contest World Finals in Ekaterinburg, held from 22 to 26 June, 2014. The problems were unusually hard this year, and solving 4 out of the 10 problems already got teams into the bronze medal range [results]. The technical coach of the UCT team, Bruce Merry, has written an analysis of 6 problems on his blog (D, K, C, E, B, and F–update: now also I and G, J, L, and H), and over at TopCoder, SnapDragon discusses 4 of the 10 problems (A, E, F, G), except that SnapDragon does not provide a solution to A (and I don’t like his hint of brute-force code-and-try) [update 3-7: there’s description of his solution here now]. Googling, I couldn’t find someone else discussing the solution to problem A, so here’s mine, which I solved on the plane from Moscow to Frankfurt (among other activities, and one among the four flights we had to take to arrive back in Cape Town).

The problem

Stripping the “baggage problem” to its essentials: you have a sequence of alternating Bs and As, which has to be reordered to first all the As then all the Bs, and the reordering has to occur moving two letters at the same time, and remain adjacent. For instance, with an n = 4, we have 2n characters, BABABABA, and in the end after sorting, it then will be AAAABBBB, and this has to occur in the minimum amount of moves. N is randomly chosen, and is an integer between 3 and 100. The cells on your tape are numbered from 0 to -2n+1, so with our n = 4, from -7 to 8, and the first B is on position 1.

Update (3-7): This problem appears to be “Tait’s counter puzzle”, which Peter Guthrie Tait, a Scottish mathematical physicist, described in 1884 [see description] (thanks to Davi Duarte who mentioned it in the topcoder thread).

Solution

Informally, there are two parts to the algorithm: move around the As and Bs to pair them as AA and BB, which cost you n/2 moves if n is even and n/2-rounded up moves if n is odd, and then sort those pairs in the remainder to a total of n moves. The first part occurs alternating moving AB from the right to the left, starting at the last AB (position 2n-2) and then every other 4 to its left, and BA from left to right, starting from position 3 and every other 4 positions to the right (i.e., 7, 11, etc.), and then the BBs are ferried to the right from left to right, and the AAs from the right to the left, also alternating. N=3 is an exception.

One can do this with a neat mathematical proof, but I found out with pattern creation and recognition, visualizing the provided sample input and moves for n = 5 and n = 8, which can be done in 5 resp. 8 steps (given in the problem statement as sample output). Here’s a description of that approach.

First, devising one for 3 is trivial, using the following moves (the only option), using the position indicated with the position of the first letter, and highlighting the ones that will be moved next:

Start: . . . . . . BABABA
Move 2 to -1, which gives . . . . ABB . . ABA
Move 5 to 2, which gives . . . . ABBBAA . .
Move 3 to -3, which gives . . AAABBB . . . .

This already suggests that the minimum amount of moves will always be n, because it is n moves also for n = 5 and n = 8. Next, devising one for n = 4, and knowing the moves for n = 3 and n = 5, I tried to work it out for BABABABA, i.e., the first AB, like with n = 3, BABABABA, and BABABABA like with the n = 5 case. The last one is the only one that worked in 4 moves, starting with moving 6 to -1, again. Again, because for all ns so far, the first move is to -1.

Let’s put this to the test with n = 7, with the ones to be moved in bold and the empty cells indicated with dots:

..BABABABABABABA

i.e., move the last AB (thus, position 12) as this worked for n=4 and n=5, then

ABBABABABABAB..A

i.e., move the first BA after position 1 (thus, position 3)

ABBA..BABABABBAA

i.e., move the last AB before a pair (thus, position 8)

ABBAABBAB..ABBAA

i.e., move the first BA after position 1 (thus, position 5). Then sorting the BBs and AAs:

ABBAAB..BBAABBAA
A..AABBBBBAABBAA
AAAAABBBBB..BBAA
AAAAABBBBBBB..AA
AAAAAAABBBBBBB..

This was enough for me to have discovered the pattern of alternating for the matching and alternating for the sorting.

What is ‘nasty’ in the problem description, retrospectively and for the pattern-based approach vs. a neat maths-y proof at least, is that the pattern deviates for n = 3, and the provided sample output for n = 8, although in 8 steps, is an alternative solution, not one that one obtains with the algorithm. With the approach as mentioned above, we have for n = 8 the following (I did that one to double-check):

..BABABABABABABABA
ABBABABABABABAB..A
ABBA..BABABABABBAA
ABBAABBABAB..ABBAA
ABBAABBA..BBAABBAA
A..AABBABBBBAABBAA
AAAAABBABBBB..BBAA
AAAAA..ABBBBBBBBAA
AAAAAAAABBBBBBBB..

This confirmed it works. It may look a bit craft-y, but the patterns show beautifully with larger n, which are shown below for the even and odd case (made just for the blog post, not in solving it); squint your eyes if it’s not immediately clear.

N = 16

..BABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABBAABBAA
ABBAABBAABBABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABBAABBAABBAA
ABBAABBAABBAABBABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BBAABBAABBAABBAA
A..AABBAABBAABBABBBBAABBAABBAABBAA
AAAAABBAABBAABBABBBB..BBAABBAABBAA
AAAAA..AABBAABBABBBBBBBBAABBAABBAA
AAAAAAAAABBAABBABBBBBBBB..BBAABBAA
AAAAAAAAA..AABBABBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBABBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..ABBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAABBBBBBBBBBBBBBBB..

N = 17

..BABABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABABBAABBAA
ABBAABBAABBABABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABABBAABBAABBAA
ABBAABBAABBAABBABABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BABBAABBAABBAABBAA
ABBAABBAABBAABBAABB..BAABBAABBAABBAA
A..AABBAABBAABBAABBBBBAABBAABBAABBAA
AAAAABBAABBAABBAABBBBB..BBAABBAABBAA
AAAAA..AABBAABBAABBBBBBBBBAABBAABBAA
AAAAAAAAABBAABBAABBBBBBBBB..BBAABBAA
AAAAAAAAA..AABBAABBBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBAABBBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..AABBBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBB..

Neat. Writing the code is left as an exercise to the reader.

I’m looking forward to the other problems (except the crane balancing problem C, whose description I did not like, at all), the upcoming regionals and next year’s ICPC World Finals in Morocco (if not winning, then to have at least a great time like we had this time)!

First steps for isiZulu natural language generation

Yes, Google Translate English-isiZulu does exist, but it has many errors (some very funny) and there’s a lot more to Natural Language Generation (NLG) than machine translation, such as natural language-based query interfaces that has some AI behind it, and they are needed, too [1]. Why should one bother with isiZulu? Muendane has his lucid opinions about that [2], and in addition to that, it is the first language of about 23% of the population of South Africa (amounting to some 10 million people), about half can speak it, and it is a Bantu language, which is spoken by nearly 300 million people—what works for isiZulu grammar may well be transferrable to its related languages. Moreover, it being in a different language family than the more well-resourced languages, it can uncover some new problems to solve for NLG, and facilitate access to online information without the hurdle of having to learn English or French first, as is the case now in Sub-Saharan Africa.

The three principal approaches for NLG are canned text, templates, and grammars. While I knew from previous efforts [3] that the template-based approach is very well doable but has its limitations, and knowing some basic isiZulu, I guessed it might not work with the template-based approach but appealing if it would (for a range of reasons), that no single template could be identified so far was the other end of the spectrum. Put differently: we had to make a start with something resembling the foundations of a grammar engine.

Langa Khumalo, with the Linguistics program and director of the University Language Planning and Development Office at the University of KwaZulu-Natal, and I have been trying to come up with isiZulu NLG. We have patterns and algorithms for (‘simple’) universal and existential quantification, subsumption, negation (class disjointness), and conjunction; or: roughly OWL 2 EL and a restricted version of ALC. OWL 2 EL fist neatly with SNOMED CT, and therewith has the potential for interactive healthcare applications with the isiZulu healthcare terminologies that are being developed at UKZN.

The first results on isiZulu NLG are described in [4,5], which was not an act of salami-slicing, but we had more results than that fitted in a single paper. The first paper [4] will appear in the proceedings ofthe 4th workshop on Controlled Natural language (CNL’14), and is about finding those patterns and, for the options available, an attempt at figuring out which one would be best. The second paper [5], which will appear in the 8th International Web Rule Symposium (RuleML’14) conference proceedings, is more about devising the algorithms to make it work and how to actually generate those sentences. Langa and I plan to attend both events, so you can ask us about the details either in Prague (18-20 Aug) or Galway (20-22 Aug) in person. In the meantime, the CRCs of the papers are online (here and here).

Regarding the technical aspects, the main reasons why we cannot get away with devising templates to generate isiZulu controlled natural language is that isiZulu is non-trivial:

  • There is a whole system of noun classes: nouns are grouped in one of the 17 noun classes, each with their own peculiarities, which is illustrated in Figure 1, below;
  • Agglutination, informally: putting lots of bits and pieces together to make a word. A selection of those so-called ‘concords’ is included in Figure 2, below;
  • Phonological conditioned copulatives, meaning that the ‘is a’ depends on the term that comes after it (ng or y); and
  • Complex verb conjugation.
isiZulu noun classes with an example (source: [5]).

isiZulu noun classes with an example (source: [5]).

A selection of isiZulu concords (source: [5])

A selection of isiZulu concords (source: [5])

What does this mean for the verbalization? In English, we use ‘Each…’ or ‘For all…’ for the universal quantifier and it doesn’t matter over which noun it is quantified. In isiZulu, it does. Each noun class has its own ‘each’ and ‘for all’, and it is not acceptable (understandable) to use one for the wrong noun class. For disjointness, like “Cup is not a Glass” ({\sf Cup \sqsubseteq \neg Glass} in DL), in English we have the ‘is not a’ regardless what comes before or after the subsumption+negation, but in isiZulu, the copulative is omitted, the first noun (OWL class, if you will) brings in a so-called negative subject concord, the second noun brings in a pronominal, and they are glued together (e.g., Indebe akuyona Ingilazi, where the second word is composed of aku + yona), and to top it off, each noun class has its own concord and pronomial. A seemingly simple conjunction—just an ‘and’ in English—has to be divided into an and-when-it-is-used-in-an-enumeration and an and-when-it-is-a-connective, and when it is used in an enumeration, it depends on the first letter of the noun that comes after the ‘and’. Existential quantification is even more of a hassle. The table below shows a very brief summary comparing typical patterns in English with those for isiZulu.

A few DL symbols, their typical verbalization options in English, and an indication of possible patterns (source: [4])

A few DL symbols, their typical verbalization options in English, and an indication of possible patterns (source: [4])

We did ask isiZulu speakers which of the possible options they preferred (in a survey, with Limesurvey localized to isiZulu), but there wasn’t an overwhelming consistent agreement among them except for one of the options for existential quantification (the –dwa option), although there was more agreement among the linguists than among the non-linguists, possibly due to dialect influences (results can be found in [4]).

If you don’t feel like reading the two papers, but still would like to have some general overview and examples, you also can check out the slides of the CS colloquium I gave last week. I managed to ‘lure in’ also ICT4D people—and then smack them with a bit of logic and algorithms—but the other option, being talking about the other paper accepted at RuleML, probably would have had to be a ‘cookie colloquium’ to get anyone to attend (more about that paper in another post—it is fascinating, but possibly of less interest to a broader audience). If you want to skip the tedious bits and just get a feel of how one of the algorithms works out: check out the example starting on slide 63, which shows the steps to go from {\sf \forall x (uSolwazi(x) \rightarrow \exists y (ufundisa(x, y) \land Isifundo(y)))} in FOL, or {\sf uSolwazi \sqsubseteq \exists ufundisa.Isifundo} in DL (“Each professor teaches at least one course”, if the vocabulary were in English), to “Bonke oSolwazi bafundisa isifundo esisodwa”.

Clearly, a lot remains to be done.

References

[1] Alberts, R., Fogwill, T., Keet, C.M. Several Required OWL Features for Indigenous Knowledge Management Systems. 7th Workshop on OWL: Experiences and Directions (OWLED’12). 27-28 May, Heraklion, Crete, Greece. CEUR-WS Vol-849. 12p

[2] Muendane, N.M. I am an African. 2006, Soultalk CC.

[3] Jarrar, M., Keet, C.M., Dongilli, P. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[4] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), 20-22 August 2014, Galway, Ireland. Springer LNAI. (in press)

[5] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), August 18-20, 2014, Prague, Czech Republic. Springer LNCS (in press).

CFP OWLED 2014

Here’s a bit of unvarnished promotion for the 1tth OWL: Experience and Directions Workshop (I’m the co-program chair, with Valentina Tamma, and Bijan Parsia is general chair)

—-

CALL FOR PAPERS: 11th OWL: Experiences and Directions Workshop (OWLED)
Riva del Garda, October 17th – 18th, 2014 co-located with ISWC 2014

http://www.w3.org/community/owled/workshop-2014/

Important Dates (All deadlines are Hawaii time)
• Paper submission due: July 30, 2014 EXTENDED TO AUGUST 4, 2014!!!
• Acceptance notifications: September 5, 2014
• Final papers due: September 18, 2014
• OWLED workshop: 17-18 October, 2014

OWLED is now also a Community Group at the W3C. Everyone is invited to participate:

http://www.w3.org/community/owled/

 

The aim of the OWL: Experiences and Directions Workshop (OWLED) is to establish an
international forum for the OWL community, where practitioners in industry and
academia, tool developers and others interested in OWL can describe real and
potential applications, share experience and discuss requirements for language
extensions/modifications. OWL has become the representational model of choice for
supporting interoperability in many industries. This has been made possible thanks
also to the development of numerous OWL reasoning systems that efficiently deal with
both intensional (ontologies) and extensional (data) query answering. In this
edition we aim to bridge the gap with the reasoner evaluation community and welcome
the submission of papers describing challenging ontologies and/or tasks to be
represented in OWL and processed by OWL reasoners. It also welcomes proposals for
improving the OWL 2 standard.

This year, we would like to invite submissions of the following types of papers:

Technical papers:  All submissions must be in English and be no longer than 12 pages
(including references). Papers that exceed this limit will be rejected without
review.  These papers should present research, implementation experience, and
reports on the above and related topics. Space will be reserved for authors to
present their work at the workshop.

Short papers (4-6 pages, including references): These papers should present work
that is in an early stage and/or include publishable (novel) implemented systems
that are of interest to the OWLED community; and (in case of an implemented system),
can be demonstrated at the workshop.

All submissions must be must be in PDF, and must adhere to the Springer LNCS style.
For more details, see Springer’s Author Instructions:

http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0.

Papers can be submitted online using the Easychair Conference system:

https://www.easychair.org/conferences/?conf=owled2014

Papers related to any aspects of OWL and extensions, applications, theory, methods
and tools, are welcome.

Topics of interest include, but are not limited to:

  • Application driven requirements for OWL
  • Applications of OWL, particularly: from industry, or for data integration, for service interoperability, for sophisticated/non-obvious inference, for knowledge discovery
  • and within specific domains such as: law, bio and biomed, eLearning
  • Experience of using OWL: notably, highly expressive ontologies or the OWL 2 Profiles
  • Evaluation of OWL tools e.g. reasoners
  • Benchmarks for OWL tools
  • Performance and scalability issues and improvements
  • Extensions to OWL
  • OWL and Rules
  • Implementation techniques and experience reports
  • Non-standard reasoning service (implementation and requirements for)
  • Explanation
  • Ontology comprehension and verbalisation
  • Multilingual OWL
  • Modelling issues
  • Tools, including editors, visualisation, parsers and syntax checkers
  • Collaborative editing of ontologies
  • Versioning of OWL ontologies
  • Alignment of OWL ontologies
  • Modularity
  • Query answering with OWL
  • SPARQL and OWL
  • Linked Data and OWL

Dabbling into evaluating reasoners with the DMOP ontology

The Data Mining OPtimization ontology (DMOP) is a highly axiomatised ontology that uses almost all features of OWL 2 DL and the domain entities are linked to DOLCE, using all four main ‘branches’ of DOLCE. Some details are described in last year’s OWLED’13 paper [1] and a blog post. We did observe ‘slow’ reasoner performance to classify the ontology, however, like, between 10 and 20 minutes, varying across versions and machines. The Ontology Reasoner Evaluation (ORE’14) workshop (part of the Vienna Summer of Logic) was a nice motivation to have a look at trying to figure out what’s going on, and some initial results are described briefly in the 6 pages-short paper [2], which is co-authored with Claudia d’Amato, Agnieszka Lawrynowicz, and Zubeida Khan.

Those results are definitely what can be called interesting, even though we’re still at the level of dabbling into it from a reasoner user-centric viewpoint, and notably, from a modeller-centric viewpoint. The latter is what made us pose questions like “what effect does using feature x have on performance of the reasoner?”. No one knew, except for the informal feedback back I received at DL 2010 on [3] that reasoning with data types slows down things, and likewise when the cardinalities are high. That’s not an issue with DMOP, though.

So, the first thing we did was determining a baseline on a good laptop—your average modeller doesn’t have a HPC cluster readily at hand—and in an Ontology Development Environment, where the reasoner is typically accessed from. Some 9 minutes to classify the ontology (machine specs and further details in the paper).

The second steps were the analysis of one specific modeling construct (inverses), and what effect DOLCE has on the overall performance.

The reason why we chose representation of inverses is because in OWL 2 DL (cf. OLW DL), one can use the objectInverseOf(OP) to use the inverse of an object property instead of extending the ontology’s vocabulary and using InverseObjectProperties(OPE1 OPE2) to relate the property and its inverse. For instance, to use the inverse the property addresses in an axiom, one used to have to introduce a new property, addressed by, declare it inverse to addresses, and then use that in the axiom, whereas in OWL 2 DL, one can use ObjectInverseOf(addresses) in the axiom (in Protégé, the syntax is inverse(addresses)). That slashed computing the class hierarchy by at least over a third (and about half for the baseline). Why? We don’t know. Other features used in DMOP, such as punning and property chains, were harder to remove and are heavily used, so we didn’t test those.

The other one, removing DOLCE, is a bit tricky. But to give away the end results upfront: that made it 10 times faster! The ‘tricky’ part has to do with the notion of ‘linking to a foundational ontology’ (deserving of its own blog post). For DMOP, we had not imported but merged, and we did not merge everything from DOLCE and its ExtendedDnS, but only what was deemed relevant, being, in numbers, 43 classes, 78 object properties and 593 axioms. To make matters worse—from an evaluation viewpoint, that is—is that we reused heavily three DOLCE object properties, so we kept those three DOLCE properties in the evaluation file, as we suspected it otherwise would have affected the deductions too much and interfere with the DOLCE-or-not question (one also could argue that those three properties can be considered an integral part of DMOP). So, it was not a simple case of ‘remove the import statement and run the reasoner again’, but a ‘remove almost everything with a DOLCE URI manually and then run the reasoner again’.

Because computation was so ‘slow’, we wondered whether maybe cleverly modularizing DMOP could be the way to go, in case someone wants to use only a part of DMOP. We got as far as trying to modularize the ontology, which already was not trivial because DMOP and DOCLE are both highly axiomatised and with few, if any, relatively isolated sections amenable to modularization. Moreover, what it did show, is that such automated modularization (when it was possible) only affects the number of class and number of axioms, not the properties and individuals. So, the generated modules are stuck with properties and individuals that are not used in, or not relevant for, that module. We did not fix that manually. Also, putting back together the modules did not return it to the original version we started out with, missing 225 axioms out of the 4584.

If this wasn’t enough already, the DMOP with/without DOLCE test was performed with several reasoners, out of curiosity, and they gave different output. FaCT++ and MORe had a “Reasoner Died” message. My ontology engineering students know that, according to DOLCE, death is an achievement, but I guess that its reasoners’ developers would deem otherwise. Pellet and TrOWL inferred inconsistent classes; HermiT did not. Pellet’s hiccup had to do with datatypes and should not have occurred (see paper for details). TrOWL fished out a modeling issue from all of those 4584 axioms (see p5 of the paper), of the flavour as described in [4] (thank you), but with the standard semantics of OWL—i.e., not caring at all about the real semantics of object property hierarchies—it should not have derived an inconsistent class.

Overall, it feels like having opened up a can of worms, which is exciting.

References

[1] Keet, C.M., Lawrynowicz, A., d’Amato, C., Hilario, M. Modeling issues and choices in the Data Mining OPtimisation Ontology. 8th Workshop on OWL: Experiences and Directions (OWLED’13), 26-27 May 2013, Montpellier, France. CEUR-WS vol 1080.

[2] Keet, C.M., d’Amato, C., Khan, Z.C., Lawrynowicz, A. Exploring Reasoning with the DMOP Ontology. 3rd Workshop on Ontology Reasoner Evaluation (ORE’14). July 13, 2014, Vienna, Austria. CEUR-WS vol (accepted).

[3] Keet, C.M. On the feasibility of Description Logic knowledge bases with rough concepts and vague instances. 23rd International Workshop on Description Logics (DL’10), 4-7 May 2010, Waterloo, Canada.

[4] Keet, C. M. (2012). Detecting and revising flaws in OWL object property expressions. In Proc. of EKAW’12, volume 7603 of LNAI, pages 252–266. Springer.

Follow

Get every new post delivered to your Inbox.

Join 42 other followers

%d bloggers like this: