## On ‘swapping’ your foundational ontology to increase interoperability

Over the past few years, I’ve been putting some effort into methods and tools and some data collection and analysis that would aid the use of foundational (top-level) ontologies in ontology engineering, such as DOLCE, GFO, and BFO, and some of its relations (mainly part-whole relations). Tools include the Ontology Selection and Explanation Tool to choose the most suitable foundational ontology [1] and OntoPartS [2] and OntoPartS-2 [3] for software-supported modeling of part-whole relations, and experimentally validating using a foundational ontology does make a difference [4]. The latest addition is SUGOI—Software Used to Gain Ontology Interchangeability, initiated by Zubeida Khan’s idea mentioned in her (cum laude) MSc thesis, which I supervised.

In the meantime, SUGOI has been implemented, and we have used it to answer principally two questions:

1. Is it feasible to automatically generate links between ontology Oa and foundational ontology Oy, given Oa is linked to Ox? Say, I have an ontology linked to BFO, then can I swap BFO for DOLCE?
2. If there are issues with the former, what is causing it? Or: in praxis, which entities of Ox are typically used for mappings with domain ontologies that may not be present, or present in an incompatible way, in Oy? Or: if not, then why not?

We tested this with 16 ontologies that are linked to a foundational ontology, and the results have just been accepted at the 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW’14) [5].

Now, I already know that some of you will say (and, in fact, have said!), this is not feasible at all. Arguments on philosophical distinctions are there, yes, but not all of that appears in an OWL file and in the modeller’s view (see also an earlier post and references therein). Put differently: things are not that clear-cut and black-and-white as it initially may seem. We did observe a basic, or raw, ‘swapping success rate’ from 2% for the PID ontology from the GFO it was aligned to, to BFO, to up to a whopping 82% for the IDO ontology from BFO to either DOLCE or GFO (averaging at 36% for the real ontologies we tested with). Now, there.

So, what’s really happening? The success rate actually depends on several factors. Some entities in, say, BFO, while named differently, do have an equivalent in DOLCE or GFO, that may or may not be in a similar place in the ontology (if not, then you still end up with an inconsistency, which we removed as mapping), others do not. Those mappings have been investigated in detail [6], and, indeed, there aren’t many, but surely there are some. Several domain ontologies have alignments to only a few categories in a foundational ontology, others have more. If there aren’t many links, or predominantly to those for which there exists an equivalence assertion, then your ‘swapping success rate’ (called raw interchangeability in the paper) is high. Thus, it is not that it is not feasible at all.

The interface of the online desktop version of SUGOI.

Sounds obvious when one puts it like that. But what about my ontology, you may wonder. Use SUGOI to find out. The log file shows what’s been done in the process, and does compute those raw interchangeability metrics for you. SUGOI is ‘trivial’ to extend to include foundational ontologies other than DOLCE, BFO, and GFO—just the mapping files have to be added, but it doesn’t really change the algorithm.

We also looked at the data, especially for the ones with a low success rate, to figure out what causes it. It appeared that for those that use DOLCE, they probably do so because it has some nice knowledge about attributive properties that are not represented (BFO) or represented in an incompatible way (GFO) elsewhere. Likewise, those ontologies that were linked to BFO or GFO and for which there was a lower interchangeability to DOLCE, had quite a few links to aspects on roles, which aren’t in DOLCE proper, so that was causing a relatively lower success rate there (more details in the paper). We leave it up to the developers of the respective foundational ontologies to decide whether they wan to fill that ‘gap’ in their respective ontology.

We also checked SUGOI’s output against ontologies that had been aligned manually to more than one foundational ontology by the developers. We could find only two that were: BioTop and the Stuff Ontology. Mainly, we found the odd error in alignment and a few ones missed by manual alignment, but with n=2, those results are quite at the level of interesting anecdote (observing that the plural of anecdote is not data).

Whether you want to swap, or offer your ontology aligned to more than one foundational ontology to increase its interoperability with other ontologies, is, clearly, your choice to make. If you decide to do so, you could do that manually, but SUGOI automates that process for you as much as possible. Both Zubeida and I plan to be at EKAW’14, hopefully also with a demo, so that you not only can test it with your ontology (which you can do already on the SUGOI page already), but also gain some further detailed insights into the algorithm, the mapping files it used, and the consequences for your ontology.

References

[1] Khan, Z., Keet, C.M. ONSET: Automated Foundational Ontology Selection and Explanation. 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW’12), A. ten Teije et al. (Eds.). Oct 8-12, Galway, Ireland. Springer, LNAI 7603, 237-251.

[2] Keet, C.M., Fernandez-Reyes, F.C., Morales-Gonzalez, A. Representing mereotopological relations in OWL ontologies with OntoPartS. 9th Extended Semantic Web Conference (ESWC’12), Simperl et al. (eds.), 27-31 May 2012, Heraklion, Crete, Greece. Springer, LNCS 7295, 240-254.

[3] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings, pp569-578. Oct. 27 – Nov. 1, 2013, San Francisco, USA.

[4] Keet, C.M. The use of foundational ontologies in ontology development: an empirical assessment. 8th Extended Semantic Web Conference (ESWC’11), G. Antoniou et al (Eds.), Heraklion, Crete, Greece, 29 May-2 June, 2011. Springer, Lecture Notes in Computer Science LNCS 6643, 321-335.

[5] Khan, Z.C., Keet, C.M. Feasibility of automated foundational ontology interchangeability. 19th International Conference on Knowledge Engineering and Knowledge Management (EKAW’14). 24-28 Nov, 2014, Linkoping, Sweden. Springer LNAI. (accepted)

[6] Khan, Z., Keet, C.M. Addressing issues in foundational ontology mediation. 5th International Conference on Knowledge Engineering and Ontology Development (KEOD’13), Vilamoura, Portugal, 19-22 September. Filipe, J. and Dietz, J. (Eds.), SCITEPRESS, pp5-16.

## Enjoyable and interesting controlled natural languages workshop (CNL’14)

Conferencing in Ireland was a good experience again. Like EKAW 2012, the Fourth Workshop on Controlled Natural Language (CNL’14) was held in the Aula Maxima at the University of Galway, a beautiful ivy-covered building conducive of a stimulating scientific atmosphere and, as any good event, one leaves with plenty of ideas to pursue, and it was a good ambience to meet up again with colleagues as well as meeting new ones, such as Allan Third of the SWAT natural language tools that we use in the ROMULUS foundational ontology library. The remainder of this post is a quick write-up about several of the papers and presentations, written during an otherwise lost moment at Dublin airport.

If you’re not too familiar with CNLs, a useful brief overview to start with is Safwat and Davis’ state of the art [1]. However, some of you might first prefer to read something that is one of the answers to “what would it be good for?”; in that case, I can highly recommend the paper on automatically generating the Swiss avalanche bulleting in 4 languages [2], presented by Kurt Winkler: not only their participants found it very difficult to figure out which ones were manually generated and which ones automatically (55% correct, on average), but also the CNL attendees had trouble with ‘guessing’ it right (yeah, including me). From a technical perspective, it uses a catalogue-based translation system with chunks of text segments. Rather more theoretical were the two papers on the Grammatical Framework. The first one was the invited talk by Aarne Ranta [3] about embedded controlled languages. He provided a brief overview of GF (which started in 1998 at Xerox in Grenoble) up to the current state in the EU project Molto for multilingual machine translation, and different levels of quality of the generated text. Inari Listenmaa presented an extension to the system so that GF will be able to handle compositionality [4].

Interesting to me was the question whether CNLs exist for generating text about temporal events, in part because I’ve another strand of research on temporal conceptual data modelling. Not everyone agreed whether there was anything other than simple stories, but it was hard to find much about it (if you do or know of it, please leave a pointer in the comments). Gordon Pace presented results on verbalizing finite state machines (events with properties), in particular violation traces through the FSM [5]; e.g., when one has a process for logins and failed logins that is violated, the sysadmin needs to know what has happened, and ideally be informed about what essentially went wrong in an intelligible way, and summarized rather than having to pour over endless logs.

On the multilingual front for less common languages, there were two papers for Latvian involving FrameNet for their controlled natural language [6,7], and Langa Khumalo presented our joint paper about isiZulu natural language generation [8] about which I blogged earlier.

Last, but not least—and, more precisely: first—the best paper award. It was awarded to two papers, being to the paper on technical text authoring by Juyeon Kang and Patrick Saint-Dizier [9] and to the paper on style guides as controlled languages, by Karolina Suchowolec [10].

The next CNL workshop will be held in about 2 years time, also most likely co-located with a larger conference (now it was co-located with COLING in Dublin), and some other activities are also in the pipeline, such as a mailing list, wiki etc. so it will be easier for people to stay tuned with the latest developments in CNLs. I’m already looking forward to the next installment of the event.

References

Note: all links are to the CRCs posted on arxiv; the final versions formatted by Springer are on the Springer site (behind a paywall for most people).

[1] Hazem Safwat and Brian Davis. A Brief State of the Art of CNLs for Ontology Authoring. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 190-200. 20-22 Aug, 2014, Galway, Ireland.

[2] Kurt Winkler, Tobias Kuhn and Martin Volk. Evaluating the fully automatic multi-language translation of the Swiss avalanche bulletin. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 44-54. 20-22 Aug, 2014, Galway, Ireland.

[3] Aarne Ranta. Embedded Controlled Languages. (invited paper). Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 1-7. 20-22 Aug, 2014, Galway, Ireland.

[4] Ramona Enache, Inari Listenmaa and Prasanth Kolachina. Handling non-compositionality in multilingual CNLs. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 147-154. 20-22 Aug, 2014, Galway, Ireland.

[5] Gordon Pace and Michael Rosner. Explaining Violation Traces with Finite State Natural Language Generation Models. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 179-189. 20-22 Aug, 2014, Galway, Ireland.

[6] Guntis Barzdins. FrameNet CNL: a Knowledge Representation and Information Extraction Language. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 90-101. 20-22 Aug, 2014, Galway, Ireland.

[7] Dana Dannells and Normunds Gruzitis. Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 155-166. 20-22 Aug, 2014, Galway, Ireland.

[8] C. Maria Keet and Langa Khumalo. Toward verbalizing ontologies in isiZulu. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 78-89. 20-22 Aug, 2014, Galway, Ireland.

[9] Juyeon Kang and Patrick Saint-Dizier. Towards an Error Correction Memory to Enhance Technical Texts Authoring in LELIE. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 55-65. 20-22 Aug, 2014, Galway, Ireland.

[10] Karolina Suchowolec. Are Style Guides Controlled Languages? The Case of Koenig & Bauer AG. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 112-122. 20-22 Aug, 2014, Galway, Ireland.

## Coupon collecting the computing way

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have $\frac{total\_no\_of\_stickers}{5}*P$, and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a $\frac{638}{637}$ chance of being new, and so on to the last card $\frac{638}{1}$, wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

$\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}$

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant ${\sf \gamma}$ (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is $n*H_n$, with $H_n$ the harmonic number, and the whole thing equalling also

$n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty$

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, $\gamma \approx 0.5772156649$, with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

$total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma$

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for $\gamma$, i.e., how many decimal digits, and only $\gamma$ or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.

## A metamodel-driven approach for linking conceptual data models

Interoperability among applications and components of large complex software is still a bit of a nightmare and a with-your-hands-in-the-mud scenario that no-one looks forward to—people look forward to already having linked them up, so they can pose queries across departmental and institutional boundaries, or even across the different data sets within a research unit to advance their data analysis and discover new things.

Sure, ontologies can help with that, but you have to develop one if none is available, and sometimes it’s not even necessary. For instance, you have an ER diagram for the database and a UML model for the business layer. How do you link up those two?

Superficially, this looks easy: an ER entity type matches up with a UML class, and an ER relationship with an UML association. The devil is in the details, however. To name just a few examples: how are you supposed to match a UML qualified association, an ER weak entity type, or an ORM join-subset constraint, to any of the others?

Within the South Africa – Argentina bilateral collaboration project (scope), we set out to solve such things. Although we first planned to ‘simply’ formalize the most common conceptual data modelling languages (ER, UML, and ORM families), we quickly found out we needed not just an ‘inventory’ of terms used in each language matched to one in the other languages, but also under what conditions these entities can be used, hence, we needed a proper metamodel. This we published last year at ER’13 and MEDI’13 [1,2], which I blogged about last year. In the meantime, we not only have finalized the metamodel for the constraints, but also formalized the metamodel, and a journal article describing all this is close to being submitted.

But a metamodel alone doesn’t link up the conceptual data models. To achieve that, we, Pablo Follottrani and I, devised a metamodel-driven approach for conceptual model interoperability, which uses a formalised metamodel with a set of modular rules to mediate the linking and transformation of elements in the conceptual models represented in different languages. This also simplifies the verification of inter-model assertions and model conversion. Its description has recently been accepted as a full paper at the 8th International Web Rule Symposium 2014 (RuleML’14) [3], which I’ll present in Prague on 18 August.

To be able to assert a link between two entities in different models and evaluate automatically (or at least: systematically) whether it is a valid assertion and what it entails, you have to know i) what type of entities they are, ii) whether they are the same, and if not, whether one can be transformed into the other for that particular selection. So, to be able to have those valid inter-model assertions, an approach is needed for transforming one or more elements of a model in one language into another. The general idea of that is shown in the following picture, and explained briefly afterward.

Overview of the approach to transform a model represented in language A to one in language B, illustrated with some sample data from UML to ORM2 (Fig 1 in [3])

We have three input items (top of the figure, with the ovals), then a set of algorithms and rules (on the right), and two outputs (bottom, green). The conceptual model is provided by the user, the formalized metamodel is available and a selection of it is included in the RuleML’14 paper [3], and the “vocabulary containing a terminology comparison” was published in ER’13 [1]. Our RuleML paper [3] zooms in on those rules for the algorithms, availing of the formalized metamodel and vocabulary. To give a taste of that (more below): the algorithm has to know that a UML class in the diagram can be mapped 1:1 to a ORM entity type, and that there is some rule or set of rules to transform a UML attribute to an ORM value type.

This is also can be used for the inter-model assertions, albeit in a slightly modified way overall, which is depicted below. Here we use not only the formalised metamodel and the algorithms, but also which entities have 1:1 mappings, which are equivalent but need several steps (called transformations), and which once can only be approximated (requiring user input), and it can be run in both directions from one fragment to the other (one direction is chosen arbitrarily).

Overview for checking the inter-model assertions, and some sample data, checking whether the UML Flower is the same as the ORM Flower (Fig. 2 in [3]).

The rules themselves are not directly from one entity in one model to another entity in another, as that would become too messy, isn’t really scalable, and would have lots of repetitions. We use the more efficient way of declaring rules for mapping a conceptual data model entity into its corresponding entity in the metamodel, do any mapping, transformation, or approximation there in the metamodel, and then map it into the matching entity in the other conceptual data model. The rules for the main entities are described in the paper: those for object types, relationships, roles, attributes, and value types, and how one can use those to build up more complex ones for validation of intermodal assertions.

This metamodel-mediated approach to the mappings is one nice advantage of having a metamodel, but one possibly could have gotten away with just having an ‘inventory’ of the entities, not all the extra effort with a full metamodel. But there are benefits to having that metamodel, in particular when actually validating mappings: it can drive the validation of mappings and the generation of model transformations thanks to the constraints declared in the metamodel. How this can work, is illustrated in the following example, showing one example of how the “process mapping assertions using the transformation algorithms” in the centre-part of Fig. 2, above, works out.

Example. Take i) two models, let’s call them Ma and Mb, ii) an inter-model assertion, e.g., a UML association Ra and an ORM fact type Rb, ii) the look-up list with the mappings, transformations, approximations, and the non-mappable elements, and iii) the formalised metamodel. Then the model elements of Ma and Mb are classified in terms of the metamodel, so that the mapping validation process can start. Let us illustrate that for some Ra to Rb (or vv.) mapping of two relationships.

1. For the vocabulary table, we see that UML association and ORM fact type correspond to Relationship in the metamodel, and enjoy a 1:1 mapping. The ruleset that will be commenced with are R1 from UML to the metamodel and 2R to ORM’s fact type (see rules in the paper).
2. The R1 and 2R rules refer to Role and Object type in the metamodel. Now things become interesting. The metamodel has represented that each Relationship has at least two Roles, which there are, and each one causes the role-rules to be evaluated, with Ro1 of Ra’s two association ends into the metamodel’s Role and 2Ro to ORM’s roles (‘2Ro’ etc. are the abbreviations of the rules; see paper [3] for details).
3. The metamodel asserts that Role must participate in the rolePlaying relationship and thus that it has a participating Object type (possibly a subtype thereof) and, optionally, a Cardinality constraint. Luckily, they have 1:1 mappings.
4. This, in turn causes the rules for classes to be evaluated. From the classes, we see in the metamodel that each Object type must have at least one Identification constraint that involves one or more attributes or value types (which one it is has, been determined by the original classification). This also then has to be mapped using the rules specified.

This whole sequence was set in motion thanks to the mandatory constraints in the metamodel, having gone Relationship to Role to Object type to Single identification (that, in turn, consults Attribute and Datatype for the UML to ORM example here). The ‘chain reaction’ becomes longer with more elaborate participating entities, such as a Nested object type.

Overall, the whole orchestration is no trivial matter, requiring all those inputs, and it won’t be implemented in one codefest on a single rainy Sunday afternoon. Nevertheless, the prospect for semantically good (correct) inter-model assertions across conceptual data modeling languages and automated validation thereof is now certainly a step closer to becoming a reality.

References

[1] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[2] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[3] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

## More lively Theory of Computation lectures with peer instruction

Although I don’t teach theory of computation these days, I did encounter something that I’m pretty sure of that would have made the lectures more lively compared to my attempts (blogged about earlier here), and deserves some ‘air time’: peer instruction [1].

My database systems, computer literacy, and networks students already encountered it during some of my lectures, but to briefly recap the general idea (pioneered by Eric Mazur): you insert ‘quiz questions’ (i.e., concept tests) into the lectures, students vote on the multiple choice question, then there is a discussion among the students about the answers, and a re-vote. This sounds perhaps too playful for higher education, but it has been shown to improve course grades by 34-45% [1, 2], thanks to, among others, active engagement, focusing on understanding principles, and student discussion involving ‘teaching each other’.

Even if one thinks it can be suitable for some courses but not others (like tough computer science courses), there’s even peer instruction course material for something so abstract as theory of computation, pioneered by Cynthia Lee and available through peerinstruction4cs. And it works [2].

The peer instruction slides for theory of computation and several other courses can be downloaded after registration. This I did, both out of curiosity and for getting ideas how I might improve the quizzes for the networks course I’m teaching, for which no materials are available yet. Here, I’ll give two examples of such questions for theory of computation to give an indication that it is feasible.

The first one is about tracing through an NFA, and grasping the notion that if any trace through the NFA computation ends in a final state with all input read, it accepts the string. The question is shown in the figure below, which I rendered with the AutoSim automata simulator.

Peer instruction quiz question about NFAs.

A and D can be excluded ‘easily’: A has the second trace as [accept] but it rejects, and D has the first trace [reject] but it accepts (which can be checked nicely with the AutoSim file loaded into the tool). The real meat is with options B and C: if one trace accepts and the other rejects, then does the NFA as a whole accept the string or not? Yes, it does; thus, the answer is B.

The other question is further into the course material, on understanding reductions. Let us have ATM and it reduces to MYSTERY_LANGUAGE. Which of the following is true (given the above statement is true):

1. a) You can reduce to MYSTERY_LANG from ATM.
2. b) MYSTERY_LANG is undecidable.
3. c) A decider for MYSTERY_LANG (if it exists) could be used to decide ATM.
4. d) All of the above

Answering this during the lecture requires more than consulting a ‘cheat sheet’ on reductions. A is a rephrasing of the statement, so that holds (though this option may not be very fair with students whose first language is not English). You will have to remember that ATM is undecidable, and, if visually oriented, the following figure:

Graphical depiction of reduction

So, if ATM is undecidable (the “P1” on the left), then so is MYSTERY_LANGUAGE (the “P2” on the right); for the aficionados: there’s a theorem for that in Chapter 9 of [3], being ‘if there is a reduction from P1 to P2, then if P1 is undecidable, then so is P2’. So B holds. Thus, the answer is probably D, but let’s look at C just to be sure of D. Because ATM is ‘in’ MYSTERY_LANGUAGE, then if we can decide MYSTERY_LANGUAGE, then so can we decide ATM—it just happens to be the case we can’t (but if we could, it would have been possible). Thus, indeed, D is the answer.

Sure, there is more to theory of computation than quizzes alone, but feasible it is, and on top of that, student perception of peer instruction is apparently positive [4]. But before making the post too long on seemingly playful matters in education, I’m going to have a look at how that other game—the second semi-final—will unfold to see how interesting the world cup final is going be.

References

[1] Crouch, C.H., Mazur, E. (2001). Peer Instruction: Ten years of experience and results. American Journal of Physics, 69, 970-977.

[2] Bailey Lee, C., Garcia, S., Porter, L. (2013). Can Peer Instruction Be Effective in Upper-Division Computer Science Courses?. ACM Transactions on Computing Education, 13(3): 12-22.

[3] Hopcroft, J.E., Motwani, R. & Ullman, J.D. (2007). Introduction to Automata Theory, Languages, and Computation, 3rd ed., Pearson Education.

[4] Simon, B., Esper, S., Porter, L., Cutts, Q. (2013). Student experience in a student-centered peer instruction classroom. Proceedings of the International Computing Education Research Conference (ICER’13), ACM Conference Proceedings. pp129-136.

## ACM ICPC 2014 solution to problem A – baggage

Some of you already know I was on-site coach for the UCT team at the ACM Inter Collegiate Programming Contest World Finals in Ekaterinburg, held from 22 to 26 June, 2014. The problems were unusually hard this year, and solving 4 out of the 10 problems already got teams into the bronze medal range [results]. The technical coach of the UCT team, Bruce Merry, has written an analysis of 6 problems on his blog (D, K, C, E, B, and F–update: now also I and G, J, L, and H), and over at TopCoder, SnapDragon discusses 4 of the 10 problems (A, E, F, G), except that SnapDragon does not provide a solution to A (and I don’t like his hint of brute-force code-and-try) [update 3-7: there’s description of his solution here now]. Googling, I couldn’t find someone else discussing the solution to problem A, so here’s mine, which I solved on the plane from Moscow to Frankfurt (among other activities, and one among the four flights we had to take to arrive back in Cape Town).

The problem

Stripping the “baggage problem” to its essentials: you have a sequence of alternating Bs and As, which has to be reordered to first all the As then all the Bs, and the reordering has to occur moving two letters at the same time, and remain adjacent. For instance, with an n = 4, we have 2n characters, BABABABA, and in the end after sorting, it then will be AAAABBBB, and this has to occur in the minimum amount of moves. N is randomly chosen, and is an integer between 3 and 100. The cells on your tape are numbered from 0 to -2n+1, so with our n = 4, from -7 to 8, and the first B is on position 1.

Update (3-7): This problem appears to be “Tait’s counter puzzle”, which Peter Guthrie Tait, a Scottish mathematical physicist, described in 1884 [see description] (thanks to Davi Duarte who mentioned it in the topcoder thread).

Solution

Informally, there are two parts to the algorithm: move around the As and Bs to pair them as AA and BB, which cost you n/2 moves if n is even and n/2-rounded up moves if n is odd, and then sort those pairs in the remainder to a total of n moves. The first part occurs alternating moving AB from the right to the left, starting at the last AB (position 2n-2) and then every other 4 to its left, and BA from left to right, starting from position 3 and every other 4 positions to the right (i.e., 7, 11, etc.), and then the BBs are ferried to the right from left to right, and the AAs from the right to the left, also alternating. N=3 is an exception.

One can do this with a neat mathematical proof, but I found out with pattern creation and recognition, visualizing the provided sample input and moves for n = 5 and n = 8, which can be done in 5 resp. 8 steps (given in the problem statement as sample output). Here’s a description of that approach.

First, devising one for 3 is trivial, using the following moves (the only option), using the position indicated with the position of the first letter, and highlighting the ones that will be moved next:

Start: . . . . . . BABABA
Move 2 to -1, which gives . . . . ABB . . ABA
Move 5 to 2, which gives . . . . ABBBAA . .
Move 3 to -3, which gives . . AAABBB . . . .

This already suggests that the minimum amount of moves will always be n, because it is n moves also for n = 5 and n = 8. Next, devising one for n = 4, and knowing the moves for n = 3 and n = 5, I tried to work it out for BABABABA, i.e., the first AB, like with n = 3, BABABABA, and BABABABA like with the n = 5 case. The last one is the only one that worked in 4 moves, starting with moving 6 to -1, again. Again, because for all ns so far, the first move is to -1.

Let’s put this to the test with n = 7, with the ones to be moved in bold and the empty cells indicated with dots:

..BABABABABABABA

i.e., move the last AB (thus, position 12) as this worked for n=4 and n=5, then

ABBABABABABAB..A

i.e., move the first BA after position 1 (thus, position 3)

ABBA..BABABABBAA

i.e., move the last AB before a pair (thus, position 8)

ABBAABBAB..ABBAA

i.e., move the first BA after position 1 (thus, position 5). Then sorting the BBs and AAs:

ABBAAB..BBAABBAA
A..AABBBBBAABBAA
AAAAABBBBB..BBAA
AAAAABBBBBBB..AA
AAAAAAABBBBBBB..

This was enough for me to have discovered the pattern of alternating for the matching and alternating for the sorting.

What is ‘nasty’ in the problem description, retrospectively and for the pattern-based approach vs. a neat maths-y proof at least, is that the pattern deviates for n = 3, and the provided sample output for n = 8, although in 8 steps, is an alternative solution, not one that one obtains with the algorithm. With the approach as mentioned above, we have for n = 8 the following (I did that one to double-check):

..BABABABABABABABA
ABBABABABABABAB..A
ABBA..BABABABABBAA
ABBAABBABAB..ABBAA
ABBAABBA..BBAABBAA
A..AABBABBBBAABBAA
AAAAABBABBBB..BBAA
AAAAA..ABBBBBBBBAA
AAAAAAAABBBBBBBB..

This confirmed it works. It may look a bit craft-y, but the patterns show beautifully with larger n, which are shown below for the even and odd case (made just for the blog post, not in solving it); squint your eyes if it’s not immediately clear.

N = 16

..BABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABBAABBAA
ABBAABBAABBABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABBAABBAABBAA
ABBAABBAABBAABBABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BBAABBAABBAABBAA
A..AABBAABBAABBABBBBAABBAABBAABBAA
AAAAABBAABBAABBABBBB..BBAABBAABBAA
AAAAA..AABBAABBABBBBBBBBAABBAABBAA
AAAAAAAAABBAABBABBBBBBBB..BBAABBAA
AAAAAAAAA..AABBABBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBABBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..ABBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAABBBBBBBBBBBBBBBB..

N = 17

..BABABABABABABABABABABABABABABABABA
ABBABABABABABABABABABABABABABABAB..A
ABBA..BABABABABABABABABABABABABABBAA
ABBAABBABABABABABABABABABABAB..ABBAA
ABBAABBA..BABABABABABABABABABBAABBAA
ABBAABBAABBABABABABABABAB..ABBAABBAA
ABBAABBAABBA..BABABABABABBAABBAABBAA
ABBAABBAABBAABBABABAB..ABBAABBAABBAA
ABBAABBAABBAABBA..BABBAABBAABBAABBAA
ABBAABBAABBAABBAABB..BAABBAABBAABBAA
A..AABBAABBAABBAABBBBBAABBAABBAABBAA
AAAAABBAABBAABBAABBBBB..BBAABBAABBAA
AAAAA..AABBAABBAABBBBBBBBBAABBAABBAA
AAAAAAAAABBAABBAABBBBBBBBB..BBAABBAA
AAAAAAAAA..AABBAABBBBBBBBBBBBBAABBAA
AAAAAAAAAAAAABBAABBBBBBBBBBBBB..BBAA
AAAAAAAAAAAAA..AABBBBBBBBBBBBBBBBBAA
AAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBB..

Neat. Writing the code is left as an exercise to the reader.

I’m looking forward to the other problems (except the crane balancing problem C, whose description I did not like, at all), the upcoming regionals and next year’s ICPC World Finals in Morocco (if not winning, then to have at least a great time like we had this time)!

## First steps for isiZulu natural language generation

Yes, Google Translate English-isiZulu does exist, but it has many errors (some very funny) and there’s a lot more to Natural Language Generation (NLG) than machine translation, such as natural language-based query interfaces that has some AI behind it, and they are needed, too [1]. Why should one bother with isiZulu? Muendane has his lucid opinions about that [2], and in addition to that, it is the first language of about 23% of the population of South Africa (amounting to some 10 million people), about half can speak it, and it is a Bantu language, which is spoken by nearly 300 million people—what works for isiZulu grammar may well be transferrable to its related languages. Moreover, it being in a different language family than the more well-resourced languages, it can uncover some new problems to solve for NLG, and facilitate access to online information without the hurdle of having to learn English or French first, as is the case now in Sub-Saharan Africa.

The three principal approaches for NLG are canned text, templates, and grammars. While I knew from previous efforts [3] that the template-based approach is very well doable but has its limitations, and knowing some basic isiZulu, I guessed it might not work with the template-based approach but appealing if it would (for a range of reasons), that no single template could be identified so far was the other end of the spectrum. Put differently: we had to make a start with something resembling the foundations of a grammar engine.

Langa Khumalo, with the Linguistics program and director of the University Language Planning and Development Office at the University of KwaZulu-Natal, and I have been trying to come up with isiZulu NLG. We have patterns and algorithms for (‘simple’) universal and existential quantification, subsumption, negation (class disjointness), and conjunction; or: roughly OWL 2 EL and a restricted version of ALC. OWL 2 EL fist neatly with SNOMED CT, and therewith has the potential for interactive healthcare applications with the isiZulu healthcare terminologies that are being developed at UKZN.

The first results on isiZulu NLG are described in [4,5], which was not an act of salami-slicing, but we had more results than that fitted in a single paper. The first paper [4] will appear in the proceedings ofthe 4th workshop on Controlled Natural language (CNL’14), and is about finding those patterns and, for the options available, an attempt at figuring out which one would be best. The second paper [5], which will appear in the 8th International Web Rule Symposium (RuleML’14) conference proceedings, is more about devising the algorithms to make it work and how to actually generate those sentences. Langa and I plan to attend both events, so you can ask us about the details either in Prague (18-20 Aug) or Galway (20-22 Aug) in person. In the meantime, the CRCs of the papers are online (here and here).

Regarding the technical aspects, the main reasons why we cannot get away with devising templates to generate isiZulu controlled natural language is that isiZulu is non-trivial:

• There is a whole system of noun classes: nouns are grouped in one of the 17 noun classes, each with their own peculiarities, which is illustrated in Figure 1, below;
• Agglutination, informally: putting lots of bits and pieces together to make a word. A selection of those so-called ‘concords’ is included in Figure 2, below;
• Phonological conditioned copulatives, meaning that the ‘is a’ depends on the term that comes after it (ng or y); and
• Complex verb conjugation.

isiZulu noun classes with an example (source: [5]).

A selection of isiZulu concords (source: [5])

What does this mean for the verbalization? In English, we use ‘Each…’ or ‘For all…’ for the universal quantifier and it doesn’t matter over which noun it is quantified. In isiZulu, it does. Each noun class has its own ‘each’ and ‘for all’, and it is not acceptable (understandable) to use one for the wrong noun class. For disjointness, like “Cup is not a Glass” (${\sf Cup \sqsubseteq \neg Glass}$ in DL), in English we have the ‘is not a’ regardless what comes before or after the subsumption+negation, but in isiZulu, the copulative is omitted, the first noun (OWL class, if you will) brings in a so-called negative subject concord, the second noun brings in a pronominal, and they are glued together (e.g., Indebe akuyona Ingilazi, where the second word is composed of aku + yona), and to top it off, each noun class has its own concord and pronomial. A seemingly simple conjunction—just an ‘and’ in English—has to be divided into an and-when-it-is-used-in-an-enumeration and an and-when-it-is-a-connective, and when it is used in an enumeration, it depends on the first letter of the noun that comes after the ‘and’. Existential quantification is even more of a hassle. The table below shows a very brief summary comparing typical patterns in English with those for isiZulu.

A few DL symbols, their typical verbalization options in English, and an indication of possible patterns (source: [4])

We did ask isiZulu speakers which of the possible options they preferred (in a survey, with Limesurvey localized to isiZulu), but there wasn’t an overwhelming consistent agreement among them except for one of the options for existential quantification (the –dwa option), although there was more agreement among the linguists than among the non-linguists, possibly due to dialect influences (results can be found in [4]).

If you don’t feel like reading the two papers, but still would like to have some general overview and examples, you also can check out the slides of the CS colloquium I gave last week. I managed to ‘lure in’ also ICT4D people—and then smack them with a bit of logic and algorithms—but the other option, being talking about the other paper accepted at RuleML, probably would have had to be a ‘cookie colloquium’ to get anyone to attend (more about that paper in another post—it is fascinating, but possibly of less interest to a broader audience). If you want to skip the tedious bits and just get a feel of how one of the algorithms works out: check out the example starting on slide 63, which shows the steps to go from ${\sf \forall x (uSolwazi(x) \rightarrow \exists y (ufundisa(x, y) \land Isifundo(y)))}$ in FOL, or ${\sf uSolwazi \sqsubseteq \exists ufundisa.Isifundo}$ in DL (“Each professor teaches at least one course”, if the vocabulary were in English), to “Bonke oSolwazi bafundisa isifundo esisodwa”.

Clearly, a lot remains to be done.

References

[1] Alberts, R., Fogwill, T., Keet, C.M. Several Required OWL Features for Indigenous Knowledge Management Systems. 7th Workshop on OWL: Experiences and Directions (OWLED’12). 27-28 May, Heraklion, Crete, Greece. CEUR-WS Vol-849. 12p

[2] Muendane, N.M. I am an African. 2006, Soultalk CC.

[3] Jarrar, M., Keet, C.M., Dongilli, P. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[4] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), 20-22 August 2014, Galway, Ireland. Springer LNAI. (in press)

[5] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), August 18-20, 2014, Prague, Czech Republic. Springer LNCS (in press).