Enjoyable and interesting controlled natural languages workshop (CNL’14)

Conferencing in Ireland was a good experience again. Like EKAW 2012, the Fourth Workshop on Controlled Natural Language (CNL’14) was held in the Aula Maxima at the University of Galway, a beautiful ivy-covered building conducive of a stimulating scientific atmosphere and, as any good event, one leaves with plenty of ideas to pursue, and it was a good ambience to meet up again with colleagues as well as meeting new ones, such as Allan Third of the SWAT natural language tools that we use in the ROMULUS foundational ontology library. The remainder of this post is a quick write-up about several of the papers and presentations, written during an otherwise lost moment at Dublin airport.

If you’re not too familiar with CNLs, a useful brief overview to start with is Safwat and Davis’ state of the art [1]. However, some of you might first prefer to read something that is one of the answers to “what would it be good for?”; in that case, I can highly recommend the paper on automatically generating the Swiss avalanche bulleting in 4 languages [2], presented by Kurt Winkler: not only their participants found it very difficult to figure out which ones were manually generated and which ones automatically (55% correct, on average), but also the CNL attendees had trouble with ‘guessing’ it right (yeah, including me). From a technical perspective, it uses a catalogue-based translation system with chunks of text segments. Rather more theoretical were the two papers on the Grammatical Framework. The first one was the invited talk by Aarne Ranta [3] about embedded controlled languages. He provided a brief overview of GF (which started in 1998 at Xerox in Grenoble) up to the current state in the EU project Molto for multilingual machine translation, and different levels of quality of the generated text. Inari Listenmaa presented an extension to the system so that GF will be able to handle compositionality [4].

Interesting to me was the question whether CNLs exist for generating text about temporal events, in part because I’ve another strand of research on temporal conceptual data modelling. Not everyone agreed whether there was anything other than simple stories, but it was hard to find much about it (if you do or know of it, please leave a pointer in the comments). Gordon Pace presented results on verbalizing finite state machines (events with properties), in particular violation traces through the FSM [5]; e.g., when one has a process for logins and failed logins that is violated, the sysadmin needs to know what has happened, and ideally be informed about what essentially went wrong in an intelligible way, and summarized rather than having to pour over endless logs.

On the multilingual front for less common languages, there were two papers for Latvian involving FrameNet for their controlled natural language [6,7], and Langa Khumalo presented our joint paper about isiZulu natural language generation [8] about which I blogged earlier.

Last, but not least—and, more precisely: first—the best paper award. It was awarded to two papers, being to the paper on technical text authoring by Juyeon Kang and Patrick Saint-Dizier [9] and to the paper on style guides as controlled languages, by Karolina Suchowolec [10].

The next CNL workshop will be held in about 2 years time, also most likely co-located with a larger conference (now it was co-located with COLING in Dublin), and some other activities are also in the pipeline, such as a mailing list, wiki etc. so it will be easier for people to stay tuned with the latest developments in CNLs. I’m already looking forward to the next installment of the event.

References

Note: all links are to the CRCs posted on arxiv; the final versions formatted by Springer are on the Springer site (behind a paywall for most people).

[1] Hazem Safwat and Brian Davis. A Brief State of the Art of CNLs for Ontology Authoring. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 190-200. 20-22 Aug, 2014, Galway, Ireland.

[2] Kurt Winkler, Tobias Kuhn and Martin Volk. Evaluating the fully automatic multi-language translation of the Swiss avalanche bulletin. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 44-54. 20-22 Aug, 2014, Galway, Ireland.

[3] Aarne Ranta. Embedded Controlled Languages. (invited paper). Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 1-7. 20-22 Aug, 2014, Galway, Ireland.

[4] Ramona Enache, Inari Listenmaa and Prasanth Kolachina. Handling non-compositionality in multilingual CNLs. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 147-154. 20-22 Aug, 2014, Galway, Ireland.

[5] Gordon Pace and Michael Rosner. Explaining Violation Traces with Finite State Natural Language Generation Models. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 179-189. 20-22 Aug, 2014, Galway, Ireland.

[6] Guntis Barzdins. FrameNet CNL: a Knowledge Representation and Information Extraction Language. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 90-101. 20-22 Aug, 2014, Galway, Ireland.

[7] Dana Dannells and Normunds Gruzitis. Controlled Natural Language Generation from a Multilingual FrameNet-based Grammar. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 155-166. 20-22 Aug, 2014, Galway, Ireland.

[8] C. Maria Keet and Langa Khumalo. Toward verbalizing ontologies in isiZulu. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 78-89. 20-22 Aug, 2014, Galway, Ireland.

[9] Juyeon Kang and Patrick Saint-Dizier. Towards an Error Correction Memory to Enhance Technical Texts Authoring in LELIE. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 55-65. 20-22 Aug, 2014, Galway, Ireland.

[10] Karolina Suchowolec. Are Style Guides Controlled Languages? The Case of Koenig & Bauer AG. Fourth Workshop on Controlled Natural Language (CNL’14). Springer LNAI vol 8625, 112-122. 20-22 Aug, 2014, Galway, Ireland.

Coupon collecting the computing way

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have $\frac{total\_no\_of\_stickers}{5}*P$, and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a $\frac{638}{637}$ chance of being new, and so on to the last card $\frac{638}{1}$, wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

$\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}$

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant ${\sf \gamma}$ (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is $n*H_n$, with $H_n$ the harmonic number, and the whole thing equalling also

$n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty$

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, $\gamma \approx 0.5772156649$, with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

$total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma$

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for $\gamma$, i.e., how many decimal digits, and only $\gamma$ or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.