# This blog is now 10 years old

Writing the title of this post does make me wonder how it happened. That blogs are still being read, WordPress that hosts it is still around, and I’m still in academia writing about research and other topics. Honestly, when I started dabbling in writing blog posts, I didn’t expect to last it this long, nor when it celebrated 5 years that another 5 would be added. Nor did I expect to end up persistently receiving typically over 1000 visitors/month, which is fairly popular given the blog’s topics, and surpass the 100 000th visitor some time last year. Admitted, there are not a lot of comments, but then, nor do I comment a lot on other blogs (uncontrollable digital footprint and all that). So, I sat down and wrote a few reflections, which might be of use to someone thinking of starting a (science-oriented) blog or having a dip in posting.

Some pros

Having a blog is useful for learning to try to simplify one’s own research papers into a roughly presentable ‘sound byte’ that can be read in a few minutes. (I don’t get a lot of click-throughs to my papers though, so it may not help with getting people to actually read your papers.)

It is useful to push oneself to take notes during conferences and read those papers, and therewith also reflect on the conferences and workshops one attended (Which papers were actually interesting? Which ones might be useful for your own research? Did someone present a cool solution to a problem you knew of but hadn’t had [at all/enough] time for to solve but are happy someone did?).

If you don’t know what to write about: present/discuss a paper you found interesting, or disagreed with. This helped me at the start to post and not let the blog fizzle out. Admittedly, I have plenty of accepted papers now so as to space it out to ‘market’ one per month. Nevertheless, giving ‘airtime’ to others has its merits, especially when they are in the area of your research, for it shows you actually do read other papers, critically. The offline thank-yous are just a bonus.

That said, it is nice to get feedback from other researchers and readers on the posts; and it really doesn’t matter whether that is left as a comment on the blog, by email, or in person. While they may be pleased I mentioned their paper in a post, I’m happy to know someone used up some of that extremely precious resource—time—to read what I wrote.

I like to think that my writing has improved over the years. If not that, then at least it now takes less time to write at that very same level.

Some caveats

That much for the good side that I could come up with. What about the negative side? Mainly, it does take up quite a lot of time to write up the posts. Writing one evening, reading and revising it the next day, and all the layouting of the post can take up several hours. Those hours could have been spent differently.

You won’t know upfront which posts will be ‘popular’; some posts that I liked aren’t popular at all, yet some that I thought were minor, are. I haven’t deciphered a correlation between effort put in to write a post and its popularity either. I write what I fancy writing about and then hope for the best. Looking at the statistics of page visits, popular-science posts have many, many more visits than posts about my research.

Even when trying to be polite when writing, reading some of the posts years hence, some things seem to be formulated harsher than intended. There is the danger to piss off someone, and therewith feeding the rumour that blogs may be harmful.

Related to the latter, is that there are some things I wanted, or even craved, to write about, but couldn’t due to the decision to have a non-anonymous blog. That said, even anonymous blogs can be ‘revealed’ (e.g., fsp), and some things just fester better through the grapevine.

You may have trolls. I did have them. This depended on the topic (such as positive posts about Cuba), but may also be immature students or a colleague with an axe (of the xenophobic, racist, and/or sexist type) to grind. Ignore them.

Final remarks

Does doing this blogging actually have any impact on the level of my ‘popularity’ or ‘standing’ (good or bad) in the research community? I don’t have the faintest idea.

Will I go on for another 10 years? I don’t know. For now, I still try to write at least 2 posts per month. I hope you stay with me, but I also know that interests change, so I will not hold abandoning keetblog against you J. In fact, I am grateful you have taken, take, and/or will take the time to read the blog, and I hope you consider it time well spent, not wasted, or that it may have been effective for your structured procrastination.

# Pluralising isiZulu nouns, automatically

Generating the plural of a noun in the singular looked so trivial… one set of ‘boring’ rules from the prefixes listed in the standard table of the noun class system for isiZulu—neatly paired by singular and plural—and you’re done. That is also why we had in the original verbalisation algorithm in the RuleML14 paper just one method [1]. And then came the testing with a set of nouns, where some nouns did not quite stick to that neat table; e.g., indoda ‘man’ is in noun class 9, so ought to take the prefix for noun class 10 as plural, but it is amadoda ‘men’ (noun class 6), or take noun class pair 7 and 8, with prefixes isi- and izi-, for singular and plural, respectively: that generally holds, just not in those cases where the stem commences with a vowel. We bumped into a few of those, requiring us to take a step back and investigate pluralising isiZulu nouns systematically, how well that ‘standard table’ actually works, what can be said about those nouns that don’t quite adhere to it, and whether their underlying causes occur also in other Bantu languages.

We have the answers to those questions now, which are described in the paper entitled “Pluralising Nouns in isiZulu and Related Languages” that recently got accepted at the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), which will take place next week in Konya, Turkey.

Pluralising nouns is nothing novel per se, and the general approach is to take some grammar book and write a bunch of regular expressions or rules, or use a data-oriented approach and find all the rules that way. That doesn’t quite work for isiZulu due to it being an underresourced language, so we ended up designing a set of basic rules manually, and then find other rules through experimentation, i.e., a combined knowledge and data approach.

The knowledge-based part was concerned, first, with the choice whether regular expressions would suffice (syntax-only based approach), designing a few automata by availing of the prefixes in that standard table. This made evident that that was unlikely to work well: some noun classes, e.g., 1 and 3, take the same prefix (um or umu) yet have a different prefix for their plural counterpart (aba for noun class 2 and imi for noun class 4), plus it does not say when it is um and when umu. The latter prefix is used with monosyllabic stems, but we have no way of identifying a monosyllabic stem. Thus, we’d need the semantics (noun class) to go with the regular expression as input, which is unlike any pluralisation algorithm for other languages (that we know of). For the size of the automaton, it doesn’t matter if one checks first the prefix and then the noun’s noun class or vv.

For the experiment, we compiled two set of nouns: one was constructed in English by taking class names for multiple ontologies (set 1), the other was compiled by picking each first-listed noun on the left-hand page of an isiZulu dictionary (set 2). To test for generalizability, we took Runyankore, a language spoken in Uganda, which is in a different subfamily from isiZulu.

So, just how bad is nouns-only, i.e., just some regular expression based on the standard table? The accuracy was about 50%, which is pretty dismal. Just adding a noun’s noun class made the accuracy jump up to about 80-90%! The accuracy of each iteration is shown in the table below.

Accuracy of pluralisation with the incremental versions of the pluraliser (source: [2])

So, what were the other culprits? First, compound nouns, such as indawo yokubhukuda ‘swimming pool’, for it is not only the main noun that has to be pluralised (indawo to izindawo), but the other word still has to be in agreement with the first, so it is not yokubhukuda but zokubhukuda for the plural. We managed to devise 4 additional rules for those cases. Another issue were the mass nouns (amount of matter): they can exist in the singular and stay in the singular (iwayini ‘wine’), or are already in a plural class (amanzi ‘water’). There is no way to recognise those, so this has to be annotated. The plural exceptions (true exceptions) and prefix exceptions (regular ‘exception’) were mentioned earlier in this post, and some new rules were added for that. Adding all that generated 92% accuracy for the second set, and 100% for the first set of nouns. For the remaining errors, we have some ideas for how to resolve it, but linguists first have to check (see paper for details).

The initial results for Runyankore were better than for isiZulu, thanks to fewer recurring prefixes, but, interestingly, Runyankore had similar issues overall, notably with the compound nouns, mass nouns, and true exceptions, and some issues with loan words. Tone, not indicated in the orthography, popped up as well, which also holds for some isiZulu nouns (but they didn’t happen to have been in the test sets).

The data, analysis, and the software (including the source code) are available from the GeNI project page. The isiZulu pluraliser was coded up in Python and the Runyankore one in Java. Note that they are at the proof-of-concept level, not industry-grade tools, but you’re free to take it and make industry-grade level software out of it.

References

[1] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. Proceedings of the 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[2] Byamugisha, J., Keet, C.M., Khumalo, L. Pluralising Nouns in isiZulu and Related Languages. Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing’16), Springer LNCS. April 3-9, 2016, Konya, Turkey. (in print)

# Ontology authoring with a Test-Driven Development approach

Ontology development has its processes and procedures—conducting a domain analysis, the implementation, maintenance, and so on—which have been developed since the mid 1990s. These high-level information systems-like methodologies don’t tell you what and how to add the knowledge to the ontology, however, i.e., the ontology authoring stage is a ‘just do it’, but not how. There are, perhaps surprisingly, few methods for how to do that; notably, FORZA uses domain and range constraints and the reasoner to propose a suitable part-whole relation [1] and advocatus diaboli zooms in on disjointness constraints among classes [2]. In a way, they all use what can be considered as tests on the ontology before adding an axiom. This smells of notions that are well-known in software engineering: unit tests, test-driven development (TDD), and Agile, with the latter two relying on different methodologies cf. the earlier ones (waterfall, iterative, and similar).

Some of those software engineering approaches have been adjusted and adopted for ontology engineering; e.g., the Agile-inspired OntoMaven that uses the standard reasoning services as tests [3], eXtreme Design with ODPs [4] that have been prepared previously, and Vrandecic and Gangemi’s early exploration of possibilities for unit tests [5]. Except for Warrender & Lord’s TDD tests for subsumption [6], they are all test-last approaches (design, author, test), rather than a test-first approach (test, author, test). The test-first approach is called test-driven development in software engineering [7], which has been ported to conceptual modelling recently as well [8]. TDD is a step up from a “add something and lets see what the reasoner says” stance, because one has to think and check upfront first before doing. (Competency questions can help with that, but they don’t say how to add the knowledge.) The question that arises, then, is how such a TDD approach would look like for ontology development. Some of the more detailed questions to be answered are (from [9]):

• Given the TDD procedure for programming in software engineering, then what does that mean for ontologies during ontology authoring?
• TDD uses mock objects for incomplete parts of the code, and mainly for methods; is there a parallel to it in ontology development, or can that aspect of TDD be ignored?
• In what way, and where, (if at all) can this be integrated as a methodological step in existing ontology engineering methodologies?

We—Agnieszka Lawrynowicz from Poznan University of Technology in Poland and I—answer these questions in our paper that was recently accepted at the 13th Extended Semantic Web Conference (ESWC’16), to be held in May 2016 in Crete, Greece: Test-Driven Development of Ontologies [9]. In short: we specified tests for the OWL 2 DL language features and basic types of axioms one can add, implemented it as a Protégé plug-in, and tested it on performance with 67 ontologies (result: great). The tool and test data can be downloaded from Agnieszka’s ARISTOTELES project page.

Now slightly less brief than that one-liner. The tests—like for class subsumption, domain and range, a property chain—can be specified in two principal ways, called TBox tests and ABox tests. The TBox tests rely solely on knowledge represented in the TBox, whereas for the ABox tests, mock individuals are explicitly added for a particular TBox axiom. For instance, the simple existential quantification, as shown below, where the TBox query is denoted in SPARQL-OWL notation.

For the implementation, there is (1) a ‘wrapping’ component that includes creating the mock entities, checking the condition (line 2 in the TBox test example in the figure above, and line 4 in ABox TDD test), returning the TDD test result, and cleaning up afterward; and (2) the interaction with the ontology doing the actual testing, such as querying the ontology and class and instance classification. There are several options to realise component (2) of the TBox TDD tests. The query can be sent to, e.g., OWL-BGP [10] that uses SPARQL-OWL and Hermit to return the answer (line 1), but one also could use just the OWL API and send it to the automated reasoner, among other options.

Because OWL-BGP and the other options didn’t cater for the tests with OWL’s object properties, such as a property chain, so a full implementation would require extending current tools, we decided to first examine performance of the different options for (2) for those tests that could be implemented with current tools so as to get an idea of which approach would be the best to extend, rather than gambling on one, implement all, and go on with user testing. This TDD tool got the unimaginative name TDDonto and can be installed as a Protégé plugin. We tested the performance with 67 TONES ontology repository ontologies. The outcome of that is that the TBox-based SPARQL-OWL approach is faster than the ABox TDD tests (except for class disjointness; see figure below), and the OWL API + reasoner for the TBox TDD tests is again faster in general. These differences are bigger with larger ontologies (see paper for details).

Test computation times per test type (ABox versus TBox-based SPARQL-OWL) and per the kind of the tested axiom (source: [9]).

Finally, can this TDD be simply ‘plugged in’ into one of the existing methodologies? No. As with TDD for software engineering, it has its own sequence of steps. An initial sketch is shown in the figure below. It outlines only the default scenario, where the knowledge to be added wasn’t there already and adding it doesn’t result in conflicts.

Sketch of a possible ontology lifecycle that focuses on TDD, and the steps of the TDD procedure summarised in key terms (source: [9]).

The “select scenario” has to do with what gets fed into the TDD tests, and therewith also where and how TDD could be used. We specified three of them: either (a) the knowledge engineer provides an axiom, (b) a domain expert fills in some template (e.g., the ‘all-some’ one) and that software generates the axiom for the domain ontology (e.g., $Professor \sqsubseteq \exists teaches.Course$), or (c) someone wrote a competency question that is either manually or automatically converted into an axiom. The “refactoring” could include a step for removing a property from a subclass when it is added to its superclass. The “regression testing” considers previous tests and what to do when any conflicts may have arisen (which may need an interaction with step 5).

There is quite a bit of work yet to be done on TDD for ontologies, but at least there is now a first comprehensive basis to work from. Both Agnieszka and I plan to go to ESWC’16, so I hope to see you there. If you want more details or read the tests with a nicer layout than how it is presented in the ESWC16 paper, then have a look at the extended version [11] or contact us, or leave a comment below.

References

[1] Keet, C.M., Khan, M.T., Ghidini, C. Ontology Authoring with FORZA. 22nd ACM International Conference on Information and Knowledge Management (CIKM’13). ACM proceedings. Oct. 27 – Nov. 1, 2013, San Francisco, USA. pp569-578.

[2] Ferre, S. and Rudolph, S. (2012). Advocatus diaboli exploratory enrichment of ontologies with negative constraints. In ten Teije, A. et al., editors, 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW’12), volume 7603 of LNAI, pages 42-56. Springer. Oct 8-12, Galway, Ireland

[3] Paschke, A., Schaefermeier, R. Aspect OntoMaven – aspect-oriented ontology development and configuration with OntoMaven. Tech. Rep. 1507.00212v1, Free University of Berlin (July 2015)

[4] Blomqvist, E., Sepour, A.S., Presutti, V. Ontology testing — methodology and tool. In: Proc. of EKAW’12. LNAI, vol. 7603, pp. 216-226. Springer (2012)

[5] Vrandecic, D., Gangemi, A. Unit tests for ontologies. In: OTM workshops 2006. LNCS, vol. 4278, pp. 1012-1020. Springer (2006)

[6] Warrender, J.D., Lord, P. How, What and Why to test an ontology. Technical Report 1505.04112, Newcastle University (2015), http://arxiv.org/abs/1505.04112

[7] Beck, K.: Test-Driven Development: by example. Addison-Wesley, Boston, MA (2004)

[8] Tort, A., Olive, A., Sancho, M.R. An approach to test-driven development of conceptual schemas. Data & Knowledge Engineering 70, 1088-1111 (2011)

[9] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies. 13th Extended Semantic Web Conference (ESWC’16). Springer LNCS. 29 May – 2 June, 2016, Crete, Greece. (in print)

[10] Kollia, I., Glimm, B., Horrocks, I. SPARQL Query Answering over OWL Ontologies. In: Proc, of ESWC’11. LNCS, vol. 6643, pp. 382-396. Springer (2011)

[11] Keet, C.M., Lawrynowicz, A. Test-Driven Development of Ontologies (extended version). Technical Report, Arxiv.org http://arxiv.org/abs/1512.06211. Dec 19, 2015.

# Reblogging 2014: Coupon collecting the computing way

From the “10 years of keetblog – reblogging: 2014”: The 2014 post closest to ‘general interest’ is about calculating how much you will be ripped off when collecting team player cards to complete a Panini sticker book collection for some world sports championships, without swapping cards with friends and family. It coincided with the Soccer/football Word Cup in Brazil in 2014. The students of the ICPC Southern Africa Regionals training had some fun with it (and so did I when setting the problem). It may be of interest to students who are now preparing for the IT challenge heats (April 16) or the ICPC world finals (in May; we’ll go again with a UCT team [yay!], to Thailand this time).

———–

Coupon collecting is a very Dutch thing to do, though I never made a serious hobby out of it (nevertheless, I still have a great Brio Koekjesboek thanks to that), but I did collect stamps for a while, which was more interesting than cutting slips off of margarine wrappings. What does any of this have to do with computing, or math, for that matter? A lot. I mean, think of it: how much margarine must we have bought just to have enough slips to order the Brio cookie-baking booklet ‘for free’? Same story for the coffee packet wrappings. Post stamp collecting is harder: you’d want the whole series of a given edition. The Italian company Panini made a business out of it, enticing people to collect all stickers of all team members playing in a world cup. And that’s what got me into this post’s topic.

Coaching for the next ACM ICPC, which includes training sessions, made me surf on the web for some interesting problems to solve, so as not to have only previous ICPC regional’s and finals problems to train the students with. Simon Whitehouse has a great blog post on what it would cost to complete the whole Panini sticker book for the 2010 Soccer World Cup in South Africa, without swapping cards with friends, i.e.: how many packets of five stickers would you need to buy to get the whole series of 638 stickers (pictures of soccer players) to put in the sticker book? Answering this question sounded like fun. I reworked a bit the problem description from his post so as to generalize it to finding a way to be able to calculate what it would cost for any world cup—rugby and cricket are important in South Africa, too—and any cost of a packet of stickers (there’s some 6% inflation/year here); read the full problem description (pdf), on the first page.

In solving this, first, there are three variables: N for the number of unique stickers, P for the price of a packet of 5 stickers, and C for the total cost we want to know. To calculate C, we thus have $\frac{total\_no\_of\_stickers}{5}*P$, and we’ll round it up to the nearest integer. The crux is how to get to the total number of stickers.

Whitehouse’s post has a very readable explanation. In short, when you get the first sticker, it is guaranteed to be a new one, the second card has a $\frac{638}{637}$ chance of being new, and so on to the last card $\frac{638}{1}$, wich follows from some basic notions of probabilities, which you can/will/have come across in a statistics intro course. Generalising this to the arbitrary number of N cards, we obtain

$\frac{N}{N} + \frac{N}{N-1} + \frac{N}{N-2} + \ldots + \frac{N}{2} + \frac{N}{1}$

to calculate the total amount of stickers you need to buy to have the N ones complete. This is as much as you really need from a computational viewpoint. Here’s a simple python code snippet that gets the job done:

def panini(n):
     tns = 0
     for i in range(1,n):
          tns = n/i + tns
     return tns

But why keep it simple when one can complicate matters…

This problem is an instance of the Coupon Collector’s Problem (CCP). The above formula is an harmonic series, and with some math on the CCP page, and the Euler-Mascheroni constant ${\sf \gamma}$ (from number theory, with lots and lots of mathematics), one somehow obtains that the above-mentioned series is $n*H_n$, with $H_n$ the harmonic number, and the whole thing equalling also

$n \mbox{log} n + \gamma n + \frac{1}{2} + o(1) \mbox{ as } n \rightarrow \infty$

according to the Wikipedia entry; there is a lot more online about it, e.g., here [course-level] and here [research], anong many resources. If that’s not enough, $\gamma \approx 0.5772156649$, with the decimal digits computed now to over 119 billion decimal digits (it is a major question in mathematics whether it is an irrational number). Somewhere in the whole gamut of formula on Wikipedia and Whitehouse’s clean but unexplained jump (main text and a comment further down on that page), it boils down to, roughly,

$total\_no\_of\_stickers = N*\mbox{ln}(N) + \gamma$

The latter is easy to plug into a spreadsheet to obtain the answer. But lo and behold, what’s computed with the math-approach and natural log depends on what you plug in for $\gamma$, i.e., how many decimal digits, and only $\gamma$ or the whole of Eq.2. The series with the simple algorithm does not have that problem with the approximations. And you don’t have to do all the math. I didn’t exactly record the time it took to create the spreadsheet versus typing up the simple algorithm, but the latter may even have been faster to do.

Besides the observation that the computing way made it simpler to solve the problem with respect to the design, there’s still a remark to be made on computing the total cost. With R10 per packet and the soccer world cup sticker book, you’ll end up paying R8977 to complete the soccer world cup book if you’d do it all by yourself! For many a South African, that’s more than the monthly salary. Completing a 400-sticker world cup for R35/packet is going to cost you R18389 (about €1268 with the current exchange rate). You’d be a lot better off swapping doubles with family and friends rather than buying new packets. Then again, mot people probably won’t calculate how much money they’d be spending on collecting things, so, here’s a basis for a business model for you.

# Reblogging 2013: Release of the (beta version of the) foundational ontology library ROMULUS

From the “10 years of keetblog – reblogging: 2013”: also not easy to choose regarding research. Here, then, the first results of Zubeida Khan’s MSc thesis on the foundational ontology library ROMULUS, being the first post of several papers on the theoretical and methodological aspects of it (KCAP’13 poster, KEOD’13 paper, MEDI’13 paper, book chapter, EKAW’14 paper) and her winning the award for best Masters from the CSIR. The journal paper on ROMULUS has just been published last week in the Journal on Data Semantics, in a special issue edited by Alfredo Cuzzocrea.

—————-

With the increase on ontology development and networked ontologies, both good ontology development and ontology matching for ontology linking and integration are becoming a more pressing issue. Many contributions have been proposed in these areas. One of the ideas to tackle both—supposedly in one fell swoop—is the use of a foundational ontology. A foundational ontology aims to (i) serve as a building block in ontology development by providing the developer with guidance how to model the entities in a domain, and  (ii) serve as a common top-level when integrating different domain ontologies, so that one can identify which entities are equivalent according to their classification in the foundational ontology. Over the years, several foundational ontologies have been developed, such as DOLCE, BFO, GFO, SUMO, and YAMATO, which have been used in domain ontology development. The problem that has arisen now, is how to link domain ontologies that are mapped to different foundational ontologies?

To be able to do this in a structured fashion, the foundational ontologies have to be matched somehow, and ideally have to have some software support for this. As early as 2003, this issue as foreseen already and the idea of a “WonderWeb Foundational Ontologies Library” (WFOL) proposed, so that—in the ideal case—different domain ontologies can to commit to different but systematically related (modules of) foundational ontologies [1]. However, the WFOL remained just an idea because it was not clear how to align those foundational ontologies and, at the time of writing, most foundational ontologies were still under active development, OWL was yet to be standardised, and there was scant stable software infrastructure. Within the Semantic Web setting, the solvability of the implementation issues is within reach yet not realised, but their alignment is still to be carried out systematically (beyond the few partial comparisons in the literature).

We’re trying to solve these theoretical and practical shortcomings through the creation of the first such online library of machine-processable, aligned and merged, foundational ontologies: the Repository of Ontologies for MULtiple USes ROMULUS. This version contains alignments, mappings, and merged ontologies for DOLCE, BFO, and GFO and some modularized versions thereof, as a start. It also has a section on logical inconsistencies; i.e., entities that were aligned manually and/or automatically and seemed to refer to the same thing—e.g., a mathematical set, a temporal region—actually turned out not to be (at least from a logical viewpoint) due to other ‘interfering’ axioms in the ontologies. What one should be doing with those, is a separate issue, but at least it is now clear where the matching problems really are down to the nitty-gritty entity-level.

We performed a small experiment on the evaluation of the mappings (thanks to participants from DERI, Net2 funds, and Aidan Hogan), and we would like to have more feedback on the alignments and mappings. It is one thing that we, or some alignment tool, aligned two entities, another that asserting an equivalence ends up logically consistent (hence mapped) or inconsistent, and yet another what you think of the alignments, especially the ontology engineers. You can participate in the evaluation: you will get a small set of a few alignments at a time, and then you decide whether you agree, partially agree, or disagree with it, are unsure about it, or skip it if you have no clue.

Finally, ROMULUS also has a range of other features, such as ontology selection, a high-level comparison, browsing the ontology through WebProtégé, a verbalization of the axioms, and metadata. It is the first online library of machine-processable, modularised, aligned, and merged foundational ontologies around. A poster/demo paper [2] was accepted at the Seventh International Conference on Knowledge Capture (K-CAP’13), and papers describing details are submitted and in the pipeline. In the meantime, if you have comments and/or suggestions, feel free to contact Zubeida or me.

References

[1] Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A. Ontology library. WonderWeb Deliverable D18 (ver. 1.0, 31-12-2003). (2003) http://wonderweb.semanticweb.org.

[2] Khan, Z., Keet, C.M. Toward semantic interoperability with aligned foundational ontologies in ROMULUS. Seventh International Conference on Knowledge Capture (K-CAP’13), ACM proceedings. 23-26 June 2013, Banff, Canada. (accepted as poster &demo with short paper)

# More results on a CNL for isiZulu

Although it has been a bit quiet here on the controlled natural languages for isiZulu front, lots of new stuff is in the pipeline, and the substantially extended version of our CNL14 and RuleML14 papers [1,2] is in print for publication in the Language Resources and Evaluation journal: Toward a knowledge-to-text controlled natural language of isiZulu [1] (online at LRE as well).

For those who haven’t read the other blog post or the papers on the topic, a brief introduction: for a plethora of reasons, one would want to generate natural language sentences based on some data, information, or knowledge stored on the computer. For instance, to generate automatically weather reports in isiZulu or to browse or query ‘intelligently’ online annotated newspaper text that is guided by an ontology behind-the-scenes in the inner workings of the interface. This means ‘converting’ structured input into structured natural language sentences, which amounts to a Controlled Natural Language (CNL) that is a fragment of the full natural language. For instance, class subsumption in DL (“$\sqsubseteq$“) is verbalised in English as ‘is a/an’. In isiZulu, it is y- or ng- depending on the first character of the name of the superclass. So, in its simplest form, $indlovu \sqsubseteq isilwane$ (that is, $elephant \sqsubseteq animal$ in an ‘English ontology’) would, with the appropriate algorithm, generate the sentence (be verbalized as) indlovu yisilwane (‘elephant is an animal’).

In the CNL14 and RuleML14 papers, we looked into what could be the verbalisation patterns for subsumption, disjointness, conjunction, and simple existential quantification, we evaluated which ones were preferred, and we designed algorithms for them, as none of them could be done with a template. The paper in the LRE journal extends those works with, mainly: a treatment of verbs (OWL object properties) and their conjugation, updated/extended algorithms to deal with that, design considerations for those algorithms, walk-throughs of the algorithms, and an exploratory evaluation to assess the correctness of the algorithm (is the sentence generated [un]grammatical and [un]ambiguous?). There’s also a longer discussion section and more related works.

Conjugation of the verb in isiZulu is not as trivial as in English, where, for verbalizing knowledge represented in ontologies, one simply uses the 3rd person singular (e.g., ‘eats’) or plural (‘eat’) anywhere it appears in an axiom. In isiZulu, it is conjugated based on the noun class of the noun to which it applies. There are 17 noun classes. For instance, umuntu ‘human’ is in noun class 1, and indlovu in noun class 9. Then, when a human eats something, it is umuntu udla whereas with the elephant, it is indlovu idla. Negating it is not simply putting a ‘not’ or ‘does not’ in front of it, as is the case in English (‘does not eat’), but it has its own conjugation (called negative subject concord) again for each noun class, and modifying the final vowel; the human not eating something then becomes umuntu akadli and for the elephant indovu ayidli. This is now precisely captured in the verbalization patterns and algorithms.

Though a bit tedious and not an easy ride compared to a template-based approach, but surely doable to put in an algorithm. Meanwhile, I did implement the algorithms. I’ll readily admit it’s a scruffy Python file and you’ll have to type the function in the interpreter rather than having it already linked to an ontology, but it works, and that’s what counts. (see that flag put in the sand?😉 ) Here’s a screenshot with a few examples, just to show that it does what it should do.

Screenshot showing the working functions for verbalising subsumption, disjointness, universal quantificaiton, existential quantification and its negation, and conjunction.

The code and other files are available from the GeNi project page. The description of the implementation, and the refinements we made along the way in doing so (e.g., filling in that ‘pluralise it’ of the algorithm), is not part of the LRE article, for we were already pushing it beyond the page limit, so I’ll describe that in a later post.

References

[1] Keet, C.M., Khumalo, L. Toward verbalizing logical theories in isiZulu. 4th Workshop on Controlled Natural Language (CNL’14), Davis, B, Kuhn, T, Kaljurand, K. (Eds.). Springer LNAI vol. 8625, 78-89. 20-22 August 2014, Galway, Ireland.

[2] Keet, C.M., Khumalo, L. Basics for a grammar engine to verbalize logical theories in isiZulu. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS vol. 8620, 216-225. August 18-20, 2014, Prague, Czech Republic.

[3] Keet, C.M., Khumalo, L. Toward a knowledge-to-text controlled natural language of isiZulu. Language Resources and Evaluation, 2016: in print. DOI: 10.1007/s10579-016-9340-0

# OBDA/I Example in the Digital Humanities: food in the Roman Empire

A new installment of the Ontology Engineering module is about to start for the computer science honours students who selected it, so, in preparation, I was looking around for new examples of what ontologies and Semantic Web technologies can do for you, and that are at least somewhat concrete. One of those examples has an accompanying paper that is about to be published (can it be more recent than that?), which is on the production and distribution of food in the Roman Empire [1]. Although perhaps not many people here in South Africa might care about what happened in the Mediterranean basin some 2000 years ago, it is a good showcase of what one perhaps also could do here with the historical and archeological information (e.g., an inter-university SA project on digital humanities started off a few months ago, and several academics and students at UCT contribute to the Bleek and Lloyd Archive of |xam (San) cultural heritage, among others). And the paper is (relatively) very readable also to the non-expert.

So, what is it about? Food was stored in pots (more precisely: an amphora) that had engravings on it with text about who, what, where etc. and a lot of that has been investigated, documented, and stored in multiple resources, such as in databases. None of the resources cover all data points, but to advance research and understanding about it and food trading systems in general, it has to be combined somehow and made easily accessible to the domain experts. That is, essentially it is an instance of a data access and integration problem.

There are a couple of principal approaches to address that, usually done by an Extract-Transform-Load of each separate resource into one database or digital library, and then putting a web-based front-end on top of it. There are many shortcomings to that solution, such as having to repeat the ETL procedure upon updates in the source database, a single control point, and the, typically only, canned (i.e., fixed) queries of the interface. A more recent approach, of which the technologies finally are maturing, is Ontology-Based Data Access (OBDA) and Ontology-Based Data Integration (OBDI). I say “finally” here, as I still very well can remember the predecessors we struggled with some 7-8 years ago [2,3] (informally here, here, and here), and “maturing”, as the software has become more stable, has more features, and some of the things we had to do manually back then have been automated now. The general idea of OBDA/I applied to the Roman Empire Food system is shown in the figure below.

OBDA in the EPnet system (Source: [1])

There are the data sources, which are federated (one ‘middle layer’, though still at the implementation level). The federated interface has mapping assertions to elements in the ontology. The user then can use the terms of the ontology (classes and their relations and attributes) to query the data, without having to know about how the data is stored and without having to write page-long SQL queries. For instance, a query “retrieve inscriptions on amphorae found in the city of ‘Mainz” containing the text ‘PNN’” would use just the terms in the ontology, say, Inscription, Amphora, City, found in, and inscribed on, and any value constraint added (like the PNN), and the OBDA/I system takes care of the rest.

Interestingly, the authors of [1]—admitted, three of them are former colleagues from Bolzano—used the same approach to setting up the ontology component as we did for [3]. While we will use the Protégé Ontology Development Environment in the OE module, it is not the best modelling tool to overcome the knowledge acquisition bottleneck. The authors modelled together with the domain experts in the much more intuitive ORM language and tool NORMA, and first represented whatever needed to be represented. This included also reuse of relevant related ontologies and non-ontology material, and modularizing it for better knowledge management and thereby ameliorating cognitive overload. A subset of the resultant ontology was then translated into the Web Ontology Language OWL (more precisely: OWL 2 QL, a tractable profile of OWL 2 DL), which is actually used in the OBDA system. We did that manually back then; now this can be done automatically (yay!).

Skipping here over the OBDI part and considering it done, the main third step in setting up an OBDA system is to link the data to the elements in the ontology. This is done in the mapping layer. This is essentially of the form “TermInTheOntology <- SQLqueryOverTheSource”. Abstracting from the current syntax of the OBDA system and simplifying the query for readability (see the real one in the paper), an example would thus have the following make up to retrieve all Dressel 1 type of amphorae, named Dressel1Amphora in the ontology, in all the data sources of the system:

Dressel1Amphora <-
SELECT ic.id
FROM ic JOIN at ON at.carrier=ic.id
WHERE at.type=’DR1’

Or some such SQL query (typically larger than this one). This takes up a bit of time to do, but has to be done only once, for these mappings are stored in a separate mapping file.

The domain expert, then, when wanting to know about the Dressel1 amphorae in the system, would have to ask only ‘retrieve all Dressel1 amphorae’, rather than creating the SQL query, and thus being oblivious about which tables and columns are involved in obtaining the answer and being oblivious about that some data entry person at some point had mysteriously decided not to use ‘Dressel1’ but his own abbreviation ‘DR1’.

The actual ‘retrieve all Dressel1 amphorae’ is then a SPARQL query over the ontology, e.g.,

SELECT ?x WHERE {?x rdf:Type :Dressel1Amphora.}

which is surely shorter and therefore easier to handle for the domain expert than the SQL one. The OBDA system (-ontop-) takes this query and reasons over the ontology to see if the query can be answered directly by it without consulting the data, or else can be rewritten given the other knowledge in the ontology (it can, see example 5 in the paper). The outcome of that process then consults the relevant mappings. From that, the whole SQL query is constructed, which is sent to the (federated) data source(s), which processes the query as any relational database management system does, and returns the data to the user interface.

It is, perhaps, still unpleasant that domain experts have to put up with another query language, SPARQL, as the paper notes as well. Some efforts have gone into sorting out that ‘last mile’, such as using a (controlled) natural language to pose the query or to reuse that original ORM diagram in some way, but more needs to be done. (We tried the latter in [3]; that proof-of-concept worked with a neutered version of ORM and we have screenshots and videos to prove it, but in working on extensions and improvements, a new student uploaded buggy code onto the production server, so that online source doesn’t work anymore (and we didn’t roll back and reinstalled an older version, with me having moved to South Africa and the original student-developer, Giorgio Stefanoni, away studying for his MSc).

Note to OE students: This is by no means all there is to OBDA/I, but hopefully it has given you a bit of an idea. Read at least sections 1-3 of paper [1], and if you want to do an OBDA mini-project, then read also the rest of the paper and then Chapter 8 of the OE lecture notes, which discusses in a bit more detail the motivations for OBDA and the theory behind it.

References

[1] Calvanese, D., Liuzzo, P., Mosca, A., Remesal, J, Rezk, M., Rull, G. Ontology-Based Data Integration in EPNet: Production and Distribution of Food During the Roman Empire. Engineering Applications of Artificial Intelligence, 2016. To appear.

[2] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED 2008), 26-27 Oct. 2008, Karlsruhe, Germany.

[3] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC 2010), March 22-26 2010, Sierre, Switzerland. ACM Proceedings, pp1389-1396.