Every American is a NamedPizza

Or: verbalizing OWL ontologies still doesn’t really work well.

Ever since we got the multi-lingual verbalization of ORM conceptual data models (restricted FOL theories) working in late 2005 [1]—well: the implementation worked in the DOGMA tool, but the understandability of the output depended on the natural language—I have been following on and off the progress on solutions to the problem. It would be really nice if it all had worked by now, because it is a way for non-logician domain experts to validate the knowledge represented in the ontology and verbalization has been shown to be very useful for domain experts (mainly enterprise) validating (business) knowledge represented in the ORM conceptual data modeling language. (Check out the NORMA tool for the latest fancy implementation, well ahead of OWL verbalization in English Controlled Natural Language).

Some of my students worked on it as an elective ‘mini-project’ topic of the ontology engineering courses I have taught [SWT at FUB, UH, UCI, UKZN]. They have tried to implement it for OWL into Italian and Spanish natural language using a template-based approach with some additional mini-grammar-engine to improve the output, or in English as a competitor to the Manchester syntax. All of them invariable run, to a greater or lesser extent, into the problems discussed in [1], especially when it comes to non-English languages, as English is grammatically challenged. Now, I do not intend to offend people who have English as first language, but English does not have features like gendered articles (just ‘the’ instead of ‘el’ and ‘la’, in Spanish), declensions (still ‘the’ instead of ‘der’ ‘des’, ‘dem’, ‘den’ depending on the proposition, in German), conjunction depending on the nouns (just ‘and’ instead of ‘na’, ‘ne’, ‘no’ that is glued onto the second noun depending on the first letter of that noun, in isiZulu), or subclauses where the verb tense changes by virtue of being in a subclause (in Italian). To sort out such basic matters to generate an understandable pseudo-natural language sentence, a considerable amount of grammar rules and a dictionary have to be added to a template-based approach to make it work.

But let us limit ourselves to English for the moment. Then it is still not trivial. There is a paper comparing the different OWL verbalizers [2], such as Rabbit (ROO) and ACE, which considers issues like how to map, e.g., an AllValuesFrom to “Each…”, “Every…” etc. This is an orthogonal issue to the multi-lingual aspects, and I don’t know how that affects the user’s understanding of the sentences.

I had another look at ACE, as ACE also has a web-interface that accepts OWL/XML files (i.e., OWL 2). I tried it out with the Pizza tutorial ontology, and it generated many intelligible sentences. However, there were also phrases like (i) “Everything that is hasTopping by a Mushroom is something that is a MozzarellaTopping or that is a MushroomTopping or that is a TomatoTopping.”, the (ii) “Every American is a NamedPizza” mentioned in the title of this post, and then there are things like  (iii) “Every DomainConcept that is America or that is England or that is France or that is Germany or that is Italy is a Country”. Example (iii) is not a problem of the verbalizer, but merely an instance of GIGO and the ontology should be corrected.

Examples (i) and (ii) exhibit other problems, though. Regarding (ii), I have noticed that when (novice) ontologists use an ontology development tool, it is a not uncommon practice to not name the entity fully, probably because it is easy for a human reader to fill in the rest from the context; in casu, American is not an adjective to people, but relates to pizza. A more precise name could have avoided such issues (AmericanPizza), or a new solution to ‘context’ can be devised. The weird “is hasTopping by” is due, I think, to the lexicalization of OWL’s ObjectPropertyRange in ACE, which takes the object property, assumes that to be in the infinitive and then puts it in the past participle form (see the Web-ACE page, section 4). So, if the Pizza Ontology developers had chosen not hasTopping but, say, the verb ‘top’, ACE would have changed it into ‘is topped by’. In idea the rule makes sense, but it can be thwarted by the names used in the ontology.

Fliedl and co-authors [3] are trying to resolve just such issues. They propose a rigid naming convention to make it easier to verbalize the ontology. I do not think it is a good proposal, because it is ‘blaming’ the ontologists for failing natural language generation (NLG) systems, and syntactic sugar (verbalization) should not be the guiding principle when adding knowledge to the ontology. Besides, it is not that difficult to add another rule or two to cater for variations, which is probably what will be needed in the near future anyway once ontology reuse and partial imports become more commonplace in ontology engineering.

Power and Third [4] readily admit that verbalizing OWL is “dubious in theory”, but they provide data that it may be “feasible in practice”. The basis of their conclusion lies in the data analysis of about 200 ontologies, which show that the ‘problematic’ cases seldom arise. For instance, OWL’s SubClassOf takes two class expressions, but in praxis it is only used in the format of SubClassOf(C CE) or SubClassOf(C C), idem regarding EquivalentClasses—I think that is probably due to Protégé’s interface—which makes the verbalization easier. They did not actually build a verbalizer, though, but the tables on page 1011 can be of use what to focus on first; e.g., out of the 633,791 axioms, there were only 12 SubDataPropertyOf assertions, whereas SubClassOf(Class,Class) appeared 297,293 times (46.9% of the total) and SubClassOf(Class,ObjectSomeValuesFrom(ObjectProperty,Class)) 158,519 times (25.0%). Why this distribution is the way it is, is another topic.

Going back to the multi-lingual dimension, there is a general problem with OWL ontologies, which is, from a theoretical perspective, addressed more elegantly with OBO ontologies. In OBO, each class has an identifier and the name is just a label. So one could, in principle, amend this by adding labels for each natural language; e.g., have a class “PIZZA:12345” in the ontology with associated labels “tomato @en”, “pomodoro @it”, “utamatisi @zulu” and so forth, and when verbalizing it in one of those languages, the system picks the right label, compared to the present cumbersome and error-prone way of developing and maintaining an OWL file for each language. Admitted, this has its limitations for terms and verbs that do not have a neat 1:1 translation, but a fully lexicalized ontology should be able to solve this (though does not do so yet).

It is very well possible that I have missed some recent paper that addresses the issues but that I have not come across. At some point in time, we’ll probably will (have to) develop an isiZulu verbalization system, so anyone who has/knows of references that point to (partial) solutions is most welcome to add them in the comments section of the post.

References

[1] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[2] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear, G. Hart. A comparison of three controlled natural languages for OWL 1.1. Proc. of OWLED 2008 DC. Washington, DC, USA, 1-2 April 2008.

[3] Fliedl, G., Kop, C., Vöhringer, J. Guideline based evaluation and verbalization of OWL class and property labels. Data & Knowledge Engineering, 2010, 69: 331-342.

[4] Power, R., Third, A. Expressing OWL axioms by English sentences: dubious in theory, feasible in practice. Coling 2010: Poster Volume, pages 1006–1013,

Beijing, August 2010.

Managing your BSc honours Project

As with the previous post on referencing academic work, there is only informal information on another aspect of the (4th year honours) thesis (project) at UKZN: project management. This was not so much of an issue as long as there were just a few students, but the numbers are up and repeating suggestions umpteen times is rather time-consuming. I have prepared a first version of a document on some general aspects on thesis project management, which is somewhat tailored to our computer science honours students (though many aspects can be useful to MSc/PhD thesis management). The latest version of the Thesis Project Management is available online.

It first describes what activities one generally has to do, and then provides guidelines for how those activities can be planned in a schedule and managed such that the chance of completing your project is higher. The former includes the core activities such as exploration, problem definition, proposal, main research/development, and the production of your thesis, and several sub-tasks. The latter addresses aspects like milestones, the ‘onion approach’ to prioritizing your work, Gantt charts, keeping a project log, and how to avoid and deal with delays. Both sections contain several examples to illustrate the (sub-)activities and their planning.

It is an evolving document that eventually will end up in the complimentary “research methods” syllabus we intend to develop for next year’s students who will have to do our honours project module (COMP700), so comments and suggestions for improvements are most welcome.

Referencing Works

In my musings about Related works: when do you read ‘too much’? some two months ago, I intentionally gave a few ‘exemplary cases’ of how not to cite material, but I did not mention how, then, one is supposed to cite material properly. As the honours students are working on their thesis proposal and the mini-projects for the Ontologies & Knowledge Bases course I teach (indeed, it is intended to give practice in doing a project before the real thing later in the university year), I have prepared a first version of a document on referencing related works, as there was nothing about that yet for our computer science students. The latest version of ReferencingWorks is available online (suggestions for improvement welcome).

Aside from describing the usual basics about how to reference material in the text, plagiarism, quoting, paraphrasing, referencing, and what you normally can and cannot cite, the major difference with other material on this topic is probably the section on how to manage references. Unlike other material online, in particular the many referencing style guides that tell you ‘style x requires you to put a book title in italics, a journal paper volume number in bold’ etc. etc., I did not waste time and space on such tedious things. One would have to be crazy to read through all those guidelines and adapt one’s references each time another style has to be followed, not to mention manually fiddling with the in-text notation. Scientists and software developers got together, and developed reference management software to do this for you. Put differently: we can get that sorted out automatically.

Basically, you store your references in a fancy database and each time you use a reference, you insert its key in the text. Once ready, the used keys, your text editor, and your selected style get together and produce the right amount of references in the right format in the right order—automatically. That is: use bibtex. To be sure, I mention other referencing software as well, like Mendeley and EndNote—but, I admit, only because I know it is known that people like the feeling of having the impression they have a choice. So the students can choose whichever way they like, even the hard way by managing the references manually, as long as the references are correctly referenced in the text, and are complete and consistently referenced in the references section.