Or: verbalizing OWL ontologies still doesn’t really work well.
Ever since we got the multi-lingual verbalization of ORM conceptual data models (restricted FOL theories) working in late 2005 [1]—well: the implementation worked in the DOGMA tool, but the understandability of the output depended on the natural language—I have been following on and off the progress on solutions to the problem. It would be really nice if it all had worked by now, because it is a way for non-logician domain experts to validate the knowledge represented in the ontology and verbalization has been shown to be very useful for domain experts (mainly enterprise) validating (business) knowledge represented in the ORM conceptual data modeling language. (Check out the NORMA tool for the latest fancy implementation, well ahead of OWL verbalization in English Controlled Natural Language).
Some of my students worked on it as an elective ‘mini-project’ topic of the ontology engineering courses I have taught [SWT at FUB, UH, UCI, UKZN]. They have tried to implement it for OWL into Italian and Spanish natural language using a template-based approach with some additional mini-grammar-engine to improve the output, or in English as a competitor to the Manchester syntax. All of them invariable run, to a greater or lesser extent, into the problems discussed in [1], especially when it comes to non-English languages, as English is grammatically challenged. Now, I do not intend to offend people who have English as first language, but English does not have features like gendered articles (just ‘the’ instead of ‘el’ and ‘la’, in Spanish), declensions (still ‘the’ instead of ‘der’ ‘des’, ‘dem’, ‘den’ depending on the proposition, in German), conjunction depending on the nouns (just ‘and’ instead of ‘na’, ‘ne’, ‘no’ that is glued onto the second noun depending on the first letter of that noun, in isiZulu), or subclauses where the verb tense changes by virtue of being in a subclause (in Italian). To sort out such basic matters to generate an understandable pseudo-natural language sentence, a considerable amount of grammar rules and a dictionary have to be added to a template-based approach to make it work.
But let us limit ourselves to English for the moment. Then it is still not trivial. There is a paper comparing the different OWL verbalizers [2], such as Rabbit (ROO) and ACE, which considers issues like how to map, e.g., an AllValuesFrom to “Each…”, “Every…” etc. This is an orthogonal issue to the multi-lingual aspects, and I don’t know how that affects the user’s understanding of the sentences.
I had another look at ACE, as ACE also has a web-interface that accepts OWL/XML files (i.e., OWL 2). I tried it out with the Pizza tutorial ontology, and it generated many intelligible sentences. However, there were also phrases like (i) “Everything that is hasTopping by a Mushroom is something that is a MozzarellaTopping or that is a MushroomTopping or that is a TomatoTopping.”, the (ii) “Every American is a NamedPizza” mentioned in the title of this post, and then there are things like (iii) “Every DomainConcept that is America or that is England or that is France or that is Germany or that is Italy is a Country”. Example (iii) is not a problem of the verbalizer, but merely an instance of GIGO and the ontology should be corrected.
Examples (i) and (ii) exhibit other problems, though. Regarding (ii), I have noticed that when (novice) ontologists use an ontology development tool, it is a not uncommon practice to not name the entity fully, probably because it is easy for a human reader to fill in the rest from the context; in casu, American is not an adjective to people, but relates to pizza. A more precise name could have avoided such issues (AmericanPizza), or a new solution to ‘context’ can be devised. The weird “is hasTopping by” is due, I think, to the lexicalization of OWL’s ObjectPropertyRange in ACE, which takes the object property, assumes that to be in the infinitive and then puts it in the past participle form (see the Web-ACE page, section 4). So, if the Pizza Ontology developers had chosen not hasTopping but, say, the verb ‘top’, ACE would have changed it into ‘is topped by’. In idea the rule makes sense, but it can be thwarted by the names used in the ontology.
Fliedl and co-authors [3] are trying to resolve just such issues. They propose a rigid naming convention to make it easier to verbalize the ontology. I do not think it is a good proposal, because it is ‘blaming’ the ontologists for failing natural language generation (NLG) systems, and syntactic sugar (verbalization) should not be the guiding principle when adding knowledge to the ontology. Besides, it is not that difficult to add another rule or two to cater for variations, which is probably what will be needed in the near future anyway once ontology reuse and partial imports become more commonplace in ontology engineering.
Power and Third [4] readily admit that verbalizing OWL is “dubious in theory”, but they provide data that it may be “feasible in practice”. The basis of their conclusion lies in the data analysis of about 200 ontologies, which show that the ‘problematic’ cases seldom arise. For instance, OWL’s SubClassOf takes two class expressions, but in praxis it is only used in the format of SubClassOf(C CE) or SubClassOf(C C), idem regarding EquivalentClasses—I think that is probably due to Protégé’s interface—which makes the verbalization easier. They did not actually build a verbalizer, though, but the tables on page 1011 can be of use what to focus on first; e.g., out of the 633,791 axioms, there were only 12 SubDataPropertyOf assertions, whereas SubClassOf(Class,Class) appeared 297,293 times (46.9% of the total) and SubClassOf(Class,ObjectSomeValuesFrom(ObjectProperty,Class)) 158,519 times (25.0%). Why this distribution is the way it is, is another topic.
Going back to the multi-lingual dimension, there is a general problem with OWL ontologies, which is, from a theoretical perspective, addressed more elegantly with OBO ontologies. In OBO, each class has an identifier and the name is just a label. So one could, in principle, amend this by adding labels for each natural language; e.g., have a class “PIZZA:12345” in the ontology with associated labels “tomato @en”, “pomodoro @it”, “utamatisi @zulu” and so forth, and when verbalizing it in one of those languages, the system picks the right label, compared to the present cumbersome and error-prone way of developing and maintaining an OWL file for each language. Admitted, this has its limitations for terms and verbs that do not have a neat 1:1 translation, but a fully lexicalized ontology should be able to solve this (though does not do so yet).
It is very well possible that I have missed some recent paper that addresses the issues but that I have not come across. At some point in time, we’ll probably will (have to) develop an isiZulu verbalization system, so anyone who has/knows of references that point to (partial) solutions is most welcome to add them in the comments section of the post.
References
[1] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.
[2] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear, G. Hart. A comparison of three controlled natural languages for OWL 1.1. Proc. of OWLED 2008 DC. Washington, DC, USA, 1-2 April 2008.
[3] Fliedl, G., Kop, C., Vöhringer, J. Guideline based evaluation and verbalization of OWL class and property labels. Data & Knowledge Engineering, 2010, 69: 331-342.
[4] Power, R., Third, A. Expressing OWL axioms by English sentences: dubious in theory, feasible in practice. Coling 2010: Poster Volume, pages 1006–1013,
Beijing, August 2010.