Reblogging 2007: AI and cultural heritage workshop at AI*IA’07

From the “10 years of keetblog – reblogging: 2007”: a happy serendipity moment when I stumbled into the AI & Cultural heritage workshop, which had its presentations in Italian. Besides the nice realisation I actually could understand most of it, I learned a lot about applications of AI to something really useful for society, like the robot-guide in a botanical garden, retracing the silk route, virtual Rome in the time of the Romans, and more.

AI and cultural heritage workshop at AI*IA’07, originally posted on Sept 11, 2007. For more recent content on AI & cultural heritage, see e.g., the workshop’s programme of 2014 (also collocated with AI*IA).

——–

I’m reporting live from the Italian conference on artificial intelligence (AI*IA’07) in Rome (well, Villa Mondrogone in Frascati, with a view on Rome). My own paper on abstractions is rather distant from near-immediate applicability in daily life, so I’ll leave that be and instead write about an entertaining co-located workshop about applying AI technologies for the benefit of cultural heritage that, e.g., improve tourists’ experience and satisfaction when visiting the many historical sites, museums, and buildings that are all over Italy (and abroad).

I can remember well the handheld guide at the Alhambra back in 2001, which had a story by Mr. Irving at each point of interest, but there was only one long story and the same one for every visitor. Current research in AI & cultural heritage looks into solving issues how this can be personalized and be more interactive. Several directions are being investigated how this can be done. This ranges from the amount of information provided at each point of interest (e.g., for the art buff, casual American visitor who ‘does’ a city in a day or two, or narratives for children), to location-aware information display (the device will detect which point of interest you are closest to), to cataloguing and structuring the vast amount of archeological information, to the software monitoring of Oetzi the Iceman. The remainder of this blog post describes some of the many behind-the-scenes AI technologies that aim to give a tourist the desired amount of relevant information at the right time and right place (see the workshop website for the list of accepted papers). I’ll add more links later; any misunderstandings are mine (the workshop was held in Italian).

First something that relates somewhat to bioinformatics/ecoinformatics: the RoBotanic [1], which is a robot guide for botanical gardens – not intended to replace a human, but as an add-on that appeals in particular to young visitors and get them interested in botany and plant taxonomy. The technology is based on the successful ciceRobot that has been tested in the Archeological Museum Agrigento, but having to operate outside in a botanical garden (in Palermo), new issues have to be resolved, such as tuff powder, irregular surface, lighting, and leaves that interfere with the GPS system (for the robot to stop at plants of most interest). Currently, the RoBotanic provides one-way information, but in the near-future interaction will be built in so that visitors can ask questions as well (ciceRobot is already interactive). Both the RoBotanic and ciceRobot are customized off-the shelf robots.

Continuing with the artificial, there were three presentations about virtual reality. VR can be a valuable add-on to visualize lost or severely damaged property, timeline visualizations of rebuilding over old ruins (building a church over a mosque or vice versa was not uncommon), to prepare future restorations, and general reconstruction of the environment, all based on the real archeological information (not Hollywood fantasy and screenwriting). The first presentation [2] explained how the virtual reality tour of the Church of Santo Stefano in Bologna was made, using Creator, Vega, and many digital photos that served for the texture-feel in the VR tour. [3] provided technical details and software customization for VR & cultural heritage. On the other hand, the third presentation [4] was from a scientific point most interesting and too full of information to cover it all here. E. Bonini et al. investigated if, and if yes how, VR can give added-value. Current VR being insufficient for the cultural heritage domain, they look at how one can do an “expansion of reality” to give the user a “sense of space”. MUDing on the via Flaminia Antica in the virtual room in the National Museum in Rome should be possible soon (CNR-ITABC project started). Another issue came up during the concluded Appia Antica project for Roman era landscape VR: behaviour of, e.g., animals are now pre-coded and become boring to the user quickly. So, what these VR developers would like to see (i.e., future work) is to have technologies for autonomous agents integrated with VR software in order to make the ancient landscape & environment more lively: artificial life in the historical era one wishes, based on – and constrained by – scientific facts so as to be both useful for science and educational & entertaining for interested laymen.

A different strand of research is that of querying & reasoning, ontologies, planning and constraints.
Arbitrarily, I’ll start with the SIRENA project in Naples (the Spanish Quarter) [5], which aims to provide automatic generation of maintenance plans for historical residential buildings in order to make the current manual plans more efficient, cost effective, and maintain them just before a collapse. Given the UNI 8290 norms for technical descriptions of parts of buildings, they made an ontology, and used FLORA-2, Prolog, and PostgreSQL to compute the plans. Each element has its own interval for maintenance, but I didn’t see much of the partonomy, and don’t know how they deal with the temporal aspects. Another project [6] also has an ontology, in OWL-DL, but is not used for DL-reasoning reasoning yet. The overall system design, including use of Sesame, Jena, SPARQL can be read here and after server migration, their portal for the archeological e-Library will be back online. Another component is the webGIS for pre- and proto-historical sites in Italy, i.e., spatio-temporal stuff, and the hope is to get interesting inferences – novel information – from that (e.g., discover new connections between epochs). A basic online accessible version of webGIS is already running for the Silk Road.
A third different approach and usage of ontologies was presented in [7]. With the aim of digital archive interoperability in mind, D’Andrea et al. took the CIDOC-CRM common reference model for cultural heritage and enriched it with DOLCE D&S foundational ontology to better describe and subsequently analyse iconographic representations, from, in this particular work, scenes and reliefs from the meroitic time in Egypt.
With In.Tou.Sys for intelligent tourist systems [8] we move to almost-industry-grade tools to enhance visitor experience. They developed software for PDAs one takes around in a city, which then through GPS can provide contextualized information to the tourist, such as the building you’re walking by, or give suggestions for the best places to visit based on your preferences (e.g., only baroque era, or churches, or etc). The latter uses a genetic algorithm to compute the preference list, the former a mix of RDBMS on the server-side, OODBMS on the client (PDA) side, and F-Logic for the knowledge representation. They’re now working on the “admire” system, which has a time component built in to keep track of what the tourist has visited before so that the PDA-guide can provide comparative information. Also for city-wide scale and guiding visitors is the STAR project [9], bit different from the previous, it combines the usual tourist information and services – represented in a taxonomy, partonomy, and a set of constraints – with problem solving and a recommender system to make an individualized agenda for each tourist; so you won’t stand in front of a closed museum, be alerted of a festival etc. A different PDA-guide system was developed in the PEACH project for group visits in a museum. It provides limited personalized information, canned Q & A, and visitors can send messages to their friend and tag points of interest that are of particular interest.

Utterly different from the previous, but probably of interest to the linguistically-oriented reader is philology & digital documents. Or: how to deal with representing multiple versions of a document. Poets and authors write and rewrite, brush up, strike through etc. and it is the philologist’s task to figure out what constitutes a draft version. Representing the temporality and change of documents (words, order of words, notes about a sentence) is another problem, which [10] attempts to solve by representing it as a PERT/CPM graph structure augmented with labeling of edges, the precise definition of a ‘variant graph’, and a method of compactly storing it (ultimately stored in XML). The test case as with a poem from Valerio Magrelli.

The proceedings will be put online soon (I presume), is also available on CD (contact the WS organizer Luciana Bordoni), and probably several of the articles are online on the author’s homepages.

[1] A. Chella, I. Macaluso, D. Peri, L. Riano. RoBotanic: a Robot Guide for Botanical Gardens. Early Steps.
[2] G. Adorni. 3D Virtual Reality and the Cultural Heritage.
[3] M.C.Baracca, E.Loreti, S. Migliori, S. Pierattini. Customizing Tools for Virtual Reality Applications in the Cultural Heritage Field.
[4] E. Bonini, P. Pierucci, E. Pietroni. Towards Digital Ecosystems for the Transmission and Communication of Cultural Heritage: an Epistemological Approach to Artificial Life.
[5] A. Calabrese, B. Como, B. Discepolo, L. Ganguzza , L. Licenziato, F. Mele, M. Nicolella, B. Stangherling, A. Sorgente, R Spizzuoco. Automatic Generation of Maintenance Plans for Historical Residential Buildings.
[6] A.Bonomi, G. Mantegari, G.Vizzari. Semantic Querying for an Archaeological E-library.
[7] A. D’Andrea, G. Ferrandino, A. Gangemi. Shared Iconographical Representations with Ontological Models.
[8] L. Bordoni, A. Gisolfi, A. Trezza. INTOUSYS: a Prototype Personalized Tourism System.
[9] D. Magro. Integrated Promotion of Cultural Heritage Resources.
[10] D. Schmidt, D. Fiormonte. Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts

The ontology-driven unifying metamodel of UML class diagrams, ER, EER, ORM, and ORM2

Metamodelling of conceptual data modelling languages is nothing new, and one may wonder why one would need yet another one. But you do, if you want to develop complex systems or integrate various legacy sources (which South Africa is going to invest more money in) and automate at least some parts of it. For instance: you want to link up the business rules modelled in ORM, the EER diagram of the database, and the UML class diagram that was developed for the application layer. Are the, say, Student entity types across the models really the same kind of thing? And UML’s attribute StudentID vs. the one in the EER diagram? Or EER’s EmployeesDependent weak entity type with the ORM business rule that states that “each dependent of an employee is identified by EmployeeID an the Dependent’s Name?

Ascertaining the correctness of such inter-model assertions in different languages does not require a comparison and contrast of their differences, but a way to harmonise or unify them. Some such models already exist, but they take subsets of the languages, whereas all those features do appear in actual models [1] (described here informally). Our metamodel, in contrast, aims to capture all constructs of the aforementioned languages and the constraints that hold between them, and generalize in an ontology-driven way so that the integrated metamodel subsumes the structural, static elements of them (i.e., the integrated metamodel has as them as fragments). Besides some updates to the earlier metamodel fragment presented in [2,3], the current version [4,5] also includes the metamodel fragment of their constraints (though omits temporal aspects and derived constraints). The metamodel and its explanation can be found in the paper in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4] that I co-authored with Pablo Fillottrani, and which was recently accepted in Data & Knowledge Engineering.

Methodologically, the unifying metamodel presented in An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2 [4], is ontological rather than formal (cf. all other known works). On that ‘ontology-driven approach’, here is meant the use of insights from Ontology (philosophy) and ontologies (in computing) to enhance the quality of a conceptual data model and obtain that ‘glue stuff’ to unify the metamodels of the languages. The DKE paper describes all that, such as: on the nature of the UML association/ORM fact type (different wording, same ontological commitment), attributes with and without data types, the plethora of identification constraints (weak entity types, reference modes, etc.), where can one reuse an ‘attribute’ if at all, and more. The main benefit of this approach is being able to cope with the larger amount of elements that are present in those languages, and it shows that, in the details, the overlap in features across the languages is rather small: 4 among the set of 23 types of relationship, role, and entity type are essentially the same across the languages (see figure below), and 6 of the 49 types of constraints. The metamodel is stable for the modelling languages covered. It is represented in UML for ease of communication, but, as mentioned earlier, it also has been formalised in the meantime [5].

Types of elements in the languages; black-shaded: entity is present in all three language families (UML, EER, ORM); darg grey: on two of the three; light grey: in one; while-filled: in none, but we added it to glue things together. (Source: [6])

Types of elements in the languages; black-shaded: entity is present in all three language families (UML, EER, ORM); dark grey: on two of the three; light grey: in one; while-filled: in none, but we added the more general entities to ‘glue’ things together. (Source: [4])

Metamodel fragment with some constraints among some of the entities. (Source [4])

Metamodel fragment with some constraints among some of the entities. (Source [4])

The DKE paper also puts it in a broader context with examples, model analyses using the harmonised terminology, and a use case scenario that demonstrates the usefulness of the metamodel for inter-model assertions.

While the 24-page paper is rather comprehensive, research results wouldn’t live up to it if it didn’t uncover new questions. Some of them have been, and are being, answered in the meantime, such as its use for classifying models and comparing their characteristics [1,6] (blogged about here and here) and a rule-based approach to validating inter-model assertions [7] (informally here). Although the 3-year funded project on the Ontology-driven unification of conceptual data modelling languages—which surely contributed to realising this paper—just finished officially, we’re not done yet, or: more is in the pipeline. To be continued…

 

References

[1] Keet, C.M., Fillottrani, P.R. An analysis and characterisation of publicly available conceptual models. 34th International Conference on Conceptual Modeling (ER’15). Springer LNCS. 19-22 Oct, Stockholm, Sweden. (in press)

[2] Keet, C.M., Fillottrani, P.R. Toward an ontology-driven unifying metamodel for UML Class Diagrams, EER, and ORM2. 32nd International Conference on Conceptual Modeling (ER’13). W. Ng, V.C. Storey, and J. Trujillo (Eds.). Springer LNCS 8217, 313-326. 11-13 November, 2013, Hong Kong.

[3] Keet, C.M., Fillottrani, P.R. Structural entities of an ontology-driven unifying metamodel for UML, EER, and ORM2. 3rd International Conference on Model & Data Engineering (MEDI’13). A. Cuzzocrea and S. Maabout (Eds.) September 25-27, 2013, Amantea, Calabria, Italy. Springer LNCS 8216, 188-199.

[4] Keet, C.M., Fillottrani, P.R. An ontology-driven unifying metamodel of UML Class Diagrams, EER, and ORM2. Data & Knowledge Engineering. 2015. DOI: 10.1016/j.datak.2015.07.004. (in press)

[5] Fillottrani, P.R., Keet, C.M. KF metamodel Formalization. Technical Report, Arxiv.org http://arxiv.org/abs/1412.6545. Dec 19, 2014. 26p.

[6] Fillottrani, P.R., Keet, C.M. Evidence-based Languages for Conceptual Data Modelling Profiles. 19th Conference on Advances in Databases and Information Systems (ADBIS’15). Springer LNCS. Poitiers, France, Sept 8-11, 2015. (in press)

[7] Fillottrani, P.R., Keet, C.M. Conceptual Model Interoperability: a Metamodel-driven Approach. 8th International Web Rule Symposium (RuleML’14), A. Bikakis et al. (Eds.). Springer LNCS 8620, 52-66. August 18-20, 2014, Prague, Czech Republic.

Quasi wordles of isiZulu online newspaper articles from this weekend

Every now and then, I get side-tracked from what I was (supposed to be) doing. This time, it was a result of the combination of preparing ICPC training problems, preparing for a statistics tutorial for the postgraduate research methods, and a conversation from last week on an isiZulu corpus with Langa Khumalo from UKZN’s ULPDO (and my co-author on several papers on isiZulu CNLs). To make a long story short, I ended up sourcing some online news articles in isiZulu and writing a little python script to count the words and top-k words of the news articles to get a feel of what the most prevalent topics of the articles were.

 

Materials and data

10 Isolezwe, listed on the front page on August 8, 2015 (articles were from Aug 6 and 7—no updates in the long weekend)

10 News24 in isiZulu articles, listed on the front page on August 8, 2015 (articles were from Aug 8)

10 News24 in isiZulu articles, listed on the front page on August 9, 2015 (articles were from Aug 9, a Sunday, and Women’s Day in South Africa)

Simple basicCorpusStats.py that one can make already just by going through the first part of ThinkPython (in case you’re unfamiliar with python).

Note: ilanga doesn’t have articles online, and therefore was not included.

Note 2: for copyright issues, I probably cannot share the txt files online, but in case you’re interested, just ask me and I’ll email them.

 

Some general stats

Isolezwe had, on average, 265 words/article, whereas news24 had about half of that (110 and 134 on Saturday and Sunday, respectively). The top-20 of each is listed at the end of this post (the raw results of News24 had “–” removed [bug], as well as udaba and olunye [standard-text noise from the articles]).

Comparing them on the August 8 offering, Isolezwe had people saying this that and the other (ukuthi ‘saying/to say’ had the highest frequency of 60) and then the police (amaphoyisa, n=27), whereas News24 had amaphoyisa 27 times as most frequent word, then abasolwa (‘suspects’) 11 times that doesn’t even appear in Isolezwe’s top-20 most frequent words (though the stem –solwa appears 9 times). The police is problematic in South Africa—they commit crimes and other dubious behaviour under investigation (e.g., Marikana)—and more get killed than in may other countries (another one last week), and crime happens. But not on a public holiday, apparently: News24 had only one –phoyisa on Aug 9.

While I hoped to find a high incidence of women, for it being Women’s Day on August 9, none of –fazi appeared in the News24 mini-corpus of 1353 words of the 10 front page articles; instead, there was a lot of saying this that and the other (ukuthi had the highest frequency of 37), and little on suspects or blaming (-solwa n=3).

 

On that quasi wordle

While ukuthi is the infinitive, there are a gazillion conjugations and things agglutinated to it that is barely clear to the linguists on how it all works, so I did not analyse that further. Amaphoyisa, on the other hand, as a noun (plural of ‘police’), has fewer variations. In the Isolezwe mini-corpus, –phoyis– (the root of ‘police’) appeared 47 times, including variants like lwamaphoyisa, ngamaphoyisa, yiphoyisa, i.e., substantially more than the 27 amaphoyisa. If I were to create a wordle, they’d be missed unless one uses some stemmer, which doesn’t happen to be available[1] and I didn’t write one (just regex in the txt). By the same token, News24’s mention of the police on August 8 goes up to 28 with –phoyisa, and as close second the blaming and suspects (-solwa, n=27).

The lack of a stemmer also means missing out on all sorts of variations on imali (‘money’, n=11) in the isolezwe articles, whereas its stem –mali pops up 29 times, due to, among others, kwemali (n=5), mali (n=3), yimali (y- functioning as copulative in that sentence, n=1), ngezimali (n=1) and others. Likewise on person/people (-ntu) for which n=17 that are distributed among abantu (plural) umuntu (singular), nabantu (‘and people’), among others.

Last, the second most frequently used word in News24 on August 9 was njengoba (‘as’, ‘whereas’, ‘since’), primarily due to the first article on the sports results of the matches played.

So, with all that background knowledge, Isolezwe’s wordle would be, in descending order (and in English for the readers of this blog): say, police, money, people. News24 on August 8: police, suspect/blame, say (two variations, n=9 each). News24 on August 9: say, as/since (and then some other adverbs).

 

In closing

This dabbling resulted in more problems and questions being raised than answered. But, for now, it’s at least still a bit of a peek into the kitchen of news in a language that I don’t master as well as I want to and should. It wasn’t useful either for the ICPC problem setting or the stats tutorial, nor is a 5123-word corpus of any use, but it was fun with python at least and satisfying at last a little of my curiosity, and perhaps it spurs someone to do all this properly/more systematically and on a grander scale. For the isiZulu speakers: it’s surely still up to you to read whichever news outlet you prefer reading.

 

 

References

[1] Pretorius, L., Bosch, S.E. (2010). Finite-state morphology of the Nguni language cluster: modelling and implementation Issues. In A. Yli-Jyrä, Kornai, A., Sakarovitch, J. & Watson, B. (Eds.), Finite-State Methods and Natural Language Processing 8th International Workshop, FSMNLP 2009. Lecture Notes in Computer Science, Vol. 6062, pp. 123–130

[2] Spiegler, S., van der Spuy, A., and Flach, P. A. (2010). Ukwabelana – an opensource morphological zulu corpus. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING’10), pages 1020-1028. Association for Computational Linguistics. Beijing

 

 

Top-20 words Isolezwe on Aug8 Top-20 words News24 on Aug8 Top-20 words News24 on Aug9
ukuthi 60 amaphoyisa 19 ukuthi 37
amaphoyisa 27 abasolwa 11 njengoba 13
uthe 17 uthe 9 ngemuva 12
ngoba 17 ukuthi 9 ngesikhathi 8
lokhu 16 ubudala 9 lo 8
kuthiwa 16 njengoba 9 uthe 7
kusho 16 kusho 9 uma 7
kodwa 12 lo 8 futhi 7
imali 11 oneminyaka 7 uzakwe 6
uma 10 ngokuthakatha 7 usnethemba 6
ngesikhathi 10 ngokusho 7 kodwa 6
yakhe 9 omphakathi 6 johannesburg 6
ukuba 9 okhulumela 6 yakhe 5
njengoba 9 kanti 6 united 5
nje 9 endaweni 6 ukuba 5
lo 9 yohlobo 5 ukomphela 5
khona 9 ngesikhathi 5 ufaku 5
abantu 9 ngemuva 5 ubudala 5
umphakathi 8 le 5 rhythms 5
umnuz 8 imoto 5 ngokubika 5

 

 

 

[1] There is some material on that (among others, [1,2]), though, but it’s mostly theoretical or very proof of concept, rather than the easy reuse of tools like for English, and the example rule in [1] isn’t right (it’s umfana, not umufana; the longer prefix with the extra –u– is used when the stem is one syllable, like –ntu -> umuntu).

An orchestration of ontologies for linguistic knowledge

Starting from multilingual knowledge representation in ontologies and an eye on linguistic linked data and controlled natural languages, we had developed a basic ontology for the Bantu noun class system [1] to link with the lemon model [2]. The noun class system is alike gender in, e.g., German and Italian, but then a bit different. It is based on semantics of the nouns and each Bantu language has some 12-23 noun classes. For instance, noun classes 1 and 2 are for singular and plural humans, 9 and 10 for animals (singular and plural, respectively), 11 for inanimates and long thin objects (e.g., a telephone cable), and class 14 has abstract nouns (e.g., beauty). Each class has its own augment or augment+prefix to be added to the stem. None of the other linguistic resources, such as ISOcat or the GOLD ontology, dealt with them, so, lemon did not either, but we needed it. The first version of the ontology we introduced in [1] had its limitations, but it mostly did its job. Mostly, but not fully.

Lemon needs that morphology module and then some for the rules. The ontology did not fully satisfy Bantu languages other than Chichewa and isiZulu. With the knowledge of the latter only, it was more alike a merged conceptual data model, for it was tailored to the two specific languages. Also, it wasn’t aligned to other models or ontologies, thus hampering interoperability and reuse. We didn’t have any competency questions or cool inferences either, because our scope then was just to annotate the names of the classes in an ontology. Hence, it was time for an improvement.

Among others, we don’t want just to annotate, but, given that Bantu languages are underresourced, see what we can add to derive implicit information, which could help with tagging terms. For instance

  • if you know abantu is a plural and in noun class 2 and umuntu is the singular of it, then umuntu is in noun class 1, or
  • when it is declared that inja is in noun class 9, then so is its stem -ja (or vv), or
  • language specific, which singular (plural) noun class goes with which plural (singular) noun class: while the majority neatly has a pair of successive odd and even numbers (1-2, 3-4, 5-6 etc), this is not always the case; e.g., in isiZulu, noun class 11 does not have noun class 12 as plural, but noun class 10 (which has its own augment and prefix).

Then, besides the interoperability and reuse requirements, we’d needed to distinguish between language-specific axioms and those that hold across the language family. To solve all that, we developed a framework, reusing the pyramid structure idea from BioTop [3] and the so-called “double articulation principle” of DOGMA [4], where the language-specific axioms are at the level of DOGMA’s conceptual model, for they add specific constraints.

To make a long story short, the framework/orchestration applied to the linguistic knowledge of Bantu noun classes in general, and specific to some language, looks as follows:

framework applied to some linguistics ontologies (source: [5])

framework applied to some linguistics ontologies (source: [5])

More details are described in the recently accepted paper “An orchestration framework for linguistic task ontologies” [5], to be presented as the 9th Metadata and Semantics Research Conference (MTSR’15), to be held from 9 to 11 September in Manchester, UK. My co-author Catherine Chavula will be attending MTSR’15 and present our paper, hoping/assuming that all those last-minute things—like visa and money actually being transferred to buy that plane ticket—will be sorted this month. (Odd ‘checks and balances’ that make life harder and more expensive for people outside of a visa-free zone and tied to a funding benefactor is a topic for some other time.).

The set of ontologies (in OWL) is available in NCS1.zip from my ontologies directory. It contains the goldModule—a module extracted from the GOLD ontology for general linguistics knowledge and that is aligned to the foundational ontology SUMO—the NCS ontology, and three languages-specific axiomatizations for the noun classes, being Chichewa, isiXhosa, and isiZulu (more TBA). The same approach can be used for other linguistic features in other language groups or families; e.g., instead of the NCS, one could have knowledge represented about conjugation in the Romance languages (Italian, Spanish etc.), and then the more precise axiomatization (conceptual data model, if you will) for constraints unique to each language.

 

p.s.: Bantu languages is the term used in linguistics, so that’s why it’s used here. Elsewhere, they are also called African languages. They’re not synonymous, however, as the latter includes also other, non-Bantu, languages, as it can designate any language spoken in Africa that may have a wholly different grammar, hence, the difference linguists make to avoid misinterpretation.

 

References

[1] Chavula, C., Keet, C.M. Is Lemon Sufficient for Building Multilingual Ontologies for Bantu Languages? 11th OWL: Experiences and Directions Workshop (OWLED’14). Keet, C.M., Tamma, V. (Eds.). Riva del Garda, Italy, Oct 17-18, 2014. CEUR-WS vol. 1265, 61-72.

[2] McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A., Gracia, J., Hollink, L., Montiel-Ponsoda, E., Spohr, D., Wunner, T.: Interchanging lexical resources on the Semantic Web. Language Resources & Evaluation, 2012, 46(4), 701-719

[3] Beißwanger, E., Schulz, S., Stenzhorn, H., Hahn, U.: Biotop: An upper domain ontology for the life sciences: A description of its current structure, contents and interfaces to obo ontologies. Applied Ontology, 2008, 3(4), 205-212

[4] Jarrar, M., Meersman, R.: Ontology Engineering The DOGMA Approach. In: Advances in Web Semantics I, LNCS, vol. 4891, pp. 7-34. Springer (2009)

[5] Chavula, C., Keet, C.M. An Orchestration Framework for Linguistic Task Ontologies. 9th Metadata and Semantics Research Conference (MTSR’15), Springer CCIS. 9-11 September, 2015, Manchester, UK. (in print)

Reblogging 2006: Figuring out requirements for automated reasoning services for formal bio-ontologies

From the “10 years of keetblog – reblogging: 2006”: a preliminary post that led to the OWLED 2007 paper I co-authored with Marco Roos and Scott Marshall, when I was still predominantly into bio-ontologies and biological databases. The paper received quite a few citations, and a good ‘harvest’ from both OWLED’07 and co-located DL’07 participants on how those requirements may be met (described here). The original post: Figuring out requirements for automated reasoning services for formal bio-ontologies, from Dec 27, 2006.

What does the user want? There is a whole sub-discipline on requirements engineering, where researchers look into methodologies how one best can extract the users’ desires for a software application and organize the requirements according to type and priority. But what to do when users – in this case biologists and (mostly non-formal) bio-ontologies developers – neither do know clearly themselves what they want nor what type of automated reasoning is already ‘on offer’. Here, I’m making a start by briefly listing informally some of the desires & usages that I came across in the literature, picked up from conversations and further probing to disambiguate the (for a logician) vague descriptions, or bumped into myself; they are summarised at the end of this blog entry and update (d.d. 5-5-’07) described more comprehensively in [0].

Feel free to add your wishes & demands; it may even be fed back into current research like [1] or be supported already after all. (An alternative approach is describing ‘scenarios’ from which one can try to extract the required reasoning tasks; if you want to, you can add those as well.)

I. A fairly obvious use of automated reasoners such as Racer, Pellet and FaCT++ with ontologies is to let the software find errors (inconsistencies) in the representation of the knowledge or reality. This is particularly useful to ensure no ‘contradictory’ information remains in the ontology when an ontology gets too big for one person to comprehend and multiple people update an ontology. Also, it tends to facilitate learning how to formally represent something. Hence, the usage is to support the ontology development process.

But this is just the beginning: having a formal ontology gives you other nice options, or at least that is the dangling carrot on front of the developer’s nose.

II. One demonstration of the advantages of having a formal ontology, thus not merely a promise, is the classification of protein phosphatases by Wolstencroft et al. [9], where also some modest results were obtained in discovering novel information about those phosphatases that was entailed in the extant information but hitherto unknown. Bandini and Mosca [2] pushed a closely related idea one step further in another direction. To constrain the search space of candidate rubber molecules for tire production, they defined the constraints (properties) all types of molecules for tires must satisfy in the TBox, treated each candidate molecule as an instance in the ABox, and performed model checking on the knowledgebase: each instance inconsistent w.r.t. the TBox was thrown out of the pool of candidate-molecules. Overall, the formal representation with model checking achieved a considerable reduction in resource usage of the system and reduced the amount of costly wet-lab research. Hence, the usages are classification and model checking.[i]

III. Whereas the former includes usage of particular instances for the reasoning scenarios, another on is to stay at the type level and, in particular, relations between the types (classes in the class hierarchy in Protégé). In short, some users want to discover new or missing relations. What type of relation is not always exactly clear, but I assume for now that any non-isA relation would do. For instance, Roos et al. [8] would like to do that for the subject domain of histones; with or without instance-level data. The former, using instance-level data, resembles the reverse engineering option in VisioModeler, which takes a physical database schema and the data stored in the tables and computes the likely entities, relations, and constraints at the level of a conceptual model (in casu, ORM). Mungall [7] wants to “Check for inconsistent or missing relationships” and “Automatically assign relationships for new terms”. How can one find what is not there but ought to be in the ontology? An automated reasoner is not an oracle. I will split up this topic into two aspects. First, one can derive relations among types, meaning that some ontology developer has declared several types, relations, other properties, but not everything. The reasoner then, takes the declared knowledge and can return relations that are logically implied by the formal ontology. From a user perspective, such a derived relation may be perceived as a ‘new’ or ‘missing’ relation – but it did not fall out of the sky because the relation was already entailed in the ontology (or: you did not realize you knew it already). Second, another notion of ‘missing relations’: e.g. there are 17 types of macrophages (types of cell) in the FMA, which must be part of, contained in, or located in something. If you query the FMA through OQAFMA, it gives as answer that the hepatic macrophage is part of the liver [5]. An informed user knows it cannot be the case that the other macrophages are not part of anything. Then, the ontology developer may want to fill this gap – adding the ‘missing’ relations – by further developing those cell-level sections of the ontology. Note that the reasoner will not second-guess you by asking “do you want more things there?”; it uses the Open World Assumption, i.e. that there always may be more than actually represented on the ontology (and absence of some piece of information is not negation of that piece). Thus, the requirements are to have some way of dealing with `gaps’ in an ontology, to support computing derived relations entailed in a logical theory, and, third, deriving type-level relations based on instance-level data. The second one is already supported, the first one only with intervention by an informed user, and the third one might, to some extent.

Now three shorter points, either because there is even less material or there is too much to stuff it in this blog entry.

IV. A ‘this would be nice’ suggestion from Christina Hettne, among others, concerns the desire to compare pathways, which, in its simplest form, amounts to checking for sub-graph isomorphisms. More generally, one could – or should be able to – treat an ontology as a scientific theory [6] and compare competing explanations of some natural phenomenon (provided both are represented formally). Thus, we have a requirement for comparison of two ontologies, not with the aim of doing meaning negotiation and/or merging them, but where the discrepancies themselves are the fun part. This indicates that dealing with ‘errors’ that a reasoner spits out could use an upgrade toward user-friendliness.

V. Reasoning with parthood and parthood-like relations in bio-ontologies are on a par with importance of the subsumption relation. Donnelly [3] and Keet [4], among many, would like to use parthood and parthood-like relations for reasoning, covering more than transitivity alone. Generalizing a bit, we have another requirement: reasoning with properties (relations) and hierarchies of relations, focusing first on the part-whole relation. What reasoning services are required exactly, be it for parthood or any other relation, deserves an entry on its own.

VI. And whatnot? For instance, linking up different ontologies that each reside at their own level of granularity, yet have enabled to perform ‘granular cross-ontology queries’, or infer locations of diseases based on combining an anatomy ontology with a disease taxonomy, hence, reasoning over linked ontologies. This needs to be written down in more detail, and may be covered at least partially with point two in item III.

Summarizing, we have to following requirements for automated reasoning services, in random order w.r.t. importance:

  • Support in the ontology development process;
  • Classification;
  • Model checking;
  • Finding ‘gaps’ in the content of an ontology;
  • Computing derived relations at the type level;
  • Deriving type-level relations from instance-level data;
  • Comparison of two ontologies ([logical] theories);
  • Reasoning with a plethora of parthood and parthood-like relations;
  • Using (including finding inconsistencies in) a hierarchy of relations in conjunction with the class hierarchy;
  • Reasoning across linked ontologies;

I doubt this is an exhaustive list, and expect to add more requirements & desires soon. They also have to be specified more precisely than explained briefly above and the solutions to meet these requirements need to be elaborated upon as well.



[0] Keet, C.M., Roos, M., Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS.

[1] European FP6 FET Project “Thinking ONtologiES (TONES)”. (UDATE 29-7-2015: URL defunct by now)

[2] Bandini, S., Mosca, A. Mereological knowledge representation for the chemical formulation. 2nd Workshop on Formal Ontologies Meets Industry 2006 (FOMI2006), 14-15 December 2006, Trento, Italy. pp55-69.

[3] Donnelly, M., Bittner, T. and Rosse, C. A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies. Artificial Intelligence in Medicine, 2006, 36(1):1-27.

[4]
Keet, C.M. Part-whole relations in Object-Role Models. 2nd International Workshop on Object-Role Modelling (ORM 2006), Montpellier, France, Nov 2-3, 2006. In: OTM Workshops 2006. Meersman, R., Tari, Z., Herrero., P. et al. (Eds.), LNCS 4278. Berlin: Springer-Verlag, 2006. pp1116-1127.

[5]
Keet, C.M. Granular information retrieval from the Gene Ontology and from the Foundational Model of Anatomy with OQAFMA. KRDB Research Centre Technical Report KRDB06-1, Free University of Bozen-Bolzano, 6 April 2006. 19p.

[6] Keet, C.M.
Factors affecting ontology development in ecology. Data Integration in the Life Sciences 2005 (DILS2005), Ludaescher, B, Raschid, L. (eds.). San Diego, USA, 20-22 July 2005. Lecture Notes in Bioinformatics 3615, Springer Verlag, 2005. pp46-62.

[7] Mungall, C.J. Obol: integrating language and meaning in bio-ontologies. Comparative and Functional Genomics, 2004, 5(6-7):509-520. (UPDATE: link rot as well; a freely accessible veriosn is available at: http://berkeleybop.org/~cjm/obol/doc/Mungall_CFG_2004.pdf)

[8] Roos, M., Rauwerda, H., Marshall, M.S., Post, L., Inda, M., Henkel, C., Breit, T. Towards a virtual laboratory for integrative bioinformatics research. CSBio Reader: Extended abstracts of “CS & IT with/for Biology” Seminar Series 2005. Free University of Bozen-Bolzano, 2005. pp18-25.

[9]
Wolstencroft, K., Lord, P., Tabernero, L., Brass, A., Stevens, R. Using ontology reasoning to classify protein phosphatases [abstract]. 8th Annual Bio-Ontologies Meeting; 2005 24 June; Detroit, United States of America.


[i] Observe another aspect regarding model checking where the automated reasoner checks if the theory is satisfiable, or: given your ontology, if there is/can be a combination of instances such that all the declared knowledge in the ontology holds (is true), which is called a `model’ (as well, like so many things). That an ontology is satisfiable does not imply it only has models as intended by you, i.e. there is a difference between ‘all models’ and ‘all intended models’. If an ontology is satisfiable it means that it is logically possible that each type can be instantiated without running into inconsistencies; it neither demonstrates that one can indeed find in reality the real-world versions of those represented entities nor if there is one-and-only-one model that actually matches exactly the data you may want to have linked to & checked against the ontology.

Reblogging 2006: “We are what we repeatedly do…

This is the 10th year of my blog, which started off as a little experiment and ‘seeing where it ends up’. In numbers, there are over 200 posts and I estimate that in September, the blog will clock its 100,000th visitor. I had a look at the list of posts, and I’ll reblog about 2 blog posts from each year, trying to pick one ‘general’ topic and one about my research that will also note some follow-ups that happened after the post. I’ve selected them ignoring the ratings or visits of the posts, as I still haven’t figured out why some posts get lots of hits whilst others don’t; shouldn’t you all want to know about changes in the ingredients of people’s meals or strive to be a nonviolent person, rather than solving a problem on rearranging luggage in an airport carousel or looking into money-making academia.edu or self-indulgence on mapmaking showing all and sundry the countries you visited? Anyway, this is the first installment of it.

From the “10 years of keetblog – reblogging: 2006” (June 11, 2006): A summary on what to do (repeatedly) to become a good researcher.

“We are what we repeatedly do…

…excellence, then, is not an act but a habit”, Aristotle has said. Being an excellent researcher then amounts to habitually doing excellent research. A prerequisite of doing excellent research is to do research effectively. Even the famed, and idealized, eureka! moment scientists occasionally (are supposed to) have is based on sound foundations acquired through searching, reading, researching, thinking, testing, and integrating new scientific developments with extant knowledge already accumulated. But how to get there? I don’t know – I’m only studying to become an excellent scientist.
Besides the aforementioned list of activities, I occasionally browse the Internet and procrastinate by reading how to write a thesis, improve the English grammar and word use, plan activities to avoid unemployment, the PhD comic, and more of those type of suggestions that don’t help me with the topic itself (granularity) but these topics are about how to do things.

Serendipity, perhaps, it was that brought me to an essay by Michael Nielsen, entitled “Principles of effective research” [1]. I summarise it here briefly, but it would be better if you read the 12 pages in full.

The first section is about integrating research into the rest of your life. So, unlike narratives and jokes that tell you its normal to not have a life as a PhD student or researcher, this would be the wrong direction to go or stage to be at. Be fit, have fun.

The principles of personal behaviour to achieve effective research are proactivity, vision, and self-discipline. Don’t abdicate responsibility, and be accountable to other people. Vision does not apply to where you think your research field will be in 20 years, but where do you want to be then, what sort of researcher do you want to be, which areas are you interested in (etc)? Have clear for yourself what you want to achieve, why, and how.

Regarding the research itself, self-development and the creative process are important. But focusing on self-development only is not ok, because then one fails to make a contribution to the community which viability and success depends on input from scientists (among others). On the other hand, keeping on organizing workshops, conferences, doing reviewing etc leaves little time for the self-development and creative process of doing research to make scientific contributions. That is, one should strive for a balance of the two.
Self-development includes developing research strengths, your ‘niche’ with a unique combination of abilities to get a comparative advantage. Emerging disciplines, mostly interdisciplinary, are a nice mess to sort out, for instance. Then, read the 10 seminal contributions in the other field as opposed to skimming several hundred mediocre articles that are fillers of the conferences or journals. (This doesn’t sound particularly friendly, but if I take bioinformatics or the ontologies hype as examples, there are quite a lot of articles that don’t add new ideas [but more descriptions of yet more software tools] and interdisciplinary articles are known to be not easy to review, hence more articles with confused content fall through the cracks and make it into archived publications.) A high-quality research environment helps.
Concerning the creative process, this depends on if you’re a problem-solver or a problem-creator, with each requiring specific skills. The former generally receives more attention, because there are so many things unknown and then figuring out how/what/why it works gives sought-for answers, technologies, or methodologies. Problem-creators, on the other hand, generate new work so to speak; by asking interesting questions, reformulating old nigh unsolvable problems in a new way, or showing connections nobody has thought of before. Read Nielsen’s article for details on the suggested skills set for each type.

Wishing you good luck with all this, then, is inappropriate, as luck does not seem to have much to do with becoming an effective researcher. So, go forth, improve your habits, and reap what you sow.

[1] Nielsen, M.A. Principles of effective research. July 27th , 2004. UPDATE 22-7-2015: this is the new URL: http://michaelnielsen.org/blog/principles-of-effective-research/

Wikipedia + open access = not quite a revolution (not yet at least)

The title of the arxiv blog post sounded so catchy and wishful thinking into a high truthlikeness: “Why Wikipedia + open access = revolution”, summarizing and expanding on arxiv.org/abs/1506.07608 with the title “Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science.” [1], with some quotes:

“The odds that an open access journal is referenced on the English Wikipedia are 47% higher compared to closed access journals,” say Teplitskiy and co.

Open access publishing has changed the way scientists communicate with each other but Teplitskiy and buddies have now shown that its influence is much more significant. “Our research suggests that open access policies have a tremendous impact on the diffusion of science to the broader general public through an intermediary like Wikipedia,” says Teplitskiy and co.

It means that open access publications are dramatically amplifying the way science diffuses through the world and ultimately changing the way we understand the universe around us.

I sooo want to believe. And, honestly, when I search for something and Wikipedia is the first hit and I do click, it does seem to give a decent introductory overview of something I know little about so that I can make a better start for searching the real sources. I never bothered to look up my own areas of specialisation, other than when a co-author mentioned there was (she put?—I can’t recall) a reference to her tool in Wikipedia some time ago. But there’s that nagging comment to the technologyreview blog post saying the same thing, and adding that when s/he looked up his/her own field, s/he

“then realized that in my own field, my main reaction was to want to scream at the cherry picking of sources to promote some minor researcher.”

So, I looked up “ontology engineering” and “Ontologies” that redirected to “Ontology (information science)” (‘information science’, tsk!)… and I kinda screamed. The next sections are, first, about the merits of the arxiv paper (outcome: their conclusions are certainly rather quite exaggerated) and, second, I’ll use that ‘ontology (information science)’ entry to dig a bit deeper as use case, using both the English entry and in several other languages as that’s what the arxiv paper covers as well. I’ll close with some thoughts on what to do about it.

 

On the arxiv paper’s data and results

There are several limitations to the paper; some of them discussed by its authors, some are not. The arxiv paper does not distinguish between online freely available scientific literature where only the final typesetted version is behind a paywall and official ‘open access’. This is problematic for processing the computer science entries in Wikipedia for trying to validate their hypothesis. In addition, they considered only journals with their open access policy, and journal-level analysis (cf article-level analysis), idem for the problematic ISI impact factor, and only those 21000 listed in Scopus, amounting eventually to the (ISI index-)top 4721 journals of which 335 open access to test Wikipedia content against. The open access list was taken from being listed in the directory of OA journals, ignoring the difference between ‘green’ and ‘gold’ and paywall-access from, say Elsevier. Overall, this already does not bode well for extending the obtained conclusion to computer science entries and, hence, the diffusion of knowledge claim.

The authors admit they may undercount references for the non-English entries, but they have few references anyway (Fig 1 in the arxiv paper), so it’s basically largely an English-Wikipedia analysis after all, i.e., so the conclusion is not really straightforwardly extending to ‘diffusion of knowledge’ for the non-English speaking world.

The statistical model is described on p19 of the pdf, and I don’t quite follow the rationale, with an elusive set of ‘journal characteristics’ and some estimated variables without detail. Maybe some stats person can shed a light on it.

Then the bubble-figure in the technologyreview, which is Fig 8 in the arxiv paper and it is reproduced in the screenshot below, which “shows that across 50 [non-English] Wikipedias, there is an inverse relationship between the effects of accessibility and status on referencing”. Come again? It’s not like the regression line fits well. And why are the language entries—presumably independent of one another—in a relation after all? Notwithstanding, the odds for a Serbian entry to have a reference to an open access journal is some 275% higher than to a paywalled one, vs entries in Turkish that cite higher impact factor journals some 200% more often, according to the arxiv paper. I haven’t found details of that data, though, other than a back-of-the-envelope calculation when glancing over the figure: Serbian has a 1.5 for impact and a 3.75 or so for open access, Turkish 3 and 1.3-ish. Of how many entries and how many citations for those languages? They state that “While the English Wikipedia references ~32,000 articles from top journals, the Slovak Wikipedia references only 108 and Volapuk references 0.”. But Volapuk still ends up with an open access odd ratio of 0.588 and an ln(impact factor) of 2.330 (Appendix A3), which is counted only with the set of top-rated journals only; how is that possible when there are no references to those top journals? The number of counted journal citations is not given for each language, so a ‘statistically significant’ may well actually be over a number that’s too low to do your statistics with. Waray-Waray is a very small dot, and reading from Fig 1, it’s probably not more than those 108 references in the Slovak entries.

All in all, there is some room for improvement on this paper, and, in any case, some toning down of the conclusions, let alone technologyreview’s sensationalist blog title.

fig8of1506.07608

Fig 8 from Teplitskiy et al (2015)

Ontology (information science) Wikipedia entry, some issues

Let me not be petty whining that none of my papers are in the references, but take a small example of the myriad of issues.

Take the statement “There are studies on generalized techniques for merging ontologies,[12] but this area of research is still largely theoretical.” Uh? The reference is to an obscure ‘dynamic ontology repair’ project pdf from the University of Edinburgh, retrieved in 2012. We merged DMOP’s domain content with DOLCE in 2011, with tool support (Protégé, to be precise). owl:import was around and working at that time as well. Not to mention the very large body of papers on ontology alignment, reference book by Shvaiko & Euzenat, and the Ontology Alignment Evaluation Initiative.

The list of ontology languages even includes SBVR and IDEF5 (not ontology languages), and, for good measure of scruffiness, a project (TOVE).

The obscure “Gellish” appears literally everywhere: it is an ontology, it is a language, it is an English dictionary (yes, the latter apparently also falls under ‘examples’ of ontologies. not), and it is even the one and only instantiation of a “hybrid ontology” combining a domain and an upper ontology. Yeah, right. Looking it up, Gellish is van Rensen’s PhD thesis of 2005 that has an underwhelming 2 citations according to Google Scholar (10 for the related journal paper), and there’s a follow-up 2nd edition of 2014 by the same author, published with lulu, no citations. That does not belong to an introductory overview of ontologies in computing. Dublin core as an example of an ontology? No (but it is a useful artefact for metadata annotations).

Under “criticisms”: aside from a Werner Ceusters statement from a commentary on someone from his website—since when deserves that to be upgraded to Wikipedia content?!?—there’s also “It’s also not clear how ontology fits with Schema on Read (NoSQL) databases.”. Ontologies with NoSQL? sigh.

“Further readings” would, I expect, have a fine set of core readings to get a more comprehensive overview of the field. While some relevant ones are there (e.g., the “what is an ontology?” paper by Oberle, Guarino, and Staab; “Ontology (Science)” by Smith, Gruber’s paper despite the flawed definition), numerous ones are the result of some authors’ self-promotion, like the one on bootstrapping biomedical ontologies, an ontology for user profiles, IE for disease intelligence—they’re not even close to ‘staple food’ for ontologies—and the 2001 OIL paper and Ontology Development 101 technical report are woefully out-dated. The “References” section is a mishmash of webpages, slides, and a few scientific papers most of which are not from mainstream ontology research venues.

And that’s just a sampling of the issues with the “Ontology (information science)” Wikipedia entry; the ontology engineering entry is worse. No wonder my students—having grown up with treating Wikipedia as gospel—get confused.

 

Ontologies entries in other languages

That much about the English language version of ‘ontology (information science)’. I happen to speak a few other languages as well, so I also checked most of those for their ‘ontology (information science)’ entry. For future reference as a stock-taking of today’s contents, I’ve pdf-printed them all (zipped). For starters, they all had ontologies at least categorised properly into ‘informatica’. +1.

The entry in Dutch is very short; one can quibble and nit-pick about term usage, and it is disappointing that there’s only one reference (in Dutch, so wouldn’t count in the arxiv analysis), but at least it’s not riddled with mistakes and inappropriate content.

The German one is quite elaborate, and starts off reasonably well, but has some mistakes. Among others, the typical novice pitfall of confusing classes for instances [“Stadt als Instanz des Begriffs topologisches Element der Klasse Punkte”] and the sample ontology—which of itself is a good idea to add to an overview page—has lots of modelling issues, such as datatypes and mixing subclasses with properties (the Maler [painter] with region of origin Flämish [Flemish]). Interestingly, ontology types for the English reader are foundational, domain, and hybrid, whereas the German reader has only lightweight and heavyweight ones. As for the references, there are some oddball ones, but the fair/good ones are in the majority, if incomplete, and perhaps a bit lopsided to Barry Smith material.

The Italian entry is of similar length as the German entry, but, unfortunately, has some copy-and-paste from the English one when it comes to the list of languages and examples, so, a propagation of issues; the ‘example of applications’ does list another project, and there is no ‘criticisms’ section. The text has been written separately instead of being a translation-of-English (idem ditto for the other entries, btw), and thus also consists of some other information. For starters, removing most of the ‘Premesse’ would be helpful (or elaborating on it in a criticism section; starting the topic with information warfare and terrorism? nah). The section after that (‘uso come glossario di base’) is chaotic, reading like a competitor-author per paragraph, and riddled with problematic statements like that all computer programs are based on foundational ontologies (“Tutti i programmi per computer si basano su ontologie fondazionali,”), and that the scope of an ontology is to develop a database (“Lo scopo di un’ontologia computazionale […] [è] di creare una base di dati”). It does mention OntoClean. Italian readers will also be treated on a brief discussion of the debate on one or multiple ontologies (absent from the other entries). It has a quite different set of ‘external links’ compared to the other entries, and there are hardly any references. Al in all, one leaves with a quite distinct impression of ontologies after reading the Italian one cf the Dutch, German, and English ones.

Last, the Spanish entry is about as short as the Dutch one. There’s overlap in content with the Italian entry in the sense of near-literal translation (on the foundational ontology and that Murray-Rust guy on the ‘semantic and ontological war’ due to ‘competition between standards’), and it has a plug for MathWorld (?!).

So, if the entries on topics I’m an expert in are such of such dubious quality (the German entry is, relatively, the best), then what does that imply for the other entries that superficially may seem potentially useful introductory overviews? By the same token, they probably are not. And the ontology topics are not even in an area with as much contention as topics in political sciences, history, etc. Go figure.

 

Now what?

Is this a bad thing? I already can see a response in the making along the line of “well, it’s crowdsourced and everyone can contribute, we invite you to not just complain, but instead improve the entry; really, you’re welcome to do so”. Maybe I will. But first, two other questions have to be answered. The arxiv paper that got my rant started claimed that open source papers are good, and that they’re reworked in interested-layperson digestible bites in Wikipedia to spread and diffuse knowledge in the world. The idea is nice, but the reality is different. Pretty much all the main papers on ontologies are freely available online even if not published ‘open access’ (computer science praxis, thank you), yet, they are not the ones that appear in Wikipedia. Question 1: Why are those—freely available—main references of ontologies not referenced there already?

A concern of a different type is that several schools in South Africa have petitioned to get free Internet access to search Wikipedia as a source of information for their studies. Their main argument was that books don’t arrive, or arrive late, and there is no library in many schools, which is a common problem. They got the zero-rate Wikipedia from MTN; more info here. (I’ll let you mull over its effects on the quality of education they get from that.) Question 2: Can Wikipedia be made a really authoritative resource with the current set-up so as to live up to what the learners [and interested laypersons] need? If I were to rewrite an update to the Wikipedia pages today, a pesky editor or someone else simply can click to roll it back to the previous version, or slowly but steadily have funny references seeping back in and sentences cut and rephrased. Writing free textbooks, or at least extensive lecture notes, seems a better option, or a ‘synthesis lectures’ booklet endorsed by lots of people researching and using ontologies. What about a ‘this version is endorsed by …’ button for Wikipedia entries?

Any better ideas, or answers to those questions, perhaps? Free diffusion of digested high quality scientific knowledge really does sound very appealing…

References

[1] Teplitskiy, M., Lu, G., Duede, E. Amplifying the Impact of Open Access: Wikipedia and the Diffusion of Science. arxiv.org/abs/1506.07608