Bottom-up ontology development using bio-diagrams

Development of (bio-)ontologies takes up a lot of resources, especially when conducted manually. This is a well-known hurdle to overcome, and various strategies and tools for bottom-up ontology development have been proposed from a computing angle, such as the reverse engineering of databases and, most prominently in the bio-ontologies area, natural language processing (NLP) (e.g. [1,2] and a review by [3]). Both, however, generate a rather crude, noisy, and simple ontology that requires substantial manual intervention to clean up and to add ‘missing’ knowledge. Nevertheless, NLP provides at least a set of terms one can start with instead of starting with an empty screen and adding everything de novo. There is, however, a way to have your cake and eat it too: exploiting the plentiful diagrams in the life sciences.

Diagrams are very important in biology, and from early on in the education, students are taught to read and draw them. There is even a rule of thumb that one should be able to understand an article by reading the abstract, conclusions, and diagrams alone. Diagrams also summarise the accompanying text, or even can tell more than what is explained in the text. That much from the biology side. They can be useful from the computing angle as well. They are at least semi-structured (compared to natural language), with conventions about depicting lipid bi-layers, DNA, sequences of interactions by means of arrows, and so forth, and over the years more and more drawing applications have been developed. The nice thing (still for computing) is that those tools have an ‘alphabet’—legend—with permissible icons and colours and how they can be used in the diagrams. There are many diagrams that represent our understanding of biological reality.

Now, imagine that those diagrams can be transferred into an ontology in one fell swoop, and subsequently used for whatever purpose ontologies are being used (such as annotation, consistency checking, and finding implicit knowledge). And because those diagrams are more structured than natural language, we can obtain a richer ontology than with NLP alone—with less effort.

How?

One thing is recognizing there’s much to be gained in improving bottom-up bio-ontology development by availing of such diagrams (already observed in [4]), another thing is how to go about doing this in the most effective way—not for just one diagram tool, but for any one. This problem I aim to tackle in the paper “Bottom-up ontology development reusing semi-structured life sciences diagrams”, which was recently accepted for the AFRICON’11 Special Session on Robotics and AI in Africa. This 6-page paper is a very condensed version of its 12-page draft, so not everything could be included. Nevertheless, it does give the basics of the method to formalize bio-diagrams in an ontology and a use case to demonstrate it.

The approach consists of a four-stage process: (i) choosing the appropriate language (OBO, SKOS, OWL, and arbitrary FOL are considered), (ii) inclusion of a foundational ontology (DOLCE, BFO, RO etc.), (iii) formalizing the icons of the diagram tool’s ‘legend’ (e.g., ‘enzyme’), and (iv) devising an algorithm to populate the TBox to mine the actual diagrams so that the individual components (e.g., ‘protease’) end up in the right position in the ontology. The main details are described in the paper.

Thus, this bottom-up method is not one of only formalising ‘legacy’ information, but also takes into account subject domain semantics that can be represented better by using a foundational ontology during the principal transformation of the diagram’s vocabulary. In addition to the more precise, formal, representation of the subject domain semantics, the use of a foundational ontology also increases interoperability.

The guidelines are demonstrated with a transformation of the Pathway Studio [6] diagrams into an OWLized (OWL 2 DL) bio-ontology with BFO and RO.

As an aside (from my perspective), it may be of interest to note that such formalized diagrams then can be deployed also as intermediate representation of the knowledge, which can facilitate understanding and communication between logicians and domain experts. And, for the financially challenged: it can bring the information modelled in such diagrams, which are often locked in expensive hardcopy textbooks and pay-per-view scientific articles, into the open access domain for free use and reuse.

References

[1] Alexopoulou D, Wachter T, Pickersgill L, Eyre C, Schroeder M. Terminologies for text-mining: an experiment in the lipoprotein metabolism domain. BMC Bioinformatics 2008;9(Suppl 4).

[2] Coulet A, Shah NH, Garten Y, Musen M, Altman RB. Using text to build semantic networks for pharmacogenomics. Journal of Biomedical Informatics 2010;43(6):1009-19.

[3] Liu K, Hogan WR, Crowley RS. Natural language processing methods and systems for biomedical ontology learning. Journal of Biomedical Informatics 2011;44(1):163-79.

[4] Keet CM. Factors affecting ontology development in ecology. In: Ludaescher B, Raschid L, editors. Data Integration in the Life Sciences 2005 (DILS2005); vol. 3615 of LNBI. Springer Verlag; 2005, p. 46-62. San Diego, USA, 20-22 July 2005.

[5] Keet CM. Bottom-up ontology development reusing semi-structured life sciences diagrams. AFRICON’11 — Special Session on Robotics and Artificial Intelligence in Africa, Livingstone, Zambia 13-15 September, 2011. IEEE (to appear).

[6] Nikitin A, Egorov S, Daraselia N, Mazo I. Pathway studio—the analysis and navigation of molecular networks. Bioinformatics 2003;19(16):2155-2157.

Questionable search terms for my blog

WordPress provides a range of blog statistics, including which search terms people used to arrive on my blog. Over the years, I have seen sensible, or at least explainable, search terms, and a bunch of funny or plain weird ones. Regarding the latter, it clearly demonstrates limitations of string-based and statistical methods for web searches, and to some extent that Internet users could do with some training on how to search for information.

The top searches of the past 5 years and >100 or >>100 times used are: ontology, keet, aardappeleters, parallel processing operating system, ontologies, and philosophy of computer science, and then there are often recurring strings that are quite similar but count as different hits, mainly about women’s achievements, failing to recognize one’s incompetence, granularity, and [computer science/ontology] with [medicine/ecology/philosophy/biology]. This is understandable given the topics I have blogged about.

More interesting from a computing perspective are those that are sort of, or even plain, wrong—and their reasons why. The remainder of the post is devoted to a selection of the more curious ones that I collected intermittently over the past 2 months (in italics), and added comments to several of them (in plain text). They are divided into “search engines are not oracles”, “what were they thinking?”, “curious”, “plain wrong”, and “miscellanea”.

Search engines are not oracles!

  • should i be a scientist or an engineer
  • what should be done with the outcome of assessment and how to use the outcome of assessments. The announcement of my ESWC2011 paper comes up, but is unlikely to give the user the answer they were looking for (there aren’t that many people interested in experiments with foundational ontologies).
  • how useful is philosophy in computer science. This post on what philosophers say about computer science turns up when I searched for it, which does not deal with the usefulness, let alone the amount of usefulness of philosophy in computer science. The next search string is a bit more sensible in general and with respect to the blog post’s content:
  • is computer science a science by different philosophers
  • reasons for wildlife ontology development.  There are posts about the African Wildlife tutorial ontology and the IJMSO paper that has a list of reasons for developing an ontology, but they have not been put together to give you reasons for developing a wildlife ontology.
  • ecology lessons good? The post on ontologies for ecology turns up, not in the least bit answering the question—those authors learned valuable lessons using ontologies in ecology research.
  • do i read too much? and can you read too much. This post is on the first page of results where I explore of one can read ‘too much’, only slightly more skewed toward ‘answering’ the second search string than the first.

What were they thinking?

  • writers who do not read
  • too much work blog
  • undergraduate computer science research least publishable unit.  Since when do undergrads care about LPUs?
  • useful typology. The typology of bureaucracies turns up in my Google search results; if it is a useful one remains to be seen.
  • random structure of website. My blog was not on the first 5 pages of Google when I searched (but it is by now known that Google customizes the search results).
  • response to the dirty war. Which dirty war would that be? There are three posts on the response to the dirty war *index* that I have my opinion about (here, here, and here).
  • computational food. Perhaps the user was thinking about computation with data about food? The only one that might fit, sort of, is the post about culinary evolution. There are interesting hits on the first Google page, though, such as about computational models of microwave food processing and computational food engineering.
  • notify me if someone searches for me on google

Curious search terms, but somewhat understandable

  • non violent essay. An essay itself is never violent; there’s a post on the non-violent personality though.
  • incompetence blog. Uhmm… I fancy thinking this is not a blog about incompetence. There is a post about the Keller-Dungan effect (on being incompetent and unaware of it).
  • incompetence not realize
  • methontology ping pong. Googling for it, this post comes up, of which it is unlikely that it served the user, because it covers realism-based ontologies and methodologies (such as methonotology) that has a blog comment lamenting the “self contained ping pong matches among academics”.

Plain wrong hit

  • anatomical structure of an owl. This is a nice example of the limitations of string-based and statistical approaches compared to semantic searches.
  • salami techniques in information system. I googled it again, and my blog does not appear on the first 5 pages, and there is no post even remotely close to the search term.
  • slinging techniques. It is not on the first 5 pages of Google when I searched, and there is nothing about slinging techniques on any of the blog posts.

Miscellanea

  • ponder ontology. It appears that ponder is an object-oriented language to describe policies; I write about ontologies and do ponder about things, but have not put them together.
  • granular book. I did announce a book about granular computing, but not about books that may be granular.
  • ontologies funny photos. Are there funny photos of/about ontologies?
  • 4. dimension

The problem of listing these odd ones is that the search algorithms will not change in the very near future, and thus that, due to this post, more people will be misdirected to my blog. But perhaps this manually assessed list of odd search terms might, some time, help in improving the algorithms and summarizing the content the links point to.

A few notes on ESWC2011 in Heraklion

It’s the end of a interesting and enjoyable ESWC’11 conference in Heraklion, Crete. Compared to other conferences, there were many keynote speeches (and not all of them that much on the Semantic Web, but interesting nevertheless), and, as usual, there were parallel sessions with (unfortunately) many co-scheduled presentations I would have liked to attend. Here follows a few notes on them (which I might update once travelled back to SA, as this is written rather hastily before departure).

Keynotes

Jim Hendler’s talk was entitled “Why the Semantic Web will never work”—with the quotation marks. There have been quite a few people uttering that sentence, but, in Hendler’s review of the past 10 years, we actually have achieved more in some areas than initially anticipated and more than pessimists thought was feasible. For instance, “the semantic web will never scale”: it does, according to Hendler, as demonstrated, e.g., by participants in the billion triple challenge and the growing LOD data cloud. Or the “folksonomies will win” (as opposed to, at least, structured vocabularies): wrong again, mainly because it does not achieve its goal without “social context” and it lacks the crucial aspect of links between entities. However, these achievements are principally in the bottom part of the Semantic Web layer cake and Hendler claims that the “ontology story is still confused”, although OWL is to a large degree “succeeding as a KR standard”. Key challenges for Hendler include: relating linked data to ontologies, the equivalent of a database calculus for linked data, and the need for providing a means for evaluating reasoning with incomplete and possibly inconsistent data. UPDATE (13-6): Hendler’s slides are on slideshare.

Lars Backstrom, data scientist at Facebook, gave a keynote about analyzing FB data and working toward ranking and filtering news feeds by turning it into a classification problem using a set of properties (localization, relation to actor, and others). Interestingly, Backstrom emphasized that FB is moving toward more structured data, which makes it easier to manage and analyse with the algorithms they are developing. If that is a good thing or not is a separate discussion, especially regarding privacy issues, which was the talk of Abe Hsuan about (clearly, this does not hold only for FB but the web in general). According to Hsuan, “Privacy cannot exist on a lawless Semantic Web”. It was good for several after-talk discussions among the attendees, and the last word on how to deal with all this has not been said and done yet. In this context, someone may want to have a look at episode 3 of The virtual revolution documentary about non-free services on the Web, the TED-talk on The filter bubble, or the less recent Database nation book.

Andraz Tori, CTO of Zemanta, gave a keynote describing some background of the ‘writing help’, as offered by WordPress since recently, whilst trying to avoid wrong usage of it and cleaning up the data. As you may have guessed, I have not used that feature yet when writing my blog posts (and do not see the need for it from my perspective). Prasad Kantamneni from Yahoo! Gave an interactive keynote on HCI applied to the effects of different web interfaces for their search engines—and the consequences on revenue, which was lively and interesting. Seemingly ‘silly little things’ like putting the keyword in boldface in the search results makes a big difference on how a user scans through the results (more efficient), likewise auto-completion that in the end make you read more of the results page.

Last, but most certainly not least, Chris Welty gave the conference dinner keynote, which was entertaining. He described some hurdles they had overcome in building ‘Watson’, a sophisticated question answering engine that finds answers to trivia/general knowledge quizzes for the Jeopardy! game that, in the end, did consistently outperform the national human experts on it. The talk was filled with entertaining mistakes they encountered during the development of Watson, and what it required to fix them. The key message was that one cannot go in a linear fashion from natural language to knowledge management, but one has to use a integration of various technologies to make a successful ‘intelligent’ tool.

Sessions and other things

Normally I have a dense section on the papers presented in the session here, but due to the very busy conference schedule and shortage of free online papers before the conference, I did not get around reading all the papers that I would have liked (and I don’t cite papers I have not read, still roughly following my approach to conference blogging). The one on removing redundancy in ontologies presented by Jens Wissmann [1] was quite interesting, in particular for its creative reuse of computing justifications to remove ‘redundant’ axioms, i.e., those which can be derived from other knowledge represented in the ontology anyway. This was computationally costly, so they also developed another algorithm with better performance; details and experimental results can be found in the paper. My own paper [2] on the experiment of the use of foundational ontologies in ontology engineering was well-received, and generated quite some interest, such as on the quality of the foundational ontologies themselves and how the results presented could translate to their particular domain ontology scenario. I may add something on epistemic queries, computing generalizations, matching 4K ontologies in one year, and cross-lingual ontology mappings (provided I find the time to do so in the upcoming days).

The panel session about e- and open- Government was a bit meager and can be summarized as: Linked Open Data (LOD) is good and catching on well but the integration problems still exist, and we need (at least) structured controlled vocabularies to fix it.

I will close with an announcement that Alexander Garcia-Castro brought under my attention: there will be an “Ontologies come of Age in the Semantic Web” workshop co-located with ISWC’11.

References

[1] Stephan Grimm and Jens Wissmann. Elimination of redundancy in ontologies. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 260-274.

[2] Keet, C.M. The use of foundational ontologies in ontology development: an empirical assessment. In: Proceedings of the 8th Extended Semantic Web Conference (ESWC’11). Heraklion, Crete, Greece, 29 May – 2 June 2011. Springer LNCS 6643, 321-335.