Development of (bio-)ontologies takes up a lot of resources, especially when conducted manually. This is a well-known hurdle to overcome, and various strategies and tools for bottom-up ontology development have been proposed from a computing angle, such as the reverse engineering of databases and, most prominently in the bio-ontologies area, natural language processing (NLP) (e.g. [1,2] and a review by [3]). Both, however, generate a rather crude, noisy, and simple ontology that requires substantial manual intervention to clean up and to add ‘missing’ knowledge. Nevertheless, NLP provides at least a set of terms one can start with instead of starting with an empty screen and adding everything de novo. There is, however, a way to have your cake and eat it too: exploiting the plentiful diagrams in the life sciences.
Diagrams are very important in biology, and from early on in the education, students are taught to read and draw them. There is even a rule of thumb that one should be able to understand an article by reading the abstract, conclusions, and diagrams alone. Diagrams also summarise the accompanying text, or even can tell more than what is explained in the text. That much from the biology side. They can be useful from the computing angle as well. They are at least semi-structured (compared to natural language), with conventions about depicting lipid bi-layers, DNA, sequences of interactions by means of arrows, and so forth, and over the years more and more drawing applications have been developed. The nice thing (still for computing) is that those tools have an ‘alphabet’—legend—with permissible icons and colours and how they can be used in the diagrams. There are many diagrams that represent our understanding of biological reality.
Now, imagine that those diagrams can be transferred into an ontology in one fell swoop, and subsequently used for whatever purpose ontologies are being used (such as annotation, consistency checking, and finding implicit knowledge). And because those diagrams are more structured than natural language, we can obtain a richer ontology than with NLP alone—with less effort.
How?
One thing is recognizing there’s much to be gained in improving bottom-up bio-ontology development by availing of such diagrams (already observed in [4]), another thing is how to go about doing this in the most effective way—not for just one diagram tool, but for any one. This problem I aim to tackle in the paper “Bottom-up ontology development reusing semi-structured life sciences diagrams”, which was recently accepted for the AFRICON’11 Special Session on Robotics and AI in Africa. This 6-page paper is a very condensed version of its 12-page draft, so not everything could be included. Nevertheless, it does give the basics of the method to formalize bio-diagrams in an ontology and a use case to demonstrate it.
The approach consists of a four-stage process: (i) choosing the appropriate language (OBO, SKOS, OWL, and arbitrary FOL are considered), (ii) inclusion of a foundational ontology (DOLCE, BFO, RO etc.), (iii) formalizing the icons of the diagram tool’s ‘legend’ (e.g., ‘enzyme’), and (iv) devising an algorithm to populate the TBox to mine the actual diagrams so that the individual components (e.g., ‘protease’) end up in the right position in the ontology. The main details are described in the paper.
Thus, this bottom-up method is not one of only formalising ‘legacy’ information, but also takes into account subject domain semantics that can be represented better by using a foundational ontology during the principal transformation of the diagram’s vocabulary. In addition to the more precise, formal, representation of the subject domain semantics, the use of a foundational ontology also increases interoperability.
The guidelines are demonstrated with a transformation of the Pathway Studio [6] diagrams into an OWLized (OWL 2 DL) bio-ontology with BFO and RO.
As an aside (from my perspective), it may be of interest to note that such formalized diagrams then can be deployed also as intermediate representation of the knowledge, which can facilitate understanding and communication between logicians and domain experts. And, for the financially challenged: it can bring the information modelled in such diagrams, which are often locked in expensive hardcopy textbooks and pay-per-view scientific articles, into the open access domain for free use and reuse.
References
[1] Alexopoulou D, Wachter T, Pickersgill L, Eyre C, Schroeder M. Terminologies for text-mining: an experiment in the lipoprotein metabolism domain. BMC Bioinformatics 2008;9(Suppl 4).
[2] Coulet A, Shah NH, Garten Y, Musen M, Altman RB. Using text to build semantic networks for pharmacogenomics. Journal of Biomedical Informatics 2010;43(6):1009-19.
[3] Liu K, Hogan WR, Crowley RS. Natural language processing methods and systems for biomedical ontology learning. Journal of Biomedical Informatics 2011;44(1):163-79.
[4] Keet CM. Factors affecting ontology development in ecology. In: Ludaescher B, Raschid L, editors. Data Integration in the Life Sciences 2005 (DILS2005); vol. 3615 of LNBI. Springer Verlag; 2005, p. 46-62. San Diego, USA, 20-22 July 2005.
[5] Keet CM. Bottom-up ontology development reusing semi-structured life sciences diagrams. AFRICON’11 — Special Session on Robotics and Artificial Intelligence in Africa, Livingstone, Zambia 13-15 September, 2011. IEEE (to appear).
[6] Nikitin A, Egorov S, Daraselia N, Mazo I. Pathway studio—the analysis and navigation of molecular networks. Bioinformatics 2003;19(16):2155-2157.