The DiDOn method to develop bio-ontologies from semi-structured life science diagrams

It is well-known among (bio-)ontology developers that ontology development is a resource-consuming task (see [1] for data backing up this claim). Several approaches and tools do exists that speed up the time-consuming efforts of bottom-up ontology development, most notably natural language processing and database reverse engineering. They are generic and the technologies have been proposed from a computing angle, and are therefore noisy and/or contain many heuristics to make them fit for bio-ontology development. Yet, the most obvious one from a domain expert perspective is unexplored: the abundant diagrams in the sciences that function as existing/’legacy’ knowledge representation of the subject domain. So, how can one use them to develop domain ontologies?

The new DiDOn procedure—from Diagram to Domain Ontology—can speed up and simplify bio-ontology development by exploiting the knowledge represented in such semi-structured bio-diagrams. It does this by means of extracting explicit and implicit knowledge, preserving most of the subject domain semantics, and making formalisation decisions explicit, so that the process is done in a clear, traceable, and reproducible way.

DiDOn is a detailed, micro-level, procedure to formalise those diagrams in a logic of choice; it provides migration paths into OBO, SKOS, OWL and some arbitrary FOL, and guidelines which axioms, and how, have to be added to the bio-ontology. It also uses a foundational ontology so as to obtain more precise and interoperable subject domain semantics than otherwise would have been possible with syntactic transformations alone. (Choosing an appropriate foundational ontology is a separate topic and can be done wit, e.g., ONSET.)

The paper describing the rationale and details, Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn [2], has just been accepted at the Journal of Biomedical Informatics. They require a graphical abstract, so here it goes:

DiDOn consists of two principal steps: (1) formalising the ‘icon vocabulary’ of a bio-drawing tool, which then functions as a seed ontology, and (2) populating the seed ontology by processing the actual diagrams. The algorithm in the second step is informed by the formalisation decisions taken in the first step. Such decisions include, among others, the representation language and how to represent the diagram’s n-aries (with n≥2, such as choosing between n-aries as relationship or reified as classes).

In addition to the presentation of DiDOn, the paper contains a detailed application of it with Pathway Studio as case study.

The neatly formatted paper is behind a paywall for those with no or limited access to Elsevier’s journals, but the accepted manuscript is openly accessible from my home page.


[1] Simperl, E., Mochol, M., Bürger, T. Achieving maturity: the state of practice in ontology engineering in 2009. International Journal of Computer Science and Applications, 2010, 7(1):45-65.

[2] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics. In print. DOI:

Progress on the EnvO at the Dagstuhl workshop

Over the course of the 4,5 days packed together at the beautiful and pleasant ambience of Schloss Dagstul, the fourth Environment Ontology workshop has been productive, and a properly referenceable paper outlining details and decisions will follow. Here I will limit myself to mentioning some of the outcomes and issues that passed the revue.

Group photo of most of the participants at the EnvO Workshop at Dagstuhl

After presentations by all attendees, a long list of discussion themes was drawn up, which we managed to discuss and agree upon to a large extent. The preliminary notes and keywords are jotted down and put on the EnvO wiki dedicated to the workshop.

Focussing first on the content topics, which took up the lion’s share of the workshop’s time, significant advances have been made in two main areas. First, we have sorted out the Food branch in the ontology, which has been moved as Food product under Environmental material and then Anthropogenic environmental material, and the kind and order of differentia have been settled, using food source and processing method as the major axes. Second, the Biome branch will be refined in two directions, regarding (i) the ecosystems at different scales and the removal of the species-centred notion of habitat to reflect better the notion of environment and (ii) work toward inclusion of the aspect of n-dimensional hypervolume of an environment (both the conditions / parameters / variables and the characterization of a particular type of environment using such conditions, analogous to the hypervolumes of an ecological niche so that EnvO can be used better for annotation and analysis of environmental data). Other content-related topics concerned GPS coordinates, hydrographic features, and the commitment to BFO and the RO for top-level categories and relations. You can browse through the preliminary changes in the envo-edit version of the ontology, which is a working version that changes daily (i.e., not an officially released one).

There was some discussion—insufficient, I think—and recurring comments and suggestions on how to represent the knowledge in the ontology and, with that, the ontology language and modelling guidelines. Some favour bare single-inheritance trees for appealing philosophical motivations. The first problematic case, however, was brought forward by David Mark, who had compelling arguments for multiple inheritance with his example of how to represent Wadi, and soon more followed with terms such as Smoked sausage (having as parents the source and processing method) and many more in the food branch. Some others preferred lattices or a more common knowledge representation language—both are ways to handle more neatly the properties/qualities with respect to the usage of properties and the property inheritance by sub-universals from its parent. Currently, the EnvO is represented in OBO and modelling the knowledge does not follow the KR approach of declaring properties of some universal (/concept/class) and availing of property inheritance, so that one ends up having to make multiple trees and then adding ‘cross-products’ between the trees. Hence, and using intuitive labels merely for human readability here, Smoked sausage either will have two parents, amounting to—in the end where the branching started—\forall x (SmokedSausage(x) \equiv AnimalFoodProduct(x) \land ProcessingMethod(x)) (which is ontologically incorrect because a smoked sausage is not way of processing) or, if done with a ‘cross-product’ and a new relation (hasQuality ), then the resulting computation will have something alike \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land hasQuality(x,y) \land Smoking(y)) instead of having declared directly in the ontology proper, say, \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land HasProcessingMethod(x,y) \land Smoking(y)) . The latter option has the advantages that it makes it easier to add, say, Fermented smoked sausage or Cooked smoked sausage as a sausage that has the two properties of being [fermented/cooked] and being smoked, and that one can avail of automated reasoners to classify the taxonomy. Either way, the details are being worked on. The ontology language and the choice for one or the other—whichever it may be—ought not to get in the way of developing an ontology, but, generally, it does so both regarding underlying commitment that the language adheres to and any implicit or explicit workaround in the modelling stage that to some extent make up for a language’s limitations.

On a lighter note, we had an excursion to Trier together with the cognitive robotics people (from a parallel seminar at Dagstuhl) on Wednesday afternoon. Starting from the UNESCO’s world heritage monument Porta Nigra and the nearby birthplace of Karl Marx, we had a guided tour through the city centre with its mixture of architectural styles and rich history, which was even more pleasant with the spring-like weather. Afterwards, we went to relax at the wine tasting event at a nearby winery, where the owners provided information about the 6 different Rieslings we tried.

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Section of the Porta Nigra, Trier

Section of the Porta Nigra, Trier

Ontologies in ecology: putting the lessons-learned to good use and moving forward

While most of the headlines and attention in bio-ontologies has gone to the Gene Ontology, later also the FMA, and, most recently, the set of ontologies within or close to the OBO Foundry project, it has been comparatively more modest in the area of ontologies for ecology. This is set to change.

Madin et al [1] published a review article last month in Trends in Ecology and Evolution about not only the state of the art on existing ontologies for ecology, but also an Ode to the development and use of ontologies. The latter is not framed in a bright-vision-follow-me way, but noting (a.o.) the problems of

terminological ambiguity [that] slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science

and showing problems and clear examples of what kind of problems ontologies can help to solve.

Recollecting the OWLED’07 industry panel discussion last year, it seemed as if industry was at the point where bio-ontologies were 5-8 years ago and, moreover, about to reinvent the wheel. Not so with ontologies for ecology. Madin et al has separate information boxes about “building consistent ontologies” explaining the difference between is-a and instance-of, is-a and part-of, and is-a and constitution—those things that early adopters learned the hard way a few years ago is presented as a known basic starting point. Likewise for the info-box on “What is an ontology?” and the straight adoption of OWL and benefits automated reasoners. In the overview presented by Madin et al, there are no issues to resolve on trying to be backward compatible with the obo format, but they go straight to the W3C standardized formal ontology representation languages for the ontologies for ecology. Idem box 2 on finding data (which is also a nice scenario for the OBDA Plugin and DIG-Mastro), OntoClean, foundational ontologies and domain ontologies versus other artifacts with terms, linking of ontologies, and a clear table with task-description-requirements (table 1) that invariably asks for good ontologies.

Aside from the analysis of benefits and usages, the concluding remarks section notes that

[t]hus, the adoption of ontologies is hindered both by the familiarity of current practices and the lack of tools to readily migrate to improved practices.

Point taken.

And last, but not least,

Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology, and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms. By clarifying the terms of scientific discourse, and annotating data and analyses with those terms, well defined, community-sanctioned, formal ontologies based on open standards will provide a much-needed foundation upon which to tackle crucial ecological research while taking full advantage of the growing repositories of data on the Internet.

[1] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168. doi:10.1016/j.tree.2007.11.007

A new plant family: the Simulacraceae

May I recommend for the Friday afternoon/weekend reading: an article by Bletter, Reynertson, and Velazquez Runk in the journal Ethnobotany Research & Applications (vol. 5, 2007) on “The taxonomy, ecology, and ethnobotany of the Simulacraceae”, which has about 80 species divided in 17 genera, such as Plasticus, Textileria, and Papyroidia. Moreover,

This family is more than a botanical curiosity. It is a scientific conundrum, as the taxa:

  1. lack genetic material,
  2. appear virtually immortal and
  3. have the ability to form intergeneric crosses with ease, despite the lack of any evident mechanism for cross-fertilization.

In this study, conducted over approximately six years, we elucidate the first full description and review of this fascinating taxon, heretofore named Simulacraceae.

To summarize, also in the words of the authors,

The eco­nomics, distribution, ecology, taxonomy, paleoethnobot­any, and phakochemistry of this widespread family are herein presented. We have recently made great strides in circumscribing this group, and collections indicate this cosmopolitan family has a varied ecology. … Despite being genomically challenged plants, an initial phylogeny is pro­posed. In an early attempt to determine the ecological re­lations of this family, a twenty-meter transect has been in­ventoried from a Plasticus rain forest in Nyack, New York, yielding 49 new species and the first species-area curve for this family.

The Simulacraceae collections—based on the principal method of “opportunistic sampling”—are deposited in the herbarium of the Foundation for Artificial Knowledge Education. Some of the open problems yet to investigate include simulacrapaleoethnobotany and simulacrapolitical ecology, and from an engineering perspective, the design of a Traditional Simulacraceae Knowledge/Teleological Simulation Knowledge base (dubbed acronym “TSK,TSK”, which would compete well with the yearly naming game for the NAR January database issue).

A short html version of the article is available online in the Jan/Feb issue of AIR, but also the full pdf file (about 6MB) in the Uni of Hawaii database with more information and colourful photos (openly accessible, of course). Enjoy!

On the (un)reasonable effectiveness of mathematics in ecology

An article appeared last week in Ecological Modeling that has the intention to be thought-provoking; it looks at effectiveness of mathematics in ecological theory [1], but it just as well can be applied to bioinformatics, computational biology, and bio-ontologies. In short: mathematical models are useful only if they are not too general to be trivially true and not too specific to be applicable to one data set only. But how to go about finding the middle way? Ginzburg et al fail to clearly answer this question, but there are some pointers worth mentioning. In the words of the authors (bold face my emphasis):

A good theory is focused without being blurred by extraneous detail or overgenerality. Yet ecological theories frequently fail to achieve this desirable middle ground. Here, we review the reasons for the mismatch between what theorists seek to achieve and what they actually accomplish. In doing so, we argue on pragmatic grounds against mathematical literalism as an appropriate constraint to mathematical constructions: such literalism would allow mathematics to constrain biology when the biology ought to be constraining mathematics. We also suggest a method for differentiating theories with the potential to be “unreasonably effective” from those that are simply overgeneral. Simple axiomatic assumptions about an ecological system should lead to theoretical predictions that can then be compared with existing data. If the theory is so general that data cannot be used to test it, the theory must be made more specific.

What then about this pragmatism and mathematical literalism? The pragmatism sums up as a “theories never work perfectly” anyway and, well, reality is surpassing us given that “we face an ever-increasing number of ecological crises, social demand will be for crude, imperfect descriptions of ecological phenomena now rather than more detailed, complex understanding later” (as aside and to rub it in: the latter is a different argumentation for pragmatism than the ‘I need a program from you today in order to analyse my lab data so that I can submit the article tomorrow and beat the competition’). The former I concur with, the latter on preferring imperfection over more thought-through theories is a judgment call and I leave that for what it is.
Mathematical literalism roughly means strict adherence to some limited mathematical model for its mathematical characteristics and limitations. For instance, in several ecological models (and elsewhere) processes are interpreted as strictly instantaneous—the “mechanistic” models—whereas those models that do not are mocked as “phenomenological”. But, so the authors argue, we should not fit nature to match the maths, but use mathematics to describe nature. Now this likely does ring a bell or two with developers of formal (logic-based) bio-ontologies: describe your bio stuff with the constructs that OWL gives you! And not—but probably should be—which formal language (i.e, which constructs) do I actually need to describe my subject domain? (Some follow-up questions on the latter are: if you can’t represent it, what exactly is it that you can’t represent? Do you really need it for the task at hand? Can you represent it in another [logical/conceptual modeling] language?)

It is not this black-and-white, however. As Ginzburg et al mention a couple of times in the article (kicking in an open door), trying to make a mathematical model of the biological theory greatly helps to be more precise about the underlying assumptions and to make those explicit. This, in turn aids making predictions based on those assumptions & theory, which subsequently should be tested against real data; if you can’t test it against data, then the theory is no good. This is a bit harsh because it may be that for some practical reasons something cannot be tested, but on the other hand, if that is the case, one may want to think twice about the usefulness of the theory.
Last, “The most useful theories emphasize explanation over description and incorporate a “limit myth” (i.e., they describe a pure situation without extraneous factors, as with the assumption in physics that surfaces are frictionless).” While it is true that one seeks for explanations, this conveniently brushes over the fact that first one has to have a way to describe things in order to incorporate them in an explanatory theory! If the theory fails, then thanks to a structured approach for the descriptions—say, some formal language or [annotated] mathematical equations—it will be easier to fiddle with the theory and reuse parts of it to come up with a new one. If the theory succeeds, it will be easier to link it up to another properly described and annotated theory to make more complex explanatory models.

Overall, the contents of the article is a bit premature and would have benefited from a thorough analyses of the too-general and too-specific theories other than anecdotal evidence with a couple of examples. Also, the “method for differentiating theories” advertised in the abstract is buried somewhere in the text, so, some sort of a bullet-pointed checklist for assessing one’s own pet theory on too-general/specific would have been useful. Despite this, it is good material to feed a confirmation bias for being against too much and too strict adherence to mathematics… as well as against no mathematics.

[1] Lev R. Ginzburg, Christopher X.J. Jensen and Jeffrey V. Yule. (2007). Aiming the “unreasonable effectiveness of mathematics” at ecological theory. Ecological Modeling, 207(2-4): 356-362. doi:10.1016/j.ecolmodel.2007.05.015

Granularity and no emergence in biology

This time a post that bears some distant relation to my thesis topic: granularity. About 1.5 years ago I got concerned that emergence, emergent properties, and emergent behaviour would complicate developing a formal theory of granularity, so I read up on the topic. While writing along the overview and analyzing both the philosophical aspects and proposed examples of emergence in biology, I came to the realization that it doesn’t complicate granularity, but on the contrary: that granularity actually serves as a useful methodology to investigate (hypothesized) emergence, in particular because of the modeling advantages and prospects for structured in silico simulations.

This is very nice for my granularity, but 20 odd pages to support a useful application area of granularity even though it is not the focus-area of applications (wandering off too far from the narrative), and thus taking up too much space in the thesis. So, I’m phasing it out. Problem is, that I don’t know of any outlet where a cocktail of bio, IT, and philosophy would be publishable, because specialists of each discipline wouldn’t be too happy reading too much about the other two fields and can smack it because it is not necessarily detailed enough for their own field, despite that the idea of combining granularity & (hypothesized) emergence may have some novelty to it. Interdisciplinarity has its drawbacks.

Things being as they are, I’m putting the pdf online after the printed paragraph was getting dust for some 1.5 years – for there might just be an interested reader out there. Comments are welcome of course!

Topics that pass the revue in the manuscript are:
1 Introduction
2 Renewed claims of emergence in biology
3 Emergence from a philosophical perspective
3.1 Epistemological emergence
3.2 Ontological emergence
3.3 Strong emergence
3.4 Weak emergence
3.4.1 Simulations
3.5 Examples
3.5.1 Example 1: pseudoplasmodium formation by cellular slime moulds
3.5.2 Example 2: horizontal gene transfer with metagenomics
4 Emergence and levels of granularity
4.1 Preliminaries of granularity
4.2 The irreducibility argument
4.3 Non-predictability and non-derivability
4.4 Characterisation of granular level from the viewpoint of emergence
5 Concluding remarks

The abstract of “Granularity as a modelling approach to investigate hypothesized emergence in biology” is as follows.

Abstract. Informal usage of emergence in biological discourse tends towards being of the epistemic type, but not ontological emergence, primarily due to our lack of knowledge about nature and limitations to how to model it. Philosophy adds clarification to better characterise the fuzzy notion of emergence in biology, but paradoxically it is the methodology of conducting scientific experiments that can give decisive answers. A renewed interest in whole-ism in (molecular) biology and simulations of complex systems does not imply emergent properties exist, but illustrates the realisation that things a more difficult and complex than initially anticipated. Usage of (weak- and epistemological) emergence in bioscience is a shorthand for `we have a gap in our knowledge about the precise relation(s) between the whole and its parts and possibly missing something about the parts themselves as well’, which amounts to absence of emergence in the philosophical sense. Given that the existence of emergent properties is not undisputed, we need better methodologies to investigate such claims. Granularity serves as one of these approaches to investigate postulated emergent properties. Specification of levels of granularity and their contents can provide a methodological modelling framework to enable structured examination of emergence from both a formal ontological modelling approach and the computational angle, and helps elucidating the required level of granularity to explain away emergence. I discuss some modelling considerations for a granularity framework and its relevance for the testability of emergence in computational implementations such as simulations.

Metagenomics, or: more problems to solve by bioinformaticians!

Nature Reviews Microbiology had their special issue on metagenomics in 2005, and the closely related topic of horizontal gene transfer shortly afterward, now it is PLoS Biology’s turn with several articles on advances in studying microbial communities in the ocean as part of their Oceanic Metagenomics collection. Not that, in theory, metagenomics is limited to microbes, but that’s where the research focus is now (e.g. [1][2][3]), because scaling up genomics research isn’t easy or cheap – and think of all the data that needs to be stored, processed, and analysed.

For the non-biologist reader in 3 sentences (or synopsis [4]): metagenomics, or `high-throughput molecular ecology’ (also called community genomics, ecogenomics, environmental genomics, or population genomics) combines molecular biology with ecosystems. It reveals community and population-specific metabolisms with the interdependent biological behaviour of organisms in nature that is affected by its micro-climate. Take a handful of soil (ocean water, mud, …) and figure out which microorganisms live there, who’s active (and what are they doing?), who’s dormant, what are the ratios of the population sizes of the different types of microorganisms, how does a microbial community ‘look’ like, etc?

For the data-enthusiast: all those individual microorganisms need to have their DNA and RNA sequenced, where, of course, the results go into databases. And then the analysis: putting back together the pieces from shotgun sequencing, comparing DNA with DNA, rRNA with rRNA, with each other, how to do the binning and so forth [5]. Naively: more and faster algorithms wouldn’t hurt; how can you visualize a community of microorganisms on your screen, and make simulations of those bacterial communities?

And then, somewhere in this whole endeavor, bio-ontologists should be able to find their place, to help out (and figure out) how to best represent all the new information in a usable and reusable way. Because metagenomics is a hot topic with much research and novel results, ontology maintenance (tracking changes etc) will then likely be more important than the attention it receives in ODEs at present, as well as reasoning over ontologies and massive amounts of data. Ouch. Some work has been and is being done on these topics (e.g. [6] [7]), and more can/will/does/should follow.

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.
[2] Lorenz, P., Eck, J. Metagenomics and industrial applications. Nature Reviews Microbiology, 2005, 3:510-516.>
[3] Schleper, C., Jurgens, G., Jonuscheit, M. Genomic studies of uncultivated Archae. Nature Reviews Microbiology, 3:479-488.
[4] Gross, L. Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity. PLoS Biology, 2007, 5(3): e85.
[5] Eisen, J.A. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biology, 2007, 5(3): e82.
[6] Klein, M. and Noy, N.F. (2003). A Component-Based Framework for Ontology Evolution. Workshop on Ontologies and Distributed Systems at IJCAI-2003, Acapulco, Mexico.
[7] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A, Rosati, R. Linking data to ontologies: The description logic DL-lite A. Proc. of the 2nd Workshop on OWL: Experiences and Directions (OWLED 2006), 2006.