Toward a framework for resolving conflicts in ontologies (with COVID-19 examples)

Among the many tasks involved in developing an ontologies, are deciding what part of the subject domain to include, and how. This may involve selecting a foundational ontology, reuse of related domain ontologies, and more detailed decisions for ontology authoring for specific axioms and design patterns. A recent example of reuse is that of the Infectious Diseases Ontology for schistosomiasis knowledge [1], but even before reuse, one may have to assess differences among ontologies, as Haendel et al did for disease ontologies [2]. Put differently, even before throwing alignment tools at them or selecting one with an import statement and hope for the best, issues may arise. For instance, two relevant domain ontologies may have been aligned to different foundational ontologies, a partOf relation could be set to be transitive in one ontology but is also used in a qualified cardinality constraint in the other (so then one cannot use an OWL 2 DL reasoner anymore when the ontologies are combined), something like Infection may be represented as a class in one ontology but as a property infectedby in another, or the ontologies differ on the science, like whether Virus is an organism or an inanimate object.

What to do then?

Upfront, it helps to be cognizant of the different types of conflict that may arise, and understand what their causes are. Then one would want to be able to find those automatically. And, most importantly, get some assistance in how to resolve them; if possible, also even preventing conflicts from happening in the first place. This is what Rolf Grütter, from the Swiss Federal Research Institute WSL, and I have been working since he visited UCT last year. The first results have been accepted for the International Conference on Biomedical Ontologies (ICBO) 2020, which are described in a paper entitled “Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring” [3]. A sample scenario of the process is illustrated informally in the following figure.

Summary of a sample scenario of detecting and resolving conflicts, illustrated with an ontology reuse scenario where Onto2 will be imported into Onto1. (source: [3])

The paper first defines and illustrates the notions of meaning negotiation and conflict resolution and summarises their main causes, to then go into some detail of the various categories of conflicts and ways how to resolve them. The detection and resolution is assisted by the notion of a conflict set, which is a data structure that stores the details for further processing.

It was tested with a use case of an epizootic disease outbreak in the Lemanic Arc in Switzerland in 2006, due to H5N1 (avian influenza): an administrative ontology had to be merged with one about the epidemiology for infected birds and surveillance zones. With that use case in place already well before the spread of SARS-CoV-2 that caused the current pandemic, it was a small step to add a few examples to the paper about COVID-19. This was made possible thanks to recently developed relevant ontologies that were made available, including for COVID-19 specifically. Let’s highlight the examples here, also so that I can write a bit more about it than the terse text in the paper, since there are no page limits for a blog post.

Example 1: OWL profile violations

Medical terminologies tend to veer toward being represented in an ontology language that is less or equal to OWL 2 EL: this permits scalability, compatibility with typical OBO Foundry ontologies, as well as fitting with the popular SNOMED CT. As one may expect, there have been efforts in ontology development with content relevant for the current pandemic; e.g., the Coronavirus Infectious Disease Ontology (CIDO) [4]. The CIDO is not in OWL 2 EL, however: it has a class expressions with a universal quantifier (ObjectAllValuesFrom) on the right-hand side; specifically (in DL notation): ‘Yale New Haven Hospital SARS-CoV-2 assay’ \sqsubseteq \forall ‘EUA-authorized use at’.’FDA EUA-authorized organization’ or, in the Protégé interface:

(codes: CIDO_0000020, CIDO_0000024, and CIDO_0000031, respectively). It also imported many ontologies and either used them to cause some profile violations or the violations came with them, such as by having used the union operator (‘or’) in the following axiom for therapeutic vaccine function (VO_0000562):

How did I find that? Most certainly NOT by manually browsing through the more than 70000 axioms of the CIDO (including imports) to find the needle in the haystack. Instead, I burned the proverbial haystack to easily get the needles. In this case, the burning was done with the OWL Classifier, which automatically computes which axioms violate any of the OWL species, and lists them accordingly. Here are two examples, illustrating an OWL 2 EL violation (that aforementioned universal quantification) and an OWL 2 QL violation (a property chain with entities from BFO and RO); you can do likewise for OWL 2 RL violations.

Following the scenario with the assumption that the CIDO would have to stay in the OWL 2 EL profile, then it is easy to find the conflicting axioms and act accordingly, i.e., remove them. (It also indicates something did not go well with importing the NDF-RT.owl into the cido-base.owl, but that as an aside for this example.)

Example 2: Modelling issues: same idea, different elements

Let’s take the CIDO again and now also the COviD Ontology for cases and patient information (CODO), which have some overlapping and complementary information, so perhaps could be merged. A not unimportant thing is the test for SARS-CoV-2 and its outcome. CODO has a ‘laboratory test finding’ \equiv {positive, pending, negative}, i.e., the possible outcomes of the test are individuals made into a class using the ObjectOneOf constructor. Consulting CIDO for the test outcomes, it has a class ‘COVID-19 diagnosis’ with three subclasses: Negative, Positive, and Presumptive positive. Aside from the inexact matches of the test status that won’t simplify data integration efforts, this is an example of class vs. instance modeling of what is ontologically the same thing. Resolving this in any merging attempt means that either

  1. the CODO has to change and bump up the test results from individuals to classes, or
  2. the CIDO has to change the subclasses to individuals in the ABox, or
  3. take an ‘outside option’ and represent it in yet a different way where both the CODO and the CIDO have to modify the ontology (e.g., take a conceptual data modeling approach by making the test outcome an attribute with a few possible values).

The paper provides an attempt to systematize such type of conflicts toward a library of common types of conflict, so that it should become easier to find them, and offers steps toward a proper framework to manage all that, which assisted with devising generic approaches to resolution of conflicts. We already have done more to realize all that (which could not all be squeezed into the 12 pages), but more is still to be done, so stay tuned.

Since COVID-19 is still doing the rounds and the international borders of South Africa are still closed (with a lockdown for some 5 months already), I can’t end the blog post with the usual ‘I hope to see you at ICBO 2020 in Bolzano in September’—well, not in the common sense understanding at least. Hopefully next year then.

 

References

[1] Cisse PA, Camara G, Dembele JM, Lo M. An Ontological Model for the Annotation of Infectious Disease Simulation Models. In: Bassioni G, Kebe CMF, Gueye A, Ndiaye A, editors. Innovations and Interdisciplinary Solutions for Underserved Areas. Springer LNICST, vol. 296, 82–91. 2019.

[2] Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annual Review of Biomedical Data Science, 2018, 1:305–331.

[3] Grütter R, Keet CM. Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring. 11th International Conference on Biomedical Ontologies (ICBO’20), 16-19 Sept 2020, Bolzano, Italy. CEUR-WS (in print).

[4] He Y, Yu H, Ong E, Wang Y, Liu Y, Huffman A, Huang H, Beverley J, Hur J, Yang X, Chen L, Omenn GS, Athey B, Smith B. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data, 2020, 7:181.

The DiDOn method to develop bio-ontologies from semi-structured life science diagrams

It is well-known among (bio-)ontology developers that ontology development is a resource-consuming task (see [1] for data backing up this claim). Several approaches and tools do exists that speed up the time-consuming efforts of bottom-up ontology development, most notably natural language processing and database reverse engineering. They are generic and the technologies have been proposed from a computing angle, and are therefore noisy and/or contain many heuristics to make them fit for bio-ontology development. Yet, the most obvious one from a domain expert perspective is unexplored: the abundant diagrams in the sciences that function as existing/’legacy’ knowledge representation of the subject domain. So, how can one use them to develop domain ontologies?

The new DiDOn procedure—from Diagram to Domain Ontology—can speed up and simplify bio-ontology development by exploiting the knowledge represented in such semi-structured bio-diagrams. It does this by means of extracting explicit and implicit knowledge, preserving most of the subject domain semantics, and making formalisation decisions explicit, so that the process is done in a clear, traceable, and reproducible way.

DiDOn is a detailed, micro-level, procedure to formalise those diagrams in a logic of choice; it provides migration paths into OBO, SKOS, OWL and some arbitrary FOL, and guidelines which axioms, and how, have to be added to the bio-ontology. It also uses a foundational ontology so as to obtain more precise and interoperable subject domain semantics than otherwise would have been possible with syntactic transformations alone. (Choosing an appropriate foundational ontology is a separate topic and can be done wit, e.g., ONSET.)

The paper describing the rationale and details, Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn [2], has just been accepted at the Journal of Biomedical Informatics. They require a graphical abstract, so here it goes:

DiDOn consists of two principal steps: (1) formalising the ‘icon vocabulary’ of a bio-drawing tool, which then functions as a seed ontology, and (2) populating the seed ontology by processing the actual diagrams. The algorithm in the second step is informed by the formalisation decisions taken in the first step. Such decisions include, among others, the representation language and how to represent the diagram’s n-aries (with n≥2, such as choosing between n-aries as relationship or reified as classes).

In addition to the presentation of DiDOn, the paper contains a detailed application of it with Pathway Studio as case study.

The neatly formatted paper is behind a paywall for those with no or limited access to Elsevier’s journals, but the accepted manuscript is openly accessible from my home page.

References

[1] Simperl, E., Mochol, M., Bürger, T. Achieving maturity: the state of practice in ontology engineering in 2009. International Journal of Computer Science and Applications, 2010, 7(1):45-65.

[2] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics. In print. DOI: http://dx.doi.org/10.1016/j.jbi.2012.01.004

Progress on the EnvO at the Dagstuhl workshop

Over the course of the 4,5 days packed together at the beautiful and pleasant ambience of Schloss Dagstul, the fourth Environment Ontology workshop has been productive, and a properly referenceable paper outlining details and decisions will follow. Here I will limit myself to mentioning some of the outcomes and issues that passed the revue.

Group photo of most of the participants at the EnvO Workshop at Dagstuhl

After presentations by all attendees, a long list of discussion themes was drawn up, which we managed to discuss and agree upon to a large extent. The preliminary notes and keywords are jotted down and put on the EnvO wiki dedicated to the workshop.

Focussing first on the content topics, which took up the lion’s share of the workshop’s time, significant advances have been made in two main areas. First, we have sorted out the Food branch in the ontology, which has been moved as Food product under Environmental material and then Anthropogenic environmental material, and the kind and order of differentia have been settled, using food source and processing method as the major axes. Second, the Biome branch will be refined in two directions, regarding (i) the ecosystems at different scales and the removal of the species-centred notion of habitat to reflect better the notion of environment and (ii) work toward inclusion of the aspect of n-dimensional hypervolume of an environment (both the conditions / parameters / variables and the characterization of a particular type of environment using such conditions, analogous to the hypervolumes of an ecological niche so that EnvO can be used better for annotation and analysis of environmental data). Other content-related topics concerned GPS coordinates, hydrographic features, and the commitment to BFO and the RO for top-level categories and relations. You can browse through the preliminary changes in the envo-edit version of the ontology, which is a working version that changes daily (i.e., not an officially released one).

There was some discussion—insufficient, I think—and recurring comments and suggestions on how to represent the knowledge in the ontology and, with that, the ontology language and modelling guidelines. Some favour bare single-inheritance trees for appealing philosophical motivations. The first problematic case, however, was brought forward by David Mark, who had compelling arguments for multiple inheritance with his example of how to represent Wadi, and soon more followed with terms such as Smoked sausage (having as parents the source and processing method) and many more in the food branch. Some others preferred lattices or a more common knowledge representation language—both are ways to handle more neatly the properties/qualities with respect to the usage of properties and the property inheritance by sub-universals from its parent. Currently, the EnvO is represented in OBO and modelling the knowledge does not follow the KR approach of declaring properties of some universal (/concept/class) and availing of property inheritance, so that one ends up having to make multiple trees and then adding ‘cross-products’ between the trees. Hence, and using intuitive labels merely for human readability here, Smoked sausage either will have two parents, amounting to—in the end where the branching started—\forall x (SmokedSausage(x) \equiv AnimalFoodProduct(x) \land ProcessingMethod(x)) (which is ontologically incorrect because a smoked sausage is not way of processing) or, if done with a ‘cross-product’ and a new relation (hasQuality ), then the resulting computation will have something alike \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land hasQuality(x,y) \land Smoking(y)) instead of having declared directly in the ontology proper, say, \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land HasProcessingMethod(x,y) \land Smoking(y)) . The latter option has the advantages that it makes it easier to add, say, Fermented smoked sausage or Cooked smoked sausage as a sausage that has the two properties of being [fermented/cooked] and being smoked, and that one can avail of automated reasoners to classify the taxonomy. Either way, the details are being worked on. The ontology language and the choice for one or the other—whichever it may be—ought not to get in the way of developing an ontology, but, generally, it does so both regarding underlying commitment that the language adheres to and any implicit or explicit workaround in the modelling stage that to some extent make up for a language’s limitations.

On a lighter note, we had an excursion to Trier together with the cognitive robotics people (from a parallel seminar at Dagstuhl) on Wednesday afternoon. Starting from the UNESCO’s world heritage monument Porta Nigra and the nearby birthplace of Karl Marx, we had a guided tour through the city centre with its mixture of architectural styles and rich history, which was even more pleasant with the spring-like weather. Afterwards, we went to relax at the wine tasting event at a nearby winery, where the owners provided information about the 6 different Rieslings we tried.

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Section of the Porta Nigra, Trier

Section of the Porta Nigra, Trier

Ontologies in ecology: putting the lessons-learned to good use and moving forward

While most of the headlines and attention in bio-ontologies has gone to the Gene Ontology, later also the FMA, and, most recently, the set of ontologies within or close to the OBO Foundry project, it has been comparatively more modest in the area of ontologies for ecology. This is set to change.

Madin et al [1] published a review article last month in Trends in Ecology and Evolution about not only the state of the art on existing ontologies for ecology, but also an Ode to the development and use of ontologies. The latter is not framed in a bright-vision-follow-me way, but noting (a.o.) the problems of

terminological ambiguity [that] slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science

and showing problems and clear examples of what kind of problems ontologies can help to solve.

Recollecting the OWLED’07 industry panel discussion last year, it seemed as if industry was at the point where bio-ontologies were 5-8 years ago and, moreover, about to reinvent the wheel. Not so with ontologies for ecology. Madin et al has separate information boxes about “building consistent ontologies” explaining the difference between is-a and instance-of, is-a and part-of, and is-a and constitution—those things that early adopters learned the hard way a few years ago is presented as a known basic starting point. Likewise for the info-box on “What is an ontology?” and the straight adoption of OWL and benefits automated reasoners. In the overview presented by Madin et al, there are no issues to resolve on trying to be backward compatible with the obo format, but they go straight to the W3C standardized formal ontology representation languages for the ontologies for ecology. Idem box 2 on finding data (which is also a nice scenario for the OBDA Plugin and DIG-Mastro), OntoClean, foundational ontologies and domain ontologies versus other artifacts with terms, linking of ontologies, and a clear table with task-description-requirements (table 1) that invariably asks for good ontologies.

Aside from the analysis of benefits and usages, the concluding remarks section notes that

[t]hus, the adoption of ontologies is hindered both by the familiarity of current practices and the lack of tools to readily migrate to improved practices.

Point taken.

And last, but not least,

Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology, and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms. By clarifying the terms of scientific discourse, and annotating data and analyses with those terms, well defined, community-sanctioned, formal ontologies based on open standards will provide a much-needed foundation upon which to tackle crucial ecological research while taking full advantage of the growing repositories of data on the Internet.

[1] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168. doi:10.1016/j.tree.2007.11.007

A new plant family: the Simulacraceae

May I recommend for the Friday afternoon/weekend reading: an article by Bletter, Reynertson, and Velazquez Runk in the journal Ethnobotany Research & Applications (vol. 5, 2007) on “The taxonomy, ecology, and ethnobotany of the Simulacraceae”, which has about 80 species divided in 17 genera, such as Plasticus, Textileria, and Papyroidia. Moreover,

This family is more than a botanical curiosity. It is a scientific conundrum, as the taxa:

  1. lack genetic material,
  2. appear virtually immortal and
  3. have the ability to form intergeneric crosses with ease, despite the lack of any evident mechanism for cross-fertilization.

In this study, conducted over approximately six years, we elucidate the first full description and review of this fascinating taxon, heretofore named Simulacraceae.

To summarize, also in the words of the authors,

The eco­nomics, distribution, ecology, taxonomy, paleoethnobot­any, and phakochemistry of this widespread family are herein presented. We have recently made great strides in circumscribing this group, and collections indicate this cosmopolitan family has a varied ecology. … Despite being genomically challenged plants, an initial phylogeny is pro­posed. In an early attempt to determine the ecological re­lations of this family, a twenty-meter transect has been in­ventoried from a Plasticus rain forest in Nyack, New York, yielding 49 new species and the first species-area curve for this family.

The Simulacraceae collections—based on the principal method of “opportunistic sampling”—are deposited in the herbarium of the Foundation for Artificial Knowledge Education. Some of the open problems yet to investigate include simulacrapaleoethnobotany and simulacrapolitical ecology, and from an engineering perspective, the design of a Traditional Simulacraceae Knowledge/Teleological Simulation Knowledge base (dubbed acronym “TSK,TSK”, which would compete well with the yearly naming game for the NAR January database issue).

A short html version of the article is available online in the Jan/Feb issue of AIR, but also the full pdf file (about 6MB) in the Uni of Hawaii database with more information and colourful photos (openly accessible, of course). Enjoy!

On the (un)reasonable effectiveness of mathematics in ecology

An article appeared last week in Ecological Modeling that has the intention to be thought-provoking; it looks at effectiveness of mathematics in ecological theory [1], but it just as well can be applied to bioinformatics, computational biology, and bio-ontologies. In short: mathematical models are useful only if they are not too general to be trivially true and not too specific to be applicable to one data set only. But how to go about finding the middle way? Ginzburg et al fail to clearly answer this question, but there are some pointers worth mentioning. In the words of the authors (bold face my emphasis):

A good theory is focused without being blurred by extraneous detail or overgenerality. Yet ecological theories frequently fail to achieve this desirable middle ground. Here, we review the reasons for the mismatch between what theorists seek to achieve and what they actually accomplish. In doing so, we argue on pragmatic grounds against mathematical literalism as an appropriate constraint to mathematical constructions: such literalism would allow mathematics to constrain biology when the biology ought to be constraining mathematics. We also suggest a method for differentiating theories with the potential to be “unreasonably effective” from those that are simply overgeneral. Simple axiomatic assumptions about an ecological system should lead to theoretical predictions that can then be compared with existing data. If the theory is so general that data cannot be used to test it, the theory must be made more specific.

What then about this pragmatism and mathematical literalism? The pragmatism sums up as a “theories never work perfectly” anyway and, well, reality is surpassing us given that “we face an ever-increasing number of ecological crises, social demand will be for crude, imperfect descriptions of ecological phenomena now rather than more detailed, complex understanding later” (as aside and to rub it in: the latter is a different argumentation for pragmatism than the ‘I need a program from you today in order to analyse my lab data so that I can submit the article tomorrow and beat the competition’). The former I concur with, the latter on preferring imperfection over more thought-through theories is a judgment call and I leave that for what it is.
Mathematical literalism roughly means strict adherence to some limited mathematical model for its mathematical characteristics and limitations. For instance, in several ecological models (and elsewhere) processes are interpreted as strictly instantaneous—the “mechanistic” models—whereas those models that do not are mocked as “phenomenological”. But, so the authors argue, we should not fit nature to match the maths, but use mathematics to describe nature. Now this likely does ring a bell or two with developers of formal (logic-based) bio-ontologies: describe your bio stuff with the constructs that OWL gives you! And not—but probably should be—which formal language (i.e, which constructs) do I actually need to describe my subject domain? (Some follow-up questions on the latter are: if you can’t represent it, what exactly is it that you can’t represent? Do you really need it for the task at hand? Can you represent it in another [logical/conceptual modeling] language?)

It is not this black-and-white, however. As Ginzburg et al mention a couple of times in the article (kicking in an open door), trying to make a mathematical model of the biological theory greatly helps to be more precise about the underlying assumptions and to make those explicit. This, in turn aids making predictions based on those assumptions & theory, which subsequently should be tested against real data; if you can’t test it against data, then the theory is no good. This is a bit harsh because it may be that for some practical reasons something cannot be tested, but on the other hand, if that is the case, one may want to think twice about the usefulness of the theory.
Last, “The most useful theories emphasize explanation over description and incorporate a “limit myth” (i.e., they describe a pure situation without extraneous factors, as with the assumption in physics that surfaces are frictionless).” While it is true that one seeks for explanations, this conveniently brushes over the fact that first one has to have a way to describe things in order to incorporate them in an explanatory theory! If the theory fails, then thanks to a structured approach for the descriptions—say, some formal language or [annotated] mathematical equations—it will be easier to fiddle with the theory and reuse parts of it to come up with a new one. If the theory succeeds, it will be easier to link it up to another properly described and annotated theory to make more complex explanatory models.

Overall, the contents of the article is a bit premature and would have benefited from a thorough analyses of the too-general and too-specific theories other than anecdotal evidence with a couple of examples. Also, the “method for differentiating theories” advertised in the abstract is buried somewhere in the text, so, some sort of a bullet-pointed checklist for assessing one’s own pet theory on too-general/specific would have been useful. Despite this, it is good material to feed a confirmation bias for being against too much and too strict adherence to mathematics… as well as against no mathematics.

[1] Lev R. Ginzburg, Christopher X.J. Jensen and Jeffrey V. Yule. (2007). Aiming the “unreasonable effectiveness of mathematics” at ecological theory. Ecological Modeling, 207(2-4): 356-362. doi:10.1016/j.ecolmodel.2007.05.015

Granularity and no emergence in biology

This time a post that bears some distant relation to my thesis topic: granularity. About 1.5 years ago I got concerned that emergence, emergent properties, and emergent behaviour would complicate developing a formal theory of granularity, so I read up on the topic. While writing along the overview and analyzing both the philosophical aspects and proposed examples of emergence in biology, I came to the realization that it doesn’t complicate granularity, but on the contrary: that granularity actually serves as a useful methodology to investigate (hypothesized) emergence, in particular because of the modeling advantages and prospects for structured in silico simulations.

This is very nice for my granularity, but 20 odd pages to support a useful application area of granularity even though it is not the focus-area of applications (wandering off too far from the narrative), and thus taking up too much space in the thesis. So, I’m phasing it out. Problem is, that I don’t know of any outlet where a cocktail of bio, IT, and philosophy would be publishable, because specialists of each discipline wouldn’t be too happy reading too much about the other two fields and can smack it because it is not necessarily detailed enough for their own field, despite that the idea of combining granularity & (hypothesized) emergence may have some novelty to it. Interdisciplinarity has its drawbacks.

Things being as they are, I’m putting the pdf online after the printed paragraph was getting dust for some 1.5 years – for there might just be an interested reader out there. Comments are welcome of course!

Topics that pass the revue in the manuscript are:
1 Introduction
2 Renewed claims of emergence in biology
3 Emergence from a philosophical perspective
3.1 Epistemological emergence
3.2 Ontological emergence
3.3 Strong emergence
3.4 Weak emergence
3.4.1 Simulations
3.5 Examples
3.5.1 Example 1: pseudoplasmodium formation by cellular slime moulds
3.5.2 Example 2: horizontal gene transfer with metagenomics
4 Emergence and levels of granularity
4.1 Preliminaries of granularity
4.2 The irreducibility argument
4.3 Non-predictability and non-derivability
4.4 Characterisation of granular level from the viewpoint of emergence
5 Concluding remarks

The abstract of “Granularity as a modelling approach to investigate hypothesized emergence in biology” is as follows.

Abstract. Informal usage of emergence in biological discourse tends towards being of the epistemic type, but not ontological emergence, primarily due to our lack of knowledge about nature and limitations to how to model it. Philosophy adds clarification to better characterise the fuzzy notion of emergence in biology, but paradoxically it is the methodology of conducting scientific experiments that can give decisive answers. A renewed interest in whole-ism in (molecular) biology and simulations of complex systems does not imply emergent properties exist, but illustrates the realisation that things a more difficult and complex than initially anticipated. Usage of (weak- and epistemological) emergence in bioscience is a shorthand for `we have a gap in our knowledge about the precise relation(s) between the whole and its parts and possibly missing something about the parts themselves as well’, which amounts to absence of emergence in the philosophical sense. Given that the existence of emergent properties is not undisputed, we need better methodologies to investigate such claims. Granularity serves as one of these approaches to investigate postulated emergent properties. Specification of levels of granularity and their contents can provide a methodological modelling framework to enable structured examination of emergence from both a formal ontological modelling approach and the computational angle, and helps elucidating the required level of granularity to explain away emergence. I discuss some modelling considerations for a granularity framework and its relevance for the testability of emergence in computational implementations such as simulations.

Metagenomics, or: more problems to solve by bioinformaticians!

Nature Reviews Microbiology had their special issue on metagenomics in 2005, and the closely related topic of horizontal gene transfer shortly afterward, now it is PLoS Biology’s turn with several articles on advances in studying microbial communities in the ocean as part of their Oceanic Metagenomics collection. Not that, in theory, metagenomics is limited to microbes, but that’s where the research focus is now (e.g. [1][2][3]), because scaling up genomics research isn’t easy or cheap – and think of all the data that needs to be stored, processed, and analysed.

For the non-biologist reader in 3 sentences (or synopsis [4]): metagenomics, or `high-throughput molecular ecology’ (also called community genomics, ecogenomics, environmental genomics, or population genomics) combines molecular biology with ecosystems. It reveals community and population-specific metabolisms with the interdependent biological behaviour of organisms in nature that is affected by its micro-climate. Take a handful of soil (ocean water, mud, …) and figure out which microorganisms live there, who’s active (and what are they doing?), who’s dormant, what are the ratios of the population sizes of the different types of microorganisms, how does a microbial community ‘look’ like, etc?

For the data-enthusiast: all those individual microorganisms need to have their DNA and RNA sequenced, where, of course, the results go into databases. And then the analysis: putting back together the pieces from shotgun sequencing, comparing DNA with DNA, rRNA with rRNA, with each other, how to do the binning and so forth [5]. Naively: more and faster algorithms wouldn’t hurt; how can you visualize a community of microorganisms on your screen, and make simulations of those bacterial communities?

And then, somewhere in this whole endeavor, bio-ontologists should be able to find their place, to help out (and figure out) how to best represent all the new information in a usable and reusable way. Because metagenomics is a hot topic with much research and novel results, ontology maintenance (tracking changes etc) will then likely be more important than the attention it receives in ODEs at present, as well as reasoning over ontologies and massive amounts of data. Ouch. Some work has been and is being done on these topics (e.g. [6] [7]), and more can/will/does/should follow.

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.
[2] Lorenz, P., Eck, J. Metagenomics and industrial applications. Nature Reviews Microbiology, 2005, 3:510-516.>
[3] Schleper, C., Jurgens, G., Jonuscheit, M. Genomic studies of uncultivated Archae. Nature Reviews Microbiology, 3:479-488.
[4] Gross, L. Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity. PLoS Biology, 2007, 5(3): e85.
[5] Eisen, J.A. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biology, 2007, 5(3): e82.
[6] Klein, M. and Noy, N.F. (2003). A Component-Based Framework for Ontology Evolution. Workshop on Ontologies and Distributed Systems at IJCAI-2003, Acapulco, Mexico.
[7] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A, Rosati, R. Linking data to ontologies: The description logic DL-lite A. Proc. of the 2nd Workshop on OWL: Experiences and Directions (OWLED 2006), 2006.

Multi-tasking or parallel processing? Operating Systems versus processing in the brain.

To multi-task or to process the tasks in parallel are well-researched topics in computer science, tested and implemented in operating systems. When the early (and multiple successive) Microsoft Windows versions were released, it was as if you could work on different programmes at the same time. But these processes were not exactly handled by the computer at the same time – in fact, the sub-processes were interleaved and had access to the processor in successive turns until the processes were finished. Multi-tasking, but not all tasks at the very same time. Or take the background (“&”)-processes in UNIX running in the background, with or without an importance-level set to carry out the process – same case. Plug in more processors, be it in one computer or distributed over multiple computers, and the computer(s) can process the tasks truly in parallel. These solutions were invented before actually knowing how nature does it in the human brain, which, if known, could have been used to emulate it in the computer. For, as Patch has written about Masaccio, Da Vinci some 2 centuries before him, and is coincidentally also the slogan of the systems biology-focussed Microsoft Research University in Trento, Italy,  

Those who took other inspiration than from nature, master of masters, were labouring in vain”
Quegli che pigliavano per altore altro che la natura, maestra de’ maestri, s’affaticavano invano. (the original sentence from Da Vinci in Trattato della pittura, 1500) 

So, how does the brain cope with processing multiple tasks? And is that anything like currently done in operating systems? Did computer scientists ‘reinvent’ the wheel, or can they learn something from task processing strategies the brain employs? 

This month’s PLoS Biology contains an article about process handling in the brain. Biology, cognitive science, neuroscience, psychology and all that, and none of the 92 references refer to a computer science publication. It is fascinating nevertheless. 

Sigman and Dehaene [1] conducted experiments where the subjects had to perceive stimuli – numbers and tones – and decide if the number presented was larger or smaller than 45, similar for the frequency of the tones. The order and interval between the stimuli was random and then they looked at speed & delays in response time of the subjects to perform the tasks. For instance, if there is a short interval between the stimuli, but the time to complete the two tasks is the same as when performed independently, then there’s serial processing going on in the brain, if shorter, then some parallel processing is going on as well, if longer, then dual-task interference and management overhead is to blame. 

Their main conclusions are that, in addition to a “central bottleneck” (i.e., tasks are, roughly but not exclusively, executed on a first-come-first-serve basis), there is an active process of task setting; hence, a “central executive” that manages the whole thing. This central executive has four distinct architectural properties: information collection from different modules, impossibility to proceed in parallel with other processes of the same type, it is sustained for a mere few hundred of milliseconds, and is highly stochastic. An operating system certainly is not stochastic.
Anyway, concerning the brain, when you are faced with having to perform multiple tasks, you are first going to think of planning the best or preferred sequence of executing the tasks, and then carry out the processes. Alike the adagio ‘think before you act’. This planning-thinking management component for dual-task processing seems to involve three successive central stochastic decision processes: 1) task choice, 2) selection of the first response, and 3) selection of the second response.Also, there is such thing as task disengaging, whereby the execution plan set for the first tasks is suppressed in order to go on with the next one.Overall, there is an interaction between bottom-up task processing based on the input-stimuli and top-down decisions from the brain’s management centre that determines what is done first. 

Some fun-facts of the results obtained.
– Response times are slower if the subjects have less certainty about which task is presented first and when it is presented. Unpredictability slows down your acts.
– Responding to the second task does not depend on the (perception of the) stimulus for the second task, but waits for the unlocking by the process for responding on the first task.
– There may be two bottlenecks in the brain system: response selection and response initiation. That is, task setting and task disengagement – management overhead takes its time.
– Some caution: processing certain stages of tasks serially or in parallel may be a matter of experience. Try to teach that to a computer. 

With two competing processes that want to get access to the central executive, the winner is not fully determined at the time when the stimulus is presented, but the process that can be performed the quickest has an advantage over the other process that requires more resources. It is like a sad printer queue management system where printing small documents always take precedence over printing large files from computers that have a shoddy connection to the print server. Well, more precisely, it’s not necessarily the complexity or the total duration of the task that has to be carried out, but the duration of perceptual processing counts too. Many more details can be found in the article and its references.  

How they would investigate the possibility of, or with certainty excluding, interleaving subtasks (of the ‘remainder’ of the task to perform after stimuli processing) of the two main tasks, I do not know, but that also depends on how broad one defines the tasks. It would also be nice to know the percentage of time spent on task planning when an increasing amount of tasks have to be carried simultaneously. Can it get ‘stuck’ in the planning process? When does the central management module get overloaded? How large is the effect of learning on task processing, and can one achieve more parallel processing thanks to the learning factor? Which processes and tasks are more amenable to being processed in parallel, and which types can be done only serially? 

Either way, the brain’s processes seem to be a little more complex, where more variables are taken into account than the multi-tasking and multi-processing of the operating system. On the other hand, with computers one can plug in more processors for parallel computing, which is a no-no for the brain. But parallel processors still need central process management too.
So, the brain does a bit of both serial and parallel processing. That will be interesting for a computer process management system: maybe finer-grained distinctions of the sub-processes can allow for further optimization of process execution? 

[1] Sigman M, Dehaene S (2006) Dynamics of the Central Bottleneck: Dual-Task and Task Uncertainty. PLoS Biololgy 4(7): e220.

Computer Science with/for Biology and (bio)medicine

The vibrant and emerging research area of 'doing research and engineering in the subject domain of biology and the applied biosciences' comprises one or more (sub-) disciplines of computer sciences and information technology that can be mixed with any of the (sub-) disciplines in biology, ecology, and applied biosciences (such as medicine and agriculture). Depending on the emphasis, this combination tends to favour one or more of the following terms to indicate the type of activity: Computational Biology, Systems Biology, Bioinformatics, In Silico Biology, Ecoinformatics, (Bio)Medical Informatics, and bio-ontologies, among others. But what exactly is the breadth and depth of these relatively new fields, and what are its characteristic activities? What is, or can be, used from mathematics to advance biology at a faster pace? What type of problems do bioscientists perceive that need to be solved? Is engineering only a supportive discipline for biology? If not, where and how is biology pushing the frontiers of computer science and IT? How did, and does, the combination of computer science & biology lead to landmark achievements – and which ones are considered to be achievements?

Against this background, the KRDB Research Centre of Faculty of Computer Science at the Free University of Bozen-Bolzano aimed to present and form new expertise and professional profiles who can answer the growing demands of the biosciences and ultimately our societies in the area of using both theoretical and applied aspects of computer science and engineering, thereby contributing to pushing the frontiers in computer science as well as (applied) biology. To this end, it has organized the “CS & IT with/for biology” Seminar Series. The aim of the seminars was to provide a broad spectrum of achievements, opportunities, and challenges on using/combining computer science with/for biology, highlighting diverse foci and approaches traversing biology (sub-) disciplines and applied bioscience and a wide range of computer science approaches. This coverage goes from basic biosciences, such as genetics & cellular processes and larger systems in ecology, and the applied biosciences medicine and agriculture, to CS/IT fields of ontology/ies, logics, natural language processing, database integration, and software development.

A reader [1] was made from the extended abstracts of the invited speakers, offering both a summary of the seminar as well as additional references to give useful pointers to key publications, the most recent research output, and 'hot' topics.

The first chapter in this reader provides a general overview of historical aspects and current characteristics of the rather flexible interpretation that was given to biology & informatics – and the more recent diversification into multiple niche areas. It can aid novices in the field to grasp some of the more, and less, active research activities and 'insiders' to have ample material for discussion. From this introduction, we first take a step back before going into details, by looking at some ethical considerations, as described by Heiner Fangerau. Within a short time span, many new possibilities are (or seem) just around the corner: stem cell research and personalised medicine to name just two; but who benefits, and is a regrouping of the human world population into certain groups with genetic predispositions for particular diseases – technologically not impossible – actually desirable and beneficial for the society at large? Which biases are 'built in' when we do our literature research?

The subsequent chapters go into some detail, both with regard to the technological and computer science aspects as (applied) biology. In chapter 3 Alberto Policriti introduces mathematical modelling for systems biology, with automata and pi-calulus in particular. These topics are relevant for in silico simulations of cellular processes and the mathematical complexities of the outstanding problems, i.e. modelling biological knowledge requires new solutions from mathematicians. The next chapter by Marco Roos, on the other hand, takes a case-based approach: biologists desire to understand better e.g. Huntington's Disease and histones, and to achieve this, they need a computer infrastructure to enable them to do their research. A regrouping of this requirement with technological support has resulted in the initiation of a virtual laboratory for e-science. Marie-Paule Lefranc has taken a yet different path (in chapter 5), where demands from biology, immunogenetics in this case, are combined with the latest developments in computer science, such that her laboratory belongs not only to the ‘early adopters’ of technology over the past 15 years, but also can use it effectively to discover biologically meaningful new information: bio & info in synergy.

The infamous biological data explosion that has occurred over the past 10 years may be well-know, its ‘consequently' disconnected software tools and databases is known in considerably less detail. Apart from the obvious data integration issues between databases and linking database and analysis tools, one first needs to be able to find what is there, and then for the biologist to find what s/he needs. This is a central topic of Sarah Cohen-Boulakia's contribution: what are biologists actually looking for, and how can we, automatically, find the relevant software resources? The issue of finding the right information is addressed from an entirely different angle and context by Werner Ceusters in chapter 7. Advances made in the sub-discipline of natural language understanding can help processing electronic health records, annotated with an ontology, to mine that data and discover new patterns in the patient's treatment and history with as aim to improve biomedicine. Last, with Aldo Gangemi we take a closer look at the usefulness of task and action ontologies for software development in agriculture, with the UN Food and Agriculture Organisation (FAO) among the beneficiaries.

While the topics do not cover all aspects of CS\&IT with/for (applied) biology, it can give you some insight in its multifaceted aspects, ranging from applied mathematics and philosophy to software engineering, from core to applied biology, and from enabling information technology to successful combination of bio-info and biology-driven computer science.

[1] CSBio reader: extended abstracts of the 'CS&IT with/for biology' Seminar Series. Free University of Bozen-Bolzano, 2005.