On my new book about modelling

It was published last month by Springer: “The what and how of modelling information and knowledge: from mind maps to ontologies”. The book’s three character-limited unique selling points are that it “introduces models and modelling processes to improve analytical skills and precision; describes and compares five modelling approaches: mind maps, models in biology, conceptual data models, ontologies, and ontology; aims at readers looking for a digestible introduction to information modelling and knowledge representation”. The softcover hardcopy and the eBook are available from Springer, Springer professional, many national and international online retailers (e.g., Amazon), as well as university libraries, and hopefully soon in the ‘science’ section of select bookstores.

There’s also a back flap blurb with the book’s motivations and aims, and intended readership. The remainder of this post are informal comments on it.

From my side as author and having read many popular science books on a wide range of topics, I wanted to write a popular science book too, but then about modelling. Modelling for the masses, as it were, or at least something that is comparatively easily readable for professionals who don’t have a computing background and who haven’t had, or had very little, training in modelling, yet who can greatly benefit from doing so. And to some extent also for computing and IT professionals who’d like a refresher on information modelling or a concise introduction to ontologies but don’t want to (re-)open their textbook tomes from college. Modelling doesn’t lend itself well to juicy world-changing discoveries the same way that vaccines and fungi can be themes for page-turners, but a few tales and juicy details do exist.

Then next consideration was about which aspects of modelling to include and what sort of popular science book to aim for. I distinguished four types of popular science books based on my prior readings, ranging from ‘entertaining layperson’ level holiday reading to ‘advanced interested layperson’ level where having at least a Bachelors in that field or a Master’s degree in an adjacent field may be needed to make it through the tiny-font book. I have no experience writing humour, and modelling is a rather dry topic compared to laugh-out-loud musings and investigations into stupidity, drunkenness, or elephants on acid—that entertainment can be found here, here, and here—so that was easily excluded. I’ve already tried out advanced texts tailored to specialists, in the form of an award-winning postgraduate textbook on ontology engineering, and wasn’t in the mood for writing another such book at the time when I was exploring ideas, which was around late 2021 and early 2022. I think this modelling book ended up between the two extremes regarding the amount of content, difficulty, and readability.

And so, I chose the tone of writing to be in so-called ‘casual writing’ style to make it more readable, there are a few anecdotes to enliven the text as is customary for popular science books, and the first three chapters are relatively easy in content compared to later chapters. The difficulty level of the chapters’ contents is turned up a notch each chapter going from Chapters 2 to 6 when we’re moving onwards with the journey passing by the five types of models covered in the book. Each successive chapter solves modelling limitations from the preceding chapter, and so it gets more challenging at least up to Chapter 5 (ontologies). Whether a reader finds Chapter 6 on Ontology (philosophy) even harder, depends on their background, because in other ways it is easier than ontologies because we can set aside certain interfering practicalities.

Chapter 7 mixes easier use cases with theoretically more abstract sections when we’re putting things together, reflect on Chapters 2-6, and look ahead. There’s no avoiding a little challenge. But then, we read non-fiction/science/tech books to learn from it and learning requires some effort.

Aside from the reader learning from reading the book, an author is supposed to gain new insights from writing it. And so did I. Moreover, upfront when planning the book, I tried to make sure I likely would. I mention a few salient points in the preface and I’ll select two for this blog post: the cladograms (Section 3.2.1) and the task-based evaluation (Section 7.1.2.2).

Diagrams/models in biology are sometimes ridiculed as “cartoons” by non-biologists. Cladograms would be the xkcd version of it, visually. I already knew that there are common practices, recurring icons, and rules governing the biological models drawn as diagrams. Digging deeper to find more diagrams with rules governing their notation, cladograms came up. They visualise key aspects of the scientific theory of evolution. Conversely, drawing an evolutionary diagram that doesn’t adhere to those rules then amounts to misunderstanding evolution. I think the case deserves more attention, especially because a bunch of school textbooks have been shown to have errors, and there’s room for improvement designing cladogram drawing software. Maybe clarifying matters and being more precise with such models helps resolve some debates on the topic as well.

The motivation for the task-based evaluation is easy to argue for in theory — actually doing it offered a deeper understanding, and writing the book spurred me to do so. One of my claims in the beginning of the book is that with better modelling—better than mind maps, not better mind maps—one learns more. The task-based evaluation is precisely about that. We take one page from a textbook and try to create a model of it, one for each type of model covered in the book. It demonstrates in a clear and straightforward way — assisted by Bloom’s taxonomy if you so fancy — why developing an ontology is much harder than developing a mind map or a conceptual data model, and in what way designing a conceptual data model of that textbook page is better for learning the content than creating a mind map of it.

There were more joys of writing the book. Like that the running example—dance—was also good for some additional interesting paper reading beyond what I already had read and engaged with in various projects. (There are also other subject domains in the examples and illustrations, such as fermentation, peace, labour law, and stuff, and a separate post will be dedicated to more content of the book.)

To jump the gun on questions like “why didn’t you include my preferred type of model or my language, being [DSL x/KG y/BPM z/etc.]?”: the point I wanted to make with this book was made with these five types of models and this was the shortest coherent story arc with which I could do it. The DSLs/KGs/BPMs/etc are not less worthy, but they would have caused the number of pages to explode without adding to the argument. As consolation, perhaps: knowledge graphs (KGs) are likely to appear in a v2 of my ontology engineering textbook and BPM likely will be linked to the TREND temporal conceptual data modelling language, but that’s future music.

Last, I’ve created a web page for the book, which collates information about the book, such as direct links where to buy it, media coverage and links to recent related blog posts (e.g., this one is a spin-off [with an add-on] of an early draft of section 6.3 and that one of a draft of section 7.3), and has extra supplementary material, including a longer illustration of a conceptual model design procedure using a prospective dance school database as example. Feedback is welcome!

More detail on the ontology of pandemic

When we can declare the covid-19 pandemic to be over? I mulled about that earlier in January this year when the omicron wave was fizzling out in South Africa, and wrote a blog post as a step toward trying to figure out and a short general public article was published by The Conversation (republished widely, including by The Next Web). That was not all and the end of it. In parallel – or, more precisely, behind the scenes – that ontological investigation did happen scientifically and in much more detail.

The conclusion is still the same, just with a more detailed analysis, which is now described in the paper entitled Exploring the ontology of pandemic [1], which was accepted at the International Conference on Biomedical Ontology 2022 recently.

First, it includes a proper discussion of how the 9 relevant domain ontologies have pandemic represented in the ontology – the same as epidemic, a sibling thereof, or as a subclass, and why – and what sort of generic top-level entity it is asserted to be, and a few more scientific references by domain experts.

Second, besides the two foundational ontologies that I discussed the alignment to (DOLCE and BFO) in the blog post, I tried with five more foundational ontologies that were selected meeting several criteria: BORO, GFO, SUMO, UFO, and YAMATO. That mainly took up a whole lot more time, but it didn’t add substantially to insights into what kind of entity pandemic is. It did, however, make clear that manually aligning is hard and difficult to get it as precise as it ought, and may need, to be, for several reasons (elaborated on in the paper).

Third, I dug deeper into the eight characteristics of pandemics according to the review by Morens, Folkers and Fauci (yes, him, from the CDC) [1] and disentangled what’s really going on with those, besides already having noted that several of them are fuzzy. Some of the characteristics aren’t really a property of pandemic itself, but of closely related entities, such as the disease (see table below). There are so many intertwined entities and relations, in fact, that one could very well develop an ontology of just pandemics, rather than have it only as a single class on an ontology as is now the case. For instance, there has to be a high attack rate, but ‘attack rate’ itself relies on the fact that there is an infectious agent that causes a disease and that R (reproduction) number that, in turn, is a complex thing that takes into account factors including susceptibility to infection, social dynamics of a population, and the ability to measure infections.

Finally, there are different ways to represent all the knowledge, or a relevant part thereof, as I also elaborated on in my Bio-Ontologies keynote last month. For instance, the attack rate could be squashed into a single data property if the calculation is done elsewhere and you don’t care how it is calculated, or it can be represented in all its glory details for the sake of it or for getting a clearer picture of what goes into computing the R number. For a scientific ontology, the latter is obviously the better choice, but there may be scenarios where the former is more practical.

The conclusion? The analysis cleared up a few things, but with some imprecise and highly complex properties as part of the mix to determine what is (and is not) a pandemic, there will be more than one optimum/finish line for a particular pandemic. To arrive at something more specific than in the paper, the domain experts may need to carry out a bit more research or come up with a consensus on how to precisiate those properties that are currently still vague.

Last, but not least, on attending ICBO’22, which will be held from 25-28 September in Ann Arbour, MI, USA: it runs in hybrid format. At the moment, I’m looking into the logistics of trying to attend in person now that we don’t have the highly anticipated ‘winter wave’ like the one we had last year and that thwarted my conference travel planning. While that takes extra time and resources to sort out, there’s that very thick silver lining that that also means we seem to be considerably closer to that real end of this pandemic (of the acute infections at least). According to the draft characterisation pandemic, one indeed might argue it’s over.

References

[1] Keet, C.M. Exploring the Ontology of Pandemic. 13th International Conference on Biomedical Ontology (ICBO’22). CEUR-WS. Michigan, USA, September 25-28, 2022.

[2] Morens, DM, Folkers, GK, Fauci, AS. What Is a Pandemic? The Journal of Infectious Diseases, 2009, 200(7): 1018-1021.

Toward a framework for resolving conflicts in ontologies (with COVID-19 examples)

Among the many tasks involved in developing an ontologies, are deciding what part of the subject domain to include, and how. This may involve selecting a foundational ontology, reuse of related domain ontologies, and more detailed decisions for ontology authoring for specific axioms and design patterns. A recent example of reuse is that of the Infectious Diseases Ontology for schistosomiasis knowledge [1], but even before reuse, one may have to assess differences among ontologies, as Haendel et al did for disease ontologies [2]. Put differently, even before throwing alignment tools at them or selecting one with an import statement and hope for the best, issues may arise. For instance, two relevant domain ontologies may have been aligned to different foundational ontologies, a partOf relation could be set to be transitive in one ontology but is also used in a qualified cardinality constraint in the other (so then one cannot use an OWL 2 DL reasoner anymore when the ontologies are combined), something like Infection may be represented as a class in one ontology but as a property infectedby in another, or the ontologies differ on the science, like whether Virus is an organism or an inanimate object.

What to do then?

Upfront, it helps to be cognizant of the different types of conflict that may arise, and understand what their causes are. Then one would want to be able to find those automatically. And, most importantly, get some assistance in how to resolve them; if possible, also even preventing conflicts from happening in the first place. This is what Rolf Grütter, from the Swiss Federal Research Institute WSL, and I have been working since he visited UCT last year. The first results have been accepted for the International Conference on Biomedical Ontologies (ICBO) 2020, which are described in a paper entitled “Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring” [3]. A sample scenario of the process is illustrated informally in the following figure.

Summary of a sample scenario of detecting and resolving conflicts, illustrated with an ontology reuse scenario where Onto2 will be imported into Onto1. (source: [3])

The paper first defines and illustrates the notions of meaning negotiation and conflict resolution and summarises their main causes, to then go into some detail of the various categories of conflicts and ways how to resolve them. The detection and resolution is assisted by the notion of a conflict set, which is a data structure that stores the details for further processing.

It was tested with a use case of an epizootic disease outbreak in the Lemanic Arc in Switzerland in 2006, due to H5N1 (avian influenza): an administrative ontology had to be merged with one about the epidemiology for infected birds and surveillance zones. With that use case in place already well before the spread of SARS-CoV-2 that caused the current pandemic, it was a small step to add a few examples to the paper about COVID-19. This was made possible thanks to recently developed relevant ontologies that were made available, including for COVID-19 specifically. Let’s highlight the examples here, also so that I can write a bit more about it than the terse text in the paper, since there are no page limits for a blog post.

Example 1: OWL profile violations

Medical terminologies tend to veer toward being represented in an ontology language that is less or equal to OWL 2 EL: this permits scalability, compatibility with typical OBO Foundry ontologies, as well as fitting with the popular SNOMED CT. As one may expect, there have been efforts in ontology development with content relevant for the current pandemic; e.g., the Coronavirus Infectious Disease Ontology (CIDO) [4]. The CIDO is not in OWL 2 EL, however: it has a class expressions with a universal quantifier (ObjectAllValuesFrom) on the right-hand side; specifically (in DL notation): ‘Yale New Haven Hospital SARS-CoV-2 assay’ \sqsubseteq \forall ‘EUA-authorized use at’.’FDA EUA-authorized organization’ or, in the Protégé interface:

(codes: CIDO_0000020, CIDO_0000024, and CIDO_0000031, respectively). It also imported many ontologies and either used them to cause some profile violations or the violations came with them, such as by having used the union operator (‘or’) in the following axiom for therapeutic vaccine function (VO_0000562):

How did I find that? Most certainly NOT by manually browsing through the more than 70000 axioms of the CIDO (including imports) to find the needle in the haystack. Instead, I burned the proverbial haystack to easily get the needles. In this case, the burning was done with the OWL Classifier, which automatically computes which axioms violate any of the OWL species, and lists them accordingly. Here are two examples, illustrating an OWL 2 EL violation (that aforementioned universal quantification) and an OWL 2 QL violation (a property chain with entities from BFO and RO); you can do likewise for OWL 2 RL violations.

Following the scenario with the assumption that the CIDO would have to stay in the OWL 2 EL profile, then it is easy to find the conflicting axioms and act accordingly, i.e., remove them. (It also indicates something did not go well with importing the NDF-RT.owl into the cido-base.owl, but that as an aside for this example.)

Example 2: Modelling issues: same idea, different elements

Let’s take the CIDO again and now also the COviD Ontology for cases and patient information (CODO), which have some overlapping and complementary information, so perhaps could be merged. A not unimportant thing is the test for SARS-CoV-2 and its outcome. CODO has a ‘laboratory test finding’ \equiv {positive, pending, negative}, i.e., the possible outcomes of the test are individuals made into a class using the ObjectOneOf constructor. Consulting CIDO for the test outcomes, it has a class ‘COVID-19 diagnosis’ with three subclasses: Negative, Positive, and Presumptive positive. Aside from the inexact matches of the test status that won’t simplify data integration efforts, this is an example of class vs. instance modeling of what is ontologically the same thing. Resolving this in any merging attempt means that either

  1. the CODO has to change and bump up the test results from individuals to classes, or
  2. the CIDO has to change the subclasses to individuals in the ABox, or
  3. take an ‘outside option’ and represent it in yet a different way where both the CODO and the CIDO have to modify the ontology (e.g., take a conceptual data modeling approach by making the test outcome an attribute with a few possible values).

The paper provides an attempt to systematize such type of conflicts toward a library of common types of conflict, so that it should become easier to find them, and offers steps toward a proper framework to manage all that, which assisted with devising generic approaches to resolution of conflicts. We already have done more to realize all that (which could not all be squeezed into the 12 pages), but more is still to be done, so stay tuned.

Since COVID-19 is still doing the rounds and the international borders of South Africa are still closed (with a lockdown for some 5 months already), I can’t end the blog post with the usual ‘I hope to see you at ICBO 2020 in Bolzano in September’—well, not in the common sense understanding at least. Hopefully next year then.

 

References

[1] Cisse PA, Camara G, Dembele JM, Lo M. An Ontological Model for the Annotation of Infectious Disease Simulation Models. In: Bassioni G, Kebe CMF, Gueye A, Ndiaye A, editors. Innovations and Interdisciplinary Solutions for Underserved Areas. Springer LNICST, vol. 296, 82–91. 2019.

[2] Haendel MA, McMurry JA, Relevo R, Mungall CJ, Robinson PN, Chute CG. A Census of Disease Ontologies. Annual Review of Biomedical Data Science, 2018, 1:305–331.

[3] Grütter R, Keet CM. Towards a Framework for Meaning Negotiation and Conflict Resolution in Ontology Authoring. 11th International Conference on Biomedical Ontologies (ICBO’20), 16-19 Sept 2020, Bolzano, Italy. CEUR-WS (in print).

[4] He Y, Yu H, Ong E, Wang Y, Liu Y, Huffman A, Huang H, Beverley J, Hur J, Yang X, Chen L, Omenn GS, Athey B, Smith B. CIDO, a community-based ontology for coronavirus disease knowledge and data integration, sharing, and analysis. Scientific Data, 2020, 7:181.

The DiDOn method to develop bio-ontologies from semi-structured life science diagrams

It is well-known among (bio-)ontology developers that ontology development is a resource-consuming task (see [1] for data backing up this claim). Several approaches and tools do exists that speed up the time-consuming efforts of bottom-up ontology development, most notably natural language processing and database reverse engineering. They are generic and the technologies have been proposed from a computing angle, and are therefore noisy and/or contain many heuristics to make them fit for bio-ontology development. Yet, the most obvious one from a domain expert perspective is unexplored: the abundant diagrams in the sciences that function as existing/’legacy’ knowledge representation of the subject domain. So, how can one use them to develop domain ontologies?

The new DiDOn procedure—from Diagram to Domain Ontology—can speed up and simplify bio-ontology development by exploiting the knowledge represented in such semi-structured bio-diagrams. It does this by means of extracting explicit and implicit knowledge, preserving most of the subject domain semantics, and making formalisation decisions explicit, so that the process is done in a clear, traceable, and reproducible way.

DiDOn is a detailed, micro-level, procedure to formalise those diagrams in a logic of choice; it provides migration paths into OBO, SKOS, OWL and some arbitrary FOL, and guidelines which axioms, and how, have to be added to the bio-ontology. It also uses a foundational ontology so as to obtain more precise and interoperable subject domain semantics than otherwise would have been possible with syntactic transformations alone. (Choosing an appropriate foundational ontology is a separate topic and can be done wit, e.g., ONSET.)

The paper describing the rationale and details, Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn [2], has just been accepted at the Journal of Biomedical Informatics. They require a graphical abstract, so here it goes:

DiDOn consists of two principal steps: (1) formalising the ‘icon vocabulary’ of a bio-drawing tool, which then functions as a seed ontology, and (2) populating the seed ontology by processing the actual diagrams. The algorithm in the second step is informed by the formalisation decisions taken in the first step. Such decisions include, among others, the representation language and how to represent the diagram’s n-aries (with n≥2, such as choosing between n-aries as relationship or reified as classes).

In addition to the presentation of DiDOn, the paper contains a detailed application of it with Pathway Studio as case study.

The neatly formatted paper is behind a paywall for those with no or limited access to Elsevier’s journals, but the accepted manuscript is openly accessible from my home page.

References

[1] Simperl, E., Mochol, M., Bürger, T. Achieving maturity: the state of practice in ontology engineering in 2009. International Journal of Computer Science and Applications, 2010, 7(1):45-65.

[2] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics. In print. DOI: http://dx.doi.org/10.1016/j.jbi.2012.01.004

Progress on the EnvO at the Dagstuhl workshop

Over the course of the 4,5 days packed together at the beautiful and pleasant ambience of Schloss Dagstul, the fourth Environment Ontology workshop has been productive, and a properly referenceable paper outlining details and decisions will follow. Here I will limit myself to mentioning some of the outcomes and issues that passed the revue.

Group photo of most of the participants at the EnvO Workshop at Dagstuhl

After presentations by all attendees, a long list of discussion themes was drawn up, which we managed to discuss and agree upon to a large extent. The preliminary notes and keywords are jotted down and put on the EnvO wiki dedicated to the workshop.

Focussing first on the content topics, which took up the lion’s share of the workshop’s time, significant advances have been made in two main areas. First, we have sorted out the Food branch in the ontology, which has been moved as Food product under Environmental material and then Anthropogenic environmental material, and the kind and order of differentia have been settled, using food source and processing method as the major axes. Second, the Biome branch will be refined in two directions, regarding (i) the ecosystems at different scales and the removal of the species-centred notion of habitat to reflect better the notion of environment and (ii) work toward inclusion of the aspect of n-dimensional hypervolume of an environment (both the conditions / parameters / variables and the characterization of a particular type of environment using such conditions, analogous to the hypervolumes of an ecological niche so that EnvO can be used better for annotation and analysis of environmental data). Other content-related topics concerned GPS coordinates, hydrographic features, and the commitment to BFO and the RO for top-level categories and relations. You can browse through the preliminary changes in the envo-edit version of the ontology, which is a working version that changes daily (i.e., not an officially released one).

There was some discussion—insufficient, I think—and recurring comments and suggestions on how to represent the knowledge in the ontology and, with that, the ontology language and modelling guidelines. Some favour bare single-inheritance trees for appealing philosophical motivations. The first problematic case, however, was brought forward by David Mark, who had compelling arguments for multiple inheritance with his example of how to represent Wadi, and soon more followed with terms such as Smoked sausage (having as parents the source and processing method) and many more in the food branch. Some others preferred lattices or a more common knowledge representation language—both are ways to handle more neatly the properties/qualities with respect to the usage of properties and the property inheritance by sub-universals from its parent. Currently, the EnvO is represented in OBO and modelling the knowledge does not follow the KR approach of declaring properties of some universal (/concept/class) and availing of property inheritance, so that one ends up having to make multiple trees and then adding ‘cross-products’ between the trees. Hence, and using intuitive labels merely for human readability here, Smoked sausage either will have two parents, amounting to—in the end where the branching started—\forall x (SmokedSausage(x) \equiv AnimalFoodProduct(x) \land ProcessingMethod(x)) (which is ontologically incorrect because a smoked sausage is not way of processing) or, if done with a ‘cross-product’ and a new relation (hasQuality ), then the resulting computation will have something alike \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land hasQuality(x,y) \land Smoking(y)) instead of having declared directly in the ontology proper, say, \forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land HasProcessingMethod(x,y) \land Smoking(y)) . The latter option has the advantages that it makes it easier to add, say, Fermented smoked sausage or Cooked smoked sausage as a sausage that has the two properties of being [fermented/cooked] and being smoked, and that one can avail of automated reasoners to classify the taxonomy. Either way, the details are being worked on. The ontology language and the choice for one or the other—whichever it may be—ought not to get in the way of developing an ontology, but, generally, it does so both regarding underlying commitment that the language adheres to and any implicit or explicit workaround in the modelling stage that to some extent make up for a language’s limitations.

On a lighter note, we had an excursion to Trier together with the cognitive robotics people (from a parallel seminar at Dagstuhl) on Wednesday afternoon. Starting from the UNESCO’s world heritage monument Porta Nigra and the nearby birthplace of Karl Marx, we had a guided tour through the city centre with its mixture of architectural styles and rich history, which was even more pleasant with the spring-like weather. Afterwards, we went to relax at the wine tasting event at a nearby winery, where the owners provided information about the 6 different Rieslings we tried.

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Section of the Porta Nigra, Trier

Section of the Porta Nigra, Trier

Ontologies in ecology: putting the lessons-learned to good use and moving forward

While most of the headlines and attention in bio-ontologies has gone to the Gene Ontology, later also the FMA, and, most recently, the set of ontologies within or close to the OBO Foundry project, it has been comparatively more modest in the area of ontologies for ecology. This is set to change.

Madin et al [1] published a review article last month in Trends in Ecology and Evolution about not only the state of the art on existing ontologies for ecology, but also an Ode to the development and use of ontologies. The latter is not framed in a bright-vision-follow-me way, but noting (a.o.) the problems of

terminological ambiguity [that] slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science

and showing problems and clear examples of what kind of problems ontologies can help to solve.

Recollecting the OWLED’07 industry panel discussion last year, it seemed as if industry was at the point where bio-ontologies were 5-8 years ago and, moreover, about to reinvent the wheel. Not so with ontologies for ecology. Madin et al has separate information boxes about “building consistent ontologies” explaining the difference between is-a and instance-of, is-a and part-of, and is-a and constitution—those things that early adopters learned the hard way a few years ago is presented as a known basic starting point. Likewise for the info-box on “What is an ontology?” and the straight adoption of OWL and benefits automated reasoners. In the overview presented by Madin et al, there are no issues to resolve on trying to be backward compatible with the obo format, but they go straight to the W3C standardized formal ontology representation languages for the ontologies for ecology. Idem box 2 on finding data (which is also a nice scenario for the OBDA Plugin and DIG-Mastro), OntoClean, foundational ontologies and domain ontologies versus other artifacts with terms, linking of ontologies, and a clear table with task-description-requirements (table 1) that invariably asks for good ontologies.

Aside from the analysis of benefits and usages, the concluding remarks section notes that

[t]hus, the adoption of ontologies is hindered both by the familiarity of current practices and the lack of tools to readily migrate to improved practices.

Point taken.

And last, but not least,

Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology, and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms. By clarifying the terms of scientific discourse, and annotating data and analyses with those terms, well defined, community-sanctioned, formal ontologies based on open standards will provide a much-needed foundation upon which to tackle crucial ecological research while taking full advantage of the growing repositories of data on the Internet.

[1] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168. doi:10.1016/j.tree.2007.11.007

A new plant family: the Simulacraceae

May I recommend for the Friday afternoon/weekend reading: an article by Bletter, Reynertson, and Velazquez Runk in the journal Ethnobotany Research & Applications (vol. 5, 2007) on “The taxonomy, ecology, and ethnobotany of the Simulacraceae”, which has about 80 species divided in 17 genera, such as Plasticus, Textileria, and Papyroidia. Moreover,

This family is more than a botanical curiosity. It is a scientific conundrum, as the taxa:

  1. lack genetic material,
  2. appear virtually immortal and
  3. have the ability to form intergeneric crosses with ease, despite the lack of any evident mechanism for cross-fertilization.

In this study, conducted over approximately six years, we elucidate the first full description and review of this fascinating taxon, heretofore named Simulacraceae.

To summarize, also in the words of the authors,

The eco­nomics, distribution, ecology, taxonomy, paleoethnobot­any, and phakochemistry of this widespread family are herein presented. We have recently made great strides in circumscribing this group, and collections indicate this cosmopolitan family has a varied ecology. … Despite being genomically challenged plants, an initial phylogeny is pro­posed. In an early attempt to determine the ecological re­lations of this family, a twenty-meter transect has been in­ventoried from a Plasticus rain forest in Nyack, New York, yielding 49 new species and the first species-area curve for this family.

The Simulacraceae collections—based on the principal method of “opportunistic sampling”—are deposited in the herbarium of the Foundation for Artificial Knowledge Education. Some of the open problems yet to investigate include simulacrapaleoethnobotany and simulacrapolitical ecology, and from an engineering perspective, the design of a Traditional Simulacraceae Knowledge/Teleological Simulation Knowledge base (dubbed acronym “TSK,TSK”, which would compete well with the yearly naming game for the NAR January database issue).

A short html version of the article is available online in the Jan/Feb issue of AIR, but also the full pdf file (about 6MB) in the Uni of Hawaii database with more information and colourful photos (openly accessible, of course). Enjoy!

On the (un)reasonable effectiveness of mathematics in ecology

An article appeared last week in Ecological Modeling that has the intention to be thought-provoking; it looks at effectiveness of mathematics in ecological theory [1], but it just as well can be applied to bioinformatics, computational biology, and bio-ontologies. In short: mathematical models are useful only if they are not too general to be trivially true and not too specific to be applicable to one data set only. But how to go about finding the middle way? Ginzburg et al fail to clearly answer this question, but there are some pointers worth mentioning. In the words of the authors (bold face my emphasis):

A good theory is focused without being blurred by extraneous detail or overgenerality. Yet ecological theories frequently fail to achieve this desirable middle ground. Here, we review the reasons for the mismatch between what theorists seek to achieve and what they actually accomplish. In doing so, we argue on pragmatic grounds against mathematical literalism as an appropriate constraint to mathematical constructions: such literalism would allow mathematics to constrain biology when the biology ought to be constraining mathematics. We also suggest a method for differentiating theories with the potential to be “unreasonably effective” from those that are simply overgeneral. Simple axiomatic assumptions about an ecological system should lead to theoretical predictions that can then be compared with existing data. If the theory is so general that data cannot be used to test it, the theory must be made more specific.

What then about this pragmatism and mathematical literalism? The pragmatism sums up as a “theories never work perfectly” anyway and, well, reality is surpassing us given that “we face an ever-increasing number of ecological crises, social demand will be for crude, imperfect descriptions of ecological phenomena now rather than more detailed, complex understanding later” (as aside and to rub it in: the latter is a different argumentation for pragmatism than the ‘I need a program from you today in order to analyse my lab data so that I can submit the article tomorrow and beat the competition’). The former I concur with, the latter on preferring imperfection over more thought-through theories is a judgment call and I leave that for what it is.
Mathematical literalism roughly means strict adherence to some limited mathematical model for its mathematical characteristics and limitations. For instance, in several ecological models (and elsewhere) processes are interpreted as strictly instantaneous—the “mechanistic” models—whereas those models that do not are mocked as “phenomenological”. But, so the authors argue, we should not fit nature to match the maths, but use mathematics to describe nature. Now this likely does ring a bell or two with developers of formal (logic-based) bio-ontologies: describe your bio stuff with the constructs that OWL gives you! And not—but probably should be—which formal language (i.e, which constructs) do I actually need to describe my subject domain? (Some follow-up questions on the latter are: if you can’t represent it, what exactly is it that you can’t represent? Do you really need it for the task at hand? Can you represent it in another [logical/conceptual modeling] language?)

It is not this black-and-white, however. As Ginzburg et al mention a couple of times in the article (kicking in an open door), trying to make a mathematical model of the biological theory greatly helps to be more precise about the underlying assumptions and to make those explicit. This, in turn aids making predictions based on those assumptions & theory, which subsequently should be tested against real data; if you can’t test it against data, then the theory is no good. This is a bit harsh because it may be that for some practical reasons something cannot be tested, but on the other hand, if that is the case, one may want to think twice about the usefulness of the theory.
Last, “The most useful theories emphasize explanation over description and incorporate a “limit myth” (i.e., they describe a pure situation without extraneous factors, as with the assumption in physics that surfaces are frictionless).” While it is true that one seeks for explanations, this conveniently brushes over the fact that first one has to have a way to describe things in order to incorporate them in an explanatory theory! If the theory fails, then thanks to a structured approach for the descriptions—say, some formal language or [annotated] mathematical equations—it will be easier to fiddle with the theory and reuse parts of it to come up with a new one. If the theory succeeds, it will be easier to link it up to another properly described and annotated theory to make more complex explanatory models.

Overall, the contents of the article is a bit premature and would have benefited from a thorough analyses of the too-general and too-specific theories other than anecdotal evidence with a couple of examples. Also, the “method for differentiating theories” advertised in the abstract is buried somewhere in the text, so, some sort of a bullet-pointed checklist for assessing one’s own pet theory on too-general/specific would have been useful. Despite this, it is good material to feed a confirmation bias for being against too much and too strict adherence to mathematics… as well as against no mathematics.

[1] Lev R. Ginzburg, Christopher X.J. Jensen and Jeffrey V. Yule. (2007). Aiming the “unreasonable effectiveness of mathematics” at ecological theory. Ecological Modeling, 207(2-4): 356-362. doi:10.1016/j.ecolmodel.2007.05.015

Granularity and no emergence in biology

This time a post that bears some distant relation to my thesis topic: granularity. About 1.5 years ago I got concerned that emergence, emergent properties, and emergent behaviour would complicate developing a formal theory of granularity, so I read up on the topic. While writing along the overview and analyzing both the philosophical aspects and proposed examples of emergence in biology, I came to the realization that it doesn’t complicate granularity, but on the contrary: that granularity actually serves as a useful methodology to investigate (hypothesized) emergence, in particular because of the modeling advantages and prospects for structured in silico simulations.

This is very nice for my granularity, but 20 odd pages to support a useful application area of granularity even though it is not the focus-area of applications (wandering off too far from the narrative), and thus taking up too much space in the thesis. So, I’m phasing it out. Problem is, that I don’t know of any outlet where a cocktail of bio, IT, and philosophy would be publishable, because specialists of each discipline wouldn’t be too happy reading too much about the other two fields and can smack it because it is not necessarily detailed enough for their own field, despite that the idea of combining granularity & (hypothesized) emergence may have some novelty to it. Interdisciplinarity has its drawbacks.

Things being as they are, I’m putting the pdf online after the printed paragraph was getting dust for some 1.5 years – for there might just be an interested reader out there. Comments are welcome of course!

Topics that pass the revue in the manuscript are:
1 Introduction
2 Renewed claims of emergence in biology
3 Emergence from a philosophical perspective
3.1 Epistemological emergence
3.2 Ontological emergence
3.3 Strong emergence
3.4 Weak emergence
3.4.1 Simulations
3.5 Examples
3.5.1 Example 1: pseudoplasmodium formation by cellular slime moulds
3.5.2 Example 2: horizontal gene transfer with metagenomics
4 Emergence and levels of granularity
4.1 Preliminaries of granularity
4.2 The irreducibility argument
4.3 Non-predictability and non-derivability
4.4 Characterisation of granular level from the viewpoint of emergence
5 Concluding remarks

The abstract of “Granularity as a modelling approach to investigate hypothesized emergence in biology” is as follows.

Abstract. Informal usage of emergence in biological discourse tends towards being of the epistemic type, but not ontological emergence, primarily due to our lack of knowledge about nature and limitations to how to model it. Philosophy adds clarification to better characterise the fuzzy notion of emergence in biology, but paradoxically it is the methodology of conducting scientific experiments that can give decisive answers. A renewed interest in whole-ism in (molecular) biology and simulations of complex systems does not imply emergent properties exist, but illustrates the realisation that things a more difficult and complex than initially anticipated. Usage of (weak- and epistemological) emergence in bioscience is a shorthand for `we have a gap in our knowledge about the precise relation(s) between the whole and its parts and possibly missing something about the parts themselves as well’, which amounts to absence of emergence in the philosophical sense. Given that the existence of emergent properties is not undisputed, we need better methodologies to investigate such claims. Granularity serves as one of these approaches to investigate postulated emergent properties. Specification of levels of granularity and their contents can provide a methodological modelling framework to enable structured examination of emergence from both a formal ontological modelling approach and the computational angle, and helps elucidating the required level of granularity to explain away emergence. I discuss some modelling considerations for a granularity framework and its relevance for the testability of emergence in computational implementations such as simulations.

Metagenomics, or: more problems to solve by bioinformaticians!

Nature Reviews Microbiology had their special issue on metagenomics in 2005, and the closely related topic of horizontal gene transfer shortly afterward, now it is PLoS Biology’s turn with several articles on advances in studying microbial communities in the ocean as part of their Oceanic Metagenomics collection. Not that, in theory, metagenomics is limited to microbes, but that’s where the research focus is now (e.g. [1][2][3]), because scaling up genomics research isn’t easy or cheap – and think of all the data that needs to be stored, processed, and analysed.

For the non-biologist reader in 3 sentences (or synopsis [4]): metagenomics, or `high-throughput molecular ecology’ (also called community genomics, ecogenomics, environmental genomics, or population genomics) combines molecular biology with ecosystems. It reveals community and population-specific metabolisms with the interdependent biological behaviour of organisms in nature that is affected by its micro-climate. Take a handful of soil (ocean water, mud, …) and figure out which microorganisms live there, who’s active (and what are they doing?), who’s dormant, what are the ratios of the population sizes of the different types of microorganisms, how does a microbial community ‘look’ like, etc?

For the data-enthusiast: all those individual microorganisms need to have their DNA and RNA sequenced, where, of course, the results go into databases. And then the analysis: putting back together the pieces from shotgun sequencing, comparing DNA with DNA, rRNA with rRNA, with each other, how to do the binning and so forth [5]. Naively: more and faster algorithms wouldn’t hurt; how can you visualize a community of microorganisms on your screen, and make simulations of those bacterial communities?

And then, somewhere in this whole endeavor, bio-ontologists should be able to find their place, to help out (and figure out) how to best represent all the new information in a usable and reusable way. Because metagenomics is a hot topic with much research and novel results, ontology maintenance (tracking changes etc) will then likely be more important than the attention it receives in ODEs at present, as well as reasoning over ontologies and massive amounts of data. Ouch. Some work has been and is being done on these topics (e.g. [6] [7]), and more can/will/does/should follow.

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.
[2] Lorenz, P., Eck, J. Metagenomics and industrial applications. Nature Reviews Microbiology, 2005, 3:510-516.>
[3] Schleper, C., Jurgens, G., Jonuscheit, M. Genomic studies of uncultivated Archae. Nature Reviews Microbiology, 3:479-488.
[4] Gross, L. Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity. PLoS Biology, 2007, 5(3): e85.
[5] Eisen, J.A. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biology, 2007, 5(3): e82.
[6] Klein, M. and Noy, N.F. (2003). A Component-Based Framework for Ontology Evolution. Workshop on Ontologies and Distributed Systems at IJCAI-2003, Acapulco, Mexico.
[7] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A, Rosati, R. Linking data to ontologies: The description logic DL-lite A. Proc. of the 2nd Workshop on OWL: Experiences and Directions (OWLED 2006), 2006.