## Archive for the ‘Bioinformatics’ Category

### The DiDOn method to develop bio-ontologies from semi-structured life science diagrams

It is well-known among (bio-)ontology developers that ontology development is a resource-consuming task (see [1] for data backing up this claim). Several approaches and tools do exists that speed up the time-consuming efforts of bottom-up ontology development, most notably natural language processing and database reverse engineering. They are generic and the technologies have been proposed from a computing angle, and are therefore noisy and/or contain many heuristics to make them fit for bio-ontology development. Yet, the most obvious one from a domain expert perspective is unexplored: the abundant diagrams in the sciences that function as existing/’legacy’ knowledge representation of the subject domain. So, how can one use them to develop domain ontologies?

The new DiDOn procedure—from Diagram to Domain Ontology—can speed up and simplify bio-ontology development by exploiting the knowledge represented in such semi-structured bio-diagrams. It does this by means of extracting explicit and implicit knowledge, preserving most of the subject domain semantics, and making formalisation decisions explicit, so that the process is done in a clear, traceable, and reproducible way.

DiDOn is a detailed, micro-level, procedure to formalise those diagrams in a logic of choice; it provides migration paths into OBO, SKOS, OWL and some arbitrary FOL, and guidelines which axioms, and how, have to be added to the bio-ontology. It also uses a foundational ontology so as to obtain more precise and interoperable subject domain semantics than otherwise would have been possible with syntactic transformations alone. (Choosing an appropriate foundational ontology is a separate topic and can be done wit, e.g., ONSET.)

The paper describing the rationale and details, Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn [2], has just been accepted at the Journal of Biomedical Informatics. They require a graphical abstract, so here it goes:

DiDOn consists of two principal steps: (1) formalising the ‘icon vocabulary’ of a bio-drawing tool, which then functions as a seed ontology, and (2) populating the seed ontology by processing the actual diagrams. The algorithm in the second step is informed by the formalisation decisions taken in the first step. Such decisions include, among others, the representation language and how to represent the diagram’s n-aries (with n≥2, such as choosing between n-aries as relationship or reified as classes).

In addition to the presentation of DiDOn, the paper contains a detailed application of it with Pathway Studio as case study.

The neatly formatted paper is behind a paywall for those with no or limited access to Elsevier’s journals, but the accepted manuscript is openly accessible from my home page.

References

[1] Simperl, E., Mochol, M., Bürger, T. Achieving maturity: the state of practice in ontology engineering in 2009. International Journal of Computer Science and Applications, 2010, 7(1):45-65.

[2] Keet, C.M. Transforming semi-structured life science diagrams into meaningful domain ontologies with DiDOn. Journal of Biomedical Informatics. In print. DOI: http://dx.doi.org/10.1016/j.jbi.2012.01.004

### Book chapter on conceptual data modelling for biological data

My invited book chapter, entitled “Ontology-driven formal conceptual data modeling for biological data analysis” [1], recently got accepted for publication in the Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data, edited by Mourad Elloumi and Albert Y. Zomaya, and is scheduled for printing by Wiley early 2012.

All this started off with my BSc(Hons) in IT & Computing thesis back in 2003 and my first paper about the trials and tribulations of conceptual data modelling for bio-databases [2] (which is definitely not well-written, but has some valid points and has been cited a bit). In the meantime, much progress has been made on the topic, and I’ve learned, researched, and published a few things about it, too. So, what is the chapter about?

The main aspect is the ‘conceptual data modelling’ with EER, ORM, and UML Class Diagrams, i.e., concerning implementation-independent representations of the data to be managed for a specific application (hence, not ontologies for application-independence).

The adjective ‘formal’ is to point out that the conceptual modeling is not just about drawing boxes, roundtangles, and lines with some adornments, but there is a formal, logic-based, foundation. This is achieved with the formally defined CMcom conceptual data modeling language, which has the greatest common denominator between ORM, EER, and UML Class Diagrams. CMcom has, on the one hand, a mapping the Description Logic language DLRifd and, on the other hand, mappings to the icons in the diagrammatic languages. The nice aspect of this it that, at least in theory and to some extent in practice as well, one can subject it to automated reasoning to check consistency of the classes, of the whole conceptual data model, and derive implicit constraints (an example) or use it in ontology-based data access (an example and some slides on ‘COMODA’ [COnceptual MOdel-based Data Access], tailored to ORM and the horizontal gene transfer database as example).

Then there is the ‘ontology-driven’ component: Ontology and ontologies can aid in conceptual data modeling by providing solution to recurring modeling problems, an ontology can be used to generate several conceptual data models, and one can integrate (a section of) an ontology into a conceptual data model that is subsequently converted into data in database tables.

Last, but not least, it focuses on ‘biological data analysis’. A non-(biologist or bioinformatician) might be inclined to say that should not matter, but it does. Biological information is not as trivial as the typical database design toy examples like “Student is enrolled in Course”, but one has to dig deeper and figure out how to represent, e.g., catalysis, pathway information, the ecological niche. Moreover, it requires an answer to ‘which language features are ‘essential’ for the conceptual data modeling language?’ and if it isn’t included yet, how do we get it in? Some of such important features are n-aries (n>2) and the temporal dimension. The paper includes a proposal for more precisely representing catalysis, informed by ontology (mainly thanks to making the distinction between the role and its bearer), and shows how certain temporal information can be captured, which is illustrated by enhancing the model for SARS viral infection, among other examples.

The paper is not online yet, but I did put together some slides for the presentation at MAIS’11 reported on earlier, which might serve as a sneak preview of the 25-page book chapter, or you can contact me for the CRC.

References

[1] Keet, C.M. Ontology-driven formal conceptual data modeling for biological data analysis. In: Biological Knowledge Discovery Handbook: Preprocessing, Mining and Postprocessing of Biological Data. Mourad Elloumi and Albert Y. Zomaya (Eds.). Wiley (in print).

[2] Keet, C.M. Biological data and conceptual modelling methods. Journal of Conceptual Modeling, Issue 29, October 2003. http://www.inconcept.com/jcm.

### Progress on the EnvO at the Dagstuhl workshop

Over the course of the 4,5 days packed together at the beautiful and pleasant ambience of Schloss Dagstul, the fourth Environment Ontology workshop has been productive, and a properly referenceable paper outlining details and decisions will follow. Here I will limit myself to mentioning some of the outcomes and issues that passed the revue.

Group photo of most of the participants at the EnvO Workshop at Dagstuhl

After presentations by all attendees, a long list of discussion themes was drawn up, which we managed to discuss and agree upon to a large extent. The preliminary notes and keywords are jotted down and put on the EnvO wiki dedicated to the workshop.

Focussing first on the content topics, which took up the lion’s share of the workshop’s time, significant advances have been made in two main areas. First, we have sorted out the Food branch in the ontology, which has been moved as Food product under Environmental material and then Anthropogenic environmental material, and the kind and order of differentia have been settled, using food source and processing method as the major axes. Second, the Biome branch will be refined in two directions, regarding (i) the ecosystems at different scales and the removal of the species-centred notion of habitat to reflect better the notion of environment and (ii) work toward inclusion of the aspect of n-dimensional hypervolume of an environment (both the conditions / parameters / variables and the characterization of a particular type of environment using such conditions, analogous to the hypervolumes of an ecological niche so that EnvO can be used better for annotation and analysis of environmental data). Other content-related topics concerned GPS coordinates, hydrographic features, and the commitment to BFO and the RO for top-level categories and relations. You can browse through the preliminary changes in the envo-edit version of the ontology, which is a working version that changes daily (i.e., not an officially released one).

There was some discussion—insufficient, I think—and recurring comments and suggestions on how to represent the knowledge in the ontology and, with that, the ontology language and modelling guidelines. Some favour bare single-inheritance trees for appealing philosophical motivations. The first problematic case, however, was brought forward by David Mark, who had compelling arguments for multiple inheritance with his example of how to represent Wadi, and soon more followed with terms such as Smoked sausage (having as parents the source and processing method) and many more in the food branch. Some others preferred lattices or a more common knowledge representation language—both are ways to handle more neatly the properties/qualities with respect to the usage of properties and the property inheritance by sub-universals from its parent. Currently, the EnvO is represented in OBO and modelling the knowledge does not follow the KR approach of declaring properties of some universal (/concept/class) and availing of property inheritance, so that one ends up having to make multiple trees and then adding ‘cross-products’ between the trees. Hence, and using intuitive labels merely for human readability here, Smoked sausage either will have two parents, amounting to—in the end where the branching started—$\forall x (SmokedSausage(x) \equiv AnimalFoodProduct(x) \land ProcessingMethod(x))$ (which is ontologically incorrect because a smoked sausage is not way of processing) or, if done with a ‘cross-product’ and a new relation ($hasQuality$), then the resulting computation will have something alike $\forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land hasQuality(x,y) \land Smoking(y))$ instead of having declared directly in the ontology proper, say, $\forall x \exists y (SmokedSausage(x) \equiv Sausage(x) \land HasProcessingMethod(x,y) \land Smoking(y))$. The latter option has the advantages that it makes it easier to add, say, Fermented smoked sausage or Cooked smoked sausage as a sausage that has the two properties of being [fermented/cooked] and being smoked, and that one can avail of automated reasoners to classify the taxonomy. Either way, the details are being worked on. The ontology language and the choice for one or the other—whichever it may be—ought not to get in the way of developing an ontology, but, generally, it does so both regarding underlying commitment that the language adheres to and any implicit or explicit workaround in the modelling stage that to some extent make up for a language’s limitations.

On a lighter note, we had an excursion to Trier together with the cognitive robotics people (from a parallel seminar at Dagstuhl) on Wednesday afternoon. Starting from the UNESCO’s world heritage monument Porta Nigra and the nearby birthplace of Karl Marx, we had a guided tour through the city centre with its mixture of architectural styles and rich history, which was even more pleasant with the spring-like weather. Afterwards, we went to relax at the wine tasting event at a nearby winery, where the owners provided information about the 6 different Rieslings we tried.

Extension to the Aula Palatina (Constantine's Basilica) in Trier

Section of the Porta Nigra, Trier

### 72010 SemWebTech lecture 8: SWT for HCLS background and data integration

After the ontology languages and general aspects of ontology engineering, we now will delve into one specific application area: SWT for health care and life sciences. Its frontrunners in bioinformatics were adopters of some of the Semantic Web ideas even before Berners-Lee, Hendler, and Lassila wrote their Scientific American paper in 2001, even though they did not formulate their needs and intentions in the same terminology: they did want to have shared, controlled vocabularies with the same syntax, to facilitate data integration—or at least interoperability—across Web-accessible databases, have a common space for identifiers, it needing to be a dynamic, changing system, to organize and query incomplete biological knowledge, and, albeit not stated explicitly, it all still needed to be highly scalable [1].

Bioinformaticians and domain experts in genomics already organized themselves together in the Gene Ontology Consortium, which was set up officially in 1998 to realize a solution for these requirements. The results exceeded anyone’s expectations in its success for a range of reasons. Many tools for the Gene Ontology (GO) and its common KR format, .obo, have been developed, and other research groups adopted the approach to develop controlled vocabularies either by extending the GO, e.g., rice traits, or adding their own subject domain, such as zebrafish anatomy and mouse developmental stages. This proliferation, as well as the OWL development and standardization process that was going on at about the same time, pushed the goal posts further: new expectations were put on the GO and its siblings and on their tools, and the proliferation had become a bit too wieldy to keep a good overview what was going on and how those ontologies would be put together. Put differently, some people noticed the inferencing possibilities that can be obtained from moving from obo to OWL and others thought that some coordination among all those obo bio-ontologies would be advantageous given that post-hoc integration of ontologies of related and overlapping subject domains is not easy. Thus came into being the OBO Foundry to solve such issues, proposing a methodology for coordinated evolution of ontologies to support biomedical data integration [2].

People in related disciplines, such as ecology, have taken on board experiences of these very early adopters, and instead decided to jump on board after the OWL standardization. They, however, were not only motivated by data(base) integration. Referring to Madin et al’s paper [3] again, I highlight three points they made: “terminological ambiguity slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science”, i.e., identification of some serious problems they have in ecological research; “Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology”, i.e., what they expect that ontologies will solve for them (disambiguation); and “and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms”, i.e., for what purpose they want to use ontologies and any associated computation (discovery). That is, ontologies not as a—one of many possible—tool in the engineering/infrastructure means, but as a required part of a method in the scientific investigation that aims to discover new information and knowledge about nature (i.e., in answering the who, what, where, when, and how things are the way they are in nature).

What has all this to do with actual Semantic Web technologies? On the one hand, there are multiple data integration approaches and tools that have been, and are being, tried out by the domain experts, bioinformaticians, and interdisciplinary-minded computer scientists [4], and, on the other hand, there are the W3C Semantic Web standards XML, RDF(S), SPARQL, and OWL. Some use these standards to achieve data integration, some do not. Since this is a Semantic Web course, we shall take a look at two efforts who (try to) do, which came forth from the activities of the W3C’s Health Care and Life Sciences Interest Group. More precisely, we take a closer look at a paper written about 3 years ago [5] that reports on a case study to try to get those Semantic Web Technologies to work for them in order to achieve data integration and a range of other things. There is also a more recent paper from the HCLS IG [6], where they aimed at not only linking of data but also querying of distributed data, using a mixture of RDF triple stores and SKOS. Both papers reveal their understanding of the purposes of SWT, and, moreover, what their goals are, their experimentation with various technologies to achieve them, and where there is still some work to do. There are notable achievements described in these, and related, papers, but the sought-after “killer app” is yet to be announced.

The lecture will cover a ‘historical’ overview and what more recent ontology-adopters focus on, the very basics of data integration approaches that motivated the development of ontologies, and we shall analyse some technological issues and challenges mentioned in [5] concerning Semantic Web (or not) technologies.

References:

[1] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics, May 2000;25(1):25-9.

[2] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuermann, Nigam Shah, Patricia L. Whetzel, Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251-1255 (2007).

[3] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. (2008). Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168.

[4] Erhard Rahm. Data Integration in Bioinformatics and Life Sciences. EDBT Summer School, Bolzano, Sep. 2007.

[5] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

[6] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10

Note: references 1, 2, and (5 or 6) are mandatory reading, and 3 and 4 are recommended to read.

Lecture notes: lecture 8 – SWLS background and data integration

Course website

### Computational, and other, problems for genomics of emerging infectious diseases

PLoS has published a cross-journal special collection on Genomics of Emerging Infectious Diseases last week. Perhaps unsurprisingly, I had a look at the article about limitations of and challenges for the computational resources by Berglund, Nystedt, and Andersson [1]. It reads quite like the one about computational problems for metagenomics I wrote about earlier (here and here), but they have a somewhat curious request in the closing section of the paper.

Concerning the overall contents of the paper and its similarity with the computational aspects of metagenomics, the computational aspects of complete genome assembly is still not quite sorted out fully, and in particular the need “for better ways to integrate data from diverse sources, including shotgun sequencing, paired-end sequencing, [and more]…” and a quality scoring standard. The other major one is the recurring topic is the hard work of annotation to give meaning to the data. Then there are the requests for better, and 3D, visualizations of the data and the cross-granular analysis of data along the omics trail and at different scales.

Limitations and challenges that are more specific to this subject domain are the classification and risk assessments for the emergence of novel infectious strains and risk prediction software for disease outbreaks. In addition, they put a higher importance put on the request for supporting tools to figure out the evolutionary aspects of the sequences and how the pieces of DNA have recombined, including how and from where they have horizontally transferred.

In the closing section, the authors reiterate that

To achieve these goals, investments in user-friendly software and improved visualization tools, along with excellent expertise in computational biology, will be of utmost importance.

I fancy the thought that our WONDER system for the HGT-DB for, in particular, graphical querying meets the first requirement—at least as proof of concept that one can construct an arbitrary query graphically using an ontology and all that in a web browser. Having said that, I am also aware of the authors’ complaint that

Currently, the slow transition from a scientific in-house program to the distribution of a stable and efficient software package is a major bottleneck in scientific knowledge sharing, preventing efficient progress in all areas of computational biology. Efforts to design, share, and improve software must receive increased funding, practical support, and, not the least, scientific impact.

Yes, like most bioinformatics tools and proof-of-concept and prototype software that comes from academia, it is not industry-grade software. To get the latter, companies have to do some more ‘shopping’ around and investment into it, i.e., monitor the latest engineering developments that demonstrate working theory presented at conferences, take up the ones they deem interesting, and transform it into a stable tool. We—be it here at FUB or almost any other university—do not have an army of experienced programmers, not only because we do not have the financial resources to pay programmers (cf. researchers with whom more scientific brownie-points can be scored) but, moreover, a computing department is not a cheap software house. The authors’ demand for more funding for software development to program cuter and more stable software would kill computing and engineering research at the faculty if the extra funding would not be real extra funding on top of existing budgets. The reality these days, is that many universities face cuts in funding. Go figure where that leaves the authors’ request. The complaint may have been more appropriate and effective when the authors would have voiced it in an industry journal.

The last part of the quote, receiving increased scientific impact, seems to me a difficult one. Descriptions of proof-of-concept and prototype software to experimentally validate the implementability of a theory can find a home in a scientific publication outlet, but a paper that tells the reader the authors have made a tool more stable is not reporting on research results and it does not bring us any new knowledge, does not answer a research question, does not solve an hitherto unsolved problem, does not confirm/refute a hypothesis. Why should—“must” in the authors’ words—improved, more usable and more stable, software receive scientific impact? Stable and freely available tools have an impact on doing science and some tasks would be nigh on undoable without them, but this does not imply such tools are part and parcel of the scientific discovery. One does not include in the “scientific impact” the Petri dish vendor, PCR machine developers, or Oracle 10g development team either. There are different activities with different scopes, goals, outcomes, and reward mechanisms; and that’s fine. Proposing to offer companies some fairly difficult to determine scientific-impact-brownie-points may not be the most effective way to motivate them to develop tools for science—getting across the possibility to make profit in the medium- to long term and to do something of societal relevance may well be a better motivator.

References

[1] Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease: Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481

### The WONDER system for ontology browsing and graphical query formulation

Did you ever not want to bother knowing how the data is stored in a database, but simply want to know what kind of things are stored in the database at, say, the conceptual or ontological layer of knowledge? And did you ever not want to bother writing queries in SQL or SPARQL, but have a graphical point-and-click interface with which you can compose a query using that what layer of knowledge and that the system generates automatically the SQL/SPARQL query for you, in the correct syntax? And all that not with a downloaded desktop application but in a Web browser?

Our domain experts in genetics as well as in healthcare informatics, at least, wanted that. We have designed and implemented it now [1], which we have enthusiastically named Web ONtology mediateD Extraction of Relational data (WONDER). Moreover, we have a working system for the use case about the 4GB horizontal gene transfer database [2] and its corresponding ‘application ontology’. (pdf)

Subscribers to this blog might remember I mentioned a that we were working towards this goal, using Ontology-Based Data Access tools to access a database through an ontology and learning from (and elaborating on) its preliminary case studies [3]. In short, we added a usability extension to the OBDA implementations so that not only savvy Semantic Web engineers can use it, but also—actually, moreover—that the domain experts who want to get information from their database(s) can do so. By building upon the OBDA framework [4], we can avail of its solid formal foundations; that is, WONDER is not merely a software application, but there is a logic-based representation behind both the graphics in the ontology browser and the query pane.

In addition, WONDER is scalable because the ontology language (roughly: OWL 2 QL) is ‘simple’. Yes, we had to drop a few things from the original ORM conceptual model, but they have—at least for our case study—no effect on querying the data. The ‘difficult’ constraints are (and generally: should be anyway) implemented in the database, so there will be no instances violating the constraints we had to drop. Trade-offs, indeed, but now one can use an ontology to access a large database over the Web and retrieve the results quickly.

For instance, take the query “For the Firmicutes, retrieve the organisms and their genes that have a GCtotal contents higher than 60”, which is for various reasons not possible through the current web interface of the source database.

Fig.1 shows the ontology pane with three relevant elements selected. (click on the figures to enlarge)

Fig.1. WONDER's ontology pane with three elements selected

Fig.2 shows the constrained adder, where I’m adding that the GCValue has to be > 60.

Fig. 2. WONDER's constraint adder, where I’m adding that the GCValue has to be > 60

Fig.3 shows the query ready for execution: the attributes with a green border are those that will appear in the query answer (I could have selected all, if I wanted to). In the menu bar on the right you can see I have customized the names of the attributes, so that the columns in the results pane will have a query-relevant name in your preferred language (not necessary to do), as well as the automatically generated query.

Fig.3. WONDER's query pane, where the query is ready for execution

Fig.4 shows a section of the results of the first page and Fig.5 of the second page; the “Family” column that has all the Firmicutes (out of about 500 organisms in the database) gives you the whole section of the species tree, because that is how the taxonomy information is stored in the database (refining the database is a separate topic). Alternatively, I could have selected the organism Name from the ontology browser (see Fig.1), de-selected the taxonomic classification in the query pane, and included the Name of the organism in the query answer to have the species name only but not all the taxonomic information; in this case, I wanted to have all that taxonomy information. The genes are the relevant selection (made with the other constraints) out of about the 2 million genes that are stored in the database.

Fig.4. Section of the results, the first page

Fig.5. Section of the results, the second page

There is also a constraint manager for the AND, OR, NOT and nesting. For instance, for the query “Give me the names of the organisms of which the abbreviation starts with a b, but not being a Bacillus, and the prediction and KEGG code of those organisms’ genes that are putatively either horizontally transferred or highly expressed” (Fig.6), we have the constraint manager as shown in Fig.7.

Fig.6. Graphical and textual representation of the second query

Fig.7. Constraint manager for the second query

You can also save and load queries when you’re logged in, and download the results set in any case.

For those who want to play with it: feel free to drop me a line and I will send you the URL. (The reason for not linking the URL here is that the current URL is still for the beta version, whereas the operational one is expected to have a more stable URL soon.)

Last, but not least, the “we” I used in the previous sentences is not some ‘standard writing in plural’, but several people were involved in various ways to realize the WONDER system. In alphabetical order, they are: Diego Calvanese, Marijke Keet, Werner Nutt, Mariano Rodriguez-Muro, and Giorgio Stefanoni, all at FUB. I also want to thank our domain experts of the case study (with whom we’re writing a bio-oriented paper): Santi Garcia-Vallvé (with the Evolutionary Genomics Group, ‘Rovira i Virgilli’ University, Tarragona, Spain) and Mark van Passel (with the Laboratory for Microbiology, Wageningen University and Research Centre, the Netherlands).

References

[1] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland.

[2] Garcia-Vallve, S, Guzman, E., Montero, MA. and Romeu, A. 2003. HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Research 31: 187-189.

[3] R. Alberts, D. Calvanese, G. De Giacomo, A. Gerber, M. Horridge, A. Kaplunova, C. M. Keet, D. Lembo, M. Lenzerini, M. Milicic, R. Moeller, M. Rodríguez-Muro, R. Rosati, U. Sattler, B. Suntisrivaraporn, G. Stefanoni, A.-Y. Turhan, S. Wandelt, M. Wessel. Analysis of Test Results on Usage Scenarios. Deliverable TONES-D27 v1.0, Oct. 10 2008.

[4] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, and Riccardo Rosati. Ontologies and databases: The DL-Lite approach. In Sergio Tessaris and Enrico Franconi, editors, Semantic Technologies for Informations Systems – 5th Int. Reasoning Web Summer School (RW 2009), volume 5689 of Lecture Notes in Computer Science, pages 255-356. Springer, 2009.

### Any semantic search for insects?

The draft of this post started with an example of a creepy insect living in Italy and, well, across the world in those locations where hygiene is not taken too seriously. But I will leave that be, so you can have a good night’s rest. Instead, I will take the example of an insect of which I still do not know what it is—it may still turn out to be a creepy one, but now I do have photos of it and it is living well away outside in Ineke’s poly-tunnel near Limerick, Ireland. The problem is this: neither Ineke, nor Heidi nor I know what it is, but really still want to know. How to get the answer, i.e., how to find the species name of the specimen? I’ve tried several strategies: the ones that are practically possible did not do the job and the one that would does not exist. I’ll go through them in the remainder of the post and close with a few questions on what the most feasible strategy would/should/could be to eventually have a decent entomology [ornithology/nematology/etc.] knowledge base.

Specimen viewed from the top; can anyone ID this specimen?

Basic searches

Neither one of us who were present at teatime in Ineke’s polytunnel where we observed the insect, is an entomologist nor do we have entomologist-friends. The famous ‘bug man’ Ruud Kleinpaste is a fellow alumnus of Wageningen University, but we did not study there around the same time and I could not find an email address to bother him asking to ID a specimen. Neither one of us has an insect handbook either and even if we had, I, for one, would not want to flick through it when there is a perceived need to find the species of a specimen: flicking through the insect-book (and plant-book, etc) was an entertaining pastime activity when I was young, like reading the encyclopaedia and doing the dictionary game, but in this day and age, I would have wanted to use the computer to find the answer. This is theoretically feasible, but—as far as I am aware of—not yet in practice.

To do image matching, I would need a very large data set and of the data set, to know which image fits with which species name, which I do not have; so the machine learning strategy will not work. There is an online browseable BugGuide for the US and Canada with lots of pictures that I clicked through for a while, but without finding the right picture. There are entomology databases that let me search by species name (here, here, and here), but not by properties of the insect; KONCHUR has a fancier search mechanisms but covers insects in Japan, East Asia and the Pacific only (“orange leg AND black body” did not return any results).

Semantic searches

Now, if there was a proper ontology of insects, and I mean not a bare taxonomic tree but one where the classes have properties and those properties have their ranges defined as well, then it would be a simple exercise of selecting the properties along the line of

adult insect
AND length 2cm
AND colour black
AND has wingtype transparent
[*AND body shape similar to a wasp*]
AND leg colour orange
AND rear body stingy
AND location at least west Ireland

so that the reasoner (FaCT++, Pellet, and the like) would classify it near-instantly, or if the ontology were to be really large, then still within an hour or so (ignoring for a moment the [*AND body shape similar to a wasp*] because that requires a bit more work). It would be even funkier if that ontology were linked to a database of images of insects to cross-check it with the visuals. Even more so when such a database also were to have information about its habitat with feeding habits, principal role in the food web, and any diseases it may cause or transmit. Then one would also be able to start the search from another direction along the line of “give me all the insects that live in the west of Ireland” as a first step to narrow down the possible answers.

Aside from the instance classification problem of this particular specimen, the question arises if it would it be up to

a) Google to work on their technology so as to be able to get the answer for me?

b) Entomologists to develop their domain ontology about insects and link it to some database with pictures and additional textual information to have indeed a properly searchable knowledge base?

c) Volunteer labour, like me having taken pictures and annotated each one with the physical characteristics, location, time of observation, etc. and categorise it as “bug” or “insect” or “insekt” or “insetto” to eventually have a grass-roots bugbase (that likely will have some imperfections with gaps in data fields and sloppy terminology)?

d) Everyone to buy insect books?

e) …?

Shelving option d, I am explicitly looking for a computational option, i.e., a, b, c, or e. I prefer a web-accessible version of option b, which can be done with scalable Semantic Web technologies; one only needs to find the money, time, and people to realise it.

Although I gave the example here with insects, the same story can be made for birds, worms, and so forth. When such searchable knowledge bases exist, it will not only save time for many lay people looking up the information and learning more about the flora and fauna around them, but I can imagine it will also make research a lot easier for interdisciplinary scientists who have to forage into knowledge of insects [/birds/worms/etc] as well as the entomologists [/ornithologists/nematologists/etc.] themselves.

specimen from the side

References

[1] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, March/April 2009, 8-12.

### Metagenomics updated and slightly upgraded

The Nature TOC-email arrived yesterday, and they have a whole “insight” section on microbial oceanography! Four years ago, Nature Reviews Microbiology had a special issue with a few papers about it, two years ago PLoS Biology presented their Oceanic Metagenomics Collection, and now then the Nature supplement. Why would a computer scientist like me care? Well, my first study was in microbiology, and they have scaled up things a lot in the meantime, thereby making computers indispensible in their research. For those unfamiliar with the topic, you can get an idea about early computational issues in my previous post and comments by visitors, but there are new ones that I’ll mention below.

Although the webpage of the supplement says that the editorial is “free access”, it is not (as of 14-5 about 6pm CET and today 9am). None of the 5 papers—1 commentary and 4 review papers—indicate anything about the computational challenges: “the life of diatoms in the world’s oceans”, “microbial community structure and its functional implications”, “the microbial ocean from genomes to biomes”, and “viruses manipulate the marine environment”. Given that DeLong’s paper of 4 years ago [1] interested me, I chose his review paper in this collection [2] to see what advances have been made in the meantime (and that article is freely available).

One of the issues mentioned in 2007 was the sequencing and cleaning up noisy data in the database, which now seems to be much less of a problem, even largely solved, so the goal posts start moving to issues with the real data analysis. With my computing-glasses on, Box 2 mentions (my emphases underlined and summarised afterwards):

Statistical approaches for the comparison of metagenomic data sets have only recently been applied, so their development is at an early stage. The size of the data sets, their heterogeneity and a lack of standardization for both metadata and gene descriptive data continue to present significant challenges for comparative analyses … It will be interesting to learn the sensitivity limits of such approaches, along more fine-scale taxonomic, spatial and temporal microbial community gradients, for example in the differences between the microbiomes of human individuals44. As the availability of data sets and comparable metadata fields continues to improve, quantitative statistical metagenomic comparisons are likely to increase in their utility and resolving power. (p202)

Let me summarise that: DeLong asserts they need (i) metadata annotations as a prerequisite for statistical approaches; (ii) deal with temporal data, and (iii) deal with spatial data. There is a lot of research and prototyping going on in topics ii and iii, and there are few commercial industry-grade plugins, such as the Oracle Cartridge, that do something with the spatial data representation. Perhaps that is not enough or it is not what the users are looking for; if this is the case, maybe they can be a bit more precise on what they want?

Point i is quite interesting, because it basically reiterates that ontologies are a means to an end and asserts that statistics cannot do it with number crunching alone but needs structured qualitative information to obtain better results. The latter is quite a challenge—probably technically doable, but there are few people who are well versed in the combination of qualitative and quantitative analysis. Curiously, only MIAME and the MGED website are mentioned for metadata annotation, even though they are limited in scope with respect to the subject domain ontologies and ontology-like artefacts (e.g., the GO, BioPax, KEGG), which are also used for annotation but not mentioned at all. The former deal with sequencing annotation following the methodological aspects of the investigation, whereas the latter type of annotations can be done with domain ontologies, i.e. annotating data with what kind of things you have found (which genes and their function, which metabolism, what role does the organism have in the community etc.) that are also need to carry out the desired comparative analyses.

There is also more generic hand-waiving that something new is needed for data analysis:

The associated bioinformatic analyses are useful for generating new hypotheses, but other methods are required to test and verify in silico hypotheses and conclusions in the real world. It is a long way from simply describing the naturally occurring microbial ‘parts list’ to understanding the functional properties, multi-scalar responses and interdependencies that connect microbial and abiotic ecosystem processes. New methods will be required to expand our understanding of how the microbial parts list ties in with microbial ecosystem dynamics. (p203)

Point taken. And if that was not enough,

Molecular data sets are often gathered in massively parallel ways, but acquiring equivalently dense physiological and biogeochemical process data54 is not currently as feasible. This ‘impedance mismatch’ (the inability of one system to accommodate input from another system’s output) is one of the larger hurdles that must be overcome in the quest for more realistic integrative analyses that interrelate data sets spanning from genomes to biomes.

I fancy the thought that my granularity might be able to contribute to the solution, but that is not yet anywhere close to user-level software application stage.

At the end of the paper, I am still—as in 2005 and 2007—left with the impression that more data is being generated in the whole metagenomics endeavour than that there are computational tools to analyse them to squeeze out all the information that is ‘locked up’ in the pile of data.

References

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.

[2] DeLong, E.F. The microbial ocean from genomes to biomes. Nature, 2009, 459: 200-206.

### A CS department is not a cheap software house

Computer Science—or if you prefer: Computing Science, Informatics, Computing, Computer Engineering, or similar—suffers from an image problem for quite some time on several fronts. One of them is that non-informaticians seem to think that a computer science department is there as a ‘science-supporting’ or ‘facilitating’ department. In principle, it is not (although given certain circumstances, that is what it ends up to be in some cases).

On the comical side, there are t-shirts with the slogan “no, I will not fix your computer”. One BSc graduate here at UniBz actually used that phrase during her speech at the degree ceremony last December, which generated laughs from the audience, but they probably did not give it a second thought. In addition to the philosophy of CS, about which I wrote earlier, and an unambiguous paper about paradigms in computer science [1], there is another source I recommend people to have a look at. It is the final report of the Task Force on the Core of Computer Science [2] which “presents a new intellectual framework for the discipline of computing and a new basis for computing curricula.”. It (i) outlines the paradigms for the discipline, (ii) has a long and a short description of computing, the short one being:

The discipline of computing is the systematic study of algorithmic processes that describe and transform information: their theory, analysis, design, efficiency, implementation, and application. The fundamental question underlying all of computing is, “What can be (efficiently) automated?”

and (iii) presents a matrix with three columns “Theory”, “Abstraction” and “Design” as complementary aspects of computing and 9 rows for the principal sub-disciplines, such as HCI, databases and information retrieval, operating systems, and architecture. Its 7-page appendix lists the suggested contents for those 29 cells in the matrix. An no, there is no ‘you should become a software house’, not even close.

So, from where could non-informatics people have gotten the idea of CS as a supporting, facilitating discipline and software house? Perhaps Dijkstra’s [3] famous complaint gives a clue:

So, if I look into my foggy crystal ball at the future of computing science education, I overwhelmingly see the depressing picture of ‘‘Business as usual’’. The universities will continue to lack the courage to teach hard science, they will continue to misguide the students, and each next stage of infantilization of the curriculum will be hailed as educational progress.

That having occurred in multiple CS curricula, perhaps CS curricula developers are to blame, or their bending over and trying to meet outside demands to produce graduates with short-lived skills that are instantly usable in industry and to improve throughput statistics of students graduating from the programme within the nominal time of study. Did anyone do a proper study about that or are they just commonly held assumptions and excuses?

A wholly different argument that CS departments are not there ‘at your service’ has been put forward by Dieter Fensel and Dieter Wolf [4], which I wrote about before here. They claim that computer science will become the foundation of the sciences because it has information processing and knowledge management (and goal-oriented services) at its core. From their perspective, then given that physics, biology and neuroscience deal with specific types/sections of information and knowledge management, they are (or become) branches of computer science. Just think of it for a moment. We would have, say, genetics as applied computer science, but not computer science as a service and facilitator for genetics. No ‘blue-collar bioinformatician’, but a ‘blue-collar wet-lab geneticist’ collecting the raw data. Unsurprisingly, during the workshop where Fensel presented his proposal, the philosophers were not charmed with that view. And I can imagine geneticists will not be pleased with inverting the cores and corresponding roles either: obviously, genetics is a science—but so is computing. Dijkstra [3] gives a more modest, but only slightly, view for the future of computer science than Fensel and Wolf did:

In the long run I expect computing science to transcend its parent disciplines, mathematics and logic, by effectively realizing a significant part of Leibniz’s Dream of providing symbolic calculation as an alternative to human reasoning. (Please note the difference between “mimicking” and “providing an alternative to”: alternatives are allowed to be better.)

.

We’re not some underpaid service-oriented software house; if you want industry-grade software and customer service, you’ll have to fork out the money and, well, ask industry to develop the software. Having said that, I’ll admit that CS departments, in general, should improve on respecting and valuing domain experts. For instance, one CS professor thinks of possible collaboration as “their benefit is that we can play with their data” or, phrased differently: we get something out of it, but the user only should put effort and resources in the endeavour but not expect anything in return. I do not think it is realistic to expect that domain experts are, or want to be, that philanthropic in a collaboration—collaborations ought to be mutually beneficial. A ‘use and throw away’ attitude might achieve short-term gains on the CS-side, but such a win-lose approach is not sustainable in the long run. Après-moi le deluge, an older and tenured CS prof might think, but (i) the younger generations cannot afford such ‘luxuries’ for their horizons reach farther (time-wise, at least, and possibly also with respect to career aims), and (ii) if producing software is part of the research task, then doing the work up to a working prototype should be part of researcher’s honesty and science ethics anyway, in particular when papers are published about it.

Maybe CS should knock on the doors of a marketing company and ask for brand positioning services so that CS-offers can be harmonised better with non-informatician demands and expectations.

[1] Amnon H. Eden. Three paradigms of computer science. Minds & Machines, 2007, 17: 135-167.

[2] Peter J. Denning, Douglas E. Comer, David Gries, Michael C. Mulder, Allen Tucker, A. Joe Turner, and Paul R. Young. Computing as a Discipline. Communications of the ACM, 1989, 32(1): 9-23.

[3] Dijkstra, E.W. (1988). On the cruelty of really teaching computing science. Unpublished manuscript EWD 1036.

[4] Fensel, D., Wolf, D. The Scientific Role of Computer Science in the 21st Century. Third International Workshop on Philosophy and Informatics (WSPI06), Saarbrücken, 3-4 May 2006. pp33-46.