A strike against the ‘realism-based approach’ to ontology development

The ontology engineering course starting this Monday at the Knowledge Representation and Reasoning group at Meraka commences with the question What is an ontology? In addition to assessing definitions, it touches upon long-standing disagreements concerning if ontologies are about representing reality, our conceptualization of entities in reality, or some conceptualization that does not necessarily ascribe to existence of reality. The “representation of reality school” is advocated in ontology engineering most prominently by Barry Smith and cs. and their foundational ontology BFO, the “conceptualization of entities in reality school” by various people and research groups, such as the LOA headed by Nicola Guarino and their DOLCE foundational ontology, whereas the “conceptualization irrespective regardless reality school” can be (but not necessarily is) encountered in organisations developing, e.g., medical ontologies that do not ascribe to evidence-based medicine to decide what goes in the ontology and how (but instead base it on, say, the outcome of power plays between big pharma and health insurance companies).

Due to the limited time and scope of this and previous courses on ontology engineering I taught, I mention[ed] only succinctly that those differences exist (e.g., pp10-11 of the UH slides), and briefly illustrate some of the aspects of the debate and their possible consequences in practical aspects of ontology engineering. This information is largely based on a few papers and extracting consequences from that, the examples they describe and that I encountered, and the discussions that took place at the various meetings, workshops, conferences, and summer schools that I participated in. But there was no nice, accessible, paper that describes de debate—or even part of it—more precisely and is readable also by ontologists who are not philosophers. Until last week, that is. The Applied Ontology journal published a paper by Gary Merrill, entitled Ontological realism: Methodology or misdirection? [1], that assess critically the ontological realism advocated by Barry Smith and his colleague Werner Ceusters. Considering its relevance in ontology engineering, the article has been made freely available, and in the announcement of the journal issue, its editors in chief (Nicola Guarino and Mark Musen) mentioned that Smith and Ceusters are busy preparing a response on Merrill’s paper, which will be published in a subsequent issue of Applied Ontology. Merrill, in turn, promised to respond to this rebuttal.

But for now, there are 30 pages of assessment on the merits of, and problems with, the philosophical underpinnings of the “realism-based approach” that is used in particular in the realm of ontology engineering within the OBO Foundry project and its large set of ontologies, BFO, and the Relation Ontology. The abstract gives an idea of the answer to the question in the paper’s title:

… The conclusion reached is that while Smith’s and Ceusters’ criticisms of prior practice in the treatment of ontologies and terminologies in medical informatics are often both perceptive and well founded, and while at least some of their own proposals demonstrate obvious merit and promise, none of this either follows from or requires the brand of realism that they propose.

The paper’s contents backs this up with analysis, arguments, examples, and bolder statements than the abstracts suggests.
For anyone involved in ontology development and interested in the debate—even if you think you’re tired of it—I recommend reading the paper, and to at least follow how the debate will unfold with responses and rebuttals.

My opinion? Well, I have one, of course, but this post is an addendum to the general course page of MOWS’10, hence I try to refrain from adding too much bias to the course material.

UPDATE (27-7-2010): On whales and apples, and on ontology and reality: you might enjoy also “Moby Dick: an exercise in ontology”, written by Lorne A. Smith.

References

[1] Gary H. Merrill. Ontological realism: Methodology or misdirection? Applied Ontology, 5 (2010) 79–108.

Failure of the experiment for the SemWebTech course

At the beginning of the SWT course, I had the illusion that we could use the blog as another aspect of the course and, more importantly, that students (and other interested people) were free to leave comments and links to pages and other related blogs and blog posts that they had encountered. It did not really happen though, so as experiment formulated as such, it failed miserably.

But I can scrape together some data that demonstrate all was not for naught entirely. I have received several off-line comments from colleagues who thought it to be useful, non-SWT-course students who use it to study as means of distance education, or kindly pointed me to updates and extensions of various topics—but the plural of anecdote is not data. So here go some figures.

34 students had enrolled in the Moodle, of which some 10-15 attended class initially, dwindling down to 4-8 the more midterms of other courses and holiday season interfered with their study schedule (and perhaps my teaching skills or the topics of the course), 12 students did a mini-project for the lab within the deadline for this exam session, 12 registered for the exam, and 11 showed up to actually do the exam. FUB strives to have a 1:6 ratio for lecturer:student, so with the SWT course (as well as most other MSc courses) we are at the good end of that.

The aggregated data for explicit blog post accesses (i.e., not counting those who read it through accessing the home page) and slides downloads on 17-2-2010 as sampled during invigilating the SWT exam are as follows: average visit of an SWT course blog post is 112, with OWL, top-down and bottom-up ontology development, and part-whole relations well above the average, and average slide download of 41 with OWL and top-down and bottom-up above average again. At the moment, one can only speculate why.

Clearly, there have been many more people accessing the pages and the slides than can be accounted for by the students only, even if one would take up an assumption that they accessed each of the blog posts, say, twice and entertained themselves with downloading both the normal slides and the same ones in hand-out format. The content of the slides were not the only topics that passed the revue during the lectures and the labs, but maybe they have been, are, or will be of use to other people as well. People who are interested in ontology engineering topics more generally, especially regarding course development and course content, will find Ontolog’s Ontology Summit’s current virtual panel sessions on “Creating the ontologists of the future” worthwhile to consult.

Finally, will I go through the trouble of writing blog posts for another course I may have to teach? Probably not.

72010 SemWebTech lecture 12: Social aspects and recap part 2 of the course

You might ask yourself why we should even bother with social aspects in a technologies course. Out there in the field, however, SWT are applied by people with different backgrounds and specialties and they are relatively new technologies that act out in an inter/multi/transdisciplinary environment, which brings with it some learning curves. If you end up working in this area, then it is wise to have some notion about human dynamics in addition to the theoretical and technological details, and how the two are intertwined. Some of the hurdles that may seem ‘merely’ dynamics of human interaction can very well turn out to be scratching the surface of problems that might be solved with extensions or modifications to the technologies or even motivate new theoretical research.

Good and Wilkinson’s paper provides a non-technical introduction to Semantic Web topics, such as LSID, RDF, ontologies, and services. They consider what problems these technologies solve (i.e., the sensible reasons to adopt them), and what the hurdles are both with respect to the extant tools & technologies and the (humans working for some of the) leading biological data providers that appear to be reluctant in taking up the technologies. There are obviously people who have taken the approach of “let’s try and see what come out of the experimentation”, whereas others are more reserved and take the approach of “let’s see what happens, and then maybe we’ll try”. If there are not enough people of the former type, then the latter ones obviously will never try.

Another dimension of the social aspects is described in [2], which is a write-up of Goble’s presentation about the montagues and capulets at the SOFG’04 meeting. It argues that there are, mostly, three different types of people within the SWLS arena (it may just as well be applicable to another subject domain if they were to experiment with SWT, e.g., in public administration): the AI researchers, the philosophers, and the IT-savvy domain experts. They each have their own motivations and goals, which, at times, clash, but with conversation, respect, understanding, compromise, and collaboration, one will, and can, achieve the realisation of theory and ideas in useful applications.

The second part of the lecture will be devoted to a recap of the material of the past 11 lectures (there recap of the first part of the SWT course will be on 19-1).

References

[1] Good BM and Wilkinson MD. The Life Science Semantic Web is Full of Creeps! Briefings in Bioinformatics, 2006 7(3):275-286.

[2] Carole Goble and Chris Wroe. The Montagues and the Capulets. Comparative and Functional Genomics, 5(8):623-632, 2004. doi:10.1002/cfg.442

Note: reference 1 is mandatory reading, 2 is optional.

Lecture notes: none

Course website

72010 SemWebTech lecture 11: BioRDF and Workflows

After considering the background of the combination of ontologies, the Semantic Web, and ‘something bio’ and some challenges and successes in the previous three lectures, we shall take a look at more technologies that are applied in the life sciences and that use SWT to a greater or lesser extent. In particular, RDF and scientific workflows will pass the revue. The former has the flavour of “let’s experiment with the new technologies”, whereas the latter is more alike “where can we add SWT to the system and make things easier?”.

BioRDF

The problems of data integration were not always solved in a satisfactory manner with the ‘old’ technologies, but perhaps SWT can solve them; or so goes the idea. The past three years has seen several experiments to test if the SWT can live up to that challenge. To see where things are heading, let us recollect the data integration strategies that passed the revue in lecture 8, which can be chosen with the extant technologies as well as the newer ones of the Semantic Web: (i) Physical schema mappings with Global As View (GAV), Local As View (LAV), or GLAV, (ii) Conceptual model-based data integration, (iii) Data federation, (iv) Data warehouses, (v) Data marts, (vi) Services-mediated integration, (vii) Peer-to-peer data integration, and (viii) Ontology-based data integration, being i or ii (possibly in conjunction with the others) through an ontology or linked data by means of an ontology.

Early experiments focused on RDF-izing ‘legacy’ data, such as RDBMSs, excel sheets, HTML pages etc., and making one large triplestore out of it, i.e., an RDF-warehouse [1,2], using tools such as D2RQ and Sesame (renamed to Open RDF) as triple store (other triple stores are, e.g., Virtuoso and AllegroGraph, used by [3]). The Bio2RDF experiment took over 20 freely available data sources and converted them with multiple JSP programs into a total of about 163 million triples in a Sesame triplestore, added a myBio2RDF personalization step, and used extant applications to present the data to the users. The warehousing strategy, however, has some well-known drawbacks even in a non-Semantic Web setting. So, following the earlier gradual development of data integration strategies, the time had come to experiment with data federation, RDF-style [3], where the authors note at the end that perhaps the next step—services—may yield interesting results as well. You also may want to have a look at the winners’ solutions to the yearly Billion triple challenge and other Semantic Web challenges (all submissions, each with a paper describing the system and a demo, are filed under the ‘former challenges’ menu).

One of the problems that SWT and its W3C standards aimed to solve was uniform data representation, which can be done well with RDF. Another was locating an entity and identifying it, which can be done with URIs. An emerging problem now is that for a single entity in reality, there are many “semantically equivalent” URIs [1,3]; e.g., Hexokinase had three different URIs, one in the GO, in UniProt, and in the BioPathways (and to harmonise them, Bio2RDF added their own one and linked to the others using owl:sameAs). More generally than only the URI issue, is the observation made by the HCLS IG’s Linking Open Drug Data group, and was a well-know hurdle in earlier non-SWT data integration efforts: “A significant challenge … is the strong prevalence of terminology conflicts, synonyms, and homonyms. These problems are not addressed by simply making data sets available on the Web using RDF as common syntax but require deeper semantic integration.” and “For … applications that rely on expressive querying or automated reasoning deeper integration is essential” [4]. In parallel with request for “more community practices on publishing term and schema mappings” [4], the experimentation with RDF-oriented data integration continues.

Scientific Workflows

You may have come across Business Process Modelling and workflows in government and industry; scientific workflows are an extension to that (see its background and motivation). In addition to general requirements, such as service composition and reuse of workflow design, scalability, and data provenance, in practice, it turns out that such a scientific workflow system must have the ability to handle multiple databases and a range of analysis tools with corresponding interfaces to a diverse range of computational environments, deal with explicit representation of knowledge at different stages, customization of the interface for each researcher, and auditability and repeatability of the workflow.

To cut a long story short (in the writing here, not in the lecture on 11-1): where can we plug SWT into scientific workflows? One can, for instance, use RDF as common data format for linking and integration and SPARQL for querying that data, OWL ontologies for the representation of the knowledge across the workflow (at least the domain knowledge and the workflow knowledge), rules to orchestrate the service execution, and services (e.g., WSDL, OWL-S) to discover useful scripts that can perform a task in the workflow.

This still leaves to choose what to do with the provenance, which may be considered to be a component of the broader notion of trust. Recollecting the Semantic Web layer cake from lecture 1, trust is above the SPARQL, OWL, and RIF pieces. Currently, there is no W3C standard for the trust layer, yet users need it. Scientific workflow systems, such as Kepler and Taverna, invented their own way of managing it. For instance, Taverna uses experiment-, workflow-, and knowledge-provenance models represented using RDF(S) & OWL, and RDF for the individual provenance graphs of a particular workflow [5,6]. The area of scientific workflows, provenance, and trust is lively with workshops and, e.g., the provenance challenges; at the time of writing this post, it may be still too early to identify an established solution (to, say, have interoperability across workflow systems and its components to weave a web of provenance), be it a SWT one or another.

Probably, there will not be enough time during the lecture to also cover Semantic Web Services. In case you are curious how one can efficiently search for the thousands of web services and their use in working systems (i.e., application-oriented papers, not the theory behind it), you may want to have a look at [7, 8] (the latter is lighter on the bio-component than the former). The W3C activities on web services have standards, working groups, and an interest group.

References

[1] Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System. Journal of Biomedical Informatics, 2008, 41(5):706-16. online interface: bio2RDF

[2] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

[3] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10

[4] Anja Jentzsch, Bo Andersson, Oktie Hassanzadeh, Susie Stephens, Christian Bizer. Enabling Tailored Therapeutics with Linked Data. LDOW2009, April 20, 2009, Madrid, Spain.

[5] Tom Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, Mark Greenwood, Tim Carver, Kevin Glover, Matthew R. Pocock, Anil Wipat and Peter Li. (2004). Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20 (17): 3045-3055. The Taverna website

[6] Carole Goble et al. Knowledge Discovery for biology with Taverna. In: Semantic Web: Revolutionizing knowledge discovery in the life sciences. 2007, pp355-395.

[7] Michael DiBernardo, Rachel Pottinger, and Mark Wilkinson. (2008). Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework. Journal of Biomedical Informatics, 41(5): 837-847.

[8] Sahoo., S.S., Shet, A. Hunter, B., and York, W.S. SEMbrowser–semantic biological web services registry. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, pp 317-340.

Note: references 1 and (5 or 6) are mandatory reading, (2 or 3) was mandatory for an earlier lecture, and 4, 7, and 8 are optional.

Lecture notes: lecture 11 – BioRDF and scientific workflows

Course website

72010 SemWebTech lecture 10: SWLS and text processing and ontologies

There is a lot to be said about how Ontology, ontologies, and natural language interact from a philosophical perspective up to the point that different commitments lead to different features and, moreover, limitations of a (Semantic Web) application. In this lecture on 22 Dec, however, we shall focus on the interaction of NLP and ontologies within a bio-domain from an engineering perspective.

During the bottom-up ontology development and methodologies lectures, it was already mentioned that natural language processing (NLP) can be useful for ontology development. In addition, NLP can be used as a component in an ontology-driven information system and an NLP application can be enhanced with an ontology. Which approaches and tools suit best depends on the goal (and background) of its developers and prospective users, ontological commitment, and available resources.

Summarising the possibilities for “something natural language text” and ontologies or ontology-like artifacts, we can:

  • Use ontologies to improve NLP: to enhance precision and recall of queries (including enhancing dialogue systems [1]), to sort results of an information retrieval query to the digital library (e.g. GoPubMed [2]), or to navigate literature (which amounts to linked data [3]).
  • Use NLP to develop ontologies (TBox): mainly to search for candidate terms and relations, which is part of the suite of techniques called ‘ontology learning’ [4].
  • Use NLP to populate ontologies (ABox): e.g., document retrieval enhanced by lexicalised ontologies and biomedical text mining [5].
  • Use it for natural language generation (NLG) from a formal language: this can be done using a template-based approach that works quite well for English but much less so for grammatically more structured languages such as Italian [6], or with a full-fledged grammar engine as with the Attempto Controlled English and bi-directional mappings (see for a discussion [7]).

Intuitively, one may be led to think that simply taking the generic NLP or NLG tools will do fine also for the bio(medical) domain. Any application does indeed use those techniques and tools—Paul Buitelaar’s slides have examples and many references to NLP tools—but, generally, they do not suffice to obtain ‘acceptable’ results. Domain specific peculiarities are many and wide-ranging. For instance, to deal with the variations of terms (scientific name, variant, common misspellings) and the grounding step (linking a term to an entity in a biological database) in the ontology-NLP preparation and instance classification side [5], to characterize the question in a question answering system correctly [1], and to find ways to deal with the rather long strings that denote a biological entity or concept or universal [4]. Some of such peculiarities actually generate better overall results than in generic or other domain-specific usages of NLP tools, but it requires extra manual preparatory work and a basic understanding of the subject domain and its applications.

References

[1] K. Vila, A. Ferrández. Developing an Ontology for Improving Question Answering in the Agricultural Domain. In: Proceedings of MTSR’09. Springer CCIS 46, 245-256.

[2] Heiko Dietze, Dimitra Alexopoulou, Michael R. Alvers, Liliana Barrio-Alvers, Bill Andreopoulos, Andreas Doms, Joerg Hakenberg, Jan Moennich, Conrad Plake, Andreas Reischuck, Loic Royer, Thomas Waechter, Matthias Zschunke, and Michael Schroeder. GoPubMed: Exploring PubMed with Ontological Background Knowledge. In Stephen A. Krawetz, editor, Bioinformatics for Systems Biology. Humana Press, 2008.

[3] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784] (but see also some comments on the paper)

[4] Dimitra Alexopoulou, Thomas Waechter, Laura Pickersgill, Cecilia Eyre, and Michael Schroeder. Terminologies for text-mining: an experiment in the lipoprotein metabolism domain. BMC Bioinformatics, 9(Suppl4):S2, 2008

[5] Witte, R. Kappler, T. And Baker, C.J.O. Ontology design for biomedical text mining. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, pp 281-313.

[6] M. Jarrar, C.M. Keet, and P. Dongilli. Multilingual verbalization of ORM conceptual models and axiomatized ontologies. STARLab Technical Report, Vrije Universiteit Brussels, Belgium. February 2006.

[7] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear, G. Hart. A comparison of three controlled natural languages for OWL 1.1. Proc. of OWLED 2008 DC.

Note: references 4 and 5 are mandatory reading, and 1-3 and 6 are optional (recommended for the EMLCT students).

Lecture notes: lecture 10 – Text processing

Course website

72010 SemWebTech lecture 9: Successes and challenges for ontologies in the life sciences

To be able to talk about successes and challenges of SWT for health care and life sciences (or any other subject domain), we first need to establish when something can be deemed a success, when it is a challenge, and when it is an outright failure. Such measures can be devised in an absolute sense (compare technology x with an SWT one: does it outperform on measure y?) and relative (to whom is technology x deemed successful?) Given these considerations, we shall take a closer look at several attempts, being two successes and a few challenges in representation and reasoning. What were the problems and how did they solve it, and what are the problems and can that be resolved, respectively?

As success stories we take the experiments by Wolstencroft and coauthors about classifying protein phosphatases [1] and Calvanese et al for graphical, web-based, ontology-based data access applied to horizontal gene transfer data [2]. They each focus on different ontology languages and reasoning services to solve different problems. What they have in common is that there is an interaction between the ontology and instances (and that it was a considerable amount of work by people with different specialties): the former focuses on classifying instances and the latter on querying instances. In addition, modest results of biological significance have been obtained with the classification of the protein phosphatases, whereas with the ontology-based data analysis we are tantalizingly close.

The challenges for SWT in general and for HCLS in particular are quite diverse, of which some concern the SWT proper and others are by its designers—and W3C core activities on standardization—considered outside their responsibility but still need to be done. Currently, for the software aspects, the onus is put on the software developers and industry to pick up on the proof-of-concept and working-prototype tools that have come out of academia and to transform them into the industry-grade quality that a widespread adoption of SWT requires. Although this aspect should not be ignored, we shall focus on the language and reasoning limitations during the lecture.

In addition to the language and corresponding reasoning limitations that passed the revue in the lectures on OWL, there are language “limitations” discussed and illustrated at length in various papers, with the most recent take [3], where it might well be that the extensions presented in lecture 6 and 7 (parts, time, uncertainty, and vagueness) can ameliorate or perhaps even solve the problem. Some of the issues outlined by Schultz and coauthors are ‘mere’ modelling pitfalls, whereas others are real challenges that can be approximated to a greater or lesser extent. We shall look at several representation issues that go beyond the earlier examples of SNOMED CT’s “brain concussion without loss of consciousness”; e.g. how would you represent in an ontology that in most but not all cases hepatitis has as symptom fever, or how would you formalize the defined concept “Drug abuse prevention”, and (provided you are convinced it should be represented in an ontology) that the world-wide prevalence of diabetes mellitus is 2.8%?

Concerning challenges for automated reasoning, we shall look at two of the nine identified required reasoning scenarios [4], being the “model checking (violation)” and “finding gaps in an ontology and discovering new relations”, thereby reiterating that it is the life scientists’ high-level goal-driven approach and desire to use OWL ontologies with reasoning services to, ultimately, discover novel information about nature. You might find it of interest to read about the feedback received from the SWT developers upon presenting [4] here: some requirements are met in the meantime and new useful reasoning services were presented.

References

[1] Wolstencroft, K., Stevens, R., Haarslev, V. Applying OWL reasoning to genomic data. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, 225-248.

[2] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland.

[3] Stefan Schulz, Holger Stenzhorn, Martin Boekers and Barry Smith. Strengths and Limitations of Formal Ontologies in the Biomedical Domain. Electronic Journal of Communication, Information and Innovation in Health (Special Issue on Ontologies, Semantic Web and Health), 2009.

[4] Keet, C.M., Roos, M. and Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS Vol-258.

[5] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

p.s.: the first part of the lecture on 21-12 will be devoted to the remaining part of last week’s lecture; that is, a few discussion questions about [5] that are mentioned in the slides of the previous lecture.

Note: references 1 and 3 are mandatory reading, 2 and 4 recommended to read, and 5 was mandatory for the previous lecture.

Lecture notes: lecture 9 – Successes and challenges for ontologies

Course website

72010 SemWebTech lecture 8: SWT for HCLS background and data integration

After the ontology languages and general aspects of ontology engineering, we now will delve into one specific application area: SWT for health care and life sciences. Its frontrunners in bioinformatics were adopters of some of the Semantic Web ideas even before Berners-Lee, Hendler, and Lassila wrote their Scientific American paper in 2001, even though they did not formulate their needs and intentions in the same terminology: they did want to have shared, controlled vocabularies with the same syntax, to facilitate data integration—or at least interoperability—across Web-accessible databases, have a common space for identifiers, it needing to be a dynamic, changing system, to organize and query incomplete biological knowledge, and, albeit not stated explicitly, it all still needed to be highly scalable [1].

Bioinformaticians and domain experts in genomics already organized themselves together in the Gene Ontology Consortium, which was set up officially in 1998 to realize a solution for these requirements. The results exceeded anyone’s expectations in its success for a range of reasons. Many tools for the Gene Ontology (GO) and its common KR format, .obo, have been developed, and other research groups adopted the approach to develop controlled vocabularies either by extending the GO, e.g., rice traits, or adding their own subject domain, such as zebrafish anatomy and mouse developmental stages. This proliferation, as well as the OWL development and standardization process that was going on at about the same time, pushed the goal posts further: new expectations were put on the GO and its siblings and on their tools, and the proliferation had become a bit too wieldy to keep a good overview what was going on and how those ontologies would be put together. Put differently, some people noticed the inferencing possibilities that can be obtained from moving from obo to OWL and others thought that some coordination among all those obo bio-ontologies would be advantageous given that post-hoc integration of ontologies of related and overlapping subject domains is not easy. Thus came into being the OBO Foundry to solve such issues, proposing a methodology for coordinated evolution of ontologies to support biomedical data integration [2].

People in related disciplines, such as ecology, have taken on board experiences of these very early adopters, and instead decided to jump on board after the OWL standardization. They, however, were not only motivated by data(base) integration. Referring to Madin et al’s paper [3] again, I highlight three points they made: “terminological ambiguity slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science”, i.e., identification of some serious problems they have in ecological research; “Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology”, i.e., what they expect that ontologies will solve for them (disambiguation); and “and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms”, i.e., for what purpose they want to use ontologies and any associated computation (discovery). That is, ontologies not as a—one of many possible—tool in the engineering/infrastructure means, but as a required part of a method in the scientific investigation that aims to discover new information and knowledge about nature (i.e., in answering the who, what, where, when, and how things are the way they are in nature).

What has all this to do with actual Semantic Web technologies? On the one hand, there are multiple data integration approaches and tools that have been, and are being, tried out by the domain experts, bioinformaticians, and interdisciplinary-minded computer scientists [4], and, on the other hand, there are the W3C Semantic Web standards XML, RDF(S), SPARQL, and OWL. Some use these standards to achieve data integration, some do not. Since this is a Semantic Web course, we shall take a look at two efforts who (try to) do, which came forth from the activities of the W3C’s Health Care and Life Sciences Interest Group. More precisely, we take a closer look at a paper written about 3 years ago [5] that reports on a case study to try to get those Semantic Web Technologies to work for them in order to achieve data integration and a range of other things. There is also a more recent paper from the HCLS IG [6], where they aimed at not only linking of data but also querying of distributed data, using a mixture of RDF triple stores and SKOS. Both papers reveal their understanding of the purposes of SWT, and, moreover, what their goals are, their experimentation with various technologies to achieve them, and where there is still some work to do. There are notable achievements described in these, and related, papers, but the sought-after “killer app” is yet to be announced.

The lecture will cover a ‘historical’ overview and what more recent ontology-adopters focus on, the very basics of data integration approaches that motivated the development of ontologies, and we shall analyse some technological issues and challenges mentioned in [5] concerning Semantic Web (or not) technologies.

References:

[1] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nature Genetics, May 2000;25(1):25-9.

[2] Barry Smith, Michael Ashburner, Cornelius Rosse, Jonathan Bard, William Bug, Werner Ceusters, Louis J. Goldberg, Karen Eilbeck, Amelia Ireland, Christopher J Mungall, The OBI Consortium, Neocles Leontis, Philippe Rocca-Serra, Alan Ruttenberg, Susanna-Assunta Sansone, Richard H Scheuermann, Nigam Shah, Patricia L. Whetzel, Suzanna Lewis. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25, 1251-1255 (2007).

[3] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. (2008). Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168.

[4] Erhard Rahm. Data Integration in Bioinformatics and Life Sciences. EDBT Summer School, Bolzano, Sep. 2007.

[5] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

[6] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10

Note: references 1, 2, and (5 or 6) are mandatory reading, and 3 and 4 are recommended to read.

Lecture notes: lecture 8 – SWLS background and data integration

Course website