72010 SemWebTech lecture 11: BioRDF and Workflows

After considering the background of the combination of ontologies, the Semantic Web, and ‘something bio’ and some challenges and successes in the previous three lectures, we shall take a look at more technologies that are applied in the life sciences and that use SWT to a greater or lesser extent. In particular, RDF and scientific workflows will pass the revue. The former has the flavour of “let’s experiment with the new technologies”, whereas the latter is more alike “where can we add SWT to the system and make things easier?”.

BioRDF

The problems of data integration were not always solved in a satisfactory manner with the ‘old’ technologies, but perhaps SWT can solve them; or so goes the idea. The past three years has seen several experiments to test if the SWT can live up to that challenge. To see where things are heading, let us recollect the data integration strategies that passed the revue in lecture 8, which can be chosen with the extant technologies as well as the newer ones of the Semantic Web: (i) Physical schema mappings with Global As View (GAV), Local As View (LAV), or GLAV, (ii) Conceptual model-based data integration, (iii) Data federation, (iv) Data warehouses, (v) Data marts, (vi) Services-mediated integration, (vii) Peer-to-peer data integration, and (viii) Ontology-based data integration, being i or ii (possibly in conjunction with the others) through an ontology or linked data by means of an ontology.

Early experiments focused on RDF-izing ‘legacy’ data, such as RDBMSs, excel sheets, HTML pages etc., and making one large triplestore out of it, i.e., an RDF-warehouse [1,2], using tools such as D2RQ and Sesame (renamed to Open RDF) as triple store (other triple stores are, e.g., Virtuoso and AllegroGraph, used by [3]). The Bio2RDF experiment took over 20 freely available data sources and converted them with multiple JSP programs into a total of about 163 million triples in a Sesame triplestore, added a myBio2RDF personalization step, and used extant applications to present the data to the users. The warehousing strategy, however, has some well-known drawbacks even in a non-Semantic Web setting. So, following the earlier gradual development of data integration strategies, the time had come to experiment with data federation, RDF-style [3], where the authors note at the end that perhaps the next step—services—may yield interesting results as well. You also may want to have a look at the winners’ solutions to the yearly Billion triple challenge and other Semantic Web challenges (all submissions, each with a paper describing the system and a demo, are filed under the ‘former challenges’ menu).

One of the problems that SWT and its W3C standards aimed to solve was uniform data representation, which can be done well with RDF. Another was locating an entity and identifying it, which can be done with URIs. An emerging problem now is that for a single entity in reality, there are many “semantically equivalent” URIs [1,3]; e.g., Hexokinase had three different URIs, one in the GO, in UniProt, and in the BioPathways (and to harmonise them, Bio2RDF added their own one and linked to the others using owl:sameAs). More generally than only the URI issue, is the observation made by the HCLS IG’s Linking Open Drug Data group, and was a well-know hurdle in earlier non-SWT data integration efforts: “A significant challenge … is the strong prevalence of terminology conflicts, synonyms, and homonyms. These problems are not addressed by simply making data sets available on the Web using RDF as common syntax but require deeper semantic integration.” and “For … applications that rely on expressive querying or automated reasoning deeper integration is essential” [4]. In parallel with request for “more community practices on publishing term and schema mappings” [4], the experimentation with RDF-oriented data integration continues.

Scientific Workflows

You may have come across Business Process Modelling and workflows in government and industry; scientific workflows are an extension to that (see its background and motivation). In addition to general requirements, such as service composition and reuse of workflow design, scalability, and data provenance, in practice, it turns out that such a scientific workflow system must have the ability to handle multiple databases and a range of analysis tools with corresponding interfaces to a diverse range of computational environments, deal with explicit representation of knowledge at different stages, customization of the interface for each researcher, and auditability and repeatability of the workflow.

To cut a long story short (in the writing here, not in the lecture on 11-1): where can we plug SWT into scientific workflows? One can, for instance, use RDF as common data format for linking and integration and SPARQL for querying that data, OWL ontologies for the representation of the knowledge across the workflow (at least the domain knowledge and the workflow knowledge), rules to orchestrate the service execution, and services (e.g., WSDL, OWL-S) to discover useful scripts that can perform a task in the workflow.

This still leaves to choose what to do with the provenance, which may be considered to be a component of the broader notion of trust. Recollecting the Semantic Web layer cake from lecture 1, trust is above the SPARQL, OWL, and RIF pieces. Currently, there is no W3C standard for the trust layer, yet users need it. Scientific workflow systems, such as Kepler and Taverna, invented their own way of managing it. For instance, Taverna uses experiment-, workflow-, and knowledge-provenance models represented using RDF(S) & OWL, and RDF for the individual provenance graphs of a particular workflow [5,6]. The area of scientific workflows, provenance, and trust is lively with workshops and, e.g., the provenance challenges; at the time of writing this post, it may be still too early to identify an established solution (to, say, have interoperability across workflow systems and its components to weave a web of provenance), be it a SWT one or another.

Probably, there will not be enough time during the lecture to also cover Semantic Web Services. In case you are curious how one can efficiently search for the thousands of web services and their use in working systems (i.e., application-oriented papers, not the theory behind it), you may want to have a look at [7, 8] (the latter is lighter on the bio-component than the former). The W3C activities on web services have standards, working groups, and an interest group.

References

[1] Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System. Journal of Biomedical Informatics, 2008, 41(5):706-16. online interface: bio2RDF

[2] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

[3] Kei-Hoi Cheung, H Robert Frost, M Scott Marshall, Eric Prud’hommeaux, Matthias Samwald, Jun Zhao, and Adrian Paschke. A journey to Semantic Web query federation in the life sciences. BMC Bioinformatics 2009, 10(Suppl 10):S10

[4] Anja Jentzsch, Bo Andersson, Oktie Hassanzadeh, Susie Stephens, Christian Bizer. Enabling Tailored Therapeutics with Linked Data. LDOW2009, April 20, 2009, Madrid, Spain.

[5] Tom Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, Mark Greenwood, Tim Carver, Kevin Glover, Matthew R. Pocock, Anil Wipat and Peter Li. (2004). Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20 (17): 3045-3055. The Taverna website

[6] Carole Goble et al. Knowledge Discovery for biology with Taverna. In: Semantic Web: Revolutionizing knowledge discovery in the life sciences. 2007, pp355-395.

[7] Michael DiBernardo, Rachel Pottinger, and Mark Wilkinson. (2008). Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework. Journal of Biomedical Informatics, 41(5): 837-847.

[8] Sahoo., S.S., Shet, A. Hunter, B., and York, W.S. SEMbrowser–semantic biological web services registry. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, pp 317-340.

Note: references 1 and (5 or 6) are mandatory reading, (2 or 3) was mandatory for an earlier lecture, and 4, 7, and 8 are optional.

Lecture notes: lecture 11 – BioRDF and scientific workflows

Course website

One response to “72010 SemWebTech lecture 11: BioRDF and Workflows

  1. Pingback: 72010 SemWebTech lecture 12: Social aspects and recap part 2 of the course « Keet blog

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.