Working towards WONDER Data

Duncan mentioned in a comment on his recent SciFoo invitation his “Google and the Semantic, Satanic, Romantic Web” post where he describes and summarises an encounter between the pro-Semantic Web Tim Berners-Lee and the ‘anti’-Semantic Web (or should I say realistic?) Peter Norvig, Director of Research at Google. I quote a relevant section here, with some changes in emphases:

Norvig: People are stupid: […] this is the world, imperfect and messy and we just have to deal with it. These same people can’t be expected to use the Resource Description Framework (RDF) and the Web Ontology Language (OWL), which are much more complicated and considerably less fool-proof. (Perhaps you could call this the dumb-antic web?!)

Berners-Lee: replied that a large part of the semantic web can be populated by taking existing relational databases and mapping them into RDF/OWL. The structured data is already there, it just needs web-izing in a mashup-friendly format. (What I like to call the romantic web: people will publish their data freely on the web this way, especially in e-science for example. This will allow sharing and re-use in unexpected ways.)

While Duncan looks at the openness of data, here I want to put the focus on the part in bold face: that you can reuse relational databases just like that and map them into RDF/OWL. Positively described, that is a romantic assumption; negatively, described, it is rather naïve and more painful to realise than it sounds. Well, if the database developers would have remained truthful to what they had learned during college, then it might have worked out to some extent at least. So let us for a moment ignore the issues of data duplication, violations of integrity constraints, hacks, outdated imports from other databases to fill a boutique database, outdated conceptual data models (if there was one), and what have you. Then it is still not trivial.

To do this ‘easy mapping’, one has to start over with the data analysis and add some new requirements analysis while one is at it. First, some data in the database (DB)—mathematically instances—actually are thought of by the users as concepts/universals/classes. For instance, the GO terms in the GO database are assumed to be representing universals and used to annotate instances in other tables of some database, and let us not go into the ontological status of species (as instances in the database of the NCBI taxonomy). Second, each tuple is assumed to denote an instance and, by virtue of key definitions, to be unique in that table, but such a tuple has values in each cell of the participating columns; however, those values are not objects that the OWL ABox is assuming to be dealing with (this is known under the term impedance mismatch). So, when we have divided the data in the DB into instances-but-actually-concepts-that-should-become-OWL-classes and real-instances-that-should-become-OWL-instances, we need to convert the real instances of the DB to objects in the ontology, where some function has to be used to convert (combinations of) values into objects proper.

For one experiment we are working on here at FUB, we have the HGT-DB with about 1.7 million genes of about 500 bacteria, and all sorts of data about each one of them (tables with 15-20 columns, some with instance data, some with type-level information like the function of the gene product). Try to load this data into Protégé’s ABox. Obviously, we do not; more about that further below.

What, you may ask, about reusing the physical DB schema and, if present, the conceptual data model (in ER, EER, UML, ORM, …)? A good question that more people have asked, i.e., lots of research has been done in that area, primarily under the banner of reverse engineering and extracting ‘ontologies’ from such schemas where it was noted that extra ‘ontology enrichment’ steps were necessary (see e.g. Lina Lubyte’s work). A fundamental problem with such reverse engineering is that, assuming there was a fully normalised conceptual data model, oftentimes denormalization steps have been carried out to flatten the database structure and improve performance, which, if simply reverse engineered, ends up in the ‘ontology’ as a class with umpteen attributes (one for each column). This is not nice and the automated reasoning one can do with it is minimal, if at all. Put differently: if we stick to a flat, subject domain semantics-poor, structure, then why bother with the automated reasoning machinery of the Semantic Web?

To mitigate this, one can redo the normalization steps to try to get some structure back into the conceptual view of the data or perhaps add a section of another ontology to brighten up the ‘ontology’ into an ontology (or, if you are lucky and there was a conceptual data model, to use that one). We did that for the HGT-DB’s conceptual model, manually; an early diagram and its import into Protégé are included in the appendix of [1].

In any case, having more structure in the ‘ontology’ than in the DB, one ends up defining multiple views in the DB, i.e., external ABox, where a part of a table has the instances of an OWL class. (How to do this in the OWL-ABox, I do not know—we have databases that are too large to squeeze into the knowledge base). In turn, this requires a mechanism to link persistently an OWL class to a SQL or SPARQL query over the DB. (One can argue if this DB should be the legacy relational database or an RDF-ized version of it; I ignore that debate for now.)

After doing all that, one has contributed the proverbial ‘2 cents’ that has cost you ‘blood, sweat and tears’ (maybe the latter is just Dutch idiom) to populating the Semantic Web.

But what can one really do with it?

The least one can do is to make querying the database easier so that users do not have to learn yet another query language. Earlier technologies in that direction were called query-by-diagram and conceptual queries, and a newer term for the same idea is called Ontology-Based Data Access (OBDA) that uses Semantic Web Technologies. Then one can add reasoner-facilitated query completion to guarantee that the user asks something that the system can answer (e.g. [2]). Having the reasoner anyway, one might as well use it for more sophisticated queries that are not easily, or not at all, possible with traditional database systems. One of them is using terms in the query for which there is no data in the database, of which several examples were described in a case study [3] (and summarised). For the HGT-DB, these are queries involving the species taxonomy and gene product functions.

Another useful addition with respect to the ‘legacy’ (well, currently operational) HGT-DB is that our domain experts, upon having seen the conceptual view of the database, came up with all sorts of other sample queries they were thinking of but where the knowledge was not yet explicitly represented in the ontology even though one can retrieve the data from the database. For instance, adjacent or nearby genes that are horizontally transferred, or clusters of such genes that are permitted to have a gap between them consisting of non-coding DNA or of a non-horizontally transferred gene. Put differently, one can do a sophisticated analysis of one’s data and unlock new information from the database by using the ontology-based approach. In our enthusiasm, we have called the experiment Web ONtology mediateD Extraction of Relational data (WONDER) for Web-based ObdA with the Hgt-db (WOAH!). We have the tools such as QuOnto for scalable reasoning and the OBDA Plugin for Protégé for management of the mappings between an OWL class, the SQL query over the database, and the transformation function (skolemization) from values to objects. The last step to make it all Web-usable—from a technical point of view, that is—is the Web-based ontology browser and graphical query builder. This interface is well down in the pipeline with a first working version sent out for review by our domain experts. One of them thought that it looked a bit simplistic; so perhaps we achieved more than we bargained for where the AI & engineering behind it did its work well—from a user-perspective, at least.

More automation of all those steps to get it working, however, will be a welcome addition from the engineering side. Until then, Norvig’s down to earth comment is closer to reality than Berners-Lee’s vision.

[1] R. Alberts, D. Calvanese, G. De Giacomo, A. Gerber, M. Horridge, A. Kaplunova, C. M. Keet, D. Lembo, M. Lenzerini, M. Milicic, R. Moeller, M. Rodríguez-Muro, R. Rosati, U. Sattler, B. Suntisrivaraporn, G. Stefanoni, A.-Y. Turhan, S. Wandelt, M. Wessel. Analysis of Test Results on Usage Scenarios. Deliverable TONES-D27 v1.0, Oct. 10 2008.

[2] Paolo Dongilli, Enrico Franconi (2006). An Intelligent Query Interface with Natural Language Support. FLAIRS Conference 2006: 658-663.

[3] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED’08 ). 26-27 Oct. 2008, Karlsruhe, Germany.

Advertisement

One response to “Working towards WONDER Data

  1. Pingback: Five years of keet blog « Keet blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.