A useful abstract relational model and SQL path queries

Whilst visiting David Toman at the University of Waterloo during my sabbatical earlier this year, one of the topics we looked into was their experiments on whether their SQLP—SQL with path queries, extended from [1]—would be better than plain SQL in terms of time it takes to understand queries and correctness in writing them. Turned out (in a user evaluation) that it’s faster with SQLP whilst maintaining accuracy. The really interesting aspect in all this from my perspective, however, was the so-called Abstract Relational Model (ARM), or: the modelling side of things rather than making the querying easier, as the latter is made easier with the ARM. In simple terms, the ARM [1] is alike the relational model, but then with identifiers, which makes those path queries doable and mostly more succinct, and one can partition the relations into class-relationship-like models (approaching the look-and-feel of a conceptual model) or lump stuff together into relational-model-like models, as preferred. Interestingly, it turns out that the queries remain exactly the same regardless whether one makes the ARM look more relational-like or ontology-like, which is called “invariance under vertical partitioning” in the paper [2]. Given all these nice things, there’s now also an algorithm to go from the usual relational model to an ARM schema, so that even if one has legacy resources, it’s possible to bump it up to this newer technology with more features and ease of use.

Our paper [2] that describes these details (invariance, RM-to-ARM, the evaluation), entitled “The Utility of the Abstract Relational Model and Attribute Paths in SQL”, is being published as part of the proceedings of the 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’18), which will be held in Nancy, France, in about two weeks.

This sort of Conceptual Model(like)-based Data Access (CoMoDA, if you will) may sound a bit like Ontology-Based Data Access (OBDA). Yes and No. Roughly, yes on the conceptual querying sort of thing (there’s still room for quite some hair splitting there, though); no regarding the conceptual querying sort of thing. The ARM doesn’t pretend to be an ontology, but easily has a reconstruction in a Description Logic language [3] (with n-aries! and identifiers!). SQLP is much more expressive than the union of conjunctive queries one can pose in a typical OBDA setting, however, for it is full SQL + those path queries. So, both the theory and technology are different from the typical OBDA setting. Now, don’t think I’m defecting on the research topics—I still have a whole chapter on OBDA in my textbook—but it’s interesting to learn about and play with alternative approaches toward solutions to (at a high level) the same problem of trying to make querying for information easier and faster.

 

References

[1] Borgida, A., Toman, D., Weddell, G.E. On referring expressions in information systems derived from conceptual modelling. Proc. of ER’16. Springer LNCS, vol. 9974, 183-197.

[2] Ma, W., Keet, C.M., Olford, W., Toman, D., Weddell, G. The Utility of the Abstract Relational Model and Attribute Paths in SQL. 21st International Conference on Knowledge Engineering and Knowledge Management (EKAW’18). Springer LNAI. (in print). 12-16 Nov. 2018, Nancy, France.

[3] Jacques, J.S., Toman, D., Weddell, G.E. Object-relational queries over CFDInc knowledge bases: OBDA for the SQL-Literate. Proc. of IJCAI’16. 1258-1264 (2016)

Advertisements

OBDA/I Example in the Digital Humanities: food in the Roman Empire

A new installment of the Ontology Engineering module is about to start for the computer science honours students who selected it, so, in preparation, I was looking around for new examples of what ontologies and Semantic Web technologies can do for you, and that are at least somewhat concrete. One of those examples has an accompanying paper that is about to be published (can it be more recent than that?), which is on the production and distribution of food in the Roman Empire [1]. Although perhaps not many people here in South Africa might care about what happened in the Mediterranean basin some 2000 years ago, it is a good showcase of what one perhaps also could do here with the historical and archeological information (e.g., an inter-university SA project on digital humanities started off a few months ago, and several academics and students at UCT contribute to the Bleek and Lloyd Archive of |xam (San) cultural heritage, among others). And the paper is (relatively) very readable also to the non-expert.

 

So, what is it about? Food was stored in pots (more precisely: an amphora) that had engravings on it with text about who, what, where etc. and a lot of that has been investigated, documented, and stored in multiple resources, such as in databases. None of the resources cover all data points, but to advance research and understanding about it and food trading systems in general, it has to be combined somehow and made easily accessible to the domain experts. That is, essentially it is an instance of a data access and integration problem.

There are a couple of principal approaches to address that, usually done by an Extract-Transform-Load of each separate resource into one database or digital library, and then putting a web-based front-end on top of it. There are many shortcomings to that solution, such as having to repeat the ETL procedure upon updates in the source database, a single control point, and the, typically only, canned (i.e., fixed) queries of the interface. A more recent approach, of which the technologies finally are maturing, is Ontology-Based Data Access (OBDA) and Ontology-Based Data Integration (OBDI). I say “finally” here, as I still very well can remember the predecessors we struggled with some 7-8 years ago [2,3] (informally here, here, and here), and “maturing”, as the software has become more stable, has more features, and some of the things we had to do manually back then have been automated now. The general idea of OBDA/I applied to the Roman Empire Food system is shown in the figure below.

OBDA in the EPnet system (Source: [1])

OBDA in the EPnet system (Source: [1])

There are the data sources, which are federated (one ‘middle layer’, though still at the implementation level). The federated interface has mapping assertions to elements in the ontology. The user then can use the terms of the ontology (classes and their relations and attributes) to query the data, without having to know about how the data is stored and without having to write page-long SQL queries. For instance, a query “retrieve inscriptions on amphorae found in the city of ‘Mainz” containing the text ‘PNN’” would use just the terms in the ontology, say, Inscription, Amphora, City, found in, and inscribed on, and any value constraint added (like the PNN), and the OBDA/I system takes care of the rest.

Interestingly, the authors of [1]—admitted, three of them are former colleagues from Bolzano—used the same approach to setting up the ontology component as we did for [3]. While we will use the Protégé Ontology Development Environment in the OE module, it is not the best modelling tool to overcome the knowledge acquisition bottleneck. The authors modelled together with the domain experts in the much more intuitive ORM language and tool NORMA, and first represented whatever needed to be represented. This included also reuse of relevant related ontologies and non-ontology material, and modularizing it for better knowledge management and thereby ameliorating cognitive overload. A subset of the resultant ontology was then translated into the Web Ontology Language OWL (more precisely: OWL 2 QL, a tractable profile of OWL 2 DL), which is actually used in the OBDA system. We did that manually back then; now this can be done automatically (yay!).

Skipping here over the OBDI part and considering it done, the main third step in setting up an OBDA system is to link the data to the elements in the ontology. This is done in the mapping layer. This is essentially of the form “TermInTheOntology <- SQLqueryOverTheSource”. Abstracting from the current syntax of the OBDA system and simplifying the query for readability (see the real one in the paper), an example would thus have the following make up to retrieve all Dressel 1 type of amphorae, named Dressel1Amphora in the ontology, in all the data sources of the system:

Dressel1Amphora <-
    SELECT ic.id
       FROM ic JOIN at ON at.carrier=ic.id
          WHERE at.type=’DR1’

Or some such SQL query (typically larger than this one). This takes up a bit of time to do, but has to be done only once, for these mappings are stored in a separate mapping file.

The domain expert, then, when wanting to know about the Dressel1 amphorae in the system, would have to ask only ‘retrieve all Dressel1 amphorae’, rather than creating the SQL query, and thus being oblivious about which tables and columns are involved in obtaining the answer and being oblivious about that some data entry person at some point had mysteriously decided not to use ‘Dressel1’ but his own abbreviation ‘DR1’.

The actual ‘retrieve all Dressel1 amphorae’ is then a SPARQL query over the ontology, e.g.,

SELECT ?x WHERE {?x rdf:Type :Dressel1Amphora.}

which is surely shorter and therefore easier to handle for the domain expert than the SQL one. The OBDA system (-ontop-) takes this query and reasons over the ontology to see if the query can be answered directly by it without consulting the data, or else can be rewritten given the other knowledge in the ontology (it can, see example 5 in the paper). The outcome of that process then consults the relevant mappings. From that, the whole SQL query is constructed, which is sent to the (federated) data source(s), which processes the query as any relational database management system does, and returns the data to the user interface.

 

It is, perhaps, still unpleasant that domain experts have to put up with another query language, SPARQL, as the paper notes as well. Some efforts have gone into sorting out that ‘last mile’, such as using a (controlled) natural language to pose the query or to reuse that original ORM diagram in some way, but more needs to be done. (We tried the latter in [3]; that proof-of-concept worked with a neutered version of ORM and we have screenshots and videos to prove it, but in working on extensions and improvements, a new student uploaded buggy code onto the production server, so that online source doesn’t work anymore (and we didn’t roll back and reinstalled an older version, with me having moved to South Africa and the original student-developer, Giorgio Stefanoni, away studying for his MSc).

 

Note to OE students: This is by no means all there is to OBDA/I, but hopefully it has given you a bit of an idea. Read at least sections 1-3 of paper [1], and if you want to do an OBDA mini-project, then read also the rest of the paper and then Chapter 8 of the OE lecture notes, which discusses in a bit more detail the motivations for OBDA and the theory behind it.

 

References

[1] Calvanese, D., Liuzzo, P., Mosca, A., Remesal, J, Rezk, M., Rull, G. Ontology-Based Data Integration in EPNet: Production and Distribution of Food During the Roman Empire. Engineering Applications of Artificial Intelligence, 2016. To appear.

[2] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED 2008), 26-27 Oct. 2008, Karlsruhe, Germany.

[3] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC 2010), March 22-26 2010, Sierre, Switzerland. ACM Proceedings, pp1389-1396.

Lecture notes for the ontologies and knowledge bases course

The regular reader may recollect earlier posts about the ontology engineering courses I have taught at FUB, UH, UCI, Meraka, and UKZN. Each one had some sort of syllabus or series of blog posts with some introductory notes. I’ve put them together and extended them significantly now for the current installment of the Ontologies and Knowledge Bases Honours module (COMP718) at UKZN, and they are bound and printed into lecture notes for the enrolled students. These lecture notes are now online and I will add accompanying slides on the module’s webpage as we go along in the semester.

Given that the target audience is computer science students in their 4th year (honours), the notes are of an introductory nature. There are essentially three blocks: logic foundations, ontology engineering, and advanced topics. The logic foundations contain a recap of FOL, basics of Description Logics with ALC, all the DL-based OWL species, and some automated reasoning. The ontology engineering block covers top-down and bottom-up ontology development, and methods and methodologies, with top-down ontology development including mainly foundational ontologies and part-whole relations, and bottom-up the various approaches to extract knowledge from ‘legacy’ representations, such as from databases and thesauri. The advanced topics are balanced in two directions: one is toward ontology-based data access applications (i.e., an ontology-drive information system) and the other one has more theory with temporal ontologies.

Each chapter has a section with recommended/required reading and a set of exercises.

Unsurprisingly, the lecture notes have been written under time constraints and therefore the level of relative completeness of sections varies slightly. Suggestions and corrections are welcome!

Automating approximations in DL-Lite ontologies

As the avid keet blog reader or attendee to one of my ontology engineering courses may remember, I politely aired my frustration when one has an OWL 2 DL ontology that needs to be ‘slimmed’ to a DL-Lite (roughly: OWL 2 QL) one to make it useable for Ontology-Based Data Access (OBDA)—already since the experiment with the ontology/OBDA for disabilities [1]. This is a difficult and time-consuming exercise to do manually, especially when one has to go back and forward between the slimmed and expressive version of the ontology. Back in 2008, the difficulties were due both to a flaky Protégé 4.0-alpha and a mere syntactic approximation. Finally, things have improved and a preliminary semantic approximation is available [2] (and recently presented at AIMSA’10), which was developed by my colleagues at the KRDB Research centre.

Well, ok, only some aspects of the sound and complete approximations are addressed (more precisely: chains of existential role restrictions) and for DL-LiteA only, but they have been implemented already. The implementations are available in three forms: a Java API, a command line application suitable for batch approximations, and as a plug-in for Protégé 4.0. Note though, that the approximation algorithm is exponential, so with a large ontology it might take some time to simplify the expressive ontology. I did not test this myself yet, however, so if you have any comments or suggestions, please contact the authors of [2] directly. More is in the pipeline, and I am looking forward to more of such results—sure, this is with some self-interest: it will ease not only transparent, coordinated ontology management and development of ontology-driven information systems, but also facilitate implementation scenarios for rough ontologies [3].

References

[1] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED’08). 26-27 Oct. 2008, Karlsruhe, Germany.

[2] Elena Botoeva, Diego Calvanese, and Mariano Rodriguez-Muro. Expressive Approximations in DL-Lite Ontologies. Proc. of the 14th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA’10). Sept 8-10, 2010, Varna, Bulgaria.

[3] Keet, C.M. Ontology engineering with rough concepts and instances17th International Conference on Knowledge Engineering and Knowledge Management (EKAW’10). 11-15 October 2010, Lisbon, Portugal. Springer LNAI 6317, 507-517.

72010 SemWebTech lecture 9: Successes and challenges for ontologies in the life sciences

To be able to talk about successes and challenges of SWT for health care and life sciences (or any other subject domain), we first need to establish when something can be deemed a success, when it is a challenge, and when it is an outright failure. Such measures can be devised in an absolute sense (compare technology x with an SWT one: does it outperform on measure y?) and relative (to whom is technology x deemed successful?) Given these considerations, we shall take a closer look at several attempts, being two successes and a few challenges in representation and reasoning. What were the problems and how did they solve it, and what are the problems and can that be resolved, respectively?

As success stories we take the experiments by Wolstencroft and coauthors about classifying protein phosphatases [1] and Calvanese et al for graphical, web-based, ontology-based data access applied to horizontal gene transfer data [2]. They each focus on different ontology languages and reasoning services to solve different problems. What they have in common is that there is an interaction between the ontology and instances (and that it was a considerable amount of work by people with different specialties): the former focuses on classifying instances and the latter on querying instances. In addition, modest results of biological significance have been obtained with the classification of the protein phosphatases, whereas with the ontology-based data analysis we are tantalizingly close.

The challenges for SWT in general and for HCLS in particular are quite diverse, of which some concern the SWT proper and others are by its designers—and W3C core activities on standardization—considered outside their responsibility but still need to be done. Currently, for the software aspects, the onus is put on the software developers and industry to pick up on the proof-of-concept and working-prototype tools that have come out of academia and to transform them into the industry-grade quality that a widespread adoption of SWT requires. Although this aspect should not be ignored, we shall focus on the language and reasoning limitations during the lecture.

In addition to the language and corresponding reasoning limitations that passed the revue in the lectures on OWL, there are language “limitations” discussed and illustrated at length in various papers, with the most recent take [3], where it might well be that the extensions presented in lecture 6 and 7 (parts, time, uncertainty, and vagueness) can ameliorate or perhaps even solve the problem. Some of the issues outlined by Schultz and coauthors are ‘mere’ modelling pitfalls, whereas others are real challenges that can be approximated to a greater or lesser extent. We shall look at several representation issues that go beyond the earlier examples of SNOMED CT’s “brain concussion without loss of consciousness”; e.g. how would you represent in an ontology that in most but not all cases hepatitis has as symptom fever, or how would you formalize the defined concept “Drug abuse prevention”, and (provided you are convinced it should be represented in an ontology) that the world-wide prevalence of diabetes mellitus is 2.8%?

Concerning challenges for automated reasoning, we shall look at two of the nine identified required reasoning scenarios [4], being the “model checking (violation)” and “finding gaps in an ontology and discovering new relations”, thereby reiterating that it is the life scientists’ high-level goal-driven approach and desire to use OWL ontologies with reasoning services to, ultimately, discover novel information about nature. You might find it of interest to read about the feedback received from the SWT developers upon presenting [4] here: some requirements are met in the meantime and new useful reasoning services were presented.

References

[1] Wolstencroft, K., Stevens, R., Haarslev, V. Applying OWL reasoning to genomic data. In: Semantic Web: revolutionizing knowledge discovery in the life sciences, Baker, C.J.O., Cheung, H. (eds), Springer: New York, 2007, 225-248.

[2] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland.

[3] Stefan Schulz, Holger Stenzhorn, Martin Boekers and Barry Smith. Strengths and Limitations of Formal Ontologies in the Biomedical Domain. Electronic Journal of Communication, Information and Innovation in Health (Special Issue on Ontologies, Semantic Web and Health), 2009.

[4] Keet, C.M., Roos, M. and Marshall, M.S. A survey of requirements for automated reasoning services for bio-ontologies in OWL. Third international Workshop OWL: Experiences and Directions (OWLED 2007), 6-7 June 2007, Innsbruck, Austria. CEUR-WS Vol-258.

[5] Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, Doherty D, Forsberg K, Gao Y, Kashyap V, Kinoshita J, Luciano J, Scott Marshall M, Ogbuji C, Rees J, Stephens S, Wong GT, Elizabeth Wu, Zaccagnini D, Hongsermeier T, Neumann E, Herman I, Cheung KH. Advancing translational research with the Semantic Web, BMC Bioinformatics, 8, 2007.

p.s.: the first part of the lecture on 21-12 will be devoted to the remaining part of last week’s lecture; that is, a few discussion questions about [5] that are mentioned in the slides of the previous lecture.

Note: references 1 and 3 are mandatory reading, 2 and 4 recommended to read, and 5 was mandatory for the previous lecture.

Lecture notes: lecture 9 – Successes and challenges for ontologies

Course website

The WONDER system for ontology browsing and graphical query formulation

Did you ever not want to bother knowing how the data is stored in a database, but simply want to know what kind of things are stored in the database at, say, the conceptual or ontological layer of knowledge? And did you ever not want to bother writing queries in SQL or SPARQL, but have a graphical point-and-click interface with which you can compose a query using that what layer of knowledge and that the system generates automatically the SQL/SPARQL query for you, in the correct syntax? And all that not with a downloaded desktop application but in a Web browser?

Our domain experts in genetics as well as in healthcare informatics, at least, wanted that. We have designed and implemented it now [1], which we have enthusiastically named Web ONtology mediateD Extraction of Relational data (WONDER). Moreover, we have a working system for the use case about the 4GB horizontal gene transfer database [2] and its corresponding ‘application ontology’. (pdf)

Subscribers to this blog might remember I mentioned a that we were working towards this goal, using Ontology-Based Data Access tools to access a database through an ontology and learning from (and elaborating on) its preliminary case studies [3]. In short, we added a usability extension to the OBDA implementations so that not only savvy Semantic Web engineers can use it, but also—actually, moreover—that the domain experts who want to get information from their database(s) can do so. By building upon the OBDA framework [4], we can avail of its solid formal foundations; that is, WONDER is not merely a software application, but there is a logic-based representation behind both the graphics in the ontology browser and the query pane.

In addition, WONDER is scalable because the ontology language (roughly: OWL 2 QL) is ‘simple’. Yes, we had to drop a few things from the original ORM conceptual model, but they have—at least for our case study—no effect on querying the data. The ‘difficult’ constraints are (and generally: should be anyway) implemented in the database, so there will be no instances violating the constraints we had to drop. Trade-offs, indeed, but now one can use an ontology to access a large database over the Web and retrieve the results quickly.

For instance, take the query “For the Firmicutes, retrieve the organisms and their genes that have a GCtotal contents higher than 60”, which is for various reasons not possible through the current web interface of the source database.

Fig.1 shows the ontology pane with three relevant elements selected. (click on the figures to enlarge)

WONDER's ontology pane with three elements selected

Fig.1. WONDER's ontology pane with three elements selected

Fig.2 shows the constrained adder, where I’m adding that the GCValue has to be > 60.

WONDER's constrained adder, where I’m adding that the GCValue has to be > 60

Fig. 2. WONDER's constraint adder, where I’m adding that the GCValue has to be > 60

Fig.3 shows the query ready for execution: the attributes with a green border are those that will appear in the query answer (I could have selected all, if I wanted to). In the menu bar on the right you can see I have customized the names of the attributes, so that the columns in the results pane will have a query-relevant name in your preferred language (not necessary to do), as well as the automatically generated query.

WONDER's query pane, where the query is ready for execution

Fig.3. WONDER's query pane, where the query is ready for execution

Fig.4 shows a section of the results of the first page and Fig.5 of the second page; the “Family” column that has all the Firmicutes (out of about 500 organisms in the database) gives you the whole section of the species tree, because that is how the taxonomy information is stored in the database (refining the database is a separate topic). Alternatively, I could have selected the organism Name from the ontology browser (see Fig.1), de-selected the taxonomic classification in the query pane, and included the Name of the organism in the query answer to have the species name only but not all the taxonomic information; in this case, I wanted to have all that taxonomy information. The genes are the relevant selection (made with the other constraints) out of about the 2 million genes that are stored in the database.

Fig.4. Section of the results, the first page

Fig.4. Section of the results, the first page

Fig.5. Section of the results, the second page

Fig.5. Section of the results, the second page

There is also a constraint manager for the AND, OR, NOT and nesting. For instance, for the query “Give me the names of the organisms of which the abbreviation starts with a b, but not being a Bacillus, and the prediction and KEGG code of those organisms’ genes that are putatively either horizontally transferred or highly expressed” (Fig.6), we have the constraint manager as shown in Fig.7.

Fig.6 graphical and textual representation of the second query

Fig.6. Graphical and textual representation of the second query

Fig.6 constraint manager for query 2

Fig.7. Constraint manager for the second query

You can also save and load queries when you’re logged in, and download the results set in any case.

For those who want to play with it: feel free to drop me a line and I will send you the URL. (The reason for not linking the URL here is that the current URL is still for the beta version, whereas the operational one is expected to have a more stable URL soon.)

Last, but not least, the “we” I used in the previous sentences is not some ‘standard writing in plural’, but several people were involved in various ways to realize the WONDER system. In alphabetical order, they are: Diego Calvanese, Marijke Keet, Werner Nutt, Mariano Rodriguez-Muro, and Giorgio Stefanoni, all at FUB. I also want to thank our domain experts of the case study (with whom we’re writing a bio-oriented paper): Santi Garcia-Vallvé (with the Evolutionary Genomics Group, ‘Rovira i Virgilli’ University, Tarragona, Spain) and Mark van Passel (with the Laboratory for Microbiology, Wageningen University and Research Centre, the Netherlands).

References

[1] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland.

[2] Garcia-Vallve, S, Guzman, E., Montero, MA. and Romeu, A. 2003. HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Research 31: 187-189.

[3] R. Alberts, D. Calvanese, G. De Giacomo, A. Gerber, M. Horridge, A. Kaplunova, C. M. Keet, D. Lembo, M. Lenzerini, M. Milicic, R. Moeller, M. Rodríguez-Muro, R. Rosati, U. Sattler, B. Suntisrivaraporn, G. Stefanoni, A.-Y. Turhan, S. Wandelt, M. Wessel. Analysis of Test Results on Usage Scenarios. Deliverable TONES-D27 v1.0, Oct. 10 2008.

[4] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, and Riccardo Rosati. Ontologies and databases: The DL-Lite approach. In Sergio Tessaris and Enrico Franconi, editors, Semantic Technologies for Informations Systems – 5th Int. Reasoning Web Summer School (RW 2009), volume 5689 of Lecture Notes in Computer Science, pages 255-356. Springer, 2009.

Working towards WONDER Data

Duncan mentioned in a comment on his recent SciFoo invitation his “Google and the Semantic, Satanic, Romantic Web” post where he describes and summarises an encounter between the pro-Semantic Web Tim Berners-Lee and the ‘anti’-Semantic Web (or should I say realistic?) Peter Norvig, Director of Research at Google. I quote a relevant section here, with some changes in emphases:

Norvig: People are stupid: […] this is the world, imperfect and messy and we just have to deal with it. These same people can’t be expected to use the Resource Description Framework (RDF) and the Web Ontology Language (OWL), which are much more complicated and considerably less fool-proof. (Perhaps you could call this the dumb-antic web?!)

Berners-Lee: replied that a large part of the semantic web can be populated by taking existing relational databases and mapping them into RDF/OWL. The structured data is already there, it just needs web-izing in a mashup-friendly format. (What I like to call the romantic web: people will publish their data freely on the web this way, especially in e-science for example. This will allow sharing and re-use in unexpected ways.)

While Duncan looks at the openness of data, here I want to put the focus on the part in bold face: that you can reuse relational databases just like that and map them into RDF/OWL. Positively described, that is a romantic assumption; negatively, described, it is rather naïve and more painful to realise than it sounds. Well, if the database developers would have remained truthful to what they had learned during college, then it might have worked out to some extent at least. So let us for a moment ignore the issues of data duplication, violations of integrity constraints, hacks, outdated imports from other databases to fill a boutique database, outdated conceptual data models (if there was one), and what have you. Then it is still not trivial.

To do this ‘easy mapping’, one has to start over with the data analysis and add some new requirements analysis while one is at it. First, some data in the database (DB)—mathematically instances—actually are thought of by the users as concepts/universals/classes. For instance, the GO terms in the GO database are assumed to be representing universals and used to annotate instances in other tables of some database, and let us not go into the ontological status of species (as instances in the database of the NCBI taxonomy). Second, each tuple is assumed to denote an instance and, by virtue of key definitions, to be unique in that table, but such a tuple has values in each cell of the participating columns; however, those values are not objects that the OWL ABox is assuming to be dealing with (this is known under the term impedance mismatch). So, when we have divided the data in the DB into instances-but-actually-concepts-that-should-become-OWL-classes and real-instances-that-should-become-OWL-instances, we need to convert the real instances of the DB to objects in the ontology, where some function has to be used to convert (combinations of) values into objects proper.

For one experiment we are working on here at FUB, we have the HGT-DB with about 1.7 million genes of about 500 bacteria, and all sorts of data about each one of them (tables with 15-20 columns, some with instance data, some with type-level information like the function of the gene product). Try to load this data into Protégé’s ABox. Obviously, we do not; more about that further below.

What, you may ask, about reusing the physical DB schema and, if present, the conceptual data model (in ER, EER, UML, ORM, …)? A good question that more people have asked, i.e., lots of research has been done in that area, primarily under the banner of reverse engineering and extracting ‘ontologies’ from such schemas where it was noted that extra ‘ontology enrichment’ steps were necessary (see e.g. Lina Lubyte’s work). A fundamental problem with such reverse engineering is that, assuming there was a fully normalised conceptual data model, oftentimes denormalization steps have been carried out to flatten the database structure and improve performance, which, if simply reverse engineered, ends up in the ‘ontology’ as a class with umpteen attributes (one for each column). This is not nice and the automated reasoning one can do with it is minimal, if at all. Put differently: if we stick to a flat, subject domain semantics-poor, structure, then why bother with the automated reasoning machinery of the Semantic Web?

To mitigate this, one can redo the normalization steps to try to get some structure back into the conceptual view of the data or perhaps add a section of another ontology to brighten up the ‘ontology’ into an ontology (or, if you are lucky and there was a conceptual data model, to use that one). We did that for the HGT-DB’s conceptual model, manually; an early diagram and its import into Protégé are included in the appendix of [1].

In any case, having more structure in the ‘ontology’ than in the DB, one ends up defining multiple views in the DB, i.e., external ABox, where a part of a table has the instances of an OWL class. (How to do this in the OWL-ABox, I do not know—we have databases that are too large to squeeze into the knowledge base). In turn, this requires a mechanism to link persistently an OWL class to a SQL or SPARQL query over the DB. (One can argue if this DB should be the legacy relational database or an RDF-ized version of it; I ignore that debate for now.)

After doing all that, one has contributed the proverbial ‘2 cents’ that has cost you ‘blood, sweat and tears’ (maybe the latter is just Dutch idiom) to populating the Semantic Web.

But what can one really do with it?

The least one can do is to make querying the database easier so that users do not have to learn yet another query language. Earlier technologies in that direction were called query-by-diagram and conceptual queries, and a newer term for the same idea is called Ontology-Based Data Access (OBDA) that uses Semantic Web Technologies. Then one can add reasoner-facilitated query completion to guarantee that the user asks something that the system can answer (e.g. [2]). Having the reasoner anyway, one might as well use it for more sophisticated queries that are not easily, or not at all, possible with traditional database systems. One of them is using terms in the query for which there is no data in the database, of which several examples were described in a case study [3] (and summarised). For the HGT-DB, these are queries involving the species taxonomy and gene product functions.

Another useful addition with respect to the ‘legacy’ (well, currently operational) HGT-DB is that our domain experts, upon having seen the conceptual view of the database, came up with all sorts of other sample queries they were thinking of but where the knowledge was not yet explicitly represented in the ontology even though one can retrieve the data from the database. For instance, adjacent or nearby genes that are horizontally transferred, or clusters of such genes that are permitted to have a gap between them consisting of non-coding DNA or of a non-horizontally transferred gene. Put differently, one can do a sophisticated analysis of one’s data and unlock new information from the database by using the ontology-based approach. In our enthusiasm, we have called the experiment Web ONtology mediateD Extraction of Relational data (WONDER) for Web-based ObdA with the Hgt-db (WOAH!). We have the tools such as QuOnto for scalable reasoning and the OBDA Plugin for Protégé for management of the mappings between an OWL class, the SQL query over the database, and the transformation function (skolemization) from values to objects. The last step to make it all Web-usable—from a technical point of view, that is—is the Web-based ontology browser and graphical query builder. This interface is well down in the pipeline with a first working version sent out for review by our domain experts. One of them thought that it looked a bit simplistic; so perhaps we achieved more than we bargained for where the AI & engineering behind it did its work well—from a user-perspective, at least.

More automation of all those steps to get it working, however, will be a welcome addition from the engineering side. Until then, Norvig’s down to earth comment is closer to reality than Berners-Lee’s vision.

[1] R. Alberts, D. Calvanese, G. De Giacomo, A. Gerber, M. Horridge, A. Kaplunova, C. M. Keet, D. Lembo, M. Lenzerini, M. Milicic, R. Moeller, M. Rodríguez-Muro, R. Rosati, U. Sattler, B. Suntisrivaraporn, G. Stefanoni, A.-Y. Turhan, S. Wandelt, M. Wessel. Analysis of Test Results on Usage Scenarios. Deliverable TONES-D27 v1.0, Oct. 10 2008.

[2] Paolo Dongilli, Enrico Franconi (2006). An Intelligent Query Interface with Natural Language Support. FLAIRS Conference 2006: 658-663.

[3] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED’08 ). 26-27 Oct. 2008, Karlsruhe, Germany.