Linked data as the future of scientific publishing

The (not really a) ‘pipe dream’ paper about “Strategic reading, ontologies, and the future of scientific publishing” came out in August in Science [1], but does not seem to have picked up a lot of blog-attention other than copy-and-paste-the-abstract (except for Dempsey‘s take on it), which is curious. Renear and Palmer envision a bright new world with online scientific papers where the text has click-through terms to records in other databases and web pages to have your network of linked data; click on a protein name in the paper and automatically browse to Uniprot, sort of. And it is even the users who want that; i.e., it is not some covert push from Semantic Web techies. But then, in this case the ‘users’ are in informatics & information/library science (the enabling-users?), which does not imply that the endusers—say, biochemists, geneticists, etc.—want all that for better management of their literature (or they want that but do not yet realise that that is what they want).

But let us assume those endusers want the linked data for their literature (after all, it was a molecular ecologists who sent me the article—thanks Mark!). Or, to use some ‘fancier’ terms from the article: the zapping scientists want (need?) ontology-supported strategic reading to work efficiently and effectively with the large amounts of papers and supporting data being published. “Scientists are scanning more and reading less”, so then the linked data would (should?) help them in this superficial foraging of data, information, and knowledge to find the useful needle in the haystack—or so goes Renear and Palmer’s vision.

However, from a theoretical and engineering point of view, this can already be done. Not just that, it has been shown that some things work: there is iHOP and Textpresso, as the authors point out, but also GoPubMed, and SHER with PubMed, which begs the question: are those tools not good enough (and if something is missing, what?) or is it about convincing people and funding agencies? If the latter, then what does the paper do in Science in the “review” section??

If one reads further on in the paper, some peculiar remarks are being made, but not the one I would have expected. That the “natural language prose of scientific articles provides too much valuable nuance and context to be treated only as data” is a known problem that keeps many a (computational) linguist busy for his/her lifetime. But they go on saying also that “Traditional approaches to evaluating information systems, such as precision, recall, and satisfaction measures, offer limited guidance for further development of strategic reading technologies”, yet alternative evaluation methods are not presented. That “research on information behaviour and the use of ontologies is also needed” may be true form an outsider’s perspective: usages are known among ontologists but perhaps a review of all the ways of actual usage may be useful. Further down in the same section (p832), the authors claim, out of the blue and without any argumentation, that “the development of ontology languages with additional expressive power is needed”. What additional expressive power is needed for document and data navigation when they talk of the desire to exploit better the “terminological annotations”? The preceding text in the same paragraph only mentions XML, so they do not seem to have a clue about ontology languages, let alone their expressiveness, at all (OWL is mentioned only in passing on p830) and they manage to mention it in the same breath as so-called service-oriented architectures with a reference to another Science paper. Frankly, I think that papers like this are bound to cause more harm (or, at best, indifference) than good.

One thing I was wondering about, but that is not covered in the paper, is the following: who decides which term goes to which URI? There are different databases for, say, proteins, and the one who will be selected (by whom?) in the scientific publishing arena will become the de facto standard. A possibility to ameliorate this is to create a specific interface so that when a scientists clicks on a term, a drop-down box appears with something like “do you wish to retrieve more information from source x, y, or z?” Nevertheless, one easily ends up with a certain bias and powerful “gatekeepers”, and perhaps, with a similar attitude as toward DBLP/PubMed (“if it’s listed in there it counts, if it is not, then it doesn’t” regardless the content of the indexed papers, and favouring older established, and well-connected, outlets above newer and/or interdisciplinary ones).

Anyway, if Semantic Web researchers need some reference padding in the context of “it is not a push but a pull, hear hear, look here at the Science-reference, and I even read Science”, it’ll do the job, even though the contents of the paper is less than mediocre for the outlet.

[1] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784]