Any semantic search for insects?

The draft of this post started with an example of a creepy insect living in Italy and, well, across the world in those locations where hygiene is not taken too seriously. But I will leave that be, so you can have a good night’s rest. Instead, I will take the example of an insect of which I still do not know what it is—it may still turn out to be a creepy one, but now I do have photos of it and it is living well away outside in Ineke’s poly-tunnel near Limerick, Ireland. The problem is this: neither Ineke, nor Heidi nor I know what it is, but really still want to know. How to get the answer, i.e., how to find the species name of the specimen? I’ve tried several strategies: the ones that are practically possible did not do the job and the one that would does not exist. I’ll go through them in the remainder of the post and close with a few questions on what the most feasible strategy would/should/could be to eventually have a decent entomology [ornithology/nematology/etc.] knowledge base.


Specimen viewed from the top; can anyone ID this specimen?

Basic searches

Neither one of us who were present at teatime in Ineke’s polytunnel where we observed the insect, is an entomologist nor do we have entomologist-friends. The famous ‘bug man’ Ruud Kleinpaste is a fellow alumnus of Wageningen University, but we did not study there around the same time and I could not find an email address to bother him asking to ID a specimen. Neither one of us has an insect handbook either and even if we had, I, for one, would not want to flick through it when there is a perceived need to find the species of a specimen: flicking through the insect-book (and plant-book, etc) was an entertaining pastime activity when I was young, like reading the encyclopaedia and doing the dictionary game, but in this day and age, I would have wanted to use the computer to find the answer. This is theoretically feasible, but—as far as I am aware of—not yet in practice.

To do image matching, I would need a very large data set and of the data set, to know which image fits with which species name, which I do not have; so the machine learning strategy will not work. There is an online browseable BugGuide for the US and Canada with lots of pictures that I clicked through for a while, but without finding the right picture. There are entomology databases that let me search by species name (here, here, and here), but not by properties of the insect; KONCHUR has a fancier search mechanisms but covers insects in Japan, East Asia and the Pacific only (“orange leg AND black body” did not return any results).

Sure, I did a Google search on “image of a black insect with orange legs and stingy back”, hoping that someone else already has uploaded an image of another specimen of the same type of insect, annotated it with the same or very similar terms, and that someone has made the next step to add the name of the species as well, i.e., who is not looking around for the answer like me but has the right knowledge of insects. With the many search phrases and pagerank algorithm that Google relies upon in devising the search results [1], something might turn up; however, the actual results were unhelpful. Other people had similar requests without an answer, the body colours swapped twice (orange bug with black legs), many unrelated insects where one of them has orange legs (Ichneumon wasp (Rhyssa persuasoria)) but its legs are only partially orange, it has white dots on its body, and the back is not as stingy as our specimen (see picture of the Ichneumon wasp), or utterly irrelevant land and sea images. That’s about it for the first page of the Google query answer.

Semantic searches

Now, if there was a proper ontology of insects, and I mean not a bare taxonomic tree but one where the classes have properties and those properties have their ranges defined as well, then it would be a simple exercise of selecting the properties along the line of

adult insect
  AND length 2cm
  AND colour black
  AND has wingtype transparent
  [*AND body shape similar to a wasp*]
  AND leg colour orange
  AND rear body stingy
  AND location at least west Ireland

so that the reasoner (FaCT++, Pellet, and the like) would classify it near-instantly, or if the ontology were to be really large, then still within an hour or so (ignoring for a moment the [*AND body shape similar to a wasp*] because that requires a bit more work). It would be even funkier if that ontology were linked to a database of images of insects to cross-check it with the visuals. Even more so when such a database also were to have information about its habitat with feeding habits, principal role in the food web, and any diseases it may cause or transmit. Then one would also be able to start the search from another direction along the line of “give me all the insects that live in the west of Ireland” as a first step to narrow down the possible answers.

Aside from the instance classification problem of this particular specimen, the question arises if it would it be up to

a) Google to work on their technology so as to be able to get the answer for me?

b) Entomologists to develop their domain ontology about insects and link it to some database with pictures and additional textual information to have indeed a properly searchable knowledge base?

c) Volunteer labour, like me having taken pictures and annotated each one with the physical characteristics, location, time of observation, etc. and categorise it as “bug” or “insect” or “insekt” or “insetto” to eventually have a grass-roots bugbase (that likely will have some imperfections with gaps in data fields and sloppy terminology)?

d) Everyone to buy insect books?

e) …?

Shelving option d, I am explicitly looking for a computational option, i.e., a, b, c, or e. I prefer a web-accessible version of option b, which can be done with scalable Semantic Web technologies; one only needs to find the money, time, and people to realise it.

Although I gave the example here with insects, the same story can be made for birds, worms, and so forth. When such searchable knowledge bases exist, it will not only save time for many lay people looking up the information and learning more about the flora and fauna around them, but I can imagine it will also make research a lot easier for interdisciplinary scientists who have to forage into knowledge of insects [/birds/worms/etc] as well as the entomologists [/ornithologists/nematologists/etc.] themselves.

specimen from the side

specimen from the side


[1] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, March/April 2009, 8-12.

Linked data as the future of scientific publishing

The (not really a) ‘pipe dream’ paper about “Strategic reading, ontologies, and the future of scientific publishing” came out in August in Science [1], but does not seem to have picked up a lot of blog-attention other than copy-and-paste-the-abstract (except for Dempsey‘s take on it), which is curious. Renear and Palmer envision a bright new world with online scientific papers where the text has click-through terms to records in other databases and web pages to have your network of linked data; click on a protein name in the paper and automatically browse to Uniprot, sort of. And it is even the users who want that; i.e., it is not some covert push from Semantic Web techies. But then, in this case the ‘users’ are in informatics & information/library science (the enabling-users?), which does not imply that the endusers—say, biochemists, geneticists, etc.—want all that for better management of their literature (or they want that but do not yet realise that that is what they want).

But let us assume those endusers want the linked data for their literature (after all, it was a molecular ecologists who sent me the article—thanks Mark!). Or, to use some ‘fancier’ terms from the article: the zapping scientists want (need?) ontology-supported strategic reading to work efficiently and effectively with the large amounts of papers and supporting data being published. “Scientists are scanning more and reading less”, so then the linked data would (should?) help them in this superficial foraging of data, information, and knowledge to find the useful needle in the haystack—or so goes Renear and Palmer’s vision.

However, from a theoretical and engineering point of view, this can already be done. Not just that, it has been shown that some things work: there is iHOP and Textpresso, as the authors point out, but also GoPubMed, and SHER with PubMed, which begs the question: are those tools not good enough (and if something is missing, what?) or is it about convincing people and funding agencies? If the latter, then what does the paper do in Science in the “review” section??

If one reads further on in the paper, some peculiar remarks are being made, but not the one I would have expected. That the “natural language prose of scientific articles provides too much valuable nuance and context to be treated only as data” is a known problem that keeps many a (computational) linguist busy for his/her lifetime. But they go on saying also that “Traditional approaches to evaluating information systems, such as precision, recall, and satisfaction measures, offer limited guidance for further development of strategic reading technologies”, yet alternative evaluation methods are not presented. That “research on information behaviour and the use of ontologies is also needed” may be true form an outsider’s perspective: usages are known among ontologists but perhaps a review of all the ways of actual usage may be useful. Further down in the same section (p832), the authors claim, out of the blue and without any argumentation, that “the development of ontology languages with additional expressive power is needed”. What additional expressive power is needed for document and data navigation when they talk of the desire to exploit better the “terminological annotations”? The preceding text in the same paragraph only mentions XML, so they do not seem to have a clue about ontology languages, let alone their expressiveness, at all (OWL is mentioned only in passing on p830) and they manage to mention it in the same breath as so-called service-oriented architectures with a reference to another Science paper. Frankly, I think that papers like this are bound to cause more harm (or, at best, indifference) than good.

One thing I was wondering about, but that is not covered in the paper, is the following: who decides which term goes to which URI? There are different databases for, say, proteins, and the one who will be selected (by whom?) in the scientific publishing arena will become the de facto standard. A possibility to ameliorate this is to create a specific interface so that when a scientists clicks on a term, a drop-down box appears with something like “do you wish to retrieve more information from source x, y, or z?” Nevertheless, one easily ends up with a certain bias and powerful “gatekeepers”, and perhaps, with a similar attitude as toward DBLP/PubMed (“if it’s listed in there it counts, if it is not, then it doesn’t” regardless the content of the indexed papers, and favouring older established, and well-connected, outlets above newer and/or interdisciplinary ones).

Anyway, if Semantic Web researchers need some reference padding in the context of “it is not a push but a pull, hear hear, look here at the Science-reference, and I even read Science”, it’ll do the job, even though the contents of the paper is less than mediocre for the outlet.

[1] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784]

Visiting DERI in sunny Galway

Believe it it not, but the weather is indeed dry and sunny in Galway, already for 5 days in a row; although I did not come to Ireland for the good weather, it is a nice bonus. One of the reasons I am in Ireland is to visit the Digital Enterprise Research Institute (DERI) in Galway, which is the largest Semantic Web-oriented research group in Ireland with about 120 employees.

I am hosted by Paul Buitelaar‘s NLP unit, looking into options to improve NLP implementations with ontologies, as DERI puts somewhat more emphasis on validation with applications than Bolzano does. In this context I gave a seminar about representing and reasoning over a taxonomy of part-whole relations, which is based on the Applied Ontology paper with the same title [1], but where the slides focus on the motivations from a linguistics perspective. It led some attendees to believe linguistics was the core motivation for disambiguating different types of part-whole relations. However, correct modelling of part-whole relations gets probably more attention in conceptual data modelling (primarily, UML) and it receives lots of attention in attempting to address the demands put forward by bio(medical) ontologists to (i) have a language with which one can represent all properties of parthood relations in ontology languages, which we still cannot in OWL, and (ii) distinguish between part-whole notions such as (spatial) containment, structural parthood, and membership of a collective.

An orthogonal dimension to the types of part-whole relations are the notions of essential and immutable parts and wholes, which can be solved by resorting to a temporalisation of relationships [2]. If one would want to ‘translate’ that to any usage in NLP, then one can deal adequately—with a formal and ontological foundation—with linguistic expressions that have a wider range of verb tenses, like “researcherAbc will become a member of researchGroup123” (a scheduled meronymic part-whole relation) and heart#123 has been transplanted from patientAbc” (a so-called disabled relation where the heart used to be a structural part of that patient). But this is still music for future work.

Other topics passed, and are passing, the revue as well, such as a possible use case with roles and rules with Axel Polleres, making it a stimulating and enjoyable visit.


[1] Keet, C.M. and Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2): 91-110.

[2] Artale, A., Guarino, N., and Keet, C.M. Formalising temporal constraints on part-whole relations. 11th International Conference on Principles of Knowledge Representation and Reasoning (KR’08). Gerhard Brewka, Jerome Lang (Eds.) AAAI Press, pp 673-683. Sydney, Australia, September 16-19, 2008.