Linked data as the future of scientific publishing

The (not really a) ‘pipe dream’ paper about “Strategic reading, ontologies, and the future of scientific publishing” came out in August in Science [1], but does not seem to have picked up a lot of blog-attention other than copy-and-paste-the-abstract (except for Dempsey‘s take on it), which is curious. Renear and Palmer envision a bright new world with online scientific papers where the text has click-through terms to records in other databases and web pages to have your network of linked data; click on a protein name in the paper and automatically browse to Uniprot, sort of. And it is even the users who want that; i.e., it is not some covert push from Semantic Web techies. But then, in this case the ‘users’ are in informatics & information/library science (the enabling-users?), which does not imply that the endusers—say, biochemists, geneticists, etc.—want all that for better management of their literature (or they want that but do not yet realise that that is what they want).

But let us assume those endusers want the linked data for their literature (after all, it was a molecular ecologists who sent me the article—thanks Mark!). Or, to use some ‘fancier’ terms from the article: the zapping scientists want (need?) ontology-supported strategic reading to work efficiently and effectively with the large amounts of papers and supporting data being published. “Scientists are scanning more and reading less”, so then the linked data would (should?) help them in this superficial foraging of data, information, and knowledge to find the useful needle in the haystack—or so goes Renear and Palmer’s vision.

However, from a theoretical and engineering point of view, this can already be done. Not just that, it has been shown that some things work: there is iHOP and Textpresso, as the authors point out, but also GoPubMed, and SHER with PubMed, which begs the question: are those tools not good enough (and if something is missing, what?) or is it about convincing people and funding agencies? If the latter, then what does the paper do in Science in the “review” section??

If one reads further on in the paper, some peculiar remarks are being made, but not the one I would have expected. That the “natural language prose of scientific articles provides too much valuable nuance and context to be treated only as data” is a known problem that keeps many a (computational) linguist busy for his/her lifetime. But they go on saying also that “Traditional approaches to evaluating information systems, such as precision, recall, and satisfaction measures, offer limited guidance for further development of strategic reading technologies”, yet alternative evaluation methods are not presented. That “research on information behaviour and the use of ontologies is also needed” may be true form an outsider’s perspective: usages are known among ontologists but perhaps a review of all the ways of actual usage may be useful. Further down in the same section (p832), the authors claim, out of the blue and without any argumentation, that “the development of ontology languages with additional expressive power is needed”. What additional expressive power is needed for document and data navigation when they talk of the desire to exploit better the “terminological annotations”? The preceding text in the same paragraph only mentions XML, so they do not seem to have a clue about ontology languages, let alone their expressiveness, at all (OWL is mentioned only in passing on p830) and they manage to mention it in the same breath as so-called service-oriented architectures with a reference to another Science paper. Frankly, I think that papers like this are bound to cause more harm (or, at best, indifference) than good.

One thing I was wondering about, but that is not covered in the paper, is the following: who decides which term goes to which URI? There are different databases for, say, proteins, and the one who will be selected (by whom?) in the scientific publishing arena will become the de facto standard. A possibility to ameliorate this is to create a specific interface so that when a scientists clicks on a term, a drop-down box appears with something like “do you wish to retrieve more information from source x, y, or z?” Nevertheless, one easily ends up with a certain bias and powerful “gatekeepers”, and perhaps, with a similar attitude as toward DBLP/PubMed (“if it’s listed in there it counts, if it is not, then it doesn’t” regardless the content of the indexed papers, and favouring older established, and well-connected, outlets above newer and/or interdisciplinary ones).

Anyway, if Semantic Web researchers need some reference padding in the context of “it is not a push but a pull, hear hear, look here at the Science-reference, and I even read Science”, it’ll do the job, even though the contents of the paper is less than mediocre for the outlet.

[1] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784]

An analysis of culinary evolution

With summertime being what it is (called komkommertijd—literally: ‘cucumber time’—in Dutch), I stumbled again upon the paper The nonequilibrium nature of culinary evolution [1].

Food is essential, and due to location with its climate and available resources, as well as culture, each region has its own cuisine. There is much talk of homogenization of food dishes in popular press, or at least the threat thereof. One colleague here called “food from the North”, north of the Alps, that is, “barbarian”. But how much diversity in recipes across geographical locations is there? How, if at all, does it vary over time? What is the ingredient replacement pattern and do the replaced ingredients really disappear from the local menu?

Kinouchi and colleagues [1] tried to answer such questions through assessment of the statistics of the recipes’ ingredients, which were taken from 3 complete Brazilian cookbooks (Dona Benta 1946, 1969, and 2004), 40% of the large contemporary French cookbook Larousse (2004), the complete British Penguin Cookery Book (2001), and the Medieval Pleyn Delit.

For instance, the average recipe size of the Dona Benta (1946) as measured by ingredients, is lowest at 6.7, that of the Pleyn Delit an impressive 9.7, and of Larousse the highest with 10.8. However, one has to note that for the Pleyn, there are just 380 recipes with a mere 219 ingredients, whereas the numbers for Larousse are 1200 and 1005, and for the Dona they are 1786 and 491, respectively. When one makes a graph the frequency of appearances of ingredients in the recipes in the cookbooks, then all six cookbooks show very similar rank-frequency plots (power-law behaviour; see Fig. 1 in the paper); that is, for that dimension, there is a cultural invariance, as well as a temporal invariance for the Brazilian cookbook.

However, the more interesting results are obtained by the statistical and complex network analysis to obtain an idea about culinary evolution. The authors propose a copy-mutate algorithm to model cuisine growth, going from a small set of initial recipes to more diverse ones and using the idea of “cultural replicators” and branching. To make the line fit the data, they need 5 parameters: number of generations (T), number of ingredients per recipe (K), number of ingredients in each recipe to be mutated (L), the number of initial recipes (R0), and the ratio between the sizes of the pool of ingredients and the pool of recipes (M). Models without a fitness parameter did not work, so one is generated randomly and assumed to stand for the “intrinsic ingredient properties”, such as nutritional value and availability. At each generation, one “mother” recipe was randomly chosen, copied, and one or more of its ingredients replaced with other random ingredient (implementing the mutation rate L) to generate a “daughter” recipe. And so onward. Searching the parameter space, the authors do indeed find values close to the actual ones observed in the cook books.

Then, on the fitness of the recipes (replaced by hamburgers, pizza, etc.?), Kinouchi and colleagues use the fitness of the kth recipe, defined as F^{(k)} = \frac{1}{K} \sum_{i=1}^{K}f_i , and a corresponding total time dependent cuisine fitness, F_{total}(R(t)) = \frac{1}{R(t)} \sum_{k=1}^{R(t)}F^{(k)} . The results are depicted in Fig4 in the paper and, in short: “this kind of historical dynamics has a glassy character, where memory of the initial conditions is preserved, suggesting that the idiosyncratic nature of each cuisine will never disappear due to invasion by alien ingredients”. In addition, the copy-mutation model with the selection mechanism is scale-free, so that it is an out-of-equilibrium process, which practically means that “the invasion of new high fitness ingredients and the elimination of initial low fitness ingredients never end”, i.e., some ingredients are very difficult to being replaced, as if they were “frozen “cultural” accidents”. The latter has some similarity with the ‘founder-effect’ phenomenon in biology.

De aardappeleters (potato eaters) by Van Gogh

De aardappeleters (potato eaters) by Van Gogh

That much for the maths and experimental data of the paper. Before I turn to some research suggestions on this topic, I will first make an unscientific informal assessment. Van Gogh painted the painting de aardappeleters (‘the potato eaters’) in Nuenen—a village about 15km from where I grew up—back in 1885, to which Thieu Sijbers added a poem to describe such a poor man’s meal. I could not find the full original, but Van Oirschot ([2], p17) has the main parts of it, which I reproduce here first in the original old Brabants dialect and then a translation in English.

En hoekig nao ‘t bidde

‘t krous en dan wordt

aon de sobere maoltijd begonne

recht van ‘t vuur

op de bonkige toffel gezet

worre d’èrpel naw schieluk

mi rappe verkèt

van de hijt fèl nog dampend

de pan outgepikt

nao de monde gebrocht

en gulzig geslikt.

Ze ète, ze schranze

nao ‘n lutske de pan

toe ‘t zwart van de bojum

zo lig as ‘t mer kan

Mi’n mörke vol koffie

van waot’rige sort

zette d’èters nao d’èrpel

de maoltijd dan vort

Ze ète, jao net

mer dan is ‘t ok gezeed

want al wè ze pruuve

is èrremoei en leed.

My translation into English:

And edgy after praying

to the cross, and then

they start with the sober meal,

straight from the fire,

put on the chunky table,

now the potatoes are suddenly

cursed swiftly,

still steaming from the heat,

picked from the pan,

brought to the mouth,

and swallowed greedily.

They eat, they gorge,

and shortly after there is the black

of the bottom of the pan,

as empty as it can be.

With a mug full of coffee,

of the watery type,

do the eaters continue

with the meal after the potatoes.

They eat, yes just about,

but with that, all is said,

because all they taste

is poverty and distress.

The coffee is probably not real coffee but made from roasted sweet chestnuts [2]. The potatoes are an example of the “alien ingredients” mentioned in [1]: before potatoes were introduced in Europe (16th century), the Dutch recipes, at least, used tubers such as pastinaak (parsnip, which are white, and longer and thicker than carrot) in the place of potatoes; this is known primarily from the documentation about the Siege of Leiden in 1573-1574 during the 80-years war. Parsnip has not entirely vanished (parsnip beignets are really tasty), but now takes up a minimal place in ‘standard’ Dutch cuisine, so that it may be an example of one of those “frozen “cultural” accidents difficult to be overcome in the out-of equilibrium regime” ([1], p7). A standard Dutch dish is the aardappelen-groente-vlees combination, or: boiled potatoes, boiled vegetables, and a piece of meat baked in butter or fat, or the potatoes and vegetables are cooked together and mashed together into a hutspot (= potato+carrot+onion) or boerenkoolstamp (= potato+curly kale). Over the years, pasta entered the menu as well, and primarily a combination of Chinese and Indonesian, but to some extent also Surinamese, food has become regular dishes. Thus, pasta and rice took some space previously occupied by potatoes, but potatoes are in no way being marginalised. My guess is that that is because tubers and grains belong to different food groups and are therefore not easily swappable compared to tuber-tuber replacement, such as parsnip → potato [note 1], or grain-grain replacement, e.g., maize[flower] → wheat[flower]. Simply put: if you grow up on rice or pasta, then you do not easily switch to potatoes, or vice versa.

Perhaps Kinouchi’s copy-mutate algorithm can be rerun taking into account types of ingredients and then see what comes out of it; and use some variations like (1) swap within same food group, (2) different food group swap, pick random ingredient; and (3) keep the swap within subgroups, such as tuber-carbohydrate-source-staplefood#1 → tuber-carbohydrate-source-staplefood#2 (vs. the more generic ‘carbohydrate source’) and herb#3 → herb#4 (vs. the subsumer ‘condiment’).

Further, in addition to ingredient substitution-by-import, one also observes recipe import, which faces the task of having to make do with the local ingredients. Chinese excel in this skill: dishes in Chinese restaurants taste different in each country but roughly similar—in Italy, they even split up the meals into primo and secondo piatti. But when substituting original ingredients with the local ingredients that are only approximations of the original ones, how much remains of the recipe so that one still can talk of instantiations of the dishes described by the original recipe and when is it really a new one? What effect do those imported recipes have on local cuisine? Is there experimental data to say that, statistically, one recipe is better “export material” than others are? Are people [from/who visited] some geographic region better at transporting the local recipes and/or their ingredients elsewhere?

It remains to test whether those mutated recipes are still edible. Forced by ‘necessity’, I did ingredient substitution due to recipe import several times (the Italian shops do not have baked beans, no brown sugar, no condensed milk, no ontbijtkoek, no real butter, no pecan nuts, only a few apple varieties, etc…), and for some recipes the substitute was at least as good as the original, but then the substitute approximated the original. I certainly have not dared mashing together cooked pasta+carrot+onion to make an “Italian-style hutspot”, let alone random ingredient substitutions. In case someone has done the latter and it is not only edible but also recommendable, feel free to drop me a line or add the recipe in the comments.

References and notes

1. Kinouchi, O., Diez-Garcia, R.W., Holanda, A.J., Zambianchi, P., and Roque, A.C. The nonequilibrium nature of culinary evolution. ArXiv 0802.4393v1, 29 Feb 2008. Also published in New J. Phys. 10, 073020 (8pp) doi: 10.1088/1367-2630/10/7/073020

2. van Oirschot, A. (ed.). Van water tot wijn, van korsten to pastijen. Stichting Brabanste Dag, 1979. 124p.

[note 1] There is a difference between root-tubers (such as parsnip) and stem-tubers (such as potato), but functionally they are quite alike, so that for the remainder of the post I will gloss over this minor point. Some basic information can be glanced from the Wikipedia entry tuber, more if you search on Pastinaca sativa (parsnip) and Solanum tuberosum (potato) who are not member of the same family, and you may be interested to check other common vegetables and their names in different languages to explore this further.

Computing and the philosophy of technology

Recently I wrote about the philosophy of computer science (PCS), where some people, online and offline, commented that one cannot have computer science but have the science of computing instead and that it must have as part experimental validation of the theory. Regarding the latter, related and often-raised remarks are that it is computer engineering instead, which, in turn, it closely tied to information technology. In addition, many a BSc programme trains students so as to equip them with the skills to obtain a job in IT. Coincidentally, there is a new entry in the Stanford Encyclopedia of Philosophy on the Philosophy of Technology (PT) [1], which, going by the title, may well be more encompassing than just PCS and perhaps just as applicable. In my opinion, not quite. There as several relevant aspects of the PT for computing, but if one takes the strand of engineering and IT of computing, then there are some gaps in the overview presented by Franssen, Lokhorst and Van de Poel. The the main gap I will address is that it misses the centrality of ‘problem’ in engineering and that it conflates it with notions such as requirements and design.

But to comment on it, let me first start with mentioning few salient points of their sections “historical developments” and “analytic philosophy of technology” to illustrate notions of technology and which strand of PT the SEP entry is about (I will leave the third part, “ethical and social aspects of technology”, for another time).

A few notes on its historical developments
Plato put forward the thesis that “technology learns from or imitates nature”, whereas Aristotle went a step further: “generally, art in some cases completes what nature cannot bring to a finish, and in others imitates nature”. A different theme concerning technology is that “there is a fundamental ontological distinction between natural things and artifacts” (not everybody agrees with that thesis), and if so, what, then, those distinctions are. Aristotle, again, also has a doctrine of the four causes with respect to technology: material, formal, efficient, and final.
As is well-know, there was a lull in the intellectual and technological progress in Europe during the Dark Ages, but the Renaissance and the industrial revolution made thinking about technology all the more interesting again, with notable contributions by Francis Bacon, Karl Marx, and Samuel Butler, among others. Curiously, “During the last quarter of the nineteenth century and most of the twentieth century a critical attitude [toward technology] predominated in philosophy. The representatives were, overwhelmingly, schooled in the humanities or the social sciences and had virtually no first-hand knowledge of engineering practice” (emphasis added). This type of PT is referred to as humanities PT, which looks into socio-politico-cultural aspects of technology. Not everyone was, or is, happy with that approach. Since the 1960s, an alternative approach has been gestating and is becoming more important, which can be dubbed the analytic PT that is concerned with technology itself, where technology is considered the practice of engineering. The SEP entry about PT is about this type of PT.

Main topics in Analytic PT
The topics that the authors of the PT overview deem the main ones have to do with the discussion on the relation between science and technology (and thus also on the relation of philosophy of science and of technology), the importance of design for technology, design as decision-making, and the status and characteristics of artifacts. Noteworthy is that their section 2.6 on other topics mentions that “there is at least one additional technology-related topic that ought to be mentioned… namely, Artificial Intelligence and related areas” (emphasis added) and the reader is referred to several other SEP entries. Notwithstanding the latter, I will comment on the content from a CS perspective anyway.

Sections 2.1 and 2.2 contain a useful discourse on the differences between science and engineering and on their respective philosophy-of. It was Ellul in 1964 who observed in technology “the emergent single dominant way of answering all questions concerning human action, comparable to science as the single dominant way of answering all questions concerning human knowledge” (emphasis added). Skolimowski phrased it as “science concerns itself with what is, whereas technology with what is to be.” Simon used the verb tenses are and ought to be, respectively. To hit the message home, they also say that science aims to understand the world but that technology aims to change the world, and is seen as a service to the public. (I leave it to the reader to assess what, then, the great service to the public is of artifacts such as the nuclear bomb, flight simulator games for kids, CCTV, and the likes.) In contrast, while Bunge did acknowledge technology was about action, he stressed that it is one heavily underpinned by theory, thereby distinguishing it from the arts and crafts. Practically regarding software engineering, then to say that development of ontologies or the conceptual modelling during the conceptual analysis stage of software development are, largely or wholly, an art, is to say that there is no underlying theoretical foundation for those activities. This is not true. Indeed, there are people who, without being aware of its theoretical foundations, go off and develop an ontology or ontology-like artifact nevertheless and call that activity an art, but that does not make the activity an art per sé.
Closing the brief summary of the first part of the PT entry, Bunge mentioned also that theories in technology come into two types: substantive (on knowledge about the objects of action) and operative (on the action itself). A further distinction has been put forward by Jarvie, which concerns the difference between knowing that and knowing how, the latter which Polanyi made a central aspect of technology.

Designs, requirements, and the lack of problems
The principal points of criticism I have are on the content of sections 2.3 and 2.4, which revolve around the messiness in the descriptions of design, goals, requirements. Franssen et al first state that “technology is a practice focused on the creation of artifacts and, of increasing importance, artifact-based services. The design process, the structured process leading toward that goal, forms the core of the practice of technology” and that the first step in that process is to translate the needs or wishes of the customer into a list of functional requirements. But ought the first step not to be the identification of the problem? For if we then ultimately solve the problem (giong through writing down the requirements, design the artifact, and build and tests it), it is a ‘service to the public’. Moreover, if we have identified the problem, formulated it in a problem description, written a specification of the goal to achieve, only then we can write down the list of requirements that are relevant to achieving the goal and that solves the problem, which will not only avoid a ‘solution’ exhibiting the same problem but also meet the goal and be this service to the public. But problem identification and problem solving is not a central issue to technology, according to the PT entry; just designing artifacts based on functional requirements.
Looking back at the science and engineering studies I enjoyed, the former had put an emphasis on formulating and testing the hypotheses—no thesis could do without that—whereas the latter focusses on which problem(s) your research tackles and solves, where theses and research papers have to have a (list of) problem(s) to solve. If a CS paper does not have a description of the problem, one might want to look again what they are really doing, if anything; if it is a mere ‘we did develop a cute software tool’, it is straightforward to criticise and easily can be rejected because it is insufficient regarding the problem specification and the system requirements (the reader may not share the unwritten background setting with the authors, hence think of other problems and requirements that he would expect the tool to solve), and the ‘solution’ to that (the tool) is then no contribution to science/engineering, at least not in the way it is presented. Not even mentioning the problem-identification-and-solving and its distinction with science in the approach of doing research is quite a shortcoming of the PT entry.

In addition, the authors confuse such different topics even more clearly by stating that “The functional requirements that define most design problems” (emphasis added). Design problems? So there are different types of problem, and if so, which? How can requirements define a problem? They don’t. The requirements entail what the artifact is supposed to do and what constraints it has to meet. However, if it defines the problem instead, as the authors say, then they are not parameters for the solution to that problem, even though the to-be-designed artifact that meets the requirements is supposed to solve a (perceived) problem. Further, they say that engineers claim that their designs are optimal solutions, but can one really say that a design is the solution, rather than the outcome of the design, i.e., the realisation of the working artifact?

If it is the case, as the authors summarise, that technology aims to change the world, then one needs to know what one is changing, why one is changing that, and how. If something is already ‘perfect’, then there is no need to change it. So, provided it is not ‘perfect’ according to at least one person, then there is something to change to make it (closer to) ‘perfect’; hence, there is a problem with that thing (regardless if this thing is an existing artifact or annoying practice that perhaps could be solved by using technology). Thus, one—well, at least, I—would have expected that ‘problem’ would be a point of inquiry, e.g.: what is it, what does it depend upon, if there is a difference between problem and nuisance, how it relates to requirements and design and human action, the processes of problem identification and problem specification, the drive for problem-solving in engineering and the brownie points one scores with it upon having a solution, and even the not unheard of, but lamentable, practice of ‘reverse engineering the problem’ (i.e.: the ‘we have a new tool, now we think of what it [may/can/does] solve and that can be written in such a way as if it [is/sounds like] a real problem’).

In closing, the original PT entry was published in February this year, and quickly has undergone “substantive revision” meriting a revised release on 22 June ’09, so maybe the area is active and in flux so that a next version could include some considerations about problems.

[1] Franssen, Maarten, Lokhorst, Gert-Jan, van de Poel, Ibo, “Philosophy of Technology”, The Stanford Encyclopedia of Philosophy  (Fall 2009 Edition), Edward N. Zalta (ed.), forthcoming URL =

Metagenomics updated and slightly upgraded

The Nature TOC-email arrived yesterday, and they have a whole “insight” section on microbial oceanography! Four years ago, Nature Reviews Microbiology had a special issue with a few papers about it, two years ago PLoS Biology presented their Oceanic Metagenomics Collection, and now then the Nature supplement. Why would a computer scientist like me care? Well, my first study was in microbiology, and they have scaled up things a lot in the meantime, thereby making computers indispensible in their research. For those unfamiliar with the topic, you can get an idea about early computational issues in my previous post and comments by visitors, but there are new ones that I’ll mention below.

Although the webpage of the supplement says that the editorial is “free access”, it is not (as of 14-5 about 6pm CET and today 9am). None of the 5 papers—1 commentary and 4 review papers—indicate anything about the computational challenges: “the life of diatoms in the world’s oceans”, “microbial community structure and its functional implications”, “the microbial ocean from genomes to biomes”, and “viruses manipulate the marine environment”. Given that DeLong’s paper of 4 years ago [1] interested me, I chose his review paper in this collection [2] to see what advances have been made in the meantime (and that article is freely available).

One of the issues mentioned in 2007 was the sequencing and cleaning up noisy data in the database, which now seems to be much less of a problem, even largely solved, so the goal posts start moving to issues with the real data analysis. With my computing-glasses on, Box 2 mentions (my emphases underlined and summarised afterwards):

Statistical approaches for the comparison of metagenomic data sets have only recently been applied, so their development is at an early stage. The size of the data sets, their heterogeneity and a lack of standardization for both metadata and gene descriptive data continue to present significant challenges for comparative analyses … It will be interesting to learn the sensitivity limits of such approaches, along more fine-scale taxonomic, spatial and temporal microbial community gradients, for example in the differences between the microbiomes of human individuals44. As the availability of data sets and comparable metadata fields continues to improve, quantitative statistical metagenomic comparisons are likely to increase in their utility and resolving power. (p202)

Let me summarise that: DeLong asserts they need (i) metadata annotations as a prerequisite for statistical approaches; (ii) deal with temporal data, and (iii) deal with spatial data. There is a lot of research and prototyping going on in topics ii and iii, and there are few commercial industry-grade plugins, such as the Oracle Cartridge, that do something with the spatial data representation. Perhaps that is not enough or it is not what the users are looking for; if this is the case, maybe they can be a bit more precise on what they want?

Point i is quite interesting, because it basically reiterates that ontologies are a means to an end and asserts that statistics cannot do it with number crunching alone but needs structured qualitative information to obtain better results. The latter is quite a challenge—probably technically doable, but there are few people who are well versed in the combination of qualitative and quantitative analysis. Curiously, only MIAME and the MGED website are mentioned for metadata annotation, even though they are limited in scope with respect to the subject domain ontologies and ontology-like artefacts (e.g., the GO, BioPax, KEGG), which are also used for annotation but not mentioned at all. The former deal with sequencing annotation following the methodological aspects of the investigation, whereas the latter type of annotations can be done with domain ontologies, i.e. annotating data with what kind of things you have found (which genes and their function, which metabolism, what role does the organism have in the community etc.) that are also need to carry out the desired comparative analyses.

There is also more generic hand-waiving that something new is needed for data analysis:

The associated bioinformatic analyses are useful for generating new hypotheses, but other methods are required to test and verify in silico hypotheses and conclusions in the real world. It is a long way from simply describing the naturally occurring microbial ‘parts list’ to understanding the functional properties, multi-scalar responses and interdependencies that connect microbial and abiotic ecosystem processes. New methods will be required to expand our understanding of how the microbial parts list ties in with microbial ecosystem dynamics. (p203)

Point taken. And if that was not enough,

Molecular data sets are often gathered in massively parallel ways, but acquiring equivalently dense physiological and biogeochemical process data54 is not currently as feasible. This ‘impedance mismatch’ (the inability of one system to accommodate input from another system’s output) is one of the larger hurdles that must be overcome in the quest for more realistic integrative analyses that interrelate data sets spanning from genomes to biomes.

I fancy the thought that my granularity might be able to contribute to the solution, but that is not yet anywhere close to user-level software application stage.

At the end of the paper, I am still—as in 2005 and 2007—left with the impression that more data is being generated in the whole metagenomics endeavour than that there are computational tools to analyse them to squeeze out all the information that is ‘locked up’ in the pile of data.


[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.

[2] DeLong, E.F. The microbial ocean from genomes to biomes. Nature, 2009, 459: 200-206.

Brief review of the Handbook of Knowledge Representation

The new Handbook of Knowledge Representation edited by Frank van Harmelen, Vladimir Lifschitz and Bruce Porter [1] is an important addition to the body of reference and survey literature. The 25 chapters cover the main areas in Knowledge Representation (KR), ranging from basic KR, such as SAT solvers, Description Logics, Constraint Programming, and Belief Revision, to specific core domains of knowledge, such as Spatial and Temporal KR & R, and Nonmonotonic Reasoning, to shorter ‘application’ chapters that touch upon the Semantic Web, Question Answering, Cognitive Robotics, and Automated Planning.

Each chapter roughly follows the approach of charting the motivation and problems the research area attempts to solve, the major developments in the area over the past 25 years, important achievements in the research, and where there is still work to do. In a way, each chapter is a structured ‘annotated bibliography’—many chapters have about 150-250 references each—that serve as an introduction and a high-level overview. This is useful, for instance, if your specific interests are not covered in a university course but have a thesis student and you would want him to work on that topic, then the appropriate chapter will be informative for the student not only to get an idea about it but also to have an entry point as to which further principal background literature to read; or you are a researcher writing a paper and do not want to put a Wikipedia URL in the references (yes, I’ve seen papers where authors had done that) but a proper reference; or you are, say, well-versed in DL-based reasoners, but come across a paper where one based on constraint programming is proposed and you want to have a quick reference to check what CP is about without ploughing through the handbook on constraint programming. Comparatively with the other topics, anyone interested in ‘something about time’ will be satisfied with the four chapters on temporal KR & R, situation calculus, event calculus, and temporal action logics. Clearly, the chapters in the handbook on KR are not substitutes for the corresponding “handbook on [topic-x]” books, but they do provide a good introduction and overview.

Some chapters are denser in providing a detailed overview than others (e.g., qualitative spatial reasoning vs. CP, respectively), however, and yet other chapters provide a predominantly text-based overview whereas others do include formalisms with precise definitions, other axioms, and theorems (Qualitative Modelling, Physical Reasoning, and Knowledge Engineering vs. most others, respectively). That most chapters do include some logic comes as no surprise for the KR researcher but may be for the novice or a searching ontology engineer. For the latter group, and logic-sceptics in general, there is a juicy section in chapter 1, “General Methods in Knowledge Representation and Reasoning”, called “Suitability of Logic for Knowledge Representation” that takes on the principal anti-logicist arguments and the about 6-page long rebuttal of each complaint. Another section that can be good for heated debates is Guus Schreiber’s (too) brief comment on the difference between “Ontologies and Data Models” (chapter 25), which easily can fill a few pages instead of the now less than half a page used for arguing there is a distinction between the two.

Although I warmly recommend the handbook as addition to the library, there are also a few shortcomings. One may have to do with the space limitations (even though the book is already over 1000 pages), whereas the other one might be due to the characteristics of research in KR & R itself (to some extent at least). They overlap with the kind of shortcomings Erik Sandewall has mentioned in his review of the handbook. Several topics that are grouped under KR are not, or very minimally, dealt with in the book (e.g., uncertainty and ontologies, respectively) or in a fragmented, isolated, way across chapters what perhaps should have been consolidated into a separate chapter (i.e., abstraction, but also ontologies). In addition, within the chapters, it may well occur that some subtopics are perceived to be missing from the overview or mentioned too briefly in passing (e.g., mereology and DL-Lite for scalable reasoning), but this also depends on one’s background. On the other hand, the chapters on Qualitative Modelling and Physical Reasoning could have been merged into one chapter.

The other point concerns the lack of elaboration on real life success stories as significant contribution of that topic that a KR novice or a specialised researcher venturing in another sub-topic may be looking for. However, the handbook charts the research progress in the respective fields, not the knowledge transfer from KR research output to the engineering areas where the theory is put to the test and implementations are tried out. It is a philosophical debate if doing science in KR should include testing one’s theories. To give an idea about this discrepancy, part III of the handbook is called “Knowledge Representation in Applications” (emphasis added), which contains a chapter, among five others, on “The Semantic Web: Webizing Knowledge Representation”. From a user perspective, including software engineers and the most active domain expert adopters (in biology and medicine), the Semantic Web is still largely a vision, but not yet a success story of applications—people experiment with implementations, but the fact that there are people willing to give it a try does not imply it is a success from their point of view. Put differently, it says more about the point of view of KR&R that it is already categorised under applications. True, as the editors note, one needs to build upon advances achieved in the base areas surveyed in parts I and II, but is it really ‘merely’ ‘applying’, or does the required linking of the different KR topics in these application areas bring about new questions and perhaps even solutions to the base KR topics? The six chapters in part III differ in the answer to this question—as in any healthy research field: there are many accomplishments, but much remains to be done.

[1] Frank van Harmelen, Vladimir Lifschitz and Bruce Porter (Eds.). Handbook of Knowledge Representation. Elsevier, 2008, 1034p. ISBN-13: 978-0-444-52211-5 / ISBN-10: 0-444-52211-5.

Editorial freedoms?

Or: on editors changing your article without notifying (neither before nor after publication).

The book The changing dynamic of Cuban civil society came out last year right before I went to Cuba, so I had read it as a preparation, and, it being a new book, I thought I might as well write a review for it and see if I could get it published. The journal Latin American Politics and Society (LAPS) was interested, and so it came to be that I submitted the book review last July and got that review accepted. Two days ago I was informed by Wiley-Blackwell, the publisher of LAPS, that I could download the offprint of the already published review: it had appeared in LAPS 50(4):189-192 back in the winter 2008 issue.

The published review is for “subscribers only” (but I’m allowed to email it to you) and to my surprise and disbelief it was not quite the same as the one I had sent off to the LAPS book review editor Alfred Montero. They had made a few changes to style and grammar, which, given that English is not my mother tongue, was probably warranted (although it would have been appropriate if I were informed about that beforehand). There are, however, also three significant changes to the content. More precisely: two deletions and one addition.

The first one is at the beginning, where an introduction is given on what constitutes ‘civil society’. Like in the book, some examples are given, as well as the notion of a ‘categorisation’ of organisations. The original text (see pdf) is as follows:

According to this description, hobbies such as bird watching and playing rugby is not part of civil society, but the La Molina against the US base in Vicenza and Greenpeace activism are. In addition, one may want to make a categorization between the different types of collectives that are part of a civil society: people have different drives or ideologies for improving or preventing deterioration of their neighborhood compared to saving the planet by attempting to diminish the causes of climate change.

This has been trimmed down to:

According to this description, hobbies such as bird watching and playing rugby are not part of civil society, but Greenpeace activism is. In addition, one may want to make a categorization between the different types of collectives that are part of a civil society: people have different drives or ideologies for improving or preventing deterioration of their neighborhood, compared to saving the planet by attempting to diminish the causes of climate change.

A careful reader may notice that there is a gap in the logic of the examples: the No Dal Molin activism against the US base is an example of NIMBY-activism (Not In My BackYard), referred to in the second sentence but the example in the first sentence is missing. There being no example of this type in the book, I felt the need to give one anyway. Perhaps if I would have used the for the US irrelevant NIMBY-activism against the TAV (high speed train) it would have remained in the final text. The activism of Molin, however, is a much more illustrative example of the interactions between a local grass-roots civil society organisation and both national and international politics, and how the so-called ‘spheres of influence’ of the actors have taken shape.

The addition is a verb, “to act”. The original:

Christine Ayorinde discusses both the historical reluctance of the, until 1992, atheist state against religious groups—used as counterrevolutionary tool primarily by the U.S. in the early years after the Revolution—and the loosening by the, now constitutionally secular, state …

The new sentence:

Christine Ayorinde discusses both the historical reluctance of the atheist state (until 1992) to act against religious groups—used as counterrevolutionary tool primarily by the United States in the early years after the revolution—and the loosening by the now-constitutionally secular state, …

But it is not the case that the state was reluctant to act against religious groups; they were reluctant and hampering involvement of foreign religious groups because it was used by primarily the US as a way to foment dissent against the Revolution.

The second deletion actually breaks a claim I make about the chapters in the edited volume and weakens an important observation on the operations of civil society organisations in Cuba, and of foreign NGOs in particular.

The original:

A personal experience perspective is given by Nino Pagliccia from the Canadian Volunteer Work Brigade (Chapter 5). This is a fascinating chapter when taken in conjunction with Alexander Gray’s chapter that analyses personal perspectives and changes in procedures from the field from a range of civil society actors (Chapter 7). … Pagliccia’s, as well as the representative of Havana Ecopolis project’s—Legambiente-funded, which is at the green-left spectrum of the Italian political arena—documented experiences of cooperation in Cuba have the component of shared ideology, whereas other representatives, such as from Save the Children UK, talk about shared objectives instead even when their Cuban collaborators assume shared ideology. Notably, the latter group of foreign NGOs report more difficulties about their experiences in Cuba.

How it appears in the published version:

A personal perspective is given by Nino Pagliccia, from the Canadian Volunteer Work Brigade (chapter 5). This is a fascinating chapter when considered together with Gray’s chapter 7, which analyzes personal perspectives and changes in procedures from a range of civil society actors. … Pagliccia’s documented experiences of cooperation in Cuba have the component of shared ideology, whereas other representatives, such as those from Save the Children UK, talk about shared objectives instead, even when their Cuban collaborators assume the former. Notably, the latter group of foreign NGOs report more difficulties in their experiences in Cuba.

But the reference to Havana Ecopolis comes from Chapter 7. In fact, of the interviewees, he was the only one really positive about the experiences in the successful foreign-initiated project/NGO, which made me think back to Pagliccia’s Workers Brigade and solidarity vs. charity. I wondered where the funding of Havana Ecopolis came from, Googled a bit, and arrived at the Legambiente website (project flyer). Needless to say, also openly leftist organisations had positive experiences on collaboration; but in analyzing effectiveness of foreign NGO involvement, unveiling the politically-veiled topical NGOs is a distinguishing parameter. Moreover, it is an, informally, well-known one with respect to Cuba’s reluctance of letting foreign NGOs into the country. Thus, it explains why the Havana Ecopolis experience stood out compared to the other documented NGO experience in Cuba. But now, in the revised text, the “Notably, the latter group … more difficulties …” sounds a bit odd and not backed up at all. They even toned down Pagliccia’s contribution from “A personal experience perspective” to “A personal perspective”: there surely is a difference between being informed by having spent some time in Cuba working side-by-side with the Cubans and just having a perspective on Cuba without having a clue what the country is like; now it reads like ‘yeah, whatever—opinions are like assholes: everybody’s got one…’. Note that when one reads the book, one sensibly can make the link between the data and analysis presented on solidarity vs. charity vs. cooperation and the shared-ideology vs. shared-objectives NGOs (ch5 & 7). Rests to make a categorisation of foreign NGOs and conduct a quantitative analysis to back up the obvious qualitative one.

I hope that this case is an exception, but maybe it is the modus operandi in the humanities that things get edited out. It certainly is not in computer science, where only the authors can fiddle a bit with the text when a paper is accepted, and even less so in the life sciences where, upon paper acceptance, thou shalt not change a single word.

UPDATE (22-3-2009): the current status of the contact I had with the LAPS editorial office is that the book review editor, Alfred Montero, did not change anything, but that that happened during copyediting by the managing editor, Eleanor Lahn. She has provided me with an explanation why the changes were done, which has a curious argumentation to which I have replied. This reply also contains a request for clarity and consistency in the procedure (now the book editor assumes the copyeditor contacts the author, whereas the copyeditor normally does not do so), but I have not yet received a response on that email.

What philosophers say about computer science

Yes, there is a Philosophy of Computer Science, but it does not appear to be just like the average philosophy of science or philosophy of engineering kind of stuff. Philosophers of computer science are still figuring out what exactly is computer science—such as “the meta-activity that is associated with programming”—but then finally settle for a weak “in the end, computer science is what computer scientists do”. Currently, the philosophy of computer science (PCS) is mostly an amalgamation of analyses of artefacts that are the products of doing computer science and many, many, research questions.

The comprehensive introductory overview of PCS in the Stanford Encyclopedia of Philosophy by Turner and Eden [1] starts with “some central issues” that are phrased into 27 principal questions that are in need of an answer. To name just a few: What kind of things are computer programs? What kind of things are digital objects, and do we need a new ontological category for them? What does it mean for a program to be correct? What is information? Why are there so many programming languages and programming paradigms? The remainder of the SEP entry on PCS is devoted to giving the setting of the discourses in some detail.

The contents in the entry focuses primarily on the dual nature of programs, programming languages, semantics, logic, proofs, and closes with legal and ethical issues in PCS; a selection that might be open to criticism by people who feel their pet topic ought to have been covered as well (but I leave it to them to stand up). The final words, however, are devoted to the nagging thought if PCS merely gives a twist to other fields of philosophy or if there are indeed new issues to resolve.

In any case, it is already noteworthy to read that Turner and Eden do not consider the act of programming itself as being part of computer science; and rightly so. That could be useful to take into account by people who write job openings offering new postdoc/assistant professor positions, many of which require “strong programming skills”. By the philosophers’ take on CS, they should have asked for good programmers but not researchers, because CS researchers are specialised in the meta-activities such as “design, development and investigation of the concepts and methodologies that facilitate and aid the specification, development, implementation and analysis of computational systems” [1] (emphasis added), i.e., not the actual development and implementation of software systems. On the other hand, the weird thing with the PCS entry is their CS-is-what-computer-scientists-do: if computer scientists decide to slack collectively, then PCS itself would degrade to investigate the non-output of CS, i.e., be obsolete, or, if research funding agencies were to pay mainly or only for ‘integrated projects’ that apply theory to develop working information systems for industry (or biologists, healthcare professionals, etc.), then that is what CS degenerates into? If that were the case, then CS would not merit to be called a science.

In this narrative, now would be a good place for a treatise on definitions of computer science to demonstrate the discipline is sane and structured, to simplify communication with non-informaticians, and feed philosophers with the settled basics to dwell upon. But then I would have to go into finer distinctions, made in e.g. the Netherlands, between informatika and informatiekunde (roughly, computer science vs. information engineering)—the former being more theoretical along the lines of the topics in the PCS entry and the latter includes more socio- psycho- cognitive- managerial topics—and extant definitions by the various organisations, dictionaries and so forth (e.g., here, here, here). I will do a bit more homework on this first, before writing anything about that (and before that, I will go to Geomatica’09 in Cuba and glue a few days holiday to the conference visit; so, within a few days, it will be quiet here until the end of February).

[1] Turner, R. and Eden, A. The Philosophy of Computer Science. In: Stanford Encyclopedia of Philosophy, Zalta, E. (Ed.), published 12 Dec. 2008.

Conflict data analysis issues and the dirty Dirty War Index

In the previous post on building bias into your database, I outlined seven modelling tricks to build your preference into the information system. Here, I will look at some of those databases and a tool/calculation built on top of such conflict databases (the ‘dirty war index’).

Conflict databases

The US terrorism incident database at MIPT suffers from most of the afore-mentioned pitfalls, which drove a recently graduated friend, Dr. Fraser Gray, to desperation asking me if I could analyse the numbers (but, alas, the database has some inconsistencies). I have more, official, detail about the design rationale and limitations of the civil war incident databases developed by Weinstein and by Sutton. In his fascinating book Inside Rebellion [1], Weinstein has described his tailor-made incident database in Appendix B; so, although I’m going to comment on the database, I still highly recommend you read the book.

Weinstein applies organsational theory to rebel organisations in civil war settings and tests his hypotheses experimentally against case studies of Uganda, Mozambique, and Peru. As such, his self-made database was made with the following assumption in mind: “civilians are often the primary and deliberate target of combatants in civil wars… Accordingly, an appropriate indicator of the “incidence” of civil war is the use of violence against noncombatant populations.” Translated to the database focus, it is a people-centred database, not, say, target-centred. Not only deaths are counted, but also a range of violations, including mutilation, abduction, detention, looting, and rape, and victim charactersitics with name, age, sex, affilitation and affiliation groups, such as religious leaders, students, occupation of civilian, and traditional authorities (according to Appendix B).

Geography is coded only at a high level—at least, the information provided in chapter 6 that deals with the quantitative data discusses only (aggregated?) rough regions, such as Mozambique’s “north”, “centre” and “south”, but for Sendero Luminoso-Huallaga no sub-regions at all. To its merit, it has a year-by-year breakdown of the incidents, although one has no access to which type of incidents exactlyeven though they are supposed to be in the database. It does not discuss quantitatively the types of arms and the targets; it certainly makes a difference to understand the dynamics of the conflict if, say, targets like water purification plants are blown up or military bases attacked and if sophisticated ‘non-conventional arms’ are used or machetes. If we want to know that, it seems we have to redo the data collection process. No statistical analysis is performed, so that for, e.g., the size of the victim groups we get indications of ‘relatively more’, and barely even percentages or ratios to make cross-comparisons across years or across conflicts but which could have been done based on the stacked-bar charts of the (yet again aggregated) data. The huge amount of incidents marked as “unclear” for Peru only has guessed explanations, due to data collection issues (e.g., for 1987 some 500 “unclear” versus about 40 attributed to Sendero Luminoso-Nacional and 30 government)—try feeding such data into the DWI (see below). The definitions of “civilian” and “non-combatant” are not clear, not even sort of inferable as with Sutton’s database (see below).

Overall, it merely gives a rough idea of some aspects of the examined conflicts, but maybe this already suffices for comparative politics.

UPDATE (21-1-2009): Jeremy Weinstein kindly responded via email, being aware of the aggregations used in the data analysis, because they intended to serve a descriptive role, and pointing me to an effort of more detailed data collection, finer-grained analysis, and online data (in proprietary Strata format) of the conflict in Sierra Leone, which was published in American Political Science Review. That freely available paper, Handing and manhandling civilians in civil war, also gives an indication what the reader can expect of the contents in the book, and has a set of 8 hypotheses that are tested against the data (not all of them could be confirmed).

The Dirty War Index

There are people who build tools upon such conflict databases. Garbage In, Garbage Out? I will highlight one of those tools, which received extensive coverage in PLoS Medicine recently [2,3,4]: being able to calculate a “Dirty War Index” for a variety of parameters that follow the pattern of DWI = \frac{nr\_of\_dirty\_cases}{total\_nr\_of\_cases} \times 100 . The cases and their aggregates to nr of cases come from the conflict’s incidents databases. Go figure. It’s not just that, but one could/would/should assume that the examples Hicks and Spagat give in their paper [3] are to illustrate, but not to invalidate, their DWI approach.

Let us take their first example, the DWIs for the actors in the Colombian civil conflict as the measure \frac{nr\_of\_civilians\_killed}{total\_nr\_of\_ civilians\_killed + combatants\_killed} \times 100 . The ‘guerillas’ (presumably FARC) have a DWI of \frac{2498}{5444} \times 100  = 46, the ‘government forces’ \frac{593}{659} \times 100 = 45 , and the ‘illegal paramilitaries’ (a pleonasm) \frac{6944}{6985} \times 100 = 99 (numbers taken from the simple Colombia conflict database [5]). Hicks and Spagat explain that “Guerrillas rank 2nd in killing absolute numbers of civilians”, as if the government forces deserve a laurel for having the best (closest to 0) DWI—with a mere 1-point margin—and as if paramilitaries are independent of the government whereas it is the norm, rather than the exception, that governments tend to arrange for a third party to do the dirty work for them (with or without external funding) so as to look comparatively good in the international spotlights. Aggregating by ‘opponents of FARC’, we get a DWI of \frac{593+6944}{659+6985} \times 100 = 98.6 , which is substantially more dirty than FARC that cannot be explained away anymore by data collection biases [4]; to put it differently, FARC is in this DWI the proverbial ‘lesser of two evils’, or, if you support their cause then you could say they have good reason to be annoyed with the current violent governance in the country. This also suggest that requiring “recognition in Colombia’s paramilitary demobilization, disarmament, and reintegration process” [3] alone may not be enough to achieve durable peace for Colombians.

The other main illustration is the conflict in Northern Ireland by using two complementary DWIs: “aggressive acts (killing civilians) and endangerment to civilians (by not wearing uniforms)”[1]. The ‘British Security Forces’ (BSF) have a “Civilian mortality DWI” of 52, the ‘Irish Republican Paramilitaries’ (IRP) 36, and the ‘Loyalist paramilitaries’ (LP) 86—note the odd naming and aggregations, e.g., are we talking IRA, or lumping the IRA together with the Real-IRA and Continuity-IRA, and all UFF, LVF…? Consulting the extensive source database, it lists 29 groups. In addition, [3]’s “number of civilian + civilian political activist” are, respectively, 190+738+873=1801, but the source’s data has 1797 civ.+ 58 civ.pol.activists = 1855, and then a series of statuses such as “ex-British army”, “ex-IRA” and so forth, who, while being “ex-” are not real civilians according to the database. Much more data for compiling your preferred DWI and preferred details or aggregates can be found here [6].

The “Attacks without uniform DWI” are “approaches 0” (BSF), “approaches 100” (IRP) and “approaches 100” (LP) without actual values to do the calculation with; nevertheless the vagaries, for the IRP they prefer the adjective “extremely high rate” but for the LP it is only “very high rate”. They try a comparatively long explanation for the nastyness of the IRP, but it is plain that the BSF and LP have the dirtiest civilian DWI and that LP killed most civilians, no matter how one wants to explain it away and dress it up with DWIs (maybe not so coincidentally, the authors are affiliated with UK institutions).

I will leave Hicks and Spagat’s “female mortality DWI” of the Arab-Israeli conflict and the “child casualty DWI” of Chechnya for the interested reader to analyse (including the term ‘unexploded ordnance’ that injured or killed children—by exploding).

Although the idea of multiple DWIs can indeed be interesting to give a rough indication, there is the real danger of misuse due to unfair sanitation of data: it can easily stimulate misinterpretation by showing some neat aggregated numbers without having to assess the source data and by brushing over the reality on the ground that a bean-counting person may not be aware of and more readily can set aside in favour of the aggregated numbers.

Hicks and Spagat do have a section on considerations, but that their two main worked-out examples with Colombia and Northern Ireland are problematic already just proves the point about possible dubious use for one’s own political agenda. Perhaps they would say the same of my alternative rendering being politically coloured, but I do not try to give it a veneer of credibility and advantages of DWIs, just that it is simple to turn around and play with the DWIs to suit one’s preferences, whichever they may be.

UPDATE (5-6-’09): a more comprehensive review of Hicks and Spagat’s paper will be published in the autumn 2009 issue of the Peace & Conflict Review.

[1] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press.

[2] Sondorp E (2008 ) A new tool for measuring the brutality of war. PLoS Med 5(12): e249. doi:10.1371/journal.pmed.0050249

[3] Hicks MH-R, Spagat M (2008 ) The Dirty War Index: A public health and human rights tool for examining and monitoring armed conflict outcomes. PLoS Med 5(12): e243. doi:10.1371/journal.pmed.0050243.

[4] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[5] The numbers originate from CERAC’s Colombia conflict database as reported in [3]; both Hicks and Spagat are research associates of CERAC; database available after registration, which has substantially less types of information and less explanation than Sutton’s [6] database.

[6] CAIN Web Service as reported in [3]; database freely available, including data, querying, and design and data collection choices.

[1] The latter DWI is theoretically problematic, because the distinction between actors who use violence and their supporters in the population (be it passively or actively with food, shelter, and logistics) is often not that clear, and off-duty soldiers are not necessarily automatically civilians; but the argument is long. Hicks and Spagat’s table 3 has a longer list than just this item, and I shall not digress further on the topic here.

Ontologies in ecology: putting the lessons-learned to good use and moving forward

While most of the headlines and attention in bio-ontologies has gone to the Gene Ontology, later also the FMA, and, most recently, the set of ontologies within or close to the OBO Foundry project, it has been comparatively more modest in the area of ontologies for ecology. This is set to change.

Madin et al [1] published a review article last month in Trends in Ecology and Evolution about not only the state of the art on existing ontologies for ecology, but also an Ode to the development and use of ontologies. The latter is not framed in a bright-vision-follow-me way, but noting (a.o.) the problems of

terminological ambiguity [that] slows scientific progress, leads to redundant research efforts, and ultimately impedes advances towards a unified foundation for ecological science

and showing problems and clear examples of what kind of problems ontologies can help to solve.

Recollecting the OWLED’07 industry panel discussion last year, it seemed as if industry was at the point where bio-ontologies were 5-8 years ago and, moreover, about to reinvent the wheel. Not so with ontologies for ecology. Madin et al has separate information boxes about “building consistent ontologies” explaining the difference between is-a and instance-of, is-a and part-of, and is-a and constitution—those things that early adopters learned the hard way a few years ago is presented as a known basic starting point. Likewise for the info-box on “What is an ontology?” and the straight adoption of OWL and benefits automated reasoners. In the overview presented by Madin et al, there are no issues to resolve on trying to be backward compatible with the obo format, but they go straight to the W3C standardized formal ontology representation languages for the ontologies for ecology. Idem box 2 on finding data (which is also a nice scenario for the OBDA Plugin and DIG-Mastro), OntoClean, foundational ontologies and domain ontologies versus other artifacts with terms, linking of ontologies, and a clear table with task-description-requirements (table 1) that invariably asks for good ontologies.

Aside from the analysis of benefits and usages, the concluding remarks section notes that

[t]hus, the adoption of ontologies is hindered both by the familiarity of current practices and the lack of tools to readily migrate to improved practices.

Point taken.

And last, but not least,

Formal ontologies provide a mechanism to address the drawbacks of terminological ambiguity in ecology, and fill an important gap in the management of ecological data by facilitating powerful data discovery based on rigorously defined, scientifically meaningful terms. By clarifying the terms of scientific discourse, and annotating data and analyses with those terms, well defined, community-sanctioned, formal ontologies based on open standards will provide a much-needed foundation upon which to tackle crucial ecological research while taking full advantage of the growing repositories of data on the Internet.

[1] Joshua S. Madin, Shawn Bowers, Mark P. Schildhauer and Matthew B. Jones. Advancing ecological research with ontologies. Trends in Ecology & Evolution, 23(3): 159-168. doi:10.1016/j.tree.2007.11.007

About insurmountable simplicities

Some reader might think I’m heading towards a write up about the seemingly insurmountable simplicities of the PhD programme, but I still think that doing a PhD amounts to coping with surmountable difficulties. The ‘insurmountable simplicities’ is part of the title of a popular philosophy book, which has the full title in English, among the eight languages it is translated into: Insurmountable simplicities—thirty-nine philosophical conundrums” by Achille Varzi and Roberto Casati. I just finished reading the 39 short stories and dialogues spread over 129 pages, and I can highly recommend it to anyone. It is written in a way that is easily accessible to the general public, yet the stories cover a wide range of philosophical puzzles that make you both laugh and, moreover, think. Else, if you have no life but occasionally have to socialize and don not know of anything else to talk about than to bore your conversation partner with your thesis topic/work, then any of the 39 stories will do to get a conversation going. I will summarize and comment some of them below; “Zombie Inc.”, “Partial Amnesia”, “Person transplant” and “My ice cream, your ice cream” are available online for free as appetizers.

The dialogue “Person transplant” has a man walking into a transplant clinic asking for a new brain. As donor, he can make his brain available to anyone interested and it costs him $10k but as receiver requesting a new brain he can get $10k from the clinic… or so begins the dialogue. Put differently: brain donor versus body receiver; i.e., is your brain you with a disposable body or your body you and your brain just like any organ that, at least in theory, could be transplanted like you heart, kidneys and so forth? And, by the way, is it really an either-or case? Staying for a bit with dialogues about medicine, there are some complications with the placebo effect, where a customer in a drug store asks for a placebo against his headache. After all, it has been shown that it works, so one might well ask for a little starch pill, which, of course, defeats the purpose. So, how to administer a placebo that is both effective and ethically correct (as the pharmacist cannot give a non-medicine knowing that better is readily available)?

In “The traveler’s pictionary”, a word may be worth a thousand pictures. Instead of going on holiday with a dictionary, the travel agent offers the traveller to Siberia a pictionary, so that she can point to the pictures instead of messing with Russian vocabulary and grammar. The pictionary has only pictures of things that can be depicted, such as for ‘buying’ (not uncontroversial) and a picture for ‘bicycle’, but can things like ‘wisdom’ or ‘inflation’ be drawn, or the negation of doing something? Moreover, and where the recurring personage “the meddler” chimes in, “a picture is itself something that requires an interpretation. And if a picture requires an interpretation, bringing it to mind can hardly help” (with a nudge to Wittgenstein). A practical example that many a biologist/bioinformatician has come across, is the derogatory term “[useless/informal/underspecified] cartoon” that computer scientists and software engineers regularly use for the very clear and explanatory colourful diagrams in biology textbooks; but then, they haven’t gotten the training in how to read such figures…

Prisoner K.J., the director of the penitentiary, the medical officer, and the Smiths are involved in a correspondence about that K.J. can neither recall the crimes he is convicted of nor the date of imprisonment due to irreversible amnesia (“Partial amnesia”). Should he be informed about it? He found out and considers himself responsible for the act he cannot remember. But given that he cannot remember it, does that affect one’s personal identity and if so, is he then really responsible for those crimes? The most interesting bit comes at the end though, with a note from the state legal office. The story does take for granted one knows the main principle of putting people in jail as punishment for having committed a crime (deny the right of free movement, reflect on the crime, learn from it so that recidivism does not occur upon release). This obviously does not include revoking the right to vote, nor for the effect as the George Jung character in the movie Blow said cynically about his first experience serving time in jail: that he went in with a bachelors in marihuana and got out with a PhD in cocaine. To name just a few ‘collateral effects’ of prison systems in several countries; but I’ll leave that for another post sometime because it has little to do with philosophy (or has it?).

Last, there is also a section with entertaining logic, such as “Interesting!”, although, of course, not everything can be interesting, for—in the case of the dialogue in the bookstore about intrinsically interesting books—“if all books are interesting, and if being interesting requires some original feature, then relative to the property of being interesting, all books would appear to be uninteresting. Which is to say: boring.” Casati and Varzi’s book is far from boring and contains many other stories covering, among others, causality, paradoxes of time and space, the notion of choice, and chance, which are narrated in settings ranging from birthdays for entering the museum for free, reducing majority voting to one person, playing lotto in reverse, to useless project proposals.

p.s.: Varzi’s publication page here has the links for all the languages the ‘insurmountable simplicities’ is translated in.