Review of ‘The web was done by amateurs’ by Marco Aiello

Via one of those friend-of-a-friend likes on social media that popped up in my stream, I stumbled upon the recently published book “The web was done by amateurs” (there’s also a related talk) by Marco Aiello, which piqued my interest both concerning the title and the author. I’ve met Aiello once in Trento, when a colleague and he had a departing party, with Aiello leaving for Groningen. He probably doesn’t remember me, nor do I remember much of him—other than his lamentations about Italian academia and going for greener pastures. Turns out he’s done very well for himself academically, and the foray into writing for the general public has been, in my opinion, a fairly successful attempt with this book.

The short book—it easily can be read in a weekend—starts in the first part with historical notes on who did what for the Internet (the infrastructure) and the multiple predecessor proposals and applications of hyperlinking across documents that Tim Berners-Lee (TBL) apparently was blissfully unaware of. It’s surely a more interesting and useful read than the first Google hit, the few factoids from W3C, or Wikipedia one can find online with a simple search—or: it pays off to read books still in this day and age :). The second part is for most readers, perhaps, also still history: the ‘birth’ of the Web and the browser wars in the mid 1990s.

Part III is, in my opinion, the most fun to read: it discusses various extensions to the original design of TBL’s Web that fixes, or at least aims to fix, a shortcoming of the Web’s basics, i.e., they’re presented as “patches” to patch up a too basic—or: rank-amateur—design of the original Web. They are, among others, persistence with cookies to mimic statefulness for Web-based transactions (for, e.g., buying things on the web), trying to get some executable instructions with Java (ActiveX, Flash), and web services (from CORBA, service-oriented computing, to REST and the cloud and such). Interestingly, they all originate in the 1990s in the time of the browser wars.

There are more names in the distant and recent history of the Web that I knew of, so even I picked up a few things here or there. IIRC, they’re all men, though. Surely there would be at least one woman worthy of mention? I probably ought to know, but didn’t, so I searched the Web and easily stumbled upon the Internet Hall of Fame. That list includes Susan Estrada among the pioneers, who founded CERFnet that “grew the network from 25 sites to hundreds of sites.”, and, after that, Anriette Esterhuysen and Nancy Hafkin for the network in Africa, Qiheng Hu for doing this for China, and Ida Holz for the same in Latin America (in ‘global connections’). Web innovators specifically include Anne-Marie Eklund Löwinder for DNS security extensions (DNSSEC, noted on p143 but not by its inventor’s name) and Elizabeth Feinler for the “first query-based network host name and address (WHOIS) server” and “she and her group developed the top-level domain-naming scheme of .com, .edu, .gov, .mil, .org, and .net, which are still in use today”.

One patch to the Web that I really missed in the overview of the early patches, is the “Web 2.0”. I know that, technologically, it is a trivial extension to TBL’s original proposal: the move from static web pages in 1:n communication from content provider to many passive readers, to m:n communication with comment sections (fancy forms), or: instead of the surfer being just a recipient of information by reading one webpage after another and thinking her own thing of it, to be able to respond and interact, i.e., the chatrooms, the article and blog comment features, and, in the 2000s, the likes of MySpace and Facebook. It got so many more people involved in it all.

Continuing with the book’s content, cloud computing and the fog (section 7.9) are from this millennium, as is, what Aiello dubbed, the “Mother of All Patches.”: the Semantic Web. Regarding the latter, early on in the book (pp. vii-viii) there is already an off-hand comment that does not bode well: “Chap. 8 on the Semantic Web is slightly more technical than the rest and can be safely skipped.” (emphasis added). The way Chapter 8 is written, perhaps. Before discussing his main claim there, a few minor quibbles: it’s the Web Ontology Language OWL, not “Ontology Web Language” (p105), and there’s OWL 2 as successor of the OWL of 2004. “RDF is a nifty combination of being a simple modeling language while also functioning as an expressive ontological language” (p104), no: RDF is for representing data, not really for modeling, and most certainly would not be considered an ontology language (one can serialize an ontology in RDF/XML, but that’s different). Class satisfiability example: no, that’s not what it does, or: the simplification does not faithfully capture it; an example with a MammalFish that cannot have any instances (as subclass of both Mammal and Fish that are disjoint), would have been (regardless the real world).

The main claim of Aiello regarding the Semantic Web, however, is that it’s been that time to throw in the towel, because there hasn’t been widespread uptake of Semantic Web technologies on the Web even though it was proposed already around the turn of the millenium. I lean towards that as well and have reduced the time spent on it from my ontology engineering course over the years, but don’t want to throw out the baby with the bathwater just yet, for two reasons. First, scientific results tend to take a long time to trickle down. Second, I am not convinced that the ‘semantic’ part of the Web is the same level of end-user stuff as playing with HTML is. I still have an HTML book from 1997. It has instructions to “design your first page in 10 minutes!”. I cannot recall if it was indeed <10 minutes, but it sure was fast back in 1998-1999 when I made my first pages, as a non-IT interested layperson. I’m not sure if the whole semantics thing can be done even on the proverbial rainy Sunday afternoon, but the dumbed down version with schema.org sort of works. This schema.org brings me to p110 of Aiello’s book, which states that Google can make do with just statistics for optimal search results because of its sheer volume (so bye-bye Semantic Web). But it is not just stats-based: even Google is trying with schema.org and its “knowledge graph”; admitted, it’s extremely lightweight, but it’s more than stats-only. Perhaps the schema.org and knowledge graph sort of thing are to the Semantic Web what TBL’s proposal for the Web was to, say, the fancier HyperCard.

I don’t know if people within the Semantic Web research community would think of its tooling as technologies for the general public. I suspect not. I consider the development and use of ontologies in ontology-driven information systems as part of the ‘back office’ technologies, notwithstanding my occasional attempts to explain to friends and family what sort of things I’m working on.

What I did find curious, is that one of Aiello’s arguments for the Semantic Web’s failure was that “Using ontologies and defining what the meaning of a page is can be much more easily exploited by malicious users” (p110). It can be exploited, for sure, but statistics can go bad, very bad, too, especially on associations of search terms, the creepy amount of data collection on the Web, and bias built into the Machine Learning algorithms. Search engine optimization is just the polite terms for messing with ‘honest’ stats and algorithms. With the Semantic Web, it would a conscious decision to mess around and that’s easily traceable, but with all the stats-based approaches, it sneakishly can creep in whilst trying to keep up the veneer of impartiality, which is harder to detect. If it were a choice between two technology evils, I prefer the honest bastard cf. being stabbed in the back. (That the users of the current Web are opting for the latter does not make it the lesser of two evils.)

As to two possible new patches (not in the book and one can debate whether they are), time will tell whether a few recent calls for “decentralizing” the Web will take hold, or more fine-grained privacy that also entails more fine-grained recording of events (e.g., TBL’s solid project). The app-fication discussion (Section 10.1) was an interesting one—I hardly use mobile apps and so am not really into it—and the lock-in it entails is indeed a cause for concern for the Web and all it offers. Another section in Chapter 10 is IoT, which sounds promising and potentially scary (what would the data-hungry ML algorithms of the Web infer from my fridge contents, and from that, about me??)—for the past 10 years or so. Lastly, the final chapter has the tempting-to-read title “Should a new Web be designed?”, but the answer is not a clear yes or no. Evolve, it will.

Would I have read the book if I weren’t on sabbatical now? Probably still, on an otherwise ‘lost time’ intercontinental trip to a conference. So, overall, besides the occasional gap and one could quibble a bit here and there, the book is a nice read on the whole for any lay-person interested in learning something about the ubiquitous Web, any expert who’s using only a little corner of it, and certainly for the younger generation to get a feel for how the current Web came about and how technologies get shaped in praxis.

OBDA/I Example in the Digital Humanities: food in the Roman Empire

A new installment of the Ontology Engineering module is about to start for the computer science honours students who selected it, so, in preparation, I was looking around for new examples of what ontologies and Semantic Web technologies can do for you, and that are at least somewhat concrete. One of those examples has an accompanying paper that is about to be published (can it be more recent than that?), which is on the production and distribution of food in the Roman Empire [1]. Although perhaps not many people here in South Africa might care about what happened in the Mediterranean basin some 2000 years ago, it is a good showcase of what one perhaps also could do here with the historical and archeological information (e.g., an inter-university SA project on digital humanities started off a few months ago, and several academics and students at UCT contribute to the Bleek and Lloyd Archive of |xam (San) cultural heritage, among others). And the paper is (relatively) very readable also to the non-expert.

 

So, what is it about? Food was stored in pots (more precisely: an amphora) that had engravings on it with text about who, what, where etc. and a lot of that has been investigated, documented, and stored in multiple resources, such as in databases. None of the resources cover all data points, but to advance research and understanding about it and food trading systems in general, it has to be combined somehow and made easily accessible to the domain experts. That is, essentially it is an instance of a data access and integration problem.

There are a couple of principal approaches to address that, usually done by an Extract-Transform-Load of each separate resource into one database or digital library, and then putting a web-based front-end on top of it. There are many shortcomings to that solution, such as having to repeat the ETL procedure upon updates in the source database, a single control point, and the, typically only, canned (i.e., fixed) queries of the interface. A more recent approach, of which the technologies finally are maturing, is Ontology-Based Data Access (OBDA) and Ontology-Based Data Integration (OBDI). I say “finally” here, as I still very well can remember the predecessors we struggled with some 7-8 years ago [2,3] (informally here, here, and here), and “maturing”, as the software has become more stable, has more features, and some of the things we had to do manually back then have been automated now. The general idea of OBDA/I applied to the Roman Empire Food system is shown in the figure below.

OBDA in the EPnet system (Source: [1])

OBDA in the EPnet system (Source: [1])

There are the data sources, which are federated (one ‘middle layer’, though still at the implementation level). The federated interface has mapping assertions to elements in the ontology. The user then can use the terms of the ontology (classes and their relations and attributes) to query the data, without having to know about how the data is stored and without having to write page-long SQL queries. For instance, a query “retrieve inscriptions on amphorae found in the city of ‘Mainz” containing the text ‘PNN’” would use just the terms in the ontology, say, Inscription, Amphora, City, found in, and inscribed on, and any value constraint added (like the PNN), and the OBDA/I system takes care of the rest.

Interestingly, the authors of [1]—admitted, three of them are former colleagues from Bolzano—used the same approach to setting up the ontology component as we did for [3]. While we will use the Protégé Ontology Development Environment in the OE module, it is not the best modelling tool to overcome the knowledge acquisition bottleneck. The authors modelled together with the domain experts in the much more intuitive ORM language and tool NORMA, and first represented whatever needed to be represented. This included also reuse of relevant related ontologies and non-ontology material, and modularizing it for better knowledge management and thereby ameliorating cognitive overload. A subset of the resultant ontology was then translated into the Web Ontology Language OWL (more precisely: OWL 2 QL, a tractable profile of OWL 2 DL), which is actually used in the OBDA system. We did that manually back then; now this can be done automatically (yay!).

Skipping here over the OBDI part and considering it done, the main third step in setting up an OBDA system is to link the data to the elements in the ontology. This is done in the mapping layer. This is essentially of the form “TermInTheOntology <- SQLqueryOverTheSource”. Abstracting from the current syntax of the OBDA system and simplifying the query for readability (see the real one in the paper), an example would thus have the following make up to retrieve all Dressel 1 type of amphorae, named Dressel1Amphora in the ontology, in all the data sources of the system:

Dressel1Amphora <-
    SELECT ic.id
       FROM ic JOIN at ON at.carrier=ic.id
          WHERE at.type=’DR1’

Or some such SQL query (typically larger than this one). This takes up a bit of time to do, but has to be done only once, for these mappings are stored in a separate mapping file.

The domain expert, then, when wanting to know about the Dressel1 amphorae in the system, would have to ask only ‘retrieve all Dressel1 amphorae’, rather than creating the SQL query, and thus being oblivious about which tables and columns are involved in obtaining the answer and being oblivious about that some data entry person at some point had mysteriously decided not to use ‘Dressel1’ but his own abbreviation ‘DR1’.

The actual ‘retrieve all Dressel1 amphorae’ is then a SPARQL query over the ontology, e.g.,

SELECT ?x WHERE {?x rdf:Type :Dressel1Amphora.}

which is surely shorter and therefore easier to handle for the domain expert than the SQL one. The OBDA system (-ontop-) takes this query and reasons over the ontology to see if the query can be answered directly by it without consulting the data, or else can be rewritten given the other knowledge in the ontology (it can, see example 5 in the paper). The outcome of that process then consults the relevant mappings. From that, the whole SQL query is constructed, which is sent to the (federated) data source(s), which processes the query as any relational database management system does, and returns the data to the user interface.

 

It is, perhaps, still unpleasant that domain experts have to put up with another query language, SPARQL, as the paper notes as well. Some efforts have gone into sorting out that ‘last mile’, such as using a (controlled) natural language to pose the query or to reuse that original ORM diagram in some way, but more needs to be done. (We tried the latter in [3]; that proof-of-concept worked with a neutered version of ORM and we have screenshots and videos to prove it, but in working on extensions and improvements, a new student uploaded buggy code onto the production server, so that online source doesn’t work anymore (and we didn’t roll back and reinstalled an older version, with me having moved to South Africa and the original student-developer, Giorgio Stefanoni, away studying for his MSc).

 

Note to OE students: This is by no means all there is to OBDA/I, but hopefully it has given you a bit of an idea. Read at least sections 1-3 of paper [1], and if you want to do an OBDA mini-project, then read also the rest of the paper and then Chapter 8 of the OE lecture notes, which discusses in a bit more detail the motivations for OBDA and the theory behind it.

 

References

[1] Calvanese, D., Liuzzo, P., Mosca, A., Remesal, J, Rezk, M., Rull, G. Ontology-Based Data Integration in EPNet: Production and Distribution of Food During the Roman Empire. Engineering Applications of Artificial Intelligence, 2016. To appear.

[2] Keet, C.M., Alberts, R., Gerber, A., Chimamiwa, G. Enhancing web portals with Ontology-Based Data Access: the case study of South Africa’s Accessibility Portal for people with disabilities. Fifth International Workshop OWL: Experiences and Directions (OWLED 2008), 26-27 Oct. 2008, Karlsruhe, Germany.

[3] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC 2010), March 22-26 2010, Sierre, Switzerland. ACM Proceedings, pp1389-1396.

Linked data as the future of scientific publishing

The (not really a) ‘pipe dream’ paper about “Strategic reading, ontologies, and the future of scientific publishing” came out in August in Science [1], but does not seem to have picked up a lot of blog-attention other than copy-and-paste-the-abstract (except for Dempsey‘s take on it), which is curious. Renear and Palmer envision a bright new world with online scientific papers where the text has click-through terms to records in other databases and web pages to have your network of linked data; click on a protein name in the paper and automatically browse to Uniprot, sort of. And it is even the users who want that; i.e., it is not some covert push from Semantic Web techies. But then, in this case the ‘users’ are in informatics & information/library science (the enabling-users?), which does not imply that the endusers—say, biochemists, geneticists, etc.—want all that for better management of their literature (or they want that but do not yet realise that that is what they want).

But let us assume those endusers want the linked data for their literature (after all, it was a molecular ecologists who sent me the article—thanks Mark!). Or, to use some ‘fancier’ terms from the article: the zapping scientists want (need?) ontology-supported strategic reading to work efficiently and effectively with the large amounts of papers and supporting data being published. “Scientists are scanning more and reading less”, so then the linked data would (should?) help them in this superficial foraging of data, information, and knowledge to find the useful needle in the haystack—or so goes Renear and Palmer’s vision.

However, from a theoretical and engineering point of view, this can already be done. Not just that, it has been shown that some things work: there is iHOP and Textpresso, as the authors point out, but also GoPubMed, and SHER with PubMed, which begs the question: are those tools not good enough (and if something is missing, what?) or is it about convincing people and funding agencies? If the latter, then what does the paper do in Science in the “review” section??

If one reads further on in the paper, some peculiar remarks are being made, but not the one I would have expected. That the “natural language prose of scientific articles provides too much valuable nuance and context to be treated only as data” is a known problem that keeps many a (computational) linguist busy for his/her lifetime. But they go on saying also that “Traditional approaches to evaluating information systems, such as precision, recall, and satisfaction measures, offer limited guidance for further development of strategic reading technologies”, yet alternative evaluation methods are not presented. That “research on information behaviour and the use of ontologies is also needed” may be true form an outsider’s perspective: usages are known among ontologists but perhaps a review of all the ways of actual usage may be useful. Further down in the same section (p832), the authors claim, out of the blue and without any argumentation, that “the development of ontology languages with additional expressive power is needed”. What additional expressive power is needed for document and data navigation when they talk of the desire to exploit better the “terminological annotations”? The preceding text in the same paragraph only mentions XML, so they do not seem to have a clue about ontology languages, let alone their expressiveness, at all (OWL is mentioned only in passing on p830) and they manage to mention it in the same breath as so-called service-oriented architectures with a reference to another Science paper. Frankly, I think that papers like this are bound to cause more harm (or, at best, indifference) than good.

One thing I was wondering about, but that is not covered in the paper, is the following: who decides which term goes to which URI? There are different databases for, say, proteins, and the one who will be selected (by whom?) in the scientific publishing arena will become the de facto standard. A possibility to ameliorate this is to create a specific interface so that when a scientists clicks on a term, a drop-down box appears with something like “do you wish to retrieve more information from source x, y, or z?” Nevertheless, one easily ends up with a certain bias and powerful “gatekeepers”, and perhaps, with a similar attitude as toward DBLP/PubMed (“if it’s listed in there it counts, if it is not, then it doesn’t” regardless the content of the indexed papers, and favouring older established, and well-connected, outlets above newer and/or interdisciplinary ones).

Anyway, if Semantic Web researchers need some reference padding in the context of “it is not a push but a pull, hear hear, look here at the Science-reference, and I even read Science”, it’ll do the job, even though the contents of the paper is less than mediocre for the outlet.

References
[1] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784]

An analysis of culinary evolution

With summertime being what it is (called komkommertijd—literally: ‘cucumber time’—in Dutch), I stumbled again upon the paper The nonequilibrium nature of culinary evolution [1].

Food is essential, and due to location with its climate and available resources, as well as culture, each region has its own cuisine. There is much talk of homogenization of food dishes in popular press, or at least the threat thereof. One colleague here called “food from the North”, north of the Alps, that is, “barbarian”. But how much diversity in recipes across geographical locations is there? How, if at all, does it vary over time? What is the ingredient replacement pattern and do the replaced ingredients really disappear from the local menu?

Kinouchi and colleagues [1] tried to answer such questions through assessment of the statistics of the recipes’ ingredients, which were taken from 3 complete Brazilian cookbooks (Dona Benta 1946, 1969, and 2004), 40% of the large contemporary French cookbook Larousse (2004), the complete British Penguin Cookery Book (2001), and the Medieval Pleyn Delit.

For instance, the average recipe size of the Dona Benta (1946) as measured by ingredients, is lowest at 6.7, that of the Pleyn Delit an impressive 9.7, and of Larousse the highest with 10.8. However, one has to note that for the Pleyn, there are just 380 recipes with a mere 219 ingredients, whereas the numbers for Larousse are 1200 and 1005, and for the Dona they are 1786 and 491, respectively. When one makes a graph the frequency of appearances of ingredients in the recipes in the cookbooks, then all six cookbooks show very similar rank-frequency plots (power-law behaviour; see Fig. 1 in the paper); that is, for that dimension, there is a cultural invariance, as well as a temporal invariance for the Brazilian cookbook.

However, the more interesting results are obtained by the statistical and complex network analysis to obtain an idea about culinary evolution. The authors propose a copy-mutate algorithm to model cuisine growth, going from a small set of initial recipes to more diverse ones and using the idea of “cultural replicators” and branching. To make the line fit the data, they need 5 parameters: number of generations (T), number of ingredients per recipe (K), number of ingredients in each recipe to be mutated (L), the number of initial recipes (R0), and the ratio between the sizes of the pool of ingredients and the pool of recipes (M). Models without a fitness parameter did not work, so one is generated randomly and assumed to stand for the “intrinsic ingredient properties”, such as nutritional value and availability. At each generation, one “mother” recipe was randomly chosen, copied, and one or more of its ingredients replaced with other random ingredient (implementing the mutation rate L) to generate a “daughter” recipe. And so onward. Searching the parameter space, the authors do indeed find values close to the actual ones observed in the cook books.

Then, on the fitness of the recipes (replaced by hamburgers, pizza, etc.?), Kinouchi and colleagues use the fitness of the kth recipe, defined as F^{(k)} = \frac{1}{K} \sum_{i=1}^{K}f_i , and a corresponding total time dependent cuisine fitness, F_{total}(R(t)) = \frac{1}{R(t)} \sum_{k=1}^{R(t)}F^{(k)} . The results are depicted in Fig4 in the paper and, in short: “this kind of historical dynamics has a glassy character, where memory of the initial conditions is preserved, suggesting that the idiosyncratic nature of each cuisine will never disappear due to invasion by alien ingredients”. In addition, the copy-mutation model with the selection mechanism is scale-free, so that it is an out-of-equilibrium process, which practically means that “the invasion of new high fitness ingredients and the elimination of initial low fitness ingredients never end”, i.e., some ingredients are very difficult to being replaced, as if they were “frozen “cultural” accidents”. The latter has some similarity with the ‘founder-effect’ phenomenon in biology.

De aardappeleters (potato eaters) by Van Gogh

De aardappeleters (potato eaters) by Van Gogh

That much for the maths and experimental data of the paper. Before I turn to some research suggestions on this topic, I will first make an unscientific informal assessment. Van Gogh painted the painting de aardappeleters (‘the potato eaters’) in Nuenen—a village about 15km from where I grew up—back in 1885, to which Thieu Sijbers added a poem to describe such a poor man’s meal. I could not find the full original, but Van Oirschot ([2], p17) has the main parts of it, which I reproduce here first in the original old Brabants dialect and then a translation in English.

En hoekig nao ‘t bidde

‘t krous en dan wordt

aon de sobere maoltijd begonne

recht van ‘t vuur

op de bonkige toffel gezet

worre d’èrpel naw schieluk

mi rappe verkèt

van de hijt fèl nog dampend

de pan outgepikt

nao de monde gebrocht

en gulzig geslikt.

Ze ète, ze schranze

nao ‘n lutske de pan

toe ‘t zwart van de bojum

zo lig as ‘t mer kan



Mi’n mörke vol koffie

van waot’rige sort

zette d’èters nao d’èrpel

de maoltijd dan vort



Ze ète, jao net

mer dan is ‘t ok gezeed

want al wè ze pruuve

is èrremoei en leed.

My translation into English:

And edgy after praying

to the cross, and then

they start with the sober meal,

straight from the fire,

put on the chunky table,

now the potatoes are suddenly

cursed swiftly,

still steaming from the heat,

picked from the pan,

brought to the mouth,

and swallowed greedily.

They eat, they gorge,

and shortly after there is the black

of the bottom of the pan,

as empty as it can be.



With a mug full of coffee,

of the watery type,

do the eaters continue

with the meal after the potatoes.



They eat, yes just about,

but with that, all is said,

because all they taste

is poverty and distress.

The coffee is probably not real coffee but made from roasted sweet chestnuts [2]. The potatoes are an example of the “alien ingredients” mentioned in [1]: before potatoes were introduced in Europe (16th century), the Dutch recipes, at least, used tubers such as pastinaak (parsnip, which are white, and longer and thicker than carrot) in the place of potatoes; this is known primarily from the documentation about the Siege of Leiden in 1573-1574 during the 80-years war. Parsnip has not entirely vanished (parsnip beignets are really tasty), but now takes up a minimal place in ‘standard’ Dutch cuisine, so that it may be an example of one of those “frozen “cultural” accidents difficult to be overcome in the out-of equilibrium regime” ([1], p7). A standard Dutch dish is the aardappelen-groente-vlees combination, or: boiled potatoes, boiled vegetables, and a piece of meat baked in butter or fat, or the potatoes and vegetables are cooked together and mashed together into a hutspot (= potato+carrot+onion) or boerenkoolstamp (= potato+curly kale). Over the years, pasta entered the menu as well, and primarily a combination of Chinese and Indonesian, but to some extent also Surinamese, food has become regular dishes. Thus, pasta and rice took some space previously occupied by potatoes, but potatoes are in no way being marginalised. My guess is that that is because tubers and grains belong to different food groups and are therefore not easily swappable compared to tuber-tuber replacement, such as parsnip → potato [note 1], or grain-grain replacement, e.g., maize[flower] → wheat[flower]. Simply put: if you grow up on rice or pasta, then you do not easily switch to potatoes, or vice versa.

Perhaps Kinouchi’s copy-mutate algorithm can be rerun taking into account types of ingredients and then see what comes out of it; and use some variations like (1) swap within same food group, (2) different food group swap, pick random ingredient; and (3) keep the swap within subgroups, such as tuber-carbohydrate-source-staplefood#1 → tuber-carbohydrate-source-staplefood#2 (vs. the more generic ‘carbohydrate source’) and herb#3 → herb#4 (vs. the subsumer ‘condiment’).

Further, in addition to ingredient substitution-by-import, one also observes recipe import, which faces the task of having to make do with the local ingredients. Chinese excel in this skill: dishes in Chinese restaurants taste different in each country but roughly similar—in Italy, they even split up the meals into primo and secondo piatti. But when substituting original ingredients with the local ingredients that are only approximations of the original ones, how much remains of the recipe so that one still can talk of instantiations of the dishes described by the original recipe and when is it really a new one? What effect do those imported recipes have on local cuisine? Is there experimental data to say that, statistically, one recipe is better “export material” than others are? Are people [from/who visited] some geographic region better at transporting the local recipes and/or their ingredients elsewhere?

It remains to test whether those mutated recipes are still edible. Forced by ‘necessity’, I did ingredient substitution due to recipe import several times (the Italian shops do not have baked beans, no brown sugar, no condensed milk, no ontbijtkoek, no real butter, no pecan nuts, only a few apple varieties, etc…), and for some recipes the substitute was at least as good as the original, but then the substitute approximated the original. I certainly have not dared mashing together cooked pasta+carrot+onion to make an “Italian-style hutspot”, let alone random ingredient substitutions. In case someone has done the latter and it is not only edible but also recommendable, feel free to drop me a line or add the recipe in the comments.

References and notes

1. Kinouchi, O., Diez-Garcia, R.W., Holanda, A.J., Zambianchi, P., and Roque, A.C. The nonequilibrium nature of culinary evolution. ArXiv 0802.4393v1, 29 Feb 2008. Also published in New J. Phys. 10, 073020 (8pp) doi: 10.1088/1367-2630/10/7/073020

2. van Oirschot, A. (ed.). Van water tot wijn, van korsten to pastijen. Stichting Brabanste Dag, 1979. 124p.

[note 1] There is a difference between root-tubers (such as parsnip) and stem-tubers (such as potato), but functionally they are quite alike, so that for the remainder of the post I will gloss over this minor point. Some basic information can be glanced from the Wikipedia entry tuber, more if you search on Pastinaca sativa (parsnip) and Solanum tuberosum (potato) who are not member of the same family, and you may be interested to check other common vegetables and their names in different languages to explore this further.

Computing and the philosophy of technology

Recently I wrote about the philosophy of computer science (PCS), where some people, online and offline, commented that one cannot have computer science but have the science of computing instead and that it must have as part experimental validation of the theory. Regarding the latter, related and often-raised remarks are that it is computer engineering instead, which, in turn, it closely tied to information technology. In addition, many a BSc programme trains students so as to equip them with the skills to obtain a job in IT. Coincidentally, there is a new entry in the Stanford Encyclopedia of Philosophy on the Philosophy of Technology (PT) [1], which, going by the title, may well be more encompassing than just PCS and perhaps just as applicable. In my opinion, not quite. There as several relevant aspects of the PT for computing, but if one takes the strand of engineering and IT of computing, then there are some gaps in the overview presented by Franssen, Lokhorst and Van de Poel. The the main gap I will address is that it misses the centrality of ‘problem’ in engineering and that it conflates it with notions such as requirements and design.

But to comment on it, let me first start with mentioning few salient points of their sections “historical developments” and “analytic philosophy of technology” to illustrate notions of technology and which strand of PT the SEP entry is about (I will leave the third part, “ethical and social aspects of technology”, for another time).

A few notes on its historical developments
Plato put forward the thesis that “technology learns from or imitates nature”, whereas Aristotle went a step further: “generally, art in some cases completes what nature cannot bring to a finish, and in others imitates nature”. A different theme concerning technology is that “there is a fundamental ontological distinction between natural things and artifacts” (not everybody agrees with that thesis), and if so, what, then, those distinctions are. Aristotle, again, also has a doctrine of the four causes with respect to technology: material, formal, efficient, and final.
As is well-know, there was a lull in the intellectual and technological progress in Europe during the Dark Ages, but the Renaissance and the industrial revolution made thinking about technology all the more interesting again, with notable contributions by Francis Bacon, Karl Marx, and Samuel Butler, among others. Curiously, “During the last quarter of the nineteenth century and most of the twentieth century a critical attitude [toward technology] predominated in philosophy. The representatives were, overwhelmingly, schooled in the humanities or the social sciences and had virtually no first-hand knowledge of engineering practice” (emphasis added). This type of PT is referred to as humanities PT, which looks into socio-politico-cultural aspects of technology. Not everyone was, or is, happy with that approach. Since the 1960s, an alternative approach has been gestating and is becoming more important, which can be dubbed the analytic PT that is concerned with technology itself, where technology is considered the practice of engineering. The SEP entry about PT is about this type of PT.

Main topics in Analytic PT
The topics that the authors of the PT overview deem the main ones have to do with the discussion on the relation between science and technology (and thus also on the relation of philosophy of science and of technology), the importance of design for technology, design as decision-making, and the status and characteristics of artifacts. Noteworthy is that their section 2.6 on other topics mentions that “there is at least one additional technology-related topic that ought to be mentioned… namely, Artificial Intelligence and related areas” (emphasis added) and the reader is referred to several other SEP entries. Notwithstanding the latter, I will comment on the content from a CS perspective anyway.

Sections 2.1 and 2.2 contain a useful discourse on the differences between science and engineering and on their respective philosophy-of. It was Ellul in 1964 who observed in technology “the emergent single dominant way of answering all questions concerning human action, comparable to science as the single dominant way of answering all questions concerning human knowledge” (emphasis added). Skolimowski phrased it as “science concerns itself with what is, whereas technology with what is to be.” Simon used the verb tenses are and ought to be, respectively. To hit the message home, they also say that science aims to understand the world but that technology aims to change the world, and is seen as a service to the public. (I leave it to the reader to assess what, then, the great service to the public is of artifacts such as the nuclear bomb, flight simulator games for kids, CCTV, and the likes.) In contrast, while Bunge did acknowledge technology was about action, he stressed that it is one heavily underpinned by theory, thereby distinguishing it from the arts and crafts. Practically regarding software engineering, then to say that development of ontologies or the conceptual modelling during the conceptual analysis stage of software development are, largely or wholly, an art, is to say that there is no underlying theoretical foundation for those activities. This is not true. Indeed, there are people who, without being aware of its theoretical foundations, go off and develop an ontology or ontology-like artifact nevertheless and call that activity an art, but that does not make the activity an art per sé.
Closing the brief summary of the first part of the PT entry, Bunge mentioned also that theories in technology come into two types: substantive (on knowledge about the objects of action) and operative (on the action itself). A further distinction has been put forward by Jarvie, which concerns the difference between knowing that and knowing how, the latter which Polanyi made a central aspect of technology.

Designs, requirements, and the lack of problems
The principal points of criticism I have are on the content of sections 2.3 and 2.4, which revolve around the messiness in the descriptions of design, goals, requirements. Franssen et al first state that “technology is a practice focused on the creation of artifacts and, of increasing importance, artifact-based services. The design process, the structured process leading toward that goal, forms the core of the practice of technology” and that the first step in that process is to translate the needs or wishes of the customer into a list of functional requirements. But ought the first step not to be the identification of the problem? For if we then ultimately solve the problem (giong through writing down the requirements, design the artifact, and build and tests it), it is a ‘service to the public’. Moreover, if we have identified the problem, formulated it in a problem description, written a specification of the goal to achieve, only then we can write down the list of requirements that are relevant to achieving the goal and that solves the problem, which will not only avoid a ‘solution’ exhibiting the same problem but also meet the goal and be this service to the public. But problem identification and problem solving is not a central issue to technology, according to the PT entry; just designing artifacts based on functional requirements.
Looking back at the science and engineering studies I enjoyed, the former had put an emphasis on formulating and testing the hypotheses—no thesis could do without that—whereas the latter focusses on which problem(s) your research tackles and solves, where theses and research papers have to have a (list of) problem(s) to solve. If a CS paper does not have a description of the problem, one might want to look again what they are really doing, if anything; if it is a mere ‘we did develop a cute software tool’, it is straightforward to criticise and easily can be rejected because it is insufficient regarding the problem specification and the system requirements (the reader may not share the unwritten background setting with the authors, hence think of other problems and requirements that he would expect the tool to solve), and the ‘solution’ to that (the tool) is then no contribution to science/engineering, at least not in the way it is presented. Not even mentioning the problem-identification-and-solving and its distinction with science in the approach of doing research is quite a shortcoming of the PT entry.

In addition, the authors confuse such different topics even more clearly by stating that “The functional requirements that define most design problems” (emphasis added). Design problems? So there are different types of problem, and if so, which? How can requirements define a problem? They don’t. The requirements entail what the artifact is supposed to do and what constraints it has to meet. However, if it defines the problem instead, as the authors say, then they are not parameters for the solution to that problem, even though the to-be-designed artifact that meets the requirements is supposed to solve a (perceived) problem. Further, they say that engineers claim that their designs are optimal solutions, but can one really say that a design is the solution, rather than the outcome of the design, i.e., the realisation of the working artifact?

If it is the case, as the authors summarise, that technology aims to change the world, then one needs to know what one is changing, why one is changing that, and how. If something is already ‘perfect’, then there is no need to change it. So, provided it is not ‘perfect’ according to at least one person, then there is something to change to make it (closer to) ‘perfect’; hence, there is a problem with that thing (regardless if this thing is an existing artifact or annoying practice that perhaps could be solved by using technology). Thus, one—well, at least, I—would have expected that ‘problem’ would be a point of inquiry, e.g.: what is it, what does it depend upon, if there is a difference between problem and nuisance, how it relates to requirements and design and human action, the processes of problem identification and problem specification, the drive for problem-solving in engineering and the brownie points one scores with it upon having a solution, and even the not unheard of, but lamentable, practice of ‘reverse engineering the problem’ (i.e.: the ‘we have a new tool, now we think of what it [may/can/does] solve and that can be written in such a way as if it [is/sounds like] a real problem’).

In closing, the original PT entry was published in February this year, and quickly has undergone “substantive revision” meriting a revised release on 22 June ’09, so maybe the area is active and in flux so that a next version could include some considerations about problems.

References
[1] Franssen, Maarten, Lokhorst, Gert-Jan, van de Poel, Ibo, “Philosophy of Technology”, The Stanford Encyclopedia of Philosophy  (Fall 2009 Edition), Edward N. Zalta (ed.), forthcoming URL =http://plato.stanford.edu/archives/fall2009/entries/technology/

Metagenomics updated and slightly upgraded

The Nature TOC-email arrived yesterday, and they have a whole “insight” section on microbial oceanography! Four years ago, Nature Reviews Microbiology had a special issue with a few papers about it, two years ago PLoS Biology presented their Oceanic Metagenomics Collection, and now then the Nature supplement. Why would a computer scientist like me care? Well, my first study was in microbiology, and they have scaled up things a lot in the meantime, thereby making computers indispensible in their research. For those unfamiliar with the topic, you can get an idea about early computational issues in my previous post and comments by visitors, but there are new ones that I’ll mention below.

Although the webpage of the supplement says that the editorial is “free access”, it is not (as of 14-5 about 6pm CET and today 9am). None of the 5 papers—1 commentary and 4 review papers—indicate anything about the computational challenges: “the life of diatoms in the world’s oceans”, “microbial community structure and its functional implications”, “the microbial ocean from genomes to biomes”, and “viruses manipulate the marine environment”. Given that DeLong’s paper of 4 years ago [1] interested me, I chose his review paper in this collection [2] to see what advances have been made in the meantime (and that article is freely available).

One of the issues mentioned in 2007 was the sequencing and cleaning up noisy data in the database, which now seems to be much less of a problem, even largely solved, so the goal posts start moving to issues with the real data analysis. With my computing-glasses on, Box 2 mentions (my emphases underlined and summarised afterwards):

Statistical approaches for the comparison of metagenomic data sets have only recently been applied, so their development is at an early stage. The size of the data sets, their heterogeneity and a lack of standardization for both metadata and gene descriptive data continue to present significant challenges for comparative analyses … It will be interesting to learn the sensitivity limits of such approaches, along more fine-scale taxonomic, spatial and temporal microbial community gradients, for example in the differences between the microbiomes of human individuals44. As the availability of data sets and comparable metadata fields continues to improve, quantitative statistical metagenomic comparisons are likely to increase in their utility and resolving power. (p202)

Let me summarise that: DeLong asserts they need (i) metadata annotations as a prerequisite for statistical approaches; (ii) deal with temporal data, and (iii) deal with spatial data. There is a lot of research and prototyping going on in topics ii and iii, and there are few commercial industry-grade plugins, such as the Oracle Cartridge, that do something with the spatial data representation. Perhaps that is not enough or it is not what the users are looking for; if this is the case, maybe they can be a bit more precise on what they want?

Point i is quite interesting, because it basically reiterates that ontologies are a means to an end and asserts that statistics cannot do it with number crunching alone but needs structured qualitative information to obtain better results. The latter is quite a challenge—probably technically doable, but there are few people who are well versed in the combination of qualitative and quantitative analysis. Curiously, only MIAME and the MGED website are mentioned for metadata annotation, even though they are limited in scope with respect to the subject domain ontologies and ontology-like artefacts (e.g., the GO, BioPax, KEGG), which are also used for annotation but not mentioned at all. The former deal with sequencing annotation following the methodological aspects of the investigation, whereas the latter type of annotations can be done with domain ontologies, i.e. annotating data with what kind of things you have found (which genes and their function, which metabolism, what role does the organism have in the community etc.) that are also need to carry out the desired comparative analyses.

There is also more generic hand-waiving that something new is needed for data analysis:

The associated bioinformatic analyses are useful for generating new hypotheses, but other methods are required to test and verify in silico hypotheses and conclusions in the real world. It is a long way from simply describing the naturally occurring microbial ‘parts list’ to understanding the functional properties, multi-scalar responses and interdependencies that connect microbial and abiotic ecosystem processes. New methods will be required to expand our understanding of how the microbial parts list ties in with microbial ecosystem dynamics. (p203)

Point taken. And if that was not enough,

Molecular data sets are often gathered in massively parallel ways, but acquiring equivalently dense physiological and biogeochemical process data54 is not currently as feasible. This ‘impedance mismatch’ (the inability of one system to accommodate input from another system’s output) is one of the larger hurdles that must be overcome in the quest for more realistic integrative analyses that interrelate data sets spanning from genomes to biomes.

I fancy the thought that my granularity might be able to contribute to the solution, but that is not yet anywhere close to user-level software application stage.

At the end of the paper, I am still—as in 2005 and 2007—left with the impression that more data is being generated in the whole metagenomics endeavour than that there are computational tools to analyse them to squeeze out all the information that is ‘locked up’ in the pile of data.

References

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.

[2] DeLong, E.F. The microbial ocean from genomes to biomes. Nature, 2009, 459: 200-206.

Brief review of the Handbook of Knowledge Representation

The new Handbook of Knowledge Representation edited by Frank van Harmelen, Vladimir Lifschitz and Bruce Porter [1] is an important addition to the body of reference and survey literature. The 25 chapters cover the main areas in Knowledge Representation (KR), ranging from basic KR, such as SAT solvers, Description Logics, Constraint Programming, and Belief Revision, to specific core domains of knowledge, such as Spatial and Temporal KR & R, and Nonmonotonic Reasoning, to shorter ‘application’ chapters that touch upon the Semantic Web, Question Answering, Cognitive Robotics, and Automated Planning.

Each chapter roughly follows the approach of charting the motivation and problems the research area attempts to solve, the major developments in the area over the past 25 years, important achievements in the research, and where there is still work to do. In a way, each chapter is a structured ‘annotated bibliography’—many chapters have about 150-250 references each—that serve as an introduction and a high-level overview. This is useful, for instance, if your specific interests are not covered in a university course but have a thesis student and you would want him to work on that topic, then the appropriate chapter will be informative for the student not only to get an idea about it but also to have an entry point as to which further principal background literature to read; or you are a researcher writing a paper and do not want to put a Wikipedia URL in the references (yes, I’ve seen papers where authors had done that) but a proper reference; or you are, say, well-versed in DL-based reasoners, but come across a paper where one based on constraint programming is proposed and you want to have a quick reference to check what CP is about without ploughing through the handbook on constraint programming. Comparatively with the other topics, anyone interested in ‘something about time’ will be satisfied with the four chapters on temporal KR & R, situation calculus, event calculus, and temporal action logics. Clearly, the chapters in the handbook on KR are not substitutes for the corresponding “handbook on [topic-x]” books, but they do provide a good introduction and overview.

Some chapters are denser in providing a detailed overview than others (e.g., qualitative spatial reasoning vs. CP, respectively), however, and yet other chapters provide a predominantly text-based overview whereas others do include formalisms with precise definitions, other axioms, and theorems (Qualitative Modelling, Physical Reasoning, and Knowledge Engineering vs. most others, respectively). That most chapters do include some logic comes as no surprise for the KR researcher but may be for the novice or a searching ontology engineer. For the latter group, and logic-sceptics in general, there is a juicy section in chapter 1, “General Methods in Knowledge Representation and Reasoning”, called “Suitability of Logic for Knowledge Representation” that takes on the principal anti-logicist arguments and the about 6-page long rebuttal of each complaint. Another section that can be good for heated debates is Guus Schreiber’s (too) brief comment on the difference between “Ontologies and Data Models” (chapter 25), which easily can fill a few pages instead of the now less than half a page used for arguing there is a distinction between the two.

Although I warmly recommend the handbook as addition to the library, there are also a few shortcomings. One may have to do with the space limitations (even though the book is already over 1000 pages), whereas the other one might be due to the characteristics of research in KR & R itself (to some extent at least). They overlap with the kind of shortcomings Erik Sandewall has mentioned in his review of the handbook. Several topics that are grouped under KR are not, or very minimally, dealt with in the book (e.g., uncertainty and ontologies, respectively) or in a fragmented, isolated, way across chapters what perhaps should have been consolidated into a separate chapter (i.e., abstraction, but also ontologies). In addition, within the chapters, it may well occur that some subtopics are perceived to be missing from the overview or mentioned too briefly in passing (e.g., mereology and DL-Lite for scalable reasoning), but this also depends on one’s background. On the other hand, the chapters on Qualitative Modelling and Physical Reasoning could have been merged into one chapter.

The other point concerns the lack of elaboration on real life success stories as significant contribution of that topic that a KR novice or a specialised researcher venturing in another sub-topic may be looking for. However, the handbook charts the research progress in the respective fields, not the knowledge transfer from KR research output to the engineering areas where the theory is put to the test and implementations are tried out. It is a philosophical debate if doing science in KR should include testing one’s theories. To give an idea about this discrepancy, part III of the handbook is called “Knowledge Representation in Applications” (emphasis added), which contains a chapter, among five others, on “The Semantic Web: Webizing Knowledge Representation”. From a user perspective, including software engineers and the most active domain expert adopters (in biology and medicine), the Semantic Web is still largely a vision, but not yet a success story of applications—people experiment with implementations, but the fact that there are people willing to give it a try does not imply it is a success from their point of view. Put differently, it says more about the point of view of KR&R that it is already categorised under applications. True, as the editors note, one needs to build upon advances achieved in the base areas surveyed in parts I and II, but is it really ‘merely’ ‘applying’, or does the required linking of the different KR topics in these application areas bring about new questions and perhaps even solutions to the base KR topics? The six chapters in part III differ in the answer to this question—as in any healthy research field: there are many accomplishments, but much remains to be done.

[1] Frank van Harmelen, Vladimir Lifschitz and Bruce Porter (Eds.). Handbook of Knowledge Representation. Elsevier, 2008, 1034p. ISBN-13: 978-0-444-52211-5 / ISBN-10: 0-444-52211-5.

Editorial freedoms?

Or: on editors changing your article without notifying (neither before nor after publication).

The book The changing dynamic of Cuban civil society came out last year right before I went to Cuba, so I had read it as a preparation, and, it being a new book, I thought I might as well write a review for it and see if I could get it published. The journal Latin American Politics and Society (LAPS) was interested, and so it came to be that I submitted the book review last July and got that review accepted. Two days ago I was informed by Wiley-Blackwell, the publisher of LAPS, that I could download the offprint of the already published review: it had appeared in LAPS 50(4):189-192 back in the winter 2008 issue.

The published review is for “subscribers only” (but I’m allowed to email it to you) and to my surprise and disbelief it was not quite the same as the one I had sent off to the LAPS book review editor Alfred Montero. They had made a few changes to style and grammar, which, given that English is not my mother tongue, was probably warranted (although it would have been appropriate if I were informed about that beforehand). There are, however, also three significant changes to the content. More precisely: two deletions and one addition.

The first one is at the beginning, where an introduction is given on what constitutes ‘civil society’. Like in the book, some examples are given, as well as the notion of a ‘categorisation’ of organisations. The original text (see pdf) is as follows:

According to this description, hobbies such as bird watching and playing rugby is not part of civil society, but the La Molina against the US base in Vicenza and Greenpeace activism are. In addition, one may want to make a categorization between the different types of collectives that are part of a civil society: people have different drives or ideologies for improving or preventing deterioration of their neighborhood compared to saving the planet by attempting to diminish the causes of climate change.

This has been trimmed down to:

According to this description, hobbies such as bird watching and playing rugby are not part of civil society, but Greenpeace activism is. In addition, one may want to make a categorization between the different types of collectives that are part of a civil society: people have different drives or ideologies for improving or preventing deterioration of their neighborhood, compared to saving the planet by attempting to diminish the causes of climate change.

A careful reader may notice that there is a gap in the logic of the examples: the No Dal Molin activism against the US base is an example of NIMBY-activism (Not In My BackYard), referred to in the second sentence but the example in the first sentence is missing. There being no example of this type in the book, I felt the need to give one anyway. Perhaps if I would have used the for the US irrelevant NIMBY-activism against the TAV (high speed train) it would have remained in the final text. The activism of Molin, however, is a much more illustrative example of the interactions between a local grass-roots civil society organisation and both national and international politics, and how the so-called ‘spheres of influence’ of the actors have taken shape.

The addition is a verb, “to act”. The original:

Christine Ayorinde discusses both the historical reluctance of the, until 1992, atheist state against religious groups—used as counterrevolutionary tool primarily by the U.S. in the early years after the Revolution—and the loosening by the, now constitutionally secular, state …

The new sentence:

Christine Ayorinde discusses both the historical reluctance of the atheist state (until 1992) to act against religious groups—used as counterrevolutionary tool primarily by the United States in the early years after the revolution—and the loosening by the now-constitutionally secular state, …

But it is not the case that the state was reluctant to act against religious groups; they were reluctant and hampering involvement of foreign religious groups because it was used by primarily the US as a way to foment dissent against the Revolution.

The second deletion actually breaks a claim I make about the chapters in the edited volume and weakens an important observation on the operations of civil society organisations in Cuba, and of foreign NGOs in particular.

The original:

A personal experience perspective is given by Nino Pagliccia from the Canadian Volunteer Work Brigade (Chapter 5). This is a fascinating chapter when taken in conjunction with Alexander Gray’s chapter that analyses personal perspectives and changes in procedures from the field from a range of civil society actors (Chapter 7). … Pagliccia’s, as well as the representative of Havana Ecopolis project’s—Legambiente-funded, which is at the green-left spectrum of the Italian political arena—documented experiences of cooperation in Cuba have the component of shared ideology, whereas other representatives, such as from Save the Children UK, talk about shared objectives instead even when their Cuban collaborators assume shared ideology. Notably, the latter group of foreign NGOs report more difficulties about their experiences in Cuba.

How it appears in the published version:

A personal perspective is given by Nino Pagliccia, from the Canadian Volunteer Work Brigade (chapter 5). This is a fascinating chapter when considered together with Gray’s chapter 7, which analyzes personal perspectives and changes in procedures from a range of civil society actors. … Pagliccia’s documented experiences of cooperation in Cuba have the component of shared ideology, whereas other representatives, such as those from Save the Children UK, talk about shared objectives instead, even when their Cuban collaborators assume the former. Notably, the latter group of foreign NGOs report more difficulties in their experiences in Cuba.

But the reference to Havana Ecopolis comes from Chapter 7. In fact, of the interviewees, he was the only one really positive about the experiences in the successful foreign-initiated project/NGO, which made me think back to Pagliccia’s Workers Brigade and solidarity vs. charity. I wondered where the funding of Havana Ecopolis came from, Googled a bit, and arrived at the Legambiente website (project flyer). Needless to say, also openly leftist organisations had positive experiences on collaboration; but in analyzing effectiveness of foreign NGO involvement, unveiling the politically-veiled topical NGOs is a distinguishing parameter. Moreover, it is an, informally, well-known one with respect to Cuba’s reluctance of letting foreign NGOs into the country. Thus, it explains why the Havana Ecopolis experience stood out compared to the other documented NGO experience in Cuba. But now, in the revised text, the “Notably, the latter group … more difficulties …” sounds a bit odd and not backed up at all. They even toned down Pagliccia’s contribution from “A personal experience perspective” to “A personal perspective”: there surely is a difference between being informed by having spent some time in Cuba working side-by-side with the Cubans and just having a perspective on Cuba without having a clue what the country is like; now it reads like ‘yeah, whatever—opinions are like assholes: everybody’s got one…’. Note that when one reads the book, one sensibly can make the link between the data and analysis presented on solidarity vs. charity vs. cooperation and the shared-ideology vs. shared-objectives NGOs (ch5 & 7). Rests to make a categorisation of foreign NGOs and conduct a quantitative analysis to back up the obvious qualitative one.

I hope that this case is an exception, but maybe it is the modus operandi in the humanities that things get edited out. It certainly is not in computer science, where only the authors can fiddle a bit with the text when a paper is accepted, and even less so in the life sciences where, upon paper acceptance, thou shalt not change a single word.

UPDATE (22-3-2009): the current status of the contact I had with the LAPS editorial office is that the book review editor, Alfred Montero, did not change anything, but that that happened during copyediting by the managing editor, Eleanor Lahn. She has provided me with an explanation why the changes were done, which has a curious argumentation to which I have replied. This reply also contains a request for clarity and consistency in the procedure (now the book editor assumes the copyeditor contacts the author, whereas the copyeditor normally does not do so), but I have not yet received a response on that email.

What philosophers say about computer science

Yes, there is a Philosophy of Computer Science, but it does not appear to be just like the average philosophy of science or philosophy of engineering kind of stuff. Philosophers of computer science are still figuring out what exactly is computer science—such as “the meta-activity that is associated with programming”—but then finally settle for a weak “in the end, computer science is what computer scientists do”. Currently, the philosophy of computer science (PCS) is mostly an amalgamation of analyses of artefacts that are the products of doing computer science and many, many, research questions.


The comprehensive introductory overview of PCS in the Stanford Encyclopedia of Philosophy by Turner and Eden [1] starts with “some central issues” that are phrased into 27 principal questions that are in need of an answer. To name just a few: What kind of things are computer programs? What kind of things are digital objects, and do we need a new ontological category for them? What does it mean for a program to be correct? What is information? Why are there so many programming languages and programming paradigms? The remainder of the SEP entry on PCS is devoted to giving the setting of the discourses in some detail.


The contents in the entry focuses primarily on the dual nature of programs, programming languages, semantics, logic, proofs, and closes with legal and ethical issues in PCS; a selection that might be open to criticism by people who feel their pet topic ought to have been covered as well (but I leave it to them to stand up). The final words, however, are devoted to the nagging thought if PCS merely gives a twist to other fields of philosophy or if there are indeed new issues to resolve.


In any case, it is already noteworthy to read that Turner and Eden do not consider the act of programming itself as being part of computer science; and rightly so. That could be useful to take into account by people who write job openings offering new postdoc/assistant professor positions, many of which require “strong programming skills”. By the philosophers’ take on CS, they should have asked for good programmers but not researchers, because CS researchers are specialised in the meta-activities such as “design, development and investigation of the concepts and methodologies that facilitate and aid the specification, development, implementation and analysis of computational systems” [1] (emphasis added), i.e., not the actual development and implementation of software systems. On the other hand, the weird thing with the PCS entry is their CS-is-what-computer-scientists-do: if computer scientists decide to slack collectively, then PCS itself would degrade to investigate the non-output of CS, i.e., be obsolete, or, if research funding agencies were to pay mainly or only for ‘integrated projects’ that apply theory to develop working information systems for industry (or biologists, healthcare professionals, etc.), then that is what CS degenerates into? If that were the case, then CS would not merit to be called a science.


In this narrative, now would be a good place for a treatise on definitions of computer science to demonstrate the discipline is sane and structured, to simplify communication with non-informaticians, and feed philosophers with the settled basics to dwell upon. But then I would have to go into finer distinctions, made in e.g. the Netherlands, between informatika and informatiekunde (roughly, computer science vs. information engineering)—the former being more theoretical along the lines of the topics in the PCS entry and the latter includes more socio- psycho- cognitive- managerial topics—and extant definitions by the various organisations, dictionaries and so forth (e.g., here, here, here). I will do a bit more homework on this first, before writing anything about that (and before that, I will go to Geomatica’09 in Cuba and glue a few days holiday to the conference visit; so, within a few days, it will be quiet here until the end of February).


[1] Turner, R. and Eden, A. The Philosophy of Computer Science. In: Stanford Encyclopedia of Philosophy, Zalta, E. (Ed.), published 12 Dec. 2008.

Conflict data analysis issues and the dirty Dirty War Index

In the previous post on building bias into your database, I outlined seven modelling tricks to build your preference into the information system. Here, I will look at some of those databases and a tool/calculation built on top of such conflict databases (the ‘dirty war index’).

Conflict databases

The US terrorism incident database at MIPT suffers from most of the afore-mentioned pitfalls, which drove a recently graduated friend, Dr. Fraser Gray, to desperation asking me if I could analyse the numbers (but, alas, the database has some inconsistencies). I have more, official, detail about the design rationale and limitations of the civil war incident databases developed by Weinstein and by Sutton. In his fascinating book Inside Rebellion [1], Weinstein has described his tailor-made incident database in Appendix B; so, although I’m going to comment on the database, I still highly recommend you read the book.

Weinstein applies organsational theory to rebel organisations in civil war settings and tests his hypotheses experimentally against case studies of Uganda, Mozambique, and Peru. As such, his self-made database was made with the following assumption in mind: “civilians are often the primary and deliberate target of combatants in civil wars… Accordingly, an appropriate indicator of the “incidence” of civil war is the use of violence against noncombatant populations.” Translated to the database focus, it is a people-centred database, not, say, target-centred. Not only deaths are counted, but also a range of violations, including mutilation, abduction, detention, looting, and rape, and victim charactersitics with name, age, sex, affilitation and affiliation groups, such as religious leaders, students, occupation of civilian, and traditional authorities (according to Appendix B).

Geography is coded only at a high level—at least, the information provided in chapter 6 that deals with the quantitative data discusses only (aggregated?) rough regions, such as Mozambique’s “north”, “centre” and “south”, but for Sendero Luminoso-Huallaga no sub-regions at all. To its merit, it has a year-by-year breakdown of the incidents, although one has no access to which type of incidents exactlyeven though they are supposed to be in the database. It does not discuss quantitatively the types of arms and the targets; it certainly makes a difference to understand the dynamics of the conflict if, say, targets like water purification plants are blown up or military bases attacked and if sophisticated ‘non-conventional arms’ are used or machetes. If we want to know that, it seems we have to redo the data collection process. No statistical analysis is performed, so that for, e.g., the size of the victim groups we get indications of ‘relatively more’, and barely even percentages or ratios to make cross-comparisons across years or across conflicts but which could have been done based on the stacked-bar charts of the (yet again aggregated) data. The huge amount of incidents marked as “unclear” for Peru only has guessed explanations, due to data collection issues (e.g., for 1987 some 500 “unclear” versus about 40 attributed to Sendero Luminoso-Nacional and 30 government)—try feeding such data into the DWI (see below). The definitions of “civilian” and “non-combatant” are not clear, not even sort of inferable as with Sutton’s database (see below).

Overall, it merely gives a rough idea of some aspects of the examined conflicts, but maybe this already suffices for comparative politics.

UPDATE (21-1-2009): Jeremy Weinstein kindly responded via email, being aware of the aggregations used in the data analysis, because they intended to serve a descriptive role, and pointing me to an effort of more detailed data collection, finer-grained analysis, and online data (in proprietary Strata format) of the conflict in Sierra Leone, which was published in American Political Science Review. That freely available paper, Handing and manhandling civilians in civil war, also gives an indication what the reader can expect of the contents in the book, and has a set of 8 hypotheses that are tested against the data (not all of them could be confirmed).

The Dirty War Index

There are people who build tools upon such conflict databases. Garbage In, Garbage Out? I will highlight one of those tools, which received extensive coverage in PLoS Medicine recently [2,3,4]: being able to calculate a “Dirty War Index” for a variety of parameters that follow the pattern of DWI = \frac{nr\_of\_dirty\_cases}{total\_nr\_of\_cases} \times 100 . The cases and their aggregates to nr of cases come from the conflict’s incidents databases. Go figure. It’s not just that, but one could/would/should assume that the examples Hicks and Spagat give in their paper [3] are to illustrate, but not to invalidate, their DWI approach.

Let us take their first example, the DWIs for the actors in the Colombian civil conflict as the measure \frac{nr\_of\_civilians\_killed}{total\_nr\_of\_ civilians\_killed + combatants\_killed} \times 100 . The ‘guerillas’ (presumably FARC) have a DWI of \frac{2498}{5444} \times 100  = 46, the ‘government forces’ \frac{593}{659} \times 100 = 45 , and the ‘illegal paramilitaries’ (a pleonasm) \frac{6944}{6985} \times 100 = 99 (numbers taken from the simple Colombia conflict database [5]). Hicks and Spagat explain that “Guerrillas rank 2nd in killing absolute numbers of civilians”, as if the government forces deserve a laurel for having the best (closest to 0) DWI—with a mere 1-point margin—and as if paramilitaries are independent of the government whereas it is the norm, rather than the exception, that governments tend to arrange for a third party to do the dirty work for them (with or without external funding) so as to look comparatively good in the international spotlights. Aggregating by ‘opponents of FARC’, we get a DWI of \frac{593+6944}{659+6985} \times 100 = 98.6 , which is substantially more dirty than FARC that cannot be explained away anymore by data collection biases [4]; to put it differently, FARC is in this DWI the proverbial ‘lesser of two evils’, or, if you support their cause then you could say they have good reason to be annoyed with the current violent governance in the country. This also suggest that requiring “recognition in Colombia’s paramilitary demobilization, disarmament, and reintegration process” [3] alone may not be enough to achieve durable peace for Colombians.

The other main illustration is the conflict in Northern Ireland by using two complementary DWIs: “aggressive acts (killing civilians) and endangerment to civilians (by not wearing uniforms)”[1]. The ‘British Security Forces’ (BSF) have a “Civilian mortality DWI” of 52, the ‘Irish Republican Paramilitaries’ (IRP) 36, and the ‘Loyalist paramilitaries’ (LP) 86—note the odd naming and aggregations, e.g., are we talking IRA, or lumping the IRA together with the Real-IRA and Continuity-IRA, and all UFF, LVF…? Consulting the extensive source database, it lists 29 groups. In addition, [3]’s “number of civilian + civilian political activist” are, respectively, 190+738+873=1801, but the source’s data has 1797 civ.+ 58 civ.pol.activists = 1855, and then a series of statuses such as “ex-British army”, “ex-IRA” and so forth, who, while being “ex-” are not real civilians according to the database. Much more data for compiling your preferred DWI and preferred details or aggregates can be found here [6].

The “Attacks without uniform DWI” are “approaches 0” (BSF), “approaches 100” (IRP) and “approaches 100” (LP) without actual values to do the calculation with; nevertheless the vagaries, for the IRP they prefer the adjective “extremely high rate” but for the LP it is only “very high rate”. They try a comparatively long explanation for the nastyness of the IRP, but it is plain that the BSF and LP have the dirtiest civilian DWI and that LP killed most civilians, no matter how one wants to explain it away and dress it up with DWIs (maybe not so coincidentally, the authors are affiliated with UK institutions).

I will leave Hicks and Spagat’s “female mortality DWI” of the Arab-Israeli conflict and the “child casualty DWI” of Chechnya for the interested reader to analyse (including the term ‘unexploded ordnance’ that injured or killed children—by exploding).

Although the idea of multiple DWIs can indeed be interesting to give a rough indication, there is the real danger of misuse due to unfair sanitation of data: it can easily stimulate misinterpretation by showing some neat aggregated numbers without having to assess the source data and by brushing over the reality on the ground that a bean-counting person may not be aware of and more readily can set aside in favour of the aggregated numbers.

Hicks and Spagat do have a section on considerations, but that their two main worked-out examples with Colombia and Northern Ireland are problematic already just proves the point about possible dubious use for one’s own political agenda. Perhaps they would say the same of my alternative rendering being politically coloured, but I do not try to give it a veneer of credibility and advantages of DWIs, just that it is simple to turn around and play with the DWIs to suit one’s preferences, whichever they may be.

UPDATE (5-6-’09): a more comprehensive review of Hicks and Spagat’s paper will be published in the autumn 2009 issue of the Peace & Conflict Review.

[1] Weinstein, Jeremy M. (2007). Inside rebellion—the politics of insurgent violence. Cambridge University Press.

[2] Sondorp E (2008 ) A new tool for measuring the brutality of war. PLoS Med 5(12): e249. doi:10.1371/journal.pmed.0050249

[3] Hicks MH-R, Spagat M (2008 ) The Dirty War Index: A public health and human rights tool for examining and monitoring armed conflict outcomes. PLoS Med 5(12): e243. doi:10.1371/journal.pmed.0050243.

[4] Taback N (2008 ) The Dirty War Index: Statistical issues, feasibility, and interpretation. PLoS Med 5(12): e248. doi:10.1371/journal.pmed.0050248.

[5] The numbers originate from CERAC’s Colombia conflict database as reported in [3]; both Hicks and Spagat are research associates of CERAC; database available after registration, which has substantially less types of information and less explanation than Sutton’s [6] database.

[6] CAIN Web Service as reported in [3]; database freely available, including data, querying, and design and data collection choices.


[1] The latter DWI is theoretically problematic, because the distinction between actors who use violence and their supporters in the population (be it passively or actively with food, shelter, and logistics) is often not that clear, and off-duty soldiers are not necessarily automatically civilians; but the argument is long. Hicks and Spagat’s table 3 has a longer list than just this item, and I shall not digress further on the topic here.