Computational, and other, problems for genomics of emerging infectious diseases

PLoS has published a cross-journal special collection on Genomics of Emerging Infectious Diseases last week. Perhaps unsurprisingly, I had a look at the article about limitations of and challenges for the computational resources by Berglund, Nystedt, and Andersson [1]. It reads quite like the one about computational problems for metagenomics I wrote about earlier (here and here), but they have a somewhat curious request in the closing section of the paper.

Concerning the overall contents of the paper and its similarity with the computational aspects of metagenomics, the computational aspects of complete genome assembly is still not quite sorted out fully, and in particular the need “for better ways to integrate data from diverse sources, including shotgun sequencing, paired-end sequencing, [and more]…” and a quality scoring standard. The other major one is the recurring topic is the hard work of annotation to give meaning to the data. Then there are the requests for better, and 3D, visualizations of the data and the cross-granular analysis of data along the omics trail and at different scales.

Limitations and challenges that are more specific to this subject domain are the classification and risk assessments for the emergence of novel infectious strains and risk prediction software for disease outbreaks. In addition, they put a higher importance put on the request for supporting tools to figure out the evolutionary aspects of the sequences and how the pieces of DNA have recombined, including how and from where they have horizontally transferred.

In the closing section, the authors reiterate that

To achieve these goals, investments in user-friendly software and improved visualization tools, along with excellent expertise in computational biology, will be of utmost importance.

I fancy the thought that our WONDER system for the HGT-DB for, in particular, graphical querying meets the first requirement—at least as proof of concept that one can construct an arbitrary query graphically using an ontology and all that in a web browser. Having said that, I am also aware of the authors’ complaint that

Currently, the slow transition from a scientific in-house program to the distribution of a stable and efficient software package is a major bottleneck in scientific knowledge sharing, preventing efficient progress in all areas of computational biology. Efforts to design, share, and improve software must receive increased funding, practical support, and, not the least, scientific impact.

Yes, like most bioinformatics tools and proof-of-concept and prototype software that comes from academia, it is not industry-grade software. To get the latter, companies have to do some more ‘shopping’ around and investment into it, i.e., monitor the latest engineering developments that demonstrate working theory presented at conferences, take up the ones they deem interesting, and transform it into a stable tool. We—be it here at FUB or almost any other university—do not have an army of experienced programmers, not only because we do not have the financial resources to pay programmers (cf. researchers with whom more scientific brownie-points can be scored) but, moreover, a computing department is not a cheap software house. The authors’ demand for more funding for software development to program cuter and more stable software would kill computing and engineering research at the faculty if the extra funding would not be real extra funding on top of existing budgets. The reality these days, is that many universities face cuts in funding. Go figure where that leaves the authors’ request. The complaint may have been more appropriate and effective when the authors would have voiced it in an industry journal.

The last part of the quote, receiving increased scientific impact, seems to me a difficult one. Descriptions of proof-of-concept and prototype software to experimentally validate the implementability of a theory can find a home in a scientific publication outlet, but a paper that tells the reader the authors have made a tool more stable is not reporting on research results and it does not bring us any new knowledge, does not answer a research question, does not solve an hitherto unsolved problem, does not confirm/refute a hypothesis. Why should—“must” in the authors’ words—improved, more usable and more stable, software receive scientific impact? Stable and freely available tools have an impact on doing science and some tasks would be nigh on undoable without them, but this does not imply such tools are part and parcel of the scientific discovery. One does not include in the “scientific impact” the Petri dish vendor, PCR machine developers, or Oracle 10g development team either. There are different activities with different scopes, goals, outcomes, and reward mechanisms; and that’s fine. Proposing to offer companies some fairly difficult to determine scientific-impact-brownie-points may not be the most effective way to motivate them to develop tools for science—getting across the possibility to make profit in the medium- to long term and to do something of societal relevance may well be a better motivator.

References

[1] Berglund EC, Nystedt B, Andersson SGE (2009) Computational Resources in Infectious Disease: Limitations and Challenges. PLoS Comput Biol 5(10): e1000481. doi:10.1371/journal.pcbi.1000481

Leave a Comment

The WONDER system for ontology browsing and graphical query formulation

Did you ever not want to bother knowing how the data is stored in a database, but simply want to know what kind of things are stored in the database at, say, the conceptual or ontological layer of knowledge? And did you ever not want to bother writing queries in SQL or SPARQL, but have a graphical point-and-click interface with which you can compose a query using that what layer of knowledge and that the system generates automatically the SQL/SPARQL query for you, in the correct syntax? And all that not with a downloaded desktop application but in a Web browser?

Our domain experts in genetics as well as in healthcare informatics, at least, wanted that. We have designed and implemented it now [1], which we have enthusiastically named Web ONtology mediateD Extraction of Relational data (WONDER). Moreover, we have a working system for the use case about the 4GB horizontal gene transfer database [2] and its corresponding ‘application ontology’. (pdf)

Subscribers to this blog might remember I mentioned a that we were working towards this goal, using Ontology-Based Data Access tools to access a database through an ontology and learning from (and elaborating on) its preliminary case studies [3]. In short, we added a usability extension to the OBDA implementations so that not only savvy Semantic Web engineers can use it, but also—actually, moreover—that the domain experts who want to get information from their database(s) can do so. By building upon the OBDA framework [4], we can avail of its solid formal foundations; that is, WONDER is not merely a software application, but there is a logic-based representation behind both the graphics in the ontology browser and the query pane.

In addition, WONDER is scalable because the ontology language (roughly: OWL 2 QL) is ‘simple’. Yes, we had to drop a few things from the original ORM conceptual model, but they have—at least for our case study—no effect on querying the data. The ‘difficult’ constraints are (and generally: should be anyway) implemented in the database, so there will be no instances violating the constraints we had to drop. Trade-offs, indeed, but now one can use an ontology to access a large database over the Web and retrieve the results quickly.

For instance, take the query “For the Firmicutes, retrieve the organisms and their genes that have a GCtotal contents higher than 60”, which is for various reasons not possible through the current web interface of the source database.

Fig.1 shows the ontology pane with three relevant elements selected. (click on the figures to enlarge)

WONDER's ontology pane with three elements selected

Fig.1. WONDER's ontology pane with three elements selected

Fig.2 shows the constrained adder, where I’m adding that the GCValue has to be > 60.

WONDER's constrained adder, where I’m adding that the GCValue has to be > 60

Fig. 2. WONDER's constraint adder, where I’m adding that the GCValue has to be > 60

Fig.3 shows the query ready for execution: the attributes with a green border are those that will appear in the query answer (I could have selected all, if I wanted to). In the menu bar on the right you can see I have customized the names of the attributes, so that the columns in the results pane will have a query-relevant name in your preferred language (not necessary to do), as well as the automatically generated query.

WONDER's query pane, where the query is ready for execution

Fig.3. WONDER's query pane, where the query is ready for execution

Fig.4 shows a section of the results of the first page and Fig.5 of the second page; the “Family” column that has all the Firmicutes (out of about 500 organisms in the database) gives you the whole section of the species tree, because that is how the taxonomy information is stored in the database (refining the database is a separate topic). Alternatively, I could have selected the organism Name from the ontology browser (see Fig.1), de-selected the taxonomic classification in the query pane, and included the Name of the organism in the query answer to have the species name only but not all the taxonomic information; in this case, I wanted to have all that taxonomy information. The genes are the relevant selection (made with the other constraints) out of about the 2 million genes that are stored in the database.

Fig.4. Section of the results, the first page

Fig.4. Section of the results, the first page

Fig.5. Section of the results, the second page

Fig.5. Section of the results, the second page

There is also a constraint manager for the AND, OR, NOT and nesting. For instance, for the query “Give me the names of the organisms of which the abbreviation starts with a b, but not being a Bacillus, and the prediction and KEGG code of those organisms’ genes that are putatively either horizontally transferred or highly expressed” (Fig.6), we have the constraint manager as shown in Fig.7.

Fig.6 graphical and textual representation of the second query

Fig.6. Graphical and textual representation of the second query

Fig.6 constraint manager for query 2

Fig.7. Constraint manager for the second query

You can also save and load queries when you’re logged in, and download the results set in any case.

For those who want to play with it: feel free to drop me a line and I will send you the URL. (The reason for not linking the URL here is that the current URL is still for the beta version, whereas the operational one is expected to have a more stable URL soon.)

Last, but not least, the “we” I used in the previous sentences is not some ‘standard writing in plural’, but several people were involved in various ways to realize the WONDER system. In alphabetical order, they are: Diego Calvanese, Marijke Keet, Werner Nutt, Mariano Rodriguez-Muro, and Giorgio Stefanoni, all at FUB. I also want to thank our domain experts of the case study (with whom we’re writing a bio-oriented paper): Santi Garcia-Vallvé (with the Evolutionary Genomics Group, ‘Rovira i Virgilli’ University, Tarragona, Spain) and Mark van Passel (with the Laboratory for Microbiology, Wageningen University and Research Centre, the Netherlands).

References

[1] Calvanese, D., Keet, C.M., Nutt, W., Rodriguez-Muro, M., Stefanoni, G. Web-based Graphical Querying of Databases through an Ontology: the WONDER System. ACM Symposium on Applied Computing (ACM SAC’10), March 22-26 2010, Sierre, Switzerland.

[2] Garcia-Vallve, S, Guzman, E., Montero, MA. and Romeu, A. 2003. HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Research 31: 187-189.

[3] R. Alberts, D. Calvanese, G. De Giacomo, A. Gerber, M. Horridge, A. Kaplunova, C. M. Keet, D. Lembo, M. Lenzerini, M. Milicic, R. Moeller, M. Rodríguez-Muro, R. Rosati, U. Sattler, B. Suntisrivaraporn, G. Stefanoni, A.-Y. Turhan, S. Wandelt, M. Wessel. Analysis of Test Results on Usage Scenarios. Deliverable TONES-D27 v1.0, Oct. 10 2008.

[4] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, and Riccardo Rosati. Ontologies and databases: The DL-Lite approach. In Sergio Tessaris and Enrico Franconi, editors, Semantic Technologies for Informations Systems – 5th Int. Reasoning Web Summer School (RW 2009), volume 5689 of Lecture Notes in Computer Science, pages 255-356. Springer, 2009.

Leave a Comment

New blogpost index and updated list of top posts

To facilitate browsing and not to let the older posts be entirely hidden in the archive, I have added a separate page with a list of all posts of the past 3.5 years. For those of you curious which of those posts have received most traffic over time, I have updated the vox populi page. I also added a list of my most cited research articles; for the time being, the topics covered by both the top posts and most cited papers have an empty intersection. Ok, stretching it a bit, then post 2 and paper numbers 4 and 8 are somewhat related, but that is easy with post 2’s topic on ‘computers science with/for biology’, of which the papers are just two examples (conceptual modeling for biology and ontologies in ecology).

One can only speculate why this is so… It may well be that the population of blog-readers is disjoint from the population who consider my scientific contributions of some value. Or perhaps the highly informal posts about my research do not attract the precious time of the researchers (which does not entail any consequences regarding the translocation of the negation in that statement!); however, perhaps scientists should read more blogs and start one of their own. John Baez has posted an interesting draft article about the usefulness of blogs in research, and (update addition on 15-10-’09:) Gowers and Nielsen describe their take on online collaborative research based on the Polymath project experience in the openly accessible Nature opinion article [1]. Maybe we can dish up examples for research in computer science, or the Semantic Web in particular, to demonstrate there are some benefits to it. Anyone has an example?

References

[1]  Timothy Gowers and Michael Nielsen. (2009). Massively collaborative mathematics. Nature 461, 879-881.

Leave a Comment

Any semantic search for insects?

The draft of this post started with an example of a creepy insect living in Italy and, well, across the world in those locations where hygiene is not taken too seriously. But I will leave that be, so you can have a good night’s rest. Instead, I will take the example of an insect of which I still do not know what it is—it may still turn out to be a creepy one, but now I do have photos of it and it is living well away outside in Ineke’s poly-tunnel near Limerick, Ireland. The problem is this: neither Ineke, nor Heidi nor I know what it is, but really still want to know. How to get the answer, i.e., how to find the species name of the specimen? I’ve tried several strategies: the ones that are practically possible did not do the job and the one that would does not exist. I’ll go through them in the remainder of the post and close with a few questions on what the most feasible strategy would/should/could be to eventually have a decent entomology [ornithology/nematology/etc.] knowledge base.

Image0020

Specimen viewed from the top; can anyone ID this specimen?

Basic searches

Neither one of us who were present at teatime in Ineke’s polytunnel where we observed the insect, is an entomologist nor do we have entomologist-friends. The famous ‘bug man’ Ruud Kleinpaste is a fellow alumnus of Wageningen University, but we did not study there around the same time and I could not find an email address to bother him asking to ID a specimen. Neither one of us has an insect handbook either and even if we had, I, for one, would not want to flick through it when there is a perceived need to find the species of a specimen: flicking through the insect-book (and plant-book, etc) was an entertaining pastime activity when I was young, like reading the encyclopaedia and doing the dictionary game, but in this day and age, I would have wanted to use the computer to find the answer. This is theoretically feasible, but—as far as I am aware of—not yet in practice.

To do image matching, I would need a very large data set and of the data set, to know which image fits with which species name, which I do not have; so the machine learning strategy will not work. There is an online browseable BugGuide for the US and Canada with lots of pictures that I clicked through for a while, but without finding the right picture. There are entomology databases that let me search by species name (here, here, and here), but not by properties of the insect; KONCHUR has a fancier search mechanisms but covers insects in Japan, East Asia and the Pacific only (“orange leg AND black body” did not return any results).

Sure, I did a Google search on “image of a black insect with orange legs and stingy back”, hoping that someone else already has uploaded an image of another specimen of the same type of insect, annotated it with the same or very similar terms, and that someone has made the next step to add the name of the species as well, i.e., who is not looking around for the answer like me but has the right knowledge of insects. With the many search phrases and pagerank algorithm that Google relies upon in devising the search results [1], something might turn up; however, the actual results were unhelpful. Other people had similar requests without an answer, the body colours swapped twice (orange bug with black legs), many unrelated insects where one of them has orange legs (Ichneumon wasp (Rhyssa persuasoria)) but its legs are only partially orange, it has white dots on its body, and the back is not as stingy as our specimen (see picture of the Ichneumon wasp), or utterly irrelevant land and sea images. That’s about it for the first page of the Google query answer.

Semantic searches

Now, if there was a proper ontology of insects, and I mean not a bare taxonomic tree but one where the classes have properties and those properties have their ranges defined as well, then it would be a simple exercise of selecting the properties along the line of

adult insect
  AND length 2cm
  AND colour black
  AND has wingtype transparent
  [*AND body shape similar to a wasp*]
  AND leg colour orange
  AND rear body stingy
  AND location at least west Ireland

so that the reasoner (FaCT++, Pellet, and the like) would classify it near-instantly, or if the ontology were to be really large, then still within an hour or so (ignoring for a moment the [*AND body shape similar to a wasp*] because that requires a bit more work). It would be even funkier if that ontology were linked to a database of images of insects to cross-check it with the visuals. Even more so when such a database also were to have information about its habitat with feeding habits, principal role in the food web, and any diseases it may cause or transmit. Then one would also be able to start the search from another direction along the line of “give me all the insects that live in the west of Ireland” as a first step to narrow down the possible answers.

Aside from the instance classification problem of this particular specimen, the question arises if it would it be up to

a) Google to work on their technology so as to be able to get the answer for me?

b) Entomologists to develop their domain ontology about insects and link it to some database with pictures and additional textual information to have indeed a properly searchable knowledge base?

c) Volunteer labour, like me having taken pictures and annotated each one with the physical characteristics, location, time of observation, etc. and categorise it as “bug” or “insect” or “insekt” or “insetto” to eventually have a grass-roots bugbase (that likely will have some imperfections with gaps in data fields and sloppy terminology)?

d) Everyone to buy insect books?

e) …?

Shelving option d, I am explicitly looking for a computational option, i.e., a, b, c, or e. I prefer a web-accessible version of option b, which can be done with scalable Semantic Web technologies; one only needs to find the money, time, and people to realise it.

Although I gave the example here with insects, the same story can be made for birds, worms, and so forth. When such searchable knowledge bases exist, it will not only save time for many lay people looking up the information and learning more about the flora and fauna around them, but I can imagine it will also make research a lot easier for interdisciplinary scientists who have to forage into knowledge of insects [/birds/worms/etc] as well as the entomologists [/ornithologists/nematologists/etc.] themselves.

specimen from the side

specimen from the side

References

[1] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, March/April 2009, 8-12.

Comments (14)

Linked data as the future of scientific publishing

The (not really a) ‘pipe dream’ paper about “Strategic reading, ontologies, and the future of scientific publishing” came out in August in Science [1], but does not seem to have picked up a lot of blog-attention other than copy-and-paste-the-abstract (except for Dempsey’s take on it), which is curious. Renear and Palmer envision a bright new world with online scientific papers where the text has click-through terms to records in other databases and web pages to have your network of linked data; click on a protein name in the paper and automatically browse to Uniprot, sort of. And it is even the users who want that; i.e., it is not some covert push from Semantic Web techies. But then, in this case the ‘users’ are in informatics & information/library science (the enabling-users?), which does not imply that the endusers—say, biochemists, geneticists, etc.—want all that for better management of their literature (or they want that but do not yet realise that that is what they want).

But let us assume those endusers want the linked data for their literature (after all, it was a molecular ecologists who sent me the article—thanks Mark!). Or, to use some ‘fancier’ terms from the article: the zapping scientists want (need?) ontology-supported strategic reading to work efficiently and effectively with the large amounts of papers and supporting data being published. “Scientists are scanning more and reading less”, so then the linked data would (should?) help them in this superficial foraging of data, information, and knowledge to find the useful needle in the haystack—or so goes Renear and Palmer’s vision.

However, from a theoretical and engineering point of view, this can already be done. Not just that, it has been shown that some things work: there is iHOP and Textpresso, as the authors point out, but also GoPubMed, and SHER with PubMed, which begs the question: are those tools not good enough (and if something is missing, what?) or is it about convincing people and funding agencies? If the latter, then what does the paper do in Science in the “review” section??

If one reads further on in the paper, some peculiar remarks are being made, but not the one I would have expected. That the “natural language prose of scientific articles provides too much valuable nuance and context to be treated only as data” is a known problem that keeps many a (computational) linguist busy for his/her lifetime. But they go on saying also that “Traditional approaches to evaluating information systems, such as precision, recall, and satisfaction measures, offer limited guidance for further development of strategic reading technologies”, yet alternative evaluation methods are not presented. That “research on information behaviour and the use of ontologies is also needed” may be true form an outsider’s perspective: usages are known among ontologists but perhaps a review of all the ways of actual usage may be useful. Further down in the same section (p832), the authors claim, out of the blue and without any argumentation, that “the development of ontology languages with additional expressive power is needed”. What additional expressive power is needed for document and data navigation when they talk of the desire to exploit better the “terminological annotations”? The preceding text in the same paragraph only mentions XML, so they do not seem to have a clue about ontology languages, let alone their expressiveness, at all (OWL is mentioned only in passing on p830) and they manage to mention it in the same breath as so-called service-oriented architectures with a reference to another Science paper. Frankly, I think that papers like this are bound to cause more harm (or, at best, indifference) than good.

One thing I was wondering about, but that is not covered in the paper, is the following: who decides which term goes to which URI? There are different databases for, say, proteins, and the one who will be selected (by whom?) in the scientific publishing arena will become the de facto standard. A possibility to ameliorate this is to create a specific interface so that when a scientists clicks on a term, a drop-down box appears with something like “do you wish to retrieve more information from source x, y, or z?” Nevertheless, one easily ends up with a certain bias and powerful “gatekeepers”, and perhaps, with a similar attitude as toward DBLP/PubMed (“if it’s listed in there it counts, if it is not, then it doesn’t” regardless the content of the indexed papers, and favouring older established, and well-connected, outlets above newer and/or interdisciplinary ones).

Anyway, if Semantic Web researchers need some reference padding in the context of “it is not a push but a pull, hear hear, look here at the Science-reference, and I even read Science”, it’ll do the job, even though the contents of the paper is less than mediocre for the outlet.

References
[1] Allen H. Renear and Carole L. Palmer. Strategic Reading, Ontologies, and the Future of Scientific Publishing. Science 325 (5942), 828. [DOI: 10.1126/science.1157784]

Comments (5)

Visiting DERI in sunny Galway

Believe it it not, but the weather is indeed dry and sunny in Galway, already for 5 days in a row; although I did not come to Ireland for the good weather, it is a nice bonus. One of the reasons I am in Ireland is to visit the Digital Enterprise Research Institute (DERI) in Galway, which is the largest Semantic Web-oriented research group in Ireland with about 120 employees.

I am hosted by Paul Buitelaar’s NLP unit, looking into options to improve NLP implementations with ontologies, as DERI puts somewhat more emphasis on validation with applications than Bolzano does. In this context I gave a seminar about representing and reasoning over a taxonomy of part-whole relations, which is based on the Applied Ontology paper with the same title [1], but where the slides focus on the motivations from a linguistics perspective. It led some attendees to believe linguistics was the core motivation for disambiguating different types of part-whole relations. However, correct modelling of part-whole relations gets probably more attention in conceptual data modelling (primarily, UML) and it receives lots of attention in attempting to address the demands put forward by bio(medical) ontologists to (i) have a language with which one can represent all properties of parthood relations in ontology languages, which we still cannot in OWL, and (ii) distinguish between part-whole notions such as (spatial) containment, structural parthood, and membership of a collective.

An orthogonal dimension to the types of part-whole relations are the notions of essential and immutable parts and wholes, which can be solved by resorting to a temporalisation of relationships [2]. If one would want to ‘translate’ that to any usage in NLP, then one can deal adequately—with a formal and ontological foundation—with linguistic expressions that have a wider range of verb tenses, like “researcherAbc will become a member of researchGroup123” (a scheduled meronymic part-whole relation) and heart#123 has been transplanted from patientAbc” (a so-called disabled relation where the heart used to be a structural part of that patient). But this is still music for future work.

Other topics passed, and are passing, the revue as well, such as a possible use case with roles and rules with Axel Polleres, making it a stimulating and enjoyable visit.

References

[1] Keet, C.M. and Artale, A. Representing and Reasoning over a Taxonomy of Part-Whole Relations. Applied Ontology, 2008, 3(1-2): 91-110.

[2] Artale, A., Guarino, N., and Keet, C.M. Formalising temporal constraints on part-whole relations. 11th International Conference on Principles of Knowledge Representation and Reasoning (KR’08). Gerhard Brewka, Jerome Lang (Eds.) AAAI Press, pp 673-683. Sydney, Australia, September 16-19, 2008.

Leave a Comment

The, surprisingly somewhat belated, report on the Granular Computing Conference GrC’09

Last week I attended the IEEE International Conference on Granular Computing 2009 in Nanchang, China. I had done my preparations to report about it also during the conference, but, alas, I was not allowed to even read my own blog, let alone posting to it; see [footnote 1] below for some observations on non-accessible blogs. This being the situation, hereby then a slightly belated report.

Tengwan Ge in Nanchang, which was also depicted on the beautiful GrC'09 remembrance plate from Jiangxi University

Tengwang Ge in Nanchang, which was also depicted on the beautiful GrC'09 remembrance plate from Jiangxi University

Keynotes

T.Y. Lin of San José State University, USA—one of the initiators of “Granular Computing” as a, in his view, specialisation area of applied mathematics—spoke about the difference between keyword terms in Web searches and the concepts behind it and he proposed to solve the issues with category theory. On my question why specifically category theory and not Semantic Web languages and technologies, I unfortunately did not get a clear answer. Xingdong Wu of the University of Vermont talked about mining user patterns, aggregations, and user interest modelling with wildcards, where the use of wildcards provide much flexibility (and improvements) in specifying and mining the user patterns. Last, Xue-wen Chen of the University of Kansas first went through the usual introductory aspects of system biology, to proceed to the actual topics of gene and protein networks. The technologies used are hidden Markov models, Bayesian networks, and a K-GIDDI divide-and-conquer biclustering algorithm that touched up with gene ontology terms to, respectively: find 4 new genes (their functions) in the D. melanogaster (fruitfly), figure out a gene network (80% of what was known already in the literature, but now done by automation), and 95.42% correct function assignment of H. sapiens genes (verified against the 437 genes already annotated) of an overall amount of 618 function assignments. Not bad, not bad at all; his 2008/9 papers have the details.

Sessions

My paper on granular perspectives ([1] and summary here) was scheduled right at the start in the first of the parallel sessions, in the “Foundations of granular computing”. It was listened to with interest, and I received positive feedback–if people will use it is, time will tell. Hong Hu brought up the topics of dynamic similarity (e.g. the gradual changes from tadpole to frog) and how to deal with it, for which he proposed to use neural networks [2]. However, dealing with the standard ’static’ similarity, i.e. comparing two objects at the same time, is already a widely researched area, and unsurprisingly, the last word has not been said yet about dynamic similarity either; in fact, perhaps they were the first ones. The same session had scheduled a paper with a preliminary, set-based, notion for a theory of granularity [3], which already looks ahead to using attributes of the objects (but that was not fully integrated in the theory yet), and gives “granule” as a lump of objects in a granular level an explicit place in the theory. Yinliang Zhao’s paper was about a way to granulate program code, and in particular of object-oriented programming languages with a restriction, thus far, to single-inheritance class hierarchies (the programming version of taxonomies) [4].

In the afternoon, I went to the ‘Japanese session’, where, among others, Toyota presented an application paper for visualising the topics of and navigation to the 7000 or so Japanese laws to acquaint Japanese citizens with it (this year Japan changed its judicial system that now has a citizen-judge, or ’saiban-in’ system) [5]. If this is practically scalable to the Italian system? It should, in theory at least, but with its more than 70000 laws, it will require more levels of granularity than the four they have so as to obtain appropriate overviews. In addition, the relations of the (hierarchically ordered) key terms in the texts of the full collection of Japanese laws have the characteristic of a so-called “small world network”, which makes it very suitable for visualisation. My experience with the Italian bureaucracy and its rules for the strangest things gives me the impression that that may not be the case with the Italian laws (and Italian laws can benefit from an automated consistency check, but that is a separate topic), but it is not a trivial exercise to actually verify or refute this hunch. As a tidbit of fun information about the relatedness of the keywords extracted from the Japanese laws: “Nation” scored highest with 1020 links, which was followed by “Money” with 981 links, which the presenter found curious enough to emphasise.

There were three sessions on rough sets: applications, theory, and computing. The applications session had a paper on using a “knowledge quantity” for relative importance of attributes used to compute the rough sets and to apply that to Chinese text categorisation using the “document frequency thresholding” characteristic [6]: while common terms appear to be important for global performance, rare terms “are the most informative” to be able to discern (make distinguishable) those documents from others and are, from a rough set perspective, therefore influential because they have most effect on the equivalence structure. Li [7], on the other hand, improved on the “extenics” company evaluation method by using rough sets so that the amount of company indicators, such as “human capital” and “technological innovation ability”, could be reduced, hence a company’s evaluation method simplified. On the theory side, there was, among others, a paper on neighborhood systems with respect to rough sets where a new “and” operator is introduced and, as the authors claim, is “different from traditional rough set approximations” [8]. The remainder of the paper to back up this claim is rather dense, but T.Y. Lin summarised his students’ work as that the lower and upper approximations in VPRS are special cases of the interior and closure in topological space. Last, Chen, Li and coauthors sought to dynamically update the upper and lower approximations of a rough set to reflect the changes in the underlying information system over time, and they presented the theory, algorithm, and experimental validation in [9,10].

Start of the walk at Lushan mountain, after the flower garden

Start of the walk at Lushan mountain, after the flower garden

The House where Nobel Priye winner Pearl Buckly lived on the Lushan Mountain

The House where Nobel Prize winner (literature, 1938) Pearl Buck lived on the Lushan Mountain

Other

As social event, besides the conference dinner, we had a trip to the Lushan mountain, which is a UNESCO world heritage site. Although I had to skip the walking sessions, the scenery is really beautiful and temperature comfortable. The visit to its “many old buildings” appeared to be the missionaries outpost of about 100-150 years ago, including a not very protestant church that is (still/again?) used for weddings, and the home of Pearl Sydenstricker Buck, who had won the Nobel prize for Literature for her writings about life in China.

Each participant also received a beautiful present from the local organisation, Jiangxi University of Finance and Economics: a black ceramic plate with in gold-coloured imprints the name of the university, IEEE GrC 2009, and in the centre the famous building Tengwang Ge.

Travelling to China is a bit of a hassle with the visa, and knowing some Chinese (which I do not) will be useful for getting around and things done, but nevertheless I highly recommend people to visit the country, be it a conference or holiday: the people are friendly and very helpful, the food is delicious, and there are lots of things to see and do.


References

  1. C. Maria Keet. From granulation hierarchy to granular perspective. In: Proceedings of the 5th IEEE international conference on Granular Computing 2009 (GrC’09). 17-19 August, Nanchang, China. IEEE Computer Society, 306-311.
  2. Hong Hu and Zhongzhi Shi. Machine learning as granular computing. In: Proc. of GrC’09. IEEE Computer Society, 229-234.
  3. Hong Li. Granule, Granular Set and Granular System. In: Proc. of GrC’09. IEEE Computer Society, 340-345.
  4. Yinliang Zhao. A step toward code granulation space. In: Proc. of GrC’09. IEEE Computer Society, 799-804.
  5. Tetsuya Toyota and Hajime Nobuhara. Hierarchical structure analysis and visualisation of Japanese law networks based on morphological analysis and granular computing. In: Proc. of GrC’09. IEEE Computer Society, 539-543.
  6. Yan Xu and Wang Bin. Knowledge management based on rough set. In: Proc. of GrC’09. IEEE Computer Society, 654-657.
  7. Yuan-yuan Li and Jun Yun. A comprehensive evaluation method based on extenics and rough set. In: Proc. of GrC’09. IEEE Computer Society, 381-383.
  8. Xibei Yang, Xinzhe Li and Tsau Young Lin. First GrC model — Neighborhood Systems: the most general rough set models. In: Proc. of GrC’09. IEEE Computer Society, 691-695.
  9. Weili Zou, Tianrui Li, Hongmei Chen, Xiaolan Ji. Approaches for incrementally updating approximations based on set-valued information systems while attribute values’ coarsening and refining. In: Proc. of GrC’09. IEEE Computer Society, 824-829.
  10. Hongmei Chen, Tianrui Li, Weibin Liu. Research on the approach of dynamically maintenance of approximations in rough set theory while attribute values coarsening and refining. In: Proc. of GrC’09. IEEE Computer Society, 45-48.

Notes

[footnote 1] Regular readers may recollect that Cuba did not block my blog, and that I have written a post there during the Informatica 2009 conference about the VIP session. This made me curious as to what type of blogs are (not) accessible here in China. Some observations (pages checked on 16 and 17 Aug 2009):

  1. WordPress: I did a random check of a few other wordpress blogs with full names as well as xxx.wordpress.com and .org types, such as Duncan’s and WP’s own blog with tips ‘n tricks, to ascertain if it was just my blog being “timed out”, but all those blogs were “timed out”, too. I could access WP’s startpage.
  2. Blogspot: I tried Ben’s and FSP’s blogs, which had a “connection interrupted” message. The www.blogger.com had a quick “connection interrupted” message, idem www.blogspot.com.
  3. Typepad: the frontpage already “timed out”, idem specific typepad blogs.
  4. Other blogs that do not run through one of those blogging sites but have their own software running, such as those of Michael Nielsen, LogBlog, and Microbeworld are accessible, but not the asmblog of the American Society for Microbiology, such as Small things considered (“timed out”, although the ASM was accessible).
  5. Curiously, when I did a Google search on “blog filters china”, one of the top hits returned was the accessible Harvard blog called “internet and democracy blog“, but the first hit returned by the Google search was a news item at National Public Radio that Microsoft implements blog filters for China, which closes with the line “Microsoft’s blogging filter could be seen as taking American companies’ cooperation with censorship to a new level. Instead of merely blocking what Internet users can read, she says, Microsoft is now limiting what they can write.”.

So, to whoever developed the filtering algorithms: there is room for improvement of your work; unless the owners of the three above-mentioned blogging softwares do this blocking themselves preemptively already, which I hope is not the case. To whomever who wants to have their blog also reach the Chinese in China: for the time being, install your own blogging software.

Comments (7)

An analysis of culinary evolution

With summertime being what it is (called komkommertijd—literally: ‘cucumber time’—in Dutch), I stumbled again upon the paper The nonequilibrium nature of culinary evolution [1].

Food is essential, and due to location with its climate and available resources, as well as culture, each region has its own cuisine. There is much talk of homogenization of food dishes in popular press, or at least the threat thereof. One colleague here called “food from the North”, north of the Alps, that is, “barbarian”. But how much diversity in recipes across geographical locations is there? How, if at all, does it vary over time? What is the ingredient replacement pattern and do the replaced ingredients really disappear from the local menu?

Kinouchi and colleagues [1] tried to answer such questions through assessment of the statistics of the recipes’ ingredients, which were taken from 3 complete Brazilian cookbooks (Dona Benta 1946, 1969, and 2004), 40% of the large contemporary French cookbook Larousse (2004), the complete British Penguin Cookery Book (2001), and the Medieval Pleyn Delit.

For instance, the average recipe size of the Dona Benta (1946) as measured by ingredients, is lowest at 6.7, that of the Pleyn Delit an impressive 9.7, and of Larousse the highest with 10.8. However, one has to note that for the Pleyn, there are just 380 recipes with a mere 219 ingredients, whereas the numbers for Larousse are 1200 and 1005, and for the Dona they are 1786 and 491, respectively. When one makes a graph the frequency of appearances of ingredients in the recipes in the cookbooks, then all six cookbooks show very similar rank-frequency plots (power-law behaviour; see Fig. 1 in the paper); that is, for that dimension, there is a cultural invariance, as well as a temporal invariance for the Brazilian cookbook.

However, the more interesting results are obtained by the statistical and complex network analysis to obtain an idea about culinary evolution. The authors propose a copy-mutate algorithm to model cuisine growth, going from a small set of initial recipes to more diverse ones and using the idea of “cultural replicators” and branching. To make the line fit the data, they need 5 parameters: number of generations (T), number of ingredients per recipe (K), number of ingredients in each recipe to be mutated (L), the number of initial recipes (R0), and the ratio between the sizes of the pool of ingredients and the pool of recipes (M). Models without a fitness parameter did not work, so one is generated randomly and assumed to stand for the “intrinsic ingredient properties”, such as nutritional value and availability. At each generation, one “mother” recipe was randomly chosen, copied, and one or more of its ingredients replaced with other random ingredient (implementing the mutation rate L) to generate a “daughter” recipe. And so onward. Searching the parameter space, the authors do indeed find values close to the actual ones observed in the cook books.

Then, on the fitness of the recipes (replaced by hamburgers, pizza, etc.?), Kinouchi and colleagues use the fitness of the kth recipe, defined as F^{(k)} = \frac{1}{K} \sum_{i=1}^{K}f_i , and a corresponding total time dependent cuisine fitness, F_{total}(R(t)) = \frac{1}{R(t)} \sum_{k=1}^{R(t)}F^{(k)} . The results are depicted in Fig4 in the paper and, in short: “this kind of historical dynamics has a glassy character, where memory of the initial conditions is preserved, suggesting that the idiosyncratic nature of each cuisine will never disappear due to invasion by alien ingredients”. In addition, the copy-mutation model with the selection mechanism is scale-free, so that it is an out-of-equilibrium process, which practically means that “the invasion of new high fitness ingredients and the elimination of initial low fitness ingredients never end”, i.e., some ingredients are very difficult to being replaced, as if they were “frozen “cultural” accidents”. The latter has some similarity with the ‘founder-effect’ phenomenon in biology.

De aardappeleters (potato eaters) by Van Gogh

De aardappeleters (potato eaters) by Van Gogh

That much for the maths and experimental data of the paper. Before I turn to some research suggestions on this topic, I will first make an unscientific informal assessment. Van Gogh painted the painting de aardappeleters (‘the potato eaters’) in Nuenen—a village about 15km from where I grew up—back in 1885, to which Thieu Sijbers added a poem to describe such a poor man’s meal. I could not find the full original, but Van Oirschot ([2], p17) has the main parts of it, which I reproduce here first in the original old Brabants dialect and then a translation in English.

En hoekig nao ‘t bidde

‘t krous en dan wordt

aon de sobere maoltijd begonne

recht van ‘t vuur

op de bonkige toffel gezet

worre d’èrpel naw schieluk

mi rappe verkèt

van de hijt fèl nog dampend

de pan outgepikt

nao de monde gebrocht

en gulzig geslikt.

Ze ète, ze schranze

nao ‘n lutske de pan

toe ‘t zwart van de bojum

zo lig as ‘t mer kan



Mi’n mörke vol koffie

van waot’rige sort

zette d’èters nao d’èrpel

de maoltijd dan vort



Ze ète, jao net

mer dan is ‘t ok gezeed

want al wè ze pruuve

is èrremoei en leed.

My translation into English:

And edgy after praying

to the cross, and then

they start with the sober meal,

straight from the fire,

put on the chunky table,

now the potatoes are suddenly

cursed swiftly,

still steaming from the heat,

picked from the pan,

brought to the mouth,

and swallowed greedily.

They eat, they gorge,

and shortly after there is the black

of the bottom of the pan,

as empty as it can be.



With a mug full of coffee,

of the watery type,

do the eaters continue

with the meal after the potatoes.



They eat, yes just about,

but with that, all is said,

because all they taste

is poverty and distress.

The coffee is probably not real coffee but made from roasted sweet chestnuts [2]. The potatoes are an example of the “alien ingredients” mentioned in [1]: before potatoes were introduced in Europe (16th century), the Dutch recipes, at least, used tubers such as pastinaak (parsnip, which are white, and longer and thicker than carrot) in the place of potatoes; this is known primarily from the documentation about the Siege of Leiden in 1573-1574 during the 80-years war. Parsnip has not entirely vanished (parsnip beignets are really tasty), but now takes up a minimal place in ’standard’ Dutch cuisine, so that it may be an example of one of those “frozen “cultural” accidents difficult to be overcome in the out-of equilibrium regime” ([1], p7). A standard Dutch dish is the aardappelen-groente-vlees combination, or: boiled potatoes, boiled vegetables, and a piece of meat baked in butter or fat, or the potatoes and vegetables are cooked together and mashed together into a hutspot (= potato+carrot+onion) or boerenkoolstamp (= potato+curly kale). Over the years, pasta entered the menu as well, and primarily a combination of Chinese and Indonesian, but to some extent also Surinamese, food has become regular dishes. Thus, pasta and rice took some space previously occupied by potatoes, but potatoes are in no way being marginalised. My guess is that that is because tubers and grains belong to different food groups and are therefore not easily swappable compared to tuber-tuber replacement, such as parsnip → potato [note 1], or grain-grain replacement, e.g., maize[flower] → wheat[flower]. Simply put: if you grow up on rice or pasta, then you do not easily switch to potatoes, or vice versa.

Perhaps Kinouchi’s copy-mutate algorithm can be rerun taking into account types of ingredients and then see what comes out of it; and use some variations like (1) swap within same food group, (2) different food group swap, pick random ingredient; and (3) keep the swap within subgroups, such as tuber-carbohydrate-source-staplefood#1 → tuber-carbohydrate-source-staplefood#2 (vs. the more generic ‘carbohydrate source’) and herb#3 → herb#4 (vs. the subsumer ‘condiment’).

Further, in addition to ingredient substitution-by-import, one also observes recipe import, which faces the task of having to make do with the local ingredients. Chinese excel in this skill: dishes in Chinese restaurants taste different in each country but roughly similar—in Italy, they even split up the meals into primo and secondo piatti. But when substituting original ingredients with the local ingredients that are only approximations of the original ones, how much remains of the recipe so that one still can talk of instantiations of the dishes described by the original recipe and when is it really a new one? What effect do those imported recipes have on local cuisine? Is there experimental data to say that, statistically, one recipe is better “export material” than others are? Are people [from/who visited] some geographic region better at transporting the local recipes and/or their ingredients elsewhere?

It remains to test whether those mutated recipes are still edible. Forced by ‘necessity’, I did ingredient substitution due to recipe import several times (the Italian shops do not have baked beans, no brown sugar, no condensed milk, no ontbijtkoek, no real butter, no pecan nuts, only a few apple varieties, etc…), and for some recipes the substitute was at least as good as the original, but then the substitute approximated the original. I certainly have not dared mashing together cooked pasta+carrot+onion to make an “Italian-style hutspot”, let alone random ingredient substitutions. In case someone has done the latter and it is not only edible but also recommendable, feel free to drop me a line or add the recipe in the comments.

References and notes

1. Kinouchi, O., Diez-Garcia, R.W., Holanda, A.J., Zambianchi, P., and Roque, A.C. The nonequilibrium nature of culinary evolution. ArXiv 0802.4393v1, 29 Feb 2008. Also published in New J. Phys. 10, 073020 (8pp) doi: 10.1088/1367-2630/10/7/073020

2. van Oirschot, A. (ed.). Van water tot wijn, van korsten to pastijen. Stichting Brabanste Dag, 1979. 124p.

[note 1] There is a difference between root-tubers (such as parsnip) and stem-tubers (such as potato), but functionally they are quite alike, so that for the remainder of the post I will gloss over this minor point. Some basic information can be glanced from the Wikipedia entry tuber, more if you search on Pastinaca sativa (parsnip) and Solanum tuberosum (potato) who are not member of the same family, and you may be interested to check other common vegetables and their names in different languages to explore this further.

Comments (1)

Enhancing granulation hierarchies

While the paper entitled From granulation hierarchy to granular perspective for this year’s IEEE International Conference on Granular Computing (GrC’09) has been accepted for a while [1], it took some effort to get the colourful sticker (visa) glued into my passport that allows me entry into China, where the conference will be held.

The paper is a shorter, and perhaps also better readable, version of Section 3.3 of my thesis, where the considerations and argumentation of the ontological aspects are mostly left out, so that some explanatory text and the definitions, lemmas, and theorems remain. The aim is to augment so-called granulation hierarchies–those things you get when linking up different levels of granularity (or their data at different levels of detail)–with several attributes and a way to unambiguously identify such hierarchies, what I then call granular perspectives.

Here’s the abstract:

It is well-known that one can granulate data and information in multiple ways to generate a plethora of granulation hierarchies each with their levels of granularity. It is left implicit what the characteristics of such hierarchies are, and what consequences they have on levels of granularity. We propose a way to represent such additional information of granulation hierarchies by upgrading them to full granular perspectives and to provide a consistent way to uniquely identify, hence, distinguish, such perspectives based on their semantics by using a criterion for granulation and type of granularity used for granulation. In addition, with the chosen premises, definitions, and proven properties, we demonstrate some consequences for characterising levels of granularity within such granular perspectives.

If the 6 pages do not satisfy your appetite for the topic and you want to read more about properties and the criterion for granulation and see more examples, then Section 3.3 of the thesis will be useful. More consequences of granular perspectives on granular levels can be found in Section 3.4 of the thesis.

References

[1] Keet, C.M. From granulation hierarchy to granular perspective. IEEE International Conference on Granular Computing (GrC’09), Nanchang, China, August 17-19, 2009. IEEE Computer Society, pp .

Leave a Comment

Older Posts »