Rough Sets Theory workshop in Milan

While it was exceptionally warm weather outside, we stayed inside in a comfortable atmosphere in one of the aulas at the University of Milano-Bicocca, who had organised the first Rough Sets Theory workshop, 25-27 May 2009. With an emphasis on theory: there are many applications of rough sets, but “Even though this attention to application is of great importance, it is not excluded that theoretical aspects concerning with foundations of rough sets, both logical and mathematical, must be taken into account.”

As I’m no expert on rough sets (but there is an interesting relationship between rough sets and granularity, which was the topic of my presentation), the different topics covered by the programme were very interesting to me and gave a useful overview of the range of research topics. As it appears, there there’s plenty of work still to be done on rough sets theory—even though the basic description of rough sets is elegant and simple—and the ambience provided ample opportunity for exchange of ideas and lively discussions.

Topics ranged from using roughs sets with ordinal data and substituting the indistinguishability relation with a dominance relation presented by Salvatore Greco to discussions what are the essential ‘ingredients’ of rough sets to presentations on definitions of rough sets by Mihir Chakraborty and on the differences between Pawlak rough sets versus probabilistic rough sets by Yiyu Yao. For instance, on the latter, Pawlak rough sets consider qualitative aspects and has zero tolerance for errors whereas probabilistic rough sets are about quantitative aspects and acceptance of error; Yao proposed a solution to deal with both, called decision-theoretic rough sets. Also organizer Gianpiero Cattaneo talked about foundational and mathematical aspects of rough sets, but then using a binary relations approach (more detailed information can be found here).

Fertile ground for discussion and misunderstandings, due to the different backgrounds and assumptions of the attendees, was the notion of incompleteness. Simply put, given some ‘data table’ (which is not necessarily a database table), there may be null values, but what does that represent? Incomplete information? To make a long story short: it depends on the context (the semantics of the structure you use, language). Didier Dubois approached it from, among others, a setting of incomplete information in database integration and considered “ill-known attributes” and “ill-known rough sets” as cases of incomplete information about the data. Ill-known attributes are another rendering of the usage at the intensional level of a value range for an attribute of a class so that each object in the class’s extension does have only one value that falls within the defined range of allowed values. Ill-known rough sets are about the ill-observation of attribute values and the lack of discrimination of the set of attributes, and then there is the issue of “potential similarity”. His proposal is about covering-based generalisation of rough sets.

I have more notes of the presentations and the panel session, but I’ll leave it at that (for now at least). If you want to know more about these and the other programme topics, I’d suggest you attend next year’s workshop, but also related conferences may be of interest (e.g., RSKT, GrC, RSFDGrC) or, if you would like to see a closer link with fuzzy and with ontologies, then you may be interested in attending the WI-IAT’09 workshop on managing vagueness and uncertainty in the Semantic Web (VUSW’09) on 15-9-’09 in Milan.

Metagenomics updated and slightly upgraded

The Nature TOC-email arrived yesterday, and they have a whole “insight” section on microbial oceanography! Four years ago, Nature Reviews Microbiology had a special issue with a few papers about it, two years ago PLoS Biology presented their Oceanic Metagenomics Collection, and now then the Nature supplement. Why would a computer scientist like me care? Well, my first study was in microbiology, and they have scaled up things a lot in the meantime, thereby making computers indispensible in their research. For those unfamiliar with the topic, you can get an idea about early computational issues in my previous post and comments by visitors, but there are new ones that I’ll mention below.

Although the webpage of the supplement says that the editorial is “free access”, it is not (as of 14-5 about 6pm CET and today 9am). None of the 5 papers—1 commentary and 4 review papers—indicate anything about the computational challenges: “the life of diatoms in the world’s oceans”, “microbial community structure and its functional implications”, “the microbial ocean from genomes to biomes”, and “viruses manipulate the marine environment”. Given that DeLong’s paper of 4 years ago [1] interested me, I chose his review paper in this collection [2] to see what advances have been made in the meantime (and that article is freely available).

One of the issues mentioned in 2007 was the sequencing and cleaning up noisy data in the database, which now seems to be much less of a problem, even largely solved, so the goal posts start moving to issues with the real data analysis. With my computing-glasses on, Box 2 mentions (my emphases underlined and summarised afterwards):

Statistical approaches for the comparison of metagenomic data sets have only recently been applied, so their development is at an early stage. The size of the data sets, their heterogeneity and a lack of standardization for both metadata and gene descriptive data continue to present significant challenges for comparative analyses … It will be interesting to learn the sensitivity limits of such approaches, along more fine-scale taxonomic, spatial and temporal microbial community gradients, for example in the differences between the microbiomes of human individuals44. As the availability of data sets and comparable metadata fields continues to improve, quantitative statistical metagenomic comparisons are likely to increase in their utility and resolving power. (p202)

Let me summarise that: DeLong asserts they need (i) metadata annotations as a prerequisite for statistical approaches; (ii) deal with temporal data, and (iii) deal with spatial data. There is a lot of research and prototyping going on in topics ii and iii, and there are few commercial industry-grade plugins, such as the Oracle Cartridge, that do something with the spatial data representation. Perhaps that is not enough or it is not what the users are looking for; if this is the case, maybe they can be a bit more precise on what they want?

Point i is quite interesting, because it basically reiterates that ontologies are a means to an end and asserts that statistics cannot do it with number crunching alone but needs structured qualitative information to obtain better results. The latter is quite a challenge—probably technically doable, but there are few people who are well versed in the combination of qualitative and quantitative analysis. Curiously, only MIAME and the MGED website are mentioned for metadata annotation, even though they are limited in scope with respect to the subject domain ontologies and ontology-like artefacts (e.g., the GO, BioPax, KEGG), which are also used for annotation but not mentioned at all. The former deal with sequencing annotation following the methodological aspects of the investigation, whereas the latter type of annotations can be done with domain ontologies, i.e. annotating data with what kind of things you have found (which genes and their function, which metabolism, what role does the organism have in the community etc.) that are also need to carry out the desired comparative analyses.

There is also more generic hand-waiving that something new is needed for data analysis:

The associated bioinformatic analyses are useful for generating new hypotheses, but other methods are required to test and verify in silico hypotheses and conclusions in the real world. It is a long way from simply describing the naturally occurring microbial ‘parts list’ to understanding the functional properties, multi-scalar responses and interdependencies that connect microbial and abiotic ecosystem processes. New methods will be required to expand our understanding of how the microbial parts list ties in with microbial ecosystem dynamics. (p203)

Point taken. And if that was not enough,

Molecular data sets are often gathered in massively parallel ways, but acquiring equivalently dense physiological and biogeochemical process data54 is not currently as feasible. This ‘impedance mismatch’ (the inability of one system to accommodate input from another system’s output) is one of the larger hurdles that must be overcome in the quest for more realistic integrative analyses that interrelate data sets spanning from genomes to biomes.

I fancy the thought that my granularity might be able to contribute to the solution, but that is not yet anywhere close to user-level software application stage.

At the end of the paper, I am still—as in 2005 and 2007—left with the impression that more data is being generated in the whole metagenomics endeavour than that there are computational tools to analyse them to squeeze out all the information that is ‘locked up’ in the pile of data.

References

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.

[2] DeLong, E.F. The microbial ocean from genomes to biomes. Nature, 2009, 459: 200-206.

Topically-scaled maps of countries in the world

Well, cartograms computed thanks to an algorithm based on diffusion taken from elementary physics.

cartograms of internet users in 1990 and 2002

Worldmapper's cartograms of internet users in 1990 and 2002

People tend to like graphical representations of data that factors out variations in population density and at the same time show how many cases occur in each region. There are several approaches and one of the finer ones was discovered and developed by Gastner and Newman [1], who informally introduce the idea as “On a true population cartogram the population is necessarily uniform: once the areas of regions have been scaled to be proportional to their population then, by definition, population density is the same everywhere… Thus, one way to create a cartogram given a particular population density, is to allow population somehow to “flow away” from high-density areas into low-density ones, until the density is equalized everywhere. There is an obvious candidate process that achieves this, the linear diffusion process of elementary physics [12], and this is the basis of our method.” In addition, they add notions of boundary of regions (see paper for details). To have quick computations of such maps (seconds to a few minutes), they solve the equation in Fourier space. Then, to actually get any interesting results, one has to set the values for the starting density, some grain size of the regions, and pick any topic of preference for which there are sufficient and reliable data points.

For the myriad of examples freely online at Worldmapper, which were produced using this approach, they took the world and its countries to create almost 600 different maps, each annotated with contextual information. For instance, the proportion of all scientific papers published in 2001 written by authors living in that country, arms exports, influenza outbreaks between 2000 and 2005, etc etc. In the picture above at the start of this post is a screenshot of internet users of 1990 (3 million users) and one of 2002 (631 million), which have changed quite remarkably in both absolute numbers and in proportions. Most data sources used for the maps are from organisations such as UNDP, WHO, and the Worldbank.

The lower numbers in the list of cartograms are about economic indicators, environment, population, disasters, destruction and so forth, whereas in the higher numbers there are all sorts of causes of deaths and religions. See the thematic index for a more targeted exploration or the thumbnails for quick impressions.

The first few maps, e.g. on population, are interactive now, so one can play with zooming in and out. Creating maps at home is apparently possible, too (I haven’t tried that yet); the software is downloadable from Newman’s cart page.

References

[1] Gastner, M.T. and Newman, M.E.J. Diffusion-based method method for producing density equalizing maps. Proc. Natl. Acad. Sci., 2004, 101, 7499-7540.

A CS department is not a cheap software house

Computer Science—or if you prefer: Computing Science, Informatics, Computing, Computer Engineering, or similar—suffers from an image problem for quite some time on several fronts. One of them is that non-informaticians seem to think that a computer science department is there as a ‘science-supporting’ or ‘facilitating’ department. In principle, it is not (although given certain circumstances, that is what it ends up to be in some cases).

On the comical side, there are t-shirts with the slogan “no, I will not fix your computer”. One BSc graduate here at UniBz actually used that phrase during her speech at the degree ceremony last December, which generated laughs from the audience, but they probably did not give it a second thought. In addition to the philosophy of CS, about which I wrote earlier, and an unambiguous paper about paradigms in computer science [1], there is another source I recommend people to have a look at. It is the final report of the Task Force on the Core of Computer Science [2] which “presents a new intellectual framework for the discipline of computing and a new basis for computing curricula.”. It (i) outlines the paradigms for the discipline, (ii) has a long and a short description of computing, the short one being:

The discipline of computing is the systematic study of algorithmic processes that describe and transform information: their theory, analysis, design, efficiency, implementation, and application. The fundamental question underlying all of computing is, “What can be (efficiently) automated?”

and (iii) presents a matrix with three columns “Theory”, “Abstraction” and “Design” as complementary aspects of computing and 9 rows for the principal sub-disciplines, such as HCI, databases and information retrieval, operating systems, and architecture. Its 7-page appendix lists the suggested contents for those 29 cells in the matrix. An no, there is no ‘you should become a software house’, not even close.

So, from where could non-informatics people have gotten the idea of CS as a supporting, facilitating discipline and software house? Perhaps Dijkstra’s [3] famous complaint gives a clue:

So, if I look into my foggy crystal ball at the future of computing science education, I overwhelmingly see the depressing picture of ‘‘Business as usual’’. The universities will continue to lack the courage to teach hard science, they will continue to misguide the students, and each next stage of infantilization of the curriculum will be hailed as educational progress.

That having occurred in multiple CS curricula, perhaps CS curricula developers are to blame, or their bending over and trying to meet outside demands to produce graduates with short-lived skills that are instantly usable in industry and to improve throughput statistics of students graduating from the programme within the nominal time of study. Did anyone do a proper study about that or are they just commonly held assumptions and excuses?

A wholly different argument that CS departments are not there ‘at your service’ has been put forward by Dieter Fensel and Dieter Wolf [4], which I wrote about before here. They claim that computer science will become the foundation of the sciences because it has information processing and knowledge management (and goal-oriented services) at its core. From their perspective, then given that physics, biology and neuroscience deal with specific types/sections of information and knowledge management, they are (or become) branches of computer science. Just think of it for a moment. We would have, say, genetics as applied computer science, but not computer science as a service and facilitator for genetics. No ‘blue-collar bioinformatician’, but a ‘blue-collar wet-lab geneticist’ collecting the raw data. Unsurprisingly, during the workshop where Fensel presented his proposal, the philosophers were not charmed with that view. And I can imagine geneticists will not be pleased with inverting the cores and corresponding roles either: obviously, genetics is a science—but so is computing. Dijkstra [3] gives a more modest, but only slightly, view for the future of computer science than Fensel and Wolf did:

In the long run I expect computing science to transcend its parent disciplines, mathematics and logic, by effectively realizing a significant part of Leibniz’s Dream of providing symbolic calculation as an alternative to human reasoning. (Please note the difference between “mimicking” and “providing an alternative to”: alternatives are allowed to be better.)

.

We’re not some underpaid service-oriented software house; if you want industry-grade software and customer service, you’ll have to fork out the money and, well, ask industry to develop the software. Having said that, I’ll admit that CS departments, in general, should improve on respecting and valuing domain experts. For instance, one CS professor thinks of possible collaboration as “their benefit is that we can play with their data” or, phrased differently: we get something out of it, but the user only should put effort and resources in the endeavour but not expect anything in return. I do not think it is realistic to expect that domain experts are, or want to be, that philanthropic in a collaboration—collaborations ought to be mutually beneficial. A ‘use and throw away’ attitude might achieve short-term gains on the CS-side, but such a win-lose approach is not sustainable in the long run. Après-moi le deluge, an older and tenured CS prof might think, but (i) the younger generations cannot afford such ‘luxuries’ for their horizons reach farther (time-wise, at least, and possibly also with respect to career aims), and (ii) if producing software is part of the research task, then doing the work up to a working prototype should be part of researcher’s honesty and science ethics anyway, in particular when papers are published about it.

Maybe CS should knock on the doors of a marketing company and ask for brand positioning services so that CS-offers can be harmonised better with non-informatician demands and expectations.

[1] Amnon H. Eden. Three paradigms of computer science. Minds & Machines, 2007, 17: 135-167.

[2] Peter J. Denning, Douglas E. Comer, David Gries, Michael C. Mulder, Allen Tucker, A. Joe Turner, and Paul R. Young. Computing as a Discipline. Communications of the ACM, 1989, 32(1): 9-23.

[3] Dijkstra, E.W. (1988). On the cruelty of really teaching computing science. Unpublished manuscript EWD 1036.

[4] Fensel, D., Wolf, D. The Scientific Role of Computer Science in the 21st Century. Third International Workshop on Philosophy and Informatics (WSPI06), Saarbrücken, 3-4 May 2006. pp33-46.