The Nature TOC-email arrived yesterday, and they have a whole “insight” section on microbial oceanography! Four years ago, Nature Reviews Microbiology had a special issue with a few papers about it, two years ago PLoS Biology presented their Oceanic Metagenomics Collection, and now then the Nature supplement. Why would a computer scientist like me care? Well, my first study was in microbiology, and they have scaled up things a lot in the meantime, thereby making computers indispensible in their research. For those unfamiliar with the topic, you can get an idea about early computational issues in my previous post and comments by visitors, but there are new ones that I’ll mention below.
Although the webpage of the supplement says that the editorial is “free access”, it is not (as of 14-5 about 6pm CET and today 9am). None of the 5 papers—1 commentary and 4 review papers—indicate anything about the computational challenges: “the life of diatoms in the world’s oceans”, “microbial community structure and its functional implications”, “the microbial ocean from genomes to biomes”, and “viruses manipulate the marine environment”. Given that DeLong’s paper of 4 years ago  interested me, I chose his review paper in this collection  to see what advances have been made in the meantime (and that article is freely available).
One of the issues mentioned in 2007 was the sequencing and cleaning up noisy data in the database, which now seems to be much less of a problem, even largely solved, so the goal posts start moving to issues with the real data analysis. With my computing-glasses on, Box 2 mentions (my emphases underlined and summarised afterwards):
Statistical approaches for the comparison of metagenomic data sets have only recently been applied, so their development is at an early stage. The size of the data sets, their heterogeneity and a lack of standardization for both metadata and gene descriptive data continue to present significant challenges for comparative analyses … It will be interesting to learn the sensitivity limits of such approaches, along more fine-scale taxonomic, spatial and temporal microbial community gradients, for example in the differences between the microbiomes of human individuals44. As the availability of data sets and comparable metadata fields continues to improve, quantitative statistical metagenomic comparisons are likely to increase in their utility and resolving power. (p202)
Let me summarise that: DeLong asserts they need (i) metadata annotations as a prerequisite for statistical approaches; (ii) deal with temporal data, and (iii) deal with spatial data. There is a lot of research and prototyping going on in topics ii and iii, and there are few commercial industry-grade plugins, such as the Oracle Cartridge, that do something with the spatial data representation. Perhaps that is not enough or it is not what the users are looking for; if this is the case, maybe they can be a bit more precise on what they want?
Point i is quite interesting, because it basically reiterates that ontologies are a means to an end and asserts that statistics cannot do it with number crunching alone but needs structured qualitative information to obtain better results. The latter is quite a challenge—probably technically doable, but there are few people who are well versed in the combination of qualitative and quantitative analysis. Curiously, only MIAME and the MGED website are mentioned for metadata annotation, even though they are limited in scope with respect to the subject domain ontologies and ontology-like artefacts (e.g., the GO, BioPax, KEGG), which are also used for annotation but not mentioned at all. The former deal with sequencing annotation following the methodological aspects of the investigation, whereas the latter type of annotations can be done with domain ontologies, i.e. annotating data with what kind of things you have found (which genes and their function, which metabolism, what role does the organism have in the community etc.) that are also need to carry out the desired comparative analyses.
There is also more generic hand-waiving that something new is needed for data analysis:
The associated bioinformatic analyses are useful for generating new hypotheses, but other methods are required to test and verify in silico hypotheses and conclusions in the real world. It is a long way from simply describing the naturally occurring microbial ‘parts list’ to understanding the functional properties, multi-scalar responses and interdependencies that connect microbial and abiotic ecosystem processes. New methods will be required to expand our understanding of how the microbial parts list ties in with microbial ecosystem dynamics. (p203)
Point taken. And if that was not enough,
Molecular data sets are often gathered in massively parallel ways, but acquiring equivalently dense physiological and biogeochemical process data54 is not currently as feasible. This ‘impedance mismatch’ (the inability of one system to accommodate input from another system’s output) is one of the larger hurdles that must be overcome in the quest for more realistic integrative analyses that interrelate data sets spanning from genomes to biomes.
I fancy the thought that my granularity might be able to contribute to the solution, but that is not yet anywhere close to user-level software application stage.
At the end of the paper, I am still—as in 2005 and 2007—left with the impression that more data is being generated in the whole metagenomics endeavour than that there are computational tools to analyse them to squeeze out all the information that is ‘locked up’ in the pile of data.
 DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.
 DeLong, E.F. The microbial ocean from genomes to biomes. Nature, 2009, 459: 200-206.
It’s interesting but also in a way, very clear that automated statistical analysis requires metadata: otherwise, what is the meaning of the analysis? What guarantee is there that the right variables were correctly instantiated?
I think it wouldn’t be too hard to get a paper together marrying biostatistics with metadata.
“microbiomes of human individuals”
I think one of the most useful things at this point would be to specify/precisely define some of these vaguely defined expressions.
(although not necessarily these two).
I fully agree that statistical analysis needs metadata, but it I found it curious that it still received little attention in DeLong’s paper.
Depending on how one wants to marry biostatistics and metadata, it is, or is not, too hard. Some low-hanging fruits were picked by Zhou et al , which does demonstrate one can gain by combining the two (in casu, the Gene Ontology + data mining), whereas, a.o., Agnieszka Ławrynowicz is doing her PhD on this topic and works on more challenging issues (e.g., rules, logic-based ontologies).
 Y Zhou, JA Young, A Santrosyan, K Chen, SF Yan, and EA Winzeler. In silico gene function prediction using ontology-based pattern identification. Bioinformatics, 21(7):1237–1245, 2005.