Metagenomics, or: more problems to solve by bioinformaticians!

Nature Reviews Microbiology had their special issue on metagenomics in 2005, and the closely related topic of horizontal gene transfer shortly afterward, now it is PLoS Biology’s turn with several articles on advances in studying microbial communities in the ocean as part of their Oceanic Metagenomics collection. Not that, in theory, metagenomics is limited to microbes, but that’s where the research focus is now (e.g. [1][2][3]), because scaling up genomics research isn’t easy or cheap – and think of all the data that needs to be stored, processed, and analysed.

For the non-biologist reader in 3 sentences (or synopsis [4]): metagenomics, or `high-throughput molecular ecology’ (also called community genomics, ecogenomics, environmental genomics, or population genomics) combines molecular biology with ecosystems. It reveals community and population-specific metabolisms with the interdependent biological behaviour of organisms in nature that is affected by its micro-climate. Take a handful of soil (ocean water, mud, …) and figure out which microorganisms live there, who’s active (and what are they doing?), who’s dormant, what are the ratios of the population sizes of the different types of microorganisms, how does a microbial community ‘look’ like, etc?

For the data-enthusiast: all those individual microorganisms need to have their DNA and RNA sequenced, where, of course, the results go into databases. And then the analysis: putting back together the pieces from shotgun sequencing, comparing DNA with DNA, rRNA with rRNA, with each other, how to do the binning and so forth [5]. Naively: more and faster algorithms wouldn’t hurt; how can you visualize a community of microorganisms on your screen, and make simulations of those bacterial communities?

And then, somewhere in this whole endeavor, bio-ontologists should be able to find their place, to help out (and figure out) how to best represent all the new information in a usable and reusable way. Because metagenomics is a hot topic with much research and novel results, ontology maintenance (tracking changes etc) will then likely be more important than the attention it receives in ODEs at present, as well as reasoning over ontologies and massive amounts of data. Ouch. Some work has been and is being done on these topics (e.g. [6] [7]), and more can/will/does/should follow.

[1] DeLong, E.F. Microbial community genomics in the ocean. Nature Reviews Microbiology, 2005, 3:459-469.
[2] Lorenz, P., Eck, J. Metagenomics and industrial applications. Nature Reviews Microbiology, 2005, 3:510-516.>
[3] Schleper, C., Jurgens, G., Jonuscheit, M. Genomic studies of uncultivated Archae. Nature Reviews Microbiology, 3:479-488.
[4] Gross, L. Untapped Bounty: Sampling the Seas to Survey Microbial Biodiversity. PLoS Biology, 2007, 5(3): e85.
[5] Eisen, J.A. Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biology, 2007, 5(3): e82.
[6] Klein, M. and Noy, N.F. (2003). A Component-Based Framework for Ontology Evolution. Workshop on Ontologies and Distributed Systems at IJCAI-2003, Acapulco, Mexico.
[7] Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A, Rosati, R. Linking data to ontologies: The description logic DL-lite A. Proc. of the 2nd Workshop on OWL: Experiences and Directions (OWLED 2006), 2006.


8 responses to “Metagenomics, or: more problems to solve by bioinformaticians!

  1. Hi, I came across this as well, and was equally excited about all of the questions it raises. It made me want to drop all of this semantic web business and get back into bioinformatics.. One thing that really struck me was their claim to have identified more than double the number of unique protein coding sequences that exist in the world’s primary repositories (UniProt, Genbank). That, coupled with the bit about Genbank removing the Sargasso sequences from their Blast searches because they were coming up too often?? in searches clearly indicates that Genbank is just not ready to handle the data that the world is producing. I’m really curious how their own database system stacks up to the challenge – will it eventually become the primary place people go to characterize their sequences? Will the semantic web offer anything to this project (already rich in metadata BTW) ?

  2. >>was their claim to have identified more than double the number of unique protein coding sequences that exist in the world’s primary repositories

    it is a lot, but on the other hand, there are so many unculturable bacteria with ‘unchartered territory’ on who they are and what they do (like with the deep sea vents), that it would not even surprise me if/when many more new protein coding sequences will be found using the high-throughput approach of with metagenomics. It is estimated that most bacteria are not culturable in the lab (or some are after all, but we don’t know yet how). That’s the exciting thing with metagenomics: what we now know about, e.g. ‘high prevalence of proteins of type x across domain y’, may be utterly wrong, or may need to be adjusted and rephrased (the latter already happened when I did my thesis on molecular ecology, prefixing sentences with “among the culturable bacteria living in soil of type x…”). On the other hand, there are only so many combinations of amino acids + evolution.

    >>Genbank removing the Sargasso sequences from their Blast searches because they were coming up too often??

    I hadn’t heard about that yet. Surely, there are more reasons to do that than just being often in the query answer?

    >>I’m really curious how their own database system stacks up to the challenge

    Yes. There are several techniques for detecting noisy data, inconsistent data, database repair, doing integrity constraint checking upon updates, and whatnot (maybe I should know more about this…).

    >>Will the semantic web offer anything to this project?

    With my bias geared toward ontologies and your Semantic Web efforts not being a waste of time ;-), I’d like to see some formal ontology/ies of what is being discovered and how that (mis)matches with existing ontologies (or that are in the process of being built). So we could throw in ontology development, maintenance, comparing ontologies of the ‘known’ info and ‘new metagenomics-obtained’ info, linking and integrating those ontologies, link ontologies to the data and check if there are instances for all types in the ontology, find some inconsistencies either within the ontology or in the data or in the connection of the two, do a couple of path queries; in short: a good bit of automated reasoning. And all this data & ontologies would not reside in one place in one database as there’s so much metagenomics data added to the already distributed nature of bio-data, so we could add some peer-to-peer database management and so-called meaning negotiation mediated by ontologies. In addition, we would need to start coupling data & knowledge at the molecular level of granularity with that of micro-environment and (microbial) ecosystems at least at population and community level, or: a “vertical integration” across levels of granularity. And then some. To me, this whole list sounds like it fits quite well under the SW banner.

  3. On the semantic web applicability front..

    While I see the ~potential sensibility of things like distributed, semantically integrated data, the reality of the tools that I’ve been exposed to so far is that they are no where near ready to deal with anything even a hundreth the scale of what we are talking about. Case in point that I’m currently suffering with – reasoning with qualified cardinality restrictions using Pellet 1.4… 5 minutes to classify an ontology of about 50 concepts -> out of memory error (with 1gb allocated) with the addition of 1 owl:Individual.. I’m not picking on Pellet, I’m sure this particular function will be optimized soon, but it is characteristic of essentially all of the technologies I’ve had the pleasure of using that claim to support the semantic web. Right now, taking a project down the semantic web path means – a) dramatically slower development time b) dramatically slower database responses c) many unexpected technical pitfalls not present in anything with more history. All of this must be accept based only only on the vague hope that somehow enough other people will be willing to follow the same path that the advantages in interoperability and integration will ~eventually be seen.

    While the whole idea of a data warehouse like their CAMERA system is intuitively displeasing to me (I like the Web.., I like decentralized control, I like graceful degradation etc..) it seems that it may be the only realistic way to handle meta-genomics scale data.

    note – the bit about the Sargasso sequences being removed from the standard blast searches is from the Venter video associated with the PloS issue in question

    People were probably complaining because they wanted data from sequences that were better characterized than the Sargasso seqs (completely automated). It seems a better option in that case would be to allow a limit on the query to manually-curated records or something – like the go evidence codes.

  4. Uhmm, yes, the scalability thing with the Semantic Web…
    It has been observed and noted before, and all I can say is that Bolzano & Rome research groups are actively working on it with DL-Lite, which is less expressive than OWL, but should be better scalable because the DL-Lite languages are only polynomial in complexity (see ref 7 in the blog entry). Work on tools for implementation is in the pipeline, as well as a mapping to OWL, so that at some point (yeah, it’s not all there yet…) you can, e.g., specify your ontology in a relatively expressive ontology language, and then when linked to lots of data, get a lite version of the ontology to enhance performance of reasoning (including querying).
    While it may be annoying there is not much history to rely on with implementations of Semantic Web technologies, maybe one can consider it exciting to use “the latest of the latest”, being on the forefront with early adopters and all that??

    From the CAMERA article, I read that it is not a warehouse but a grid and that decentralized control is a bit of an issue due to the Convention on Biological Diversity. So, if the legal hurdles are dealt with, it should be able to have it more ‘open’. What I did find odd is that they mention ontologies only in the outreach and training section, instead of being part and parcel of the informatics infrastructure.

  5. Pingback: the business|bytes|genes|molecules podcast : The bbgm podcast #2 - Pipes and metagenomics

  6. Pingback: Term Life WebLog » Blog Archive » Empowerment | IT Leadership |

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.