A novel abundance-based algorithm for binning metagenomic sequences using l-tuples

Yu Wei Wu, Yuzhen Ye

Research output: Contribution to journalArticle

79 Citations (Scopus)

Abstract

Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.

Original languageEnglish
Pages (from-to)523-534
Number of pages12
JournalJournal of Computational Biology
Volume18
Issue number3
DOIs
Publication statusPublished - Mar 1 2011
Externally publishedYes

Fingerprint

Metagenomics
Binning
Bins
Genome
Genes
Chemical analysis
DNA
Classify
Unsupervised Clustering
Sequence Analysis
Sequencing
Two Parameters
Genome Size
Fragment
Cluster Analysis
Community

Keywords

  • Binning
  • EM algorithm
  • metagenomics
  • Poisson distribution

ASJC Scopus subject areas

  • Molecular Biology
  • Genetics
  • Computational Mathematics
  • Modelling and Simulation
  • Computational Theory and Mathematics

Cite this

A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. / Wu, Yu Wei; Ye, Yuzhen.

In: Journal of Computational Biology, Vol. 18, No. 3, 01.03.2011, p. 523-534.

Research output: Contribution to journalArticle

@article{0a6ecbb178624d53a2680598233d3213,
title = "A novel abundance-based algorithm for binning metagenomic sequences using l-tuples",
abstract = "Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.",
keywords = "Binning, EM algorithm, metagenomics, Poisson distribution",
author = "Wu, {Yu Wei} and Yuzhen Ye",
year = "2011",
month = "3",
day = "1",
doi = "10.1089/cmb.2010.0245",
language = "English",
volume = "18",
pages = "523--534",
journal = "Journal of Computational Biology",
issn = "1066-5277",
publisher = "Mary Ann Liebert Inc.",
number = "3",

}

TY - JOUR

T1 - A novel abundance-based algorithm for binning metagenomic sequences using l-tuples

AU - Wu, Yu Wei

AU - Ye, Yuzhen

PY - 2011/3/1

Y1 - 2011/3/1

N2 - Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.

AB - Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes-two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.

KW - Binning

KW - EM algorithm

KW - metagenomics

KW - Poisson distribution

UR - http://www.scopus.com/inward/record.url?scp=79952425617&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952425617&partnerID=8YFLogxK

U2 - 10.1089/cmb.2010.0245

DO - 10.1089/cmb.2010.0245

M3 - Article

C2 - 21385052

AN - SCOPUS:79952425617

VL - 18

SP - 523

EP - 534

JO - Journal of Computational Biology

JF - Journal of Computational Biology

SN - 1066-5277

IS - 3

ER -