The characterization of phylogenetic and functional diversity are key elements in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used in pre-studies or two-tiered large-scale studies. However, in comparison to shotgun sequencing approaches, a view on the functional capabilities of the community gets lost with the restriction on 16S data.
Tax4Fun is a open-source R package that predicts the functional capabilities
of microbial communities based on 16S datasets.
Tax4Fun is applicable to output as obtained from the SILVAngs web server or the application of QIIME (Caporaso et al., 2010) against the SILVA database (Quast et al., 2013).
Further, the Tax4Fun package implements the MoP-Pro approach for whole metagenome shotgun sequencing data (Aßhauer and Meinicke, 2013). MoP-Pro implements a shortcut to estimate the metabolic profile of a metagenome. The taxonomic profile of the metagenome is linked to a set of pre-computed metabolic reference profiles. The combination of the taxonomic abundance estimates, obtained through the fast method Taxy-Pro (Klingenberg et al., 2013), and the metabolic reference profiles, based on the KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2014), achieves an unrivaled speed of the metabolic profiling approach.
The association matrix was built from a BLASTN analysis where we extracted 16S rRNA gene sequences of all prokaryotic KEGG organisms and searched them against the SILVA SSU Ref NR database. For the assignment, we require a sufficient sequence similarity according to a threshold on the BLAST bitscore (> 1500). In case that K different KEGG organisms simultaneously show significant hits for a SILVA 16S sequence each entry in the association matrix is initialized with 1/K
At present, an association matrix based on SILVA SSU Ref NR database release 115 (August 2013) and KEGG database release 64.0 (October 2012) as well as SILVA SSU Ref NR database release 119 (July 2014) and KEGG database release 64.0 (October 2012) is available for Tax4Fun.
Organism-specific functional profiles are computed for all bacterial and archaeal genomes in KEGG (Release 64.0) (Kanehisa and Goto, 2000; Kanehisa et al., 2014). The genomes were downloaded and subsequently fragmented into overlapping reads simulating a two-fold coverage of the genomes as previously described in (Klingenberg et al., 2013). To take different sequencing lengths into account, we generated overlapping reads of length 400 bp with 200 bp overlap for long read data and of length 100 bp with 50 bp overlap for short read data.
The organism-specific reference profiles are computed with the same method as
implemented in the CoMet-Universe web server or Taxy-Pro (Klingenberg et al., 2013)
but here we use KEGG Orthologs instead of Pfam protein domains.
For the direct computation of metagenomic functional profiles in our comparative evaluation
we utilized ultrafast protein classification (UProC) tool (Meinicke, 2014)) and the protein alignment using a DNA aligner (PAUDA) (Huson and Xie, 2014) to rapidly assign
metagenomic reads to the KEGG Ortholog of bacterial and archaeal origin.
The UProC protein classification tool was executed in short read mode for the simulated short read data and in long read mode otherwise.
The PAUDA homology search was performed in --fast mode with default parameters. In the case of multiple matches, only the best hit is considered.
Tax4Fun and PICRUSt were applied to a range of paired metagenome/16S datasets that have also been used in the original PICRUSt study.
For each paired dataset, the Spearman correlation of the directly computed and the 16S-predicted KEGG Ortholog profile was calculated.
|Data set||QIIME + Tax4Fun vs. PICRUSt||SILVAngs + Tax4Fun vs. PICRUSt||QIIME + Tax4Fun vs. PICRUSt||SILVAngs + Tax4Fun vs. PICRUSt|
|Guerro Negro hypersaline mat||1.95E-003||1.95E-003||1.95E-003||1.95E-003|
The correlation of Tax4Fun is significantly higher for all four datasets according to a nonparametric sign test (p-value < 0.001).
UProC and PAUDA were used for estimation of metagenomic and organism-specific functional profiles.
The coverage of QIIME and SILVAngs analysis pipelines was assessed in terms of the fraction of reads that could be
classified by QIIME/SILVAngs and the percentage of OTUs that could be mapped to KEGG organisms using Tax4Fun.
For all data sets, the quality values can be gathered from the following tables.
FSU - Fraction of sequences unexplained: the amount of sequences without KEGG Ortholog hits. The FSU was introduced in Taxy-Pro (Klingenberg et al., 2013)
FTU - Fraction of OTUs that could not be mapped to KEGG organisms
Weighted NSTI: see Langille et al., 2013
Please direct your questions and comments to firstname.lastname@example.org.
Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336.
DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G.L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072.
Fierer, N., Leff, J.W., Adams, B.J., Nielsen, U.N., Bates, S.T., Lauber, C.L., Owens, S., Gilbert, J.A., Wall, D.H., and Caporaso, J.G. (2012). Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc. Natl. Acad. Sci. U. S. A. 109, 21390–21395.
Harris, J.K., Caporaso, J.G., Walker, J.J., Spear, J.R., Gold, N.J., Robertson, C.E., Hugenholtz, P., Goodrich, J., McDonald, D., Knights, D., et al. (2013). Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME J. 7, 50–60.
Human Microbiome Project Consortium (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214.
Huson, D.H., and Xie, C. (2014). A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinforma. Oxf. Engl. 30, 38–39.
Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30.
Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2014). Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–205.
Klingenberg, H., Aßhauer, K.P., Lingner, T., and Meinicke, P. (2013). Protein signature-based estimation of metagenomic abundances including all domains of life and viruses. Bioinforma. Oxf. Engl. 29, 973–980.
Kunin, V., Raes, J., Harris, J.K., Spear, J.R., Walker, J.J., Ivanova, N., von Mering, C., Bebout, B.M., Pace, N.R., Bork, P., et al. (2008). Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat. Mol. Syst. Biol. 4, 198.
Langille, M.G.I., Zaneveld, J., Caporaso, J.G., McDonald, D., Knights, D., Reyes, J.A., Clemente, J.C., Burkepile, D.E., Vega Thurber, R.L., Knight, R., et al. (2013). Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat. Biotechnol. 31, 814–821.
Muegge, B.D., Kuczynski, J., Knights, D., Clemente, J.C., González, A., Fontana, L., Henrissat, B., Knight, R., and Gordon, J.I. (2011). Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 332, 970–974.
Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., and Glöckner, F.O. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–596.
Aßhauer, K.P., and Meinicke, P. (2013). On the estimation of metabolic profiles in metagenomics. (Göttingen: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH), pp. 1–13.
Meinicke, P. (2014). UProC: tools for ultra-fast protein domain classification. Bioinformatics.