gobics.de [Tax4Fun]

Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data

The characterization of phylogenetic and functional diversity are key elements in the analysis of microbial communities. Amplicon-based sequencing of marker genes, such as 16S rRNA, is a powerful tool for assessing and comparing the structure of microbial communities at a high phylogenetic resolution. Because 16S sequencing is more cost-effective than whole metagenome shotgun sequencing, marker gene analysis is frequently used in pre-studies or two-tiered large-scale studies. However, in comparison to shotgun sequencing approaches, a view on the functional capabilities of the community gets lost with the restriction on 16S data.

Tax4Fun is a open-source R package that predicts the functional capabilities of microbial communities based on 16S datasets.
Tax4Fun is applicable to output as obtained from the SILVAngs web server or the application of QIIME (Caporaso et al., 2010) against the SILVA database (Quast et al., 2013).

Further, the Tax4Fun package implements the MoP-Pro approach for whole metagenome shotgun sequencing data (Aßhauer and Meinicke, 2013). MoP-Pro implements a shortcut to estimate the metabolic profile of a metagenome. The taxonomic profile of the metagenome is linked to a set of pre-computed metabolic reference profiles. The combination of the taxonomic abundance estimates, obtained through the fast method Taxy-Pro (Klingenberg et al., 2013), and the metabolic reference profiles, based on the KEGG database (Kanehisa and Goto, 2000; Kanehisa et al., 2014), achieves an unrivaled speed of the metabolic profiling approach.

How to cite

K.P. Aßhauer, B. Wemheuer, R. Daniel, P. Meinicke (2015)
Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data
Bioinformatics (2015) 31 (17): 2882-2884. doi:10.1093/bioinformatics/btv287.


Precomputation of the association matrix

The association matrix was built from a BLASTN analysis where we extracted 16S rRNA gene sequences of all prokaryotic KEGG organisms and searched them against the SILVA SSU Ref NR database. For the assignment, we require a sufficient sequence similarity according to a threshold on the BLAST bitscore (> 1500). In case that K different KEGG organisms simultaneously show significant hits for a SILVA 16S sequence each entry in the association matrix is initialized with 1/K

At present, an association matrix based on SILVA SSU Ref NR database release 115 (August 2013) and KEGG database release 64.0 (October 2012) as well as SILVA SSU Ref NR database release 119 (July 2014) and KEGG database release 64.0 (October 2012) is available for Tax4Fun.

Precomputation of the functional reference profiles

Organism-specific functional profiles are computed for all bacterial and archaeal genomes in KEGG (Release 64.0) (Kanehisa and Goto, 2000; Kanehisa et al., 2014). The genomes were downloaded and subsequently fragmented into overlapping reads simulating a two-fold coverage of the genomes as previously described in (Klingenberg et al., 2013). To take different sequencing lengths into account, we generated overlapping reads of length 400 bp with 200 bp overlap for long read data and of length 100 bp with 50 bp overlap for short read data.

The organism-specific reference profiles are computed with the same method as implemented in the CoMet-Universe web server or Taxy-Pro (Klingenberg et al., 2013) but here we use KEGG Orthologs instead of Pfam protein domains. For the direct computation of metagenomic functional profiles in our comparative evaluation we utilized ultrafast protein classification (UProC) tool (Meinicke, 2014)) and the protein alignment using a DNA aligner (PAUDA) (Huson and Xie, 2014) to rapidly assign metagenomic reads to the KEGG Ortholog of bacterial and archaeal origin.
The UProC protein classification tool was executed in short read mode for the simulated short read data and in long read mode otherwise.
The PAUDA homology search was performed in --fast mode with default parameters. In the case of multiple matches, only the best hit is considered.


Tax4Fun and PICRUSt (Langille et al., 2013) were applied to a range of paired metagenome/16S datasets that have also been used in the original PICRUSt study.
All publicly accessible amplicon and metagenome data sample files from the Human Microbiome Project (HMP) (Human Microbiome Project Consortium, 2012), mammalian guts (Muegge et al., 2011), soils (Fierer et al., 2012), Guerrero Negro hypersaline microbial mat (Harris et al., 2013; Kunin et al., 2008) were downloaded in December 2013.
Due to the credit system of SILVAngs, we restricted the analysis of the HMP data to a subset of 49 samples for all taxonomic and functional profiling approaches.
The 16S profile was estimated using QIIME (Caporaso et al., 2010) and SILVAngs (Quast et al., 2013).
The functional profiles of the microbial communities were predicted using the PICRUSt (Langille et al., 2013) and Tax4Fun approach.
In total, we were able to process 49 paired HMP samples, 56 mammalian guts samples, 13 paired soil samples, and 10 paired Guerrero Negro microbial mat samples.




Tax4Fun and PICRUSt were applied to a range of paired metagenome/16S datasets that have also been used in the original PICRUSt study.
For each paired dataset, the Spearman correlation of the directly computed and the 16S-predicted KEGG Ortholog profile was calculated.

Spearman correlation

Spearman correlations between metagenomic and 16S-predicted functional profiles for comparison of Tax4Fun and PICRUSt on paired datasets from the human microbiome (HMP), mammalian guts, hypersaline microbial mat and soils.
For the human microbiome datasets, the Spearman correlations between metagenomic and 16S-predicted functional profiles was additionally calculated according to distinct body sites.
Both, UProC and PAUDA were used for estimation of metagenomic and organism-specific functional profiles.

P-values for nonparametric statistical testing

Data setQIIME + Tax4Fun vs. PICRUStSILVAngs + Tax4Fun vs. PICRUStQIIME + Tax4Fun vs. PICRUStSILVAngs + Tax4Fun vs. PICRUSt
Mammalian gut 2.78E-0172.78E-0172.78E-0172.78E-017
Guerro Negro hypersaline mat1.95E-0031.95E-0031.95E-0031.95E-003

The correlation of Tax4Fun is significantly higher for all four datasets according to a nonparametric sign test (p-value < 0.001).
UProC and PAUDA were used for estimation of metagenomic and organism-specific functional profiles.

Quality survey of prediction methods

The coverage of QIIME and SILVAngs analysis pipelines was assessed in terms of the fraction of reads that could be classified by QIIME/SILVAngs and the percentage of OTUs that could be mapped to KEGG organisms using Tax4Fun.
For all data sets, the quality values can be gathered from the following tables.

FSU - Fraction of sequences unexplained: the amount of sequences without KEGG Ortholog hits. The FSU was introduced in Taxy-Pro (Klingenberg et al., 2013)

FTU - Fraction of OTUs that could not be mapped to KEGG organisms

Weighted NSTI: see Langille et al., 2013

Please direct your questions and comments to kathrin@gobics.de.


Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I., et al. (2010). QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7, 335–336.

DeSantis, T.Z., Hugenholtz, P., Larsen, N., Rojas, M., Brodie, E.L., Keller, K., Huber, T., Dalevi, D., Hu, P., and Andersen, G.L. (2006). Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072.

Fierer, N., Leff, J.W., Adams, B.J., Nielsen, U.N., Bates, S.T., Lauber, C.L., Owens, S., Gilbert, J.A., Wall, D.H., and Caporaso, J.G. (2012). Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc. Natl. Acad. Sci. U. S. A. 109, 21390–21395.

Harris, J.K., Caporaso, J.G., Walker, J.J., Spear, J.R., Gold, N.J., Robertson, C.E., Hugenholtz, P., Goodrich, J., McDonald, D., Knights, D., et al. (2013). Phylogenetic stratigraphy in the Guerrero Negro hypersaline microbial mat. ISME J. 7, 50–60.

Human Microbiome Project Consortium (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214.

Huson, D.H., and Xie, C. (2014). A poor man’s BLASTX--high-throughput metagenomic protein database search using PAUDA. Bioinforma. Oxf. Engl. 30, 38–39.

Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30.

Kanehisa, M., Goto, S., Sato, Y., Kawashima, M., Furumichi, M., and Tanabe, M. (2014). Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–205.

Klingenberg, H., Aßhauer, K.P., Lingner, T., and Meinicke, P. (2013). Protein signature-based estimation of metagenomic abundances including all domains of life and viruses. Bioinforma. Oxf. Engl. 29, 973–980.

Kunin, V., Raes, J., Harris, J.K., Spear, J.R., Walker, J.J., Ivanova, N., von Mering, C., Bebout, B.M., Pace, N.R., Bork, P., et al. (2008). Millimeter-scale genetic gradients and community-level molecular convergence in a hypersaline microbial mat. Mol. Syst. Biol. 4, 198.

Langille, M.G.I., Zaneveld, J., Caporaso, J.G., McDonald, D., Knights, D., Reyes, J.A., Clemente, J.C., Burkepile, D.E., Vega Thurber, R.L., Knight, R., et al. (2013). Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat. Biotechnol. 31, 814–821.

Muegge, B.D., Kuczynski, J., Knights, D., Clemente, J.C., González, A., Fontana, L., Henrissat, B., Knight, R., and Gordon, J.I. (2011). Diet drives convergence in gut microbiome functions across mammalian phylogeny and within humans. Science 332, 970–974.

Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., and Glöckner, F.O. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590–596.

Aßhauer, K.P., and Meinicke, P. (2013). On the estimation of metabolic profiles in metagenomics. (Göttingen: Schloss Dagstuhl - Leibniz-Zentrum für Informatik GmbH), pp. 1–13.

Meinicke, P. (2014). UProC: tools for ultra-fast protein domain classification. Bioinformatics.