Skip to content
Efrat Muller edited this page Jan 11, 2022 · 26 revisions

Introduction

Welcome to the Human Fecal Microbiome-Metabolome Data Collection wiki!

This data collection includes curated data from multiple studies where both **metagenomic **and **metabolomic **profiles were obtained from human fecal samples [TODO: add on bottom refs to all datasets included]. It is made publicly available for the benefit of the microbiome research community in order to facilitate integrated microbiome-metabolome analysis and cross-study comparisons.

Data from the original studies was obtained from publications' supplementary, public data repositories in which data was deposited, or provided by the authors via mail. We then processed the data in a unified manner, attempting to create comparable datasets. Importantly, all comparisons should be made with caution, as there is substantial heterogeneity between studies in terms of cohort characteristics (ages, geography, medical backgrounds, etc.) as well as study protocols and data generation (sample collection and storage protocols, metagenomics and metabolomics technologies, etc.). All of these factors are expected to introduce variation in both fecal microbiome and fecal metabolome profiles. [TODO: add references form paper]

The following Wiki contains details about how the data is organized in this repository, the original studies that generated the data, how the data was processed and unified and a quick example of how the data could be used for cross-study comparisons. For transparency and reproducibility, all scripts used for manipulating the originally-published data are available in the repository as well (and referred to within the relevant sections of this Wiki).

[TODO: add some kind of table of contents here]

Data overview

Data organization

[TODO]

A diagram of the tables provided per dataset

Datasets included

Data was obtained only from studies that met the following criteria:

  • Human cohort
  • Microbiome profiles of stool samples available
  • Metabolome profiles of same stool samples available
  • Basic metadata per sample/subject available

The collection currently includes the following datasets:

Dataset name Study DOI
YACHIDA_CRC_2019
FRANZOSA_IBD_2019
SINHA_CRC_2016
HE_INFANTS_MFGM_2019
iHMP_IBDMDB_2019
JACOBS_IBD_RELATIVES_2016
POYET_BIO_ML_2019
ERAWIJANTARI_GASTRIC_CANCER_2020
KIM_ADENOMAS
MARS_IBS_2020
KANG_AUTISM_2018
KOSTIC_INFANTS_DIABETES_2015
WANDRO_PRETERMS_2018

Additional details per dataset can be found here [TODO: add link to excel]

Data access

[TODO]

Data processing details

Metabolomics processing notes

The metabolomics data obtained from the different studies and included in this collection are diverse. They were generated by different technologies (e.g. NMR, LC-MS, GC-MS, etc.), followed either a targeted or untargeted approach, carried out different controls and normalizations, and finally also shared in varying formats and included varying compound identifiers.

In an attempt to consolidate the metabolome profiles in this collection while also maintaining maximum original data, we performed the following processing:

  • Compound identifiers in each mtb table are as provided by the authors (i.e. they are not unified or modified in any way in this table). Occasionally, a few different fields from the original data were concatenated in order to assure unique compound identifiers. For example, if a dataset contained both an NMR-based metabolic profile and another LC-MS untargeted profile, then the unique compound names were a concatenation of metabolomics method name and metabolite identifier within that method (e.g. "NMR_Lactate", "LC-MS_Glycocholic acid", ...). Further details could be found in the dataset-specific scripts found here [add link to load_original_data folder].
  • We created a metabolite-metadata table per dataset (namely mtb.map) where additional details are provided for each metabolite in the mtb table. The mtb.map table includes:
    • Any original information per metabolite as provided by authors;
    • Mappings to KEGG and HMDB identifiers wherever possible. These were either provided by authors, or obtained using the conversion utility from MetaboAnalystR (version 3.2) [TODO: add ref]. Additional mappings were added manually when possible (see load_data scripts [TODO: add link]).
    • We added a "High.Confidence.Annotation" boolean field to mark cases where the identification of the metabolite (specifically, the mapping to KEGG/HMDB) was made with lower confidence. In particular, this field is set to FALSE in any of the following cases:
      • Metabolite had an ambiguous identification in the original table (e.g. "fructose/glucose");
      • Metabolite was marked as lower-confidence annotation in the original table;
      • Metabolite name had a typo;
      • In cases where both a name and a KEGG ID was provided, if the name matched a different KEGG ID as well it (i.e. contradiction in KEGG mappings);
      • In cases where more than one metabolite was mapped to the same KEGG or HMDB ID;
  • Metabolite values were kept as is (including missing values where present).

Note: searching for metabolite identifiers by metabolite names may lead to inaccurate/partial mappings [refs: https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-6-2, https://www.mdpi.com/2218-1989/9/2/28]

Microbiome processing notes

This dataset collection includes microbiome data from both whole genome shotgun sequencing (WGSS) and 16S rRNA amplicon sequencing. Details about the processing of each data type are provided below.

WGSS data - MetaPhlAn tables

For studies with shotgun metagenomic data, we obtained MetaPhlAn2 tables.

Note: a newer and improved version of MetaPhlAn is available

The following processing was performed for MetaPhlAn tables:

  • A species-level table was saved as is (species table, no unification or further processing was applied);
  • To create the genera table:
    • When short species names were provided, we mapped them to the corresponding genus name and aggregated rows of the same genus by summing up the abundance values.
    • Mappings to genus names was based on this publicly available file. Further details can be found in the "utils.R" script [TODO: link to utils script].

16S rRNA data

16S rRNA gene sequencing raw data was processed using QIIME2 version 2019-1 [TODO: ref] as follows:

  • When raw data was multiplexed, we demultiplexed the data using QIIME2’s demux plugin.
  • We applied DADA2 [TODO: ref] in order to denoise the data and extract ASV's. Whenever a sufficient high-quality overlap of forward and reverse reads was available, DADA2 was also used for merging paired end reads.
  • To assign ASVs to taxonomy, we trained a Naive Bayes classifier per dataset using QIIME2’s feature-classifier plugin [TODO: ref]. Classifiers were trained on reads extracted from the SILVA 99-OTU database [TODO: ref], according to the specific 16S rRNA gene hypervariable region used in each dataset.
  • ASV tables were collapsed to genus-level counts;

Further study-specific parameters and details about the 16S data processing of each dataset are detailed here [TODO: add link to excel].

Unification at genus level

Differences between bioinformatic tools for taxonomic annotation, WGSS vs. 16S data resolution, as well as differences between reference taxonomy databases caused discrepancies between genera entities across datasets. We unified genera entities by applying the following:

  • Non-bacteria entities were removed;
  • Full genus names were re-formatted to match the following pattern: k__<...>|p__<...>|c__<...>|o__<...>|f__<...>|g__<...>. For example, the string D_0__Bacteria;D_1__Actinobacteria;D_2__Actinobacteria;D_3__Actinomycetales;D_4__Actinomycetaceae;D_5__Actinotignum was converted to k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinotignum;
  • Some genera have inconcsistent assignments to higher-level phylogeny across different databases. Atopobium for example can be either assigned to k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Atopobium or to k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Coriobacteriales|f__Coriobacteriaceae|g__Atopobium. In these cases, we unify the full taxonomy string to a single version;
  • Genera entities that cannot be differentiated by their 16S sequences were aggregated into a single entity, to allow comparisons to WGSS datasets. For example, Escherichia and Shigella were merged into a single Escherichia-Shigella entity;
  • Unclassified species/genera were aggregated into an "Unclassified" entity. If an entity was unclassified at the genus level but assigned to a higher-level clade, we left it as is and marked the unclassified phylogenetic levels with __ (e.g. k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|__);
  • Additional unification details and examples can be found in the [unify_genera.R script](TODO - add link);

We finally normalized genus abundances in all datasets so that abundances sum to 1 in each sample (i.e. converted to relative abundances).

Subject metadata files

Subject metadata files (metadata) include all information provided in the original publication (typically in supplementary info. See [TODO] for dataset-specific details).

We specifically unified the names of the following fields:

Field name Description
Sample Sample identifier. Corresponds to sample names in feature tables
Subject Subject identifier. Some studies have multiple samples per subject
Study.Group Study group as named in original study (typically one of the groups would be named 'control' or 'healthy' and the other will be named after the studied disease/condition)
Age The subject's age
Age.Units One of: Years,Months,Days
Gender One of: Male,Female,Other
BMI The subject's BMI

Note: if not provided in original metadata, the field will be missing from the table.

In addition, in each metadata file we added the 3 following fields:

Field name Description
Dataset The “Dataset” name is formatted as following: <First author>_<Short cohort description>_<Year of publication>
DOI Publication DOI
Publication.Name Publication name

General data usage tips

  • Some of the datasets are from longitudinal studies, meaning that they include multiple samples per subject. Depending on the analysis, you may want to consider dealing with these separately. See [TODO] for further details per dataset, including whether it is longitudinal or not.
  • Cross-study comparisons can be performed using either KEGG or HMDB ID's as the link between metabolites in different datasets. For microbiome comparisons, genera tables can be used (genus names were unified), and if analyzing only shotgun datasets, species names can be used.
  • A simple example of a cross-study comparison using this data collection, looking into the correlation between XX and the XXX bacteria XXX, can be found in here [TODO: add link, maybe teaser plot?].

Acknowledgements

We thank all authors of the included studies, for making their data publicly available and responding to questions we had.

Citation

If you use the data provided here, we kindly request that you cite both the original papers as well as: ...

References

Clone this wiki locally