-
Notifications
You must be signed in to change notification settings - Fork 30
Home
Welcome to the Human Fecal Microbiome-Metabolome Data Collection wiki!
This data collection includes curated data from multiple studies where both **metagenomic **and **metabolomic **profiles were obtained from human fecal samples [TODO: add on bottom refs to all datasets included]. It is made publicly available for the benefit of the microbiome research community in order to facilitate integrated microbiome-metabolome analysis and cross-study comparisons.
Data from the original studies was obtained from publications' supplementary, public data repositories in which data was deposited, or provided by the authors via mail. We then processed the data in a unified manner, attempting to create comparable datasets. Importantly, all comparisons should be made with caution, as there is substantial heterogeneity between studies in terms of cohort characteristics (ages, geography, medical backgrounds, etc.) as well as study protocols and data generation (sample collection and storage protocols, metagenomics and metabolomics technologies, etc.). All of these factors are expected to introduce variation in both fecal microbiome and fecal metabolome profiles. [TODO: add references form paper]
The following Wiki contains details about how the data is organized in this repository, the original studies that generated the data, how the data was processed and unified and a quick example of how the data could be used for cross-study comparisons. For transparency and reproducibility, all scripts used for manipulating the originally-published data are available in the repository as well (and referred to within the relevant sections of this Wiki).
[TODO: add some kind of table of contents here]
[TODO]
Data was obtained only from studies that met the following criteria:
- Human cohort
- Microbiome profiles of stool samples available
- Metabolome profiles of same stool samples available
- Basic metadata per sample/subject available
The collection currently includes the following datasets:
Dataset name | Study DOI |
---|---|
YACHIDA_CRC_2019 | |
FRANZOSA_IBD_2019 | |
SINHA_CRC_2016 | |
HE_INFANTS_MFGM_2019 | |
iHMP_IBDMDB_2019 | |
JACOBS_IBD_RELATIVES_2016 | |
POYET_BIO_ML_2019 | |
ERAWIJANTARI_GASTRIC_CANCER_2020 | |
KIM_ADENOMAS | |
MARS_IBS_2020 | |
KANG_AUTISM_2018 | |
KOSTIC_INFANTS_DIABETES_2015 | |
WANDRO_PRETERMS_2018 |
Additional details per dataset can be found here [TODO: add link to excel]
[TODO]
The metabolomics data obtained from the different studies and included in this collection are diverse. They were generated by different technologies (e.g. NMR, LC-MS, GC-MS, etc.), followed either a targeted or untargeted approach, carried out different controls and normalizations, and finally also shared in varying formats and included varying compound identifiers.
In an attempt to consolidate the metabolome profiles in this collection while also maintaining maximum original data, we performed the following processing:
- Compound identifiers in each
mtb
table are as provided by the authors (i.e. they are not unified or modified in any way in this table). Occasionally, a few different fields from the original data were concatenated in order to assure unique compound identifiers. For example, if a dataset contained both an NMR-based metabolic profile and another LC-MS untargeted profile, then the unique compound names were a concatenation of metabolomics method name and metabolite identifier within that method (e.g. "NMR_Lactate", "LC-MS_Glycocholic acid", ...). Further details could be found in the dataset-specific scripts found here [add link to load_original_data folder]. - We created a metabolite-metadata table per dataset (namely
mtb.map
) where additional details are provided for each metabolite in themtb
table. Themtb.map
table includes:- Any original information per metabolite as provided by authors;
- Mappings to KEGG and HMDB identifiers wherever possible. These were either provided by authors, or obtained using the conversion utility from MetaboAnalystR (version 3.2) [TODO: add ref]. Additional mappings were added manually when possible (see load_data scripts [TODO: add link]).
- We added a "High.Confidence.Annotation" boolean field to mark cases where the identification of the metabolite (specifically, the mapping to KEGG/HMDB) was made with lower confidence. In particular, this field is set to FALSE in any of the following cases:
- Metabolite had an ambiguous identification in the original table (e.g. "fructose/glucose");
- Metabolite was marked as lower-confidence annotation in the original table;
- Metabolite name had a typo;
- In cases where both a name and a KEGG ID was provided, if the name matched a different KEGG ID as well it (i.e. contradiction in KEGG mappings);
- In cases where more than one metabolite was mapped to the same KEGG or HMDB ID;
- Metabolite values were kept as is (including missing values where present).
Note: searching for metabolite identifiers by metabolite names may lead to inaccurate/partial mappings [refs: https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-6-2, https://www.mdpi.com/2218-1989/9/2/28]
This dataset collection includes microbiome data from both whole genome shotgun sequencing (WGSS) and 16S rRNA amplicon sequencing. Details about the processing of each data type are provided below.
For studies with shotgun metagenomic data, we obtained MetaPhlAn2 tables.
Note: a newer and improved version of MetaPhlAn is available
The following processing was performed for MetaPhlAn tables:
- A species-level table was saved as is (
species
table, no unification or further processing was applied); - To create the
genera
table:- When short species names were provided, we mapped them to the corresponding genus name and aggregated rows of the same genus by summing up the abundance values.
- Mappings to genus names was based on this publicly available file. Further details can be found in the "utils.R" script [TODO: link to utils script].
16S rRNA gene sequencing raw data was processed using QIIME2 version 2019-1 [TODO: ref] as follows:
- When raw data was multiplexed, we demultiplexed the data using QIIME2’s demux plugin.
- We applied DADA2 [TODO: ref] in order to denoise the data and extract ASV's. Whenever a sufficient high-quality overlap of forward and reverse reads was available, DADA2 was also used for merging paired end reads.
- To assign ASVs to taxonomy, we trained a Naive Bayes classifier per dataset using QIIME2’s feature-classifier plugin [TODO: ref]. Classifiers were trained on reads extracted from the SILVA 99-OTU database [TODO: ref], according to the specific 16S rRNA gene hypervariable region used in each dataset.
- ASV tables were collapsed to genus-level counts;
Further study-specific parameters and details about the 16S data processing of each dataset are detailed here [TODO: add link to excel].
Differences between bioinformatic tools for taxonomic annotation, WGSS vs. 16S data resolution, as well as differences between reference taxonomy databases caused discrepancies between genera entities across datasets. We unified genera entities by applying the following:
- Non-bacteria entities were removed;
- Full genus names were re-formatted to match the following pattern:
k__<...>|p__<...>|c__<...>|o__<...>|f__<...>|g__<...>
. For example, the stringD_0__Bacteria;D_1__Actinobacteria;D_2__Actinobacteria;D_3__Actinomycetales;D_4__Actinomycetaceae;D_5__Actinotignum
was converted tok__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinotignum
; - Some genera have inconcsistent assignments to higher-level phylogeny across different databases.
Atopobium
for example can be either assigned tok__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Atopobium
or tok__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Coriobacteriales|f__Coriobacteriaceae|g__Atopobium
. In these cases, we unify the full taxonomy string to a single version; - Genera entities that cannot be differentiated by their 16S sequences were aggregated into a single entity, to allow comparisons to WGSS datasets. For example,
Escherichia
andShigella
were merged into a singleEscherichia-Shigella
entity; - Unclassified species/genera were aggregated into an "Unclassified" entity. If an entity was unclassified at the genus level but assigned to a higher-level clade, we left it as is and marked the unclassified phylogenetic levels with
__
(e.g.k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|__
); - Additional unification details and examples can be found in the [unify_genera.R script](TODO - add link);
We finally normalized genus abundances in all datasets so that abundances sum to 1 in each sample (i.e. converted to relative abundances).
Subject metadata files (metadata
) include all information provided in the original publication (typically in supplementary info. See [TODO] for dataset-specific details).
We specifically unified the names of the following fields:
Field name | Description |
---|---|
Sample | Sample identifier. Corresponds to sample names in feature tables |
Subject | Subject identifier. Some studies have multiple samples per subject |
Study.Group | Study group as named in original study (typically one of the groups would be named 'control' or 'healthy' and the other will be named after the studied disease/condition) |
Age | The subject's age |
Age.Units | One of: Years ,Months ,Days
|
Gender | One of: Male ,Female ,Other
|
BMI | The subject's BMI |
Note: if not provided in original metadata, the field will be missing from the table.
In addition, in each metadata file we added the 3 following fields:
Field name | Description |
---|---|
Dataset | The “Dataset” name is formatted as following: <First author>_<Short cohort description>_<Year of publication>
|
DOI | Publication DOI |
Publication.Name | Publication name |
- Some of the datasets are from longitudinal studies, meaning that they include multiple samples per subject. Depending on the analysis, you may want to consider dealing with these separately. See [TODO] for further details per dataset, including whether it is longitudinal or not.
- Cross-study comparisons can be performed using either KEGG or HMDB ID's as the link between metabolites in different datasets. For microbiome comparisons, genera tables can be used (genus names were unified), and if analyzing only shotgun datasets, species names can be used.
- A simple example of a cross-study comparison using this data collection, looking into the correlation between XX and the XXX bacteria XXX, can be found in here [TODO: add link, maybe teaser plot?].
We thank all authors of the included studies, for making their data publicly available and responding to questions we had.
If you use the data provided here, we kindly request that you cite both the original papers as well as: ...