-
Notifications
You must be signed in to change notification settings - Fork 191
Description
Hi all,
Has anyone encountered an issue where tax_glom() appears to aggregate beyond the specified taxonomic level?
Problem description
I'm working with relative abundance data constructed from MetaPhlAn output. When I agglomerate by a specific taxonomic level, the resulting column sums—which should be ~1 if properly normalized—are much higher. For example:
ps_phylum <- tax_glom(physeq_obj, taxrank = "Phylum") # relative abundance data
This results in column sums of 7 at the Phylum level, 6 at Class, 5 at Order, etc.—suggesting values may be summed across lower levels rather than aggregated appropriately. However, inspecting each taxonomic level individually confirms that the original data sums to ~1 per sample, as expected for relative abundance.
I'm wondering if this behavior could be due to how I constructed the phyloseq object, especially since:
- I only have relative abundance data from MetaPhlAn (no raw counts)
- There's no tree
- The object includes just the otu_table, tax_table, and sample_data
# Formating relative abundance table
mpa_phyloSeq <- mpa_data / 100 # Convert from percentage to proportion
Step 2: Prepare sample metadata
metadata_metaG_phylo <- metadata_metagenomics %>%
dplyr::select(-Diagnosis, -CD_onset, -Country) %>%
inner_join(Expsm15mo_CaseID, by = "CaseID") %>%
column_to_rownames(var = "CaseTime")
correct_samples <- intersect(rownames(mpa_phyloSeq), rownames(metadata_metaG_phylo))
mpa_phyloSeq_filtered <- mpa_phyloSeq[correct_samples, , drop = FALSE]
metadata_metaG_phylo <- sample_data(metadata_metaG_phylo)
# Build OTU table
mpa_phyloSeq_filtered <- t(mpa_phyloSeq_filtered) # taxa as rows
OTU <- otu_table(mpa_phyloSeq_filtered, taxa_are_rows = TRUE)
Step 4: Construct tax_table from MetaPhlAn strings
otu_df <- as.data.frame(otu_table(physeq_MetaG))
tax_strings <- rownames(otu_df)
tax_split <- str_split(tax_strings, "\\|")
max_levels <- max(sapply(tax_split, length))
tax_split_padded <- lapply(tax_split, function(x) {
length(x) <- max_levels
return(x)
})
tax_matrix <- do.call(rbind, tax_split_padded)
colnames(tax_matrix) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain")[1:max_levels]
tax_matrix <- apply(tax_matrix, 2, str_trim)
tax_tab <- tax_table(tax_matrix)
rownames(tax_tab) <- rownames(otu_df)
Step 5: Assemble final phyloseq object
physeq_obj <- phyloseq(OTU, metadata_metaG_phylo, tax_tab)
Request for help
When I then take physeq_obj and try to tax_glom as follows:
ps_phylum<- tax_glom(physeq_obj, taxrank = "Phylum")
head(ps_phylum@otu_table)
I get:
OTU Table: [6 taxa and 533 samples]
taxa are rows
CASE-1_12 CASE-1_15 CASE-1_18 CASE-1_21
k__Bacteria|p__Actinobacteria 5.3338845 2.5500895 2.1597526 1.8547148
etc...
Which shouldn't be possible as relative abundances are confined to sum to 1. I think somehow, my tax_tab isn't pulling just phyla data, and is including the lower level taxa data?
Any insights or suggestions would be hugely appreciated! Thank you!