Skip to content

Tax_glom over aggregating beyond specified taxonomy #1782

@maxnoblemm

Description

@maxnoblemm

Hi all,

Has anyone encountered an issue where tax_glom() appears to aggregate beyond the specified taxonomic level?

Problem description
I'm working with relative abundance data constructed from MetaPhlAn output. When I agglomerate by a specific taxonomic level, the resulting column sums—which should be ~1 if properly normalized—are much higher. For example:

ps_phylum <- tax_glom(physeq_obj, taxrank = "Phylum")  # relative abundance data

This results in column sums of 7 at the Phylum level, 6 at Class, 5 at Order, etc.—suggesting values may be summed across lower levels rather than aggregated appropriately. However, inspecting each taxonomic level individually confirms that the original data sums to ~1 per sample, as expected for relative abundance.

I'm wondering if this behavior could be due to how I constructed the phyloseq object, especially since:

  • I only have relative abundance data from MetaPhlAn (no raw counts)
  • There's no tree
  • The object includes just the otu_table, tax_table, and sample_data
# Formating relative abundance table

mpa_phyloSeq <- mpa_data / 100  # Convert from percentage to proportion
Step 2: Prepare sample metadata

metadata_metaG_phylo <- metadata_metagenomics %>% 
  dplyr::select(-Diagnosis, -CD_onset, -Country) %>%
  inner_join(Expsm15mo_CaseID, by = "CaseID") %>%
  column_to_rownames(var = "CaseTime")

correct_samples <- intersect(rownames(mpa_phyloSeq), rownames(metadata_metaG_phylo))
mpa_phyloSeq_filtered <- mpa_phyloSeq[correct_samples, , drop = FALSE]
metadata_metaG_phylo <- sample_data(metadata_metaG_phylo)

# Build OTU table

mpa_phyloSeq_filtered <- t(mpa_phyloSeq_filtered)  # taxa as rows
OTU <- otu_table(mpa_phyloSeq_filtered, taxa_are_rows = TRUE)
Step 4: Construct tax_table from MetaPhlAn strings

otu_df <- as.data.frame(otu_table(physeq_MetaG))
tax_strings <- rownames(otu_df)
tax_split <- str_split(tax_strings, "\\|")
max_levels <- max(sapply(tax_split, length))

tax_split_padded <- lapply(tax_split, function(x) {
  length(x) <- max_levels
  return(x)
})

tax_matrix <- do.call(rbind, tax_split_padded)
colnames(tax_matrix) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species", "Strain")[1:max_levels]
tax_matrix <- apply(tax_matrix, 2, str_trim)

tax_tab <- tax_table(tax_matrix)
rownames(tax_tab) <- rownames(otu_df)
Step 5: Assemble final phyloseq object

physeq_obj <- phyloseq(OTU, metadata_metaG_phylo, tax_tab)

Request for help
When I then take physeq_obj and try to tax_glom as follows:

ps_phylum<- tax_glom(physeq_obj, taxrank = "Phylum")
head(ps_phylum@otu_table)

I get:

OTU Table:          [6 taxa and 533 samples]
                     taxa are rows
                                     CASE-1_12 CASE-1_15 CASE-1_18 CASE-1_21 
k__Bacteria|p__Actinobacteria        5.3338845 2.5500895 2.1597526 1.8547148 
etc...

Which shouldn't be possible as relative abundances are confined to sum to 1. I think somehow, my tax_tab isn't pulling just phyla data, and is including the lower level taxa data?

Any insights or suggestions would be hugely appreciated! Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions