Skip to content

[feature request] add .drop argument to group_by, e.g. to include empty seqlevels #95

@jayoung

Description

@jayoung

Hi there,

I only just discovered your package - I'm excited to start being tidy with my GRanges!

Apologies if I missed something, but I think I am requesting an enhancement.

With tibbles, when I'm grouping on a factor, there's a way to summarize and make sure I include empty groups, by using the .drop=FALSE argument. But for GRanges, I don't see a way to include the empty groups. Again, sorry if I missed it - I have tried searching but didn't see anything.

I've provided code below that I think is a nice small example.

thanks very much,

Janet Young

Malik lab,
Fred Hutch Cancer Research Center,
Seattle, WA

## here's how I include empty groups when summarizing a tibble
library(tidyverse)
fruit_tbl <- data.frame(fruit=factor( c("apple","apple","orange","pear"), 
                                      levels=c("apple","orange","pear","banana")),
                        weight=c(3,4,5,3)) %>% 
    as_tibble()
# we DO get output for 'banana', the empty group:
fruit_tbl %>% 
    group_by(fruit, .drop=FALSE) %>% 
    summarise(numFruits=n(), 
              mean=mean(weight))
#   A tibble: 4 × 3
#   fruit  numFruits  mean
#   <fct>      <int> <dbl>
# 1 apple          2   3.5
# 2 orange         1   5  
# 3 pear           1   3  
# 4 banana         0 NaN  

But I don't see a way to include empty groups in plyranges. Is that true? Sorry if I missed it. I am using plyranges_1.14.0 (release version). Here's what I tried (after restarting R to make sure tidyverse packages aren't loaded):

library(plyranges)

## make GRanges where not all factor levels are represented (for seqnames, also for regionType)
grng2 <- data.frame(seqnames = sample(c("chr1", "chr2"), 7, replace = TRUE),
                   strand = sample(c("+", "-"), 7, replace = TRUE),
                   gc = runif(7),
                   start = 1:7,
                   width = 10) %>%
    mutate(seqnames=factor(seqnames, levels=c("chr1", "chr2", "chr3"))) %>% 
    mutate(regionType=factor( sample(c("a", "b"), 7, replace = TRUE),
                              levels=c("a", "b", "c"))) %>% 
    as_granges()

## works, but we don't get summaries for the empty levels of seqlevel (chr3) or regionType (c):
grng2 %>% 
    group_by(seqnames) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# DataFrame with 2 rows and 3 columns
#   seqnames numRegions    meanGC
#      <Rle>  <integer> <numeric>
# 1     chr1          6  0.592756
# 2     chr2          1  0.664616


grng2 %>% 
    group_by(regionType) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# DataFrame with 2 rows and 3 columns
#   regionType numRegions    meanGC
#    <factor>  <integer> <numeric>
# 1          a          6  0.646677
# 2          b          1  0.341085

## can't use .drop like I would with a tibble
grng2 %>% 
    group_by(regionType, .drop=FALSE) %>% 
    summarize(numRegions=n(),
              meanGC=mean(gc))
# Error in new_grouping(.data, ...) : Column `.drop` is unknown

and here's my R session information

library(sessioninfo)
sessioninfo::session_info()
─ Session info ─────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.2.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Los_Angeles
 date     2022-02-23
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   NA

─ Packages ─────────────────────────────────────────────────────────────────────────────────
 package              * version  date (UTC) lib source
 assertthat             0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
 Biobase                2.54.0   2021-10-26 [1] Bioconductor
 BiocGenerics         * 0.40.0   2021-10-26 [1] Bioconductor
 BiocIO                 1.4.0    2021-10-26 [1] Bioconductor
 BiocParallel           1.28.3   2021-12-09 [1] Bioconductor
 Biostrings             2.62.0   2021-10-26 [1] Bioconductor
 bitops                 1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
 cli                    3.1.1    2022-01-20 [1] CRAN (R 4.1.2)
 crayon                 1.4.2    2021-10-29 [1] CRAN (R 4.1.0)
 DBI                    1.1.2    2021-12-20 [1] CRAN (R 4.1.1)
 DelayedArray           0.20.0   2021-10-26 [1] Bioconductor
 dplyr                  1.0.7    2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis               0.3.2    2021-04-29 [1] CRAN (R 4.1.0)
 fansi                  1.0.2    2022-01-14 [1] CRAN (R 4.1.2)
 generics               0.1.2    2022-01-31 [1] CRAN (R 4.1.2)
 GenomeInfoDb         * 1.30.0   2021-10-26 [1] Bioconductor
 GenomeInfoDbData       1.2.7    2021-11-16 [1] Bioconductor
 GenomicAlignments      1.30.0   2021-10-26 [1] Bioconductor
 GenomicRanges        * 1.46.1   2021-11-18 [1] Bioconductor
 glue                   1.6.1    2022-01-22 [1] CRAN (R 4.1.2)
 IRanges              * 2.28.0   2021-10-26 [1] Bioconductor
 lattice                0.20-45  2021-09-22 [1] CRAN (R 4.1.2)
 lifecycle              1.0.1    2021-09-24 [1] CRAN (R 4.1.0)
 magrittr               2.0.2    2022-01-26 [1] CRAN (R 4.1.2)
 Matrix                 1.4-0    2021-12-08 [1] CRAN (R 4.1.0)
 MatrixGenerics         1.6.0    2021-10-26 [1] Bioconductor
 matrixStats            0.61.0   2021-09-17 [1] CRAN (R 4.1.0)
 pillar                 1.7.0    2022-02-01 [1] CRAN (R 4.1.2)
 pkgconfig              2.0.3    2019-09-22 [1] CRAN (R 4.1.0)
 plyranges            * 1.14.0   2021-10-26 [1] Bioconductor
 purrr                  0.3.4    2020-04-17 [1] CRAN (R 4.1.0)
 R6                     2.5.1    2021-08-19 [1] CRAN (R 4.1.0)
 RCurl                  1.98-1.5 2021-09-17 [1] CRAN (R 4.1.0)
 restfulr               0.0.13   2017-08-06 [1] CRAN (R 4.1.0)
 rjson                  0.2.21   2022-01-09 [1] CRAN (R 4.1.2)
 rlang                  1.0.1    2022-02-03 [1] CRAN (R 4.1.2)
 Rsamtools              2.10.0   2021-10-26 [1] Bioconductor
 rstudioapi             0.13     2020-11-12 [1] CRAN (R 4.1.0)
 rtracklayer            1.54.0   2021-10-26 [1] Bioconductor
 S4Vectors            * 0.32.3   2021-11-21 [1] Bioconductor
 sessioninfo          * 1.2.2    2021-12-06 [1] CRAN (R 4.1.0)
 SummarizedExperiment   1.24.0   2021-10-26 [1] Bioconductor
 tibble                 3.1.6    2021-11-07 [1] CRAN (R 4.1.0)
 tidyselect             1.1.1    2021-04-30 [1] CRAN (R 4.1.0)
 utf8                   1.2.2    2021-07-24 [1] CRAN (R 4.1.0)
 vctrs                  0.3.8    2021-04-29 [1] CRAN (R 4.1.0)
 XML                    3.99-0.8 2021-09-17 [1] CRAN (R 4.1.0)
 XVector                0.34.0   2021-10-26 [1] Bioconductor
 yaml                   2.2.2    2022-01-25 [1] CRAN (R 4.1.2)
 zlibbioc               1.40.0   2021-10-26 [1] Bioconductor

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions