Skip to content

[FEATURE REQUEST] Join overlap by ranges as well as metadata #107

@shalvichirmade

Description

@shalvichirmade

Feature request: add an argument to the join_overlap_intersect function that allows additional overlap based on metadata values.

For some context, here is a hypothetical example:
I have two GRanges objects, one for introns and one for transcripts.

## intron GRanges
intron
# GRanges object with 1 range and 2 metadata columns:
#     seqnames       ranges        strand |     type             transcript_id 
#      <Rle>       <IRanges>           <Rle>  |   <factor>         <character>
#      1       100149098-100152384     -       |    intron            ENST00000370137.6


## transcript GRanges
trans
# GRanges object with 2 range and 3 metadata columns:
#     seqnames       ranges             strand |    transcript_name           gene_name
#      <Rle>        <IRanges>              <Rle>  |   <character>                <character>
#        1            100148448-100178256     -       |    ENST00000370137.6            LRRC39
#        1            100133163-100150496     -       |     ENST00000370141.8           TRMT13

I want to join these GRanges objects so I can annotate the intron GRanges with gene_name metadata.

However, when I use join_overlap_left, the range of the intron row overlaps both the rows from trans.

intron <- join_overlap_left(intron, trans)
intron
# GRanges object with 2 range and 3 metadata columns:
#     seqnames       ranges               strand |     type             transcript_id                transcript_name           gene_name
#     <Rle>        <IRanges>        <Rle>  |   <factor>       <character>                  <character>                  <character>
#        1             100149098-100152384     -       |    intron          ENST00000370137.6    ENST00000370137.6   LRRC39
#        1             100149098-100152384     -       |    intron          ENST00000370137.6     ENST00000370141.8   TRMT13

The desired output would only overlap with the trans row corresponding to trans$transcript_name == "ENST00000370137.6".

Here, the overlap should be based on the range as well as the metadata columns:

  • intron$transcript_id
  • trans$transcript_name

R session information

Remember to include your full R session information.

options(width = 120)
sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.3.2 (2023-10-31)
 os       macOS Sonoma 14.3
 system   x86_64, darwin20
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Toronto
 date     2024-02-08
 rstudio  2023.06.1+524 Mountain Hydrangea (desktop)
 pandoc   NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions