Skip to content

[BUG] Inconsistent behaviour of join_nearest_downstream() with one unstranded range set? #106

@eggrandio

Description

@eggrandio

I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().

I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).

I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).

Is this the correct way of doing it ?

The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).

I am working with Arabidopsis data, so as genes I am using these data:

ath_coding_genes <- read_gff("./Arabidopsis_thaliana.TAIR10.57.gff3.gz") %>% 
  filter(type == "gene")

Then I try to find the nearest gene that is downstream of my regions:

my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)

This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:

test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"

> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     score
         <Rle>         <IRanges>  <Rle> |         <character> <numeric>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1         1
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
      seqnames            ranges strand |     gene_id
         <Rle>         <IRanges>  <Rle> | <character>
  [1]        1 19696923-19698482      - |   AT1G52890
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name     gene_id
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Both regions are 120 nt apart:

> reg1 <- my_regions %>% filter(name == test_peak)
> reg2 <- ath_coding_genes %>% filter(gene_id == test_gene)
> distance(reg1, reg2)
[1] 120

If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.

> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |                name     gene_id
         <Rle>         <IRanges>  <Rle> |         <character> <character>
  [1]        1 19698603-19698703      * | Merged-1-19698653-1   AT1G52890
  [2]        1 19698483-19698583      * | Merged-1-19698533-1   AT1G52890
  -------
  seqinfo: 5 sequences from an unspecified genome; no seqlengths

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions