-
Notifications
You must be signed in to change notification settings - Fork 18
Description
I am not sure if this is intented or if I am missunderstanding join_nearest_downstream().
I have a set of regions that are unstranded (they come from ChIPseq data). I am trying to find the genes that have any of these regions upstream of their transcription start site (or overlap with their coding region). These regions are unstranded but I would like to obtain only the nearest gene downstream of them (meaning that if there is a nearer gene, but the region is downstream of that gene, this gene is ignored).
I use join_nearest_downstream() as it should take into account the strandness of the gene regions and I assume it ignores the strandness of the "x" region set as per the documentation: "method will find arbitrary nearest neighbour ranges on x that are upstream of those on y", but it seems this is not the case. I should find at least one gene for each of the regions in the dataset (these genes might be duplicated if they have several regions in their upstream region, but I can deal with that).
Is this the correct way of doing it ?
The issue is that for some of these regions, no downstream gene is found. When I visualize them or manually check if there is a gene close, I can find it. I can also use join_nearest() to find them, but that has some limitations (see below).
I am working with Arabidopsis data, so as genes I am using these data:
ath_coding_genes <- read_gff("./Arabidopsis_thaliana.TAIR10.57.gff3.gz") %>%
filter(type == "gene")
Then I try to find the nearest gene that is downstream of my regions:
my_regions <- dget("original_data.txt") #these regions are unstranded
genes_downstream <- plyranges::join_nearest_downstream(my_regions, ath_coding_genes, distance = TRUE)
This works as expected for many genes, but in some cases it doesnt. I checked some genes that shuold have a peak upstream but they do not show in the "joined" dataset:
test_peak <- "Merged-1-19698653-1"
test_gene <- "AT1G52890"
> my_regions %>% filter(name == test_peak)
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name score
<Rle> <IRanges> <Rle> | <character> <numeric>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 1
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> ath_coding_genes %>% filter(gene_id == test_gene) %>% plyranges::select(gene_id)
GRanges object with 1 range and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
[1] 1 19696923-19698482 - | AT1G52890
-------
seqinfo: 7 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> genes_downstream %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 0 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
Both regions are 120 nt apart:
> reg1 <- my_regions %>% filter(name == test_peak)
> reg2 <- ath_coding_genes %>% filter(gene_id == test_gene)
> distance(reg1, reg2)
[1] 120
If I use join_nearest() instead, they are joined. However, using nearest will also join some genes that have these regions downstream, and I only want the genes that have these regions upstream.
> nearest_genes <- plyranges::join_nearest(my_regions, ath_coding_genes, distance = TRUE)
> nearest_genes %>% filter(name == test_peak) %>% plyranges::select(c(name, gene_id))
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths
> nearest_genes %>% filter(gene_id == test_gene) %>% plyranges::select(c(name, gene_id))
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | name gene_id
<Rle> <IRanges> <Rle> | <character> <character>
[1] 1 19698603-19698703 * | Merged-1-19698653-1 AT1G52890
[2] 1 19698483-19698583 * | Merged-1-19698533-1 AT1G52890
-------
seqinfo: 5 sequences from an unspecified genome; no seqlengths