Skip to content

GDCquery_clinic not working for TCGA projects #639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hnsc-a11y opened this issue Jan 31, 2025 · 21 comments
Open

GDCquery_clinic not working for TCGA projects #639

hnsc-a11y opened this issue Jan 31, 2025 · 21 comments

Comments

@hnsc-a11y
Copy link

hnsc-a11y commented Jan 31, 2025

I tried to download clinical data from TCGA projects but it returned an error message (please see below). It happened for various different TCGA tumor types.

clinical_brca <- GDCquery_clinic("TCGA-BRCA", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 1098 items to be assigned to 1343 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

clinical_hnsc <- GDCquery_clinic("TCGA-HNSC", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 528 items to be assigned to 768 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

clinical_gbm <- GDCquery_clinic("TCGA-GBM", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 617 items to be assigned to 1175 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

Any chance you could look into this issue? Thanks.

Version: 2.32.0

tiagochst added a commit that referenced this issue Jan 31, 2025
@SNN0
Copy link

SNN0 commented Jan 31, 2025

I have the same problem

@tiagochst
Copy link
Contributor

Hi,

Thank you for the bug report. It seems there were changes in the API/data retrieved that broke the code.
Could you reinstall the devel version from GitHub and try it again.

If anyone finds other issues, please let me know. I will need to rewrite the code due to those changes.

remotes::install_github("https://github.com/BioinformaticsFMRP/TCGAbiolinks",ref = "devel")

@SNN0
Copy link

SNN0 commented Jan 31, 2025

Hi,

The function is working, and I can access the clinical data. For example, I retrieved the clinical data for the BRCA cohort, but there are too many NA values in days_to_last_follow_up. I have worked with this cohort before, and it seems like there weren’t this many NA values. Could there be an error here?

@tiagochst
Copy link
Contributor

tiagochst commented Jan 31, 2025 via email

@SNN0
Copy link

SNN0 commented Jan 31, 2025

I really hope this issue gets resolved soon because I don’t want my project to be delayed. Do you know of any other way to access this data? I have written many of my functions based on this output. Also, can we manually access this data from the GDC portal in the same way? @tiagochst

@tiagochst
Copy link
Contributor

tiagochst commented Jan 31, 2025 via email

@hnsc-a11y
Copy link
Author

@tiagochst Thank you so much for the prompt response and the effort in resolving the issue!

I think I may need to download the data manually for now, but for some reason I can't see this screenshot image from your previous post. Any chance you can upload it again? Thanks!

This data should be the same one downloaded in this TSV file if you want to
download manually.
[image: Screenshot 2025-01-31 at 5.03.07 PM.png]

@tiagochst
Copy link
Contributor

@hnsc-a11y
Image

@femiogundare
Copy link

Hi,

I'm facing the same issue.

@ramyapurkanti
Copy link

ramyapurkanti commented Feb 3, 2025

Hello all,

I just found the columns paper_Follow.up.days and paper_Days.to.death in the coldata of the summarised experiment object we get as a result of GDCPrepare (together with many other clinical data attributes).

library(TCGAbiolinks)
query_paad_all <- GDCquery(
  project = "TCGA-PAAD",
  data.category = "Transcriptome Profiling",
  experimental.strategy = "RNA-Seq",
  workflow.type = "STAR - Counts",
  data.type = "Gene Expression Quantification",
  sample.type = "Primary Tumor",
  access = "open")
tcga_paad_data <- GDCprepare(query_paad_all, summarizedExperiment = TRUE, directory = TCGAbiolinks_dir)

tcga_paad_coldata <- colData(tcga_paad_data) %>%  as.data.frame()
tcga_paad_coldata$paper_Follow.up.days
tcga_paad_coldata$paper_Days.to.death

Can someone confirm if this is equivalent to the clinical data we used to pull separately using GDCquery_clinic ? I sadly did not save the Rdataframes I pulled earlier so cannot check myself.

Thanks,
Ramya

@tiagochst
Copy link
Contributor

tiagochst commented Feb 4, 2025 via email

@YeHW
Copy link

YeHW commented Feb 14, 2025

I just got some answers from GDC. I still need to check which changes need
to be made.

  1. The GDC just deployed our latest data release, in which almost all of
    the TCGA clinical data was indexed in the API instead of being largely
    available in the supplemental files.
  2. I think the days_to_last_followup field has been moved to the
    follow_up node as "days_to_followup". Add "follow_up" to your expand
    fields and you should see it. I think you will need to add an extra step
    to determine which follow_up has the largest value.

This data should be the same one downloaded in this TSV file if you want to
download manually.
[image: Screenshot 2025-01-31 at 5.03.07 PM.png]

On Fri, Jan 31, 2025 at 3:41 PM SNN0 @.***> wrote:
I really hope this issue gets resolved soon because I don’t want my
project to be delayed. Do you know of any other way to access this data? I
have written many of my functions based on this output. Also, can we
manually access this data from the GDC portal in the same way? @tiagochst
https://github.com/tiagochst


Reply to this email directly, view it on GitHub
<#639 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABDQ6KNWH3Z2Z2LAN3YQJL2NPNWPAVCNFSM6AAAAABWIGJ64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGM3DGMBQGI
.
You are receiving this because you were mentioned.Message ID:
@.***>

Hi @tiagochst,

I faced the same issue with TCGA-DLBC project, all "days_to_last_follow_up" are NA.
I'm new to TCGAbiolinks - How exactly should I add "follow_up" to expand fields?
Is it to say I should download tsv/json from GDC Data Portal, then use the max value of "days_to_follow_up" column for each case_submitter_id id in the "follow_up.tsv"?

Thanks!

@ramyapurkanti
Copy link

An update: I pulled out the follow_ups node details but the follow_ups.days_to_follow_up does not contain a single integer, it's for each follow up event. I could not find a ready made column for days_to_last_follow_up (the one within diagnoses is all NA) anywhere. I haven't yet checked whether if we take the max of all values in follow_ups.days_to_follow_up column, if it matches the value we get by downloading manually.

In case someone wants to check up, the commands to pull out the clinical data are:

library(GenomicDataCommons)
metadata <- cases() %>%
    GenomicDataCommons::filter( project.project_id == 'TCGA-GBM') %>%
    GenomicDataCommons::select(default_fields('cases'), "demographic.vital_status", "demographic.days_to_death", "diagnoses.days_to_last_follow_up", GenomicDataCommons::grep_fields('cases', 'follow'))) %>%
    results_all()

@ptranvan
Copy link

Hello,

Is there any plan to reintegrate the field days_to_last_follow_up when getting data with:

query <- GDCquery(
  project = id,
  data.category = "Transcriptome Profiling", 
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  access = "open",
  workflow.type = "STAR - Counts"
)

data_full <- GDCprepare(query, summarizedExperiment = TRUE)
colnames(colData(data_full))

Thanks.

@888learn
Copy link

have u solved this problem?? I met the same problem

@mass-a
Copy link

mass-a commented Mar 20, 2025

Dear developers,

the issue with empty days_to_last_follow_up seems to persist:

clin <- GDCquery_clinic("TCGA-LUAD", "clinical")
table(clin$days_to_last_follow_up, useNA="ifany")
<NA> 
 585 

Though the majority of patients are alive:

table(clin$vital_status, useNA="ifany")
Alive  Dead  <NA> 
  334   188    63 

I am using the 'dev' version of TCGAbiolinks (2.35.3). Any plans for a fix?

Thanks!
-m

@DijkJel
Copy link

DijkJel commented Apr 7, 2025

Is there an update on this issue? I have tried to download the TCGA-COAD data today (April 7) with the developer version (version 2.35.3, installed from github) of TCGAbiolinks and the 'days_to_last_follow_up' column is still absent from the clinical data.

install_github("BioinformaticsFMRP/TCGAbiolinks")
library('TCGAbiolinks')

query = TCGAbiolinks::GDCquery(project = 'TCGA-COAD', 
data.category = "Transcriptome Profiling", 
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts")

 TCGAbiolinks::GDCdownload(query, directory = paste0(directory, '/GDCdata'), files.per.chunk = 1)
data = TCGAbiolinks::GDCprepare(query, directory = paste0(directory, '/GDCdata'))

'days_to_last_follow_up' %in% colnames(colData(data))  ##Returns FALSE

@mass-a
Copy link

mass-a commented Apr 29, 2025

Thanks @tiagochst for the fix in v2.35.4! Days to follow-up are back.

@Rutwik-Garge
Copy link

Rutwik-Garge commented May 1, 2025

I am trying to carryout differential gene expression between drug sensitive and drug resistant patient samples. For downloading clinical samples, I have ran the following code to update TCGAbiolinks.
remotes::install_github("https://github.com/BioinformaticsFMRP/TCGAbiolinks",ref = "devel")
But still I am getting NA for many columns in the clinical data that I am seeing in my Rstudio. Kindly help me in getting the drug treatment and drug response data from TCGA. My project is hampered completely due to this issue.

@mass-a
Copy link

mass-a commented May 1, 2025

@Rutwik-Garge try installing from the master branch. It seems the 'devel' branch is behind master by a few commits.

@Rutwik-Garge
Copy link

@Rutwik-Garge try installing from the master branch. It seems the 'devel' branch is behind master by a few commits.

I am still getting the the columns in data with NA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests