GDCquery_clinic not working for TCGA projects #639

hnsc-a11y · 2025-01-31T18:53:45Z

I tried to download clinical data from TCGA projects but it returned an error message (please see below). It happened for various different TCGA tumor types.

clinical_brca <- GDCquery_clinic("TCGA-BRCA", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 1098 items to be assigned to 1343 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

clinical_hnsc <- GDCquery_clinic("TCGA-HNSC", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 528 items to be assigned to 768 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

clinical_gbm <- GDCquery_clinic("TCGA-GBM", type = "clinical")
Error in set(x, j = name, value = value) :
Supplied 617 items to be assigned to 1175 items of column 'submitter_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

Any chance you could look into this issue? Thanks.

Version: 2.32.0

SNN0 · 2025-01-31T19:23:07Z

I have the same problem

tiagochst · 2025-01-31T19:24:55Z

Hi,

Thank you for the bug report. It seems there were changes in the API/data retrieved that broke the code.
Could you reinstall the devel version from GitHub and try it again.

If anyone finds other issues, please let me know. I will need to rewrite the code due to those changes.

remotes::install_github("https://github.com/BioinformaticsFMRP/TCGAbiolinks",ref = "devel")

SNN0 · 2025-01-31T19:49:22Z

Hi,

The function is working, and I can access the clinical data. For example, I retrieved the clinical data for the BRCA cohort, but there are too many NA values in days_to_last_follow_up. I have worked with this cohort before, and it seems like there weren’t this many NA values. Could there be an error here?

tiagochst · 2025-01-31T19:56:29Z

It seems to be an issue on the API side. I have null for days_to_last_follow_up for all the TCGA-GBM samples. [image: Screenshot 2025-01-31 at 2.54.32 PM.png] I will send an email to GDC regarding this issue.

…

On Fri, Jan 31, 2025 at 2:49 PM SNN0 ***@***.***> wrote: Hi, The function is working, and I can access the clinical data. For example, I retrieved the clinical data for the BRCA cohort, but there are too many NA values in days_to_last_follow_up. I have worked with this cohort before, and it seems like there weren’t this many NA values. Could there be an error here? — Reply to this email directly, view it on GitHub <#639 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQ6IWZUX4Q33WIAOVSKL2NPHVRAVCNFSM6AAAAABWIGJ64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGI2TCMRWGY> . You are receiving this because you commented.Message ID: ***@***.***>

SNN0 · 2025-01-31T20:40:48Z

I really hope this issue gets resolved soon because I don’t want my project to be delayed. Do you know of any other way to access this data? I have written many of my functions based on this output. Also, can we manually access this data from the GDC portal in the same way? @tiagochst

tiagochst · 2025-01-31T22:04:54Z

I just got some answers from GDC. I still need to check which changes need to be made. 1. The GDC just deployed our latest data release, in which almost all of the TCGA clinical data was indexed in the API instead of being largely available in the supplemental files. 2. I think the days_to_last_followup field has been moved to the follow_up node as "days_to_followup". Add "follow_up" to your expand fields and you should see it. I think you will need to add an extra step to determine which follow_up has the largest value. This data should be the same one downloaded in this TSV file if you want to download manually. [image: Screenshot 2025-01-31 at 5.03.07 PM.png]

…

On Fri, Jan 31, 2025 at 3:41 PM SNN0 ***@***.***> wrote: I really hope this issue gets resolved soon because I don’t want my project to be delayed. Do you know of any other way to access this data? I have written many of my functions based on this output. Also, can we manually access this data from the GDC portal in the same way? @tiagochst <https://github.com/tiagochst> — Reply to this email directly, view it on GitHub <#639 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQ6KNWH3Z2Z2LAN3YQJL2NPNWPAVCNFSM6AAAAABWIGJ64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGM3DGMBQGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hnsc-a11y · 2025-01-31T22:16:30Z

@tiagochst Thank you so much for the prompt response and the effort in resolving the issue!

I think I may need to download the data manually for now, but for some reason I can't see this screenshot image from your previous post. Any chance you can upload it again? Thanks!

This data should be the same one downloaded in this TSV file if you want to
download manually.
[image: Screenshot 2025-01-31 at 5.03.07 PM.png]

tiagochst · 2025-01-31T22:40:28Z

@hnsc-a11y

femiogundare · 2025-02-01T22:01:41Z

Hi,

I'm facing the same issue.

ramyapurkanti · 2025-02-03T16:48:47Z

Hello all,

I just found the columns paper_Follow.up.days and paper_Days.to.death in the coldata of the summarised experiment object we get as a result of GDCPrepare (together with many other clinical data attributes).

library(TCGAbiolinks)
query_paad_all <- GDCquery(
  project = "TCGA-PAAD",
  data.category = "Transcriptome Profiling",
  experimental.strategy = "RNA-Seq",
  workflow.type = "STAR - Counts",
  data.type = "Gene Expression Quantification",
  sample.type = "Primary Tumor",
  access = "open")
tcga_paad_data <- GDCprepare(query_paad_all, summarizedExperiment = TRUE, directory = TCGAbiolinks_dir)

tcga_paad_coldata <- colData(tcga_paad_data) %>%  as.data.frame()
tcga_paad_coldata$paper_Follow.up.days
tcga_paad_coldata$paper_Days.to.death

Can someone confirm if this is equivalent to the clinical data we used to pull separately using GDCquery_clinic ? I sadly did not save the Rdataframes I pulled earlier so cannot check myself.

Thanks,
Ramya

tiagochst · 2025-02-04T14:59:23Z

Hi Ramya, Every column with prefix paper_ in metadata pulled from supplemental files from the TGCA analysis groups articles. If the clinical data continues to be updated, they will be different from what was published years ago.

…

On Mon, Feb 3, 2025 at 11:49 AM Ramya Purkanti ***@***.***> wrote: Hello all, I just found the columns *paper_Follow.up.days* and *paper_Days.to.death* in the coldata of the summarised experiment object we get as a result of GDCPrepare (together with many other clinical data attributes). ` library(TCGAbiolinks) query_paad_all <- GDCquery( project = "TCGA-PAAD", data.category = "Transcriptome Profiling", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts", data.type = "Gene Expression Quantification", sample.type = "Primary Tumor", access = "open") tcga_paad_data <- GDCprepare(query_paad_all, summarizedExperiment = TRUE, directory = TCGAbiolinks_dir) tcga_paad_coldata <- colData(tcga_paad_data) %>% as.data.frame() tcga_paad_coldata$paper_Follow.up.days tcga_paad_coldata$paper_Days.to.death ` Can someone confirm if this is equivalent to the clinical data we used to pull separately using GDCquery_clinic ? I sadly did not save the Rdataframes I pulled earlier so cannot check myself. Thanks, Ramya — Reply to this email directly, view it on GitHub <#639 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQ6IEI74QRS6FZIYLIYT2N6MYLAVCNFSM6AAAAABWIGJ64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMZRGUZTMOJUHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

YeHW · 2025-02-14T15:39:29Z

I just got some answers from GDC. I still need to check which changes need
to be made.

The GDC just deployed our latest data release, in which almost all of
the TCGA clinical data was indexed in the API instead of being largely
available in the supplemental files.

I think the days_to_last_followup field has been moved to the
follow_up node as "days_to_followup". Add "follow_up" to your expand
fields and you should see it. I think you will need to add an extra step
to determine which follow_up has the largest value.

This data should be the same one downloaded in this TSV file if you want to
download manually.
[image: Screenshot 2025-01-31 at 5.03.07 PM.png]
…

On Fri, Jan 31, 2025 at 3:41 PM SNN0 @.***> wrote:
I really hope this issue gets resolved soon because I don’t want my
project to be delayed. Do you know of any other way to access this data? I
have written many of my functions based on this output. Also, can we
manually access this data from the GDC portal in the same way? @tiagochst
https://github.com/tiagochst

—
Reply to this email directly, view it on GitHub
<#639 (comment)>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AABDQ6KNWH3Z2Z2LAN3YQJL2NPNWPAVCNFSM6AAAAABWIGJ64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRYGM3DGMBQGI
.
You are receiving this because you were mentioned.Message ID:
@.***>

Hi @tiagochst,

I faced the same issue with TCGA-DLBC project, all "days_to_last_follow_up" are NA.
I'm new to TCGAbiolinks - How exactly should I add "follow_up" to expand fields?
Is it to say I should download tsv/json from GDC Data Portal, then use the max value of "days_to_follow_up" column for each case_submitter_id id in the "follow_up.tsv"?

Thanks!

ramyapurkanti · 2025-02-21T17:03:03Z

An update: I pulled out the follow_ups node details but the follow_ups.days_to_follow_up does not contain a single integer, it's for each follow up event. I could not find a ready made column for days_to_last_follow_up (the one within diagnoses is all NA) anywhere. I haven't yet checked whether if we take the max of all values in follow_ups.days_to_follow_up column, if it matches the value we get by downloading manually.

In case someone wants to check up, the commands to pull out the clinical data are:

library(GenomicDataCommons)
metadata <- cases() %>%
    GenomicDataCommons::filter( project.project_id == 'TCGA-GBM') %>%
    GenomicDataCommons::select(default_fields('cases'), "demographic.vital_status", "demographic.days_to_death", "diagnoses.days_to_last_follow_up", GenomicDataCommons::grep_fields('cases', 'follow'))) %>%
    results_all()

ptranvan · 2025-03-10T15:26:22Z

Hello,

Is there any plan to reintegrate the field days_to_last_follow_up when getting data with:

query <- GDCquery(
  project = id,
  data.category = "Transcriptome Profiling", 
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  access = "open",
  workflow.type = "STAR - Counts"
)

data_full <- GDCprepare(query, summarizedExperiment = TRUE)
colnames(colData(data_full))

Thanks.

888learn · 2025-03-12T16:51:11Z

have u solved this problem?? I met the same problem

mass-a · 2025-03-20T14:17:48Z

Dear developers,

the issue with empty days_to_last_follow_up seems to persist:

clin <- GDCquery_clinic("TCGA-LUAD", "clinical")
table(clin$days_to_last_follow_up, useNA="ifany")

<NA> 
 585

Though the majority of patients are alive:

table(clin$vital_status, useNA="ifany")

Alive  Dead  <NA> 
  334   188    63

I am using the 'dev' version of TCGAbiolinks (2.35.3). Any plans for a fix?

Thanks!
-m

DijkJel · 2025-04-07T14:41:58Z

Is there an update on this issue? I have tried to download the TCGA-COAD data today (April 7) with the developer version (version 2.35.3, installed from github) of TCGAbiolinks and the 'days_to_last_follow_up' column is still absent from the clinical data.

install_github("BioinformaticsFMRP/TCGAbiolinks")
library('TCGAbiolinks')

query = TCGAbiolinks::GDCquery(project = 'TCGA-COAD', 
data.category = "Transcriptome Profiling", 
data.type = "Gene Expression Quantification",
workflow.type = "STAR - Counts")

 TCGAbiolinks::GDCdownload(query, directory = paste0(directory, '/GDCdata'), files.per.chunk = 1)
data = TCGAbiolinks::GDCprepare(query, directory = paste0(directory, '/GDCdata'))

'days_to_last_follow_up' %in% colnames(colData(data))  ##Returns FALSE

mass-a · 2025-04-29T08:01:36Z

Thanks @tiagochst for the fix in v2.35.4! Days to follow-up are back.

Rutwik-Garge · 2025-05-01T09:45:46Z

I am trying to carryout differential gene expression between drug sensitive and drug resistant patient samples. For downloading clinical samples, I have ran the following code to update TCGAbiolinks.
remotes::install_github("https://github.com/BioinformaticsFMRP/TCGAbiolinks",ref = "devel")
But still I am getting NA for many columns in the clinical data that I am seeing in my Rstudio. Kindly help me in getting the drug treatment and drug response data from TCGA. My project is hampered completely due to this issue.

mass-a · 2025-05-01T11:57:37Z

@Rutwik-Garge try installing from the master branch. It seems the 'devel' branch is behind master by a few commits.

Rutwik-Garge · 2025-05-02T04:55:20Z

@Rutwik-Garge try installing from the master branch. It seems the 'devel' branch is behind master by a few commits.

I am still getting the the columns in data with NA

tiagochst added a commit that referenced this issue Jan 31, 2025

Fixes error reported in #639

27190ea

tiagochst added a commit that referenced this issue Apr 7, 2025

Add days to last follow up from the new GDC API model #639

9a37a85

tiagochst added a commit that referenced this issue Apr 7, 2025

some improvements for #639

56a304b

GDCquery_clinic not working for TCGA projects #639

GDCquery_clinic not working for TCGA projects #639

Comments

hnsc-a11y commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SNN0 commented Jan 31, 2025

Uh oh!

tiagochst commented Jan 31, 2025

Uh oh!

SNN0 commented Jan 31, 2025

Uh oh!

tiagochst commented Jan 31, 2025 via email

Uh oh!

SNN0 commented Jan 31, 2025

Uh oh!

tiagochst commented Jan 31, 2025 via email

Uh oh!

hnsc-a11y commented Jan 31, 2025

Uh oh!

tiagochst commented Jan 31, 2025

Uh oh!

femiogundare commented Feb 1, 2025

Uh oh!

ramyapurkanti commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiagochst commented Feb 4, 2025 via email

Uh oh!

YeHW commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ramyapurkanti commented Feb 21, 2025

Uh oh!

ptranvan commented Mar 10, 2025

Uh oh!

888learn commented Mar 12, 2025

Uh oh!

mass-a commented Mar 20, 2025

Uh oh!

DijkJel commented Apr 7, 2025

Uh oh!

mass-a commented Apr 29, 2025

Uh oh!

Rutwik-Garge commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mass-a commented May 1, 2025

Uh oh!

Rutwik-Garge commented May 2, 2025

Uh oh!

hnsc-a11y commented Jan 31, 2025 •

edited

Loading

ramyapurkanti commented Feb 3, 2025 •

edited

Loading

YeHW commented Feb 14, 2025 •

edited

Loading

Rutwik-Garge commented May 1, 2025 •

edited

Loading