Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brca_tcga_pan_can_atlas_2018 is failing with cBioDataPack #68

Open
mjsteinbaugh opened this issue May 3, 2023 · 11 comments
Open

brca_tcga_pan_can_atlas_2018 is failing with cBioDataPack #68

mjsteinbaugh opened this issue May 3, 2023 · 11 comments

Comments

@mjsteinbaugh
Copy link

mjsteinbaugh commented May 3, 2023

Hi, I'm seeing a parsing error for brca_tcga_pan_can_atlas_2018:

utils::read.table chokes too easily on malformed files -- is it worth considering switching to readr/vroom or data.table here to harden against malformed files in the tarballs?

> packageVersion("cBioPortalData")
[1] ‘2.12.0’
> brca <- cBioPortalData::cBioDataPack("brca_tcga_pan_can_atlas_2018", ask = FALSE)
Warning: replacing previous import ‘utils::findMatches’ by ‘S4Vectors::findMatches’ when loading ‘AnnotationDbi’
Warning in .service_validate_md5sum(api_reference_url, api_reference_md5sum,  :
  service version differs from validated version
    service url: https://www.cbioportal.org/api/v2/api-docs
    observed md5sum: 008be96361f24a5c8d1cfb7f10ae9c97
    expected md5sum: 07ceb76cc5afcf54a9cf2e1a689b18f7
Calls: <Anonymous> ... initialize -> initialize -> Service -> .service_validate_md5sum
Downloading study file: brca_tcga_pan_can_atlas_2018.tar.gz
  |======================================================================| 100%

Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_armlevel_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_cna_hg19.seg
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_gene_panel_matrix.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_log2_cna.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_methylation_hm27_hm450_merged.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_microbiome.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_all_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_diploid_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem_zscores_ref_normal_samples.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mrna_seq_v2_rsem.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_mutations.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_phosphoprotein_quantification.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_protein_quantification_zscores.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_protein_quantification.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_rppa_zscores.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_rppa.txt
Working on: /var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T//RtmpYkgaMb/233d49947a0b_brca_tcga_pan_can_atlas_2018/brca_tcga_pan_can_atlas_2018/data_sv.txt
Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
  more columns than column names
Calls: <Anonymous> ... <Anonymous> -> .preprocess_data -> <Anonymous> -> read.table
Backtrace:
    ▆
 1. └─cBioPortalData::cBioDataPack(...)
 2.   └─cBioPortalData::loadStudy(exdir, names.field, cleanup)
 3.     └─cBioPortalData:::.loadExperimentsFromFiles(...)
 4.       └─base::Map(...)
 5.         └─base::mapply(FUN = f, ..., SIMPLIFY = FALSE)
 6.           └─cBioPortalData (local) `<fn>`(y = dots[[1L]][[18L]], x = dots[[2L]][[18L]])
 7.             └─cBioPortalData:::.preprocess_data(...)
 8.               └─utils::read.delim(...)
 9.                 └─utils::read.table(...)
@LiNk-NY
Copy link
Contributor

LiNk-NY commented May 3, 2023

Hi @mjsteinbaugh
Switching to another reader might only alleviate the symptoms. It would be better to report data errors at https://github.com/cbioportal/cbioportal
I will take a look at the details.
Best,
Marcel

@mjsteinbaugh
Copy link
Author

Thanks @LiNk-NY I'll file a bug there too

@mjsteinbaugh
Copy link
Author

mjsteinbaugh commented May 3, 2023

Following up, I agree that it's better to fix the upstream source, but it does look like readr handles this file OK.

library(pipette)
con <- "https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt"
## Errors, as expected.
sv_base <- import(
    con = con,
    format = "tsv",
    colnames = TRUE
)
## Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",  : 
##   more columns than column names
## Calls: import ... import -> .local -> do.call -> do.call -> <Anonymous>
## Munges the number of columns and names, not great.
sv_dt <- import(
    con = con,
    format = "tsv",
    engine = "data.table",
    colnames = TRUE
)
print(dim(sv_dt))
## [1] 5335   17
## The readr/vroom engine seems to parse OK.
sv_readr <- import(
    con = con,
    format = "tsv",
    engine = "readr",
    colnames = TRUE
)
print(dim(sv_readr))
## [1] 5336   13
print(colnames(sv_readr))
##  [1] "Sample_Id"                   "Site1_Hugo_Symbol"          
##  [3] "Site1_Chromosome"            "Site1_Position"             
##  [5] "Site2_Hugo_Symbol"           "Site2_Chromosome"           
##  [7] "Site2_Position"              "Site2_Effect_On_Frame"      
##  [9] "Tumor_Split_Read_Count"      "Tumor_Paired_End_Read_Count"
## [11] "SV_Status"                   "NCBI_Build"                 
## [13] "Event_Info"

@mjsteinbaugh
Copy link
Author

OK issue has been filed with the cBioPortal datahub team here cBioPortal/datahub#1820

@mjsteinbaugh
Copy link
Author

mjsteinbaugh commented May 9, 2023

I can confirm that fixing the data_sv.txt file fixes this issue:
cBioPortal/datahub#1820 (comment)

## First, replace the `data_sv.txt` file in extracted directory.
object <- cBioPortalData::loadStudy("brca_tcga_pan_can_atlas_2018", cleanup = FALSE)
## A MultiAssayExperiment object of 18 listed
##  experiments with user-defined names and respective classes.
##  Containing an ExperimentList class object of length 18:
##  [1] armlevel_cna: SummarizedExperiment with 39 rows and 1084 columns
##  [2] cna_hg19.seg: RaggedExperiment with 210376 rows and 1068 columns
##  [3] cna: SummarizedExperiment with 25128 rows and 1070 columns
##  [4] log2_cna: SummarizedExperiment with 25128 rows and 1070 columns
##  [5] methylation_hm27_hm450_merged: SummarizedExperiment with 22601 rows and 1066 columns
##  [6] microbiome: SummarizedExperiment with 1406 rows and 1070 columns
##  [7] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 20531 rows and 1082 columns
##  [8] mrna_seq_v2_rsem_zscores_ref_diploid_samples: SummarizedExperiment with 20471 rows and 1082 columns
##  [9] mrna_seq_v2_rsem_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 1082 columns
##  [10] mrna_seq_v2_rsem: SummarizedExperiment with 20531 rows and 1082 columns
##  [11] mutations: RaggedExperiment with 130495 rows and 1009 columns
##  [12] phosphoprotein_quantification: SummarizedExperiment with 18806 rows and 105 columns
##  [13] protein_quantification_zscores: SummarizedExperiment with 9733 rows and 105 columns
##  [14] protein_quantification: SummarizedExperiment with 9733 rows and 105 columns
##  [15] rppa_zscores: SummarizedExperiment with 198 rows and 876 columns
##  [16] rppa: SummarizedExperiment with 198 rows and 876 columns
##  [17] mrna_seq_v2_rsem_normal_samples_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
##  [18] mrna_seq_v2_rsem_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
## Functionality:
##  experiments() - obtain the ExperimentList instance
##  colData() - the primary/phenotype DataFrame
##  sampleMap() - the sample coordination DataFrame
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment
##  *Format() - convert into a long or wide DataFrame
##  assays() - convert ExperimentList to a SimpleList of matrices
##  exportClass() - save data to flat files

@mjsteinbaugh
Copy link
Author

mjsteinbaugh commented May 9, 2023

Seeing parsing issues for:

  • blca_plasmacytoid_mskcc_2016/data_sv.txt
  • brca_tcga_pan_can_atlas_2018/data_sv.txt
  • coadread_tcga_pan_can_atlas_2018/data_sv.txt
  • ov_tcga_pan_can_atlas_2018/data_sv.txt
  • sarc_tcga_pan_can_atlas_2018/data_gene_panel_matrix.txt

@LiNk-NY
Copy link
Contributor

LiNk-NY commented May 9, 2023

Thanks for putting this together @mjsteinbaugh
We will take a look at the data and file issues at cBioPortal.

@mjsteinbaugh
Copy link
Author

@LiNk-NY I put together a pretty nifty script that attempts to process all of the datasets at cBioPortal. I'll update the list of failures here once it finishes running.

@mjsteinbaugh
Copy link
Author

mjsteinbaugh commented May 9, 2023

Draft functions are here for reference:

@LiNk-NY
Copy link
Contributor

LiNk-NY commented May 9, 2023

@mjsteinbaugh
Have you taken a look at the long tests folder?
https://github.com/waldronlab/cBioPortalData/tree/devel/longtests/testthat

@mjsteinbaugh
Copy link
Author

OK here's an updated list of datasets with processing issues:

brca_tcga_pan_can_atlas_2018
ccrcc_utokyo_2013
coadread_tcga_pan_can_atlas_2018
gbm_cptac_2021
ihch_msk_2021
ihch_mskcc_2020
luad_mskimpact_2021
mbl_dkfz_2017
mbn_mdacc_2013
mixed_msk_tcga_2021
mixed_selpercatinib_2020
mpnst_mskcc
ov_tcga_pan_can_atlas_2018
pan_origimed_2020
pcpg_tcga_pub
sarc_tcga_pan_can_atlas_2018
stad_tcga_pub
ucec_ccr_msk_2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants