-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
brca_tcga_pan_can_atlas_2018 is failing with cBioDataPack
#68
Comments
Hi @mjsteinbaugh |
Thanks @LiNk-NY I'll file a bug there too |
Following up, I agree that it's better to fix the upstream source, but it does look like readr handles this file OK. library(pipette)
con <- "https://github.com/cBioPortal/datahub/raw/master/public/brca_tcga_pan_can_atlas_2018/data_sv.txt"
## Errors, as expected.
sv_base <- import(
con = con,
format = "tsv",
colnames = TRUE
)
## Error in (function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", :
## more columns than column names
## Calls: import ... import -> .local -> do.call -> do.call -> <Anonymous>
## Munges the number of columns and names, not great.
sv_dt <- import(
con = con,
format = "tsv",
engine = "data.table",
colnames = TRUE
)
print(dim(sv_dt))
## [1] 5335 17
## The readr/vroom engine seems to parse OK.
sv_readr <- import(
con = con,
format = "tsv",
engine = "readr",
colnames = TRUE
)
print(dim(sv_readr))
## [1] 5336 13
print(colnames(sv_readr))
## [1] "Sample_Id" "Site1_Hugo_Symbol"
## [3] "Site1_Chromosome" "Site1_Position"
## [5] "Site2_Hugo_Symbol" "Site2_Chromosome"
## [7] "Site2_Position" "Site2_Effect_On_Frame"
## [9] "Tumor_Split_Read_Count" "Tumor_Paired_End_Read_Count"
## [11] "SV_Status" "NCBI_Build"
## [13] "Event_Info" |
OK issue has been filed with the cBioPortal datahub team here cBioPortal/datahub#1820 |
I can confirm that fixing the ## First, replace the `data_sv.txt` file in extracted directory.
object <- cBioPortalData::loadStudy("brca_tcga_pan_can_atlas_2018", cleanup = FALSE)
## A MultiAssayExperiment object of 18 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 18:
## [1] armlevel_cna: SummarizedExperiment with 39 rows and 1084 columns
## [2] cna_hg19.seg: RaggedExperiment with 210376 rows and 1068 columns
## [3] cna: SummarizedExperiment with 25128 rows and 1070 columns
## [4] log2_cna: SummarizedExperiment with 25128 rows and 1070 columns
## [5] methylation_hm27_hm450_merged: SummarizedExperiment with 22601 rows and 1066 columns
## [6] microbiome: SummarizedExperiment with 1406 rows and 1070 columns
## [7] mrna_seq_v2_rsem_zscores_ref_all_samples: SummarizedExperiment with 20531 rows and 1082 columns
## [8] mrna_seq_v2_rsem_zscores_ref_diploid_samples: SummarizedExperiment with 20471 rows and 1082 columns
## [9] mrna_seq_v2_rsem_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 1082 columns
## [10] mrna_seq_v2_rsem: SummarizedExperiment with 20531 rows and 1082 columns
## [11] mutations: RaggedExperiment with 130495 rows and 1009 columns
## [12] phosphoprotein_quantification: SummarizedExperiment with 18806 rows and 105 columns
## [13] protein_quantification_zscores: SummarizedExperiment with 9733 rows and 105 columns
## [14] protein_quantification: SummarizedExperiment with 9733 rows and 105 columns
## [15] rppa_zscores: SummarizedExperiment with 198 rows and 876 columns
## [16] rppa: SummarizedExperiment with 198 rows and 876 columns
## [17] mrna_seq_v2_rsem_normal_samples_zscores_ref_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
## [18] mrna_seq_v2_rsem_normal_samples: SummarizedExperiment with 20531 rows and 114 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files |
Seeing parsing issues for:
|
Thanks for putting this together @mjsteinbaugh |
@LiNk-NY I put together a pretty nifty script that attempts to process all of the datasets at cBioPortal. I'll update the list of failures here once it finishes running. |
Draft functions are here for reference: |
@mjsteinbaugh |
OK here's an updated list of datasets with processing issues:
|
Hi, I'm seeing a parsing error for
brca_tcga_pan_can_atlas_2018
:utils::read.table
chokes too easily on malformed files -- is it worth considering switching to readr/vroom or data.table here to harden against malformed files in the tarballs?The text was updated successfully, but these errors were encountered: