-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand cancer cell line and patient related datasets, e.g. DepMap, CCLE, TCGA #216
Comments
Thanks for the issue! This sounds interesting. Would it make sense to add this as an additional dataset for the drug response prediction task? https://tdcommons.ai/multi_pred_tasks/drugres/ Or are you thinking more as an independent data function as in https://tdcommons.ai/fct_overview/? |
Hi @kexinhuang12345, I think DepMap and CCLE datasets are multi-modal readouts form different assays performed on cancer cells and these are / can be used in many different tasks. Thus, maybe this can be a "Data Processing" from "Data Functions"? |
Hi @kexinhuang12345 - quick question. Have you ever thought about including tasks related to connecting cancer cell line to cancer patients? e.g. https://github.com/broadinstitute/celligner |
Interesting! What is the relevant machine learning task formulation for it?
As for the data function/dataset for this cancer cell line data, I was thinking more about it and it seems like it maybe more fit as datasets since the data functions in general need to be applicable to multiple tasks&datasets in contrast to be dataset-specific.
The function to generate the datasets are definitely useful we should store it in the data generation repo and reuse it or even make it into the data loader for more diverse usage.
What are your thoughts on this? Also you mention about multiple tasks, can you elaborate more on this? Happy to hop on a call to discuss more and let me know, thanks!!
…________________________________
From: Abolfazl (Abe) ***@***.***>
Sent: Thursday, January 11, 2024 19:43
To: mims-harvard/TDC ***@***.***>
Cc: Kexin Huang ***@***.***>; Mention ***@***.***>
Subject: Re: [mims-harvard/TDC] Cancer Cell Line "Datasets" (Issue #216)
Hi @kexinhuang12345<https://github.com/kexinhuang12345> - quick question. Have you ever thought about including tasks related to connecting cancer cell line to cancer patients? e.g. https://github.com/broadinstitute/celligner
—
Reply to this email directly, view it on GitHub<#216 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGUB5A4YP6YT7ZZG647DQ6DYN7F5HAVCNFSM6AAAAABBLITRUKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWHE3DCNJSGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi @kexinhuang12345,
I think there is a wide range of ML tasks possible with the CCLE and DepMap datasets, here are some examples:
Agreed.
I guess I'm not aware of the "data generation repo". Let me know how I can help in this regard.
In general, CCLE stands for Cancer Cell Line "Encyclopedia" so conceptually it is a well-established empirical resource for a diverse set of biological questions. Thus, these datasets are widely used for simple query tasks or more advanced ML tasks in the context of cancer cell biology.
I'll send an email right after this, thank you. |
Describe the problem
To enable cancer research, I would like to suggest including functionalities to work with cancer cell line information in TDC. In DepMap, there are updates in newer DepMap releases that make it incompatible with some current implementations for data collection – e.g. GilbertLabUCSF/CanDI#34, kevinhu/cancer_data#79.
Our previous work, CanDI, is a global cancer data integrator in Python that is used to harmonize and query datasets. Stable data from prior DepMap releases is deposited in Harvard Dataverse. I drafted some scripts to download this older but functinoal data and we will update script to make it work with newer DepMap releases.
TCGA data access can be even harder, although I just saw https://cloud.google.com/life-sciences/docs/resources/public-datasets/tcga
Describe the solution you'd like
A new data collection method will be very beneficial. It would be great to gather structured and harmonized data for cancer cell lines using TDC. You already have a tool for GDSC so a similar approach for CCLE and DepMap will be very useful.
gget
is also planning something like this which can be a synergized effort pachterlab/gget#121 (cc @lauraluebbert).Additional context
See these links for CanDI's source codes https://github.com/GilbertLabUCSF/CanDI, docs or manuscript
This is an example of my analysis using TDC and CanDI – notebook | blog post | GilbertLabUCSF/Decitabine-treatment#5
other related issues: #191
The text was updated successfully, but these errors were encountered: