-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #7 from jjc2718/abstract
Abstract and title
- Loading branch information
Showing
5 changed files
with
17 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,14 @@ | ||
## Abstract {.page_break_before} | ||
|
||
In the pursuit of molecular characterization of diverse cancers, collaborative efforts have generated large publicly available datasets, which combine various data types and data sources. | ||
Simultaneously, machine learning has rapidly gravitated toward models with many parameters that can be trained on broad sets of data, and subsequently fine-tuned to a wide variety of tasks. | ||
Computational oncology sits squarely at the intersection between these advances. | ||
However, the structure of most cancer datasets is uniquely heterogeneous, relative to other fields and data types in which large models have proven successful. | ||
In this dissertation, we first study aspects of machine learning model tuning in cancer, showing that the choice of optimizer used to fit models on cancer transcriptomics datasets can have pronounced effects on model selection. | ||
We then explore two aspects of heterogeneity inherent to public cancer datasets that affect machine learning modeling choices. | ||
We first show that most -omics types available in the TCGA Pan-Cancer Atlas can capture information relevant to cancer function, but somewhat less intuitively, when multiple -omics types are combined there is considerable redundancy and model performance does not generally improve. | ||
Next, we study model generalization across biological contexts in cancer transcriptomics and its implications on model selection, finding that cross-validation performance on holdout data is a sufficient selection criterion, and criteria that incorporate model sparsity or simplicity do not tend to improve generalization performance. | ||
Overall, our results show that the particularities of large cancer genomics datasets must be taken into account for applications of machine learning to be successful in this domain. | ||
These findings suggest hurdles to, but also opportunities for, machine learning models integrating pan-cancer and pan-omics data to derive biological and clinical insights. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters