Merge pull request #7 from jjc2718/abstract

Abstract and title
greenelab · Sep 15, 2023 · def2187 · def2187
2 parents 2773a82 + 572548e
commit def2187
Show file tree

Hide file tree

Showing 5 changed files with 17 additions and 4 deletions.
diff --git a/content/01.abstract.md b/content/01.abstract.md
@@ -1,3 +1,14 @@
 ## Abstract {.page_break_before}
 
+In the pursuit of molecular characterization of diverse cancers, collaborative efforts have generated large publicly available datasets, which combine various data types and data sources.
+Simultaneously, machine learning has rapidly gravitated toward models with many parameters that can be trained on broad sets of data, and subsequently fine-tuned to a wide variety of tasks.
+Computational oncology sits squarely at the intersection between these advances.
+However, the structure of most cancer datasets is uniquely heterogeneous, relative to other fields and data types in which large models have proven successful.
+In this dissertation, we first study aspects of machine learning model tuning in cancer, showing that the choice of optimizer used to fit models on cancer transcriptomics datasets can have pronounced effects on model selection.
+We then explore two aspects of heterogeneity inherent to public cancer datasets that affect machine learning modeling choices.
+We first show that most -omics types available in the TCGA Pan-Cancer Atlas can capture information relevant to cancer function, but somewhat less intuitively, when multiple -omics types are combined there is considerable redundancy and model performance does not generally improve.
+Next, we study model generalization across biological contexts in cancer transcriptomics and its implications on model selection, finding that cross-validation performance on holdout data is a sufficient selection criterion, and criteria that incorporate model sparsity or simplicity do not tend to improve generalization performance.
+Overall, our results show that the particularities of large cancer genomics datasets must be taken into account for applications of machine learning to be successful in this domain.
+These findings suggest hurdles to, but also opportunities for, machine learning models integrating pan-cancer and pan-omics data to derive biological and clinical insights.
+
 
diff --git a/content/10.background_header.md b/content/10.background_header.md
@@ -1,4 +1,4 @@
-## Chapter 1: background
+## Chapter 1: background {.page_break_before}
 
 This chapter was formatted for this dissertation to provide background information and context for the following chapters. The subsection titled "Machine learning modeling strategies for high-dimensional -omics data" was adapted from a review paper previously published in the _Current Opinion in Biotechnology_ journal, as "Incorporating biological structure into machine learning models in biomedicine" (https://doi.org/10.1016/j.copbio.2019.12.021).
 

diff --git a/content/11.introduction.md b/content/11.introduction.md
@@ -2,7 +2,7 @@
 
 Precision oncology, or the selection of cancer treatments based on molecular or cellular features of patients' tumors, has become a fundamental part of the standard of care for some cancers [@doi:10.1093/annonc/mdx707].
 Although each tumor is unique, the successes of precision oncology reinforce the idea that there are commonalities that can be understood and therapeutically targeted.
-Targeted therapies that have been successfully applied across cancer types and patient subsets include _HER2_ (_ERBB2_) inhibitors in breast and stomach cancer [@doi:10.1093/jnci/djp341], BTK inhibitors in various hematological malignancies [@doi:10.1186/s13045-022-01353-w], _EGFR_ inhibitors across a variety of carcinomas [@doi:10.1186/s13045-022-01311-6], and _PARP_ inhibitors for tumors with DNA damage repair defects [@doi:10.1093/annonc/mdz192].
+Targeted therapies that have been successfully applied across cancer types and patient subsets include _HER2_ (_ERBB2_) inhibitors in breast and stomach cancer [@doi:10.1093/jnci/djp341], BTK inhibitors in various hematological malignancies [@doi:10.1186/s13045-022-01353-w], _EGFR_ inhibitors across a variety of carcinomas [@doi:10.1186/s13045-022-01311-6], and _PARP_ inhibitors for tumors with DNA damage repair defects [@doi:10.1093/annonc/mdz192], among others.
 The genes and mutations that drive cancer are often specific to a given cancer type or subtype, but they tend to converge on a few pathways [@doi:10.1016/j.cell.2018.03.035; @doi:10.1016/j.cell.2020.11.045], making more general targeted treatments possible.
 
 The past decade has seen an expansion in the size and diversity of cancer genomics datasets, both publicly available and otherwise.

diff --git a/content/30.header.md b/content/30.header.md
@@ -6,4 +6,6 @@ This chapter has been published in _Genome Biology_ (https://doi.org/10.1186/s13
 JC: conceptualization, methodology, software, visualization, writing - original draft, writing - review and editing
 BCC: methodology, writing - review and editing
 MC: methodology, writing - review and editing
-CSG: conceptualization, funding acquisition, methodology, supervision, writing - review and editing
+CSG: conceptualization, funding acquisition, methodology, supervision, writing - review and editing.
+An initial version of this manuscript was edited based on feedback from anonymous reviewers.
+
diff --git a/content/metadata.yaml b/content/metadata.yaml
@@ -1,5 +1,5 @@
 ---
-title: "Jake Crawford dissertation title"
+title: "Navigating heterogeneity to learn from large-scale cancer data: optimization, redundancy, and generalization"
 date: null # Defaults to date generated, but can specify like '2022-10-31'.
 keywords:
  - gene-expression