Merge pull request #5 from jjc2718/background_models

Chapter 1: integrate ML for structured data section, + short conclusion
greenelab · Sep 6, 2023 · 7e951b8 · 7e951b8
2 parents 00639a0 + 58acb1f
commit 7e951b8
Show file tree

Hide file tree

Showing 6 changed files with 29 additions and 108 deletions.
diff --git a/content/02.introduction.md b/content/02.introduction.md
diff --git a/content/10.background_header.md b/content/10.background_header.md
@@ -1,6 +1,6 @@
 ## Chapter 1: Background
 
-This chapter was formatted for this dissertation to provide background information and context for the following chapters. Some elements of the second subsection on machine learning modeling techniques were previously published in the _Current Opinion in Biotechnology_ journal as "Incorporating biological structure into machine learning models in biomedicine" (https://doi.org/10.1016/j.copbio.2019.12.021).
+This chapter was formatted for this dissertation to provide background information and context for the following chapters. The subsection titled "Machine learning modeling strategies for high-dimensional -omics data" was adapted from a review paper previously published in the _Current Opinion in Biotechnology_ journal, as "Incorporating biological structure into machine learning models in biomedicine" (https://doi.org/10.1016/j.copbio.2019.12.021).
 
 **Contributions:**
 For the unpublished parts of this chapter, I was the sole author.

diff --git a/content/11.introduction.md b/content/11.introduction.md
@@ -2,7 +2,7 @@
 
 Precision oncology, or the selection of cancer treatments based on molecular or cellular features of patients' tumors, has become a fundamental part of the standard of care for some cancers [@doi:10.1093/annonc/mdx707].
 Although each tumor is unique, the successes of precision oncology reinforce the idea that there are commonalities that can be understood and therapeutically targeted.
-Targeted therapies that have been successfully applied across cancer types and patient subsets include _HER2_ (_ERBB2_) inhibitors in breast and stomach cancer [@doi:10.1093/jnci/djp341], BTK inhibitors in various hematological malignancies [@doi:10.1186/s13045-022-01353-w], and _EGFR_ inhibitors across a variety of carcinomas [@doi:10.1186/s13045-022-01311-6].
+Targeted therapies that have been successfully applied across cancer types and patient subsets include _HER2_ (_ERBB2_) inhibitors in breast and stomach cancer [@doi:10.1093/jnci/djp341], BTK inhibitors in various hematological malignancies [@doi:10.1186/s13045-022-01353-w], _EGFR_ inhibitors across a variety of carcinomas [@doi:10.1186/s13045-022-01311-6], and _PARP_ inhibitors for tumors with DNA damage repair defects [@doi:10.1093/annonc/mdz192].
 The genes and mutations that drive cancer are often specific to a given cancer type or subtype, but they tend to converge on a few pathways [@doi:10.1016/j.cell.2018.03.035; @doi:10.1016/j.cell.2020.11.045], making more general targeted treatments possible.
 
 The past decade has seen an expansion in the size and diversity of cancer genomics datasets, both publicly available and otherwise.
@@ -20,9 +20,6 @@ Training foundation models on pan-cancer, pan-omics data would be a natural exte
 As a whole, this dissertation explores ways in which the structure of large, public pan-cancer datasets can present unexpected challenges and caveats for machine learning.
 TCGA and CCLE both contain data from various -omics types (feature groups) and samples from diverse cancer types/tissues of origin (sample groups).
 There are additional, less obvious forms of structure in these data such as patient sub-populations and sample collection locations, which we will not address directly in this dissertation but which can affect model training and performance as well.
-
 This chapter, Chapter 1, describes existing work at the intersection of cancer -omics and machine learning, which will provide context for the following chapters.
-In Chapter 2, we show that the choice of optimization method can affect model selection and tuning, for prediction from cancer transcriptomic data.
-Chapter 3 explores the relative information content of -omics types/feature groups in TCGA, showing that gene expression tends to contain the most information relative to cancer driver mutations, but most -omics types can serve as effective, and likely somewhat redundant, readouts.
-In Chapter 4, we test generalization across cancer types in TCGA and across datasets (CCLE to TCGA and vice-versa), showing that smaller models do not tend to generalize better across contexts, and cross-validation performance is in most cases a sufficient model selection criterion.
-Finally, in Chapter 5, we conclude by summarizing the implications of these results and discussing future directions.
+In particular, we focus in turn on applications of ML to cancer -omics data (first section of Chapter 1), and on ML methods that are currently used to take into account structure in -omics data (second section of Chapter 1).
+
diff --git a/content/12.data_review.md b/content/12.data_review.md
@@ -1,6 +1,6 @@
-## Cancer -omics data and applications
+### Cancer -omics data and applications
 
-### Publicly available cancer -omics data resources
+#### Publicly available cancer -omics data resources
 
 A wealth of public cancer genomics and multi-omics human sample resources have been generated in the past decade.
 As mentioned in the introduction, the TCGA Pan-Cancer Atlas [@pancanatlas] contains data spanning 33 cancer types and multiple -omics data types, including mutation, CNV, gene expression, miRNA, DNA methylation, reverse phase protein array (RPPA) proteomics data, and clinical outcome data [@doi:10.1016/j.cell.2018.02.052].
@@ -17,7 +17,7 @@ The GDSC and PRISM drug screening datasets provide cell viability dose-response
 Aside from cell lines, the PDX Encyclopedia is a dataset of patient-derived xenograft (PDX) mouse model data, including more than 1000 models with mutation, CNV, and gene expression data for each [@doi:10.1038/nm.3954].
 The National Cancer Institute's Patient-Derived Models Repository (PDMR) also contains mutation and gene expression profiles for mouse models and patient-derived tumor organoids, or tumoroids [@doi:10.1038/s41467-021-25177-3; @pdmr], although it is still under development.
 
-### Applications of machine learning in cancer genomics
+#### Applications of machine learning in cancer genomics
 
 Historically, one common use of -omics data in cancer has been to define subtypes, or clinically relevant patient subsets that may have similar prognosis or respond similarly to therapy.
 Many studies have sought to distinguish tumor samples from control/normal samples, to identify subtypes of a particular cancer type, or to distinguish samples of a particular cancer type/tissue of origin from samples of other cancer types (e.g. [@doi:10.1186/s12920-020-0677-2; @doi:10.3389/fbioe.2020.00737; @doi:10.1109/TBME.2012.2225622; @doi:10.1186/s13073-023-01176-5]).
@@ -30,14 +30,14 @@ Prediction of drug response from genomic data, often combined with clinical feat
 Given the availability and uniformity of the cell line data in CCLE, and drug response data in GDSC, PRISM and other cell line datasets, many method development efforts have centered on these data sources.
 Examples include prediction of drug response from integrated multi-omics data [@doi:10.1093/bioinformatics/btz318], prediction of drug response using perturbation modeling via CMap as an intermediate step [@doi:10.1093/bioinformatics/btz158], and prediction of drug response via single-cell transcriptomic data [@doi:10.1101/2022.01.11.475728], among many others reviewed in [@doi:10.1093/bib/bbab294; @doi:10.1038/s41467-022-34277-7; @doi:10.1038/s41598-023-39179-2].
 Large datasets of human-derived genomic data with associated drug response annotations are more difficult to find.
-Still, there have been attempts to develop and/or validate models on human data, including for prediction of immunotherapy response which benefits from applications across a wide range of cancer types [@doi:10.1038/s41587-021-01070-8; @doi:10.1016/j.ccell.2023.06.006; @doi:10.1101/2020.09.03.260265].
+Still, there have been attempts to develop and/or validate models on human data, including for prediction of immunotherapy response, which benefits from applications across a wide range of cancer types [@doi:10.1038/s41587-021-01070-8; @doi:10.1016/j.ccell.2023.06.006; @doi:10.1101/2020.09.03.260265].
 Prognosis or patient survival prediction from multi-omics data is another area of modeling that leverages widely available clinical metadata, reviewed in detail in many existing papers [@doi:10.1186/1471-2288-12-102; @doi:10.1093/bib/bbu003; @doi:10.1186/s12885-021-08796-3; @doi:10.1016/j.csbj.2014.11.005].
 
 Much of our work, described later in this thesis, stems from the idea of predicting the mutation status in key driver genes of cancer samples, based on functional readouts such as gene expression [@doi:10.1158/1078-0432.CCR-13-1943; @doi:10.1016/j.celrep.2018.03.046; @doi:10.1186/s13059-020-02021-3; @doi:10.1371/journal.pone.0241514].
 At first consideration, an accurate mutation status classifier may not seem particularly useful, since for a patient sample a clinician could simply sequence the genome, or select genes in the genome, to determine driver mutation status.
 One application of accurate mutation status classifiers, however, is to identify samples with a similar phenotype to those with a driver mutation, but _without_ the mutation being present in DNA sequencing data.
 Observed examples of this phenomenon include the "BRCAness" phenotype in tumors without observed _BRCA1_/_BRCA2_ mutations [@doi:10.1038/nrc.2015.21], and the "Ph-like" leukemia phenotype in the absence of the Philadelphia chromosome fusion [@doi:10.1182/asheducation-2016.1.561], among others.
-Following this line of reasoning, algorithms have been developed to identify mutations that "phenocopy" known cancer drivers [@doi:10.1142/9789811215636_0031; @doi:10.1101/2022.07.28.501874], and to integrate this information into drug response prediction pipelines to define larger and more accurate patient subgroups [@doi:10.1038/s41525-022-00328-7].
+Following this line of reasoning, algorithms have been developed to identify mutations that "phenocopy" known cancer drivers [@doi:10.1142/9789811215636_0031; @doi:10.1101/2022.07.28.501874], and to integrate this information into drug response prediction pipelines to define larger and more accurate patient subgroups [@doi:10.1186/s12859-021-04147-y; @doi:10.1038/s41525-022-00328-7].
 Related machine learning approaches to genomic prediction/phenotype identification include methods for identifying DNA damage repair deficiencies based on genomic data [@doi:10.1038/nm.4292; @doi:10.1038/s43018-022-00474-y] and for identifying synthetic lethal relationships for use in targeted therapy selection [@doi:10.1016/j.cell.2021.03.030].
-Such methods could be useful for defining broader and more representative patient groups than would be possible based solely on somatic mutation status, that may exhibit similar tumor phenotypes or respond to similar therapies.
-For example, in "basket" clinical trials where patients are included across cancer types based on the presence or absence of individual molecular markers [@doi:10.1200/jco.2014.58.2007], including "phenocopies" could improve efficacy for some targeted therapies.
+Such methods could be useful for defining broader and more representative patient groups than would be possible based solely on somatic mutation status, that may pool together similar tumor phenotypes or identify patients that respond to similar therapies.
+For example, in "basket" clinical trials where patients are included across cancer types based on the presence or absence of individual molecular markers [@doi:10.1200/jco.2014.58.2007], including "phenocopies" could improve efficacy for some therapies targeted to a particular gene or pathway.