-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis of held-out cancer type classification results #21
Changes from 13 commits
4ace2a8
aeb3216
57d74ac
cfe9b61
b8e0893
e057424
5afa664
de2a6bd
4939514
b4f4d70
0360824
38f5ffa
0b56c3e
af48c66
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -465,8 +465,8 @@ | |
], | ||
"source": [ | ||
"vogelstein_results_df = au.compare_results(vogelstein_df, metric='aupr', correction=True,\n", | ||
" correction_method='fdr_bh', correction_alpha=0.001,\n", | ||
" verbose=True)\n", | ||
" correction_method='fdr_bh', correction_alpha=0.001,\n", | ||
" verbose=True)\n", | ||
"vogelstein_results_df.sort_values(by='p_value').head(n=10)" | ||
] | ||
}, | ||
|
@@ -736,6 +736,17 @@ | |
"source": [ | ||
"We have usually used TTN as our negative control (not understood to be a cancer driver, but is a large gene that is frequently mutated as a passenger). So it's a bit weird that it has a fairly low p-value here (would be significant at $\\alpha = 0.05$). We'll have to think about why this is." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# save significance testing results\n", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So is the takeaway from these plots that those genes that were found to be DE in cancer vs normal (y-axis), were also found to be most mutated (x-axis), which we'd expect because these mutations will likely change the expression of these genes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might help if I explain the The question I wanted to answer in this notebook is: for which genes can we train a model to predict mutation from gene expression that outperforms the negative control? So for each gene, we ran 8 total cross-validation replicates (4 folds x 2 random seeds), for the experimental case and the control (shuffled) case. That then gives us 2 distributions of results (in this case we're using AUPR), and we can compare these using a t-test. The plot is a bit like a volcano plot from a DE analysis: on the x-axis it shows the AUPR difference between the true labels and the shuffled labels (positive = better model performance for true labels), and on the y-axis it shows the p-value for the t-test comparing the two distributions. So points in the upper right (better performance for true labels, and low p-values) are the genes we're interested in, showing that we can build effective classifiers on this dataset for these genes. Does that make more sense? It's not quite a standard use of a volcano plot, but you can interpret it in a similar way. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, and as far as the takeaway - in this notebook, there were two for me:
(I'll add this interpretation to the notebook - I wrote this out in some slides I discussed with Casey, but forgot to add it here) |
||
"top50_results_df.to_csv(os.path.join(cfg.results_dir, 'top50_stratified_pvals.tsv'), index=False, sep='\\t')\n", | ||
"vogelstein_results_df.to_csv(os.path.join(cfg.results_dir, 'vogelstein_stratified_pvals.tsv'), index=False, sep='\\t')" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for this notebook overall, couldn't put in line comments since it appears these are not new changes so I apologize that this might be a little out of scope and you can address them in a separate PR if you'd like.
I tend to find it helpful to have a description of the experiment I am performing at the top of the notebook just to orient myself.
Just wanted to clarify the term
stratified
here. So you're saying that you're training set includes say 10 samples of cancer type A, 10 of cancer type B, 10 of cancer type C (total of 30 samples each with 1/3 protion). And so your test set contains maybe 9 total samples with 3 samples of cancer type A, 3 of cancer type B, 3 of cancer type C. So the proportions are the same in the test and training..?Trying to understand your dfs. So for the first 3 rows
top50_df
you have auroc and aupr that tells you how well gene info from training set (including mutation burden, etc) predict mutation status of TP53 (binary i assume) on training/test/validation sets where the labels used to train the model were shuffled. I assume that means you have the same training dataset with multiple labels for mutation status of gene X, Y, Z. So you'd train your model on its ability to predict mutation of gene X, then train your model on its ability to predict mutation of gene Y,...So you have multiple models here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all good questions! See answers below:
Good idea! I'll add this to the top here.
Yep, exactly (not always exactly the same proportion between train/test sets, but +/- 1 sample I think - this is implemented in StratifiedKFold from scikit-learn).
Right - so each row in
top50_df
andvogelstein_df
is one model, trained on the (binary, mutated or not mutated) mutation status of one gene on one cross-validation fold, either on the true labels or the shuffled labels.We have mutation information for (almost) every gene in the genome - "top_50" and "vogelstein" are two different ways to select the genes to train models on. If we just train models on every gene in the genome, our statistical power to detect true relationships between mutation and gene expression won't be very good (and also it will take forever), so we want to start with sets of known cancer genes to improve power. In each case, we train one model for each gene/true-shuffled/cross-validation fold combination.
Let me know if that doesn't answer your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. Thank you!
As a followup. So you're including those cancer genes instead of all genes. Do you expect most of those genes to have a mutated status = 1 (I guess this'll probably depend on the cancer type)? Would it make sense to include genes that are not mutated in cancers as a control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we definitely expect many of the cancer-related genes to be frequently mutated in at least a few cancer types (and we are filtering out gene/cancer type combos where less than 5% of samples are mutated, like in my last exploratory data analysis notebook).
Yes, this is definitely what we're trying to do, but it's hard to choose control genes well. Ideally we'd include some genes that aren't drivers, but this isn't really documented anywhere (and absence of evidence for a gene being a driver in some cancer type isn't the same as evidence of absence; some drivers are just rarely mutated or haven't been studied in depth).
In the past we've used TTN as an example of a gene that isn't thought to be a driver of any cancer type, but remember that lots of genes are mutated in cancer, even those that aren't actually driving the cancer to form. TTN is a large gene that is frequently mutated as a passenger (just by chance) in many cancers, so its mutation status correlates with mutation burden (and thus with cancer type). So (we think) it turns out that TTN mutation status can actually be predicted to some degree from gene expression, because gene expression -> cancer type -> mutation burden -> probability of TTN being mutated.
I guess we could pick smaller genes to reduce the chances of passenger mutations, but I'd have to think about what the best way is to do this.