Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analysis of held-out cancer type classification results #21

Merged
merged 14 commits into from
Sep 21, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions 04_plot_results.ipynb → 04_plot_stratified_results.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -465,8 +465,8 @@
],
"source": [
"vogelstein_results_df = au.compare_results(vogelstein_df, metric='aupr', correction=True,\n",
" correction_method='fdr_bh', correction_alpha=0.001,\n",
" verbose=True)\n",
" correction_method='fdr_bh', correction_alpha=0.001,\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments for this notebook overall, couldn't put in line comments since it appears these are not new changes so I apologize that this might be a little out of scope and you can address them in a separate PR if you'd like.

  1. I tend to find it helpful to have a description of the experiment I am performing at the top of the notebook just to orient myself.

  2. Just wanted to clarify the term stratified here. So you're saying that you're training set includes say 10 samples of cancer type A, 10 of cancer type B, 10 of cancer type C (total of 30 samples each with 1/3 protion). And so your test set contains maybe 9 total samples with 3 samples of cancer type A, 3 of cancer type B, 3 of cancer type C. So the proportions are the same in the test and training..?

  3. Trying to understand your dfs. So for the first 3 rows top50_df you have auroc and aupr that tells you how well gene info from training set (including mutation burden, etc) predict mutation status of TP53 (binary i assume) on training/test/validation sets where the labels used to train the model were shuffled. I assume that means you have the same training dataset with multiple labels for mutation status of gene X, Y, Z. So you'd train your model on its ability to predict mutation of gene X, then train your model on its ability to predict mutation of gene Y,...So you have multiple models here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all good questions! See answers below:

  1. I tend to find it helpful to have a description of the experiment I am performing at the top of the notebook just to orient myself.

Good idea! I'll add this to the top here.

  1. Just wanted to clarify the term stratified here. So you're saying that you're training set includes say 10 samples of cancer type A, 10 of cancer type B, 10 of cancer type C (total of 30 samples each with 1/3 protion). And so your test set contains maybe 9 total samples with 3 samples of cancer type A, 3 of cancer type B, 3 of cancer type C. So the proportions are the same in the test and training..?

Yep, exactly (not always exactly the same proportion between train/test sets, but +/- 1 sample I think - this is implemented in StratifiedKFold from scikit-learn).

  1. Trying to understand your dfs. So for the first 3 rows top50_df you have auroc and aupr that tells you how well gene info from training set (including mutation burden, etc) predict mutation status of TP53 (binary i assume) on training/test/validation sets where the labels used to train the model were shuffled. I assume that means you have the same training dataset with multiple labels for mutation status of gene X, Y, Z. So you'd train your model on its ability to predict mutation of gene X, then train your model on its ability to predict mutation of gene Y,...So you have multiple models here?

Right - so each row in top50_df and vogelstein_df is one model, trained on the (binary, mutated or not mutated) mutation status of one gene on one cross-validation fold, either on the true labels or the shuffled labels.

We have mutation information for (almost) every gene in the genome - "top_50" and "vogelstein" are two different ways to select the genes to train models on. If we just train models on every gene in the genome, our statistical power to detect true relationships between mutation and gene expression won't be very good (and also it will take forever), so we want to start with sets of known cancer genes to improve power. In each case, we train one model for each gene/true-shuffled/cross-validation fold combination.

Let me know if that doesn't answer your question.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. Thank you!

As a followup. So you're including those cancer genes instead of all genes. Do you expect most of those genes to have a mutated status = 1 (I guess this'll probably depend on the cancer type)? Would it make sense to include genes that are not mutated in cancers as a control?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect most of those genes to have a mutated status = 1 (I guess this'll probably depend on the cancer type)?

Yeah, we definitely expect many of the cancer-related genes to be frequently mutated in at least a few cancer types (and we are filtering out gene/cancer type combos where less than 5% of samples are mutated, like in my last exploratory data analysis notebook).

Would it make sense to include genes that are not mutated in cancers as a control?

Yes, this is definitely what we're trying to do, but it's hard to choose control genes well. Ideally we'd include some genes that aren't drivers, but this isn't really documented anywhere (and absence of evidence for a gene being a driver in some cancer type isn't the same as evidence of absence; some drivers are just rarely mutated or haven't been studied in depth).

In the past we've used TTN as an example of a gene that isn't thought to be a driver of any cancer type, but remember that lots of genes are mutated in cancer, even those that aren't actually driving the cancer to form. TTN is a large gene that is frequently mutated as a passenger (just by chance) in many cancers, so its mutation status correlates with mutation burden (and thus with cancer type). So (we think) it turns out that TTN mutation status can actually be predicted to some degree from gene expression, because gene expression -> cancer type -> mutation burden -> probability of TTN being mutated.

I guess we could pick smaller genes to reduce the chances of passenger mutations, but I'd have to think about what the best way is to do this.

" verbose=True)\n",
"vogelstein_results_df.sort_values(by='p_value').head(n=10)"
]
},
Expand Down Expand Up @@ -665,7 +665,12 @@
"source": [
"The plot above is similar to a volcano plot used in differential expression analysis. The x-axis shows the difference between AUPR in the signal (true labels) case and in the negative control (shuffled labels) case, and the y-axis shows the negative log of the t-test p-value, after FDR adjustment.\n",
"\n",
"Orange points are significant at a cutoff of $\\alpha = 0.001$ after FDR correction."
"Orange points are significant at a cutoff of $\\alpha = 0.001$ after FDR correction.\n",
"\n",
"Our interpretation of these results:\n",
"\n",
"* For the top 50 analysis, we mostly reproduced the results from BioBombe which also used this gene set (some of the less significant hits weren't found in BioBombe, but we should have better statistical power here so it makes sense that we see more results)\n",
"* For the Vogelstein analysis, it was surprising/interesting that we saw lots more significant hits than we did for the top 50 analysis! On some level it's not shocking (if a gene is mutated frequently that doesn't necessarily make it a driver, and conversely drivers aren't always frequently mutated across all samples) but seeing visual confirmation of this was neat."
]
},
{
Expand Down Expand Up @@ -736,6 +741,17 @@
"source": [
"We have usually used TTN as our negative control (not understood to be a cancer driver, but is a large gene that is frequently mutated as a passenger). So it's a bit weird that it has a fairly low p-value here (would be significant at $\\alpha = 0.05$). We'll have to think about why this is."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# save significance testing results\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is the takeaway from these plots that those genes that were found to be DE in cancer vs normal (y-axis), were also found to be most mutated (x-axis), which we'd expect because these mutations will likely change the expression of these genes?

Copy link
Contributor Author

@jjc2718 jjc2718 Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might help if I explain the results_dfs (with the statistical testing results), then maybe that will make it a bit clearer what the plots are showing.

The question I wanted to answer in this notebook is: for which genes can we train a model to predict mutation from gene expression that outperforms the negative control? So for each gene, we ran 8 total cross-validation replicates (4 folds x 2 random seeds), for the experimental case and the control (shuffled) case.

That then gives us 2 distributions of results (in this case we're using AUPR), and we can compare these using a t-test. The plot is a bit like a volcano plot from a DE analysis: on the x-axis it shows the AUPR difference between the true labels and the shuffled labels (positive = better model performance for true labels), and on the y-axis it shows the p-value for the t-test comparing the two distributions. So points in the upper right (better performance for true labels, and low p-values) are the genes we're interested in, showing that we can build effective classifiers on this dataset for these genes.

Does that make more sense? It's not quite a standard use of a volcano plot, but you can interpret it in a similar way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and as far as the takeaway - in this notebook, there were two for me:

  • For the top 50 analysis, we mostly reproduced the results from BioBombe which also used this gene set (some of the less significant hits weren't found in BioBombe, but we should have better statistical power here so it makes sense that we see more results)
  • For the Vogelstein analysis, it was surprising/interesting that we saw lots more significant hits than we did for the top 50 analysis! On some level it's not shocking (if a gene is mutated frequently that doesn't necessarily make it a driver, and conversely drivers aren't always frequently mutated across all samples) but seeing visual confirmation of this was neat.

(I'll add this interpretation to the notebook - I wrote this out in some slides I discussed with Casey, but forgot to add it here)

"top50_results_df.to_csv(os.path.join(cfg.results_dir, 'top50_stratified_pvals.tsv'), index=False, sep='\\t')\n",
"vogelstein_results_df.to_csv(os.path.join(cfg.results_dir, 'vogelstein_stratified_pvals.tsv'), index=False, sep='\\t')"
]
}
],
"metadata": {
Expand Down
Loading