Here, we provide data and scripts to generate all main figures in the paper.
- Data_curation_visualisations.rmd generates panels from Figure 1. This is done using the standardDataset for Fig 1B, Pan HLA dataset for Fig 1C-E and GBM dataset for Fig 1F.
- Re-training each model and testing in a k-fold cross-validation fashion, takes a considerable amount of time due to the feature computation component of Repitope. Therefore for the following benchmarking experiments, the data after crossvalidation occurred is read into each markdown file and figures are generated. An example of how each model is re-trained and tested in a cross-validation context is shown in 10Fold_CV_Example.rmd. This file runs 10-fold crossvalidation for each model (except Repitope which takes too long, an example script for Repitope is provided and here the results are simply read in), in the pathogenic HLA-specific scenario.
- PanHLA_Pathogenic_CV.rmd generates Fig 2A and corresponding supplementary confusion matrices /PR-AUC. This uses the data generated after the cross-validation experiment (PanHLA_combinedData.rds). A .txt version of this data is included.
- HLASpecific_Pathogenic_CV.rmd generates Fig 2B and corresponding supplementary confusion matrices / PR-AUC. This uses the data generated after the cross-validation experiment (HLASpecific_combinedData.rds). A .txt version of this data is included.
- Benchmark_GBM_PanHLA.rmd generates Fig 2C and corresponding supplementary confusion matrices / ROC-AUC. This uses the data generated after training the models on the Pan-HLA dataset and testing against the GBM dataset (GBM_PAN_HLA_combinedData.rds). A .txt version of this data is included.
- Benchmark_GBM_A201.rmd generates Fig 2D and corresponding supplemenetary confusion matrices / ROC-AUC. This uses the data generated after training the models on the HLA-specific dataset and testing against the GBM dataset (HLA_Specific_GBM_combinedData.rds). A .txt version of this data is included.
- Bjerregaard_PanHLA.rmd generates Fig 2E and correspsonding supplementary confusion matrices / ROC-AUC. This uses the data generated after training the models on the Pan HLA dataset and testing against the 291 Bjerregaard 9mers dataset (PanHLA_Bjerregaard_combinedData.rds). A .txt version of this data is included.
- Bjerregaard_HLA_Specific.rmd generates Fig 2F and correspsonding supplementary confusion matrices / ROC-AUC. This uses the data generated after training the models on the HLA Specific dataset and testing against the 291 Bjerregaard 9mers dataset (HLASpecific_Bjerregaard_combinedData.rds). A .txt version of this data is included.
- Models_unreliable_neoantigens.rmd generates all panels of Figure 3.
- Evaluating_HLA_imbalance.rmd generates all panels of Figure 4.
- Further_data_associated_complexities.rmd generates all panels of Figure 5.
- Standard dataset (Additional file 2) - "Datasets_csv/standardDataset_200903_PB_Inc_ImmunogenicityEvidence.csv"
- Pan HLA dataset (Additional file 3) - "Analysis_Pan_HLA_Pathogenic/PanHLA_FullDataset"
- HLA Specific dataset (Additional file 4) - "Analysis_HLA_specific_Pathogenic/A201_standardData_forAnalysis_BALANCED.rds" and /HLASpecific_FullDataset
- GBM Dataset (Additional file 5) (produced by Margardia Rei and Rui Ma) "GBM_Benchmark_PanHLA_Train/GBM_Peptides.tsv"
- NetTepi - https://services.healthtech.dtu.dk/service.php?NetTepi-1.0 (please remember for re-training, a new .MOD file generated by ourselves is passed to NetTepi's python script)
- iPred - https://github.com/antigenomics/ipred
- Repitope - https://github.com/masato-ogishi/Repitope
- NetMHCpan 4.0 - https://services.healthtech.dtu.dk/service.php?NetMHCpan-4.1