Exome sequencing data management and variant filtering in azoospermic and testicular germ cell tumor patients
This program executes all the passages contained in the Krausz et al. pipeline for SNV and INDEL filtering. It takes in input .xlsx or .csv files obtained after exome sequencing data analysis and is aimed to find the rarest variants.
SNV_INDEL_filter_PV.py [-h] (-s SINGLE | -d DIRECTORY)
[-p PARALLEL] [-a AUX1] [-b AUX2] [-o OUTPUT]
Indicate a single file or a directory containing multiple files.
OPTIONS:
-h, --help show this help message and exit
-s SINGLE, --single SINGLE Path to a single INDEL or SNV file (debug mode)
-d DIRECTORY, --directory DIRECTORY Path to directory to elaborate
-p PARALLEL, --parallel PARALLEL Multiprocessing: Parallel degree number (default 2) unused in single mode
-a AUX1, --aux1 AUX1 Auxiliary data sheet containing SAMPLE as sample name and PHENOTYPE as diagnosis or ethnic labels column. It is a CSV file format.
E.g.
SAMPLE,PHENOTYPE
A731RV,AZO
B732RV,AZO
C733RV,CRTL
-b AUX2, --aux2 AUX2 Auxiliary directory containing OMIM genes classification (e.g. D/r).
-o OUTPUT, --output OUTPUT Path to output directory. (default ./)
-pikt --pathogenicIndexThreshold Pathogenic Index threshold. It retains SNV with an IP >= the user imposed threshold. If zero, this filter is disabled. (default >= 0.7).
-ihAF --inHouseAlleleFrequency In House Allele Frequency Threshold. It retains SNV and INDEL with an inHouse AF <= the user imposed threshold. If one, this filter is DISABLED. (default <= 0.01)
-v, --verification It verifies whether each sample in sample sheet has files (SNV, INDEL) in input directory
Warning:
the file name must have the following format
{SAMPLE}.{SNV|INDEL}.{something_else}.{xlsx|xls|csv|csv.gz|csv.gzip}
E.g.
A731RV.SNV.FINAL.xlsx
AF_recalculation.csv: file containing alleles for AF recalculation
PHENOA/
/SNV/
filtered SNV for each sample
/INDEL/
filtered INDEL for each sample
PHENOA_INDEL_filtered.csv: Filtered INDELs for all the samples belonging to the PHENOA phenotype;
PHENOA_SNV_filtered.csv: Filtered SNVs for all the samples belonging to the PHENOA phenotype;
PHENOA_RECESSIVE.csv: homozygous, putative composite Heterozygous and X-linked variants crossed against OMIM recessive genes list. HET column has the ‘pC-HET’ value indicating the putative composite Heterozygous;
PHENOA_DOMINANT.csv: heterozygous crossed against OMIM dominant genes list;
PHENOA_noDOM_noREC.csv: Variants of genes without an OMIM dominant|recessive annotation. HET column has the ‘pC-HET’ value indicating the putative composite Heterozygous;
PHENOA_genes4GO.txt: Non redundant list of dominant and recessive genes contained in PHENOA_RECESSIVE.csv and PHENOA_DOMINANT.csv useful for the Gene Ontology analysis;
You can test the program by using at least 50 samples.
First, try to check sample sheet with the following command:
SNV_INDEL_filter_PV.py\
-d ./test/esomi_prova\
-a ./test/esomi_prova_data_sheet.csv\
-b ./script/OMIM_30_07_22/\
-v
Then, if the data sheet is correctly checked you can launch
SNV_INDEL_filter_PV.py\
-d ./test/esomi_prova\
-p 20
-a ./test/esomi_prova_data_sheet.csv\
-b ./script/OMIM_30_07_22/\
-pikt 0.7
-o ./test/esomi_prova_test
After rehearsal, you should observe the obtained data to find problems, if present.
Then you can execute the script for the analysis of the entire samples ensemble.