Skip to content

Latest commit

 

History

History
98 lines (70 loc) · 4.03 KB

README.md

File metadata and controls

98 lines (70 loc) · 4.03 KB

Exome sequencing data management and variant filtering in azoospermic and testicular germ cell tumor patients

This program executes all the passages contained in the Krausz et al. pipeline for SNV and INDEL filtering. It takes in input .xlsx or .csv files obtained after exome sequencing data analysis and is aimed to find the rarest variants.

USAGE

SNV_INDEL_filter_PV.py [-h] (-s SINGLE | -d DIRECTORY) 
                       [-p PARALLEL] [-a AUX1] [-b AUX2] [-o OUTPUT]

Indicate a single file or a directory containing multiple files.

OPTIONS:

  -h, --help   show this help message and exit

  -s SINGLE, --single SINGLE   Path to a single INDEL or SNV file (debug mode)

  -d DIRECTORY, --directory DIRECTORY   Path to directory to elaborate

  -p PARALLEL, --parallel PARALLEL   Multiprocessing: Parallel degree number (default 2) unused in single mode

  -a AUX1, --aux1 AUX1   Auxiliary data sheet containing SAMPLE as sample name and PHENOTYPE as diagnosis or ethnic labels column. It is a CSV file format.

      E.g.
      SAMPLE,PHENOTYPE
      A731RV,AZO
      B732RV,AZO
      C733RV,CRTL

  -b AUX2, --aux2 AUX2   Auxiliary directory containing OMIM genes classification (e.g. D/r).

  -o OUTPUT, --output OUTPUT   Path to output directory. (default ./)

  -pikt --pathogenicIndexThreshold   Pathogenic Index threshold. It retains SNV with an IP >= the user imposed threshold. If zero, this filter is disabled. (default >= 0.7).

  -ihAF --inHouseAlleleFrequency   In House Allele Frequency Threshold. It retains SNV and INDEL with an inHouse AF <= the user imposed threshold. If one, this filter is DISABLED. (default <= 0.01)

  -v, --verification   It verifies whether each sample in sample sheet has files (SNV, INDEL) in input directory

Warning:
the file name must have the following format
  {SAMPLE}.{SNV|INDEL}.{something_else}.{xlsx|xls|csv|csv.gz|csv.gzip}
  E.g.
  A731RV.SNV.FINAL.xlsx

OUTPUT description in output directory

AF_recalculation.csv: file containing alleles for AF recalculation PHENOA/
  /SNV/
   filtered SNV for each sample
  /INDEL/
   filtered INDEL for each sample
PHENOA_INDEL_filtered.csv: Filtered INDELs for all the samples belonging to the PHENOA phenotype;
PHENOA_SNV_filtered.csv: Filtered SNVs for all the samples belonging to the PHENOA phenotype;
PHENOA_RECESSIVE.csv: homozygous, putative composite Heterozygous and X-linked variants crossed against OMIM recessive genes list. HET column has the ‘pC-HET’ value indicating the putative composite Heterozygous;
PHENOA_DOMINANT.csv: heterozygous crossed against OMIM dominant genes list;
PHENOA_noDOM_noREC.csv: Variants of genes without an OMIM dominant|recessive annotation. HET column has the ‘pC-HET’ value indicating the putative composite Heterozygous;
PHENOA_genes4GO.txt: Non redundant list of dominant and recessive genes contained in PHENOA_RECESSIVE.csv and PHENOA_DOMINANT.csv useful for the Gene Ontology analysis;

Rehearsal

You can test the program by using at least 50 samples.

First, try to check sample sheet with the following command:

SNV_INDEL_filter_PV.py\
	-d ./test/esomi_prova\  
	-a ./test/esomi_prova_data_sheet.csv\ 
	-b ./script/OMIM_30_07_22/\ 
	-v

Then, if the data sheet is correctly checked you can launch

SNV_INDEL_filter_PV.py\
	-d ./test/esomi_prova\  
	-p 20
	-a ./test/esomi_prova_data_sheet.csv\ 
	-b ./script/OMIM_30_07_22/\ 
	-pikt 0.7
	-o ./test/esomi_prova_test

After rehearsal, you should observe the obtained data to find problems, if present.

Then you can execute the script for the analysis of the entire samples ensemble.