Skip to content

roohy/ilash_analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

iLASH Analyzer

This repository includes nearly all the script we used to evaluate iLASH, our IBD estimation tool. Each file has brief commentary about the code inside. Here, we go through files that were critical to our evaluation process.

IBD Compiler (ibd_compiler.py)

This file simulates the IBD tracts. load_meta and load_haps functions load the HapGen generated files and generate the objects necessary for the simulation (the map and hap objects). Generator functions, such as tract_generator_from_dist and family_generator simulate IBD segments based on a given distribution. These functions return a dictionary that includes coordinates of non-overlaping IBD segments based on the map file.

make_related function then transforms a hap genotype array based on the simulated tracts passed on to it as a dictionary. save_map and save_hap files then can be used to generate PLINK PED/MAP files required for running iLASH and GERMLINE. The IBD dictionary should be saved for later evaluation of the IBD estimation algorithms. IBD loading functions, such as load_ilash_for_power and load_germline_for_power can then be used to load the results of iLASH, GERMLINE, BEAGLE, and RaPID into a similarly formatted IBD dictionary using a genetic position dictionary (and sometimes a map object, both generated by the load_map_file function).


Accuracy (analyzer.py)

IBD dictionary objects can be passed to write_concordance function available in this file to measure the accuracy of IBD estimation tools based on the original simulated IBD dictionary. The result is a tab separated file where each line represents one simulated IBD segment. For every simulated IBD segment the genetic distance covered by each algorithm is reported. The arguments of this function are listed below:

  1. ref_dict: The simulated IBD dictionary.
  2. dict_list: A list of IBD dictionaries generated for each algorithm.
  3. name_list: A list of names of the algorithms in the same order as the dict_list. These names will be used as column names when generating the output file.
  4. map_data: The map object generated by the read_map function.
  5. hap_count: Total number of simulated haplotypes. Usually twice the sample count.
  6. dict_size: If IBD simulations are done in batches (due to the large size of the simulations, we simulated IBD dictionaries for smaller sample sizes and then concatenated them.), the batch size should be passed since the ref_dict is divided into batchs.
  7. output_address: The address for the tab separated results file.
  8. rapid_ind: The index for RaPID output. This option is used since RaPID does not report the halotype code like all other tools do!

Measurement and Evaluation (tester.py & timer.py)

This file contains the classes that streamline our evaluation process for iLASH paper. The parent class TesterBase provides a template for our measurement processes. Measurement processes are iterator objects going through a range of values (or iterations) for their assigned parameters; providing data and configuration for each run and collecting the results. They also use the ProcessTimer to measure run time and consumed memory for each iteration. We will describe them below:

  • FPTester: This class runs false-positive tests using a composite individuals model. For each iteration, it creates a new genotype file by shuffling the genotype data (divided up in the form of small windows).
  • GridSearch: This class takes ranges of values for the number of buckets and the number minhash signatures per bucket and iterates through all the combinations of them.
  • LengthTester: This class takes a range of minimum IBD length values and tests the IBD estimation tool for the effects of minimum IBD length on the runtime and memoery usage.
  • ThinnerTester: This class randomly removes SNPs from the genotype data to 'thin' it down. As input arguments, it takes a ratio of the SNPs to be dropped and the number of repetitions.

ProcessTimer

Implemented in the timer.py file, this class is in charge of running the shell command. It also records the runtime and memory usage.


add_distance.py

This script add genetic distances (in cM) to a PLINK map file that does not have them. to do so, it requires a "genetic distance file"(raw_addr).


Reader (reader.py)

Basic I/O functions are implemented in this file. load_map_data, save_map_data, and load_haps_data are available here. Functions that can load iLASH and GERMLINE results are also available. However, more complete versions are available in the ibd_compiler.py file.


Transform (ped_2_vcf.py & ped2vcfmap.py & hap_to_ped.py)

This script can transform ped files into vcf files to be fed into IBD estimation tools that require VCF files as input.

hap_to_ped.py file contains the function we used to convert hap files generated by HapGen or phasing software such as SHAPEIT and EAGLE. The function convert_haps is in charge of this conversion. You can also directly run the file from the command line using the following command:

$python hap_to_ped.py [input HAP file address] [number of samples] [number of SNPs] [output PED file address]


divide_to_chr.py

This script can be used to break down the PED/MAP files that contain the genotype information of more than one chromosome into multiple files, each assigned to a single chromosome.


Extracting Populations (extract_aa_cols.py & extract_aa_haps.py)

This script has three input arguments. The first argument is the address to a file containing a sub-population primary IDs. The second argument is the address to a file containing information for the total population, including primary and secondary IDs. The script saves the information pertaining to the subpopulation passed as the first argument in the file address provided as the third argument.

The second script can help extract haplotypes of any population from a larger set of haplotype data in a similar manner.