Skip to content

Latest commit

 

History

History
313 lines (229 loc) · 9.88 KB

README.md

File metadata and controls

313 lines (229 loc) · 9.88 KB

Loopsim

PyPI GitHub Release License

Requirements

Loopsim was tested with the following environment:

  • Python >=3.8
  • Linux (Ubuntu 20.04 LTS)

Installation

With Pip

pip install loopsim

From GitHub

pip install git+https://github.com/CutaneousBioinf/Loopsim 

From source

git clone https://github.com/CutaneousBioinf/Loopsim
cd Loopsim

# Option 1: pip
pip install .

# Option 2: poetry
poetry install

Input Files Format

The Loopsim pipeline requires three types of files:

  • Type Name: Chromatin loop file

    • Purpose: Store chromatin loop (Hi-C) data

    • Format: Delimiter-separated values file. Each row represents one chromatin loop. BEDPE format.

      Columns are as follows:

      Start region chromosome First loci of start region Last loci of start region End region chromosome First loci of end region Last loci of end region
    • Example: merged_5K_10K.loop

  • Type Name: Chromosome region file

    • Purpose: Defines the region of each chromosome

    • Format: Delimiter-separated values file. Each row represents one chromosome region. BED format.

      Columns are as follows:

      Chromosome Start position End position
    • Example: chr_region_hg19

  • Type Name: Genomic regions of interest

    • Purpose: Store genomic regions of interest to be checked against chromatin loop file

    • Format: Delimiter-separated values file. Each row represents a genomic region of interest.

      Columns are as follows:

      Chromosome Start loci of region End loci of region
    • Example: 95_BCS_psor_loci

Basic Usage

Loopsim is broken down into a number of different commands:

Step Command Command Description
1 validate Validates the inputted loop file. Issues warnings about possibly erroneous data and removes some types of erroneous data.
2 simulate Produces a distribution of simulated loop files. Note that this may be a very intensive task, depending on the number of simulations you require. I recommend that anything >30 simulations be done with multiple batches, possibly as a collection of SLURM jobs.
3 analyze or batch-analyze Use batch-analyze to produce summary tables with overlaps for the simulated distribution of loop files. Use analyze to do the same for single loop files, such as the original.
4 visualize Produces visualizations, outputs summary statistics, and performs a statistical test with the simulated distribution and the original loop file.
Loopsim pipeline diagram
Loopsim pipeline

CLI Reference

You can run loopsim --help for a broad overview of each of the commands.

$ loopsim --help
Usage: loopsim [OPTIONS] COMMAND [ARGS]...

  For a more thorough explanation of what every command does, please see the
  documentation.

Options:
  --delimiter TEXT  delimiter for outputted files [default: tab]
  --version         Show the version and exit.
  --help            Show this message and exit.

Commands:
  analyze        Perform analysis on a single loop file
  batch-analyze  Perform analysis on a distribution of loop files
  simulate       Generate a distribution of simulations
  validate       Validate input file and output a validated version
  visualize      Get visualization and stats from distribution of ratios

You can also run loopsim <COMMAND> --help for more detailed help messages on each of the commands.

For example, here is the help message for simulate:

$ loopsim simulate --help
Usage: loopsim simulate [OPTIONS] LOOP_IN_FILE
                                  CHROMOSOME_REGION_FILE
                                  SIMULATION_DATA_DIRECTORY

  Generate a distribution of simulations.

  NOTE: any data in SIMULATION_DATA_DIRECTORY may be overwritten!!

Options:
  --num-sims INTEGER       number of simulations  [default: 1]
  --num-processes INTEGER  number of threads to use
                           [default: round(multiprocessing.cpu_count() / 2)]
  --help                   Show this message and exit.

Tutorial

Below, you can find a guided walkthrough of the Loopsim tool.

To follow along with the guided walkthrough, just download the repository and install Loopsim. The Loopsim repository includes all the necessary example files.

  1. Validation
  2. Simulation
  3. Analysis
  4. Visualization

Validation

$ loopsim validate example_data/merged_5K_10K.loop loop_valid.loop example_data/chr_region_hg19
Input loop file: example_data/merged_5K_10K.loop
Output loop file: loop_valid.loop
Chromosome regions file: example_data/chr_region_hg19
Flagging loop ends that are >= 1.000000e+05
Delimiter for output: ' '
Validating loop data
Validation complete
Validated data outputted to file loop_valid.loop
Files after:
.
└── loop_out.loop

Simulation

$ loopsim simulate --num-sims 2 loop_valid.loop example_data/chr_region_hg19 sims/
Input loop file: loop_valid.loop
Chromosome regions file: example_data/chr_region_hg19
Number of simulations: 2
Number of processes: 5
Outputting simulation files to directory: sims/
Delimiter for output: ' '
Simulation 0 simulation started
Simulation 1 simulation start
Simulation 0 simulation complete
Simulation 1 simulation complete
Simulation 0 data outputted to file: sims/sim_hi-c_0.loop
Simulation 1 data outputted to file: sims/sim_hi-c_1.loop
Files after:
.
└── sims
    |── sim_hi-c_0.loop
    └── sim_hi-c_1.loop

Analysis

Batch Analysis

$ loopsim batch-analyze sims/ example_data/95_BCS_psor_loci ratios_out.txt --loop-out-directory loop_out_dir/
Input loop files directory: sims/
Intervals file: example_data/95_BCS_psor_loci
Ratio distribution file: ratios_out.txt
Delimiter for output: ' '
Output loop files directory: loop_out_dir/
Output directory does not exist.
Output directory created!
Finished outputting analyzed files to loop_out_dir/
Finished outputting ratio distribution to ratios_out.txt
Files after:
.
|── ratios_out.txt
└── loop_out_dir
    |── summary_table_0.loop
    └── summary_table_1.loop

Single-file Analysis

$ loopsim analyze loop_valid.loop loop_analyzed.loop example_data/95_BCS_psor_loci
Input loop file: loop_valid.loop
Output loop file: loop_analyzed.loop
Intervals file: example_data/95_BCS_psor_loci
Delimiter for output: ' '
Outputted analyzed loop file to loop_analyzed.loop
Ratio of overlapping intervals out of the total number of loops was: 0.034299968818210166
Files after:

Note: We don't use loop_analyzed.loop in the pipeline again.

.
└── loop_analyzed.loop

Visualization

$ loopsim visualize ratios_out.txt dist_plot.jpg --other 0.034299968818210166
Obtaining overlapping ratios from: ratios_out.txt.
Exported plot to dist_plot.jpg

Summary stats:
Distribution mean: 0.0178775595052489
Distribution std: 0.000808458018194828
Distribution min: 0.0173058933582787
Distribution median: 0.0178775595052489
Distribution max: 0.0184492256522191

Calculating p-value based on empirical distribution:
p-value: 0.0

Calculating p-value based on normal distribution:
p-value: 0.0
Plot

Note: The rather odd looking distribution plot and $p = 0$ are artifacts of the simulation being $N = 2$.

Loopsim Distribution Plot
dist_plot.jpg