This tool is an in-house snakemake pipeline from the EI Core-Bioinformatics group for the quality control and pre-processinf of Hi-C data for genome assembly. This tool currently suports the Arima Genomics and Dovetail-Genomics Omni-C protocols.
#Snakemake needs to be at least version 7.10.
#This software was tested with 7.12.1 and it is recommended to use the same due major and inconsistent changes in snakemake.
conda create snakemake python=3.8 snakemake=7.12.1
conda activate snakemake #If it is your first time go to the conda documentation at the end of the repository.
git clone https://${PERSONAL_TOKEN}
cd eihic
#For ease of installation follow these steps
python -m pip install --upgrade pip #updates pip to latest version
PYTHONUSERBASE=/path/to/install/to #Sets a custom path of installation
python3 -m pip install --user -r requirements.txt .
echo $PATH #be sure the binaries are in your Path
#if not add to path.
First source the current version of the tool:
source eihic-0.2.0
If you don't have it add /ei/software/cb/bin to your PATH variable:
export PATH=${PATH}:/ei/software/cb/bin
Then check the wrapper of the two main companents of eihic:
eihic --help
[~]--% eihic --help
usage: EI HI-C [-h] [-v] {configure,run} ...
EI HI-C Pipeline
positional arguments:
configure see `configure -h`
run see `run -h`
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
The first step of the pipeline is to create a config file to be used by snakemake in the HPC. All the arguments are described as optional but --sample_csv -s must be have an asociated Path a sample file as showed in the reference path in the eihic configure --help command.
--% eihic configure --help
usage: EI HI-C configure [-h] [-s SAMPLES_CSV] [-c] [--jira JIRA] [-o OUTPUT] [-f] [-bm2]
optional arguments:
-h, --help show this help message and exit
-s SAMPLES_CSV, --samples_csv SAMPLES_CSV
Provide sample information in tab-separated format. Please refer to the sample
file: /ei/software/cb/eihic/0.2.0/x86_64/lib/python3.9/site-
packages/eihic/etc/run_config.yaml for more information above the csv format. A
template is provided here
packages/eihic/etc/samples.csv. (default: None)
-c, --curation Use this flag if you have your final scaffold and want to run the curation
pipeline. Bare in mind that the sample_csv file requires an extra field with
the long reads file paths as a fifth line. (default: False)
--jira JIRA Provide JIRA id for posting job summary. E.g., PPBFX-611 (default: None)
-o OUTPUT, --output OUTPUT
Provide output directory (default: /ei/.project-
-f, --force-reconfiguration
Force reconfiguration (default: False)
-bm2, --bwa_mem2 Use bwa-mem2 insted of bwa mem. This option use a lot more RAM, use with
precaution. (default: False
Once you created the output_dir/run_config.yaml file from the eihic wrapper you need to run the eihic run subcommand. This tool supports omni-c and arima two-enzymes library prep protocols fot Hi-C, you must choose one of them (default is omni-c).
--% eihic run --help
usage: EI HI-C run [-h] [--library LIBRARY] [-c] [--hpc_config HPC_CONFIG] [--jobs JOBS]
[--latency_wait LATENCY_WAIT] [--no_posting_off] [-v] [-np]
positional arguments:
run_config Provide run configuration YAML. Run 'eihic configure -h' to generate the run
configuration YAML file. (Description template file is here:
optional arguments:
-h, --help show this help message and exit
--library LIBRARY This pipeline supports the following library protocols for hi-c data: arima (2
enzymes protocol), and omni-c. Provide the name (arima or omni-c) as the second
positional argument after the sample file (default: omni-c)
-c, --curation This flag run the steps generate all the required Hi-C contact matrices and
tracks for its manual curation step. (default: False)
--hpc_config HPC_CONFIG
Provide HPC configuration YAML (default:
--jobs JOBS, -j JOBS Use at most N CPU cluster/cloud jobs in parallel (default: 100)
--latency_wait LATENCY_WAIT
Wait given seconds if an output file of a job is not present after the job
finished (default: 120)
--no_posting_off Use this flag if you want to post comments to JIRA tickets (default: True)
-v, --verbose Verbose mode for debugging (default: False)
-np, --dry_run Dry run (default: False
arima two enzymes library:
- READ_NAME.hicup.sam.HiCUP_summary_report_UUID_TIMESTAMP.html
omni-c library:
omni-c library curation mode:
- OUTPUT_dir/workflow/samtools/ORGANISM_NAME.sorted.bam
- OUTPUT_dir/workflow/bwa/ORGANISM_NAME_mapped_reads.sort.bam
- OUTPUT_dir/workflow/pretext/ORGANISM_NAME_unique_mapping.pretext
- OUTPUT_dir/workflow/pretext/ORGANISM_NAME_multi_mapping.pretext
- OUTPUT_dir/workflow/cooler/unique_1kb.mcool
- OUTPUT_dir/workflow/cooler/all_1kb.mcool
- OUTPUT_dir/workflow/tracks/gaps_ORGANISM_NAME.bedgraph
- OUTPUT_dir/workflow/tracks/telomeres_ORGANISM_NAME.bedgraph
- OUTPUT_dir/workflow/tracks/coverage_ORGANISM_NAME.bedgraph
This file is composed of four lines:
- all R1 reads
- all R2 reads
- Path to the reference assembly.
- organism name (will be used for file naming)
- (if running curation mode -c / --curation) list of hifi_reads in fasta/fasta.gz
sample_1_R1.fastq,sample_2_R1.fastq,(...), sample_n_R1.fastq
sample_1_R2.fastq,sample_2_R2.fastq,(...), sample_n_R2.fastq
reference/genome/you_reference_genome.fasta #path to reference
name_of_organism (internal naming usage)
hifi_1.fasta,hifi_2.fasta,(...),hifi_n.fasta (optional)
Package managment and environment mangament for python code.
Workflow managment tool based in Python.
HiCUP is the software used for the QC from
The installation of the Arima software is sourced from a fork of the original repository because of some subtle changes to the installation steps were necessary for the HPC installation.
Matrix indexing used to obtain .mcool contac maps to use in high-glass
multiple resolution matrix used for the manual curation of genomes and TAD analysis.
Contact maps visualization tool for Hi-C. Used to move contigs/scaffolds to curate the genome.
Telomere identification toolkit from DToL.
If you don't know or there is no information for your telomere sequence
obtaining coverage data from the genome.
aligner of choice for the PacBio HiFi reads. The output is in sam/bam format and uses the the preset map-hifi
- use bwa-mem 2 to make analysis fastaer for small genomes (too resource demanding for big genomes)
- add other relevant library preps for hi-c
- add telomere to the config
- add repeats trach from eirepeat pipeline
- add the fasta to TPF file formatting