Skip to content

ENCODE-DCC/ENCODE_scatac

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🏭 ENCODE sc/snATAC Automated Processing

Note: This pipeline is currently a work in progress.

This is the automated portion of the ENCODE single-cell/single-nucleus ATAC-Seq pipeline.

Information on the specific analysis steps can be found in the pipeline specification document.

Requirements

  • A Linux-based OS
  • A conda-based Python 3 installation
  • Snakemake v6.6.1+ (full installation)
  • An ENCODE DCC account with access to the necessary datasets

Additional requirements for cloud execution:

  • Kubectl
  • A cloud provider CLI for Kubernetes cluster creation
  • A cloud provider CLI for remote storage (if different from above)

All other dependencies are handled by the pipeline itself

Running the Pipeline

Local Execution

  1. Install any necessary requirements above
  2. Download the pipeline
    git clone https://github.com/kundajelab/ENCODE_scatac
    
  3. Activate the snakemake conda environment:
    conda activate snakemake
    
  4. Configure the pipeline in the /config directory. Detailed information can be found here.
  5. Run the pipeline:
    snakemake -k --use-conda --cores $NCORES 
    
    Here, $NCORES is the number of cores to utilize

Note: When run for the first time, the pipeline will take some time to install conda packages.

Cloud Execution with Kubernetes

  1. Install and configure the pipeline as specified above
  2. Create a cloud cluster. Note that setup specifics may differ depending on the cloud provider. Example setup instructions for GCP and for Azure.
  3. Configure remote storage. Instructions for each provider can be found here. For our purpose, only the environment variables and command line configuration are needed.
  4. Run the pipeline:
    snakemake -k --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX --jobs $NJOBS --envvars $VARS
    
    Here:
    • $REMOTE is the cloud storage provider, and should be one of {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp,iRODS,AzBlob,XRootD}
    • $PREFIX is the target bucket name or subfolder in storage
    • $NJOBS is the maximum number of jobs to be run in parallel
    • $VARS is a list of environment variables for accessing remote storage. The --envvars flag can be omitted if no variables are required.

Additional Execution Modes

This pipeline has been tested locally and on the cloud via Kubernetes. However, Snakemake offers a number of additional execution modes.

Documentation on cluster execution

Documentation on cloud execution

Authors

Austin Wang
Primary developer
[email protected]

Surag Nair
Secondary developer and advisor
[email protected]

Ben Parks
Secondary developer and advisor
[email protected]

Laksshman Sundaram
Advisor
[email protected]

Caleb Lareau
Advisor
[email protected]

William Greenleaf
Supervisor
[email protected]

Anshul Kundaje
Supervisor
[email protected]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • R 3.6%
  • Shell 1.9%
  • Awk 0.2%