Skip to content

hubmapconsortium/sc-atac-seq-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

The HuBMAP Consortium sc-atac-seq pipeline is a pipeline for analyzing scATAC-seq data sets, composed of ArchR, and chromVAR. Source code can be found at https://github.com/hubmapconsortium/sc-atac-seq-pipeline

The pipeline performs quantification using a specified aligner, and HuBMAP has standardized on BWA with the GRCh38 reference genome. ArchR divides the genome into non-overlapping bins of user-specified size (we use 500), produces FASTQC analysis of the input fastq files, and produces a binary cell-by-bin matrix denoting whether reads in each cell were aligned to each bin.

The ArchR secondary analysis pipeline filters bins based on TSS enrichment and fragment number, performs LSI dimensionality reduction, and selects peaks from all available bins. The chromVAR tool performs motif analysis, assigns motifs to transcription factors, and computes differential enrichment of transcription factors across cells in the data set.

Requirements

Running the pipeline requires a CWL workflow execution engine, and we recommend the cwltool reference implementation, which is written in Python. This can be installed in a sufficiently recent Python environment with pip install cwltool, after which the pipeline can be invoked as:

cwltool sc_atac_seq_prep_process_analyze.cwl sc_atac_seq_prep_process_analyze.json

To build the Docker images run

build_docker_containers

from the sc-atac-seq directory. The build could take up to an hour.

Supplementary Data

The HuBMAP sc-atac-seq pipeline uses the Genome Reference Consortium human genome, build 38 (GRCh38). A BWA generated set of index files is required for the reference genome. Using an alternate reference or index is not currently supported without rebuilding the sc-atac-seq Docker container, though one can build an alternate container by modifying the Dockerfile.

Inputs

Required

  • sequence_directory
    A directory for the pipeline to search for fastq or fastq.gz files. The pipeline only works on paired end reads and expects, for historical reasons, the paired end read files to be named <some_name>*_R1*.fastq and <some_name>*_R3*.fastq. If a file containing barcodes <some_name>*_R2*.fastq is found the barcodes will be read and added to the read IDs in the paired end fastq files

  • input_reference_genome
    A fasta file of the GRCh38 reference genome

Optional

  • reference_genome_index
    A .gz file containing the BWA generated index of the GRCh38 reference genome. I.e. the ".bwt", ".sa", ".ann", ".pac", ".amb" files generated by BWA indexing. If this file is provided the index will not have to be generated by the pipeline saving some time.

Outputs

  • Bins.csv
    A CSV file providing sequence name and bin information

  • cellBarcodes.CSV
    A CSV file with barcode ID and barcode

  • cellByBin_summary.csv
    A CSV file with barcode ID and bin number

  • cellClusterAssignment.csv
    A CSV file with barcode ID and cluster number

  • GenesRanges.csv
    A CSV file providing sequence, gene name and gene location information

  • cellByGene.mtx
    A file with the cell by gene matrix in Matrix Market format

  • cellGenes.csv
    A CSV file with gene ID and gene name

  • peaksAllCells.csv
    A CSV file with sequence name and peak start and end