HT-RNAseq - A pipeline for processing high-throughput RNA-seq data

Introduction

TODO: Add a description of the pipeline here.

Test data

As test data, we use a DRUGseq dataset from the NCBI Sequence Read Archive.

The original data has been (partly) subsampled to reduce the test runtime. We used seqtk for this with a seed of 1, e.g.:

seqtk sample -s1 orig/SRR14730302/VH02001614_S8_R1_001.fastq.gz 10000 > 10k/SRR14730302/VH02001614_S8_R1_001.fastq.gz

The data is available at: gs://viash-hub-test-data/htrnaseq/v1/:

❯ gcstree -f viash-hub-test-data/htrnaseq/v1/
viash-hub-test-data
└── htrnaseq
    └── v1
        ├── [  48]  2-wells.fasta
        ├── [465.3K]  GSE176150_metadata.csv
        ├── 100k
        │   ├── SRR14730301
        │   │   ├── [8.5M]  VH02001612_S9_R1_001.fastq
        │   │   └── [14.9M]  VH02001612_S9_R2_001.fastq
        │   └── SRR14730302
        │       ├── [8.5M]  VH02001614_S8_R1_001.fastq.gz
        │       └── [14.9M]  VH02001614_S8_R2_001.fastq.gz
        ├── 10k
        │   ├── SRR14730301
        │   │   ├── [845.4K]  VH02001612_S9_R1_001.fastq
        │   │   └── [1.5M]  VH02001612_S9_R2_001.fastq
        │   └── SRR14730302
        │       ├── [845.3K]  VH02001614_S8_R1_001.fastq.gz
        │       └── [1.5M]  VH02001614_S8_R2_001.fastq.gz
        └── orig
            ├── [20.4G]  SRR14730301
            │   └── [20.4G]  SRR14730301
            ├── SRR14730301
            │   ├── [9.1G]  VH02001612_S9_R1_001.fastq.gz
            │   └── [22.0G]  VH02001612_S9_R2_001.fastq.gz
            ├── [16.9G]  SRR14730302
            │   └── [16.9G]  SRR14730302
            ├── SRR14730302
            │   ├── [7.6G]  VH02001614_S8_R1_001.fastq.gz
            │   └── [18.0G]  VH02001614_S8_R2_001.fastq.gz
            ├── [18.0G]  SRR14730303
            │   └── [18.0G]  SRR14730303
            ├── SRR14730303
            │   ├── [8.1G]  VH02001618_S7_R1_001.fastq.gz
            │   └── [19.2G]  VH02001618_S7_R2_001.fastq.gz
            ├── [16.5G]  SRR14730304
            │   └── [16.5G]  SRR14730304
            ├── SRR14730304
            │   ├── [7.5G]  VH02001700_S6_R1_001.fastq.gz
            │   └── [17.8G]  VH02001700_S6_R2_001.fastq.gz
            ├── [19.0G]  SRR14730305
            │   └── [19.0G]  SRR14730305
            ├── SRR14730305
            │   ├── [8.4G]  VH02001702_S5_R1_001.fastq.gz
            │   └── [20.6G]  VH02001702_S5_R2_001.fastq.gz
            ├── [14.6G]  SRR14730306
            │   └── [14.6G]  SRR14730306
            ├── SRR14730306
            │   ├── [6.6G]  VH02001704_S4_R1_001.fastq.gz
            │   └── [16.0G]  VH02001704_S4_R2_001.fastq.gz
            ├── [21.5G]  SRR14730307
            │   └── [21.5G]  SRR14730307
            ├── SRR14730307
            │   ├── [9.6G]  VH02001708_S3_R1_001.fastq.gz
            │   └── [23.2G]  VH02001708_S3_R2_001.fastq.gz
            ├── [20.7G]  SRR14730308
            │   └── [20.7G]  SRR14730308
            ├── SRR14730308
            │   ├── [9.3G]  VH02001710_S2_R1_001.fastq.gz
            │   └── [22.1G]  VH02001710_S2_R2_001.fastq.gz
            ├── [15.8G]  SRR14730309
            │   └── [15.8G]  SRR14730309
            └── SRR14730309
                ├── [7.2G]  VH02001712_S1_R1_001.fastq.gz
                └── [16.9G]  VH02001712_S1_R2_001.fastq.gz

18 directories, 37 files

The orig directory contains the original fastq files. The fastq files are available for 10k and 100k subsamples in the 10k and 100k directories, respectively.

The 2-wells.fasta file contains the barcodes for 2 wells.

Test run

The pipeline can be run by creating a params.yaml file like this:

param_list:
  - input_r1: "gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R1_001.fastq"
    input_r2: "gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730301/VH02001612_S9_R2_001.fastq"
    genomeDir: "gs://viash-hub-test-data/htrnaseq/v1/genomeDir/gencode.v41.star.sparse"
    barcodesFasta: "gs://viash-hub-test-data/htrnaseq/v1/2-wells.fasta"
    id: sample_one
  - input_r1: "gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730302/VH02001614_S8_R1_001.fastq"
    input_r2: "gs://viash-hub-test-data/htrnaseq/v1/100k/SRR14730302/VH02001614_S8_R2_001.fastq"
    genomeDir: "gs://viash-hub-test-data/htrnaseq/v1/genomeDir/gencode.v41.star.sparse"
    barcodesFasta: "gs://viash-hub-test-data/htrnaseq/v1/2-wells.fasta"
    id: sample_two

and then:

viash ns build --setup cb
nextflow run . -main-script target/nextflow/workflows/htrnaseq/main.nf \
  -profile docker \
  -c target/nextflow/workflows/htrnaseq/nextflow.config \
  -params-file params.yaml \
  -resume \
  --publish_dir output

Or, by running src/workflows/htrnaseq/integration_test.sh.

Special Thanks

Developed in collaboration with Data Intuitive and Open Analytics.

Name		Name	Last commit message	Last commit date
Latest commit DriesSchaumont Update CHANGELOG Dec 17, 2024 b9e20d6 · Dec 17, 2024 History 27 Commits
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
_viash.yaml		_viash.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HT-RNAseq - A pipeline for processing high-throughput RNA-seq data

Introduction

Test data

Test run

Special Thanks

About

Releases

Packages

Languages

mvanmoerbeke/htrnaseq

Folders and files

Latest commit

History

Repository files navigation

HT-RNAseq - A pipeline for processing high-throughput RNA-seq data

Introduction

Test data

Test run

Special Thanks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages