Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add star solo component #62

Merged
merged 7 commits into from
Jul 29, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 230 additions & 0 deletions src/star/star_solo/argument_groups_solo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
argument_groups:
- name: STARsolo (single cell RNA-seq) parameters
arguments:
- name: --soloType
type: string
description: |-
type of single-cell RNA-seq

- CB_UMI_Simple ... (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium.
- CB_UMI_Complex ... multiple Cell Barcodes of varying length, one UMI of fixed length and one adapter sequence of fixed length are allowed in read2 only (e.g. inDrop, ddSeq).
- CB_samTagOut ... output Cell Barcode as CR and/or CB SAm tag. No UMI counting. --readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires --outSAMtype BAM Unsorted [and/or SortedByCoordinate]
- SmartSeq ... Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases)
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
rcannood marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloCBtype
type: string
description: |-
cell barcode type

Sequence: cell barcode is a sequence (standard option)
String: cell barcode is an arbitrary string
example: Sequence
- name: --soloCBwhitelist
type: string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be type: file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch. I'll adapt my script accordingly

description: |-
file(s) with whitelist(s) of cell barcodes. Only --soloType CB_UMI_Complex allows more than one whitelist file.

- None ... no whitelist: all cell barcodes are allowed
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
rcannood marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloCBstart
type: integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add min and/or max for the arguments of type: integer where applicable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be done but I don't really want to start parsing this from the help text. I propose hoping that STAR has sufficiently good validation?

description: cell barcode start base
example: 1
- name: --soloCBlen
type: integer
description: cell barcode length
example: 16
- name: --soloUMIstart
type: integer
description: UMI start base
example: 17
- name: --soloUMIlen
type: integer
description: UMI length
example: 10
- name: --soloBarcodeReadLength
type: integer
description: |-
length of the barcode read

- 1 ... equal to sum of soloCBlen+soloUMIlen
- 0 ... not defined, do not check
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere this syntax is used I think choices is warranted?

Copy link
Contributor Author

@rcannood rcannood Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I wouldn't want to put my hand in a fire to claim that you can't pass an arbitrary integer to this integer argument.

example: 1
- name: --soloBarcodeMate
type: integer
description: |-
identifies which read mate contains the barcode (CB+UMI) sequence

- 0 ... barcode sequence is on separate read, which should always be the last file in the --readFilesIn listed
- 1 ... barcode sequence is a part of mate 1
- 2 ... barcode sequence is a part of mate 2
example: 0
- name: --soloCBposition
type: string
description: |-
position of Cell Barcode(s) on the barcode read.

Presently only works with --soloType CB_UMI_Complex, and barcodes are assumed to be on Read2.
Format for each barcode: startAnchor_startPosition_endAnchor_endPosition
start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end
start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base
String for different barcodes are separated by space.
Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 0_0_2_-1 3_1_3_8
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloUMIposition
type: string
description: |-
position of the UMI on the barcode read, same as soloCBposition

Example: inDrop (Zilionis et al, Nat. Protocols, 2017):
--soloCBposition 3_9_3_14
- name: --soloAdapterSequence
type: string
description: adapter sequence to anchor barcodes. Only one adapter sequence is
allowed.
- name: --soloAdapterMismatchesNmax
type: integer
description: maximum number of mismatches allowed in adapter sequence.
example: 1
- name: --soloCBmatchWLtype
type: string
description: |-
matching the Cell Barcodes to the WhiteList

- Exact ... only exact matches allowed
- 1MM ... only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match.
- 1MM_multi ... multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches.
Allowed CBs have to have at least one read with exact match. This option matches best with CellRanger 2.2.0
- 1MM_multi_pseudocounts ... same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes.
- 1MM_multi_Nbase_pseudocounts ... same as 1MM_multi_pseudocounts, multimatching to WL is allowed for CBs with N-bases. This option matches best with CellRanger >= 3.0.0
- EditDist_2 ... allow up to edit distance of 3 fpr each of the barcodes. May include one deletion + one insertion. Only works with --soloType CB_UMI_Complex. Matches to multiple passlist barcdoes are not allowed. Similar to ParseBio Split-seq pipeline.
example: 1MM_multi
- name: --soloInputSAMattrBarcodeSeq
type: string
description: |-
when inputting reads from a SAM file (--readsFileType SAM SE/PE), these SAM attributes mark the barcode sequence (in proper order).

For instance, for 10X CellRanger or STARsolo BAMs, use --soloInputSAMattrBarcodeSeq CR UR .
This parameter is required when running STARsolo with input from SAM.
multiple: yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
multiple: yes
multiple: true

multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloInputSAMattrBarcodeQual
type: string
description: |-
when inputting reads from a SAM file (--readsFileType SAM SE/PE), these SAM attributes mark the barcode qualities (in proper order).

For instance, for 10X CellRanger or STARsolo BAMs, use --soloInputSAMattrBarcodeQual CY UY .
If this parameter is '-' (default), the quality 'H' will be assigned to all bases.
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloStrand
type: string
description: |-
strandedness of the solo libraries:

- Unstranded ... no strand information
- Forward ... read strand same as the original RNA molecule
- Reverse ... read strand opposite to the original RNA molecule
example: Forward
- name: --soloFeatures
type: string
description: |-
genomic features for which the UMI counts per Cell Barcode are collected

- Gene ... genes: reads match the gene transcript
- SJ ... splice junctions: reported in SJ.out.tab
- GeneFull ... full gene (pre-mRNA): count all reads overlapping genes' exons and introns
- GeneFull_ExonOverIntron ... full gene (pre-mRNA): count all reads overlapping genes' exons and introns: prioritize 100% overlap with exons
- GeneFull_Ex50pAS ... full gene (pre-RNA): count all reads overlapping genes' exons and introns: prioritize >50% overlap with exons. Do not count reads with 100% exonic overlap in the antisense direction.
example: Gene
multiple: yes
rcannood marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloMultiMappers
type: string
description: |-
counting method for reads mapping to multiple genes

- Unique ... count only reads that map to unique genes
- Uniform ... uniformly distribute multi-genic UMIs to all genes
- Rescue ... distribute UMIs proportionally to unique+uniform counts (~ first iteration of EM)
- PropUnique ... distribute UMIs proportionally to unique mappers, if present, and uniformly if not.
- EM ... multi-gene UMIs are distributed using Expectation Maximization algorithm
example: Unique
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloUMIdedup
type: string
description: |-
type of UMI deduplication (collapsing) algorithm

- 1MM_All ... all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once).
- 1MM_Directional_UMItools ... follows the "directional" method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017).
- 1MM_Directional ... same as 1MM_Directional_UMItools, but with more stringent criteria for duplicate UMIs
- Exact ... only exactly matching UMIs are collapsed.
- NoDedup ... no deduplication of UMIs, count all reads.
- 1MM_CR ... CellRanger2-4 algorithm for 1MM UMI collapsing.
example: 1MM_All
multiple: yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
multiple: yes
multiple: true

multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloUMIfiltering
type: string
description: |-
type of UMI filtering (for reads uniquely mapping to genes)

- - ... basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0).
- MultiGeneUMI ... basic + remove lower-count UMIs that map to more than one gene.
- MultiGeneUMI_All ... basic + remove all UMIs that map to more than one gene.
- MultiGeneUMI_CR ... basic + remove lower-count UMIs that map to more than one gene, matching CellRanger > 3.0.0 .
Only works with --soloUMIdedup 1MM_CR
multiple: yes
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloOutFileNames
type: string
description: |-
file names for STARsolo output:

file_name_prefix gene_names barcode_sequences cell_feature_count_matrix
example:
- Solo.out/
- features.tsv
- barcodes.tsv
- matrix.mtx
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloCellFilter
type: string
description: |-
cell filtering type and parameters

- None ... do not output filtered cells
- TopCells ... only report top cells by UMI count, followed by the exact number of cells
- CellRanger2.2 ... simple filtering of CellRanger 2.2.
Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count
The harcoded values are from CellRanger: nExpectedCells=3000; maxPercentile=0.99; maxMinRatio=10
- EmptyDrops_CR ... EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
Can be followed by 10 numeric parameters: nExpectedCells maxPercentile maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR simN
The harcoded values are from CellRanger: 3000 0.99 10 45000 90000 500 0.01 20000 0.01 10000
example:
- CellRanger2.2
- '3000'
- '0.99'
- '10'
multiple: yes
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
multiple_sep: ;
DriesSchaumont marked this conversation as resolved.
Show resolved Hide resolved
- name: --soloOutFormatFeaturesGeneField3
type: string
description: field 3 in the Gene features.tsv file. If "-", then no 3rd field
is output.
example: Gene Expression
multiple: yes
multiple_sep: ;
- name: --soloCellReadStats
type: string
description: |-
Output reads statistics for each CB

- Standard ... standard output
115 changes: 115 additions & 0 deletions src/star/star_solo/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
name: star_solo
namespace: star
description: |
Aligns reads to a reference genome using STAR.
rcannood marked this conversation as resolved.
Show resolved Hide resolved
keywords: [align, fasta, genome]
links:
repository: https://github.com/alexdobin/STAR
documentation: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
references:
doi: 10.1093/bioinformatics/bts635
license: MIT
requirements:
commands: [ STAR, python, ps, zcat, bzcat ]
# manually taking care of the main input and output arguments
argument_groups:
- name: Inputs
arguments:
- type: file
name: --input
alternatives: --readFilesIn
required: true
description: The single-end or paired-end R1 FastQ files to be processed.
example: [ mysample_S1_L001_R1_001.fastq.gz ]
multiple: true
- type: file
name: --input_r2
required: false
description: The paired-end R2 FastQ files to be processed. Only required if --input is a paired-end R1 file.
example: [ mysample_S1_L001_R2_001.fastq.gz ]
multiple: true
- name: Outputs
arguments:
- type: file
name: --aligned_reads
required: true
description: The output file containing the aligned reads.
direction: output
example: aligned_reads.bam
- type: file
name: --reads_per_gene
required: false
description: The output file containing the number of reads per gene.
direction: output
example: reads_per_gene.tsv
- type: file
name: --unmapped
required: false
description: The output file containing the unmapped reads.
direction: output
example: unmapped.fastq
- type: file
name: --unmapped_r2
required: false
description: The output file containing the unmapped R2 reads.
direction: output
example: unmapped_r2.fastq
- type: file
name: --chimeric_junctions
required: false
description: The output file containing the chimeric junctions.
direction: output
example: chimeric_junctions.tsv
- type: file
name: --log
required: false
description: The output file containing the log of the alignment process.
direction: output
example: log.txt
- type: file
name: --splice_junctions
required: false
description: The output file containing the splice junctions.
direction: output
example: splice_junctions.tsv
# other arguments are defined in a separate file
__merge__: [../star_align_reads/argument_groups.yaml, argument_groups_solo.yaml]
resources:
- type: python_script
path: script.py
test_resources:
- type: bash_script
path: test.sh
engines:
- type: docker
image: python:3.12-slim
setup:
- type: apt
packages:
- procps
- gzip
- bzip2
# setup derived from https://github.com/alexdobin/STAR/blob/master/extras/docker/Dockerfile
- type: docker
env:
- STAR_VERSION 2.7.11b
- PACKAGES gcc g++ make wget zlib1g-dev unzip xxd
run: |
apt-get update && \
apt-get install -y --no-install-recommends ${PACKAGES} && \
cd /tmp && \
wget --no-check-certificate https://github.com/alexdobin/STAR/archive/refs/tags/${STAR_VERSION}.zip && \
unzip ${STAR_VERSION}.zip && \
cd STAR-${STAR_VERSION}/source && \
make STARstatic CXXFLAGS_SIMD=-std=c++11 && \
cp STAR /usr/local/bin && \
cd / && \
rm -rf /tmp/STAR-${STAR_VERSION} /tmp/${STAR_VERSION}.zip && \
apt-get --purge autoremove -y ${PACKAGES} && \
apt-get clean
- type: docker
run: |
STAR --version | sed 's#\(.*\)#star: "\1"#' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
Loading