Skip to content

Commit

Permalink
Merge branch 'main' into devel
Browse files Browse the repository at this point in the history
  • Loading branch information
sbooeshaghi authored Oct 8, 2024
2 parents 164d9d2 + fcb2dea commit 96d109b
Show file tree
Hide file tree
Showing 2 changed files with 299 additions and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ Ali Sina Booeshaghi, Xi Chen, Lior Pachter, A machine-readable specification for
- [View example `seqspec` files: `https://www.sina.bio/seqspec-builder/assays.html`](https://www.sina.bio/seqspec-builder/assays.html)
- [Contribute a `seqspec` : `docs/CONTRIBUTING.md`](docs/CONTRIBUTING.md)
- [Watch a YouTube video about `seqspec`](https://youtu.be/NSj6Vpzy8tU)
- [Read the manuscript that describes `seqspec`](https://doi.org/10.1093/bioinformatics/btae168)
- [Read the manuscript that describes `seqspec`](https://doi.org/10.1093/bioinformatics/btae168)
298 changes: 298 additions & 0 deletions docs/TUTORIAL_FROM_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# Writing a seqspec file for a set of FASTQs
A seqspec file describes the library and read structure of a set of genomics data. Understanding both structures is crucial for analyzing sequencing reads. Even if the library construction process is unknown, a seqspec file can still be generated for a set of FASTQ reads.
## Example Dataset
We will generate a seqspec file for the following dataset:

```json
{
"observation_id": "GSM3587010",
"doi": "https://doi.org/10.1084/jem.20191130",
"species": "homo sapiens",
"organ": "colon",
"name": "age 55 years old",
"description": "epithelial cells",
"technology": "10xv2",
"links": [
{
"accession": "GSM3587010",
"filename": "SRR8513796_1.fastq.gz",
"filetype": null,
"filesize": null,
"md5": null,
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_1.fastq.gz"
},
{
"accession": "GSM3587010",
"filename": "SRR8513796_2.fastq.gz",
"filetype": null,
"filesize": null,
"md5": null,
"urltype": "ftp",
"url": "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_2.fastq.gz"
}
]
}
```



The technology used is 10xv2. The read structure for 10xv2 libraries is:

**R1.fastq.gz** (SRR8513796_1.fastq.gz): 16bp cell barcode followed by a 10bp UMI.
**R2.fastq.gz** (SRR8513796_2.fastq.gz): cDNA, user-defined length.
**Cell barcode onlist**: [Cell barcode list](https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz).

The library was generated using the Illumina Hiseq X Ten platform with 150-bp paired-end reads.

## Download reads
First, download the FASTQ files and check the lengths of the reads:
```bash
# download the two FASTQ files
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_1.fastq.gz
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR851/006/SRR8513796/SRR8513796_2.fastq.gz

# view the first read in R1
$ zcat SRR8513796_1.fastq.gz | head -2
@SRR8513796.1 1/1
NGATTTCGTTGCGTTAATCATCGTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAGAACCACATTATTATTTTGATAGATATAGAAATAGTCAAACCTAATCTACAAAAGTGCAGTATCATGCGGGGNCTTCGCAGCGTAGGTGTT

# count the number of bases in the R1 read
$ echo -n "NGATTTCGTTGCGTTAATCATCGTCCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAGAACCACATTATTATTTTGATAGATATAGAAATAGTCAAACCTAATCTACAAAAGTGCAGTATCATGCGGGGNCTTCGCAGCGTAGGTGTT" | wc -c
150

# view the first read in R2
$ zcat SRR8513796_1.fastq.gz | head -2
@SRR8513796.1 1/2
NGAGTCTCCCTTCACCATTTCCGACGGCATCTATGGCTCAACATTTTTTGTAGCCACAGGCTTCCGCGGACTTCACGTCATTATTGGCTCAACTTTCGTCACTATCTGCTTTATCAGCCAACTAATATTTCACTTTACATACAAACATCA

# count the number of bases in the R2 read
$ echo -n "NGAGTCTCCCTTCACCATTTCCGACGGCATCTATGGCTCAACATTTTTTGTAGCCACAGGCTTCCGCGGACTTCACGTCATTATTGGCTCAACTTTCGTCACTATCTGCTTTATCAGCCAACTAATATTTCACTTTACATACAAACATCA" | wc -c
150
```

## Download template spec
Start with a template seqspec. The 10xv2 template can be found [here](https://github.com/pachterlab/seqspec/blob/main/examples/specs/template/10xv2-template.yml). Save this file as spec.yaml.

```yaml
!Assay
seqspec_version: 0.2.0
assay_id: 10xv2
name: 10xv2
doi: https://doi.org/10.1126/science.aam8999
date: 15 March 2018
description: 10x Genomics v2 single-cell rnaseq
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html
sequence_protocol: Not-specified
sequence_kit: Not-specified
library_protocol: 10xv2 RNA
library_kit: Not-specified
sequence_spec:
- !Read
read_id: R1.fastq.gz
name: Read 1
modality: rna
primer_id: r1_primer
min_len: 26
max_len: 26
strand: pos
- !Read
read_id: R2.fastq.gz
name: Read 2
modality: rna
primer_id: r2_primer
min_len: 150
max_len: 150
strand: neg
library_spec:
- !Region
parent_id: null
region_id: rna
region_type: null
name: null
sequence_type: null
sequence: AAAAAAAAAAAAAAAANNNNNNNNNNNNNNNNNNNNNNNNNNXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXAAAAAAAAAAAAAAAA
min_len: 59
max_len: 208
onlist: null
regions:
- !Region
parent_id: rna
region_id: r1_primer
region_type: r1_primer
name: r1_primer
sequence_type: fixed
sequence: AAAAAAAAAAAAAAAA
min_len: 16
max_len: 16
onlist: null
regions: null
- !Region
parent_id: rna
region_id: barcode
region_type: barcode
name: barcode
sequence_type: onlist
sequence: NNNNNNNNNNNNNNNN
min_len: 16
max_len: 16
onlist: !Onlist
location: remote
filename: https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz
md5: 72aa64fd865bcda142c47d0da8370168
regions: null
- !Region
parent_id: rna
region_id: umi
region_type: umi
name: umi
sequence_type: fixed
sequence: NNNNNNNNNN
min_len: 10
max_len: 10
onlist: null
regions: null
- !Region
parent_id: rna
region_id: cdna
region_type: cdna
name: cdna
sequence_type: fixed
sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
min_len: 1
max_len: 150
onlist: null
regions: null
- !Region
parent_id: rna
region_id: r2_primer
region_type: r2_primer
name: r2_primer
sequence_type: fixed
sequence: AAAAAAAAAAAAAAAA
min_len: 16
max_len: 16
onlist: null
regions: null
```
## Modify `sequence_spec`
A seqspec file requires the user to specify the primer used to generate each of the sequencing reads. Since we may not know the exact primer used by the authors, we specify two generic primers (r1_primer generates R1.fastq.gz and r2_primer generates R2.fastq.gz). Then we fill in the relevant information for the `sequence_spec` using the reads and information we can obtain from the fastq files. Note that sequencing this library on a Hiseq generates R1 reads on the positive strand and R2 reads on the negative strand.

```yaml
sequence_spec:
- !Read
read_id: SRR8513796_1.fastq.gz
name: Read 1
modality: rna
primer_id: r1_primer
min_len: 150
max_len: 150
strand: pos
- !Read
read_id: SRR8513796_1.fastq.gz
name: Read 2
modality: rna
primer_id: r2_primer
min_len: 150
max_len: 150
strand: neg
```

## Modify `library_spec`
Ensure that the `library_spec` matches the 10xv2 structure:
- r1 primer
- barcode (with barcode onlist)
- umi
- cdna
- r2 primer
```yaml
- !Region
parent_id: rna
region_id: r1_primer
region_type: r1_primer
name: r1_primer
sequence_type: fixed
sequence: AAAAAAAAAAAAAAAA
min_len: 16
max_len: 16
onlist: null
regions: null
- !Region
parent_id: rna
region_id: barcode
region_type: barcode
name: barcode
sequence_type: onlist
sequence: NNNNNNNNNNNNNNNN
min_len: 16
max_len: 16
onlist: !Onlist
location: remote
filename: https://github.com/pachterlab/qcbc/raw/main/tests/10xRNAv2/737K-august-2016.txt.gz
md5: 72aa64fd865bcda142c47d0da8370168
regions: null
- !Region
parent_id: rna
region_id: umi
region_type: umi
name: umi
sequence_type: fixed
sequence: NNNNNNNNNN
min_len: 10
max_len: 10
onlist: null
regions: null
- !Region
parent_id: rna
region_id: cdna
region_type: cdna
name: cdna
sequence_type: fixed
sequence: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
min_len: 1
max_len: 150
onlist: null
regions: null
- !Region
parent_id: rna
region_id: r2_primer
region_type: r2_primer
name: r2_primer
sequence_type: fixed
sequence: AAAAAAAAAAAAAAAA
min_len: 16
max_len: 16
onlist: null
regions: null
```

## Modify the metadata
Lastly, we update the metadata for the seqspec file
```yaml
!Assay
seqspec_version: 0.2.0
assay_id: 10xv2
name: 10xv2
doi: https://doi.org/10.1084/jem.20191130
date: 21 November 2019
description: 10x Genomics v2 single-cell rnaseq
modalities:
- rna
lib_struct: https://teichlab.github.io/scg_lib_structs/methods_html/10xChromium3.html
sequence_protocol: Not-specified
sequence_kit: Not-specified
library_protocol: 10xv2 RNA
library_kit: Not-specified
```

## Format, check, and print the spec
Use the seqspec command line utility to format, check, and view the seqspec file:

```bash
$ seqspec format -o spec.yaml spec.yaml
$ seqspec check spec.yaml
$ seqspec print spec.yaml
```

0 comments on commit 96d109b

Please sign in to comment.