bio-steams

Streaming parsers for genomics data

This is an avant-garde runtime-agnostic API that relies on type inference to configure streams of genomic data.

Features

Parsers for popular formats

fastq
fasta
sam/bam
gfa
paf

Zero-copy datastructures

Shared SeqRecord type for Fastq, Fasta, and streams of records that carry sequences:

pub struct SeqRecord<S: for<'a> TryFrom<&'a [u8]> = &'a [u8]> { ... }

Records can be read into custom types like bit-packed sequences or SIMD-backed kmers:

use bio_seq::prelude::*;

todo!()

Lazy parsing configured through generic types

Lazy parsing for record members (CIGAR strings/Phred scores/Sequences)

Webassembly friendly async

This crate implements nightly std::futures traits.

Combinators for streams of data

zipping streams with buffers
mapping quality filter to streams
create a stream of mapped sequences

Examples

Remove non M. TB reads from streaming fastqs, amplicon based SARS-CoV-2 assembly

Stream a pair of fastqs and check some conditions on their name fields

// Open a pair of gzipped fastq files as streams of `Record`s with `Seq<Dna>` sequences

let fq1: Fastq<BufReader<MultiGzDecoder<File>>> = Fastq::new(BufReader::new(
    MultiGzDecoder::new(File::open(&file1).unwrap()),
));

let fq2: Fastq<BufReader<MultiGzDecoder<File>>> = Fastq::new(BufReader::new(
    MultiGzDecoder::new(File::open(&file2).unwrap()),
));

for zipped in fq1.zip(fq2) {
    match zipped {
        (Ok(r1), Ok(r2)) => {
            // check that the last characters of the name strings are 1 and 2
            if r1.fields[r1.fields.len() - 1] != b'1' || r2.fields[r2.fields.len() - 1] != b'2'
            {
                eprintln!("paired records do not end in 1/2");
            }

            // check that the description fields are equal up to the last character
            if r1.fields[..r1.fields.len() - 1] != r2.fields[..r2.fields.len() - 1] {
                eprintln!("reads do not have the same names");
            }
        }
        _ => {
            eprintln!("Parse error in fastq files");
        }
    }
}

To run the fqcheck example program with read files r1.fq.gz and f2.fq.gz:

$ cargo build --example fqcheck --release
$ target/release/examples/fqcheck r1.fq.gz r2.fq.gz

Count amino acid k-mers

// this opens a gzipped data stream and parses it into `Records` with `Seq<Amino>` sequence fields
let faa: Fasta<BufReader<File>, Seq<Amino>> =
    Fasta::new(BufReader::new(File::open(&faa_file).unwrap()));

// we can convert amino acid k-mers directly into usizes and use them to index into a table
let mut histogram = Box::new([0u64; 1 << (K * Amino::BITS as usize)]);

for contig in faa {
    // here "contig" is a fasta record
    for kmer in contig.unwrap().seq.kmers::<K>() {
        histogram[usize::from(kmer)] += 1;
    }
}

To run the aminokmers example program with fasta file proteins.faa:

$ cargo build --example fqcheck --release
$ target/release/examples/aminokmers proteins.faa

TODO

SIMD parallel scanning for record boundaries
Branch prediction hints for record boundary checks
Checking multiple delimiters in parallel
Prefetch hints for sequential access patterns
Buffer preallocation based on typical FASTQ record sizes
Reusing buffer space by sliding windows rather than clearing
SIMD-optimized scanning for record delimiters (@, +, newlines)
Memory alignment of the buffer for SIMD operations
Custom allocator for the buffer optimized for append/clear pattern## Examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

bio-steams

Streaming parsers for genomics data

Features

Parsers for popular formats

Zero-copy datastructures

Lazy parsing configured through generic types

Webassembly friendly async

Combinators for streams of data

Examples

Stream a pair of fastqs and check some conditions on their name fields

Count amino acid k-mers

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

bio-steams

Streaming parsers for genomics data

Features

Parsers for popular formats

Zero-copy datastructures

Lazy parsing configured through generic types

Webassembly friendly async

Combinators for streams of data

Examples

Stream a pair of fastqs and check some conditions on their name fields

Count amino acid k-mers

TODO