Skip to content

Commit

Permalink
Documentation udpate
Browse files Browse the repository at this point in the history
  • Loading branch information
telatin committed Sep 22, 2024
1 parent e25f69f commit c8af22e
Show file tree
Hide file tree
Showing 7 changed files with 431 additions and 4 deletions.
26 changes: 22 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,29 @@

A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.

## Other docs
## Tools

- [`gen` generate datasets](README_GEN.md)
- [`n50` calculate N50](README_N50.md)
- [benchmark](README_BENCHMARK.md)
- [`n50` calculate N50](docs/README_N50.md)

- [`n50_simreads`](docs/README_N50_SIMREADS.md), to simulate reads based on the lengths desired
- [`n50_binner`](docs/README_N50_BINNER.md), to generate a summary of reads lengths from a FASTQ file to be used with `n50_generate`
- [`n50_generate`](docs/README_N50_GENERATE.md) uses the output of `n50_binner` to generate reads (using `n50_simreads`)

- [`gen`](docs/README_GEN.md), alternative generator
- [benchmark notes](docs/README_BENCHMARK.md)


## General requirements

- C compiler
- zlib, pthread libraries

## Compiling

```bash
make all
make test
```

## Author

Expand Down
Empty file added docs/README_BENCHMARK.md
Empty file.
103 changes: 103 additions & 0 deletions docs/README_GEN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# `gen`: DNA Sequence Generator

## Overview

The DNA Sequence Generator is a C program designed to create random DNA sequences in FASTA or FASTQ format focusing on some **parameters focusing on their lengths**.
It generates multiple files containing sequences of varying lengths.

This tool is particularly useful for:

- Testing tools to gather statistics from DNA sequences
- Benchmarking bioinformatics pipelines

## Features

- Calculates and includes N50 statistics in the output filenames
- Supports both FASTA and FASTQ output formats
- Deterministic output based on a provided seed for reproducibility
- Output filename format: `N50_TOTSEQS_SUMLEN.{fasta|fastq}` to make easy to test the N50 calculation

## Compilation

To compile the program, use a C compiler such as GCC:

```bash
gcc -o dna_generator dna_generator.c -lm
```

## Usage

```bash
./gen <min_seqs> <max_seqs> <min_len> <max_len> <tot_files> <format> <outdir>
```

### Parameters

- `<min_seqs>`: Minimum number of sequences per file
- `<max_seqs>`: Maximum number of sequences per file
- `<min_len>`: Minimum length of each sequence
- `<max_len>`: Maximum length of each sequence
- `<tot_files>`: Total number of files to generate
- `<format>`: Output format (either "fasta" or "fastq")
- `<outdir>`: Directory to store the output files
- `<seed>`: Seed for the random number generator (for reproducibility)

### Example

```bash
gen 10 100 1000 10000 5 fasta output_dir
```

This command will generate 5 FASTA files in the `output_dir` directory. Each file will contain between 10 and 100 sequences, with lengths ranging from 1000 to 10000 base pairs. The random number generator will be initialized with a static seed to ensure reproducibility.

## Output

The program generates files with names in the format:

```text
${N50}_${TOTSEQS}_${SUMLEN}.{fasta|fastq}
```

Where:

- `N50` is the N50 statistic of the sequences in the file
- `TOTSEQS` is the total number of sequences in the file
- `SUMLEN` is the sum of all sequence lengths in the file

### FASTA Format

In FASTA format, each sequence is represented as:

```text
>seq1
ATCGATCGATCG...
>seq2
GCTAGCTAGCTA...
```

### FASTQ Format

In FASTQ format, each sequence is represented as:

```text
@seq1
ATCGATCGATCG...
+
IIIIIIIIIIII...
@seq2
GCTAGCTAGCTA...
+
IIIIIIIIIIII...
```

Note: The quality scores in FASTQ format are dummy values (all 'I') for simplicity.

## License

This program is provided under the MIT License. See the source code for full license text.

## Author

Andrea Telatin, 2023
Quadram Institute Bioscience

102 changes: 102 additions & 0 deletions docs/README_N50.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# `n50` - Calculate N50

> A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.
## Features


- Supports both FASTA and FASTQ formats
- Optimized for FASTQ raw file (Nanopore, PacBio)
- Handles gzipped input files

## Installation

### Prerequisites

- GCC compiler
- zlib library
- pthread library

### Compiling

To compile the program, use the following command:

```bash
make
```

or compile binaries like:

```bash
gcc -o n50 src/n50.c -lz -lpthread -O3
```

## Usage

```bash
./n50 [options] [filename]...
```

If no filename is provided, the program reads from standard input.

### Options

- `--fasta` or `-a`: Force FASTA input format
- `--fastq` or `-q`: Force FASTQ input format
- `--header` or `-H`: Print header in the output
- `--n50` or `-n`: Output only the N50 value
- `--version` and `--help` to display version and help information respectively

### Examples

1. Process a FASTA file:

```bash
./n50 input.fasta
```

2. Process a gzipped FASTQ file:

```bash
./n50 input.fastq.gz
```

3. Process a file with header output:

```bash
./n50 --header input.fasta
```

4. Output only the N50 value:

```bash
./n50 --n50 input.fasta
```

5. Process input from stdin:

```bash
cat input.fasta | ./n50 --format fasta
```

## Output

By default, the program outputs a tab-separated line with the following fields:

1. Format (FASTA or FASTQ)
2. Total sequence length
3. Total number of sequences
4. N50 value

When using the `--header` option, a header line is printed before the results.

When using the `--n50` option, only the N50 value is printed.

## Performance

The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads.

## Limitations

- The maximum number of threads is currently set to 8. This can be adjusted by modifying the `MAX_THREADS` constant in the source code.
- The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased.
100 changes: 100 additions & 0 deletions docs/README_N50_BINNER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# n50_binner

This program analyzes a FASTQ file and counts the number of reads falling into predefined length bins.

It can be used to generate a simplified summary of the read length distribution in a
FASTQ file, to be then used to
[generate a similar FASTX file](README_N50_GENERATE.md).

## What it does

The program:

1. Reads a (compressed) FASTQ file.
2. Counts the number of reads falling into each of 16 predefined length bins.
3. Outputs a CSV-formatted result showing the number of reads in each bin.

## How to compile

To compile the program, you need a C compiler (such as gcc) and the zlib library installed. Use the following command:

```bash
gcc -o fastq_length_counter fastq_length_counter.c -lz
```

This will create an executable named `fastq_length_counter`.

## How to use

Run the program from the command line, providing the path to a gzip-compressed FASTQ file as an argument:

```bash
./n50_binner path/to/your/file.fastq.gz
```

The program will process the file and output the results to stdout in CSV format.

## Examples

1. Running the program:

```bash
./fastq_length_counter sample.fastq.gz
```

2. Saving the output to a file:

```bash
./fastq_length_counter sample.fastq.gz > length_distribution.csv
```

3. Processing multiple files:

```bash
for file in *.fastq.gz; do
echo "Processing $file"
./fastq_length_counter "$file" > "${file%.fastq.gz}_length_distribution.csv"
done
```

## Output format

The output is in CSV format with two columns:
1. Bin: The upper limit of the length bin
2. Number of Reads: The count of reads falling into that bin

Example output:

```text
Bin,Number of Reads
10,0
100,5
1000,1000
2500,5000
...
```

## Notes

- The program uses 16 predefined bins: 10, 100, 1000, 2500, 5000, 10000, 20000, 35000, 50000, 75000, 100000, 200000, 300000, 500000, 750000, and 1000000.
- Reads longer than 1,000,000 bases will be counted in the last bin.
- The program assumes the input file is in the standard FASTQ format and is gzip-compressed.

## Dependencies

- zlib library for reading gzip-compressed files

Remember to install zlib development files before compiling. On Ubuntu or Debian, you can do this with:

```bash
sudo apt-get install zlib1g-dev
```

## License

This program is provided under the MIT License. See the source code for full license text.

## Author

Andrea Telatin, 2023
Quadram Institute Bioscience
Loading

0 comments on commit c8af22e

Please sign in to comment.