Documentation udpate

quadram-institute-bioscience · Sep 22, 2024 · c8af22e · c8af22e
1 parent e25f69f
commit c8af22e
Show file tree

Hide file tree

Showing 7 changed files with 431 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -4,11 +4,29 @@
 
 A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.
 
-## Other docs
+## Tools
 
-- [`gen` generate datasets](README_GEN.md)
-- [`n50` calculate N50](README_N50.md)
-- [benchmark](README_BENCHMARK.md)
+- [`n50` calculate N50](docs/README_N50.md)
+
+- [`n50_simreads`](docs/README_N50_SIMREADS.md), to simulate reads based on the lengths desired
+- [`n50_binner`](docs/README_N50_BINNER.md),  to generate a summary of reads lengths from a FASTQ file to be used with `n50_generate`
+- [`n50_generate`](docs/README_N50_GENERATE.md) uses the output of `n50_binner` to generate reads (using `n50_simreads`) 
+
+- [`gen`](docs/README_GEN.md), alternative generator
+- [benchmark notes](docs/README_BENCHMARK.md)
+
+
+## General requirements
+
+- C compiler
+- zlib, pthread libraries
+
+## Compiling
+
+```bash
+make all
+make test
+```
 
 ## Author
 

diff --git a/docs/README_BENCHMARK.md b/docs/README_BENCHMARK.md
diff --git a/docs/README_GEN.md b/docs/README_GEN.md
@@ -0,0 +1,103 @@
+# `gen`: DNA Sequence Generator
+
+## Overview
+
+The DNA Sequence Generator is a C program designed to create random DNA sequences in FASTA or FASTQ format focusing on some **parameters focusing on their lengths**.
+It generates multiple files containing sequences of varying lengths.
+
+This tool is particularly useful for:
+
+- Testing tools to gather statistics from DNA sequences
+- Benchmarking bioinformatics pipelines
+
+## Features
+
+- Calculates and includes N50 statistics in the output filenames
+- Supports both FASTA and FASTQ output formats
+- Deterministic output based on a provided seed for reproducibility
+- Output filename format: `N50_TOTSEQS_SUMLEN.{fasta|fastq}` to make easy to test the N50 calculation
+
+## Compilation
+
+To compile the program, use a C compiler such as GCC:
+
+```bash
+gcc -o dna_generator dna_generator.c -lm
+```
+
+## Usage
+
+```bash
+./gen <min_seqs> <max_seqs> <min_len> <max_len> <tot_files> <format> <outdir> 
+```
+
+### Parameters
+
+- `<min_seqs>`: Minimum number of sequences per file
+- `<max_seqs>`: Maximum number of sequences per file
+- `<min_len>`: Minimum length of each sequence
+- `<max_len>`: Maximum length of each sequence
+- `<tot_files>`: Total number of files to generate
+- `<format>`: Output format (either "fasta" or "fastq")
+- `<outdir>`: Directory to store the output files
+- `<seed>`: Seed for the random number generator (for reproducibility)
+
+### Example
+
+```bash
+gen  10 100 1000 10000 5 fasta output_dir 
+```
+
+This command will generate 5 FASTA files in the `output_dir` directory. Each file will contain between 10 and 100 sequences, with lengths ranging from 1000 to 10000 base pairs. The random number generator will be initialized with a static seed to ensure reproducibility.
+
+## Output
+
+The program generates files with names in the format:
+
+```text
+${N50}_${TOTSEQS}_${SUMLEN}.{fasta|fastq}
+```
+
+Where:
+
+- `N50` is the N50 statistic of the sequences in the file
+- `TOTSEQS` is the total number of sequences in the file
+- `SUMLEN` is the sum of all sequence lengths in the file
+
+### FASTA Format
+
+In FASTA format, each sequence is represented as:
+
+```text
+>seq1
+ATCGATCGATCG...
+>seq2
+GCTAGCTAGCTA...
+```
+
+### FASTQ Format
+
+In FASTQ format, each sequence is represented as:
+
+```text
+@seq1
+ATCGATCGATCG...
++
+IIIIIIIIIIII...
+@seq2
+GCTAGCTAGCTA...
++
+IIIIIIIIIIII...
+```
+
+Note: The quality scores in FASTQ format are dummy values (all 'I') for simplicity.
+
+## License
+
+This program is provided under the MIT License. See the source code for full license text.
+
+## Author
+
+Andrea Telatin, 2023
+Quadram Institute Bioscience
+
diff --git a/docs/README_N50.md b/docs/README_N50.md
@@ -0,0 +1,102 @@
+# `n50` - Calculate N50
+
+> A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files.
+
+## Features
+
+
+- Supports both FASTA and FASTQ formats
+- Optimized for FASTQ raw file (Nanopore, PacBio)
+- Handles gzipped input files
+
+## Installation
+
+### Prerequisites
+
+- GCC compiler
+- zlib library
+- pthread library
+
+### Compiling
+
+To compile the program, use the following command:
+
+```bash
+make
+```
+
+or compile binaries like:
+
+```bash
+gcc -o n50 src/n50.c -lz -lpthread -O3
+```
+
+## Usage
+
+```bash
+./n50 [options] [filename]...
+```
+
+If no filename is provided, the program reads from standard input.
+
+### Options
+
+- `--fasta` or `-a`: Force FASTA input format
+- `--fastq` or `-q`: Force FASTQ input format
+- `--header` or `-H`: Print header in the output
+- `--n50` or `-n`: Output only the N50 value
+- `--version` and `--help` to display version and help information respectively
+
+### Examples
+
+1. Process a FASTA file:
+
+```bash
+./n50 input.fasta
+```
+
+2. Process a gzipped FASTQ file:
+
+```bash
+./n50 input.fastq.gz
+```
+
+3. Process a file with header output:
+
+```bash
+./n50 --header input.fasta
+```
+
+4. Output only the N50 value:
+
+```bash
+./n50 --n50 input.fasta
+```
+
+5. Process input from stdin:
+
+```bash
+cat input.fasta | ./n50 --format fasta
+```
+
+## Output
+
+By default, the program outputs a tab-separated line with the following fields:
+
+1. Format (FASTA or FASTQ)
+2. Total sequence length
+3. Total number of sequences
+4. N50 value
+
+When using the `--header` option, a header line is printed before the results.
+
+When using the `--n50` option, only the N50 value is printed.
+
+## Performance
+
+The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads.
+
+## Limitations
+
+- The maximum number of threads is currently set to 8. This can be adjusted by modifying the `MAX_THREADS` constant in the source code.
+- The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased.
diff --git a/docs/README_N50_BINNER.md b/docs/README_N50_BINNER.md
@@ -0,0 +1,100 @@
+# n50_binner
+
+This program analyzes a FASTQ file and counts the number of reads falling into predefined length bins. 
+
+It can be used to generate a simplified summary of the read length distribution in a 
+FASTQ file, to be then used to 
+[generate a similar FASTX file](README_N50_GENERATE.md).
+
+## What it does
+
+The program:
+
+1. Reads a (compressed) FASTQ file.
+2. Counts the number of reads falling into each of 16 predefined length bins.
+3. Outputs a CSV-formatted result showing the number of reads in each bin.
+
+## How to compile
+
+To compile the program, you need a C compiler (such as gcc) and the zlib library installed. Use the following command:
+
+```bash
+gcc -o fastq_length_counter fastq_length_counter.c -lz
+```
+
+This will create an executable named `fastq_length_counter`.
+
+## How to use
+
+Run the program from the command line, providing the path to a gzip-compressed FASTQ file as an argument:
+
+```bash
+./n50_binner path/to/your/file.fastq.gz
+```
+
+The program will process the file and output the results to stdout in CSV format.
+
+## Examples
+
+1. Running the program:
+
+```bash
+./fastq_length_counter sample.fastq.gz
+```
+
+2. Saving the output to a file:
+
+```bash
+./fastq_length_counter sample.fastq.gz > length_distribution.csv
+```
+
+3. Processing multiple files:
+
+```bash
+for file in *.fastq.gz; do
+    echo "Processing $file"
+    ./fastq_length_counter "$file" > "${file%.fastq.gz}_length_distribution.csv"
+done
+```
+
+## Output format
+
+The output is in CSV format with two columns:
+1. Bin: The upper limit of the length bin
+2. Number of Reads: The count of reads falling into that bin
+
+Example output:
+
+```text
+Bin,Number of Reads
+10,0
+100,5
+1000,1000
+2500,5000
+...
+```
+
+## Notes
+
+- The program uses 16 predefined bins: 10, 100, 1000, 2500, 5000, 10000, 20000, 35000, 50000, 75000, 100000, 200000, 300000, 500000, 750000, and 1000000.
+- Reads longer than 1,000,000 bases will be counted in the last bin.
+- The program assumes the input file is in the standard FASTQ format and is gzip-compressed.
+
+## Dependencies
+
+- zlib library for reading gzip-compressed files
+
+Remember to install zlib development files before compiling. On Ubuntu or Debian, you can do this with:
+
+```bash
+sudo apt-get install zlib1g-dev
+```
+
+## License
+
+This program is provided under the MIT License. See the source code for full license text.
+
+## Author
+
+Andrea Telatin, 2023
+Quadram Institute Bioscience