From c8af22e38dda8f60f96947595358b5a6a3ea4dea Mon Sep 17 00:00:00 2001 From: "Andrea Telatin, M1" <15690844+telatin@users.noreply.github.com> Date: Sun, 22 Sep 2024 15:06:41 +0100 Subject: [PATCH] Documentation udpate --- README.md | 26 +++++++-- docs/README_BENCHMARK.md | 0 docs/README_GEN.md | 103 +++++++++++++++++++++++++++++++++++ docs/README_N50.md | 102 +++++++++++++++++++++++++++++++++++ docs/README_N50_BINNER.md | 100 ++++++++++++++++++++++++++++++++++ docs/README_N50_GENERATE.md | 104 ++++++++++++++++++++++++++++++++++++ docs/README_N50_SIMREADS.md | 0 7 files changed, 431 insertions(+), 4 deletions(-) create mode 100644 docs/README_BENCHMARK.md create mode 100644 docs/README_GEN.md create mode 100644 docs/README_N50.md create mode 100644 docs/README_N50_BINNER.md create mode 100644 docs/README_N50_GENERATE.md create mode 100644 docs/README_N50_SIMREADS.md diff --git a/README.md b/README.md index 1ee3106..f148a17 100644 --- a/README.md +++ b/README.md @@ -4,11 +4,29 @@ A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files. -## Other docs +## Tools -- [`gen` generate datasets](README_GEN.md) -- [`n50` calculate N50](README_N50.md) -- [benchmark](README_BENCHMARK.md) +- [`n50` calculate N50](docs/README_N50.md) + +- [`n50_simreads`](docs/README_N50_SIMREADS.md), to simulate reads based on the lengths desired +- [`n50_binner`](docs/README_N50_BINNER.md), to generate a summary of reads lengths from a FASTQ file to be used with `n50_generate` +- [`n50_generate`](docs/README_N50_GENERATE.md) uses the output of `n50_binner` to generate reads (using `n50_simreads`) + +- [`gen`](docs/README_GEN.md), alternative generator +- [benchmark notes](docs/README_BENCHMARK.md) + + +## General requirements + +- C compiler +- zlib, pthread libraries + +## Compiling + +```bash +make all +make test +``` ## Author diff --git a/docs/README_BENCHMARK.md b/docs/README_BENCHMARK.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/README_GEN.md b/docs/README_GEN.md new file mode 100644 index 0000000..63d1fcc --- /dev/null +++ b/docs/README_GEN.md @@ -0,0 +1,103 @@ +# `gen`: DNA Sequence Generator + +## Overview + +The DNA Sequence Generator is a C program designed to create random DNA sequences in FASTA or FASTQ format focusing on some **parameters focusing on their lengths**. +It generates multiple files containing sequences of varying lengths. + +This tool is particularly useful for: + +- Testing tools to gather statistics from DNA sequences +- Benchmarking bioinformatics pipelines + +## Features + +- Calculates and includes N50 statistics in the output filenames +- Supports both FASTA and FASTQ output formats +- Deterministic output based on a provided seed for reproducibility +- Output filename format: `N50_TOTSEQS_SUMLEN.{fasta|fastq}` to make easy to test the N50 calculation + +## Compilation + +To compile the program, use a C compiler such as GCC: + +```bash +gcc -o dna_generator dna_generator.c -lm +``` + +## Usage + +```bash +./gen +``` + +### Parameters + +- ``: Minimum number of sequences per file +- ``: Maximum number of sequences per file +- ``: Minimum length of each sequence +- ``: Maximum length of each sequence +- ``: Total number of files to generate +- ``: Output format (either "fasta" or "fastq") +- ``: Directory to store the output files +- ``: Seed for the random number generator (for reproducibility) + +### Example + +```bash +gen 10 100 1000 10000 5 fasta output_dir +``` + +This command will generate 5 FASTA files in the `output_dir` directory. Each file will contain between 10 and 100 sequences, with lengths ranging from 1000 to 10000 base pairs. The random number generator will be initialized with a static seed to ensure reproducibility. + +## Output + +The program generates files with names in the format: + +```text +${N50}_${TOTSEQS}_${SUMLEN}.{fasta|fastq} +``` + +Where: + +- `N50` is the N50 statistic of the sequences in the file +- `TOTSEQS` is the total number of sequences in the file +- `SUMLEN` is the sum of all sequence lengths in the file + +### FASTA Format + +In FASTA format, each sequence is represented as: + +```text +>seq1 +ATCGATCGATCG... +>seq2 +GCTAGCTAGCTA... +``` + +### FASTQ Format + +In FASTQ format, each sequence is represented as: + +```text +@seq1 +ATCGATCGATCG... ++ +IIIIIIIIIIII... +@seq2 +GCTAGCTAGCTA... ++ +IIIIIIIIIIII... +``` + +Note: The quality scores in FASTQ format are dummy values (all 'I') for simplicity. + +## License + +This program is provided under the MIT License. See the source code for full license text. + +## Author + +Andrea Telatin, 2023 +Quadram Institute Bioscience + diff --git a/docs/README_N50.md b/docs/README_N50.md new file mode 100644 index 0000000..52500b4 --- /dev/null +++ b/docs/README_N50.md @@ -0,0 +1,102 @@ +# `n50` - Calculate N50 + +> A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files. + +## Features + + +- Supports both FASTA and FASTQ formats +- Optimized for FASTQ raw file (Nanopore, PacBio) +- Handles gzipped input files + +## Installation + +### Prerequisites + +- GCC compiler +- zlib library +- pthread library + +### Compiling + +To compile the program, use the following command: + +```bash +make +``` + +or compile binaries like: + +```bash +gcc -o n50 src/n50.c -lz -lpthread -O3 +``` + +## Usage + +```bash +./n50 [options] [filename]... +``` + +If no filename is provided, the program reads from standard input. + +### Options + +- `--fasta` or `-a`: Force FASTA input format +- `--fastq` or `-q`: Force FASTQ input format +- `--header` or `-H`: Print header in the output +- `--n50` or `-n`: Output only the N50 value +- `--version` and `--help` to display version and help information respectively + +### Examples + +1. Process a FASTA file: + +```bash +./n50 input.fasta +``` + +2. Process a gzipped FASTQ file: + +```bash +./n50 input.fastq.gz +``` + +3. Process a file with header output: + +```bash +./n50 --header input.fasta +``` + +4. Output only the N50 value: + +```bash +./n50 --n50 input.fasta +``` + +5. Process input from stdin: + +```bash +cat input.fasta | ./n50 --format fasta +``` + +## Output + +By default, the program outputs a tab-separated line with the following fields: + +1. Format (FASTA or FASTQ) +2. Total sequence length +3. Total number of sequences +4. N50 value + +When using the `--header` option, a header line is printed before the results. + +When using the `--n50` option, only the N50 value is printed. + +## Performance + +The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads. + +## Limitations + +- The maximum number of threads is currently set to 8. This can be adjusted by modifying the `MAX_THREADS` constant in the source code. +- The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased. diff --git a/docs/README_N50_BINNER.md b/docs/README_N50_BINNER.md new file mode 100644 index 0000000..d40352a --- /dev/null +++ b/docs/README_N50_BINNER.md @@ -0,0 +1,100 @@ +# n50_binner + +This program analyzes a FASTQ file and counts the number of reads falling into predefined length bins. + +It can be used to generate a simplified summary of the read length distribution in a +FASTQ file, to be then used to +[generate a similar FASTX file](README_N50_GENERATE.md). + +## What it does + +The program: + +1. Reads a (compressed) FASTQ file. +2. Counts the number of reads falling into each of 16 predefined length bins. +3. Outputs a CSV-formatted result showing the number of reads in each bin. + +## How to compile + +To compile the program, you need a C compiler (such as gcc) and the zlib library installed. Use the following command: + +```bash +gcc -o fastq_length_counter fastq_length_counter.c -lz +``` + +This will create an executable named `fastq_length_counter`. + +## How to use + +Run the program from the command line, providing the path to a gzip-compressed FASTQ file as an argument: + +```bash +./n50_binner path/to/your/file.fastq.gz +``` + +The program will process the file and output the results to stdout in CSV format. + +## Examples + +1. Running the program: + +```bash +./fastq_length_counter sample.fastq.gz +``` + +2. Saving the output to a file: + +```bash +./fastq_length_counter sample.fastq.gz > length_distribution.csv +``` + +3. Processing multiple files: + +```bash +for file in *.fastq.gz; do + echo "Processing $file" + ./fastq_length_counter "$file" > "${file%.fastq.gz}_length_distribution.csv" +done +``` + +## Output format + +The output is in CSV format with two columns: +1. Bin: The upper limit of the length bin +2. Number of Reads: The count of reads falling into that bin + +Example output: + +```text +Bin,Number of Reads +10,0 +100,5 +1000,1000 +2500,5000 +... +``` + +## Notes + +- The program uses 16 predefined bins: 10, 100, 1000, 2500, 5000, 10000, 20000, 35000, 50000, 75000, 100000, 200000, 300000, 500000, 750000, and 1000000. +- Reads longer than 1,000,000 bases will be counted in the last bin. +- The program assumes the input file is in the standard FASTQ format and is gzip-compressed. + +## Dependencies + +- zlib library for reading gzip-compressed files + +Remember to install zlib development files before compiling. On Ubuntu or Debian, you can do this with: + +```bash +sudo apt-get install zlib1g-dev +``` + +## License + +This program is provided under the MIT License. See the source code for full license text. + +## Author + +Andrea Telatin, 2023 +Quadram Institute Bioscience diff --git a/docs/README_N50_GENERATE.md b/docs/README_N50_GENERATE.md new file mode 100644 index 0000000..c51cf4b --- /dev/null +++ b/docs/README_N50_GENERATE.md @@ -0,0 +1,104 @@ +# n50_generate + +This program processes an input file containing read length distribution data and prepares it for use with [n50_simreads](README_N50_SIMREADS.md). +It generates a string representation of the read length distribution and executes n50_simreads with the prepared data. + +## Features + +- Processes input files with read length distribution data from [n50_binner](README_N50_BINNER.md) +- Runs [n50_simreads](README_N50_SIMREADS.md) to generate reads based on the input data + +## Usage + +```bash +n50_prepare -i INPUTFILE -o OUTDIR [-f FORMAT] [-s PATH] [-v] +``` + +### Options: + +- `-i INPUTFILE` : Path to the input file (required) +- `-o OUTDIR` : Output directory for n50_simreads results (required) +- `-f FORMAT` : Output format (optional, FASTQ by default, FASTA also supported) +- `-s PATH` : Path to n50_simreads executable (optional) +- `-v` : Verbose mode, prints additional information +- `-h` : Display help message + +## Input File Format + +The input file should be a CSV file with the following format: + +```text +length,count +100,1000 +200,500 +300,250 +... +``` + +The first line (header) is skipped during processing. + +## How to Compile + +To compile the program, use a C compiler such as gcc: + +```bash +gcc -o n50_prepare n50_prepare.c +``` + +This will create an executable named `n50_prepare`. + +## Examples + +1. Basic usage: + +```bash +./n50_prepare -i input_distribution.csv -o output_directory +``` + +1. Specifying FASTA output format: + +```bash +./n50_prepare -i input_distribution.csv -o output_directory -f FASTA +``` + +1. Using a custom path for n50_simreads: + +```bash +./n50_prepare -i input_distribution.csv -o output_directory -s /path/to/n50_simreads +``` + +1. Running in verbose mode: + +```bash +./n50_prepare -i input_distribution.csv -o output_directory -v +``` + +## Output + +The n50_simreads output will be saved in the specified output directory. + +## Statistics + +The program calculates and displays: + +- Total number of reads +- Maximum read length + +## Dependencies + +- n50_simreads (should be in the same directory as n50_prepare or specified with -s option) + +## Notes + +- The program assumes that n50_simreads is in the same directory as n50_prepare unless specified otherwise, but you can supply a custom path with the `-s` option. +- Make sure you have the necessary permissions to execute `n50_simreads` and write to the output directory. +- Invalid data in the input file will be skipped with a warning message. + +## License + +This program is provided under the MIT License. See the source code for full license text. + +## Author + +Andrea Telatin, 2023 +Quadram Institute Bioscience diff --git a/docs/README_N50_SIMREADS.md b/docs/README_N50_SIMREADS.md new file mode 100644 index 0000000..e69de29