-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
7 changed files
with
431 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# `gen`: DNA Sequence Generator | ||
|
||
## Overview | ||
|
||
The DNA Sequence Generator is a C program designed to create random DNA sequences in FASTA or FASTQ format focusing on some **parameters focusing on their lengths**. | ||
It generates multiple files containing sequences of varying lengths. | ||
|
||
This tool is particularly useful for: | ||
|
||
- Testing tools to gather statistics from DNA sequences | ||
- Benchmarking bioinformatics pipelines | ||
|
||
## Features | ||
|
||
- Calculates and includes N50 statistics in the output filenames | ||
- Supports both FASTA and FASTQ output formats | ||
- Deterministic output based on a provided seed for reproducibility | ||
- Output filename format: `N50_TOTSEQS_SUMLEN.{fasta|fastq}` to make easy to test the N50 calculation | ||
|
||
## Compilation | ||
|
||
To compile the program, use a C compiler such as GCC: | ||
|
||
```bash | ||
gcc -o dna_generator dna_generator.c -lm | ||
``` | ||
|
||
## Usage | ||
|
||
```bash | ||
./gen <min_seqs> <max_seqs> <min_len> <max_len> <tot_files> <format> <outdir> | ||
``` | ||
|
||
### Parameters | ||
|
||
- `<min_seqs>`: Minimum number of sequences per file | ||
- `<max_seqs>`: Maximum number of sequences per file | ||
- `<min_len>`: Minimum length of each sequence | ||
- `<max_len>`: Maximum length of each sequence | ||
- `<tot_files>`: Total number of files to generate | ||
- `<format>`: Output format (either "fasta" or "fastq") | ||
- `<outdir>`: Directory to store the output files | ||
- `<seed>`: Seed for the random number generator (for reproducibility) | ||
|
||
### Example | ||
|
||
```bash | ||
gen 10 100 1000 10000 5 fasta output_dir | ||
``` | ||
|
||
This command will generate 5 FASTA files in the `output_dir` directory. Each file will contain between 10 and 100 sequences, with lengths ranging from 1000 to 10000 base pairs. The random number generator will be initialized with a static seed to ensure reproducibility. | ||
|
||
## Output | ||
|
||
The program generates files with names in the format: | ||
|
||
```text | ||
${N50}_${TOTSEQS}_${SUMLEN}.{fasta|fastq} | ||
``` | ||
|
||
Where: | ||
|
||
- `N50` is the N50 statistic of the sequences in the file | ||
- `TOTSEQS` is the total number of sequences in the file | ||
- `SUMLEN` is the sum of all sequence lengths in the file | ||
|
||
### FASTA Format | ||
|
||
In FASTA format, each sequence is represented as: | ||
|
||
```text | ||
>seq1 | ||
ATCGATCGATCG... | ||
>seq2 | ||
GCTAGCTAGCTA... | ||
``` | ||
|
||
### FASTQ Format | ||
|
||
In FASTQ format, each sequence is represented as: | ||
|
||
```text | ||
@seq1 | ||
ATCGATCGATCG... | ||
+ | ||
IIIIIIIIIIII... | ||
@seq2 | ||
GCTAGCTAGCTA... | ||
+ | ||
IIIIIIIIIIII... | ||
``` | ||
|
||
Note: The quality scores in FASTQ format are dummy values (all 'I') for simplicity. | ||
|
||
## License | ||
|
||
This program is provided under the MIT License. See the source code for full license text. | ||
|
||
## Author | ||
|
||
Andrea Telatin, 2023 | ||
Quadram Institute Bioscience | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# `n50` - Calculate N50 | ||
|
||
> A fast and efficient tool for calculating N50 and other sequence statistics from FASTA and FASTQ files. | ||
## Features | ||
|
||
|
||
- Supports both FASTA and FASTQ formats | ||
- Optimized for FASTQ raw file (Nanopore, PacBio) | ||
- Handles gzipped input files | ||
|
||
## Installation | ||
|
||
### Prerequisites | ||
|
||
- GCC compiler | ||
- zlib library | ||
- pthread library | ||
|
||
### Compiling | ||
|
||
To compile the program, use the following command: | ||
|
||
```bash | ||
make | ||
``` | ||
|
||
or compile binaries like: | ||
|
||
```bash | ||
gcc -o n50 src/n50.c -lz -lpthread -O3 | ||
``` | ||
|
||
## Usage | ||
|
||
```bash | ||
./n50 [options] [filename]... | ||
``` | ||
|
||
If no filename is provided, the program reads from standard input. | ||
|
||
### Options | ||
|
||
- `--fasta` or `-a`: Force FASTA input format | ||
- `--fastq` or `-q`: Force FASTQ input format | ||
- `--header` or `-H`: Print header in the output | ||
- `--n50` or `-n`: Output only the N50 value | ||
- `--version` and `--help` to display version and help information respectively | ||
|
||
### Examples | ||
|
||
1. Process a FASTA file: | ||
|
||
```bash | ||
./n50 input.fasta | ||
``` | ||
|
||
2. Process a gzipped FASTQ file: | ||
|
||
```bash | ||
./n50 input.fastq.gz | ||
``` | ||
|
||
3. Process a file with header output: | ||
|
||
```bash | ||
./n50 --header input.fasta | ||
``` | ||
|
||
4. Output only the N50 value: | ||
|
||
```bash | ||
./n50 --n50 input.fasta | ||
``` | ||
|
||
5. Process input from stdin: | ||
|
||
```bash | ||
cat input.fasta | ./n50 --format fasta | ||
``` | ||
|
||
## Output | ||
|
||
By default, the program outputs a tab-separated line with the following fields: | ||
|
||
1. Format (FASTA or FASTQ) | ||
2. Total sequence length | ||
3. Total number of sequences | ||
4. N50 value | ||
|
||
When using the `--header` option, a header line is printed before the results. | ||
|
||
When using the `--n50` option, only the N50 value is printed. | ||
|
||
## Performance | ||
|
||
The program uses multi-threading to process large files efficiently. It automatically adjusts the number of threads based on the input size, up to a maximum of 8 threads. | ||
|
||
## Limitations | ||
|
||
- The maximum number of threads is currently set to 8. This can be adjusted by modifying the `MAX_THREADS` constant in the source code. | ||
- The initial capacity for storing sequence lengths is set to 1,000,000. For extremely large datasets, this value might need to be increased. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# n50_binner | ||
|
||
This program analyzes a FASTQ file and counts the number of reads falling into predefined length bins. | ||
|
||
It can be used to generate a simplified summary of the read length distribution in a | ||
FASTQ file, to be then used to | ||
[generate a similar FASTX file](README_N50_GENERATE.md). | ||
|
||
## What it does | ||
|
||
The program: | ||
|
||
1. Reads a (compressed) FASTQ file. | ||
2. Counts the number of reads falling into each of 16 predefined length bins. | ||
3. Outputs a CSV-formatted result showing the number of reads in each bin. | ||
|
||
## How to compile | ||
|
||
To compile the program, you need a C compiler (such as gcc) and the zlib library installed. Use the following command: | ||
|
||
```bash | ||
gcc -o fastq_length_counter fastq_length_counter.c -lz | ||
``` | ||
|
||
This will create an executable named `fastq_length_counter`. | ||
|
||
## How to use | ||
|
||
Run the program from the command line, providing the path to a gzip-compressed FASTQ file as an argument: | ||
|
||
```bash | ||
./n50_binner path/to/your/file.fastq.gz | ||
``` | ||
|
||
The program will process the file and output the results to stdout in CSV format. | ||
|
||
## Examples | ||
|
||
1. Running the program: | ||
|
||
```bash | ||
./fastq_length_counter sample.fastq.gz | ||
``` | ||
|
||
2. Saving the output to a file: | ||
|
||
```bash | ||
./fastq_length_counter sample.fastq.gz > length_distribution.csv | ||
``` | ||
|
||
3. Processing multiple files: | ||
|
||
```bash | ||
for file in *.fastq.gz; do | ||
echo "Processing $file" | ||
./fastq_length_counter "$file" > "${file%.fastq.gz}_length_distribution.csv" | ||
done | ||
``` | ||
|
||
## Output format | ||
|
||
The output is in CSV format with two columns: | ||
1. Bin: The upper limit of the length bin | ||
2. Number of Reads: The count of reads falling into that bin | ||
|
||
Example output: | ||
|
||
```text | ||
Bin,Number of Reads | ||
10,0 | ||
100,5 | ||
1000,1000 | ||
2500,5000 | ||
... | ||
``` | ||
|
||
## Notes | ||
|
||
- The program uses 16 predefined bins: 10, 100, 1000, 2500, 5000, 10000, 20000, 35000, 50000, 75000, 100000, 200000, 300000, 500000, 750000, and 1000000. | ||
- Reads longer than 1,000,000 bases will be counted in the last bin. | ||
- The program assumes the input file is in the standard FASTQ format and is gzip-compressed. | ||
|
||
## Dependencies | ||
|
||
- zlib library for reading gzip-compressed files | ||
|
||
Remember to install zlib development files before compiling. On Ubuntu or Debian, you can do this with: | ||
|
||
```bash | ||
sudo apt-get install zlib1g-dev | ||
``` | ||
|
||
## License | ||
|
||
This program is provided under the MIT License. See the source code for full license text. | ||
|
||
## Author | ||
|
||
Andrea Telatin, 2023 | ||
Quadram Institute Bioscience |
Oops, something went wrong.