Skip to content

Commit

Permalink
Merge pull request #110 from hasindu2008/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
hasindu2008 authored Apr 26, 2024
2 parents 7d526b4 + 3cc1194 commit b06ff20
Show file tree
Hide file tree
Showing 7 changed files with 37 additions and 22 deletions.
4 changes: 2 additions & 2 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -260,11 +260,11 @@ slow5tools skim [OPTIONS] file.blow5

* `-t, --threads INT`:<br/>
Number of threads [default value: 8].
* `-K, --batchsize`:<br/>
* `-K, --batchsize INT`:<br/>
The batch size. This is the number of records on the memory at once [default value: 4096]. An increased batch size improves multi-threaded performance at cost of higher RAM.
* `--hdr`:<br/>
print the header only.
* `--hdr`:<br/>
* `--rid`:<br/>
print the list of read ids only.
* `-h`, `--help`:
Prints the help menu.
Expand Down
12 changes: 6 additions & 6 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,25 +11,25 @@ The NA12878 R9.4.1 PromethION dataset sequenced for the [SLOW5 paper](https://ww
| <sub>~9M reads complete PromethION dataset</sub> | <sub>[SRR22186402](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR22186402&display=data-access)</sub> | <sub>[na12878_prom_merged.blow5](https://slow5.page.link/na12878_prom_slow5) (`7e1a5900aff10e2cf1b97b8d3c6ecd1e`), [na12878_prom_merged.blow5.idx](https://slow5.page.link/na12878_prom_slow5_idx) (`a78919e8ac8639788942dbc3f1a2451a`) </sub> |


## NA24385 R10.4.1 LSK114 PromethION
## NA24385 R10.4.1 LSK114 PromethION (4 KHz)

An NA24385 R10.4.1 LSK114 dataset sequenced on a PromethION is available on [SRA](https://www.ncbi.nlm.nih.gov/sra/?term=SRS16575602) and given below are the links:

| <sub>Description</sub> | <sub>SRA run Data access</sub> | <sub>Direct download link (md5sum)</sub> |
| <sub>Description</sub> | <sub>SRA/ENA run Data access</sub> | <sub>Direct download link (md5sum)</sub> |
|------------------------------------------------------|------------------------------------------------------------------------------------------------------------|----------------------|
| <sub>~20K reads subsubset (BLOW5 format)</sub> | | <sub>[hg2_prom_lsk114_subsubsample.tar](https://slow5.page.link/hg2_prom_subsub)</sub> <sub>(`4d338e1cffd6dbf562cc55d9fcca040c`)</sub> |
| <sub>~500K reads subset (BLOW5 format)</sub> | <sub>[SRR23215365](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23215365&display=data-access)</sub> | <sub>[hg2_subsample_slow5.tar](https://slow5.page.link/hg2_prom_sub_slow5)</sub> <sub>(`65386e1da1d82b892677ad5614e8d84d`)</sub> |
| <sub>~15M reads complete PromethION dataset (BLOW5 format)</sub> | <sub>[SRR23215366](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23215366&display=data-access)</sub> | <sub> [PGXX22394_reads.blow5](https://slow5.page.link/hg2_prom_slow5) (`3498b595ac7c79a3d2dce47454095610`), [PGXX22394_reads.blow5.idx](https://slow5.page.link/hg2_prom_slow5_idx) (`1e11735c10cf63edc4a7114f010cc472`)</sub>* |
| <sub>~15M reads complete PromethION dataset (BLOW5 format)</sub> | <sub>[SRR23215366](https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR23215366&display=data-access)/[ERR11777845](https://www.ebi.ac.uk/ena/browser/view/ERR11777845)</sub> | <sub> [PGXX22394_reads.blow5](https://slow5.page.link/hg2_prom_slow5) (`3498b595ac7c79a3d2dce47454095610`), [PGXX22394_reads.blow5.idx](https://slow5.page.link/hg2_prom_slow5_idx) (`1e11735c10cf63edc4a7114f010cc472`)</sub>* |

*This dataset is hosted in the [gtgseq AWS bucket](https://aws.amazon.com/marketplace/pp/prodview-rve772jpfevtw) granted by the AWS open data sponsorship programme, for which the documentation available under the [gtgseq GitHub repository](https://github.com/GenTechGp/gtgseq).

## NA12878 R10.4.1 LSK114 PromethION
## NA12878 R10.4.1 LSK114 PromethION (4KHz)

An NA12878 R10.4.1 LSK114 dataset sequenced on a PromethION is available at the links below:

| <sub>Description</sub> | <sub>SRA run Data access</sub> | <sub>Direct download link (md5sum)</sub> |
| <sub>Description</sub> | <sub>ENA run Data access</sub> | <sub>Direct download link (md5sum)</sub> |
|------------------------------------------------------|------------------------------------------------------------------------------------------------------------|----------------------|
| <sub>~11M reads complete PromethION dataset (BLOW5 format)</sub> | <sub>-</sub> | <sub> [PGXXHX230142_reads.blow5](https://slow5.page.link/na12878_prom2_slow5) (`24266f6dabb8d679f7f520be6aa22694`), [PGXXHX230142_reads.blow5.idx](https://slow5.page.link/na12878_prom2_slow5_idx) (`a5659f829b9410616391427b2526b853`) </sub>* |
| <sub>~11M reads complete PromethION dataset (BLOW5 format)</sub> | <sub>[ERR11777844](https://www.ebi.ac.uk/ena/browser/view/ERR11777844)</sub> | <sub> [PGXXHX230142_reads.blow5](https://slow5.page.link/na12878_prom2_slow5) (`24266f6dabb8d679f7f520be6aa22694`), [PGXXHX230142_reads.blow5.idx](https://slow5.page.link/na12878_prom2_slow5_idx) (`a5659f829b9410616391427b2526b853`) </sub>* |


*This dataset is hosted in the [gtgseq AWS bucket](https://aws.amazon.com/marketplace/pp/prodview-rve772jpfevtw) granted by the AWS open data sponsorship programme, for which the documentation available under the [gtgseq GitHub repository](https://github.com/GenTechGp/gtgseq).
Expand Down
1 change: 1 addition & 0 deletions docs/software.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@

- [sigtk](https://github.com/hasindu2008/sigtk)
- [slow5tools](https://github.com/hasindu2008/slow5tools)
- [slow5curl](https://github.com/BonsonW/slow5curl)

## conversion

Expand Down
12 changes: 10 additions & 2 deletions docs/workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ For mounting private buckets, put your ACCESS:KEY in ~/.passwd-s3fs (make sure 6
```bash
samtools view reads.bam chrX:147911919-147951125 | cut -f1 | sort -u > rid_list.txt
slow5tools get reads.blow5 --list rid_list.txt -o extracted.blow5
buttery-eel -i reads.blow5 -g /path/to/ont-guppy/bin/ --config dna_r9.4.1_450bps_sup.cfg --device 'cuda:all' -o extracted_sup.fastq #see https://github.com/Psy-Fer/buttery-eel/ for butter-eel options
buttery-eel -i extracted.blow5 -g /path/to/ont-guppy/bin/ --config dna_r9.4.1_450bps_sup.cfg --device 'cuda:all' -o extracted_sup.fastq #see https://github.com/Psy-Fer/buttery-eel/ for butter-eel options
```

Note: If the read IDs in the BAM file are not the parent IDs (happens when read splitting is enabled during initial basecalling step), you can grab the parent read IDs from the FASTQ file as below and use that as the input the to slow5tools get.
Expand Down Expand Up @@ -123,4 +123,12 @@ done
```

If Guppy has automatically done read splitting, you would see errors from slow5tools that some reads are not found.
In that case we need to locate these “parent read ids”, as explained under [this workflow](#extract-and-re-basecall-reads-mapping-to-a-particular-genomic-region). If anything is unclear, open an issue under slow5tools.
In that case we need to locate these “parent read ids”, as explained under [this workflow](#extract-and-re-basecall-reads-mapping-to-a-particular-genomic-region). If anything is unclear, open an issue under slow5tools. A quick code snippet for handling “parent read ids”:

```bash
for barcode in $(seq 0 4)
do
awk '{if(NR%4==1) {print $0}}' barcode_${barcode}.fastq | sed -n -e 's/.*parent\_read\_id=//p' | awk '{print $1}' | sort -u > read_id_${barcode}.txt
cat read_id_${barcode}.txt | slow5tools get reads.blow5 -o ${barcode}_0.blow5
done
```
26 changes: 16 additions & 10 deletions scripts/realtime-f2s/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ This can be used on your computer where you are doing the sequencing acquisition

## Real run

Assume your sequencing data directory is */data* and you are sequencing an experiment called *my_sequencing_experiment* on to */data/my_sequencing_experiment*. Simply run the following for real-time FAST5 to SLOW5 conversion.
Assume your sequencing data directory is */data* and you are sequencing an experiment called *my_sequencing_experiment* onto */data/my_sequencing_experiment*. Simply run the following for real-time FAST5 to SLOW5 conversion.

```
./realf2s.sh -m /data/my_sequencing_experiment
Expand All @@ -22,31 +22,37 @@ This script will monitor the specified directory */data/my_sequencing_experiment

Brief log messages (including any conversion failures) are written to the terminal as well as */data/my_sequencing_experiment/realtime_f2s.log*. The list of files that were detected by the monitor and which the conversion was attempted will be written to */data/my_sequencing_experiment/realtime_f2s_attempted_list.log*. If any conversion failed, the names of the *FAST5* files will be written to *realtime_f2s_failed_list.log*. In addition, there will be some other debug/trace logs (e.g.,*realtime_f2s_monitor_trace.log*).

The monitoring script will terminate if it idles for 6 hours, i.e., no new FAST5 files were created under */data/my_sequencing_experiment/*, the script will terminate assuming that the sequencing run has completed. Just before termination, the script will check for any left over FAST5 and will convert them if present. Also, it will do a brief check on the file count and print some statistics any warnings if any.
The monitoring script will terminate if it idles for 6 hours, i.e., no new FAST5 files were created under */data/my_sequencing_experiment/*, the script will terminate assuming that the sequencing run has been completed. Just before termination, the script will check for any leftover FAST5 and will convert them if present. Also, it will do a brief check on the file count and print some statistics any warnings if any. If you want to make the script terminate as soon as the sequencing run in MinKNOW stops, please add `export REALF2S_AUTO=1` to your `~/.bashrc` (before running realf2s.sh and remember to source the .bashrc). Note that this auto-terminate feature relies on the "final_summary*.txt" file created by MinKNOW and will not be effective if ONT changes that.

If you want to resume a conversion that was abruptly terminated half-way, use the `-r` option for resuming as below:
If you want to resume a conversion that was abruptly terminated halfway, use the `-r` option for resuming as below:

```
./realf2s.sh -m /data/my_sequencing_experiment -r
```

### Options

* `-m STR`:
* `-m STR`:
The sequencing experiment directory to be monitored. This is usually where MinKNOW writes data for your experiment e.g., */data/my_sequencing_experiment/* or */var/lib/minknow/data/my_sequencing_experiment/*.
* `-r`:
Resumes a previous live conversion. This option is useful if the real-time conversion abruptly stopped in the middle and you now want to resume the live conversion.
* `-t INT`:
Timeout in seconds [default: 21600]. The script will end if no new FAST5 were written for this specified period of time.
* `-r`:
Resumes a previous live conversion. This option is useful if the real-time conversion abruptly stops in the middle and you now want to resume the live conversion.
* `-t INT`:
Timeout in seconds [default: 21600]. The script will end if no new FAST5 is written for this specified period of time.
* `-p INT`:
Maximum number of parallel conversion processes [default: 1]. This value can be increased to keep up with the sequencing rate as necessary, depending on the numbe of CPU cores available.
Maximum number of parallel conversion processes [default: 1]. This value can be increased to keep up with the sequencing rate as necessary, depending on the number of CPU cores available.

### Environment variables

The following optional environment variables will be honoured by the real-time conversion script if they are set.

- REALF2S_AUTO: make the script terminate as soon as the sequencing run in MinKNOW stops as explained above.
- SLOW5TOOLS: path to the slow5tools binary

## Simulation

Say you have some FAST5 files in a directory at */data2/previous_run*. You can test our real-time conversion script (*realf2s.sh*) by simulating a run based on these existing FAST5 files (*monitor/simulator.sh*).

First create a directory to represent our simulated sequencing run, for instance `mkdir /data/my_simulated_run`.
First create a directory to represent our simulated sequencing run, for instance, `mkdir /data/my_simulated_run`.
Now launch the real-time conversion script to monitor this directory for newly created FAST5.

```
Expand Down
2 changes: 1 addition & 1 deletion src/cmd.h
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#ifndef CMD_H
#define CMD_H

#define SLOW5TOOLS_VERSION "1.1.0"
#define SLOW5TOOLS_VERSION "1.1.0-dirty"

#define DEFAULT_NUM_THREADS 8
#define DEFAULT_NUM_PROCESSES 8
Expand Down

0 comments on commit b06ff20

Please sign in to comment.