nanoplot (#95)

* nanoplot * test_data * reinitiate * gitignore * namespace * Testing NanoPlot in CLI * NanoPlot complete * Updated docker engine * Docker * Delete taget directory * Deleted * Input file * fastq with more reads * Delete config.vsh.yaml * Pull request changes * Delete var directory * Config arguments complete * Update help.txt * Update config file * Test files * runners script * gitignore default * Move output * Delete output directory * Runners script complete * Test script * default output * test data * params passed correctly * outdir * test script * input files * all test files * test data < 100 KB * test script update * Update CHANGELOG.md * Update CHANGELOG.md * Test cases in directories * rm .gz .pickle .feather files * reduce test input size * Multiple separator ";" and check there is only one input file --------- Co-authored-by: jakubmajercik <[email protected]> Co-authored-by: Emma Rousseau <[email protected]>
viash-hub · Oct 26, 2024 · 6e6b139 · 6e6b139
1 parent 7fb67a9
commit 6e6b139
Show file tree

Hide file tree

Showing 13 changed files with 1,317 additions and 2 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,8 @@
 
 * `rsem/rsem_calculate_expression`: Calculate expression levels (PR #93).
 
+* `nanoplot`: Plotting tool for long read sequencing data and alignments (PR #95).
+
 ## BREAKING CHANGES
 
 * `falco`: Fix a typo in the `--reverse_complement` argument (PR #157).
@@ -189,8 +191,6 @@
     - `bbmap_bbsplit`: Split sequencing reads by mapping them to multiple references simultaneously (PR #138).
 
 
-
-
 ## MINOR CHANGES
 
 * Uniformize component metadata (PR #23).

diff --git a/src/nanoplot/config.vsh.yaml b/src/nanoplot/config.vsh.yaml
@@ -0,0 +1,230 @@
+name: nanoplot
+description: |
+  Run NanoPlot on nanopore-sequenced reads.
+  NanoPlot is a plotting tool for long read sequencing data and alignments.
+keywords: ["fastq", "sequencing summary", "nanopore"]
+links:
+  repository: https://github.com/wdecoster/NanoPlot
+  homepage: http://nanoplot.bioinf.be/
+  documentation: https://github.com/wdecoster/NanoPlot
+references:
+  doi: 10.1093/bioinformatics/btad311
+license: MIT
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --fastq
+        type: file
+        description: Input fastq file(s), separated by ";".
+        example: read.fq
+        direction: input
+        multiple: true
+      - name: --fasta
+        type: file
+        description: Input fasta file(s), separated by ";".
+        example: read.fa
+        direction: input
+        multiple: true
+      - name: --fastq_rich
+        type: file
+        description: |
+          Input fastq file(s) generated by albacore or 
+          MinKNOW with additional information concerning channel and time, separated by ";".
+        example: read.fq
+        direction: input
+        multiple: true
+      - name: --fastq_minimal
+        type: file
+        description: |
+          Input fastq file(s) generated by albacore or MinKNOW with
+          additional information concerning channel and time. Minimal data is extracted
+          swiftly without elaborate checks. Separated by ";".
+        example: read.fq
+        direction: input
+        multiple: true
+      - name: --summary
+        type: file
+        description: |
+          Input summary file(s) generated by albacore or guppy, separated by ";".
+        example: read.txt
+        direction: input
+        multiple: true
+      - name: --bam
+        type: file
+        description: Input sorted bam file(s), separated by ";".
+        example: read.bam
+        direction: input
+        multiple: true
+      - name: --ubam
+        type: file
+        description: Input unmapped bam file(s), separated by ";".
+        example: read.ubam
+        direction: input
+        multiple: true
+      - name: --cram
+        type: file
+        description: Input sorted cram file(s), separated by ";".
+        example: read.cram
+        direction: input
+        multiple: true
+      - name: --pickle
+        type: file
+        description: Input pickle file stored earlier, separated by ";".
+        example: read.pkl
+        direction: input
+        multiple: true
+      - name: --feather
+        alternatives: [--arrow]
+        type: file
+        description: Input feather file(s), separated by ";".
+        example: read.arrow
+        direction: input
+        multiple: true
+  - name: Outputs
+    arguments:
+      - name: --outdir
+        alternatives: [-o]
+        type: file
+        direction: output
+        description: Specify directory in which output has to be created.
+        required: true
+  - name: Options
+    arguments:
+      - name: --verbose
+        type: boolean_true
+        description: Write log messages also to terminal
+      - name: --store
+        type: boolean_true
+        description: Store the extracted data in a pickle file for future plotting.
+      - name: --raw
+        type: boolean_true
+        description: Store the extracted data in tab separated file.
+      - name: --huge
+        type: boolean_true
+        description: Input data is one very large file.
+      - name: --no_static
+        type: boolean_false
+        description: Do not make static (png) plots.
+      - name: --prefix
+        alternatives: [-p]
+        type: string
+        description: Specify an optional prefix to be used for the output files.
+      - name: --tsv_stats
+        type: boolean_true
+        description: Output the stats file as a properly formatted TSV.
+      - name: --only_report
+        type: boolean_true
+        description: Output only the report.
+      - name: --info_in_report
+        type: boolean_true
+        description: Add NanoPlot run info in the report.
+  - name: Filtering or transforming input
+    arguments:
+      - name: --maxlength
+        type: integer
+        description: Drop reads longer than length specified.
+      - name: --minlength
+        type: integer
+        description: Drop reads shorter than length specified.
+      - name: --drop_outliers
+        type: boolean_false
+        description: Drop outlier reads with extreme long length.
+      - name: --downsample
+        type: integer
+        description: Reduce dataset to N reads by random sampling.
+      - name: --loglength
+        type: boolean_true
+        description: Logarithmic scaling of lengths in plots.
+      - name: --percentqual
+        type: boolean_true
+        description: Use qualities as theoretical percent identities.
+      - name: --alength
+        type: boolean_true
+        description: Use aligned read lengths rather than sequenced length (bam mode). 
+      - name: --minqual
+        type: integer
+        description: Drop reads with an average quality lower than specified.
+      - name: --runtime_until
+        type: integer
+        description: Only take the N first hours of a run.
+      - name: --readtype
+        type: string
+        description: |
+          Which read type to extract information about from summary.
+          Options are 1D, 2D, 1D2
+      - name: --barcoded
+        type: boolean_true
+        description: Use if you want to split the summary file by barcode.
+      - name: --no_supplementary
+        type: boolean_false
+        description: Use if you want to remove supplementary alignments.
+  - name: Customizing plots
+    arguments:
+      - name: --color
+        alternatives: [-c]
+        type: string
+        description: Specify a color for the plots, must be a valid matplotlib color.
+      - name: --colormap
+        alternatives: [-cm]
+        type: string
+        description: Specify a valid matplotlib colormap for the heatmap.
+      - name: --format
+        alternatives: [-f]
+        type: string
+        default: png
+        description: |
+          Specify the output format of the plots.
+          {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}
+      - name: --plots
+        type: string
+        description: |
+          Specify which bivariate plots have to be made.
+          [{kde,hex,dot} ...]
+      - name: --legacy
+        type: string
+        description: |
+          Specify which bivariate plots have to be made (legacy mode).
+          [{kde,dot,hex} ...]
+      - name: --listcolors
+        type: boolean_true
+        description: List the colors which are available for plotting and exit.
+      - name: --listcolormaps
+        type: boolean_true
+        description: List the colormaps which are available for plotting and exit.
+      - name: --no_N50 
+        type: boolean_false
+        description: Hide the N50 mark in the read length histogram.
+      - name: --N50 
+        type: boolean_true
+        description: Show the N50 mark in the read length histogram.
+      - name: --title
+        type: string
+        description: Add a title to all plots, requires quoting if using spaces.
+      - name: --font_scale
+        type: double
+        description: Scale the font of the plots by a factor.
+      - name: --dpi
+        type: integer
+        description: Set the dpi for saving images.
+      - name: --hide_stats
+        type: boolean_false
+        description: Not adding Pearson R stats in some bivariate plots.
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/nanoplot:1.43.0--pyhdfd78af_1
+    setup:
+      - type: docker
+        run: |
+          version=$(NanoPlot --version) && \
+          echo "$version" > /var/software_versions.txt
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/src/nanoplot/help.txt b/src/nanoplot/help.txt
@@ -0,0 +1,96 @@
+usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge]
+                [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats]
+                [--only-report] [--info_in_report] [--maxlength N]
+                [--minlength N] [--drop_outliers] [--downsample N]
+                [--loglength] [--percentqual] [--alength] [--minqual N]
+                [--runtime_until N] [--readtype {1D,2D,1D2}] [--barcoded]
+                [--no_supplementary] [-c COLOR] [-cm COLORMAP]
+                [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]]
+                [--plots [{kde,hex,dot} ...]] [--legacy [{kde,dot,hex} ...]]      
+                [--listcolors] [--listcolormaps] [--no-N50] [--N50]
+                [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI]
+                [--hide_stats]
+                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram file [file ...] | --pickle pickle | --feather file [file ...])
+
+CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.
+
+General options:
+  -h, --help            show the help and exit
+  -v, --version         Print version and exit.
+  -t, --threads THREADS
+                        Set the allowed number of threads to be used by the script
+  --verbose             Write log messages also to terminal.
+  --store               Store the extracted data in a pickle file for future plotting.
+  --raw                 Store the extracted data in tab separated file.
+  --huge                Input data is one very large file.
+  -o, --outdir OUTDIR   Specify directory in which output has to be created.      
+  --no_static           Do not make static (png) plots.
+  -p, --prefix PREFIX   Specify an optional prefix to be used for the output files.
+  --tsv_stats           Output the stats file as a properly formatted TSV.        
+  --only-report         Output only the report
+  --info_in_report      Add NanoPlot run info in the report.
+
+Options for filtering or transforming input prior to plotting:
+  --maxlength N         Hide reads longer than length specified.
+  --minlength N         Hide reads shorter than length specified.
+  --drop_outliers       Drop outlier reads with extreme long length.
+  --downsample N        Reduce dataset to N reads by random sampling.
+  --loglength           Additionally show logarithmic scaling of lengths in plots.
+  --percentqual         Use qualities as theoretical percent identities.
+  --alength             Use aligned read lengths rather than sequenced length (bam mode)
+  --minqual N           Drop reads with an average quality lower than specified.  
+  --runtime_until N     Only take the N first hours of a run
+  --readtype {1D,2D,1D2}
+                        Which read type to extract information about from summary. Options are 1D, 2D,
+                        1D2
+  --barcoded            Use if you want to split the summary file by barcode      
+  --no_supplementary    Use if you want to remove supplementary alignments        
+
+Options for customizing the plots created:
+  -c, --color COLOR     Specify a valid matplotlib color for the plots
+  -cm, --colormap COLORMAP
+                        Specify a valid matplotlib colormap for the heatmap       
+  -f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]
+                        Specify the output format of the plots, which are in addition to the html files
+  --plots [{kde,hex,dot} ...]
+                        Specify which bivariate plots have to be made.
+  --legacy [{kde,dot,hex} ...]
+                        Specify which bivariate plots have to be made (legacy mode).
+  --listcolors          List the colors which are available for plotting and exit.
+  --listcolormaps       List the colors which are available for plotting and exit.
+  --no-N50              Hide the N50 mark in the read length histogram
+  --N50                 Show the N50 mark in the read length histogram
+  --title TITLE         Add a title to all plots, requires quoting if using spaces
+  --font_scale FONT_SCALE
+                        Scale the font of the plots by a factor
+  --dpi DPI             Set the dpi for saving images
+  --hide_stats          Not adding Pearson R stats in some bivariate plots        
+
+Input data sources, one of these is required.:
+  --fastq file [file ...]
+                        Data is in one or more default fastq file(s).
+  --fasta file [file ...]
+                        Data is in one or more fasta file(s).
+  --fastq_rich file [file ...]
+                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy
+                        with additional information concerning channel and time.  
+  --fastq_minimal file [file ...]
+                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy
+                        with additional information concerning channel and time. Is extracted swiftly
+                        without elaborate checks.
+  --summary file [file ...]
+                        Data is in one or more summary file(s) generated by albacore or guppy.
+  --bam file [file ...]
+                        Data is in one or more sorted bam file(s).
+  --ubam file [file ...]
+                        Data is in one or more unmapped bam file(s).
+  --cram file [file ...]
+                        Data is in one or more sorted cram file(s).
+  --pickle pickle       Data is a pickle file stored earlier.
+  --feather, --arrow file [file ...]
+                        Data is in one or more feather file(s).
+
+EXAMPLES:
+    NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed
+    NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot
+    NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000