Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add agat convert sp gff2bed #114

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,9 @@
based on a provided sequence IDs or region coordinates file (PR #85).

* `agat`:
- `agat/agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
- `agat/agat_convert_bed2gff`: convert bed file to gff format (PR #97).
- `agat_convert_sp_gff2gtf`: convert any GTF/GFF file into a proper GTF file (PR #76).
- `agat_convert_bed2gff`: convert bed file to gff format (PR #97).
- `agat_convert_sp_gff2bed`: convert GTF/GXF file into bed file (PR #114).
- `agat/agat_convert_embl2gff`: convert an EMBL file into GFF format (PR #99).
- `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
- `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).
Expand Down
105 changes: 105 additions & 0 deletions src/agat/agat_convert_sp_gff2bed/config.vsh.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
name: agat_convert_sp_gff2bed
namespace: agat
description: |
The script aims to convert GTF/GXF file into bed file. It will convert
level2 features from gff (mRNA, transcripts) into bed features. If the
selected level2 subfeatures (defaut: exon) exist, they are reported in
the block fields (9-12th colum in bed). CDS Start and End are reported
in column 7 and 8 accordingly.

### Definition of the bed format:

#### Definition of the BED format:

1. **chrom** - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
2. **chromStart** - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
3. **chromEnd** - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.

#### OPTIONAL fields:

4. **name** - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
5. **score** - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray).
6. **strand** - Defines the strand - either '+' or '-'.
7. **thickStart** - The starting position at which the feature is drawn thickly.
8. **thickEnd** - The ending position at which the feature is drawn thickly.
9. **itemRgb** - An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to "On", this RGB value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
10. **blockCount** - The number of blocks (exons) in the BED line.
11. **blockSizes** - A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
12. **blockStarts** - A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.
keywords: [gene annotations, GTF conversion, BED]
links:
homepage: https://github.com/NBISweden/AGAT
documentation: https://agat.readthedocs.io/en/latest/tools/agat_convert_sp_gff2bed.html
issue_tracker: https://github.com/NBISweden/AGAT/issues
repository: https://github.com/NBISweden/AGAT
references:
doi: 10.5281/zenodo.3552717
Copy link
Contributor

@dorien-er dorien-er Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
doi: 10.5281/zenodo.3552717
doi: https://doi.org/10.5281/zenodo.3552717

license: GPL-3.0
requirements:
- commands: [agat]
authors:
- __merge__: /src/_authors/leila_paquay.yaml
roles: [ author, maintainer ]
argument_groups:
- name: Inputs
arguments:
- name: --gff
alternatives: [-i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alternatives: [-i]

description: Input GFF3 file that will be read.
type: file
required: true
direction: input
example: input.gff
- name: Outputs
arguments:
- name: --output
alternatives: [--outfile, --out, -o]
description: |
File where the result will be written.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
File where the result will be written.
Filepath to the output bed file.

type: file
direction: output
required: true
example: output.bed
- name: Arguments
arguments:
- name: --nc
description: |
Behaviour for non-coding features (e.g. records without CDS):

* keep: Default, they are kept but no CDS position is reported in the 7th and 8th columns (a period is reported instead).
* filter: We remove them.
* transcript: We keep them but values in the 7th and 8th columns will contain transcript's start and stop.
type: string
choices: [keep, filter, transcript]
required: false
- name: --sub
description: |
Define the subfeature (level3, e.g. exon, cds, utr, etc.) to report as blocks in the BED output. Default: exon.
type: string
required: false
example: exon
- name: --config
alternatives: [-c]
description: |
AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
type: file
required: false
example: custom_agat_config.yaml
resources:
- type: bash_script
path: script.sh
test_resources:
- type: bash_script
path: test.sh
- type: file
path: test_data
engines:
- type: docker
image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
setup:
- type: docker
run: |
agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
runners:
- type: executable
- type: nextflow
105 changes: 105 additions & 0 deletions src/agat/agat_convert_sp_gff2bed/help.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
```sh
agat_convert_sp_gff2bed.pl --help
```

------------------------------------------------------------------------------
| Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0 |
| https://github.com/NBISweden/AGAT |
| National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se |
------------------------------------------------------------------------------

Name:
agat_convert_sp_gff2bed.pl

Description:
The script aims to convert GTF/GXF file into bed file. It will convert
level2 features from gff (mRNA, transcripts) into bed features. If the
selected level2 subfeatures (defaut: exon) exist, they are reported in
the block fields (9-12th colum in bed). CDS Start and End are reported
in column 7 and 8 accordingly.

Definintion of the bed format: # 1 chrom - The name of the chromosome
(e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671). # 2
chromStart - The starting position of the feature in the chromosome or
scaffold. The first base in a chromosome is numbered 0. # 3 chromEnd -
The ending position of the feature in the chromosome or scaffold. The
chromEnd base is not included in the display of the feature. For
example, the first 100 bases of a chromosome are defined as
chromStart=0, chromEnd=100, and span the bases numbered 0-99. ##########
OPTIONAL fields ########## # 4 name - Defines the name of the BED line.
This label is displayed to the left of the BED line in the Genome
Browser window when the track is open to full display mode or directly
to the left of the item in pack mode. # 5 score - A score between 0 and
1000. If the track line useScore attribute is set to 1 for this
annotation data set, the score value will determine the level of gray in
which this feature is displayed (higher numbers = darker gray). # 6
strand - Defines the strand - either '+' or '-'. # 7 thickStart - The
starting position at which the feature is drawn thickly # 8 thickEnd -
The ending position at which the feature is drawn thickly # 9 itemRgb -
An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb
attribute is set to "On", this RBG value will determine the display
color of the data contained in this BED line. NOTE: It is recommended
that a simple color scheme (eight colors or less) be used with this
attribute to avoid overwhelming the color resources of the Genome
Browser and your Internet browser. # 10 blockCount - The number of
blocks (exons) in the BED line. # 11 blockSizes - A comma-separated list
of the block sizes. The number of items in this list should correspond
to blockCount. # 12 blockStarts - A comma-separated list of block
starts. All of the blockStart positions should be calculated relative to
chromStart. The number of items in this list should correspond to
blockCount.

Usage:
agat_convert_sp_gff2bed.pl --gff file.gff [ -o outfile ]
agat_convert_sp_gff2bed.pl --help

Options:
--gff Input GFF3 file that will be read

--nc STRING - behaviour for non-coding features (e.g. recored wihtout
CDS). [keep,filter,transcript] keep - Default, they are kept but
no CDS position is reported in the 7th and 8th columns (a period
is reported instead). filter - We remove them. transcript - We
keep them but values in 7th and 8th columns will contains
transcript's start and stop.

--sub Define the subfeature (level3, e.g exon,cds,utr,etc...) to
report as blocks in the bed output. Defaut: exon.

--outfile, --out, --output, or -o
File where will be written the result. If no output file is
specified, the output will be written to STDOUT.

-c or --config
String - Input agat config file. By default AGAT takes as input
agat_config.yaml file from the working directory if any,
otherwise it takes the orignal agat_config.yaml shipped with
AGAT. To get the agat_config.yaml locally type: "agat config
--expose". The --config option gives you the possibility to use
your own AGAT config file (located elsewhere or named
differently).

-h or --help
Display this helpful text.

Feedback:
Did you find a bug?:
Do not hesitate to report bugs to help us keep track of the bugs and
their resolution. Please use the GitHub issue tracking system available
at this address:

https://github.com/NBISweden/AGAT/issues

Ensure that the bug was not already reported by searching under Issues.
If you're unable to find an (open) issue addressing the problem, open a new one.
Try as much as possible to include in the issue when relevant:
- a clear description,
- as much relevant information as possible,
- the command used,
- a data sample,
- an explanation of the expected behaviour that is not occurring.

Do you want to contribute?:
You are very welcome, visit this address for the Contributing
guidelines:
https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
13 changes: 13 additions & 0 deletions src/agat/agat_convert_sp_gff2bed/script.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

agat_convert_sp_gff2bed.pl \
--gff "$par_gff" \
--output "$par_output" \
${par_nc:+--nc "${par_nc}"} \
${par_sub:+--sub "${par_sub}"} \
${par_config:+--config "${par_config}"}
35 changes: 35 additions & 0 deletions src/agat/agat_convert_sp_gff2bed/test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/bin/bash

set -eo pipefail

## VIASH START
## VIASH END

test_dir="${meta_resources_dir}/test_data"

# create temporary directory and clean up on exit
TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
function clean_up {
[[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
}
trap clean_up EXIT

echo "> Run $meta_name with test data"
"$meta_executable" \
--gff "$test_dir/1.gff" \
--output "$TMPDIR/output.gff"

echo ">> Checking output"
[ ! -f "$TMPDIR/output.gff" ] && echo "Output file output.gff does not exist" && exit 1

echo ">> Check if output is empty"
[ ! -s "$TMPDIR/output.gff" ] && echo "Output file output.gff is empty" && exit 1

echo ">> Check if output matches expected output"
diff "$TMPDIR/output.gff" "$test_dir/agat_convert_sp_gff2bed_1.gff"
if [ $? -ne 0 ]; then
echo "Output file output.gff does not match expected output"
exit 1
fi

echo "> Test successful"
Loading