Skip to content

Commit

Permalink
Merge pull request #12 from alliance-genome/KANBAN-507_managed-workflow
Browse files Browse the repository at this point in the history
KANBAN-507 managed workflow implementation
  • Loading branch information
mluypaert authored Apr 4, 2024
2 parents 75d7140 + 81ab2b3 commit 6996687
Show file tree
Hide file tree
Showing 16 changed files with 266 additions and 21 deletions.
82 changes: 75 additions & 7 deletions .github/workflows/PR-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,27 @@ jobs:
pipeline-seq-retrieval-container-image-build:
name: pipeline/seq_retrieval container-image build
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: ./pipeline/seq_retrieval/
steps:
- name: Check out repository code
uses: actions/checkout@v4
with:
fetch-depth: 0
sparse-checkout: |
pipeline/seq_retrieval/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build container image
run: |
make container-image
uses: docker/build-push-action@v5
with:
context: ./pipeline/seq_retrieval/
push: false
tags: agr_pavi/seq_retrieval:latest
outputs: type=docker,dest=/tmp/pavi_seq_retrieval_docker_image.tar
- name: Upload image as artifact (share between jobs)
uses: actions/upload-artifact@v4
with:
name: seq_retrieval_image
path: /tmp/pavi_seq_retrieval_docker_image.tar
pipeline-seq-retrieval-python-typing-check:
name: pipeline/seq_retrieval python typing check
runs-on: ubuntu-22.04
Expand Down Expand Up @@ -82,4 +89,65 @@ jobs:
- name: Run unit tests
run: |
make run-unit-tests
#TODO: add integration testing
pipeline-alignment-container-image-build:
name: pipeline/alignment container-image build
runs-on: ubuntu-22.04
steps:
- name: Check out repository code
uses: actions/checkout@v4
with:
fetch-depth: 0
sparse-checkout: |
pipeline/alignment/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build container image
uses: docker/build-push-action@v5
with:
context: ./pipeline/alignment/
push: false
tags: agr_pavi/alignment:latest
outputs: type=docker,dest=/tmp/pavi_alignment_docker_image.tar
- name: Upload image as artifact (share between jobs)
uses: actions/upload-artifact@v4
with:
name: alignment_image
path: /tmp/pavi_alignment_docker_image.tar
pipeline-workflow-integration-testing:
name: pipeline/workflow integration testing
needs:
- pipeline-seq-retrieval-container-image-build
- pipeline-alignment-container-image-build
runs-on: ubuntu-22.04
defaults:
run:
shell: bash
working-directory: ./pipeline/workflow/
steps:
- name: Check out repository code
uses: actions/checkout@v4
with:
fetch-depth: 0
sparse-checkout: |
pipeline/workflow/
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Download seq_retrieval image artifact (from previous job)
uses: actions/download-artifact@v4
with:
name: seq_retrieval_image
path: /tmp
- name: Download alignment image artifact (from previous job)
uses: actions/download-artifact@v4
with:
name: alignment_image
path: /tmp
- name: Load seq_retrieval Docker image
run: |
docker load --input /tmp/pavi_seq_retrieval_docker_image.tar
- name: Load alignment Docker image
run: |
docker load --input /tmp/pavi_alignment_docker_image.tar
- name: Run integration test
run: |
make run-integration-test
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ AGR's Proteins Annotations and Variants Inspector
Just as most modern software, PAVI heavily relies on third-party tools and libraries for much of its core functionality.
We specifically acknowledge the creators and developers of the following third-party tools and libraries:
* BioPython: [Cock PJ, Antao T, Chang JT, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422-1423. doi:10.1093/bioinformatics/btp163](https://pubmed.ncbi.nlm.nih.gov/19304878/)
* Nextflow: [Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35(4):316-319. doi:10.1038/nbt.3820](https://pubmed.ncbi.nlm.nih.gov/28398311/)
* PySam: https://github.com/pysam-developers/pysam
* Samtools: [Danecek P, Bonfield JK, Liddle J, et al. Twelve years of SAMtools and BCFtools. Gigascience. 2021;10(2):giab008. doi:10.1093/gigascience/giab008](https://pubmed.ncbi.nlm.nih.gov/33590861/)

Expand Down
4 changes: 2 additions & 2 deletions pipeline/alignment/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM biocontainers/clustalo:v1.2.4-2-deb_cv1

ENTRYPOINT [ "clustalo"]
CMD [ "--help" ]
USER root
CMD [ "clustalo", "--help" ]
11 changes: 10 additions & 1 deletion pipeline/alignment/Makefile
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
CONTAINER_NAME=agr_pavi/alignment
ADDITIONAL_BUILD_ARGS=

.PHONY: clean

clean:
$(eval ADDITIONAL_BUILD_ARGS := --no-cache)
@:

container-image:
docker build --no-cache -t agr_pavi/alignment .
docker build ${ADDITIONAL_BUILD_ARGS} -t ${CONTAINER_NAME} .
19 changes: 14 additions & 5 deletions pipeline/alignment/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
# Manual invokation and testing instructions
First build the docker image:
This subdirectory contains the alignment component of PAVI.

# Local invocation and testing instructions
To build the docker image:
```bash
make docker-image
```

To build a clean docker image (without using caching, for troubleshooting potential caching issues):
```bash
make clean docker-image
```

Then run the container to run any alignment.

Use a volume mount (`-v`) as appropriate to enable the container access to the input and output directorie(s).
Use a volume mount (`-v`) as appropriate to provide the container access to the input and output directorie(s)
on your local system.
Specify the clustalo command-line arguments as appropriate after the `docker run` command, as per below example:
```bash
docker run -v /abs/path/to/in-out-dir:/mnt/pavi/ --rm pavi/alignment -i /mnt/pavi/input-seqs.fa -outfmt=clustal -o /mnt/pavi/clustal-output.aln
docker run -v /abs/path/to/in-out-dir:/mnt/pavi/ --rm agr_pavi/alignment \
clustalo -i /mnt/pavi/input-seqs.fa --outfmt=clustal --resno -o /mnt/pavi/clustal-output.aln
```
Once the run completed, Clustal-formattted alignment results can then be found locally in `</abs/path/to/in-out-dir>/clustal-output.aln`.
Once the run completed, Clustal-formatted alignment results can then be found locally in `</abs/path/to/in-out-dir>/clustal-output.aln`.
8 changes: 5 additions & 3 deletions pipeline/seq_retrieval/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,14 @@ FROM python:3.12-alpine

WORKDIR /usr/src/app

RUN apk add --no-cache build-base zlib-dev bzip2-dev xz-dev
RUN apk update && apk add --no-cache build-base zlib-dev bzip2-dev xz-dev \
bash # Nextflow requirement

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./
RUN chmod a+x main.py
ENV PATH=$PATH:/usr/src/app

ENTRYPOINT [ "python", "main.py"]
CMD [ "--help" ]
CMD [ "main.py", "--help" ]
11 changes: 9 additions & 2 deletions pipeline/seq_retrieval/Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,14 @@
.PHONY: check-venv-active run-python-type-check
CONTAINER_NAME=agr_pavi/seq_retrieval
ADDITIONAL_BUILD_ARGS=

.PHONY: check-venv-active run-python-type-check clean

clean:
$(eval ADDITIONAL_BUILD_ARGS := --no-cache)
@:

container-image:
docker build --no-cache -t agr_pavi/seq_retrieval .
docker build ${ADDITIONAL_BUILD_ARGS} -t ${CONTAINER_NAME} .

python-dependencies:
pip install -r requirements.txt
Expand Down
20 changes: 20 additions & 0 deletions pipeline/seq_retrieval/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
# PAVI Sequence retrieval
This subdirectory contains all code and configs for the PAVI sequence retrieval component.

## Content table
* [Development](#development)
* [Local dev environment](#local-dev-environment)
* [Code guidelines](#code-guidelines)
* [Building](#building)
* [Usage](#usage)

## Development
### Local dev environment
In order to enable isolated local development that does not interfere with the global system python setup,
Expand Down Expand Up @@ -154,3 +161,16 @@ All unit tests are automatically run and enforced as part of the PR validation
and all reported errors must be fixed to enable merging each PR into `main`.
If the `pipeline/seq_retrieval unit tests` status check fails on a PR in github,
click the details link and inspect the failing step output for hints on what to fix.

## Building
To build a clean docker image (for production usage and troubleshooting):
```bash
make clean docker-image
```

## Usage
This PAVI component is intented to be called as a container.
To call the container after building:
```bash
docker run agr_pavi/seq_retrieval main.py
```
7 changes: 6 additions & 1 deletion pipeline/seq_retrieval/src/main.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#!/usr/bin/env python3
"""
Main module serving the CLI for PAVI sequence retrieval.
Expand Down Expand Up @@ -105,14 +106,17 @@ def process_seq_regions_param(ctx: click.Context, param: click.Parameter, value:
Use "file://*" for local file or "http(s)://*" for remote files.""")
@click.option("--output_type", type=click.Choice(['transcript', 'protein'], case_sensitive=False), required=True,
help="""The output type to return.""")
@click.option("--name", type=click.STRING, required=True,
help="The sequence name to use in the output fasta header.")
@click.option("--reuse_local_cache", is_flag=True,
help="""When defined and using remote `fasta_file_url`, reused local files
if file already exists at destination path, rather than re-downloading and overwritting.""")
@click.option("--unmasked", is_flag=True,
help="""When defined, return unmasked sequences (undo soft masking present in reference files).""")
@click.option("--debug", is_flag=True,
help="""Flag to enable debug printing.""")
def main(seq_id: str, seq_strand: str, seq_regions: List[Dict[str, int]], fasta_file_url: str, output_type: str, reuse_local_cache: bool, unmasked: bool, debug: bool) -> None:
def main(seq_id: str, seq_strand: str, seq_regions: List[Dict[str, int]], fasta_file_url: str, output_type: str,
name: str, reuse_local_cache: bool, unmasked: bool, debug: bool) -> None:
"""
Main method for sequence retrieval from JBrowse faidx indexed fasta files. Receives input args from click.
Expand Down Expand Up @@ -142,6 +146,7 @@ def main(seq_id: str, seq_strand: str, seq_regions: List[Dict[str, int]], fasta_

logger.debug(f"full region: {fullRegion.seq_id}:{fullRegion.start}-{fullRegion.end}:{fullRegion.strand}")

click.echo('>' + name)
if output_type == 'transcript':
click.echo(seq_concat)
elif output_type == 'protein':
Expand Down
6 changes: 6 additions & 0 deletions pipeline/workflow/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# NextFlow output files
nextflow
.nextflow/
.nextflow.log.*
work/
pipeline-results/
15 changes: 15 additions & 0 deletions pipeline/workflow/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
RELEASE=23.10.1

.PHONY: build-workflow-local-deps run-workflow-local

nextflow:
curl -L https://github.com/nextflow-io/nextflow/releases/download/v${RELEASE}/nextflow-${RELEASE}-all -o nextflow
chmod u+x nextflow

build-workflow-local-deps:
make -C ../seq_retrieval/ container-image
make -C ../alignment/ container-image

run-integration-test: nextflow
./nextflow run -profile test protein-msa.nf
@diff -qs pipeline-results/alignment-output.aln tests/resources/integration-test-results.aln
25 changes: 25 additions & 0 deletions pipeline/workflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
This subdirectory contains all code that defines the workflows,
which tie all pipeline components together into a fully functional and scalable pipeline
comprising of all data retrieval and computation required for each PAVI alignment.
To that goal, NextFlow is used as workflow manager and Domain Specific Language.

To download nextflow:
```bash
make nextflow
```

To run the protein MSA workflow locally:
1. Build all required components locally:
```bash
make build-workflow-local-deps
```
2. Run the pipeline with approriate input arguments as seen in below example:
```bash
./nextflow run protein-msa.nf --input_seq_regions '[
{"name": "C54H2.5.1", "seq_id": "X", "seq_strand": "-",
"seq_regions": "[\"5780644..5780722\", \"5780278..5780585\", \"5779920..5780231\", \"5778875..5779453\"]",
"fasta_file_url": "https://s3.amazonaws.com/agrjbrowse/fasta/GCF_000002985.6_WBcel235_genomic.fna.gz"},
{"name": "ERV29-S288C", "seq_id": "chrVII", "seq_strand": "-", "seq_regions": "[\"1061590..1060658\"]",
"fasta_file_url": "https://s3.amazonaws.com/agrjbrowse/fasta/GCF_000146045.2_R64_genomic.fna.gz"}
]'
```
7 changes: 7 additions & 0 deletions pipeline/workflow/nextflow.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
docker.enabled = true

profiles {
test {
includeConfig 'tests/integration/test.config'
}
}
42 changes: 42 additions & 0 deletions pipeline/workflow/protein-msa.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
params.input_seq_regions

process sequence_retrieval {
container 'agr_pavi/seq_retrieval'

input:
val request_map

output:
path "${request_map.name}-protein.fa"

script:
"""
main.py --output_type protein \
--name ${request_map.name} --seq_id ${request_map.seq_id} --seq_strand ${request_map.seq_strand} \
--fasta_file_url ${request_map.fasta_file_url} --seq_regions '${request_map.seq_regions}' \
> ${request_map.name}-protein.fa
"""
}

process alignment {
container 'agr_pavi/alignment'

publishDir "pipeline-results/", mode: 'copy'

input:
path 'alignment-input.fa'

output:
path 'alignment-output.aln'

script:
"""
clustalo -i alignment-input.fa --outfmt=clustal --resno -o alignment-output.aln
"""
}

workflow {
def seq_region_channel = Channel.of(params.input_seq_regions).splitJson()

seq_region_channel | sequence_retrieval | collectFile(name: 'alignment-input.fa', sort: { file -> file.name }) | alignment
}
3 changes: 3 additions & 0 deletions pipeline/workflow/tests/integration/test.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
params {
input_seq_regions = '[{"name": "C54H2.5.1", "seq_id": "X", "seq_strand": "-", "seq_regions": "[\\"5780644..5780722\\", \\"5780278..5780585\\", \\"5779920..5780231\\", \\"5778875..5779453\\"]", "fasta_file_url": "https://s3.amazonaws.com/agrjbrowse/fasta/GCF_000002985.6_WBcel235_genomic.fna.gz"}, {"name": "ERV29-S288C", "seq_id": "chrVII", "seq_strand": "-", "seq_regions": "[\\"1061590..1060658\\"]", "fasta_file_url": "https://s3.amazonaws.com/agrjbrowse/fasta/GCF_000146045.2_R64_genomic.fna.gz"}]'
}
26 changes: 26 additions & 0 deletions pipeline/workflow/tests/resources/integration-test-results.aln
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
CLUSTAL O(1.2.4) multiple sequence alignment


C54H2.5.1 ------------------------MNQFRAPGGQN--EML-----------AKAE-DAAE 22
ERV29-S288C MSYRGPIGNFGGMPMSSSQGPYSGGAQFRSNQNQSTSGILKQWKHSFEKFASRIEGLTDN 60
***: .*. :* :: * : :

C54H2.5.1 DFFRKTRTYLPHIARLCLVSTFLEDGIRMYFQWDDQKQFMQESWSCGWFIATLFVIYNFF 82
ERV29-S288C AVVYKLKPYIPSLSRFFIVATFYEDSFRILSQWSDQIFYLNKWKHYPYFFVVVFLVVVTV 120
.. * : *:* ::*: :*:** **.:*: **.** :::: :*:..:*:: .

C54H2.5.1 GQFIPVLMIMLRKKVLVACGILASIVILQTIAYHILWDLKFLARNIAVGGGLLLLLAETQ 142
ERV29-S288C SMLIGASLLVLRKQTNYATGVLCACVISQALVYGLFTGSSFVLRNFSVIGGLLIAFSDSI 180
. :* . :::***:. * *:*.: ** *::.* :: . .*: **::* ****: ::::

C54H2.5.1 EEKASLFAGVPTMGD-SNKPKSYMLLAGRVLLIFMFMSLMHFEMSFMQVLEIVVGFALIT 201
ERV29-S288C VQNKTTFGMLPELNSKNDKAKGYLLFAGRILIVLMFIAFTFSKSWFTVVLTI-IG---TI 236
:: : *. :* :.. .:* *.*:*:***:*:::**::: . : * ** * :*

C54H2.5.1 LVSIGYKTKLSAIVLVIWLFGLNLWLNAWWTIPSDRFYRDFMKYDFFQTMSVIGGLLLVI 261
ERV29-S288C CFAIGYKTKFASIMLGLILTFYNITLNNYWFYNN--TKRDFLKYEFYQNLSIIGGLLLVT 294
.:******:::*:* : * *: ** :* . ***:**:*:*.:*:*******

C54H2.5.1 AYGPGGVSVDDYKKRW 277
ERV29-S288C NTGAGELSVDEKKKIY 310
* * :***: ** :

0 comments on commit 6996687

Please sign in to comment.