generated from nextstrain/pathogen-repo-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit ad0b045
Showing
68 changed files
with
2,715 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Allow Git to decide if file is text or binary | ||
# Always use LF line endings even on Windows. | ||
* text=auto eol=lf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
name: pre-commit | ||
|
||
on: | ||
- push | ||
|
||
jobs: | ||
pre-commit: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v4 | ||
- uses: actions/setup-python@v5 | ||
with: | ||
python-version: "3.12" | ||
- uses: pre-commit/[email protected] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Files created by workflows that we usually want to keep out of git | ||
auspice/ | ||
builds/ | ||
data/ | ||
results/ | ||
logs/ | ||
benchmarks/ | ||
|
||
# Sensitive environment variables | ||
environment* | ||
env.d/ | ||
|
||
# Snakemake | ||
.snakemake/ | ||
|
||
# For Python # | ||
############## | ||
*.pyc | ||
.tox/ | ||
.cache/ | ||
|
||
# Compiled source # | ||
################### | ||
*.com | ||
*.class | ||
*.dll | ||
*.exe | ||
*.o | ||
*.so | ||
|
||
# OS generated files # | ||
###################### | ||
.DS_Store | ||
.DS_Store? | ||
._* | ||
.Spotlight-V100 | ||
.Trashes | ||
Icon? | ||
ehthumbs.db | ||
Thumbs.db | ||
*~ | ||
|
||
# IDE generated files # | ||
###################### | ||
.vscode/ | ||
|
||
# nohup output | ||
nohup.out | ||
|
||
# cluster logs | ||
slurm-* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
default_language_version: | ||
python: python3 | ||
exclude: '\.(tsv|fasta|gb)$|^ingest/vendored/' | ||
repos: | ||
- repo: https://github.com/pre-commit/sync-pre-commit-deps | ||
rev: v0.0.1 | ||
hooks: | ||
- id: sync-pre-commit-deps | ||
- repo: https://github.com/shellcheck-py/shellcheck-py | ||
rev: v0.10.0.1 | ||
hooks: | ||
- id: shellcheck | ||
- repo: https://github.com/rhysd/actionlint | ||
rev: v1.6.27 | ||
hooks: | ||
- id: actionlint | ||
entry: env SHELLCHECK_OPTS='--exclude=SC2027' actionlint | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v4.6.0 | ||
hooks: | ||
- id: trailing-whitespace | ||
- id: check-ast | ||
- id: check-case-conflict | ||
- id: check-docstring-first | ||
- id: check-json | ||
- id: check-executables-have-shebangs | ||
- id: check-merge-conflict | ||
- id: check-shebang-scripts-are-executable | ||
- id: check-symlinks | ||
- id: check-toml | ||
- id: check-yaml | ||
- id: destroyed-symlinks | ||
- id: detect-private-key | ||
- id: end-of-file-fixer | ||
- id: fix-byte-order-marker | ||
- repo: https://github.com/astral-sh/ruff-pre-commit | ||
# Ruff version. | ||
rev: v0.4.6 | ||
hooks: | ||
# Run the linter. | ||
- id: ruff |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# CHANGELOG | ||
|
||
We use this CHANGELOG to document breaking changes, new features, bug fixes, | ||
and config value changes that may affect both the usage of the workflows and | ||
the outputs of the workflows. See the [changelog for the ncov | ||
repository](https://github.com/nextstrain/ncov/blob/HEAD/docs/src/reference/change_log.md) | ||
for an example of formatting. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# Pathogen Repo Guide | ||
|
||
This is a Nextstrain pathogen repository guide for setting up a pathogen | ||
repo to hold the files necessary to run and maintain a Nextstrain pathogen build. | ||
|
||
Using this guide will allow you to start with the general repository | ||
and workflow organization that is expected of a Nextstrain maintained pathogen. | ||
However, the workflows will require customizations to support your specific pathogen | ||
and should not be expected to "just work". |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Ingest | ||
|
||
This workflow ingests public data from NCBI and outputs curated metadata and | ||
sequences that can be used as input for the phylogenetic workflow. | ||
|
||
If you have another data source or private data that needs to be formatted for | ||
the phylogenetic workflow, then you can use a similar workflow to curate your | ||
own data. | ||
|
||
## Workflow Usage | ||
|
||
The workflow can be run from the top level pathogen repo directory: | ||
``` | ||
nextstrain build ingest | ||
``` | ||
|
||
Alternatively, the workflow can also be run from within the ingest directory: | ||
``` | ||
cd ingest | ||
nextstrain build . | ||
``` | ||
|
||
This produces the default outputs of the ingest workflow: | ||
|
||
- metadata = results/metadata.tsv | ||
- sequences = results/sequences.fasta | ||
|
||
### Dumping the full raw metadata from NCBI Datasets | ||
|
||
The workflow has a target for dumping the full raw metadata from NCBI Datasets. | ||
|
||
``` | ||
nextstrain build ingest dump_ncbi_dataset_report | ||
``` | ||
|
||
This will produce the file `ingest/data/ncbi_dataset_report_raw.tsv`, | ||
which you can inspect to determine what fields and data to use if you want to | ||
configure the workflow for your pathogen. | ||
|
||
## Defaults | ||
|
||
The defaults directory contains all of the default configurations for the ingest workflow. | ||
|
||
[defaults/config.yaml](defaults/config.yaml) contains all of the default configuration parameters | ||
used for the ingest workflow. Use Snakemake's `--configfile`/`--config` | ||
options to override these default values. | ||
|
||
## Snakefile and rules | ||
|
||
The rules directory contains separate Snakefiles (`*.smk`) as modules of the core ingest workflow. | ||
The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized. | ||
|
||
The `workdir` is hardcoded to be the ingest directory so all filepaths for | ||
inputs/outputs should be relative to the ingest directory. | ||
|
||
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes) | ||
in the main Snakefile in the order that they are expected to run. | ||
|
||
### Nextclade | ||
|
||
Nextstrain is pushing to standardize ingest workflows with Nextclade runs to include Nextclade outputs in our publicly | ||
hosted data. However, if a Nextclade dataset does not already exist, it requires curated data as input, so we are making | ||
Nextclade steps optional here. | ||
|
||
If Nextclade config values are included, the Nextclade rules will create the final metadata TSV by joining the Nextclade | ||
output with the metadata. If Nextclade configs are not included, we rename the subset metadata TSV to the final metadata TSV. | ||
|
||
To run Nextclade rules, include the `defaults/nextclade_config.yaml` config file with: | ||
|
||
``` | ||
nextstrain build ingest --configfile defaults/nextclade_config.yaml | ||
``` | ||
|
||
> [!TIP] | ||
> If the Nextclade dataset is stable and you always want to run the Nextclade rules as part of ingest, we recommend | ||
moving the Nextclade related config parameters from the `defaults/nextclade_config.yaml` file to the default config file | ||
`defaults/config.yaml`. | ||
|
||
## Build configs | ||
|
||
The build-configs directory contains custom configs and rules that override and/or | ||
extend the default workflow. | ||
|
||
- [nextstrain-automation](build-configs/nextstrain-automation/) - automated internal Nextstrain builds. | ||
|
||
|
||
## Vendored | ||
|
||
This repository uses [`git subrepo`](https://github.com/ingydotnet/git-subrepo) | ||
to manage copies of ingest scripts in [vendored](vendored), from [nextstrain/ingest](https://github.com/nextstrain/ingest). | ||
|
||
See [vendored/README.md](vendored/README.md#vendoring) for instructions on how to update | ||
the vendored scripts. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
""" | ||
This is the main ingest Snakefile that orchestrates the full ingest workflow | ||
and defines its default outputs. | ||
""" | ||
# The workflow filepaths are written relative to this Snakefile's base directory | ||
workdir: workflow.current_basedir | ||
|
||
# Use default configuration values. Override with Snakemake's --configfile/--config options. | ||
configfile: "defaults/config.yaml" | ||
|
||
# This is the default rule that Snakemake will run when there are no specified targets. | ||
# The default output of the ingest workflow is usually the curated metadata and sequences. | ||
# Nextstrain-maintained ingest workflows will produce metadata files with the | ||
# standard Nextstrain fields and additional fields that are pathogen specific. | ||
# We recommend using these standard fields in custom ingests as well to minimize | ||
# the customizations you will need for the downstream phylogenetic workflow. | ||
# TODO: Add link to centralized docs on standard Nextstrain metadata fields | ||
rule all: | ||
input: | ||
"results/sequences.fasta", | ||
"results/metadata.tsv", | ||
|
||
|
||
# Note that only PATHOGEN-level customizations should be added to these | ||
# core steps, meaning they are custom rules necessary for all builds of the pathogen. | ||
# If there are build-specific customizations, they should be added with the | ||
# custom_rules imported below to ensure that the core workflow is not complicated | ||
# by build-specific rules. | ||
include: "rules/fetch_from_ncbi.smk" | ||
include: "rules/curate.smk" | ||
|
||
|
||
# We are pushing to standardize ingest workflows with Nextclade runs to include | ||
# Nextclade outputs in our publicly hosted data. However, if a Nextclade dataset | ||
# does not already exist, creating one requires curated data as input, so we are making | ||
# Nextclade steps optional here. | ||
# | ||
# If Nextclade config values are included, the nextclade rules will create the | ||
# final metadata TSV by joining the Nextclade output with the metadata. | ||
# If Nextclade configs are not included, we rename the subset metadata TSV | ||
# to the final metadata TSV. | ||
# To run nextclade.smk rules, include the `defaults/nextclade_config.yaml` | ||
# config file with `nextstrain build ingest --configfile defaults/nextclade_config.yaml`. | ||
if "nextclade" in config: | ||
|
||
include: "rules/nextclade.smk" | ||
|
||
else: | ||
|
||
rule create_final_metadata: | ||
input: | ||
metadata="data/subset_metadata.tsv" | ||
output: | ||
metadata="results/metadata.tsv" | ||
shell: | ||
""" | ||
mv {input.metadata} {output.metadata} | ||
""" | ||
|
||
# Allow users to import custom rules provided via the config. | ||
# This allows users to run custom rules that can extend or override the workflow. | ||
# A concrete example of using custom rules is the extension of the workflow with | ||
# rules to support the Nextstrain automation that uploads files and sends internal | ||
# Slack notifications. | ||
# For extensions, the user will have to specify the custom rule targets when | ||
# running the workflow. | ||
# For overrides, the custom Snakefile will have to use the `ruleorder` directive | ||
# to allow Snakemake to handle ambiguous rules | ||
# https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#handling-ambiguous-rules | ||
if "custom_rules" in config: | ||
for rule_file in config["custom_rules"]: | ||
|
||
include: rule_file |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# Nextstrain automation | ||
|
||
> [!NOTE] | ||
> External users can ignore this directory! | ||
> This build config/customization is tailored for the internal Nextstrain team | ||
> to extend the core ingest workflow for automated workflows. | ||
## Update the config | ||
|
||
Update the [config.yaml](config.yaml) for your pathogen: | ||
|
||
1. Edit the `s3_dst` param to add the pathogen repository name. | ||
2. Edit the `files_to_upload` param to a mapping of files you need to upload for your pathogen. | ||
The default includes suggested files for uploading curated data and Nextclade outputs. | ||
|
||
## Run the workflow | ||
|
||
Provide the additional config file to the Snakemake options in order to | ||
include the custom rules from [upload.smk](upload.smk) in the workflow. | ||
Specify the `upload_all` target in order to run the additional upload rules. | ||
|
||
The upload rules will require AWS credentials for a user that has permissions | ||
to upload to the Nextstrain data bucket. | ||
|
||
The customized workflow can be run from the top level pathogen repo directory with: | ||
``` | ||
nextstrain build \ | ||
--env AWS_ACCESS_KEY_ID \ | ||
--env AWS_SECRET_ACCESS_KEY \ | ||
ingest \ | ||
upload_all \ | ||
--configfile build-configs/nextstrain-automation/config.yaml | ||
``` | ||
|
||
## Automated GitHub Action workflows | ||
|
||
Additional instructions on how to use this with the shared `pathogen-repo-build` | ||
GitHub Action workflow to come! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# This configuration file should contain all required configuration parameters | ||
# for the ingest workflow to run with additional Nextstrain automation rules. | ||
|
||
# Custom rules to run as part of the Nextstrain automated workflow | ||
# The paths should be relative to the ingest directory. | ||
custom_rules: | ||
- build-configs/nextstrain-automation/upload.smk | ||
|
||
# Nextstrain CloudFront domain to ensure that we invalidate CloudFront after the S3 uploads | ||
# This is required as long as we are using the AWS CLI for uploads | ||
cloudfront_domain: "data.nextstrain.org" | ||
|
||
# Nextstrain AWS S3 Bucket with pathogen prefix | ||
# Replace <pathogen> with the pathogen repo name. | ||
s3_dst: "s3://nextstrain-data/files/workflows/<pathogen>" | ||
|
||
# Mapping of files to upload | ||
files_to_upload: | ||
ncbi.ndjson.zst: data/ncbi.ndjson | ||
metadata.tsv.zst: results/metadata.tsv | ||
sequences.fasta.zst: results/sequences.fasta | ||
alignments.fasta.zst: results/alignment.fasta | ||
translations.zip: results/translations.zip |
Oops, something went wrong.