Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from HIVSeqinR to HIVIntact #23

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
55536fb
Add HIVIntact analysis, for #10.
donkirkby Jul 24, 2021
1d54987
Install mafft and blast in GitHub Actions, for #10.
donkirkby Jul 26, 2021
6b77cac
Enable more checks in HIVIntact
Donaim May 19, 2023
28ae510
Specify writable workdir for hivintact
Donaim Jul 14, 2023
9aadb15
Update dependencies in the README file
Donaim Jun 5, 2023
7cfd902
Add HIVIntact to our python dependencies
Donaim Jun 5, 2023
7f4eba1
Generate table_precursor table from hivintact output
Donaim Jun 9, 2023
891b4a8
Remember intact sequences
Donaim Jun 9, 2023
b88c4e4
Update hivintact errors list
Donaim Jun 24, 2023
0bca5be
Improve hivintact errors priority and translation
Donaim Jun 30, 2023
6ad62bd
Bump HIVIntact from cfe-1.2 to cfe-1.3
Donaim Jul 21, 2023
7e15cf0
Generate proviral_landscape for HIVIntact runs too
Donaim Jul 21, 2023
719a14b
Singularity: make HIVIntact the default and the only app
Donaim Jul 21, 2023
a1294df
Switch to csv outputs in HIVIntact
Donaim Jul 21, 2023
5c378e9
Rename "hivseqinr_results_tar" argument to "detailed_results_tar"
Donaim Jul 21, 2023
56d06d9
Bump HIVIntact from 1.3 to 1.4
Donaim Sep 19, 2023
eae7b0f
Do not translate HIVIntact errors to HIVSeqinR ones
Donaim Sep 19, 2023
775f100
Fix table precursor construction
Donaim Sep 20, 2023
120fe49
Add missing codes to the HIVINACT_ERRORS_TABLE
Donaim Sep 20, 2023
5367241
fix the check for when to use HIVSeqinR results
Donaim Mar 7, 2024
95ae330
Various code improvements
Donaim Mar 7, 2024
9d725a1
Switch to CFEIntact
Donaim Jul 8, 2024
1c5eca6
Update Pipfile.lock
Donaim Jul 8, 2024
3a0bf9d
Update cfeintact command
Donaim Jul 8, 2024
2345afa
Bump CFEIntact version
Donaim Jul 8, 2024
bc1d1ea
Use CFEIntact's python API instead of operating system's shell
Donaim Jul 8, 2024
514221c
Do not install cfeintact twice
Donaim Jul 8, 2024
1819951
Fix column name of CFEIntact errors.csv
Donaim Jul 9, 2024
05aac4f
Improve reading of .SAM files
Donaim Jul 9, 2024
49022e0
Small refactoring of utils.py
Donaim Jul 9, 2024
3d53ae8
Add a separate proviral_landscapes.py script
Donaim Jul 9, 2024
c1698b4
Fix defect column value for intact sequences
Donaim Jul 9, 2024
0925e73
Move all landscapes stuff to landscapes.py
Donaim Jul 9, 2024
c1ed309
Bump CFEIntact version
Donaim Jul 12, 2024
eea0151
Bump CFEIntact version
Donaim Jul 16, 2024
6d18b36
Bump CFEIntact version
Donaim Jul 26, 2024
2b9a7b6
Bump CFEIntact version
Donaim Jul 27, 2024
e68bfe5
Update CFEIntact errors table
Donaim Jul 27, 2024
75fcf9c
Bump CFEIntact version
Donaim Jul 28, 2024
731556b
Bump CFEIntact version
Donaim Jul 28, 2024
9454ade
Fix landscape generation based on sample_name
Donaim Jul 29, 2024
87aaffb
Fix ignoring intact sequences in table_precursor and landscapes_plots
Donaim Jul 29, 2024
5fc9340
Bump CFEIntact version
Donaim Jul 29, 2024
28950eb
Bump pandas version
Donaim Jul 30, 2024
6098677
Fix reading of SAM files
Donaim Jul 30, 2024
9f874da
Add missing UnknownNucleotide CFEIntact defect code
Donaim Jul 30, 2024
92c8d39
Add missing SequenceDivergence CFEIntact defect code
Donaim Jul 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ jobs:
echo /opt/minimap2-2.17_x64-linux >> $GITHUB_PATH
- name: Install dependencies
run: |
sudo apt install -qq mafft ncbi-blast+
python -m pip install --upgrade pip pipenv
pipenv install --dev
- name: Test with pytest
Expand Down
3 changes: 2 additions & 1 deletion Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ name = "pypi"
gotoh = {subdirectory = "micall/alignment", ref = "v7.7.0", git = "https://github.com/cfe-lab/MiCall.git"}
numpy = "==1.25.1"
python-levenshtein = "==0.12.0"
pandas = "==2.0.2"
pandas = "==2.2.2"
requests = "==2.31.0"
pyyaml = "*"
cfeintact = {ref = "v1.23.0", git = "https://github.com/cfe-lab/CFEIntact"}

[dev-packages]
pytest = "*"
Expand Down
513 changes: 310 additions & 203 deletions Pipfile.lock

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
1. minimap2 (https://github.com/lh3/minimap2) (must be available via commandline)
2. blast tools (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
3. R and RSCRIPT (https://www.r-project.org/)
donkirkby marked this conversation as resolved.
Show resolved Hide resolved
4. mafft (https://mafft.cbrc.jp/alignment/software/)

### Singularity builds
* Build all singularity images inside of the `simages` folder
Expand Down
15 changes: 6 additions & 9 deletions Singularity
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Bootstrap: docker
From: ubuntu:22.04

%help
Search proviral consensus sequences for primers, then use HIVSeqinR to
Search proviral consensus sequences for primers, then use HIVIntact to
decide if the genomes are complete.

This Singularity container can be run on Kive: http://cfe-lab.github.io/Kive
Expand All @@ -13,7 +13,7 @@ From: ubuntu:22.04
MAINTAINER BC CfE in HIV/AIDS https://github.com/cfe-lab/
KIVE_INPUTS sample_info_csv contigs_csv conseqs_csv cascade_csv
KIVE_OUTPUTS outcome_summary_csv conseqs_primers_csv contigs_primers_csv \
table_precursor_csv proviral_landscape_csv hivseqinr_results_tar
table_precursor_csv proviral_landscape_csv detailed_results_tar
KIVE_THREADS 1
KIVE_MEMORY 6000

Expand All @@ -28,6 +28,9 @@ From: ubuntu:22.04
fontconfig libbz2-dev liblzma-dev libssl-dev \
libffi-dev libsqlite3-dev

echo ===== Installing MAFFT ===== >/dev/null
apt-get install -y mafft

echo ===== Installing Python ===== >/dev/null
apt-get install -y python3 python3-pip

Expand All @@ -41,17 +44,11 @@ From: ubuntu:22.04
echo ===== Installing minimap2 ===== >/dev/null
apt-get install -y minimap2

echo ===== Installing hivseqinr ===== >/dev/null
apt-get install -y libz-dev libcurl4-openssl-dev libxml2-dev
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y r-base
Rscript /opt/primer_finder/gene_splicer/configure_r.sh
python3 -m gene_splicer.hivseqinr /opt/hivseqinr

# Clean up
apt-get remove -y wget git build-essential

%environment
export LANG=en_US.UTF-8

%runscript
gene_splicer_sample --hivseqinr /opt/hivseqinr "$@"
gene_splicer_sample --hivintact "$@"
222 changes: 222 additions & 0 deletions gene_splicer/landscapes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
import csv
import logging
import os
import re
import typing
from typing import TextIO, Mapping, Dict, Set, List, Iterable, Tuple
import argparse
import sys

import json
import shutil
import subprocess as sp
import glob
from pathlib import Path
from csv import DictWriter, DictReader
from itertools import groupby
from operator import itemgetter

from gene_splicer.utils import (
iterate_hivintact_verdicts_1,
iterate_hivseqinr_verdicts_1,
LEFT_PRIMER_END, RIGHT_PRIMER_START,
)


logger = logging.getLogger(__name__)


def generate_proviral_landscape_csv_1_cont(blastn_reader: csv.DictReader,
landscape_writer: csv.DictWriter,
verdicts: Mapping[str, str],
) -> None:

for row in blastn_reader:
if row['qseqid'] in ['8E5LAV', 'HXB2']:
# skip the positive control rows
continue

ref_start = int(row['sstart'])
ref_end = int(row['send'])
if ref_end <= LEFT_PRIMER_END or ref_start >= RIGHT_PRIMER_START:
# skip unspecific matches of LTR at start and end
continue

qseqid = row['qseqid']
try:
[run_name, sample_name, _, _] = qseqid.split('::')
except ValueError:
[run_name, sample_name] = [None, qseqid]

is_inverted = ''
if ref_end < ref_start:
# automatically recognize inverted regions
new_end = ref_start
ref_start = ref_end
ref_end = new_end
is_inverted = 'yes'

verdict = verdicts[qseqid]
is_defective = verdict != 'Intact'
landscape_entry = {'ref_start': ref_start,
'ref_end': ref_end,
'samp_name': sample_name,
'run_name': run_name,
'is_inverted': is_inverted,
'is_defective': is_defective,
'defect': verdict,
}

landscape_writer.writerow(landscape_entry)


def get_hivintact_verdicts_1_map(details_dir: Path) -> Mapping[str, str]:
ret: Dict[str, str] = {}

for [qseqid, verdict] in iterate_hivintact_verdicts_1(details_dir):
ret[qseqid] = verdict

return ret


def get_hivseqinr_verdicts_1_map(details_dir: Path) -> Mapping[str, str]:
ret: Dict[str, str] = {}

for [qseqid, verdict] in iterate_hivseqinr_verdicts_1(details_dir):
ret[qseqid] = verdict

return ret


def generate_proviral_landscape_csv_1(landscape_writer: csv.DictWriter,
details_dir: Path,
) -> None:
is_hivintact = (details_dir / "holistic.csv").exists()
if is_hivintact:
verdicts = get_hivintact_verdicts_1_map(details_dir)
blastn_path = details_dir / "blast.csv"
else:
verdicts = get_hivseqinr_verdicts_1_map(details_dir)
blastn_path = details_dir / "Results_Intermediate" / "Output_Blastn_HXB2MEGA28_tabdelim.txt"

with blastn_path.open() as blastn_file:
if is_hivintact:
blastn_reader = DictReader(blastn_file)
else:
blastn_columns = ['qseqid',
'qlen',
'sseqid',
'sgi',
'slen',
'qstart',
'qend',
'sstart',
'send',
'evalue',
'bitscore',
'length',
'pident',
'nident',
'btop',
'stitle',
'sstrand']
blastn_reader = DictReader(blastn_file, fieldnames=blastn_columns, delimiter='\t')

return generate_proviral_landscape_csv_1_cont(
blastn_reader,
landscape_writer,
verdicts,
)


def generate_proviral_landscape_csv(outpath: Path, is_hivintact: bool):
proviral_landscape_csv = os.path.join(outpath, 'proviral_landscape.csv')

if is_hivintact:
subpath = 'hivintact*'
else:
subpath = 'hivseqinr*'

landscape_columns = ['samp_name', 'run_name', 'ref_start', 'ref_end', 'defect', 'is_inverted', 'is_defective']
with open(proviral_landscape_csv, 'w') as landscape_file:
landscape_writer = csv.DictWriter(landscape_file, fieldnames=landscape_columns)
landscape_writer.writeheader()

for details_dir in outpath.glob(subpath):
generate_proviral_landscape_csv_1(landscape_writer, details_dir)


class UserError(RuntimeError):
def __init__(self, fmt: str, *fmt_args: object):
self.fmt = fmt
self.fmt_args = fmt_args
self.code = 1


def dir_path(string: str) -> Path:
if os.path.exists(string) and os.path.isdir(string):
return Path(string)
else:
raise UserError("Path %r does not exist or is not a directory.", string)


def cli_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(description="Generate Proviral Landscape CSV.")

parser.add_argument("--details_dir", type=dir_path, required=True,
help="Directory containing details files for verdicts.")

parser.add_argument("--output", type=argparse.FileType("wt"), required=True,
help="Output CSV file for proviral landscape.")

verbosity_group = parser.add_mutually_exclusive_group()
verbosity_group.add_argument('--verbose', action='store_true',
help='Increase output verbosity.')
verbosity_group.add_argument('--no-verbose', action='store_true',
help='Normal output verbosity.', default=True)
verbosity_group.add_argument('--debug', action='store_true',
help='Maximum output verbosity.')
verbosity_group.add_argument('--quiet', action='store_true',
help='Minimize output verbosity.')

return parser


def main(argv: list) -> int:
parser = cli_parser()
args = parser.parse_args(argv)
if args.quiet:
logger.setLevel(logging.ERROR)
elif args.verbose:
logger.setLevel(logging.INFO)
elif args.debug:
logger.setLevel(logging.DEBUG)
else:
logger.setLevel(logging.WARN)

logger.debug("Start.")

fieldnames = ['ref_start', 'ref_end', 'samp_name', 'run_name', 'is_inverted', 'is_defective', 'defect']

landscape_writer = csv.DictWriter(args.output, fieldnames=fieldnames)
landscape_writer.writeheader()
generate_proviral_landscape_csv_1(landscape_writer, args.details_dir)

logger.debug("Done.")
return 0


if __name__ == '__main__':
try:
rc = main(sys.argv[1:])
except BrokenPipeError:
logger.debug("Broken pipe.")
rc = 0
except KeyboardInterrupt:
logger.debug("Interrupted.")
rc = 1
except UserError as e:
logger.fatal(e.fmt, *e.fmt_args)
rc = e.code

sys.exit(rc)
Loading
Loading