Skip to content

Commit

Permalink
demo
Browse files Browse the repository at this point in the history
  • Loading branch information
Yunxi Liu committed Nov 20, 2023
1 parent 01f556e commit ef87fd0
Show file tree
Hide file tree
Showing 19 changed files with 655 additions and 4 deletions.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2023 Treangen Lab
Copyright (c) 2021 treangenlab

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
27 changes: 25 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,27 @@ Wastewater monitoring is an important tool that can complement clinical testing
2. enables fast queries of mutation combinations against all publicly available SARS-CoV-2 genomes
3. improving understanding of SARS-CoV-2 intrahost evolution and transmission events at a large scale

It is highly recommanded that the wastewater samples are processed with [QuaID](https://gitlab.com/treangenlab/quaid), a novel bioinformatics tool (QuaID) for VoC detection based on quasiunique mutations that are being developed by [Treangenlab](https://gitlab.com/treangenlab).
It is highly recommanded that the wastewater samples are processed with [QuaID](https://gitlab.com/treangenlab/quaid), a novel bioinformatics tool (QuaID) for VoC detection based on quasiunique mutations that are being developed by [Treangenlab](https://gitlab.com/treangenlab). The current version number of Crykey is v1.0

## System requirements

Crykey is supported on Linux system. The user should provide sufficient amount of RAM in order to load the classification database for Crykey. A standard database based on publicly available SARS-CoV-2 genomes till Jan, 10, 2023 takes more than 17GB. This tool (version 1.0) is tested on Linux (Ubuntu 18.04.5 LTS). There is no non-stardard hardware required for this software.

## Installation

To install Crykey, simply download the github repo. It's highly recommand that the dependencies is installed on a clean conda enviorment. To create a new conda enviorment, please follow [conda user guide](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html). After installing the required dependencies and downloading the pre-build database, the software is ready to be used. The typical install time should be no longer than 30 minutes.

```
git clone [email protected]:treangenlab/crykey.git
cd crykey
./install.sh
```

## 3rd party software requirements

Below is the list of 3rd party software requirements. All required sofwtare can be installed via Miniconda after adding `bioconda` to the list of channels. Version specified in the parantheses is the version currently tested.

* vdb (2.7)
* vdb (2.7) (database building only)
* samtools (1.7)
* SnpEff

Expand Down Expand Up @@ -169,6 +183,15 @@ Based on such information, you could determine which of the co-occurring SNVs qu
* have sufficiant number of supporting reads,
* the occurence in the database should be low. in other words, the cryptic lineage should be novo or at least rare in the database.

## Demo Run

The following are a demo run with the test data we provide. The expected output are store in `demo/test_output`. The test run should take less then 10 minutes to finish. The majority time spend will be loading the database, so having multiple samples run as a batch would significantly increase efficiency of the tool.
```
python crykey_wastewater.py -i demo/test_metadata.tsv -r demo/SARS-CoV-2-reference.fasta -d [PATH_TO_CRYKEY_DATABASE] -o [PATH_TO_OUTPUT_DIRECTORY]
python crykey_query.py -d [PATH_TO_CRYKEY_DATABASE] -o [PATH_TO_OUTPUT_DIRECTORY]
```


## Manuscript

You can find the manuscript describing QuaID and corresponding results at [doi.org/10.1101/2023.06.16.23291524](https://www.medrxiv.org/content/10.1101/2023.06.16.23291524v1).
6 changes: 5 additions & 1 deletion crykey_wastewater.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pickle
import glob
import re
Expand Down Expand Up @@ -251,7 +254,8 @@ def search_valid_mutation_combinations(date, site, sorted_bam_f, vcf_lofreq_f, c
'Total DP': total_dp,
'Combined Freq': comb_freq}

cryptic_df = cryptic_df.append(record_dict, ignore_index=True)
#cryptic_df = cryptic_df.append(record_dict, ignore_index=True)
cryptic_df = pd.concat([cryptic_df, pd.DataFrame([record_dict])], ignore_index=True)
else:
if (not os.path.exists(sorted_bam_f)) and os.path.exists(vcf_lofreq_f):
print(date, site, 'Missing BAM File(s).')
Expand Down
430 changes: 430 additions & 0 deletions demo/SARS-CoV-2-reference.fasta

Large diffs are not rendered by default.

Binary file added demo/test_data.bai
Binary file not shown.
Binary file added demo/test_data.bam
Binary file not shown.
Binary file added demo/test_data.bam.bai
Binary file not shown.
51 changes: 51 additions & 0 deletions demo/test_data.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
##fileformat=VCFv4.0
##fileDate=20210806
##source=lofreq call -f SARS-CoV-2-reference.fasta --call-indels -o Variant-calling-LoFreq/HHD0802/76-1.clean.vcf Variant-calling-LoFreq/HHD0802/76-1.clean.indelqual.bam
##reference=SARS-CoV-2-reference.fasta
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">
##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=SB,Number=1,Type=Integer,Description="Phred-scaled strand bias at this position">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Counts for ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=CONSVAR,Number=0,Type=Flag,Description="Indicates that the variant is a consensus variant (as opposed to a low frequency variant).">
##INFO=<ID=HRUN,Number=1,Type=Integer,Description="Homopolymer length to the right of report indel position">
##FILTER=<ID=min_dp_10,Description="Minimum Coverage 10">
##FILTER=<ID=sb_fdr,Description="Strand-Bias Multiple Testing Correction: fdr corr. pvalue > 0.001000">
##FILTER=<ID=min_snvqual_59,Description="Minimum SNV Quality (Phred) 59">
##FILTER=<ID=min_indelqual_38,Description="Minimum Indel Quality (Phred) 38">
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512.2 210 . G T 1607 PASS DP=43;AF=1.000000;SB=0;DP4=0,0,17,26
NC_045512.2 241 . C T 1813 PASS DP=50;AF=0.980000;SB=0;DP4=0,0,23,26
NC_045512.2 320 . C T 114 PASS DP=55;AF=0.163636;SB=20;DP4=21,25,0,9
NC_045512.2 727 . T A 288 PASS DP=27;AF=0.925926;SB=0;DP4=1,0,25,0
NC_045512.2 1133 . A T 61 PASS DP=184;AF=0.027174;SB=0;DP4=95,84,3,2
NC_045512.2 5669 . T C 1172 PASS DP=104;AF=0.423077;SB=0;DP4=37,23,27,17
NC_045512.2 5744 . T C 65 PASS DP=96;AF=0.041667;SB=4;DP4=50,42,1,3
NC_045512.2 5777 . C T 168 PASS DP=96;AF=0.093750;SB=0;DP4=51,36,5,4
NC_045512.2 6402 . C T 5573 PASS DP=157;AF=0.993631;SB=0;DP4=0,0,70,86
NC_045512.2 6456 . G A 231 PASS DP=120;AF=0.108333;SB=1;DP4=47,60,5,8
NC_045512.2 6478 . T C 1041 PASS DP=96;AF=0.406250;SB=1;DP4=26,31,16,23
NC_045512.2 10029 . C T 507 PASS DP=15;AF=1.000000;SB=0;DP4=0,0,7,8
NC_045512.2 12926 . A AC 41 PASS DP=266;AF=0.007519;SB=0;DP4=138,127,1,1;INDEL;HRUN=3
NC_045512.2 16466 . C T 661 PASS DP=19;AF=1.000000;SB=0;DP4=0,0,7,12
NC_045512.2 17122 . G T 7008 PASS DP=1405;AF=0.226335;SB=1;DP4=681,406,203,115
NC_045512.2 17135 . C T 1695 PASS DP=1535;AF=0.071661;SB=0;DP4=870,554,68,42
NC_045512.2 17285 . C T 108 PASS DP=1986;AF=0.008560;SB=8;DP4=1107,861,13,4
NC_045512.2 17518 . CT C 48 PASS DP=82;AF=0.024390;SB=0;DP4=45,35,1,1;INDEL;HRUN=2
NC_045512.2 18636 . G A 89 PASS DP=223;AF=0.035874;SB=12;DP4=138,77,8,0
NC_045512.2 23403 . A G 429 PASS DP=12;AF=1.000000;SB=0;DP4=0,0,5,7
NC_045512.2 24863 . C T 14056 PASS DP=392;AF=0.997449;SB=0;DP4=0,0,182,209
NC_045512.2 25174 . A C 94 PASS DP=139;AF=0.064748;SB=1;DP4=59,71,5,4
NC_045512.2 25469 . C T 6969 PASS DP=193;AF=0.994819;SB=0;DP4=0,1,88,104
NC_045512.2 26767 . T C 643 PASS DP=20;AF=1.000000;SB=0;DP4=0,0,8,12
NC_045512.2 27131 . C T 22273 PASS DP=616;AF=0.995130;SB=0;DP4=1,1,261,352
NC_045512.2 27176 . T C 105 PASS DP=484;AF=0.026860;SB=26;DP4=185,286,0,13
NC_045512.2 28247 . AGATTTC A 12960 PASS DP=402;AF=0.990050;SB=1;DP4=8,15,163,235;INDEL;HRUN=1
NC_045512.2 28270 . TA T 44158 PASS DP=1150;AF=0.988696;SB=2;DP4=7,7,681,456;INDEL;HRUN=4
NC_045512.2 28372 . TG T 40 PASS DP=1443;AF=0.002079;SB=0;DP4=722,905,1,2;INDEL;HRUN=4
NC_045512.2 28432 . C T 2435 PASS DP=949;AF=0.154900;SB=9;DP4=231,569,52,95
NC_045512.2 29029 . T C 112 PASS DP=143;AF=0.055944;SB=12;DP4=84,51,8,0
NC_045512.2 29039 . A T 182 PASS DP=157;AF=0.070064;SB=20;DP4=90,56,11,0
NC_045512.2 29049 . G A 166 PASS DP=171;AF=0.070175;SB=2;DP4=98,61,9,3
NC_045512.2 29711 . G T 113 PASS DP=19;AF=0.210526;SB=0;DP4=8,7,2,2
NC_045512.2 29742 . G T 522 PASS DP=17;AF=0.941176;SB=0;DP4=0,1,6,10
2 changes: 2 additions & 0 deletions demo/test_metadata.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
Sample_Collection_Date WWTP Sorted_BAM VCF
01012023 Test demo/test_data.bam demo/test_data.vcf
Binary file not shown.
Binary file not shown.
4 changes: 4 additions & 0 deletions demo/test_output/cryptic_dataframe/1012023_Test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Date,Site,Nt Mutations,AA Mutations,Support DP,Total DP,Combined Freq
1012023,Test,C6402T;G6456A,ORF1a:P2046L;ORF1a:C2064Y,7,60,0.11666666666666667
1012023,Test,C27131T;T27176C,M:N203N;M:A218A,13,289,0.04498269896193772
1012023,Test,A29039T;G29049A,N:K256*;N:R259Q,7,97,0.07216494845360824
27 changes: 27 additions & 0 deletions demo/test_output/cryptic_reads/read_ids_1012023_Test.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
M04488:8:000000000-G8LL4:1:2103:13396:8558
M04488:8:000000000-G8LL4:1:2102:9412:26067
M04488:8:000000000-G8LL4:1:1102:12015:16214
M04488:8:000000000-G8LL4:1:2104:20712:21847
M04488:8:000000000-G8LL4:1:1103:22919:9482
M04488:8:000000000-G8LL4:1:1104:12896:21967
M04488:8:000000000-G8LL4:1:2104:15016:25238
M04488:8:000000000-G8LL4:1:1102:14601:5730
M04488:8:000000000-G8LL4:1:1102:18604:17868
M04488:8:000000000-G8LL4:1:1102:15226:19521
M04488:8:000000000-G8LL4:1:1102:14848:6489
M04488:8:000000000-G8LL4:1:1102:24859:19429
M04488:8:000000000-G8LL4:1:1102:11148:6398
M04488:8:000000000-G8LL4:1:1102:13910:12657
M04488:8:000000000-G8LL4:1:1102:26239:14246
M04488:8:000000000-G8LL4:1:1102:28000:17740
M04488:8:000000000-G8LL4:1:1102:8701:16578
M04488:8:000000000-G8LL4:1:1102:16810:5541
M04488:8:000000000-G8LL4:1:1102:9024:3703
M04488:8:000000000-G8LL4:1:1102:17949:24318
M04488:8:000000000-G8LL4:1:1103:16919:12562
M04488:8:000000000-G8LL4:1:1104:11272:15582
M04488:8:000000000-G8LL4:1:2103:26012:23845
M04488:8:000000000-G8LL4:1:1101:14763:11854
M04488:8:000000000-G8LL4:1:1103:5630:13033
M04488:8:000000000-G8LL4:1:2104:21013:17941
M04488:8:000000000-G8LL4:1:1103:27288:9477
Loading

0 comments on commit ef87fd0

Please sign in to comment.