Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low truvari benchmark scores #3

Open
riyasj327 opened this issue May 16, 2024 · 5 comments
Open

Low truvari benchmark scores #3

riyasj327 opened this issue May 16, 2024 · 5 comments

Comments

@riyasj327
Copy link

riyasj327 commented May 16, 2024

Hi,

I am trying to run truvari on the hapdiff unphased variants VCF (produced by the haplotype resolved HG002 assemblies) against the HG002 benchmarking VCF. I have given the commands used and the links to the public datasets below:

Hapdiff command:
singularity exec --bind $DD_DIR hapdiff_0.9.sif hapdiff.py --reference $DD_DIR/chm13_v2.fa --pat $DD_DIR/hg002v1.0.1.pat.fasta.gz --mat $DD_DIR/hg002v1.0.1.mat.fasta.gz --out-dir $DD_DIR/hapdiff -t 20

Links to the pat and mat assemblies:
hg002v1.0.1.pat.fasta.gz - https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.1.pat.fasta.gz
hg002v1.0.1.mat.fasta.gz - https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/HG002/assemblies/hg002v1.0.1.mat.fasta.gz

Truvari command:
truvari bench -b CHM13v2.0_HG2-T2TQ100-V1.0.vcf.gz -c /projects/rsaju_prj/LongReadAssembly-test/hapdiff/hapdiff/hapdiff_unphased.vcf.gz -o output/

Links to the base dataset:
base dataset - https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.015-20240215/CHM13v2.0_HG2-T2TQ100-V1.0.vcf.gz
comparison dataset - produced by the hapdiff using the command above

Unfortunately, the precision, recall and F1 scores are low(~0.5) when it should be around 0.9? I tried using the latest HG002 benchmarks and good quality HG002 haplotype resolved assemblies available. Please find the summary.json produced by the truvari attached with this issue.
summary.json

Any idea what is going on and why are the scores so low? Any insights on this would be really helpful!

Thanks,
Riya

@riyasj327 riyasj327 changed the title Low Low truvari benchmark scores May 16, 2024
@mikolmogorov
Copy link
Collaborator

Hello,

Sorry for the late response! I don't see anything wrong with your command lines, so it is hard to tell why the scores are lower. I would try to look into fp / fn calls output by truvari and see if you can see any patterns there. If can look into the hapdup assembly alignments in these regions along with the hapdiff VCF calls.

@oneillkza
Copy link

Thanks @fenderglass !

Yep when we looked into the FP/FN calls, it looked like a lot of those were due to tandem repeat arrays just being represented slightly differently. (This may in part be due to actual errors in the gold standard, since it's a newer one based off the HPRC T2T assemblies, which hasn't been as well curated as the old hg19 gold standard).

Anyway, the solution seemed to be to slightly relax some of the parameters in Truvari. @riyasj327, could you please post the full Truvari call you ended up using?

(The numbers we are now getting show F-measures around 0.8 for sniffles2, hapdiff and pav, with slight differences in FP/FN tradeoff between the methods).

@mikolmogorov
Copy link
Collaborator

Good to know - 0.8 seems closer to the expected number, if you are including SDs / pericentromeric reigons.

@riyasj327
Copy link
Author

Thanks @fenderglass and @oneillkza!

Here is the final truvari commands we are using now:

truvari bench -b CHM13v2.0_HG2-T2TQ100-V1.0.vcf.gz -c hapdiff_unphased.vcf.gz -o truvari --pctseq 0.5 --pctsize 0.5 -r 2000 --chunksize 2000 --passonly --includebed CHM13v2.0_HG2-T2TQ100-V1.0_stvar.benchmark.bed

truvari refine --regions truvari/candidate.refine.bed --reference chm13_v2.fa --recount --use-region-coords truvari/

The new Truvari scores:
"TP-base": 18421,
"TP-comp": 18421,
"FP": 726,
"FN": 7969,
"precision": 0.9620828328197629,
"recall": 0.6980295566502464,
"f1": 0.8090563717416607,
"base cnt": 26390,
"comp cnt": 19147

@mikolmogorov
Copy link
Collaborator

Thanks for the info!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants