-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SVStratify and GroupedSVCluster tools #8990
Conversation
This is a combination of 2 commits. Implement SVStratify Interchrom events Start unit tests Finish unit tests Multiple contexts per stratum Integration tests; fix empty context bug Test duplicate context name Handle CPX/CTX, add some tests Compiler error Prioritize SR and PE evidence types for representative breakpoint strategy Add SVStratifiedCluster Integration tests for SVStratifiedClustering Unused line Spooled sorting in cluster tools Start fixing JointGCNVSegmentation Fix JointGCNVSegmentation integration test Rename to GroupedSVCluster Fix sorting bug Comment Use RD_CN for CNV sample overlap; improve testing Documentation Add comment about requiresOverlapAndProximity Allow empty EVIDENCE list Add STRAT INFO field Representative breakpoint by variant quality Handle BOTHSIDES_SUPPORT and HIGH_SR_BACKGROUND in variant collapsing Add expected exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only big ask is some clarification around "context" because it took me a long time to figure that out. Everything else you can take or leave.
@@ -77,17 +77,17 @@ protected static boolean hasSampleOverlap(final SVCallRecord a, final SVCallReco | |||
final Set<String> samples = new HashSet<>(SVUtils.hashMapCapacity(genotypesA.size() + genotypesB.size())); | |||
samples.addAll(genotypesA.getSampleNames()); | |||
samples.addAll(genotypesB.getSampleNames()); | |||
if (samples.isEmpty()) { | |||
// Empty case considered perfect overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own edification, why is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a sites-only vcf then we don't need to enforce sample overlap.
if (matchedSampleGenotype != null) { | ||
return VariantContextGetters.getAttributeAsInt(matchedSampleGenotype, GATKSVVCFConstants.EXPECTED_COPY_NUMBER_FORMAT, 0); | ||
} else { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand we have to choose something as the default, but I feel weird defaulting to homozygous deletion. Why not -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. This code path shouldn't be possible so I changed it to throw an error. For the defaults, I made them -1 as a unique "null" value.
protected GATKPath outputFile; | ||
|
||
/** | ||
* Expected format is tab-delimited and contains a header with the first column SAMPLE and remaining columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a tool that outputs this file? This is not familiar to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is a gatk-sv file designed to replace the ped file for storing allosome ploidies. We might make a gatk tool for this in the future though.
* clustering, but any associated annotation fields will be set to null in the output. | ||
*/ | ||
@Argument( | ||
doc = "Fast mode. Drops hom-ref and no-call genotype fields and emits them as no-calls.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call the attributes no-calls, I would call them missing
* Default genotypes are assigned when they cannot be inferred from the inputs, such as when VCFs with different | ||
* variants and samples are provided. | ||
*/ | ||
@Argument(fullName = DEFAULT_NO_CALL_LONG_NAME, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default you get 0/0 with no attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Turns out this is sufficient for our batch clustering module, we just need the genotypes.
runCommandLine(args, SVStratify.class.getSimpleName()); | ||
} | ||
|
||
@Test(expectedExceptions = {GATKException.class, IllegalArgumentException.class}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been burned this way before where another class throws and IllegalArgumentException. I also made a note where the Utils arg validation is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
public void testBwaMeltCohortRedundant() { | ||
final File outputDir = createTempDir("stratify"); | ||
final String inputVcfPath = getToolTestDataDir() + "bwa_melt.chr22.vcf.gz"; | ||
final String configFile = getToolTestDataDir() + "test_config_redundant.tsv"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the config conflicts with the cmdline args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll get caught in the Stratrum
class and throw an error
* @param svType SV type, may be null | ||
* @param minSize minimum size in bp (inclusive), may be null | ||
* @param maxSize maximum size in bp (exclusive), may be null | ||
* @param contextNames reference context names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confused me for a while because in the engine the ReferenceContext is like the actual FASTA sequence. You think this is obvious enough for users who haven't done GATK development?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, changed context to track
expectedOutputSuffixes.put("DUP_lt5kb_RM", 0); | ||
expectedOutputSuffixes.put("INV_gt1kb", 26); | ||
expectedOutputSuffixes.put("BND_SD", 77); | ||
expectedOutputSuffixes.put(SVStratify.DEFAULT_STRATUM, 1196); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to trust that you verified this with an orthogonal python script or somesuch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've been testing against the gatk-sv pipeline results, although those evaluations are qualitative. I trust this will alert us if any unintentional changes happen though.
public GATKPath configFile; | ||
|
||
@Argument( | ||
doc = "Context intervals file. Can be specified multiple times.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this something more like "repeat track intervals file". I'm sure you could generalize beyond repeats, but I found this really vague.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now "Track intervals file" if that works?
Implements two new tools and updates some methods for a revamp of the
CombineBatches
cross-batch integration module in gatk-sv.SVStratify
- tool for splitting out a VCF by variant class. Users pass in a configuration table (see tool documentation for an example) specifying one or more stratification groups classified by SVTYPE, SVLEN range, and reference context(s). The latter are specified as a set of interval lists using--context-name
and--context-intervals
arguments. All variants are matched with their respective group which is annotated in theSTRAT
INFO field. Optionally, the output can be split into multiple VCFs by group, which is a very useful functionality that currently can't be done efficiently with common commands/toolkits.GroupedSVCluster
- a hybrid tool combining functionality fromSVStratify
withSVCluster
to perform intra-stratum clustering. This tool is critical for fine-tuned clustering of specific variants types within certain reference contexts. For example, small variants in simple repeats tend to have lower breakpoint accuracy and are typically "reclustered" during call set refinement with looser clustering criteria.SVStratificationEngine
- new class for performing stratification.CanonicalSVCollapser
that should improve breakpoint accuracy, particularly in larger call sets. Raw evidence support and variant quality are now considered when choosing a representative breakpoint for a group of clustered SVs.FlagFieldLogic
type for customizing howBOTHSIDE_PASS
andHIGH_SR_BACKGROUND
INFO flags are collapsed during clustering.RD_CN
is now used as a backup ifCN
is not available when determining carrier status for sample overlap.