README updated

ngs-ai-org · Nov 18, 2024 · 3fbe3f5 · 3fbe3f5
1 parent 2cb706b
commit 3fbe3f5
Show file tree

Hide file tree

Showing 3 changed files with 202 additions and 45 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name    = "FusionR"
-version = "1.0.1"
+version = "1.0.2"
 authors = ["Emanuel Schmid-Siegert <[email protected]>"]
 edition = "2018"
 description   =  """

diff --git a/README.md b/README.md
@@ -1,35 +1,43 @@
 # Splitty - tool suite for fusion detection
 
 This respository is targeted around the detection of fusion events.
-It contains multiple algorithms for fusion detection as well as analysis of genome integration events.
+It contains algorithms for fusion detection as well as tools to cluster and filter the VCF-based events in a BND-pair fashion.
 
-<img src="splitty.jpg" alt="drawing" width="200"/>
+Splitty will derive the position of integration/fusion based on alignment information, in particular the information of supplementary alignments of reads which are split. Therefore it is important to use callers which support such a reporting like BWA-MEM2 for short reads or minimap2 for long reads.
 
-Splitty will derive the position of integration/fusion based on alignment information. These can sometimes be unprecise due to
+Alignments are not solved and can create ambiguity, especially around break-end positions.
 
-- unprecise alignment towards the boards, especially for spliced elements
-- scoring, where placing in both junctions would result in identical scores
-- situations where placing a clipped element is not resulting in a high enough score (again in spliced alignment even worse)
+This is amplified and reflected in
 
-For this purpose the fuzzy-ness exists which provides in bp the number of unprecise nucleotides to be expected.
-This fuzzyiness at that point is only propagated to the position of question (CIPOS might have been originally already set and will be incremented then) not the pair.
+- alignments adjacent to edges, especially for spliced elements
+- scoring, where multiple targets result in identical scores
+- situations where mapping of supplementary element is not resulting in a high enough score 
 
-Occasionally (and in particular for cassette integration analysis) it can happen that a predicated site is beyond the boundaries of a chromosome/contig.
-Meaning it is < 0 or > contigSize. If that happens, the program will assign the last valid position and take the difference to the predicted one as fuzzyness, too.
-
-**Note**: The checks if positions are within contig boundaries is only enforced in the vcf format, not in the tsv format !
+It is subsequently reflected in the fuzzy-ness of the exact reported position and splitty provides infomration (CIPOS) in bp reporting the expected fuziness.
 
+Occasionally (and in particular for integration analysis) it can happen that a predicated site is due to these ambiguities beyond the boundaries of a chromosome/contig. Meaning it is < 0 or > contigSize. If that happens, the program will assign the last valid position and report the difference to the predicted one as fuzzyness, too.
 
 Both, fragment and paired read fusion detection uses annotation to provide additional information of the involved genes (transcripts).
 Importantly, if a annotation file is present the notion for known fusions is `GeneA--GeneB` with `NA` for unknown elements e.g. `GeneA--NA`.
 If no annotation is present, this though replaced by the chromosome names, e.g. `ChrA--ChrB` or `ChrA--ChrA` and never defaulting to `NA`.
 
-### Splitty fragment 
+Splitty uses the annotation of the underlying transcripts to determine which transcript is the most likely to be source of the resulting fragment. If multiple transcripts are overlapping, priority is given to the one in the same sense of the transcript and then the larger proportion of the transcript being represented in the alignment block. If this results in identical "scores", splitty will report both (or multiple) transcripts as possible annotations of the same VCF event.
+
+On the other hand, if one of the targets results in multi-mappers and cant be resolved properly then splitty will report one VCF entry (pair) per target as they represent different genomic coordinates and we emphasize coordinates rather than gene-annotations as fusion reporting.
+E.g. we might encounter an entry `SA:Z:chr14,42077618,+,3121M279334D415S,4,1486; chr14,35031866,-,1809M65484D1727S,60,0;`
+Which described more than one entry and will result potentially into 2 subsequently proposed gene fusions.
+Normally the entry with the highest score is kept but sometimes we do encounter identical scores, in this case both are propagated if specified.
+
+
+## Splitty fragment 
 
 It's main function is to predict and annotate fusion positions based on a contig alignment against a reference genome. There is not necessarily a limit to the length of an aligned contig but it is unlikely to work well on any given short reads.
 
 
 ```bash
+splitty-fragment 
+classifying cDNA/DNA mappings into gene-fusion events, not for paired-end reads
+
 USAGE:
     splitty fragment [FLAGS] [OPTIONS] --bam <FILE>
 
@@ -44,25 +52,23 @@ FLAGS:
 
 OPTIONS:
     -b, --bam <FILE>              long read/cDNA alignment vs reference
-    -l, --blacklist <FILE>        black-list of queries to ignore (e.g. assembled transcripts identified as mouse
-                                  derived), one ID per line
+    -l, --blacklist <FILE>        black-list of queries to ignore , one ID per line
     -w, --distance <int>          minimum allowed distance for intra-chromosomal fusions [default=10000]
     -g, --gtf <FILE>              Provide annotation of the genome. Use option m to specify selected feature The field
                                   "gene_id" or, if available "gene_name" will be used for reporting Comply either to
                                   CHESS or Gencode standard
     -m, --match <feature-name>    specifies the feature field in a GTF to be used, [defaults: "transcript" ] 
                                   activates automatically option "gtf" too
-    -d, --ratio <float>           ratio of accepted 5' to 3' clipping (and other way around), 0-1 [default: 0.1]
+    -d, --ratio <float>           ratio of accepted 5' to 3' clipping (and other way around), 0.0 will deactivate check
+                                  [default: 0.1]
     -r, --reference <int>         reference in fasta format provided, required for vcf output
     -i, --similarity <float>      minimum seq similarity of matches [default:98.0]
     -t, --threads <int>           number of threads for reading BAM [default: 1]
     -f, --vcf <int>               generate vcf encoded fusion call, requires reference
+
 ```
 
-If there is actually a paired-read data detected the program will exit as this is not compatible. It has been tested to work on assembled transcript fragments, entire transcripts, assembled integration sites and IsoSeq reads. 
-The last currently though showing some limitations in situations where both sides of an alignment are being clipped.
-That situation is very unlikely to happen for any previously described situtation and therefore ignored, whereas this is quite often seen with IsoSeq data. 
-The same kind of limitation could potentially be observed  with HIFI reads carrying integration events.
+If there is actually a paired-read data detected the program will exit as this is not compatible. It has been tested to work on assembled transcript fragments, entire transcripts, assembled integration sites and IsoSeq reads.
 
 `splitty fragment` scans the BAM file and extracts entries which have a `SA` field of supplementary alignment annotated and evaluates whether it classifies as a fusion indication. 
 It checks that the minimum sequence similarity (gap-compressed) and that information are coherent - otherwise it will trash said entry. 
@@ -77,28 +83,35 @@ If that precision score is not equal to 0 one should consider the following posi
 - fusion point on genome from geneA
 - fusion point on genome from geneB
 
-At the end splitty generates a tab-separated table of fusion point annotation and a BND style VCF.
+At the end splitty generates a BND style VCF with paired entries.
 The header of these files contain supplementary information about the version of the program, the date and the used command.
 
 **Note**: currently single entries are permitted and annotated but the information not returned at the very end of the program.
 This might have to be changed in the future.
 
-An important point in this evaluation is that one might encounter more than one entry.
-E.g. we might encounter an entry `SA:Z:chr14,42077618,+,3121M279334D415S,4,1486; chr14,35031866,-,1809M65484D1727S,60,0;`
-Which described more than one entry and will result potentially into 2 subsequently proposed gene fusions.
-Normally the entry with the highest score is kept but sometimes we do encounter identical scores, in this case both are propagated.
-
-**Note**: As birefly mentioned above, the SA-alignments do generate as well cigars which are being used to estimate the position of the gene fusion for each alignment individually. 
+**Note**: As briefly mentioned above, the SA-alignments do generate as well cigars which are being used to estimate the position of the gene fusion for each alignment individually.
 These cigars are though often flawed and approximations and do no contain full alignment information.
 This is supposedly a feature not a [bug](https://github.com/lh3/minimap2/issues/524).
 
-### Paired
+In some cases, the alignment mayb contain towards both extremities clipping which makes the determination of the sense of the fusion less intuitive. Splitty has 2 ways to deal with this:
+
+1. use a ratio of maximum accepted clipping between both sides - default 10%
+2. deactivate and use comparison with 2nd aligned elements
+
+The first would e.g. discard an element for which one side contained 100bp of clipping and the other side >10bp of clipping.
+This can be useful to reduce the number of FP elements, especially for very similar reference-to-sequencing situations.
+The second case ignores though such a factor, and rather would compare than with the SA alignment the similarity. It generates an absolute delta between clipping and 2nd alignment, e.g. 2nd alignment would be 110 bp, than for the 100bp clipping the delta would be 10bp and for a e.g. 15bp clipping on the second clipping it would result in 85bp. Based on that the 1st would be chosen as fusion direction.
+This option might be generally preferable, but has not been extensively tested and compared.
 
+## Paired
 
 This tool is in principle the extension of splitty, meaning it does as well analyze gene fusions from alignments.
 The underlying assumptions and therefore result-organization is a bit different though.
 
 ```bash
+splitty-paired 
+detecting integrations and genefusions from mapped paired-reads
+
 USAGE:
     splitty paired [FLAGS] [OPTIONS] --bam <FILE>
 
@@ -120,25 +133,30 @@ OPTIONS:
                                   increasing might help to pool events in close proximity. This might fuse though
                                   independent events and the lowest possible value might be advisable [default: 5]
     -z, --range <int>             the allowed range in bp within which a split pair is considered support for a fusion
-                                  point. Applies
-                                                  for both sides of the fusion [default: 500]
+                                  point. Applies for both sides of the fusion [default: 500]
     -r, --reference <int>         reference in fasta format provided, required for vcf output
     -i, --similarity <float>      minimum seq similarity of matches [default:98.0]
     -t, --threads <int>           number of threads for reading BAM [default: 1]
     -f, --vcf <int>               generate vcf encoded fusion call, requires reference
+
 ```
 
 
 It does not derive from splice-aware alignment of full- or fragmented-cDNA but from splice-aware alignment of paired-short reads.
-This can be either WGS short-reads for integration or RNAseq reads for gene fusions. The latter must be aligned with a splice-aware aligner such as STAR.
-Each read is analyzed whether it contains indication of a fusion + each read pair is as well evaluated whether the pairs do represent information.
-These results are though then organized on a position base and additional information from other reads is piled up.
-It is though not possible to obtain fusion/integration information without split-reads as these are the base in the 1st step of the program.
+This can be either WGS short-reads for integration or RNAseq reads for gene fusions. The latter must be aligned with a splice-aware aligner such as STAR. Each read is analyzed whether it contains indication of a fusion + each read pair is as well evaluated whether the pairs do represent information. It is though not possible to obtain fusion/integration information without split-reads as these are the base in the 1st step of the program.
 The second step then consists into re-analyzing the entire BAM for discovered points to evaluate whether supporting information can be found.
 
 Precision is the standard deviation from multiple supports compared to the median value (in both directions).
 If not enough support is gathered (because only 1 read found), then it becomes NA instead.
 
+Opposed to long read sequencing, RNAseq is most of the time not stranded and therefore the option for strandness should be ignored.
+Multi-threading is limited to multiple threads for reading and alignment should be reduced before to informative alignment-only.
+
+## Utilities
+
+This repo contains furthermore tools which help to deal properly with BND-pair VCF entries.
+
+
 ### id_fusion_reads
 
 This tool allows to select from an alignment all reads which indicate fusion of transcripts.
@@ -158,7 +176,7 @@ Therefore one needs to specify the transcriptome source to faciliate this.
 For CHESS this is implemented and rather easy as the CHESS ID and transcript ID are easy to collapse
 onto a gene-level annotation. For ENSEMBLE/GENCODE this is more complicated as the identifiers do not easily allow this without more knowledge.
 Meaning we will need a GTF describing for each transcript the gene-level ID as well.
-Therefore GENCODE/ENSEMBLE is currently not yet implemented.
+Currently, GENCODE/ENSEMBLE is not yet implemented.
 
 ```bash
 USAGE:
@@ -178,12 +196,25 @@ OPTIONS:
 
 ```
 
-## Helper scripts
+Multi-threading is fully implemented here and will benefit massively from as many CPUs provided as possibe.
+Eventually, BAM reading IO-capabilities become the bottleneck in our experience.
+
 
 ### vcf_bnd_merge
 
 This tools is really designed to merge paired read and CDNA-fragment derived information.
-It expects 2 VCF files 
+It expects 2 VCF files and will only work IMHO on splitty derived output files.
+This tool combines 2 vcf files describing BND. SVTYPE needs to be BND and ID + MATEID has to be set. Important: if you
+have Fragment and read derived results, please pass them in this order ignoring the "--simple" and "--flag". To simply
+check if entries in FileA were supported in FileB use the "--simple" flag. In this case the "--flag" field will be added
+as INFO field and be set to 0 if no support was found and 1 if support was detected. One can provide further INFO fields
+which will be propagated accordingly in a comma separated list. Currently this requires though elements which have a
+single entrie. E.g. CIPOS which provides 2 numbers typically are not supported.
+
+Scoring: if only entryA/B but not both are present, low confidence events will get the label
+"SPLITTY_FAILED" . If both evidence are combined the filter label is set to "PASS".
+Otherwise no filter field is applied.
+IMPORTANT: currently the strand is not verified of the events!
 
 ```bash
 USAGE:
@@ -196,12 +227,15 @@ FLAGS:
     -V, --version    Prints version information
 
 OPTIONS:
-    -a, --FileA <vcf>      the first file used to compare BND events
-    -b, --FileB <vcf>      the 2nd file used to compare BND events
-    -c, --cipos <int>      the distance within which events are combined [default: 500]
-    -f, --flag <string>    name of the flag value in case of "simple"
-    -o, --out <vcf>        the combined out vcf
-    -t, --threads <int>    vcf reading threads (files need to be huge to make an impact) [default: 2]
+    -a, --FileA <vcf>             the first file used to compare BND events
+    -b, --FileB <vcf>             the 2nd file used to compare BND events
+    -c, --cipos <int>             the distance within which events are combined [default: 500]
+    -f, --flag <string>           name of the flag value in case of "simple"
+    -d, --info-fields <string>    a comma separated list of INFO fields to propagate, fomrat of field must be Integer
+                                  [default: SR]
+    -o, --out <vcf>               the combined out vcf
+    -t, --threads <int>           vcf reading threads (files need to be huge to make an impact) [default: 2]
+
 ```
 
 If these 2 files are provided it follows the following weighting schema:
@@ -243,6 +277,7 @@ Otherwise each VCF BND entry will potentially generate one cluster entry.
 
 ```bash
 
+
 USAGE:
     vcf_sv_cluster [FLAGS] [OPTIONS] --chroms <FILE> --fofn <FILE> --vcf <FILE>
 
@@ -359,3 +394,7 @@ OPTIONS:
     -o, --out <FILE>       outfile name for the VCF
     -a, --vcf <FILE>       the VCF file to subset
 ```
+
+## Dependencies
+
+Splitty needs rust for the compilation and furthermore `cmake` and a recent `libclang-dev` environment.