Skip to content

Commit

Permalink
improving documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
clemaitre committed Mar 1, 2022
1 parent 10cbc7d commit bb1f38e
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 14 deletions.
29 changes: 16 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,27 +194,30 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
4. **MindTheGap Output**
All the output files are prefixed either by a default name: "MindTheGap_Expe-[date:YY:MM:DD-HH:mm]" or by a user defined prefix (option `-out` of MindTheGap).
Both MindTheGap modules generate the graph file if reads were given as input:
* a graph file (`.h5`). This is a binary file, to obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
`MindTheGap find` generates the following output files:
The main results files are output by the Fill module, these are:
* a breakpoint file (`.breakpoints`) in fasta format.
* an **insertion variant file** (`.insertions.vcf`) in vcf format, in the case of insertion variant detection (for insertions >2 bp).
* an **assembly graph file** (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
Additional output files are:
* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs, deletions and very small insertions (1-2 bp).
* a graph file (`.h5`), output by both MindTheGap modules. This is a binary file containing the de Bruijn graph data structure. To obtain information stored in it, you can use the utility program `dbginfo` located in your bin directory or in ext/gatb-core/bin/.
`MindTheGap fill` generates the following output files:
* Files output specifically by `MindTheGap find`:
* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences (for insertions >2 bp) or contig gap-fills that were successfully assembled.
* an insertion variant file (`.insertions.vcf`) in vcf format, in the case of insertion variant detection (for insertions >2 bp).
* a breakpoint file (`.breakpoints`) in fasta format.
* a variant file (`.othervariants.vcf`) in vcf format. It contains SNPs, deletions and very small insertions (1-2 bp).
* an assembly graph file (`.gfa`) in GFA format, in the case of contig gap-filling. It contains the original contigs and the obtained gap-fill sequences (nodes of the graph), together with their overlapping relationships (arcs of the graph).
* Files output specifically by `MindTheGap fill`:
* a sequence file (`.insertions.fasta`) in fasta format. It contains the inserted sequences (for insertions >2 bp) or contig gap-fills that were successfully assembled.
* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
* a log file (`.info.txt`), a tabular file with some information about the filling process for each breakpoint/grap-fill.
* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
* with option `-extend`, an additional sequence file (`.extensions.fasta`) in fasta format. It contains sequence extensions for failed insertion or gap-filling assemblies, ie. when the target kmer was not found, the first contig immediately after the source kmer is output.
Expand Down
8 changes: 7 additions & 1 deletion doc/MindTheGap_insertion_caller.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,13 @@ MindTheGap is composed of two main modules : breakpoint detection (`find` module
* `-max-rep`: maximal repeat size allowed for fuzzy sites [default '5'].
* `-het-max-occ`: maximal number of occurrences of a (k-1)mer in the reference genome allowed for heterozyguous insertion breakpoints [default '1']. In order to detect an heterozyguous insertion breakpoints, both flanking k-1-mers, at each side of the insertion site, must have strictly less than this number of occurrences in the reference genome. This prevents false positive predictions inside repeated regions. Warning : increasing this parameter may lead to numerous false positives (genomic approximate repeats).
* `-branching-filter`: maximal number of branching kmers in a 100-bp window before a heterozygous site [default '15', '-1' means no filter applied]. This filter prevents numerous false positive predictions inside repeated regions. In large and complex genomes, such as human, this parameter can be set to lower values (10 or 5), in order to decrease the running time of the Fill module (but this may result in a loss of recall in repeat-rich regions).
* `-bed`: the path to a bed file defining genomic regions, to limit the find algorithm to particular regions of the genome. This can be usefull for exome data.
* `-bed`: the path to a bed file defining genomic regions, to limit the find algorithm to particular regions of the genome. This can be usefull for exome data. Important: the bed file has to be sorted and overlapping intervals merged, such as:

```
sort -k1,1 -k2,2n file.bed > file_sorted.bed
bedtools merge -i file_sorted.bed > file_final.bed
```

5. **Fill module specific options**

Expand Down

0 comments on commit bb1f38e

Please sign in to comment.