-
Notifications
You must be signed in to change notification settings - Fork 0
/
ChangeLog
85 lines (83 loc) · 5.84 KB
/
ChangeLog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
2022.07.04
* v1.0.5:
. added warnings/exit when no HTs are found or allv fail QC
. fixed file names when single sample passes HQ
. tatajuba does not exit if single sample is given, but it warns about non-standard input
2021.11.06
* todo:
. finding ht in reference should use edit distance or bwa result (instead of scanning as hopo_counter perhaps?).
currently it uses canon X rev_strand info combination to make sure only same base is considered, but ideally
distance_between_contexts can have a bitwise shift (still faster than NW/Levenshtein for one or two shifts?)
. new version of VCF uses only right context but still buggy suffix removal
. removal of duplicate vcf entries: check if start position same, per sample. However if single vcf for all samples
is built, then we may need hash table for existing entries
. possibility to create our own protein-translating for each HT in isolation or all HTs from given gene. We can also
just check difference in HT lengths per gene (another file, with gene in row per sample in col?). careful to avoid
using bwa (since it contains contexts, which may overlap over HTs).
2021.04.30
* idea:
. Each concat[] has idx of gene (if anotated, or -1 o.w.). This idx points to genome_set structure containing the
sequence itself and its offset on 1D location (i.e. such that concat[i]->location - offset gives location inside
gene sequence)
. alternatively a new struct with gene DNA seq, 1D offset, strand (of coding region) and phase (0,1, or 2) can be
included in genome_set
2021.04.01
* future ideas:
. vcf file with HTs only (then user can https://github.com/rpetit3/vcf-annotator to check a.a. impact)
. github issue 1: flag HT without forw/rev strands; flag HTs with diff length in forw and rev
. github issue 2: a.a. effect of HT (first codon before HT to last codon after, or whole HT)
2021.03.18
* TODO:
. individual files, which can be updated for new reference genomes; these files could have:
- dinucleotide frequencies (mentioned in several publications, to give a probability of having those lengths by chance)
- all paralogs per reference genome (someone could create a roary-like graph to generate homologs)
- ref genome location file could be the same (merged) across samples (WARNING: this could become a huge file,
maybe have one per sample and tatajuba merges them on the spot)
2021.01.05
* TODO (extra/review):
. if SNP is present maybe one of the contexts will be the same; this will indicate potential SNPs
. To postpone these exceptions, we can assume that interesting homopol-tracts must be present in all samples and
have variation in lengths (we can flag those with potential SNPs above i.e. nearby similar tracts)
. function that finds tract in reference may change tract location (e.g. -1) which makes it incompatible with
selected_tracts.csv which is done before: find_best_context_name_for_reference() changes contig_location
2020.11.27
* multi-genome comparison:
. GFF3 reading implemented, and can save FASTA if included (needed by bwa). Safer is to provide both, however.
. prokka returns (proper) GFF3 with several contigs (a.k.a. genomes or chromosomes in GFF3 docs)
. we transform it into a 1D location indexing by flattening contigs into one 'concatenated' sueprcontig (we reserve
the word "concat" to the concatenation of context_histogram_t though)
. this does not mean it works with several GFF3 files, and we assume distinct, non-homologous contigs (this would force
tatajuba to chose one homolog OR the other)
* TODO:
1. splitting new_genomic_context_list() into two functions, one before bwa+gff only for identical contexts and one
after (accounting for similar contexts). Currently it is quadratic on number of unique context+tract elements.
2. when reading fasta for first time bwa functions complain (non thread safe perhaps)
3. allow for comparison of non-canonic descriptors (also or alternatively store protein version)
4. after finding equivalent tract on reference re-check neighbour tracts; also leftmost_hopo_name_from_string()
could scan _all_ tracts and return closest to input
5. when translating to protein, use Miyata79 and Grantham74 a.a. distances doi:10.1126/science.185.4154.862 doi:10.1007/bf01732340
6. a context+tract may span two genes/gff3 features (so we should ask for start and final features)
2020.08.25
* beta version working:
1. imports BWA and gives for each tract a location in reference genome. (BWA creates indices from ref fasta)
2. merge tracts from same genome based on similarity and location
* TODO:
1. needs annotated genome; therefore user gives GFF, we generate the ref fasta that will then be used for bwa
indexing (DONE 2020.11.27)
2. aim is to distinguish tracts by their effects on sample; therefore we want to keep track of (a) coding effect (i.e.
if tract is on coding region or not); and (b) change in length from reference
2020.02.19
* Andrew's hints:
1. if program is slow, we can first map context-mers to a reference and exclude those not in homopolymer tracts;
2. check Illumina limits for homopolymers, it might exclude > 10 or something
3. indeed good idea to exclude unique context-mers (Bloom filter)
4. Always aim at "one-click" software: run without options and get a useful, interpretable result
2020.01.27
* first version: maintain kalign; assume histogram per sample (must be normalised)
* Desiderata:
- use location information (relies on BAM/reference): not possible now, but keep in mind that we'll need to locate these changes on reference genome
- consider canonical form
- single k-mer may represent error (think Bloom filter); does QC remove these reads?
- alternative to location information can be flanking k-mers
- exclude reads beginning or ending with homopolymer (since true size may span another read)
- therefore it can limit to homopolymers between two 4mers;