Bookend v1.2.0: merge update
Feature addition for Bookend to implement bookend merge
. This new utility lets you integrate one or more assemblies into a reference annotation, following gene and transcript naming conventions. Reference transcripts with a matching assembly will have their 5' and 3' ends updated, and they will be given evidence attributes that describe how many times they were assembled and in which samples.
Merge behavior:
- Process transcripts in descending order of total genomic length
- Merge assemblies first, applying no filters
- Combine the attributes of merged transcripts according to
--attr_merge
: sum or mean of expression values (TPM, cov, S.reads, E.reads) - All assemblies classified as 'full_match' to another assembly will be combined into a single transcript model
- Combine the attributes of merged transcripts according to
- Integrate merged assemblies with reference (decreasing length):
- Determine the class of each merged transcript vs. reference
- Name the transcript and add it to the reference list
- 'full_match' retain the original transcript_id
- 'exon_match' transcripts with 5' and/or 3' variation are named after the matching transcript_id with an extra suffix '_<count>'
- Novel isoforms are given the gene_id with a suffix '.i<count>'
- Novel antisense transcripts receive the '-AS' suffix
- Intronic transcripts: '-IT' suffix
- Intergenic transcripts are named 'BOOKEND_<count>'
- Apply filters:
- Isoforms must have been found at least
--rep_filter
times - The sum/max TPM must be at least
--tpm_filter
- Multiply these filters by
--high_conf
for suspected artifacts (fragments and fusions) - The spliced transcript length must be at least
--min_len
nucleotides - The percentage of capped 5' signal must be at least
--cap_percent
- Isoforms must have been found at least
Bugfixes
- Changes to
bookend elr --sj_shift
in v1.1 allowed malformed exons with zero or negative length. - Added
--max_intron
to utilitieselr
,assemble
, andcondense
assemble
,condense
andelr
utilities now check for and discard malformed entries with negative exon lengthsbookend elr
: terminal exons with noncanonical gaps are discardedbookend elr
: it is now possible to use all three sources of splice junction evidence together (--splice, --reference, --genome)- Summary log of
bookend label
no longer counts--discard_untrimmed
reads in Total Output bookend elr
: refactored softclipping decision tree to better identify untrimmed 5' and 3' oligosbookend label
: now retreives UMIs from adapters in either forward or reverse orientationbookend label
: extended the maximum phred score from 40 to 60bookend label
: the UMI sequence can be comprised of IUPAC ambiguity characters other than Nbookend label
: oligomer extensions (e.g. TTTT+) cannot exceed --max_endbookend label
: mismatches are no longer tolerated in the last 5nt of an oligomerbookend label
: best trim is now determined by closest sequence match, not by maximum trim lengthbookend classify
: now treats single-exon transcripts less than half the length of their matching transcript as a 'fragment'