Skip to content

Bookend v1.2.0: merge update

Compare
Choose a tag to compare
@maschon0 maschon0 released this 19 Sep 16:14
· 9 commits to master since this release

Feature addition for Bookend to implement bookend merge. This new utility lets you integrate one or more assemblies into a reference annotation, following gene and transcript naming conventions. Reference transcripts with a matching assembly will have their 5' and 3' ends updated, and they will be given evidence attributes that describe how many times they were assembled and in which samples.

Merge behavior:

  1. Process transcripts in descending order of total genomic length
  2. Merge assemblies first, applying no filters
    • Combine the attributes of merged transcripts according to --attr_merge: sum or mean of expression values (TPM, cov, S.reads, E.reads)
    • All assemblies classified as 'full_match' to another assembly will be combined into a single transcript model
  3. Integrate merged assemblies with reference (decreasing length):
    • Determine the class of each merged transcript vs. reference
    • Name the transcript and add it to the reference list
      • 'full_match' retain the original transcript_id
      • 'exon_match' transcripts with 5' and/or 3' variation are named after the matching transcript_id with an extra suffix '_<count>'
      • Novel isoforms are given the gene_id with a suffix '.i<count>'
      • Novel antisense transcripts receive the '-AS' suffix
      • Intronic transcripts: '-IT' suffix
      • Intergenic transcripts are named 'BOOKEND_<count>'
  4. Apply filters:
    • Isoforms must have been found at least --rep_filter times
    • The sum/max TPM must be at least --tpm_filter
    • Multiply these filters by --high_conf for suspected artifacts (fragments and fusions)
    • The spliced transcript length must be at least --min_len nucleotides
    • The percentage of capped 5' signal must be at least --cap_percent

Bugfixes

  • Changes to bookend elr --sj_shift in v1.1 allowed malformed exons with zero or negative length.
  • Added --max_intron to utilities elr, assemble, and condense
  • assemble, condense and elr utilities now check for and discard malformed entries with negative exon lengths
  • bookend elr: terminal exons with noncanonical gaps are discarded
  • bookend elr: it is now possible to use all three sources of splice junction evidence together (--splice, --reference, --genome)
  • Summary log of bookend label no longer counts --discard_untrimmed reads in Total Output
  • bookend elr: refactored softclipping decision tree to better identify untrimmed 5' and 3' oligos
  • bookend label: now retreives UMIs from adapters in either forward or reverse orientation
  • bookend label: extended the maximum phred score from 40 to 60
  • bookend label: the UMI sequence can be comprised of IUPAC ambiguity characters other than N
  • bookend label: oligomer extensions (e.g. TTTT+) cannot exceed --max_end
  • bookend label: mismatches are no longer tolerated in the last 5nt of an oligomer
  • bookend label: best trim is now determined by closest sequence match, not by maximum trim length
  • bookend classify: now treats single-exon transcripts less than half the length of their matching transcript as a 'fragment'