Skip to content

Segmented Sequence Tree

Vladimir Schneider edited this page Dec 11, 2019 · 1 revision

Optimization for segmented sequences was needed which did not store base sequence offset for every character but at the same did not impact random access to characters via charAt(n).

Some known approaches are an Interval tree for overlapping multiple intervals, Segment tree - Wikipedia or Binary search tree. All have $O(log \ n)$ query time.

In practice, search $O(log \ n)$ where n is number of segments can be made to appear to have almost constant time for sequential and intra-segment random access by using per-thread caching of the current segment index and segment data with additional checking of next/previous segment hits when the current segment does not match, before resorting to the binary search for a given index.

Another desired optimization is to have each BasedSequence provide its segment information efficiently to allow construction of a new segmented sequences consisting of subsequences of the segmented sequence. Being able to extract segment information using the offset and length of the subsequence from parent structure is highly desired and will reduce construction time of segmented sequences.

The tree structure is immutable like the sequences it represents so beyond cost the of creation and querying there is no concern for insert/delete/copy operation efficiency.

Segment Binary Tree

With query time of $O(log \ n)$ where n is number of segments in the sequence not requiring storing any information per character, a binary tree of aggregated length of ordered segments in the sequence is ideal for segmented sequences.

An array of aggregated lengths int[] provides the binary tree structure. The middle entry is the root node for the search.

Advantage for compact storage is significant especially for long sequences with few segments. Current implementation used int for every character in the sequence, tripling the native Java character storage. This huge memory footprint makes use of SegmentedSequence for accumulation of text output from HtmlRenderer and Formatter problematic for large files. In tests the limit for Formatter was about 250k of markdown source. Larger and the operation would fail with out of memory exception. Running smaller files would require over 6GB of memory.

Query penalty for sequential access can be mitigated by per thread caching of last used segment information.

If a char index is within cached segment then the segment can be re-used. If it is after the end of the segment, then the next segment can be tried. If it is before start of segment, then previous segment can be tried.

If there is no match then first and last segments should be checked to optimize sequential char access of sequences from start or end of the sequence.

These tests are a quick index into int[] so do not incur a heavy penalty. Additionally, failed tests are not wasted because they serve to narrow the range of segments which must be searched.

If the requested index is before the previous segment then binary search should be done on [0, prevSeg) and if it is greater than the next segment's aggregated length then search is narrowed to (nextSeg, endSeg).

Because all segments are already ordered, the aggregated length array can be generated by a single iteration of the segments, which is already done to build the segmented sequence. Additional variable to aggregate lengths creates no overhead either. Therefore building of the binary tree is effectively free.

Subsequences can use the parent binary tree or a sub-section of it by defining an offset and length of their portion of it, eliminating the overhead of searching extra segments and copying of data. Effectively no penalty to subsequences for re-use of parent search structure.

Construction of segments from parent binary tree by subsequences is another feature which speeds up overall tree based segmented sequence use.

The total storage overhead for segmented sequences is (4 bytes * 2 + per segment overhead) * number of segments. 4 bytes per integer, one being the aggregated length and the other an offset of the segment in the serialized segment byte array.

Out of base characters have additional overhead for character storage and all have access penalty for de-serializing the segment data. This is relatively painless because each segment contains little data needed for class construction. Character data is not de-serialized and character access is done directly from serialized byte data as characters are accessed.

Deserialization is only performed once the segment is positively identified as one to be used. Even when trying to determine if previous or next segment contains an index, binary tree search data is used not the serialized segment data.

Segments for Segmented Sequences

Segments are stored in a byte array and deserialized from consecutive bytes as needed.

Each type of segment is represented by a separate class all extending SeqSeg. In Kotlin these would be a sealed class. SeqSeg are only deserialized from a byte[]. Serialization is performed by SegmentBuilder from a Seg element.

Serialized data is optimized to minimize number of bytes taken by each segment.

  • Lengths and offsets of <16 require no extra information and fit into the first byte identifying the segment.
  • Lengths and offsets are stored in 1, 2, 3 or 4 byte format as needed to fit the data
  • Out of base text will use 1 byte per character if all characters in the segment have code < 256
  • Repeated characters only store the character and repeat count
  • Repeated spaces and \n only need length and if <16 then a single byte encodes all the information.

Numeric values for offsets and length can have only positive values and 0, with length of any segment limited to Integer.MAX_VALUE/4, ie. 0.5 GB which is plenty for the required application.

Segment types:

  • BASE: base segment representing a range of base sequence characters. 1 to 9 bytes, in practice probably hitting an average of 3-4 bytes.
    • ANCHOR: base segment representing a range of 0 span of base sequence characters, 1 to 5 bytes, average most likely 2-3 bytes.
  • TEXT: segment representing characters out of base sequence, general case 1-5 bytes + 2 bytes per character
    • REPEATED_TEXT: segment representing a repeated character span, 3-7 bytes.
      • REPEATED_SPACE: repeated space, 1 to 3 bytes. Most will probably be 1 or 2 bytes. The latter can handle up to 255 spaces.
        • REPEATED_EOL: repeated \n, 1 to 3 bytes. Most will probably be 1 or 2 bytes. The latter can handle up to 255 EOLs.
      • REPEATED_ASCII: repeated character with code < 256, 2-6 bytes, most likely 2 bytes for repeat <16, 3 bytes up to 255 repeats.
    • TEXT_ASCII: segment where all characters code < 256, 1 to 5 bytes plus 1 byte per character.

First byte of serialized segment carries the information about the specifics and serialized format of the rest of the bytes:

  • SEG_TYPE:
    • 0b0000_0000: ANCHOR: startOffset
    • 0b0010_0000: BASE: startOffset, length
    • 0b0100_0000: TEXT: length: char[]
    • 0b0110_0000: REPEATED_TEXT: char, length
    • 0b1000_0000: TEXT_ASCII: length, byte[]
    • 0b1010_0000: REPEATED_ASCII: byte, length, not needed and should be freed up for other uses.
    • 0b1100_0000: REPEATED_SPACE: length
    • 0b1110_0000: REPEATED_EOL: length
  • SEG_BYTES: encodes number of bytes used for segment's numeric values.
    • 0b0001_xxxx: no size/offset bytes. If has startOffset then offset = xxxx, length = 0, if has only length then length = xxxx, not valid if has both length and start offset
    • 0b0000_0000: 1 byte for start
    • 0b0000_0001: 2 bytes for start
    • 0b0000_0010: 3 bytes for start
    • 0b0000_0011: 4 bytes for start
    • 0b0000_0000: 1 byte for length
    • 0b0000_0100: 2 bytes for length
    • 0b0000_1000: 3 bytes for length
    • 0b0000_1100: 4 bytes for length