Skip to content

LZ match finders

Bulat-Ziganshin edited this page Sep 4, 2024 · 15 revisions

LZ match finders can be classified into two types:

  1. Fast (for greedy/lazy parsing), where we skip over match contents, so it's more latency-bound.
  2. Full (for optimal parsing), where we can (and should) make it throughput-bound, and since they make the most sense with large dictionaries, this means memory bandwidth-bound.

CPU and RAM

To optimize computations, we should use the following (usual) techniques:

  1. ILP and MLP (instruction/memory parallelism). Note that a compiler mostly handles it, except that compilers assume L1 hits, so we have to prefetch larger tables.
  2. Branchless (thus, lots of CMOVs).
  3. SIMD commands and thus SIMD-optimized match search structures.

The Great Memory Wall

The worst side of match-finding is the lot of random memory accesses it generates. Thus, optimizing for the cache hierarchy of widespread CPUs is very welcome.

Most x86 cpus used today provide at least:

  • 16 KB/thread L1$, with 4-5 cycles latency
  • 128 KB/thread L2$, with 12-17 cycles latency
  • 1 MB/thread L3$, with 30-50 cycles latency

Moreover, limited associativity (4..16) reduces effective cache size to 3/4 .. 15/16 of the nominal size.

x86 cache lines are 64 bytes long, but data from RAM are always (?) fetched in aligned 128-byte chunks, and even fetching 256 aligned bytes might be 30% faster (in terms of total RAM bandwidth) than fetching two independent 128-byte chunks. OTOH, writing only 64 bytes is PROBABLY faster than writing 128 bytes (but I have to check), so it may be advantageous to have e.g. a 256-byte hash table row, of those only 64 bytes are updated at once.

Overall, an ideal full match finder should be limited only by RAM bandwidth necessary to check all large matches (e.g. >= 7 bytes) having large distances (e.g. > 256 KB). And since a modern (4-16 core) desktop x86 CPU can make 100-300 million/second of random RAM accesses, it should be able to process about 100-300 MB/s in an ideal full match finder, which means ~25 MB/s per core, or 100-200 CPU cycles per byte (at 2.5-5 GHz).

A single core can do only 50-100 Maccess/sec, so it can process only about 50-100 MB/s, which means that for a single-threaded match finder, we have only 50-100 CPU cycles per byte (at 4-5 GHz). For a 4-wide Skylake-class core, this means 200-400 CPU commands to process one dictionary position. And taking into account the branchless character of our ideal MF, this means that this 200-command limit includes everything we may want to optionally do.

Unpredicted jumps

Each incorrectly predicted jump adds about 14 CPU cycles delay, which means a wasted opportunity to execute 50-100 commands. The hit may be even larger if the jump depends on the long command chain, but perhaps it may be less if it's preceded by commands depending on long chains (because the jump may be executed earlier in this case). To get most of the RAM/cache bandwidth, such a jump should be preceded and immediately followed by loads/prefetches of useful data.

For lazy/greedy parsing, two unpredicted jumps per recorded match seems almost inevitable, switching the execution between two modes - searching for a match and mere indexing. But we can fool the system by snapshotting the necessary HT rows in the indexing loop:

for(i=0; i<256; i++) {
  cur_pos = base_pos + i;
  hash = hash_fn(cur_pos);
  snapshot[i] = HT[hash], optionally filtered by cached bytes (see below);
  prefetch the dictionary at positions included in snapshot[i];
  update HT[hash];
}

for(i=0; i<256;) {
  cur_pos = base_pos + i;
  search for a match using snapshot[i];
  if (match found)  i += match_len;  else i++;
}

Match finder layers

Today, all but the fastest match finders are constructed from multiple layers that handle different match sizes, e.g.

  1. 2-byte MF with a single cell per hash value
  2. 3/4-byte MF with 4 cells/HV
  3. 5+-byte MF with 16 cells/HV

These layers, except for the last one, are almost uniform, parameterized only by two values:

  1. Range of match lengths handled (usually of 1-2 values).
  2. Single or many cells per HV. Multiple cells case is better handled by SIMD, and actually, SIMD makes using more cells per HV much more desirable.

Simple match finders

We can write pretty generic code for 3 typical cases (single cell per multiple lengths is meaningless), plus another generic code for the last layer, and then count the number of CPU commands in the entire MF.

Single cell

Let's start with the simplest case:

// Note: one line per CPU command
src_bytes = src_8_bytes & 0xFFFFFF;
crc = CRC32(src_bytes);
hash_cell = crc & MASK;

match_pos = hash_table[hash_cell];
memcpy(&match_8_bytes, dict + match_pos, 8);
match_bytes = match_8_bytes & 0xFFFFFF;

output[0] = 3;
output[1] = match_pos;
if (EQUALLY_LIKELY(src_bytes == match_bytes))  output += 2;  // 3 commands: CMP+ADD+CMOV

hash_table[hash_cell] = cur_pos;

It's 12 commands, of those only 4 commands are required for HT update w/o searching. We should somehow force the compiler to generate CMOV to make the code branchless. It's easily modified for finding matches of 2..8 bytes (and we can drop two commands doing & 0xFFFFFF for 2-byte and 4-byte cases). For shorter matches (2..4 bytes), we can store only the 2 lowest bytes of pos in the hash table, by the price of a few extra commands to compute the full 4 bytes of pos.

Multiple cells for multiple lens

Let's extend our simplest code with 2 features:

  • multiple cells per single hash value
  • match length computation
// Note: one line per CPU command
src_bytes = src_8_bytes & 0xFFFFFF;
crc = CRC32(src_bytes);
hash_cell = crc & MASK;
prev_len = MIN_MATCHLEN - 1;

for(i = 0..NUM_CELLS-1) {
  tmp = hash_table[hash_cell][i];
  hash_table[hash_cell][i] = match_pos,  match_pos = tmp;

  if (SCALAR) {
    memcpy(&match_8_bytes, dict + match_pos, 8);
    xor_8_bytes = src_8_bytes ^ match_8_bytes;
    len_bits = tzcnt(xor_8_bytes);
    len = len_bits / 8;
  } else {
    // Match len up to 16 bytes with SSE2, up to 32 bytes with AVX2.
    // Can be increased up to 64 bytes by combining masks: bits_eq0 + (bits_eq1 << 16) + ...
    memcpy(&match_16_bytes, dict + match_pos, 16);
    bytes_eq = pcmpeqb(src_16_bytes, match_16_bytes);  // byte 0xFF for equal bytes
    bits_eq = pmovmskb(bytes_eq);  // bit 1 for equal bytes
    bits_noteq = ~ bits_eq;  // bit 0 for equal bytes
    len = tzcnt(bits_noteq);
  }

  output[0] = len;
  output[1] = match_pos;
  if (EQUALLY_LIKELY(len > prev_len))  output += 2, prev_len = len;  // 4 commands: CMP+ADD+CMOV+CMOV
}

hash_table[hash_cell][0] = cur_pos;

Caching match finders

We can save a few bytes (or even a few bits) of the data or hash value in an HT cell, which allows quickly filtering cells for potential matches.

There are two basic approaches:

  1. Split 4 bytes of the cell between match_pos and bits from dict/HV
  2. Use 4 bytes for match_pos and another 1-4 bytes for cached data

In the first case, if we keep only part of match_pos bits, we have to reconstruct its value using cur_pos: match_pos < (cur_pos & MASK) ? (cur_pos & ~MASK) + match_pos : (cur_pos & ~MASK) + match_pos - (MASK+1)

Also, we can choose between:

  1. Partial caching (it's only a hint and we have to check it against the dictionary)
  2. Full caching (the cached value has enough data to accept the match)

Full caching allows us to avoid fetching match bytes from the dictionary, but it may be false. We should mark unused entries in the hash table, e.g. by setting match_pos = 0 both during the initialization and during shifting the dictionary when the same match_pos values can be reused for new dictionary data.

Full caching can be made either:

  1. Trivial - just keep all the necessary bytes, e.g. 2 bytes of match_pos and 2 bytes of data for 2-byte MF.
  2. Smart - use a reversible hash from N to N bytes. Split the hash value between the HT row number and extra bits saved in the cache. E.g. hash = crc32(4 bytes); ht_row = lower 17 bits of hash; cached_value = higher 15 bits of hash;

Aside from bits necessary to quickly filter potential matches, the cached value may keep extra bytes allowing to extend the match. E.g. a cell may include 32 bits of match_pos, 8 bits for filtering, and 3 extra bytes allowing to quickly predict match size from N to N+3 bytes. Or, with smart full caching, it may be 24 bits for match_pos, 16 bits to fully check 4-byte matches, and another 3 bytes to extend these matches up to 7 bytes - all without going to the dictionary!

Overall, match finders should be configured with caching in a way that allows to completely avoid going out, except for the largest matches (e.g. >=7 bytes) on the last MF layer.

Single cell

Go back to the simplest code and extend it with caching (it may be useful to replace online fetch from the dictionary with prefetch from HT, see Lazy case):

// Note: one line per CPU command
src_bytes = src_8_bytes & 0xFFFF;
crc = CRC32(src_bytes);
hash_cell = crc & MASK;

hash_value = hash_table[hash_cell];
hash_match_bytes = hash_value & 0xFFFF;

if (src_bytes == hash_match_bytes) {
  match_pos_lowbits = hash_value >> 16;
  match_pos = F(cur_pos, match_pos_lowbits); // TO DO
  memcpy(&match_8_bytes, dict + match_pos, 8);
  match_bytes = match_8_bytes & 0xFFFF;
  if (src_bytes == match_bytes) {
    output[0] = 2;
    output[1] = match_pos;
    output += 2;
  }
}

hash_value = cur_pos << 16;
hash_value |= src_bytes;
hash_table[hash_cell] = hash_value;

It's 16 commands plus the computation of F, and only 6 commands for the pure HT update. Again, the real code should be made branchless by using CMOV for output += 2. It's easily modified for finding matches of 3..8 bytes at distances up to ~16MB. The main difference from 2-byte case is to keep bits from crc in the hash table for quick filtering.

Multiple cells for a single len

With a little help from SIMD this case can be implemented almost as easily (and work almost as fast) as the single-cell case. The key point is that we need to handle only one length, so we can bail out on the first successful match. By filtering the hash table on e.g. 8 bits of the hash value, we can ensure that less than 1/256 of potential matches are lost (but can be restored by repeating the search cycle).

// Note: one line per CPU command
src_bytes = src_8_bytes & 0xFFFFFF;
crc = CRC32(src_bytes);
hash_cell = crc & MASK;

simd_hash_value = hash_table[hash_cell];
simd_cached_hash_bits = hash_value & simd_broadcast(0xFF000000);

src_hash_bits = crc & 0xFF000000;
simd_src_hash_bits = simd_broadcast(src_hash_bits);

simd_mask = simd_cmpeq_dwords(simd_src_hash_bits, simd_cached_hash_bits);
mask = simd_mov_mask_dwords(simd_mask);  // The lowest bit 1 corresponds to the first probable match

match_num = NUMBER_OF_THE_LOWEST_BIT_SET[mask];  // a bit longer sequence if the mask has more than 4-8 bits
hash_value = simd_hash_value[match_num];  // SIMD extract command

match_pos_lowbits = hash_value & 0xFFFFFF;
match_pos = F(cur_pos, match_pos_lowbits); // TO DO
memcpy(&match_8_bytes, dict + match_pos, 8);
match_bytes = match_8_bytes & 0xFFFFFF;

if (src_bytes == match_bytes) {
  output[0] = 3;
  output[1] = match_pos;
  output += 2;
}

hash_value = cur_pos & 0xFFFFFF;
hash_value |= src_hash_bits;
simd_hash_value = move_dword_up(simd_hash_value);
simd_hash_value[0] = hash_value;  // SIMD insert command
hash_table[hash_cell] = simd_hash_value;

Now, it's 23 commands plus F, and only 9 commands for the pure HT update. If we are going to handle a 32-byte hash line (e.g. 8 match candidates * 4 bytes each) using plain SSE, this will add only about 10 commands to this sequence (3-5 commands to the pure HT update). Overall, each code sequence handling single len requires 20-40 commands, and even 4 of them (handling individually the lengths 2, 3, 4 and 5) will be 100-150 commands total. If the last MF layer can be managed in less than 100 commands, the entire match finder will have 250 commands at most, which means handling 100 MB/s per core on a 5 GHz Zen4-class cpu.

Prefetching

Prefetching for optimal parsing

And now, all we have to do is to properly order the commands (and prefetch the data in time).

It's easy for a modern compiler to intermix commands from MF layers to handle dependency chains and even L1 cache delays. But we still have to help it by prefetching from L2+ cache layers, as well as writing branchless code to allow the compiler to intermix commands between MF layers.

We can start by prefetching data for larger layers first and processing them last, e.g.:

  1. Prefetch for 5+-byte layer
  2. Prefetch for 3/4-byte layer
  3. Process 2-byte layer
  4. Process 3/4-byte layer
  5. Process 5+-byte layer

But with 32 KB/core (thus 16 KB/thread) L1$ in some modern CPUs, even MF for 3-byte matches (which needs to handle distances up to ~4KB) can't entirely reside in L1$. L2$ latency is usually 12-17 CPU cycles, so 50-100 commands can be executed between prefetch from L2$ and actual data usage.

It becomes even worse for L3$ and memory loads, e.g. in 50-100 ns of the typical RAM latency a CPU can execute about 1000 commands. Note that it doesn't necessarily mean that we are memory-bound! It's easily possible that 80% of positions don't have 5+-byte matches at all, and from those who have, 80% of fetches are served from L3$ due to normal distribution.

But if we don't want to lose time waiting for RAM, we have to overlap the handling of adjacent positions. Let's see - RAM delay is 50-100 ns, with a large dictionary we may need to make second RAM access for the page walk, and on top of it we need at least two fetches - first into the hash table and second into the dictionary. So, it requires up to 400 ns, so without overlapping operations for multiple positions, we can process only 2.5 MB/s.

A solution is to split the processing of each position into 3 stages:

  1. Prefetch the hash table.
  2. Find positions of potential matches in the hash table and update it. Prefetch match strings from the dictionary.
  3. Compute match lengths and output the results.

Distance between the stages should be enough to get data even from the slowest RAM, e.g. 200 ns. Let's be brave and target the full MF speed of 50 MB/s per core, i.e. 20 ns per position. This means that we should operate on 20 positions simultaneously, i.e. while prefetching for position N, process hash table for position N+10, and compute matches for position N+20.

Hash table data prefetched into L1$ for each position includes one cache line (64 bytes) for each MF layer, plus up to 512 bytes for the last layer (size of 64 cells * 8 bytes). Dictionary data includes at average 1 (?) potential match per position, which I will count as 128 bytes.

We should prefetch HT data for 10 positions, and dictionary data for another 10 positions, so together they will occupy a few KB of L1$. This changes usage of L1$ from "place where most frequent data reside" to "transient space for data read from RAM and slower caches".

For higher search lengths and high-compressible data, 12 KB of L1$ may be insufficient. We can further optimize data access by prefetching data into L2 (or L3?) cache 10-20 positions ahead and then prefetching them into L1$ just 1 or 2 positions before actual usage.

Prefetching for lazy parsing

Match finder for lazy parsing is latency-bound (if the dictionary is large enough). It's because we search only for some positions and the next position to search is determined only when we have found the best match in the current one.

Lazy parsing still processes positions inside matches, but only inserts data into hash tables. This means that hash table cells can be prefetched in the same way as for optimal parsing, but we may need to increase prefetch distance according to faster overall processing.

The real problem, however, is prefetching the dictionary data for match candidates. The best we can do is to shift it to the moment we computed the best match for the current position:

  1. next_pos = cur_pos + best_match.len
  2. Extract match candidates for next_pos, next_pos+1... till we found at least one
  3. Prefetch dictionary at these candidate positions
  4. Encode the best_match
  5. Update hash tables at pos=cur_pos..next_pos-1, while prefetching them for each pos+20
  6. Extract match candidates for next_pos and next_pos+1 again, because they may have changed by updates
  7. Find a new best_match and go 1

We can try to optimize 6 by reusing already found match candidates for the lowest MF layer, while extending the best match from the previous MF layer.

On average, we encode one match per 5-15 positions, and hash table updates are more frequent than HT searches. This means that HT for lazy parsing should be better optimized for fast updates, even at the cost of slower searches.

With implementation ideally overlapping computations with prefetching, the time required to process one encoded match (i.e. ~10 positions) will be kinda max(MEM latency, 10*update_time + 3*search_time) where MEM latency is defined by the place where the required slice of dictionary happened to reside (RAM/L3$/...).

The entire pipeline

Optimal parsing

A quick-and-dirty variant:

  • either 8-cells-by-row for 3+4 bytes
  • or 4-cell 3-byte plus 4-cell 4-byte
  • and 16-64 cells 5+-byte

Smaller lengths should be served entirely from L3$. In particular, avoid probing dict with offsets > 256K (either by full caching or artificially replacing large offsets with a small one).

New, streamlined description

An LZ match finder algorithm is made of these steps:

  • LookupHT: fetch the hash table row for the current pos
  • --- Memory Wall ---
  • FilterHT: filter the HT row for potential matches
  • UpdateHT: add current pos to the HT row
  • LookupDict: fetch dictionary data of the potential matches
  • --- Memory Wall ---
  • CheckDict: compute the potential match lengths and save the "best" matches found

It has the following specifics:

  • The memory wall between some steps. To solve it, we should make a pipeline of 3 stages, interleaving e.g. processing of pos at stage1, pos-10 at stage2, and pos-20 at stage3, if we choose DISTANCE==10 between the stages. Data between stages should be passed via a circular buffer holding data for the last DISTANCE positions. Lookups should be done by prefetchXX CPU commands to avoid ROB overflows. Sometimes, the CheckDict step (and the corresponding memory wall) can be completely avoided by keeping more data in the HT "cache" bytes.
  • The processing pace differs. HT steps are made once per position, while Dict steps are made once per checked match. It may be a lesser problem for LookupDict (just one prefetch command). However, CheckDict performs 10 times more work, so it will be advantageous to make a loop over match candidates rather than dictionary positions.
  • Greedy/lazy parsing search matches only at ~~10-20% positions, which calls for a different pipeline organization:
    • Approach 1: combine a loop with full iterations (i.e. all the steps above), searching for the next match, and a loop with update-only iterations (only LookupHT and UpdateHT), skipping over the recorded match contents. It will cost us 2 unpredicted jumps per encoded match. Unfortunately, it allows neither making a separate pipeline stage for CheckDict nor running CheckDict at its own pace, making this approach inefficient for larger dictionaries (RAM latency will be a hard limit) and deeper searches (too much work wasted). It may be fine for L3$-only operation (10-20 ns between the recorded matches), and with artificial restrictions on the search depth (e.g. check only the first 2 "good" candidates).
    • Approach 2: Run the pipeline with the first two stages over all positions. Run another loop over match candidates in match positions. Switch between these loops once per ~~1000 positions The drawback of this approach is the extra work of filtering matches for all positions, although later we will check matches only for 10-20% of them. We can diminish it by using SIMD and simplifying the filtering code (moving more work to the next stage). With AVX2, the filtering may cost about ROWSIZE/2 CPU cycles per position (== 16 CPU commands per SIMD word), i.e. 1 GB/s for 8-wide HT. Another drawback is the useless memory traffic (again, only 10-20% of requested data will be actually used), which limits its practical usage to L3$-only compression.

Greedy parsing, approach 1, may look like:

// We suppose that 10 iterations of the small loop are enough to deliver data from L3$,
// and the same is true for 2 iterations of the large loop.
// It seems that the speed will be the same as ZSTD -5 with equal settings.
// To make it faster, we may employ a caching match finder
// with SIMD-accelerated filtering, which returns only one, the best match candidate to check.

// Search for a match
while(pos < buf_end) {  // ~20+11*ROWSIZE CPU commands/pos
  LookupHT(pos+10);     // 16 commands
  LookupDict(pos+2); // lookup Dict for all matches recorded in the HT row (3*ROWSIZE commands)
  UpdateHT(pos);
  CheckDict(pos); // compute match candidate sizes and choose the best one (8*ROWSIZE commands)
  if (match_found)      // 4 commands
    break;
  EncodeLit(buf[pos]);
}

EncodeMatch(pos, match_len, match_offset);  // ~8 commands/match found
match_end = pos + match_len;

// Prefetch HT+dictionary for the next match search (2*3*ROWSIZE commands/match found)
LookupDict(match_end);
LookupDict(match_end+1);

// Skip the current match contents (~16 commands/pos)
while(pos < match_end) {
  LookupHT(pos+10);
  UpdateHT(pos);
}

A better-streamlined version:

// Once we computed the current match size, we can start processing of the next match
// 1. Find the first pos where any match candidate occurs
next_match_pos = cur_pos + match_len;
for pos = next_match_pos .. next_match_pos+7:
  match_tags &&= match_cached_bytes(pos)

// 2. The first pos where any cached byte matches, and probably the position of the next match
next_match_pos += lzcnt(match_tags) / 8
prefetch(dict for all positions in the hash row for the next_match_pos)

// 3. Update HT for all positions in-between
for pos = cur_pos .. next_match_pos-1:
  UpdateHT(pos)
  PrefetchHT(pos+16)
cur_pos = next_match_pos

// 4. Find the best match at the cur_pos

// 5. We are almost done
if UNLIKELY (best_match_len < 3):
  A slow loop performing CheckCandidates(pos) + UpdateHT(pos) + PrefetchHT(pos+16) till a match is finally found

This code does just one unpredicted jump per match, nicely overlapped with the only L3$-waiting moment. So, the worst case is L3$ latency (30-50 CPU cycles) overlapped with the HT update loop, followed by 8 match size computations at next_match_pos (~24 cycles), followed by HT probing operations for the next 8 positions (~16 cycles). It's ~80 CPU cycles total per one match, e.g. ~8 positions, i.e. ~10 CPU cycles/byte = 400 MB/s for the whole greedy parser.