-
Notifications
You must be signed in to change notification settings - Fork 7
LZ optimal parsing
ZSTD encodes each match as 3 bit fields - litLen, matchLen and matchOffset. These fields are encoded independently, each getting its own "price", fixed for an entire compressed block.
If we have match candidates list for an entire block, we can compute matchOffset price for each candidate apriori, getting benefits of maximal ILP and even employing AVX-512.
For matchLen prices, we should make one precomputed table for the entire compressed block. It may include e.g. lengths up to 256, with computation on-the-fly for larger lengths.
Now the crucial part - let's compute each matchPrice[pos][len], i.e. the price of matches starting at each position with each possible length! Again, they can be computed independently, maximizing ILP.
Moreover, we can use CMOV to fly through positions, their candidates, and candidate lengths without any unpredicted jumps:
curPos = position[curCandidate];
curOffsetPrice = offsetPrice[curCandidate]; // precomputed above
matchPrice[curPos][len] = price[len] + curOffsetPrice;
len++;
if (EQUIPROBABLE (len > maxLen[curCandidate])) {
// Switch to the next candidate
curCandidate++;
len = startingLen[curCandidate]; // precomputed too - either MIN_MATCHLEN or 1 + prevCandidate.maxLen
}
Of course, matchPrice shouldn't be a rectangular matrix, instead we should alloc only the necessary space for each position. We can also process a compressed block in smaller chunks to limit matchPrice size (and the size of per-candidate arrays too).
This loop is about 15 commands, which means processing 1e9 pos+len combinations per second on 4 GHz Skylake, which is probably equivalent to ~~ 100 MB/s of input data. It already looks much faster than existing optimal parsers, but not yet good enough for a real deal.
So let's dig deeper! Dropping price[len], this loop is actually the classic UNRLE problem -
each offsetPrice[curCandidate] is repeated curCandidate.maxLen - prevCandidate.maxLen
times.
Knowing that we can put SIMD to work.
Our new shiny loop will usually make one iteration per candidate (rather than one iteration per len):
curLen = maxLen[curCandidate];
mask = ONES[curLen];
newPrices = broadcast( offsetPrice[curCandidate]);
simdPrices = select(mask, simdPrices, newPrices);
*matchPricesPtr = simdPrices; // always write to RAM
// Go either to the next SIMD word or the next match candidate
if (filled entire SIMD word) {
matchPricesPtr++;
mask = 0;
} else {
curCandidate++;
if (new position)
matchPricesPtr++;
}
It's again about 15 commands, but now per candidate/SIMD word. With the usual 1.5 candidates per position, this means > 500 MB/s.
And now, once we collected all the accessories, we can take the final boss. The optimal parser loop does this (ignoring repeat codes):
for pos = 0 .. bufsize-1:
price[pos+1] = min( price[pos+1], price[pos] + literalPrice[buf[pos]])
for matchLen = MIN_MATCHLEN .. maxMatchLen[pos]:
matchPrice = compute price of match with length matchLen starting at pos
price[pos+matchLen] = min( price[pos+matchLen], price[pos] + matchPrice)
Ignoring literals and their prices for a moment, we can implement it as:
curPrices = *pricePtr;
*pricePtr = min(*pricePtr, *matchPtr);
*pricePtr = curPrices;
if(no more SIMD words for this pos) {
pos++;
pricePtr++;
} else {
pricePtr += SIMD width;
}
matchPtr += SIMD width;
Now let's take literals back to the equation. Literals in ZSTD has an unusual pricing - the price of a literal run is the sum of individual literal prices (bits required to encode each literal in the run), plus the price of encoding the run length - which we should precompute and save into litRunPrice[runLength] table, just like matchLen prices (see above).
This means that encoding a literal means extending the literal run, which currently finishes at pos (and may have zero length), with one more literal, and thus the price of encoding a literal in position pos should be computed as:
// Length of the literal run finishing the best encoding chosen for pos:
runLen = runLength[pos];
litPrice = literalPrice[buf[pos]] + litRunPriceDiff[runLen];
// using precomputed:
// litRunPriceDiff[runLen] = litRunPrice[runLen+1] - litRunPrice[runLen]
// We can also precompute (and efficiently vectorize with AVX-512):
// litPriceAt[pos] = literalPrice[buf[pos]]
So, once we computed the final price[pos], the prices of pos+1 and pos+2 can be affected only by literals (since MIN_MATCHLEN==3).
// Earlier in the series: final price[pos] was established
price[pos+1] = min( price[pos+1], price[pos] + litPrice(pos));
price[pos+2] = min( price[pos+2], price[pos+1] + litPrice(pos+1));
// Now the final price[pos+1] and price[pos+2] were established.
// The real code should also update runLength[pos+1] and runLength[pos+2]
// according to the chosen path to these positions.
// We can also vectorize the litPrice(pos)/litPrice(pos+1) computation.
With all these prerequisites, we can take the final boss. Let's recall him the last time:
for matchLen = MIN_MATCHLEN .. maxMatchLen[pos]:
// Compute the price of the match with length matchLen starting at pos:
matchPrice = matchLenPrice[matchLen] + matchOffsetPrice[pos][matchLen]
price[pos+matchLen] = min( price[pos+matchLen], price[pos] + matchPrice)
This loop is perfectly vectorizable. The match price includes 3 parts - litRun, matchLen, and offset. Of those, we account for the litRun price separately (see above). matchLenPrice[] is an array indexed by matchLen, the same for all matches. And matchOffsetPrice[pos][] is also an array indexed by matchLen, so the required array matchPrice[pos][] can be computed by plain SIMD add.
Similarly, the price[pos+matchLen] computation in the next line
is also done with SIMD broadcast, add, and min operations
(really, min
should be replaced by cmp
, since we need these flags
to choose proper values for other arrays like runLength[]).
So, the branchless main loop body is as simple as:
It would be perfect, if not ignoring two problems. First, it lacks computing the literal run price. We can sacrifice a little performance by recomputing it at each iteration. Actually, we suffer mainly because we lose the ability to skip positions without matches, rather than due to rare occurrences of match lengths > 10:
Second problem - any x86 seriously suffers from writing to partially overlapped memory areas, and here we do that A LOT. A possible solution is to keep prices e.g. for the next 32 positions in vector registers, shifting data between them, one word per position, and use scalar writes to save into memory only data for the current position.
Another possible solution is to make only aligned stores and loads, but shift data in vector registers to align them with the current position:
1. Compute one SIMD register with price[pos] + matchPrice[i]
(3 SIMD operations - broadcast, load, and sum)
2. Load two corresponding price SIMD words from memory
3. Fill two SIMD registers with matchPrice[] shifted to match the loaded SIMD price words,
filling unused slots with 9999
4. Compare pairs of corresponding registers and use the comparison result
to conditionally select the best prices and their attributes like runLength[pos]
Taking into account that we took most computations at steps 2..4, we should revert the logic and just drop the unused part of the step 1 computation:
1. Compute two SIMD registers with mPrice[i] = price[pos] + matchPrice[i]
(5 SIMD operations - broadcast, two loads, and two additions)
2. Load corresponding simdPrice[pos] SIMD word from memory
3. Combine two mPrice[i] registers to match the alignment of simdPrice[pos],
filling unused slots with 9999
4. Compare aligned-combined mPrice[i] with simdPrice[pos] and use the comparison result
to conditionally select the best prices and their attributes like runLength[pos]
We can also choose any other N:N, N:N+1, or N:N-1 combination, but 1:1 or 2:1 seems like the optimal choice. This way, we will average 1-2 iterations per position.
- move "price[pos] + " after mPrice[i] shuffling
- shuffle matchPrice[i] on precomputing, storing it as SIMD words, aligned with simdPrice[pos]
- main loop over matchPrice[] SIMD words, loading curPos and baseLen from parallel arrays
- UNRLE loop storing 1-4 SIMD words per each match candidate, plus 9999 before MIN_MATCH and after maxMatchLen
But this main loop still suffers from store-to-load forwarding delays. While for the literal run price calculation we can preload priceAt[pos] two iterations earlier, saving priceAt[] SIMD word and loading it again in the next iteration makes the iteration speed latency-bound (latency = forwarding + simdMin).
To combat it, we may process 2..4 SIMD words from matchPrice[] per main loop iteration, spreading the delay over larger useful work. To avoid extra work, we should combine matchPrice[] SIMD words into 2..4-packs with known properties.
One approach is to make each pack from SIMD words belonging to the same pos. This way, 4-pack on 8-wide SIMD allows us to process in a single iteration any position having only matches up to 27 bytes, bringing us to the desired ~~500 MB/s speed. In this code, we will load 4 consecutive SIMD words from priceAt[], another 4 from matchPrice[], and other corresponding vars like runLengthAt[], combine them, and store the results back to *At arrays. The latency between iterations is defined by the load-min-store sequence which is about 8 cycles, resulting in the 500 MB/s speed. It has another nice property - by speeding up the heaviest case, it makes optimal parsing more "fair".
We can try to make a 4-pack from SIMD words of 4 consecutive positions, thus processing up to 4 positions per iteration (with the average pretty close to 4), but we will have to update the literal run prices after the first 2, so essentially it will be 2 iterations of the 2-pack loop.
So, let's start with the basic 2-pack loop, processing 2 consecutive positions, and updating the same priceAt[] SIMD word for both of them. This approach has an outstanding property - we never go back, which means that we can completely avoid reloading priceAt[]. This means no more forwarding latencies and a few less loads. Unfortunately, this applies only to the SIMD department, we still have to reload runLengthAt[] for the literal run pricing and then somehow update priceAt[] with a lower price.
So, we can make a large iteration over a single SIMD word like that: 0. Load SIMD registers with prices and other data for pos..pos+7
- Update SIMD registers with matches starting at pos-2 and pos-1
- Update SIMD registers with literals at pos and pos+1 (these prices don't depend on matches starting at pos-2 and pos-1)
- Update SIMD registers with matches starting at pos and pos+1 ...
- Update SIMD registers with matches starting at pos+4 and pos+5
- Update SIMD registers with literals at pos+6 and pos+7
- Save the data from SIMD registers (loaded at step 0) to memory
This code moves us 8 positions forward with just a few unnecessary loads and stores, and minimizes latencies by relying on register moves rather than store-to-load forwarding. It would easily process 1 GB/sec, if not having a tiny drawback - it handles only matches starting in the same SIMD word.
If we have longer matches, this code needs to be run twice or more on the same SIMD word, updating it first with candidates starting in earlier SIMD words. On these extra runs, updating prices with litRuns is useless (but doesn't hurt), while the base price for each match candidate word should be loaded from the memory (instead of extracted from SIMD register), so a universal code dealing with both cases should do both operations and combine their result with CMOV. For most positions, two runs over the same SIMD word (first updating with matches starting in earlier SIMD words, a second with matches starting in the current word) should be enough.
Alternatively, we can make a combined iteration, that first applies N SIMD words with matches starting in earlier words (base prices are loaded strictly from memory, litRun prices aren't computed), then 8 SIMD words with matches starting in the current SIMD word. N can be any number, it should be chosen to get the best performance. Or we can go deeper and combine N light updates (only from previous SIMD words, no litRun pricing) with 8 heavy updates (either from current or earlier words, with litRun pricing).
We can combine two of our best ideas - make a huge iteration processing match candidates from 8 positions pos-2..pos+5, and apply e.g. 4 SIMD aligned words with matches per position. This allows to process matches up to 27..34 bytes (depending on the match start position). If any match at these positions (pos-2..pos+5) is longer than supported
- take care of the situation with a conditional branch.
This approach will make the main loop very simple and deterministic. Processing of a single position P:
// take care of litRun price...
basePrice = priceAt[P]; // load from memory for P<pos or from current SIMD register for P>=pos
simdBasePrice = broadcast(basePrice);
for i=0..3:
simdMatchPrices = simdLoad(alignedMatchPricesFor[P] + i);
simdNewPrice = simdBasePrice + simdMatchPrices;
mask = simdCmp(pricesAt[i], simdNewPrice);
pricesAt[i] = select(mask, pricesAt[i], simdNewPrice);
otherVarsAt[i] = select(mask,...);
All in all, I think that we should start with the following approach
- one iteration includes 8 positions, with 2 SIMD words processed per position. It will directly handle match lengths up to 11..18 bytes. If any of these positions have longer matches, we make another pass operating in "long" mode (selected by CMOV operations in the generic iteration code). In this mode, we don't need to update litRun-based prices, and basePrice for each position is loaded from memory rather than SIMD register. One pass in long mode processes 16 more lengths for each of 8 positions, and I expect that 80% of blocks (i.e. 8 adjacent positions) will be handled with one pass, 80% of the remaining blocks with two passes, and so on.
There are two ways to handle it - either we keep the same SIMD words in registers, processing any match candidates ending up at these 8 positions, then write the first set of SIMD registers to memory, move data from the second set to the first one, set prices in the second set to 9999, and start to process the next 8 positions. This approach is more efficient (no need to load SIMD sets from memory), but requires sorting the match candidates by the end of the match position (i.e. matchStart+matchLen). We should do the reordering at the preprocessing stage, running a limited-depth bubble sort.
Another way is simpler - we keep data ordered by the match start, but extra main loop passes will write to higher memory addresses. So, at the start of each pass, we have to load SIMD registers from memory, making speed somewhat more dependent on the forwarding latency.