Why use Leb128? #1533

wmarais · 2024-11-17T01:46:50Z

wmarais
Nov 17, 2024

Hi

I've been writing a WASM VM in Rust for fun. I'm trying to understand why LEB128 is used for encoding integers rather than just storing the integers directly in 4 or 8 byte byte blocks. I.e.

I would expect that the whole-of-binary compression offered by the webserver will realise overall better compression results that encoding only the integers in LEB128.
Each integer load takes several CPU instructions to decode the value, even for the smallest length encoded value. It seems a single fixed load instruction would be faster.
There may be no real observable effect because as long as you have good cache locality, it should should not matter (i.e. the system bus to put the value back into the stack will be much slower than the multiple instructions). But I expect that if you directly store the values as 4 bytes, you will overall fit more instructions into a single cache line.
At the hardware level, variable length types tend to be less efficient since instruction and data busses are fixed widths. So when you use variable length encoding like this, it relies on software side decoding and you are not optimally utilising the underling hardware optimisations.

I'm not sure if there is something that I am missing, but no amount of googling has provided me any real insight. To me, it seems to be needless complexity for no obvious gain.

I have similar questions around using 8 bit opcodes rather than 16bit ones that can be tightly packed sequentially. (I.e. allow the compiler to build a jump table rather than a more complex data structure that take more CPU cycles to propagate, specifically for variable length opcodes.)

My understanding is that the purpose of WebAssembly is to maximise performance and if this was a performance choice, I would really like to understand how / why. I've been thinking of writing an optimiser that will convert WASM binary into a hardware optimised format, but before I do that, I really would like to understand if I have grossly missed the point or fundamentally misunderstood something. Any feedback?

conrad-watt · 2024-11-17T03:20:44Z

conrad-watt
Nov 17, 2024
Maintainer

High-level - we care a lot about streaming compilation on the Web, and believe the LEB representation improves end-to-end performance in this situation (see below).

I'm trying to understand why LEB128 is used for encoding integers

Mostly it's to optimise the representations of static indices in situations where the majority of indices will take small values, but there may be a long tail of larger values. For example, with local.get i, the LEB format allows all values of i that are less than 128 to be represented in a single byte. Many functions have fewer than 128 local variables, but some have far more (so we can't just mandate that i is always a i8). This crops up again in a lot of other places, e.g. the indices of call i, global.get/set i, the offset of load ... off, and import/export offsets. If we required these indices to always be (e.g.) i32, we'd be leaving a lot of code size on the table. Note that fixed i16 also wouldn't be large enough for many of these examples (call and imports/exports in particular).

(less important) - LEBs also have a standards body advantage if we expect that an index will initially have a very small range, but that future features may expand this range. Otherwise, we'd have to "bake in" the max size of the index during initial feature design, which might be a political pain. I don't have perfect examples of this happening yet to hand, although the multi-memory and memory64 proposals might be imperfect examples.

I would expect that the whole-of-binary compression offered by the webserver will realise overall better compression results that encoding only the integers in LEB128.

I think this would need to be empirically validated - I'd have the opposite intuition about the examples above, especially since LEB representations can also be generically compressed.

Each integer load takes several CPU instructions to decode the value, even for the smallest length encoded value. It seems a single fixed load instruction would be faster.

But I expect that if you directly store the values as 4 bytes, you will overall fit more instructions into a single cache line.

At the hardware level, variable length types tend to be less efficient since instruction and data busses are fixed widths...

Since Wasm is mostly compiled, this would only be optimising compilation time, which was historically already bounded by the speed of the network in a lot of Web streaming compilation cases (I've not checked this in a few years). It could be instructive to consider how an alternative discipline of fixed indices might have created different trade-offs - I'd expect a better experience for in-place interpreters and non-Web/on-disk uses of Wasm, in exchange for some increase in binary size and a correspondingly poorer experience on the Web. I'm not sure how explicitly this trade-off was considered/measured in the first days of Wasm - @titzer might have some additional context.

I have similar questions around using 8 bit opcodes rather than 16bit ones...
My understanding is that the purpose of WebAssembly is to maximise performance and if this was a performance choice, I would really like to understand how / why.

On the Web, end-to-end execution time is often dominated by start-up time, which includes download time, so binary size should be very important to us (although I think we've been angsting recently about the binary size characteristics of new features like GC). It could be that, empirically, we've ended up making the wrong decision in our index representation - for example if compression algorithms are relatively better for fixed size indices than I'm expecting, or if in-place/on-disk uses of Wasm are suffering more than I'm expecting, or if Web compilation has become no longer bounded by download speed and the processing speed differential between fixed size indices and LEBs would make an observable difference. Wearing my other hat as an academic, this could be an interesting research project!

0 replies

rossberg · 2024-11-17T08:22:48Z

rossberg
Nov 17, 2024
Maintainer

There was a lot of discussion about encoding questions at the beginning of Wasm. The size effectiveness of the basic code format definitely mattered to us. For example, there were discussions about using LEB vs other variable-length integer formats, there was discussion about adding specialised get0 instructions just to save single bytes, but we also intentionally imposed a max-length restriction on LEBs so that decoders could unroll their loops. For quite a while we even wanted a second layer in the binary format that enabled custom macro opcodes and thus the ability to reduce code size more systematically. But that was difficult to design right, and it wasn't clear that it would give sufficient benefit over simple zip compression to justify the added complexity for both producers and consumers. In the end we deferred it but nobody has bothered to pick it up again since.

As a general observation, it is always easy to cut down on the number of instructions in the future by introducing new, specialised instructions. But if the basic format is inherently wasteful, then that is almost impossible to fix. 16-bit opcodes plus 4-byte indices would easily blow up the size of almost all code by a factor of 2 to 3, and that certainly matters.

Also, Wasm was primarily designed for streaming jit compilation, where maximally efficient decoding matters less. On the other hand, it is not at all clear that substantially larger code would benefit interpreters either, because of adverse cache effects.

0 replies

wmarais · 2024-11-18T10:29:26Z

wmarais
Nov 18, 2024
Author

Thanks @conrad-watt and @rossberg.

(Sorry for spelling and grammar mistakes, trying to wrangle a toddler at the same time.)

Mostly it's to optimise the representations of static indices in situations ...

This also seems like a sensible future proofing, assuming addresses spaces become 128bit etc at some point. Though at the risk of repeating the mistake of so many others ... I dare say it would not happen in my life time ... (Now the race is on :D )

Otherwise, we'd have to "bake in" the max size of the index during initial feature design, which might be a political pain.

Yup, ok, say no more.

I think this would need to be empirically validated.

I agree. I imagine at the time the initial specification was published, it wasn't really possible to do much empirical testing. I.e. chicken and egg problem.

My intuition is based on three assertions:

Serial CPU speeds have essentially reached it's peak (so my priority is to reduce the number of clock strikes to do anything, also the reason for vector instructions etc).
Majority CPU performance improvements now come from cache increases.
Still lots of speed growth opportunity at the network layer, specifically for wired internet (noting there is just so many ways you can divide up the RF spectrum).

To that extend, I've prepared two examples:

Using leb128: https://godbolt.org/z/dcTahexds
(Same as 1. but simplified to see if I can lower the instruction count: https://godbolt.org/z/xrxcPzeoE)
Using direct reads: https://godbolt.org/z/qhab1W8ds

Now looking at the assembly generated, it seems likely that 3. will not only have the lowest CPU clock count (given it has 2 instructions), but that it will fit in the cache better and be transferred from system memory and into the cache faster.

Now I haven't run it up to verify it yet, but I'll try and get some performance tests done in the next few days. (Need to unit tests those examples too.)

On the Web, end-to-end execution time is often dominated by start-up time, which includes download time, so binary size should be very important to us ...

... it is always easy to cut down on the number of instructions in the future by introducing new, specialised instructions. But if the basic format is inherently wasteful, then that is almost impossible to fix. 16-bit opcodes plus 4-byte indices would easily blow up the size of almost all code by a factor of 2 to 3, and that certainly matters.

I can only talk to my own expectations here. I agree with this for general web browsing, but I wonder if start-up time is really that big of a factor for the types of applications that will be using webassembly. I.e. which one would you rather, wait a few more seconds for the download or have a lagy experience for the full duration of usage.

This may also not matter either if you do full application compression since the bundle should be overall smaller than just compressing the indexes and values. I.e. every layer of compression adds entropy, so compressing indexes and values, and then the whole module will be physically larger than just compressing the whole module from the start.

Again, rather not run my mouth without testing :) Assuming I have time over the next few weeks, I'll try and make a tool that convert wasm to "big-wasm" to see the impact on performance and compression.

Wearing my other hat as an academic, this could be an interesting research project!

I'm not an academic, but I like the topic and could be up for some collab work if you want to publish something.

Also, Wasm was primarily designed for streaming jit compilation

This may ultimately be the crux of it. I'm admittedly looking to embed wasm into applications and hardware which favours local performance over streaming performance. I.e. closer to Java and C#. I'll do some more testing to make an evidence based decision about whether wasm is a sensible solution for my work. (May turn out that I'm overthinking things.)

1 reply

conrad-watt Nov 18, 2024
Maintainer

I can only talk to my own expectations here. I agree with this for general web browsing, but I wonder if start-up time is really that big of a factor for the types of applications that will be using webassembly. I.e. which one would you rather, wait a few more seconds for the download or have a lagy experience for the full duration of usage.

After the compilation step (during which LEBs are decoded), engines are free to use more runtime-efficient in-memory representations for numbers (and I certainly wouldn't expect them to use LEBs!). So there's no persistent slowdown - in the streaming case the only question is whether using fixed-width (or other) encodings in the binary would make streaming compilation go faster than using LEBs. We've made a bet that LEBs reduce download size enough that the extra work needed during streaming compilation to decode them is "worth it" - but we could be empirically proven wrong!

EDIT: I should acknowledge that in-place interpreters/hypothetical Wasm-specialised hardware could suffer persistent slowdown from LEBs, but we didn't optimise for these use-cases during Wasm's design

wmarais · 2024-11-18T11:39:14Z

wmarais
Nov 18, 2024
Author

Right. I had assumed that the wasm binary format was the compiled output and designed to be interpreted directly. So essentially the tools I'm writing is the jit compiler.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why use Leb128? #1533

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why use Leb128? #1533

wmarais Nov 17, 2024

Replies: 4 comments · 1 reply

conrad-watt Nov 17, 2024 Maintainer

rossberg Nov 17, 2024 Maintainer

wmarais Nov 18, 2024 Author

conrad-watt Nov 18, 2024 Maintainer

wmarais Nov 18, 2024 Author

wmarais
Nov 17, 2024

Replies: 4 comments 1 reply

conrad-watt
Nov 17, 2024
Maintainer

rossberg
Nov 17, 2024
Maintainer

wmarais
Nov 18, 2024
Author

conrad-watt Nov 18, 2024
Maintainer

wmarais
Nov 18, 2024
Author