In this section a table is presented with optimization results in several projections.
Contest test cases display how much gas is used by contest-like methods of test suite, total saved and percentage as compared to the original commit gas use.
Global gas counters were introduced after commit Keep your functions close and vars even closer
to make a measurable
metric of "tradeoff" between contest test cases and all other cases, that are totalled by their corresponding test suite.
Only positive (no fail) tests without getters (since there is no point to optimize getters) are included in the table.
This metric proven to be useful, because an Localize extensions in loop and short-circ simple
commit resulted in very
big jump in savings on the test cases, meanwhile it severly impaired all other cases (GGC increased a lot on it). As a
result, the very next commit Refactored internal message flows, good GGC value
managed to bring the GGC below the initial
total level, with further commits do a stable development
of contest test cases with improving GGC as well.
"Radical" optimizations involving modifying the compiler ``Asm.fif`` code are outlined in bold.
Commit | Contest test cases | Global gas counters | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
.____________________________________________________________. | Ext | Int | Extn | Total | Save | Perc% | Exter | Inter | Exten | Total |
Origin point: INITIAL | 3235 | 4210 | 2760 | 10250 | 0 | 0.00% | 64038 | 71163 | 38866 | 174067 |
Optimized unneccessary cell loads and operations | 3185 | 4014 | 2744 | 9943 | 262 | 2.56% | 65556 | 70764 | 40304 | 176624 |
Removed unneccessary always true check | 3185 | 3823 | 2501 | 9509 | 741 | 7.23% | 65504 | 68993 | 38998 | 173495 |
Unrolled the common internal handler code | 3185 | 3700 | 2373 | 9258 | 992 | 9.68% | 65504 | 67886 | 38204 | 171594 |
Implicitly return from the external handler | 3165 | 3700 | 2373 | 9238 | 1012 | 9.87% | 65264 | 67886 | 38204 | 171354 |
Reaped benefits of separated internal loaders | 3165 | 3700 | 2295 | 9160 | 1090 | 10.6% | 65264 | 67886 | 37736 | 170886 |
Discarded unneccessary slice remains in dispatcher | 3155 | 3690 | 2285 | 9130 | 1120 | 10.9% | 65034 | 67716 | 37646 | 170396 |
Loaded auth_kind optionally using LDUQ instruction | 3155 | 3654 | 2249 | 9058 | 1192 | 11.6% | 65050 | 67408 | 37430 | 169888 |
Is ifnot a joke for you? (emits less instructions) | 3155 | 3654 | 2231 | 9040 | 1210 | 11.8% | 65050 | 67408 | 37322 | 169780 |
Localize extensions in loop and short-circ simple | 3045 | 3644 | 2121 | 8810 | 1440 | 14.0% | 69697 | 71316 | 39314 | 180327 |
Reordering int msg handlers somehow saves 10 gas | 3045 | 3567 | 2188 | 8800 | 1450 | 14.1% | 69697 | 70623 | 39716 | 180036 |
Moving signature check higher saves some gas | 3027 | 3549 | 2188 | 8764 | 1486 | 14.5% | 69481 | 70461 | 39716 | 179658 |
Reordering checks somehow sames some more gas | 3009 | 3531 | 2188 | 8728 | 1522 | 14.8% | 69265 | 70299 | 39716 | 179280 |
Removing end_parse is -gas and +reliability | 2983 | 3505 | 2188 | 8676 | 1574 | 15.6% | 68953 | 70065 | 39716 | 178734 |
Keep your functions close and vars even closer | 2957 | 3505 | 2188 | 8650 | 1600 | 15.6% | 68641 | 70065 | 39716 | 178422 |
< restore extensions var in loop > | 3067 | 3533 | 2288 | 8888 | 65669 | 67568 | 38456 | |||
< move complex logic to inline_ref > | 2957 | 3423 | 2316 | 8696 | 65528 | 67495 | 39148 | |||
< optimize tail loading of extensions > | 2957 | 3423 | 2298 | 8678 | 65528 | 67495 | 39040 | |||
< optimize preference for simple ext ops > | 2957 | 3423 | 2248 | 8628 | 65528 | 67495 | 39324 | |||
Refactored internal message flows, good GGC value | 2957 | 3423 | 2248 | 8628 | 1622 | 15.8% | 65528 | 67495 | 39324 | 172347 |
^^^ This commit finally shows TOTAL GGC less then initial one WHILE providing 15.8% gas save on contest test cases! ^^^ | ||||||||||
Reorganized inlining point for extension message flow | 2957 | 3423 | 2230 | 8610 | 1640 | 16.0% | 65528 | 67495 | 38782 | 171805 |
Do not carry around params not needed (ext opt) | 2957 | 3423 | 2176 | 8556 | 1694 | 16.5% | 65176 | 67275 | 38586 | 171037 |
Optimize argument order to match stack | 2939 | 3405 | 2148 | 8492 | 1758 | 17.2% | 64960 | 67113 | 38346 | 170419 |
Swapping extn and sign order back saves some net gas | 2939 | 3464 | 2063 | 8466 | 1784 | 17.4% | 65004 | 67676 | 37876 | 170556 |
Short-circuit optimization of LDUQ with IFNOTRET | 2939 | 3420 | 2019 | 8378 | 1872 | 18.3% | 64929 | 67205 | 37612 | 169746 |
< short-circuit flags check with asm > | 2939 | 3402 | 2001 | 8342 | ||||||
< short-circuit int msg sign last check with asm > | 2939 | 3376 | 2001 | 8316 | ||||||
< short-circuit ext msg sign last check with asm > | 2913 | 3376 | 2001 | 8290 | ||||||
< short-circuit extension dictionary check with asm > | 2913 | 3376 | 1983 | 8272 | ||||||
Short-circuited some returns with asm | 2913 | 3376 | 1983 | 8272 | 1978 | 19.3% | 64599 | 66791 | 37373 | 168763 |
ASM-optimized simple action cases | 2885 | 3348 | 1981 | 8214 | 2036 | 19.9% | 64470 | 66700 | 37351 | 168521 |
Optimized out more unneeded instructions if may RET | 2885 | 3338 | 1955 | 8178 | 2072 | 20.2% | 64470 | 66610 | 37177 | 168257 |
^^^ Finally got more than 20% optimization in test cases! ^^^ | ||||||||||
Removed another unneccessary DROP with preload uint | 2875 | 3338 | 1955 | 8168 | 2082 | 20.3% | 64350 | 66610 | 37177 | 168137 |
Reordered argument order to optimize stack operations | 2857 | 3320 | 1955 | 8132 | 2118 | 20.7% | 64134 | 66448 | 37137 | 167719 |
Rewritten RETALT to IFNOTJMP - less gas, more reliable | 2836 | 3299 | 1934 | 8069 | 2181 | 21.3% | 64071 | 66406 | 37220 | 167697 |
Another argument stack optimization (psr -> dr call) | 2818 | 3281 | 1934 | 8033 | 2217 | 21.6% | 64017 | 66370 | 37220 | 167607 |
^^^ Finally this solution is better then original in EVERY test suite branch! The last hurdle - Exter GGC is now less!!! ^^^ | ||||||||||
Black magic route optimization (drop some result later) | 2818 | 3281 | 1916 | 8015 | 2235 | 21.8% | 64017 | 66370 | 37130 | 167517 |
Another black magic optimization (drop auth_kind later) | 2818 | 3281 | 1906 | 8005 | 2245 | 21.9% | 64017 | 66370 | 37102 | 167489 |
Moving recv handlers to entrypoint (root code cell) | 2829 | 3123 | 1748 | 7700 | 2550 | 24.9% | 63771 | 64759 | 36154 | 164684 |
Rearranged entrypoint conditions flow, compiler fix | 2826 | 3110 | 1735 | 7671 | 2579 | 25.1% | 63709 | 64629 | 36076 | 164414 |
< some commits not affecting the main test branches > | 60135 | 62241 | 33604 | 155980 | ||||||
Removed unneccessary exploded data parameters | 2790 | 3074 | 1735 | 7599 | 2651 | 25.9% | 59703 | 61917 | 33604 | 155224 |
Use SDBEGINS to enforce prefix in external message | 2718 | 3074 | 1735 | 7527 | 2723 | 26.7% | 58839 | 61917 | 33604 | 154360 |
Use SDBEGINSQ to check internal message prefixes | 2718 | 2984 | 1647 | 7349 | 2901 | 28.3% | 59045 | 61210 | 33044 | 153299 |
Use SDFIRST instead of PLDU to check first bit | 2710 | 2976 | 1639 | 7325 | 2925 | 28.6% | 58949 | 61138 | 33004 | 153091 |
Check bounced flag using slices and trail bits | 2710 | 2966 | 1629 | 7305 | 2945 | 28.7% | 58929 | 61038 | 32944 | 152911 |
Using SDBEGINSQ to check for starting zero | 2707 | 2963 | 1626 | 7296 | 2954 | 28.8% | 58893 | 61011 | 32929 | 152833 |
Optimized instructions order for extension | 2707 | 2963 | 1608 | 7278 | 2972 | 29.0% | 58893 | 61011 | 32821 | 152725 |
Reminder and origin point: INITIAL | 3235 | 4210 | 2760 | 10250 | 0 | 0.00% | 64038 | 71163 | 38866 | 174067 |
N.B. Contest multiplier: 9905/10250 = 0.9663 (approximate) -> place multipliers ~ 0.3221138, 0.0966341, 0.048317
In this section, details of optimization in each commit are pointed out, sometimes with detailed rationale and reasoning, when neccessary (in some relatively controversial optimizations).
The origin point, the state of the contract when the contest was started. It is used as a basis point to measure further improvements.
There is some data that is not needed to be loaded right away, since most likely it won't be used, so that data loading is deferred
until the moment it is actually needed. First of all, that is extensions
dictionary, since loading dict (consequently, a cell)
is a pretty expensive operation.
Also, reading stored_subwallet, public_key, extensions
and writing them back just to increase stored_seqno
is completely
unneccessary, so I took a snapshot of slice immediately after stored_seqno
, and write it as a slice, instead of 3 write operations
when increasing the stored_seqno
.
Instead of extensions
now immutable_tail
is being passed around, and extensions
are extracted from it, when needed.
Adding return to the if condition decreased amount of gas (due to turning IF
into IFJMP
), and, consequently,
second check of opcode is not required, since it is allowed to be only one of two options, one of which was already checked.
Copying the common data load code to separate execution paths in internal message handler somehow saves considerable amount of gas, but, most importantly, allows to optimize the data loading in future (since it is now different code).
Explicity (commit name has logic mistake) returning from the external handler saves some gas due to some TVM optimizations.
Because data loading is now handled separately for signed and extension messages, it is possible to optimize data loading so as not to waste unneccessary gas to load data that is not required for a specific execution path.
More precisely, extensions are now loaded from immutable tail, that allows to streamline stack manipulations that decrease amount of used gas, also, this logic will be even more simplified in future to save even more gas.
Using preload_ref
instead of load_ref
on a varible that is not used anymore saves considerable amount of gas, since
it is not required anymore to do stack manipulations and dropping the unneccessary result.
An LDUQ
TVM instruction was used to construct a try_load_uint32
that attempts to load an uint32
from a slice,
and returns the success indicator alongside with result, that allows to compact checking of availability of bits in slice
and reading the integer itself into one instruction - less branching, instructions, checks and gas.
Using ifnot
instead of if ~...
saves gas, since NOT
instruction is not needed anymore. ifnot
has same price
and bit length as the if
, therefore it is always advised to use ifnot
for negative conditions.
In this commit, there are two different changes. First one is localizing extensions
inside loop, that allowed to save
some gas in case extensions
are not needed to be changed.
The second one is one of the most important optimizations, that opens the door for many more further gas optimizations
in the code. The idea is that if the message is simple, that it, has no extended actions (the first bit is right away 0),
it is possible to immediately do the set_actions
and return
.
While the first idea has a noticeable tradeoff, that will be eliminated in future by optimizations all around the code, the second one does not make other execution paths more pricey, while making the main ones much better in terms of gas.
Moving sign
above extn
one in internal message handler somehow saved 10 gas.
In process_signed_request
moving signature check to the top of the function saves some gas.
In process_signed_request
changing order of parameter checks decreased amount of stack manipulations and saved some gas.
In this commit, end_parse
(and coincidentally now unneeded skip_dict
) was removed from this code. This leads to
increased reliability, less gas usage, and opens road to some more optimizations (like tail preloading).
While decreasing gas usage and opening road to more optimizations is pretty obvious, let's me explain on the increased reliabilty.
The idea behind it is, that usually, end_parse
is used to force structure of user messages. Therefore, mostly, using
it to enforce structure of internal data of the contract is quite excessive, since the contract itself is the one, who
only can write it's own data, and therefore if it cannot be corrupted by the code, then there is no way extra data appears
after the expected end. Therefore, using end_parse
is unneccessary, and just wastes gas.
However, in this contract the user can directly do set_data
using extended actions on the contract. And here is the point
why reliability of the contract is actually increased by removing the end_parse
. It is possible in future, that the
user might accidentally append extra data to the end of the contract. This may happen if the user would like to upgrade the
contract, it will have some more extra data, but for some reason failed or forgot to do the code upgrade action, or it failed
one or another way. In this situation the user will end up with the old contract with the new data. And in this situation,
all the TONs, tokens and NFTs on this wallet will be locked forever!!! just because of that end_parse
. Therefore,
removing the end_parse
also helps against such kind of mistakes, and there are no any kind of implications on removing it.
The only place where it should really be used is checking close-structured (without open ends, like in our case, where the list can be of any length) input user data, in order to make sure, that a specific request can have only one single implementation in order to prevent some playing with signatures, but that is completely not an our case.
This refactoring of external message handler streamlines data flows in it, therefore avoiding unneccessary stack manipulations
and saving some gas as a result. More precisely, the auth_kind
is loaded right away from body
(since it is the last
parameter of the function, it is at the top of the stack at that moment), and data is being loaded later after the check.
This commit, and several other technical commits before it (not described here, since they are technical ones and do not affect the code) lays beginning for calculation and optimizations of GGC (global gas counter). While not being a direct target of the contest, the GGC is important metric, that allows to measure the tradeoff, of how optimizing contest paths inadversely affects all other logic of the code that is not measured. Therefore, keeping an eye on GGC is important for sustained development of contest paths, where optimizing them does not severely impair all other code logic.
This commit, while increasing extension gas usage a little (this problem will be addressed to and solved in later commits), immensely decreases usage of gas in GGC, and brings it down below the GGC in initial commit. Therefore, starting at this point, I can strongly assert, that the optimizations of the main contest paths do not impair the other code paths and logic.
First of all, extensions
variable in complex handling loop was reinstated, because saving exts in cell and popping them
off each time required a lot of gas due to recreation of cell each time.
Next, the complex dispatch request handling logic was moved off to a separate function, that is called with inline_ref
modifier. This allows to save some gas on simple cases, and is actually a very important optimization for future, because
at some point in future, the cell breaking point where TVM Assembler decides to break cell into pieces because a critical
point for further optimization.
The way how extensions are loaded in internal message handler is optimized so as not to load the unneccessary at that moment data.
Simple operations initiated by extensions now do not require to load the unneccessary data from the contract.
Some optimizations were made to tell the compiler to break the cell at exact place by using inline
and inline_ref
accurately.
Getting the data of the contract in place, even accounting for the begin_slice
is more efficient than carrying it around
in many parameters, that forces stack shaping when crossing the function boundary, and constraints on how efficient stack
manipulations may be, therefore all the unneccessary parameters were removed and data is extracted closer to the point
where it is actually needed.
Some parameters were reordered to match how they are ordered in stack, so that to decrease amount of unneccessary stack operations.
In internal message handler sign
and extn
message handlers were swapped back once again, since somehow, after all the
optimizations carried out above, that order is now more efficient in terms of gas.
Instead of pretty complex in terms of instructions and gas FunC construct, a single IFNOTRET
is used to quickly end
execution when there are not enough bits in the slice to obtain the opcode from the internal message.
Following the idea of the previous commit, some more operations now use IF(NOT)RET
instead of conditionals to save more gas.
Flags of internal message (bounced, to be more precise) are now checked by a concise ASM function that does IFRET
to
end the execution in case a bounced message is detected.
The check of second operation can be made shorter by comparing two numbers equality and performing IFNOTRET
in ASM.
The same applies for opcode check in internal message handler.
... and the success?
result of locating the sender in the extensions
dictionary.
An optimized code construct was built to replace the not-so-efficient FunC code for simple function cases. This one uses a specific ordering of result on the stack after executing the neccessary instructions.
An udict_get_or_return
instruction was introduced that instead of returning success?
alongside with the result
returns immediately if the entry is not found in the dictionary.
Also, I have noticed, that public_key
is read from cs
using ~load_uint
, but that cs
is not used anymore
in the code, so saved an unneccessary DROP
by using .preload_uint
instead.
The same optimization for public_key
loading was done in the external message handler in this commit.
Some arguments were reordered to save gas on stack manipulations. Also, another public_key
loading was optimized (the
last one, in the extension handler execution path).
The simple actions handler was rewritten from IFNOT:<{ ... RETALT }>
to IFNOTJMP:<{ ... }>
. This saves some gas
(since implicit returns are cheaper), and makes the code more reliable (since we cannot be 100% sure that RETALT
will
end the execution as expected if the code will be modified in future, therefore using IFNOTJMP
eliminates this uncertainity).
Some another reordering of function arguments was done to eliminate unneccessary stack operations.
An unused result of extension dictionary checking is now carried around inside the called function in order to be dropped
later after the simple actions checker. Surprisingly, this does not impair non-test code paths at all, since the DROP
at the end of simple actions checker is merged with drop of the carried result into 2DROP
, thus having no drawbacks.
Another variable is now called around for delayed drop, this time auth_kind
, which turns 2DROP
into 3 BLKDROP
,
that is still not bad, increases gas efficiency on primary paths, and does not impair it on other ones.
HERE BE DRAGONS. This is a commit that does the optimization to the code by modifying the TVM Assembler, Asm.fif
!!
Therefore a custom func-js-bin
library funcfiftlib
is provided as a patch, that is applied by patch-package
using a postinstall
command that should be automatically executed after npm install
.
The modification is actually an extension of functionality of the Asm.fif
by allowing to specify two special functions,
that are embedded into root code cell (c3) in a specific way. Previously, the only logic present in root code cell was
to select a continuation from dictionary and execute it, that includes external and internal message handlers.
But I thought that is a big waste of gas, since it involves loading of intermediate dictionary cells and the target dictionary
code cell itself, passing of control to that new cell, that is a very substantional waste of gas. Therefore, I decided to
embed the internal and external message handlers inside the root code cell to as much extent, as possible. Right here
the cell breaking point becomes extremly important, since if main code paths get broken in the middle of execution, then
implicit JMPREF
will cause additional gas usage on loading the new code cell and doing the implicit JMPREF
itself,
therefore it was VERY important to track free bits in the root code cell, and to track where the cell is broken into pieces
by the TVM assembler, optimize and move instructions to made all the main contest paths fit into the root code cell.
N.B. Also I later attempted to make an optimized and efficient special entrypoint code that would intelligently handle the cases, but somehow it always turned out to be less efficient then just inline-calling the corresponding message handler function, therefore the entrypoint was not touched for a long time (or maybe was not touched at all later).
As for the code changes, there is now a new super special function entry_point_recv
that is executed right away
in root code cell after it determines that the incoming call is either 0
(recv_internal
) or -1
(recv_external
).
Consequently, those handlers are now inline
, so that they are inlined in entry_point_recv
, and as a result, in a
root code cell.
External and internal message processing conditions order are swapped that result in less gas usage overall. Also, some mistakes in TVM Assembler are fixed and functions were renamed so as not to accidentally compile it using an ordinary compiler.
Some additional improvements to the complex dispatch case were made to decrease the global gas counters. This did not affect the gas usage in the main test cases, but made my optimizations for friendly to the natur... to the other code branches.
Moved data (ds
) variables closer to their actual usage. Therefore it is not required to pass lots of those variables
in the arguments anymore saving some gas on stack reorganizations.
Moreover, this allows to move data variable code inbetween other code in process_signed_request
function, saving even
more code by optimizing order of operations.
I have found out a super useful SDBEGINS(Q)
TVM instruction that allows to verify the prefix of a slice against another
one (in this version of function the prefix is even conveniently embedded into the instruction code itself), and even has
a very convenient behaviour of throwing if prefix does not match (that is very convenient for external message, since
returning from it without accepting message is effectively the same as throwing an exception), and returns the slice without
that prefix is correct, that perfectly matches the previous behaviour.
As such, replacing compare and return with this instruction saves considerable amount of gas with no implications.
The quiet version of aforementioned instruction, SDBEGINSQ
exhibits even more convenient behaviour for multi-case checking
and pipelining: on the top of the stack it puts whether the prefix matched or not, that can be consumed for any kind of condition
checks, and always returns a slice after it. The great behaviour is that if the prefix matched the returned slice is stripped of
it, and if the prefix did not match, the original slice is returned. This allows to use this instruction, branch into processing
code if it matched, or use it again if did not, and keep doing that (something like a switch-case).
Therefore, I have used this instruction to check for opcode prefix in internal message processing.
It is possible to use shorter SDFIRST
instruction to check if first bit of slice is set, that saves some gas.
I have used it in checking whether to use simple action processing code, that saves some gas in each execution branch.
It is more efficient to get a 4-bit slice and check trailing bits with SDCNTTRAIL1
(it will always be non-zero
if last bit (bounced) is non-zero, and it always will be zero if it is zero - a perfect instruction to check the last bit).
Therefore by such approach checking bounced flag bit is much more effective than loading 4-bit number from slice, pushing 1
to stack, and performing the or operation.
Like with internal message prefixes, it is more efficient to use a single SDBEGINSQ
instruction to check that prefix
starts with zero and is a simple action even than preload a single uint1.
Adjusting order of instructions in extension branch allows to save some gas.