Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes for loop performance #1663

Open
awelzel opened this issue Jan 25, 2024 · 1 comment
Open

bytes for loop performance #1663

awelzel opened this issue Jan 25, 2024 · 1 comment

Comments

@awelzel
Copy link
Contributor

awelzel commented Jan 25, 2024

In the context zeek/zeek#3580 it was observed that a for loop over a bytes object to xor decode the content takes ~30% of the total runtime in that setting. This ticket is meant as a generic "speed up bytes iteration".

The test is iterating 500k times over a 134 bytes instance in %init. The SPICY_THREADED=1 version creates a short lived thread in spicy-driver's main to avoid glibc using an optimized non-atomic shared_ptrs (#1600) This reflects better the performance within an application like Zeek and makes performance significantly worse. Python is added as baseline which executes significantly faster.

No sure how relevant this is, for looping over large bytes might not be a common use case.

Command Mean [s] Min [s] Max [s] Relative
echo "x" | ../build/bin/spicy-driver loop.hlto 4.922 ± 0.022 4.903 4.945 1.76 ± 0.01
echo "x" | SPICY_THREADED=1 ../build/bin/spicy-driver loop.hlto 7.147 ± 0.015 7.132 7.161 2.56 ± 0.01
python3 loop.py 2.791 ± 0.007 2.783 2.797 1.00

The non-dwarf flamegraph (suspect std::shared_ptr above the flat bars like in zeek/zeek#3580).

flame

#loop.spicy
module Loop;

global xs: bytes = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA";

public type X = unit {

 on %init {
  local i = 0;
  local sum  = 0;
  while ( i < 500000 ) {
    for ( c in xs )
      sum += c;
    i++;
  }
  print "sum", sum;
 }
};
# loop.py
xs = b"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"


def main():
    i = 0
    s = 0
    while i < 500000:
        for c in xs:
            s += c
        i += 1

    print("sum", s)


if __name__ == "__main__":
    main()
@bbannier
Copy link
Member

I think the more general issue is that iterations over collections of known size can be expensive since we always emit checks at runtime; in general this makes sense since even with a range for loop the underlying could be changed under the covers (e.g., image if every second iteration would empty the collection).

To be faster we would need a way to detect that the collection itself is not modified. This is non-trivial to do automatically since we'd have to make sure that all called code does not modify the collection (e.g., change element order, or remove them); this could even happen in e.g., called functions (which do not even need to be implemented in Spicy, so we might be unable to inspect their bodies). I think we talked in the past about collection methods like transform or for_each which would either compute a new collection or modify collection elements in place, but the function (or since Spicy currently doesn't have them: hypothetical closure) argument could still modify the collection itself so I fear that approach would still not allow faster iteration. It looks like the only way to accomplish this is very global data flow analysis.

If such iteration is a bottleneck for you right now you could implement a custom C++ function which uses unchecked iteration after making sure you can uphold the invariants required for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants