Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lua interop #626

Closed
wants to merge 8 commits into from
Closed

lua interop #626

wants to merge 8 commits into from

Conversation

cldellow
Copy link
Contributor

Replaces #604

This PR reduces overhead when calling between Lua and C++:

  • use the global namespace, e.g. Find() vs way:Find()
  • use a specialized class for looking up tags
  • don't use string as arguments when calling Find() or Holds()

Also some non-Lua related improvements to reduce lock contention on machines with many cores

Currently, Tilemaker uses member functions for interop:

```lua
function node_function(node)
    node:Layer(...)
```

This PR changes Tilemaker to use global functions:

```lua
function node_function()
    Layer(...)
```

The chief rationale is performance. Every member function call needs to
push an extra pointer onto the stack when crossing the Lua/C++ boundary.

Kaguya serializes this pointer as a Lua userdata. That means every
call into Lua has to malloc some memory, and every call back from Lua
has to dereference through this pointer.

And there are a lot of calls! For OMT on the GB extract, I counted
~1.4B calls from Lua into C++.

A secondary rationale is that a global function is a bit more honest.
A user might believe that this is currently permissible:

```lua
last_node = nil
function node_function(node)
    if last_node ~= nil
        -- do something with last_node
    end

    -- save the current node for later, for some reason
    last_node = node
```

But in reality, the OSM objects we pass into Lua don't behave quite
like Lua objects. They're backed by OsmLuaProcessing, who will move
on, invalidating whatever the user thinks they've got a reference to.

This PR has a noticeable decrease in reading time for me, measured
on the OMT profile for GB, on a 16-core computer:

Before:
```
real	1m28.230s
user	19m30.281s
sys	0m29.610s
```

After:
```
real	1m21.728s
user	17m27.150s
sys	0m32.668s
```

The tradeoffs:

- anyone with a custom Lua profile will need to update it, although the
  changes are fairly mechanical

- Tilemaker now reserves several functions in the global namespace,
  causing the potential for conflicts
Cherry-picked from
systemed@b322166,
systemed@5c807a9,
systemed@13b3465
and fixed up to work with protozero's data_view structure.

Original commit messages below, the timings will vary but the idea is
the same:

Faster tagmap
=====

Building a std::map for tags is somewhat expensive, especially when
we know that the number of tags is usually quite small.

Instead, use a custom structure that does a crappy-but-fast hash
to put the keys/values in one of 16 buckets, then linear search
the bucket.

For GB, before:
```
real 1m11.507s
user 16m49.604s
sys 0m17.381s
```

After:
```
real	1m9.557s
user	16m28.826s
sys	0m17.937s
```

Saving 2 seconds of wall clock and 20 seconds of user time doesn't
seem like much, but (a) it's not nothing and (b) having the tags
in this format will enable us to thwart some of Lua's defensive
copies in a subsequent commit.

A note about the hash function: hashing each letter of the string
using boost::hash_combine eliminated the time savings.

Faster Find()/Holds()
=====

We (ab?)use kaguya's parameter serialization machinery. Rather than
take a `std::string`, we take a `KnownTagKey` and teach Lua how to
convert a Lua string into a `KnownTagKey`.

This avoids the need to do a defensive copy of the string when coming
from Lua.

It provides a modest boost:

```
real 1m8.859s
user 16m13.292s
sys 0m18.104s
```

Most keys are short enough to fit in the small-string optimization, so
this doesn't help us avoid mallocs. An exception is `addr:housenumber`,
which, at 16 bytes, exceeds g++'s limit of 15 bytes.

It should be possible to also apply a similar trick to the `Attribute(...)`
functions, to avoid defensive copies of strings that we've seen as keys
or values.

avoid malloc for Attribute with long strings
=====

After:

```
real	1m8.124s
user	16m6.620s
sys	0m16.808s
```

Looks like we're solidly into diminishing returns at this point.
On a 48-core machine, I still see lots of lock contention.
AttributeStore:add is one place.

Add a thread-local cache that can be consulted without taking the shared
lock. The intuition here is that there are 1.3B objects, and 40M
attribute sets. Thus, on average, an attribute set is reused 32 times.

However, average is probably misleading -- the distribution is likely not
uniform, e.g. the median attribute set is probably reused 1-2 times, and
some exceptional attribute sets (e.g. `natural=tree` are reused thousands of times).

For GB on a 16-core machine, this avoids 27M of 36M locks.
On a 48-core machine, this phase currently achieves only 400% CPU usage,
I think due to these locks
@cldellow cldellow mentioned this pull request Dec 28, 2023
@cldellow
Copy link
Contributor Author

Merged into v3 branch: e768595

@cldellow cldellow closed this Jan 12, 2024
cldellow added a commit to cldellow/tilemaker that referenced this pull request Jan 12, 2024
When I replaced systemed#604 with systemed#626, I botched extracting this part of the
code. I had the trait, which taught kaguya how to serialize
`PossiblyKnownTagValue`, but I missed updating the parameter type
of `Attribute` to actually use it, so it was a no-op.

This PR restores the behaviour of avoiding string copies, but now that
we have protozero's data_view class, we can use that rather than
our own weirdo struct.
cldellow added a commit to cldellow/tilemaker that referenced this pull request Jan 12, 2024
When I replaced systemed#604 with systemed#626, I botched extracting this part of the
code. I had the trait, which taught kaguya how to serialize
`PossiblyKnownTagValue`, but I missed updating the parameter type
of `Attribute` to actually use it, so it was a no-op.

This PR restores the behaviour of avoiding string copies, but now that
we have protozero's data_view class, we can use that rather than
our own weirdo struct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant