Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework object hash part algorithm #1284

Merged
merged 7 commits into from
Jan 15, 2017
Merged

Rework object hash part algorithm #1284

merged 7 commits into from
Jan 15, 2017

Conversation

svaarala
Copy link
Owner

@svaarala svaarala commented Jan 13, 2017

Instead of a prime and a MOD, use a bitmask and 2^N sized hash part.

Tasks:

  • Change hash sizing to 2^N, use bitwise mask
  • Move hash parameters to config options
  • Remove hashprime utility, as it is no longer needed
  • Config option changes
  • Default parameters for hash size limit and hash sizing (current hash limit 4 in pull is probably too low)
  • Low memory parameters for hash size limit etc (very low memory targets don't have a hash part so these don't apply)
  • Releases entry

Follow-ups:

  • Reconsider step handling: earlier large step, now +1 (more cache friendly but more clustering)
  • Maybe a good place to add small hash tables (8-bit or 16-bit) which would allow a much smaller load factor at the same memory cost

@svaarala
Copy link
Owner Author

Just considering property lookups, a hash table always pays off, even for very small objects:

test-prop-read-1024.js              : duk.O2.prophash  3.15 duk.O2.master  3.53
test-prop-read-16.js                : duk.O2.prophash  3.14 duk.O2.master  4.08
test-prop-read-256.js               : duk.O2.prophash  3.12 duk.O2.master  4.09
test-prop-read-32.js                : duk.O2.prophash  3.13 duk.O2.master  3.56
test-prop-read-4.js                 : duk.O2.prophash  3.17 duk.O2.master  3.32
test-prop-read-48.js                : duk.O2.prophash  3.12 duk.O2.master  3.53
test-prop-read-64.js                : duk.O2.prophash  3.12 duk.O2.master  3.50
test-prop-read-8.js                 : duk.O2.prophash  3.15 duk.O2.master  3.60

However, this is only the case when reading the same properties repeatedly, which ignores the cost of creating and maintaining the hash table over resizes (which matters for much practical code).

I'll run some more performance tests but a limit of 4 (= create hash table if object has 4 properties or more) seems too low. The good default is probably between say 6 and 12; I'll run some more tests to see what seems to work best. I'll also make the limit configurable via config options so it can be more easily tweaked than currently.

@svaarala
Copy link
Owner Author

Here's a concrete example of code behaving exactly the opposite as the property read tests:

test-object-literal.js              : duk.O2.proplimit4  3.12 duk.O2.proplimit6  3.14 duk.O2.proplimit8  3.10 duk.O2.master  2.86

The test case creates an object literal with 20 properties with the value immediately thrown away. The object hash table limit is 32 in master, so it avoids the overhead of creating a hash table. With limit values 4, 6, 8, the result is naturally slower because there's the overhead of creating (a never used) hash table.

Real application code is somewhere between these two extremes: hash tables have a cost to set up, but also benefit accesses if there are more than just a few over time.

@svaarala
Copy link
Owner Author

svaarala commented Jan 13, 2017

@fatcerberus I think I asked about this before but I don't remember what it came to -- but do you think it would be possible to arrange some sort of headless Minisphere build which could "run through the motions" for some example game? I can run such a thing with a display available (I think this was a blocker before) but it'd probably be best if it didn't spend most of the time drawing stuff.

The reason I ask is that there are currently no useful application benchmarks in the set of automated tests, so I'm trying to figure out what application benchmarks to use to improve that part of commit test coverage (and hopefully lead to good merge decisions :-). Some current tests and ideas are:

  • Sunspider: it's obsolete, and probably not a very accurate application benchmark for modern out-of-browser Javascript.
  • Google's V8 benchmark is useful, but again doesn't necessarily emphasize actual application behavior very well.
  • Kraken benchmark is useful.
  • Running a large Emscripten compiled C program would be an interesting benchmark, but it's most likely quite one sided in what features it stress tests.
  • The Typescript compiler would be one interesting case also.

Anyway, Minisphere would maybe be a useful test target and would also provide you useful information about builds and their impact on Minisphere.

@fatcerberus
Copy link
Contributor

I don't think it's possible to run minisphere headlessly because its first action on startup is to create an Allegro display, which in turn needs to initialize OpenGL. When I was first implementing my Node.js-compatible require() system, I experimented with making the graphics, audio, etc. components be lazy-loaded native modules (like in Node), but scrapped the idea because I figured almost all games will end up having to require them anyway. Instead I ended up designing the core set of bindings to be as low-level as possible and built a set of easy-to-use JS modules on top of that.

For TypeScript in particular: Cell, minisphere's SCons-inspired compiler, also uses Duktape. So that would actually be a bit easier to automate than minisphere itself since it's just a matter of providing a Cellscript and then running cell from the command line.

Anyway, I can look into mocking something up where minisphere runs the Spectacles battle engine as a "smoke test" of sorts for Duktape. Currently battles require player input, but it shouldn't be too difficult to set an AI to control the player characters. I designed my AI framework to be quite flexible in that regard :)

The Specs battle engine should be pretty decent coverage since it does a lot of different things: Damage/healing calculations, calling into C (for the Sphere API), tons of stuff with first-class functions (i.e. the "from" query module), etc.

@svaarala
Copy link
Owner Author

Right, I now have some physical hosts to run automated tests on, so it's no longer a problem if an OpenGL context gets created. But for the test result to make sense, ideally most of the execution time (say 30-50% at least) would be in script execution. Sort of "warp mode".

Also if there's a concept of a "frame time", measuring the frame time over some automated run might give a useful indication. This might be workable even if OpenGL output is enabled.

@fatcerberus
Copy link
Contributor

For transpilation, a useful performance test would be to have a Cellscript that looks like this:

const minify    = require('minify');
const transpile = require('transpile');

describe("SpecsMark 2017",
{
	version: 1,
	author: "Fat Cerberus",
	resolution: '320x200',
	main: 'scripts/main.js',
	// etc.
});

var scripts = transpile('tmp/transpiled/', files('src/*.js', true));
minify('@/scripts/', scripts);

install('@/images/', files('images/*.png', true));
// etc.

transpile() would run all scripts through an ES7 -> ES5 transformation using Babel, and minify() runs the output through the Babili minifier.

@fatcerberus
Copy link
Contributor

Regarding frame time: minisphere has system.now() which returns the number of frames processed (including skipped) since the game started running. Is that what you mean?

@svaarala
Copy link
Owner Author

I mean more that when an individual frame is processed (if that concept applies to Minisphere - usually it does for game engines :-) how much time (on average, or cumulatively) each frame takes. If there's a high resolution time source available, cumulative frame processing time would be a useful measure if it can be computed so that graphics operations are excluded.

@fatcerberus
Copy link
Contributor

Ah, I see. I actually removed all the "wall clock" timing in minisphere 4.3 in favor of a "frame perfect" API (all durations in the API are specified in frames), since wall-clock timing is more vulnerable to game lag. The engine times its frames internally (so it knows how long it can sleep between frames), but that information is not exposed to game code.

@fatcerberus
Copy link
Contributor

By the way, maybe we should open a separate issue to discuss this so we don't spam the object hash pull too much?

@svaarala
Copy link
Owner Author

Sounds good, opened #1288.

@svaarala
Copy link
Owner Author

svaarala commented Jan 14, 2017

Google benchmark, maximum score for 5 runs:

  • 1.3.1: 229
  • 1.5.0: 234
  • 2.0.0: 272
  • master: 293
  • hash limit 2: 304
  • hash limit 4: 305
  • hash limit 6: 310
  • hash limit 8: 309
  • hash limit 10: 306
  • hash limit 12: 310
  • hash limit 14: 309
  • hash limit 16: 307
  • hash limit 32: 306

The hash limit doesn't affect the score very strongly (maybe 6-12 scores slightly higher, but can't really be sure). It's quite likely the test doesn't use a lot of large objects so it doesn't really shed much light on choosing a good hash limit. What's interesting though is that regardless of the hash limit this branch gets better scores than master. I can't think of any other reason than code layout effects (and code being smaller in general).

@fatcerberus
Copy link
Contributor

Intuitively, I'd expect the typical pattern for real-world code to be that small objects with only data properties are likely to be thrown away quickly after reading one or two values from them (compound return values, e.g.), with larger objects more likely to be long-lived and accessed repeatedly--especially if those objects contain any function properties.

@svaarala
Copy link
Owner Author

svaarala commented Jan 14, 2017

I was also thinking about the typical objects that occur, and some basic categories I could think of:

  • Small temporary objects whose values are read ~once: argument object literals, compound return values like you said.
  • Small permanent objects like an object instance for a logger, socket, or something. They may contain anywhere between 1-10 properties, with some of them accessed for roughly every operation. Also small inheritance parents are like that.
  • Large temporary objects, for example a "visited keys" structure for some tree walk. Written and read a lot but thrown away quickly.
  • Large permanent objects like big constant tables (for example, some string-to-number conversion), Math object, prototype objects in general.

There is a lot more relevant nuance beyond these of course. For example, some objects are read heavy, others are write heavy, etc.

I added a note to #1196 that it would be nice if the hash structure was spawned only if the object is actually operated on a lot. There are ways to do that e.g. using a probabilistic check so that no actual count tracking or similar would be needed.

Other useful places where the hash table could be spawned are e.g. when an object is frozen, or when an object is set as a prototype of another object.

So there's a lot of scope to make better hash table decisions. I'll try to stick to the hash algorithm and parameters here :-)

Also, I added a task item for hash tables whose entries are smaller than the full 32 bits. Right now a hash table contains 32-bit entries, but that's pretty wasteful for an object of, say, 200 properties, because the entries could be 8-bit integers instead. So for desktop environments where footprint is not critical, supporting 8-bit, 16-bit, and 32-bit hash tables (or maybe just 8+32 or 16+32) would allow a much smaller load factor (and less collisions) for same memory cost. But I'll probably work on that in a separate pull.

@svaarala
Copy link
Owner Author

Some work related to this pull is the prototype property cache pull: if that works well, a hash part becomes less critical and could be reserved for actually large objects for which the cache doesn't work well because the cost of a miss and a full key scan is high.

Another property cache related idea I have is to use a best effort property slot cache which is sloppier but in some ways easier/cheaper to manage:

  • Maintain a heap-wide table of property slot indices: duk_uint16_t slotcache[4096] for example.
  • When doing a property entry lookup for an object with no hash part, compute a lookup index as (object_pointer ^ string_hash) % 4096.
  • The value provides a potential property slot index. Validate that index against the current object by comparing the lookup key and the key at that property slot (validating the index against the current property table size of course).
  • If the lookup is valid, we can safely use the property slot because we've validated the object and the key before using the slot.
  • If the lookup is not valid, i.e. the slot doesn't exist or contains a different key, continue normally. This can happen for a variety of reason, e.g. a collision in the index space, property deletion, or maybe the object property table has been resized or compacted.
  • When the actual lookup has been done, overwrite the slot cache entry so that a repeated lookup will now be valid.

This should work relatively well for a few reasons:

  • It emphasizes caching of actually looked up object/key pairs. No upfront work is done for maintaining hash tables in advance.
  • There's no GC impact, i.e. no need to explicitly or implicitly invalidate entries due to object changes, because the slot index is just tentative anyway.
  • The slot cache entry is just an integer (16 bits should be enough and even 8 bits might be enough if objects >= 256 entries in size get a hash part) which is much more dense than the prototype property cache entry which has 4 separate fields.

A property/slot cache is still not a replacement for a hash part for large objects: if an object is large, and most of its properties are accessed over and over again, it takes a lot of linear scans to populate the cache as compared to O(1) lookups from the hash. Maintaining the hash table is cheaper than re-populating the cache at least for very large objects.

I'll prototype this in a separate branch. It may be a valid alternative to the (more complicated) property cache pull because the property cache entry is larger, and requires careful invalidation which can be tricky to get right.

@svaarala svaarala force-pushed the rework-object-hash branch 2 times, most recently from 95c2f69 to 75e397f Compare January 14, 2017 23:24
Make the hash algorithm simpler by using a bit mask rather than a modulus for
probing the hash.

Make the hash part load factor lower than before to reduce clustering.  Low
memory environments disable hash part support anyway, so this doesn't impact
them.
@svaarala svaarala merged commit 299efb0 into master Jan 15, 2017
@svaarala svaarala deleted the rework-object-hash branch January 15, 2017 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants