Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use C code from pgsql-hackers #15

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Use C code from pgsql-hackers #15

wants to merge 1 commit into from

Conversation

x4m
Copy link

@x4m x4m commented Sep 17, 2023

This code have several advantages:

  1. Implement counter per standart draft. This counter protects from collisions better than random numbers, when UUIDs are generated at high speed.
  2. Buffer randomness. This greatly improves speed of generation. Try select uuid_generate_v7() from generate_series(1,1E6);
  3. Avoid pghton(), because it does several unnecesary manipulations.
  4. Avoid initializing timestamp part with randomness. This helps to save some entropy. Make the world greener :)

The code is based on PG patch https://commitfest.postgresql.org/43/4388/ I have my repo https://github.com/x4m/pg_uuid_next , but I do not have energy to support it like you do. So I decided to bring some code from there to your implementation.

Have a nice day :)

This code have several advantages:
1. Implement counter per standart draft. This counter protects from collisions better than random numbers, when UUIDs are generated at high speed.
2. Buffer randomness. This greatly improves speed of generation. Try select uuid_generate_v7() from generate_series(1,1E6);
3. Avoid pghton(), because it does several unnecesary manipulations.
4. Avoid initializing timestamp part with randomness. This helps to save some entropy. Make the world greener :)

The code is based on PG patch https://commitfest.postgresql.org/43/4388/
I have my repo https://github.com/x4m/pg_uuid_next , but I do not have energy to support it like you do. So I decided to bring some code from there to your implementation.

Have a nice day :)
@fboulnois
Copy link
Owner

@x4m , thanks for your PR! I also read your thread in the pgsql-hackers list.

A few thoughts:

  1. As I outline in my comments on the RFC, I don't really like the alternate pseudorandom constructions for version 7 UUIDs. From a philosophical perspective, one of the goals of this extension is to keep code as simple as possible so it is easy to reason through, and I think the alternate constructions add too much surface area to get things wrong. From a technical perspective, the counter uses up bits in the random data but I am not sure the tradeoff is worthwhile. With 72 bits of randomness you can generate billions of UUIDs every ms without collisions.
  2. This is smart! However, I wonder if the cache should live in the pg_strong_random layer instead? This is what other high quality cryptography libraries seem to do. That way, this would benefit every Postgres instance and application in the world. Talk about making the world greener!
  3. pg_hton64 is ultimately an alias for bswap64, which will get compiled on supported architectures as a single instruction. Are there benchmarks that show that manual bit swapping is faster?
  4. The truth is that I didn't want to fill the UUID at an arbitrary location to avoid buffer overflow type issues, but this is fair to point out.

All in all, I'm unlikely to add 1 or 3. If I benchmark the cache in 2 and it provides a large performance improvement, I may implement something similar. 4 seems reasonable if I can verify and test there aren't any off-by-one errors that would lead to buffer overflows.

Cheers.

@x4m
Copy link
Author

x4m commented Sep 18, 2023

  1. The idea behind a counter is about keeping serially generated UUIDs sorted. Randomly initialized counter still counts against global collisions, better prevents local collisions and keep data sorted for better ingestion speed.
    Indeed, 128 bits is a lot, enought to solve most of uniquiness problems. But using some of these bits for counter helps to solve unqieness problem even better, also proving better sortability.
  2. Postgres uses OpenSSL, though I did not benchmark the feature with this lib. When OpenSSL is absent simply "f = open("/dev/urandom", O_RDONLY, 0);" consumes all the time...

@fboulnois fboulnois mentioned this pull request Sep 19, 2023
@fboulnois
Copy link
Owner

I'm still undecided on implementing something similar to 2. After several benchmarks the performance gain is about 10% on a Postgres install with OpenSSL. This is pretty good, but I still feel like the functionality probably shouldn't live in this extension. I'm going to continue considering it, however.

@sergeyprokhorenko
Copy link

@fboulnois
Could you please publish Andrey Borodin's (@x4m) extension. Many people have been wating for his patch, but unfortunately the patch was recently suspended because of feature freeze for 17 version of PostgreSQL, just before the final approval of the new RFC by its authors.

@fboulnois
Copy link
Owner

@fboulnois Could you please publish Andrey Borodin's (@x4m) extension. Many people have been wating for his patch, but unfortunately the patch was recently suspended because of feature freeze for 17 version of PostgreSQL, just before the final approval of the new RFC by its authors.

At this point, the only substantive difference between x4m's patch and my extension is the alternate pseudorandom construction, which I've already discussed in #15 (comment) . Cheers!

@x4m
Copy link
Author

x4m commented Apr 16, 2024

I think @sergeyprokhorenko meant exactly this: let's add a counter or microseconds.
The thing is if you do
select i, gen_uuid_v7() id from generate_series(1,1000) i
the order of i will be different from order of id.
This is not super critical, but actually makes life better in many scenarios.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Apr 16, 2024

I meant to add a counter 18 bits long, initialized every millisecond with a random number. The most significant bit of the counter is initialized to zero to prevent the counter from overflowing if the most significant bits of the random number are ones.
All advanced implementations use a counter:

This promotes sortability, additional monotonicity and better database performance. It can also make it easier to find bugs by knowing the latest version of the record.
The name of the function with counter can be the common uuidv7().

@fboulnois
Copy link
Owner

Different projects use different counter lengths. Why use 18? I see other projects use 14, 16, 24, 40, 42, and 48 bits for the counter size.

@x4m
Copy link
Author

x4m commented Jun 9, 2024

Just a sane number: 125MHz without overflow, tradeoff between max frequency and predictability. Detailed explanation was in hackers AFAIR.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jun 10, 2024

Different projects use different counter lengths. Why use 18? I see other projects use 14, 16, 24, 40, 42, and 48 bits for the counter size.

Too short a counter makes it possible for the counter to overflow under peak loads. This can be mitigated by using the timestamp as the counter in such a situation, but is intuitively undesirable. But a short counter theoretically allows for faster UUID detection if the UUID is compared over the timestamp + counter segment (this is not the case now) rather than over the entire UUID length.

With 18 bits, counter overflow is guaranteed to be impossible if the high-order bit of the counter is initialized to zero every millisecond, and if the remaining counter bits are initialized to a random number.

Too long a counter reduces the random segment of the UUID, and therefore worsen unguessability of the UUID by brute-force guessing. But a long counter theoretically allows for fewer resources to be spent on random number generation.

RFC 9562 recommends that the counter SHOULD be at least 12 bits but no longer than 42 bits.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jun 10, 2024

UUIDv7 generation function in ClickHouse DBMS

https://clickhouse.com/docs/en/sql-reference/functions/uuid-functions#generateUUIDv7

ClickHouse/ClickHouse#62852

For the sake of monotonicity, when several microservices access the UUIDv7 generator in parallel, a single UUIDv7 generator is used per server.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Jun 11, 2024

Just a sane number: 125MHz without overflow, tradeoff between max frequency and predictability. Detailed explanation was in hackers AFAIR.

See https://www.postgresql.org/message-id/84D20D0F-B5FF-41CD-9F48-E282CE9FEC1D%40yandex-team.ru

and https://www.postgresql.org/message-id/F91948DD-500A-4A22-ABB9-5F4C59C28851%40yandex-team.ru

The standard does not specify how big counter should be. PFA patch with 16 bit counter. Maybe it worth doing 18bit counter - it will save us one byte of PRNG data. Currently we only take 2 bits out of the whole random byte.

@trim21
Copy link

trim21 commented Oct 24, 2024

I'm new to pgsql native extenions.

Should sequence_counter and previous_timestamp be protected by mutex?
or pg guarantee uuid_generate_v7 won't be called in parallel?

AS 'MODULE_PATHNAME', 'uuid_generate_v7'
VOLATILE STRICT LANGUAGE C PARALLEL SAFE;

it's PARALLEL SAFE here, I guess we need some kind of mutex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants