Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Snowflake ID generation #17389

Open
DeathBorn opened this issue Dec 13, 2024 · 4 comments
Open

Feature Request: Snowflake ID generation #17389

DeathBorn opened this issue Dec 13, 2024 · 4 comments

Comments

@DeathBorn
Copy link
Contributor

Feature Description

Request

We would like to add Snowflake ID generation feature to Vitess.
Here is a working example on v11 code base.
It can really integrate nicely just like Vitess Sequences

Snowflake ID

  • 64 bits BIGINT constructed
    • Sign 1 bit - always 0
    • Timestamp 42 bit - in milliseconds since chosen epoch
    • Machine ID 10 bit - 1024 machines possible
    • Sequence 12-bit - 4096 max value; local counter per each machine;
  • Requires Snowflake generator servers and Zookeeper for Machine ID coordination - extra parts
  • Good enough for ~69 years - depends on chosen epoch

How to implement

  • Use Sequences code path and the same primitive
  • Store Snowflake configuration (machine_id + chosen epoch) in a table just like Sequences do.
    • It can be sharded up to 1024 shards
    • It can reload its configuration using configured interval
    • During first load, it will auto-initialise its machine_id using cross-shard TX, which is ok, because table is practically never written due to the fact that we don't need to store last generated ID - benefit of Snowflake algorithm. This will cover Reshard operation too.
    • VTablets will be responsible for advancing sequence and timestamp, and returning Snowflake ID
    • VTGates
      • random shard will receive the query select next N values from snow_table
      • will receive timestamp + sequence + chosen epoch
      • generate N required ids

I would like some comments on this before working on contribution. Can I use the same Sequence primitive but return 3 values instead of one - timestamp + sequence + chosen epoch ? This would make chosen epoch configurable.

Use Case(s)

Benefits:

  • Generated values are Time-Sortable - good for indexes
  • Secure - very hard to guess next value, so good for business
  • Nice feature - contains created_at timestamp in value!
  • Using public_id with UUID/NanoID/ULID requires complex app code changes, complex communication between service, take at least 2x more space - BIGINT is better here
@DeathBorn DeathBorn added the Needs Triage This issue needs to be correctly labelled and triaged label Dec 13, 2024
@harshit-gangal
Copy link
Member

We can definitely extend sequences to support other type of sequence generator.
A plugin based approach will be nice to extend it.

I do not understand the need for sharding sequences.
I think we can continue to use the existing sequence to return the first ID value and generate the rest of the ID part at VTGate.

@DeathBorn
Copy link
Contributor Author

We can definitely extend sequences to support other type of sequence generator.

A plugin based approach will be nice to extend it.

I do not understand the need for sharding sequences.

I think we can continue to use the existing sequence to return the first ID value and generate the rest of the ID part at VTGate.

Well, sharding is for:

  • machine ID part of Snowflake algorithm
  • almost no need for extra keyspace
  • multiple masters can service the requests - less impact on master; partial failures are less painful; also maybe cell aware routing could be used,
  • from code perspective, only single place checks if it has more than one shard

@GuptaManan100 GuptaManan100 added Type: Feature and removed Needs Triage This issue needs to be correctly labelled and triaged labels Dec 16, 2024
@mattlord
Copy link
Contributor

mattlord commented Dec 16, 2024

@DeathBorn I think on main (which would be the relevant branch for a new feature) you would want to look at the code using the vtgate engine.Generate type:

// Generate represents the instruction to generate
// a value from a sequence.
Generate struct {
Keyspace *vindexes.Keyspace
Query string
// Values are the supplied values for the column, which
// will be stored as a list within the expression. New
// values will be generated based on how many were not
// supplied (NULL).
Values evalengine.Expr
// Insert using Select, offset for auto increment column
Offset int
}

It also sounds like you might want to use SnowFlake as a new vindex type as well so that it provides a keyspace_id value (64 bits) and we effectively route each row to a random shard (with that shard generating the snowflake ID / keyspace ID)? And when doing so you no longer need a "global" keyspace to house a sequence table as the keyspace ID value (not really a global sequence as each machine ID will have its own sequence value) is generated using the target shard primary's machine ID (which also has to be stored and managed somewhere like the topo server). Is that correct? If so, IMO we should talk about this as a new vindex type rather than a new sequence implementation as:

  1. Vitess Sequences today are a feature meant to provide a MySQL compatible auto_increment feature for sharded tables
  2. These SnowFlake IDs would not be sequential, even for an unsharded keyspace if the machine ID is unique to each machine rather than shards (since the primary is not static). Instead these are more like MySQL GTID sets in that each shard/host has its own identifier and sequence.

Am I missing or misunderstanding things?

@DeathBorn
Copy link
Contributor Author

yeah, Generate is one of the places to work on, but also NextVal query plan should be changed in VtTtablets too.
Certainly, VIndex could be created too, since machine_id can be extracted, but we still need to store state of the last generated Snowflake ID somewhere and from my perspective vttablets are the best place to store that. And then we have chosen epoch configuration too.

Vitess Sequences today are a feature meant to provide a MySQL compatible auto_increment feature for sharded tables
Yeah, if sharded, Snowflake is "mostly" sequential. Still it provides very similar MySQL Index performance benefits due to being almost fully incremental. If strict order is requirement, then only unsharded Snowflake can be used.

These SnowFlake IDs would not be sequential, even for an unsharded keyspace if the machine ID is unique to each >machine rather than shards (since the primary is not static). Instead these are more like MySQL GTID sets in that each >shard/host has its own identifier and sequence.
Since timestamp with ms is used as leftmost bits, it is practically impossible to get to a situation when ids are not sequential if Snowflake table is unsharded. I think reparents can't really happen during the same millisecond.

If we could "pluginize" Vitess sequence/auto_increment + NextVal feature, I believe multiple Vitess users would be thrilled - a way to bring any ID generation algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants