Feature Request: Snowflake ID generation #17389

DeathBorn · 2024-12-13T08:15:31Z

Feature Description

Request

We would like to add Snowflake ID generation feature to Vitess.
Here is a working example on v11 code base.
It can really integrate nicely just like Vitess Sequences

Snowflake ID

64 bits BIGINT constructed
- Sign 1 bit - always 0
- Timestamp 42 bit - in milliseconds since chosen epoch
- Machine ID 10 bit - 1024 machines possible
- Sequence 12-bit - 4096 max value; local counter per each machine;
Requires Snowflake generator servers and Zookeeper for Machine ID coordination - extra parts
Good enough for ~69 years - depends on chosen epoch

How to implement

Use Sequences code path and the same primitive
Store Snowflake configuration (machine_id + chosen epoch) in a table just like Sequences do.
- It can be sharded up to 1024 shards
- It can reload its configuration using configured interval
- During first load, it will auto-initialise its machine_id using cross-shard TX, which is ok, because table is practically never written due to the fact that we don't need to store last generated ID - benefit of Snowflake algorithm. This will cover Reshard operation too.
- VTablets will be responsible for advancing sequence and timestamp, and returning Snowflake ID
- VTGates
  - random shard will receive the query select next N values from snow_table
  - will receive timestamp + sequence + chosen epoch
  - generate N required ids

I would like some comments on this before working on contribution. Can I use the same Sequence primitive but return 3 values instead of one - timestamp + sequence + chosen epoch ? This would make chosen epoch configurable.

Use Case(s)

Benefits:

Generated values are Time-Sortable - good for indexes
Secure - very hard to guess next value, so good for business
Nice feature - contains created_at timestamp in value!
Using public_id with UUID/NanoID/ULID requires complex app code changes, complex communication between service, take at least 2x more space - BIGINT is better here

The text was updated successfully, but these errors were encountered:

harshit-gangal · 2024-12-13T09:55:19Z

We can definitely extend sequences to support other type of sequence generator.
A plugin based approach will be nice to extend it.

I do not understand the need for sharding sequences.
I think we can continue to use the existing sequence to return the first ID value and generate the rest of the ID part at VTGate.

DeathBorn · 2024-12-13T10:43:28Z

We can definitely extend sequences to support other type of sequence generator.

A plugin based approach will be nice to extend it.

I do not understand the need for sharding sequences.

I think we can continue to use the existing sequence to return the first ID value and generate the rest of the ID part at VTGate.

Well, sharding is for:

machine ID part of Snowflake algorithm
almost no need for extra keyspace
multiple masters can service the requests - less impact on master; partial failures are less painful; also maybe cell aware routing could be used,
from code perspective, only single place checks if it has more than one shard

mattlord · 2024-12-16T15:51:52Z

@DeathBorn I think on main (which would be the relevant branch for a new feature) you would want to look at the code using the vtgate engine.Generate type:

vitess/go/vt/vtgate/engine/insert_common.go

Lines 87 to 99 in 998433c

    
           // Generate represents the instruction to generate 
        
           // a value from a sequence. 
        
           Generate struct { 
        
           	Keyspace *vindexes.Keyspace 
        
           	Query    string 
        
           	// Values are the supplied values for the column, which 
        
           	// will be stored as a list within the expression. New 
        
           	// values will be generated based on how many were not 
        
           	// supplied (NULL). 
        
           	Values evalengine.Expr 
        
           	// Insert using Select, offset for auto increment column 
        
           	Offset int 
        
           }

It also sounds like you might want to use SnowFlake as a new vindex type as well so that it provides a keyspace_id value (64 bits) and we effectively route each row to a random shard (with that shard generating the snowflake ID / keyspace ID)? And when doing so you no longer need a "global" keyspace to house a sequence table as the keyspace ID value (not really a global sequence as each machine ID will have its own sequence value) is generated using the target shard primary's machine ID (which also has to be stored and managed somewhere like the topo server). Is that correct? If so, IMO we should talk about this as a new vindex type rather than a new sequence implementation as:

Vitess Sequences today are a feature meant to provide a MySQL compatible auto_increment feature for sharded tables
These SnowFlake IDs would not be sequential, even for an unsharded keyspace if the machine ID is unique to each machine rather than shards (since the primary is not static). Instead these are more like MySQL GTID sets in that each shard/host has its own identifier and sequence.

Am I missing or misunderstanding things?

DeathBorn · 2024-12-22T11:49:03Z

yeah, Generate is one of the places to work on, but also NextVal query plan should be changed in VtTtablets too.
Certainly, VIndex could be created too, since machine_id can be extracted, but we still need to store state of the last generated Snowflake ID somewhere and from my perspective vttablets are the best place to store that. And then we have chosen epoch configuration too.

Vitess Sequences today are a feature meant to provide a MySQL compatible auto_increment feature for sharded tables
Yeah, if sharded, Snowflake is "mostly" sequential. Still it provides very similar MySQL Index performance benefits due to being almost fully incremental. If strict order is requirement, then only unsharded Snowflake can be used.

These SnowFlake IDs would not be sequential, even for an unsharded keyspace if the machine ID is unique to each >machine rather than shards (since the primary is not static). Instead these are more like MySQL GTID sets in that each >shard/host has its own identifier and sequence.
Since timestamp with ms is used as leftmost bits, it is practically impossible to get to a situation when ids are not sequential if Snowflake table is unsharded. I think reparents can't really happen during the same millisecond.

If we could "pluginize" Vitess sequence/auto_increment + NextVal feature, I believe multiple Vitess users would be thrilled - a way to bring any ID generation algorithm.

DeathBorn added the Needs Triage This issue needs to be correctly labelled and triaged label Dec 13, 2024

GuptaManan100 added Type: Feature and removed Needs Triage This issue needs to be correctly labelled and triaged labels Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Snowflake ID generation #17389

Feature Request: Snowflake ID generation #17389

DeathBorn commented Dec 13, 2024

harshit-gangal commented Dec 13, 2024

DeathBorn commented Dec 13, 2024

mattlord commented Dec 16, 2024 •

edited

Loading

DeathBorn commented Dec 22, 2024

Feature Request: Snowflake ID generation #17389

Feature Request: Snowflake ID generation #17389

Comments

DeathBorn commented Dec 13, 2024

Feature Description

Request

Snowflake ID

How to implement

Use Case(s)

Benefits:

harshit-gangal commented Dec 13, 2024

DeathBorn commented Dec 13, 2024

mattlord commented Dec 16, 2024 • edited Loading

DeathBorn commented Dec 22, 2024

mattlord commented Dec 16, 2024 •

edited

Loading