feat: cyclotron #24228

oliverb123 · 2024-08-06T22:11:33Z

Problem

We want our delivery system (aka "the CDP" or "destinations" or probably other names too, I find it hard to keep up) to deliver (insert very large number here) of events per second. This is the start of a PG-based, sharded job queue system that's intended to let us do that (while managing QoS on a per-user or per-endpoint basis, and doing other fancy tricks kafka makes difficult). The underlying implementation is written in rust, but it's designed to be easy to expose bindings to other languages, so projects or teams favouring TS/JS (e.g. the hog folks) or python (I haven't been in the company long enough to call out anyone specific) can still interact with our "delivery engine" (queue work, manage work for a given team or function or endpoint or queue or whatever, or even ship a worker to consume jobs from the queue, if needed).

Right now only the PG part is in place here, the sharding is mostly signs in the ground labelled "draw the rest of the F'ing owl", and the management commands I first ship with will be executed directly on the DB (or DBs) by the issuing manager instance, even though the intention is for them to be pushed onto a kafka queue and for each shard to individually manage it's control tables. Also I need to do a bunch of query optimisation (although one of the nice things here is that, because the DB is sharded, we can afford to be a bit inefficient with our DB ops, since if we need more throughput we just spin up more shards), and so on and so on and so on. It's v0, it's going to be a little rough around the edges.

You might notice this looks a bit like rustyhook. You'd be correct, we stole a fair amount of both inspiration and literal SQL from it while sketching this out. We didn't try to simply extend rustyhook because the queue implementation there made a series of design choices that make it unsuitable for 1) being embedded in other languages 2) having more than 1 type of worker operate atop it and 3) supporting the kind of work-management features we expect the delivery system to need. These were good decisions - rusty hook was designed to only use it's own queue internally for retries, needed to ship, and was /definitely/ never meant to be embedded in other languages - but our delivery solution is maturing and growing in complexity, and it's needs have outgrown rustyhook. Onwards and upwards.

bretthoerner

Looks like a good starting point, nice work.

I have to admit after seeing the code it's not super obvious what Rust is doing for us. 😓

But I guess being able to write workers like fetch in it will be nice.

bretthoerner · 2024-08-12T20:57:49Z

rust/common/metrics/README.md

@@ -0,0 +1 @@
+Ripped from rusty-hook, since it'll be used across more or less all cyclotron stuff, as well as rustyhook


total nit, but this seems more like a PR comment than a README

bretthoerner · 2024-08-12T20:59:05Z

rust/cyclotron-core/migrations/20240804122549_initial_job_queue_schema.sql

+-- TODO - I go back and forth on whether this should just be an open text field,
+-- rather than an enum - that makes it faster to add new kinds of workers to the
+-- system (since you don't have to bump library versions for anything consuming the
+-- cyclotron-core crate), but having a defined set of workers means you can spin up


You'd only have to bump it for producers/consumers of that queue anyway, right? Seems fine.

bretthoerner · 2024-08-12T21:02:44Z

rust/cyclotron-core/migrations/20240804122549_initial_job_queue_schema.sql

+    created TIMESTAMPTZ NOT NULL,
+-- Queue bookkeeping - invisible to the worker
+    lock_id UUID, -- This is set when a job is in a running state, and is required to update the job.
+    last_heartbeat TIMESTAMPTZ, -- This is updated by the worker to indicate that the job is making forward progress even without transitions (and should not be reaped)


Just to note, as discussed on the call, if we think many jobs will run long enough to require a heartbeat (beyond the initial dequeue) I think we'd save a lot by heartbeating a single session (unique per worker process).

rust/cyclotron-core/migrations/20240804122549_initial_job_queue_schema.sql

bretthoerner · 2024-08-12T23:48:41Z

rust/cyclotron-core/src/janitor_ops.rs

+// rather than doing that, it could just put the job in a "dead letter" state, and no worker or janitor process
+// will touch it... maybe the table moving isn't needed? but either way, being able to debug jobs that cause workers
+// to stall would be good (and, thinking about it, moving it to a new table means we don't have to clear the lock,
+// so have a potential way to trace back to the last worker that died holding the job)


Or the entire row could be written outside of the DB, even just to logs in the short term.

Anyway, this is a good idea. Poison pills really hurt us in the txn-based Rusty-Hook, and this will be a nice safety net.

bretthoerner · 2024-08-12T23:50:35Z

rust/cyclotron-core/src/janitor_ops.rs

+    let oldest_valid_heartbeat = Utc::now() - timeout;
+    // NOTE - we don't check the lock_id here, because it probably doesn't matter (the lock_id should be set if the
+    // job state is "running"), but perhaps we should only delete jobs with a set lock_id, and report an error
+    // if we find a job with a state of "running" and no lock_id. Also, we delete jobs whose last_heartbeat is


Yeah, it could be nice for the janitor to check some of our invariants and report anything suspicious. Doesn't have to be now though.

bretthoerner · 2024-08-12T23:57:22Z

rust/cyclotron-core/src/worker.rs

+    // All dequeued job IDs that haven't been flushed yet. The idea is this lets us
+    // manage, on the rust side of any API boundary, the "pending" update of any given
+    // job, such that a user can progressively build up a full update, and then flush it,
+    // rather than having to track the update state on their side and submit it all at once


Interesting, I'll have to read more code but it's surprising to me that a worker would even have lots of little updates to do for a job, rather than a single simple update at the end.

Most wont, but e.g. in the fetch impl, I have a pattern of queuing up the set of updates that would put a job into the deal letter queue, then doing some serde or other "this should never fail, if it does that's a coding error in the fetch worker" work, and if any of that causes an error I just flush the update (sending the job to the DLQ), otherwise I then queue up the updates to send it back to an available state or whatever, and flush

It looks like (ripped from in-progress code, but you maybe get the idea)

// Complete the job, either because we got a good response, or because the jobs retries // have been exceeded. pub async fn complete_job( worker: &Worker, job: &Job, return_worker: WaitingOn, return_queue: Option<String>, on_finish: OnFinish, result: FetchResult, ) -> Result<(), WorkerError> { // If we fail any serde, we just want to flush to the DLQ and bail worker.set_state(job.id, JobState::Available)?; worker.set_queue(job.id, DEAD_LETTER_QUEUE)?; let is_completed = result.is_completed(); let result = match serde_json::to_string(&result) { Ok(r) => r, Err(e) => { // Leave behind a hint for debugging worker.set_metadata(job.id, Some(format!("Failed to serialise result: {}", e)))?; worker.flush_job(job.id).await?; return Err(WorkerError::SerdeError(e)); } }; worker.set_queue( job.id, &return_queue.unwrap_or_else(|| job.queue_name.clone()), )?; worker.set_waiting_on(job.id, return_worker)?; match (is_completed, on_finish) { (true, _) | (false, OnFinish::Return) => { worker.set_state(job.id, JobState::Available)?; } (false, OnFinish::Complete) => { worker.set_state(job.id, JobState::Failed)?; } } worker.set_parameters(job.id, Some(result))?; worker.set_metadata(job.id, None)?; // We're finished with the job, so clear our internal state worker.flush_job(job.id).await?; Ok(()) }

Just my opinion, but following what state everything in here seems like more of a burden than just having a happy path that sets things at the end, otherwise throw an error, and have a wrapper function that catches errors and sets the state properly in that case.

Being N lines into a function and having to remember "OK, at the start we set it to DLQ at fn start, so if X happens then Y will happen" only gets worse over time in my experience.

bretthoerner · 2024-08-13T00:00:19Z

rust/cyclotron-core/src/worker.rs

+    /// job lock). We're more strict here (flushes can only happen once, you must
+    /// flush some non-running state) to try and enforce a good interaction
+    /// pattern with the queue. I might return to this and loosen this constraint in the
+    /// future, if there's a motivating case for needing to flush partial job updates.


Kind of like above, I'm still confused why our whole update API isn't just a "I'm done with the job, here is its new state" type call.

bretthoerner · 2024-08-13T00:05:49Z

rust/cyclotron-node/src/lib.rs

+// The general interface for calling our functions takes a JSON serialized stirng,
+// because neon has no nice serde support for function arguments (and generally.
+// rippping objects from the v8 runtime piece by piece is slower than just passing
+// a since chunk of bytes). These are convenience functions for converting between


Suggested change

// a since chunk of bytes). These are convenience functions for converting between

// a single chunk of bytes). These are convenience functions for converting between

(This kind of trails off, too.)

github-actions · 2024-08-14T21:18:23Z

Size Change: 0 B

Total Size: 1.08 MB

ℹ️ View Unchanged

Filename	Size
`frontend/dist/toolbar.js`	1.08 MB

_{compressed-size-action}

…o oliver_cyclotron_lib

This reverts commit 9734a40.

oliverb123 added 15 commits August 5, 2024 15:38

first steps - start hashing out schema

69c4e37

first pass of core queue ops done

bb10a74

naming fixes and starting node module

58b53c2

started work on node stuff

9272242

node bindings work, but it's not super pretty

66cdb36

Add load testing script

17efe80

come back to this

97a9c3d

got a little carried away

7ca7d0e

well, the janitor works, and handles the load test running well

39571bc

added bulk insert... procrastinating CI stuff

5491032

start working on sqlx offline mode for CI

55b3d72

CI fixe

cd904bf

ci

567dd74

Add cyclotron janitor entrypoint

e7b104f

minor cleanup

7873a9f

oliverb123 requested review from benjackwhite and bretthoerner August 9, 2024 12:32

bretthoerner reviewed Aug 13, 2024

View reviewed changes

oliverb123 and others added 8 commits August 13, 2024 14:38

skeleton of fetch sketched out

7ecf9a8

sketch of fetch ~done

9a07d54

add public-only DNS resolver

746d36c

fetch doesn't depend on sqlx directly at all, which is nice

0887259

drop WaitingOn, combine migrations and the dequeue index

604ef36

remove waitingon from js

4fc0f76

add cyclotron node module to dev and prod

f4cc29b

forgot the install script

5685af9

bretthoerner added 3 commits August 14, 2024 16:07

use bullseye based image

efdbd41

Merge remote-tracking branch 'origin/main' into oliver_cyclotron_lib

9df6faf

add simple typescript wrapper around cyclotron C module

b4a9175

bretthoerner and others added 4 commits August 15, 2024 15:13

submit jobs to cyclotron from CDP (when enabled for the team)

803cf56

add cdp cyclotron consumer capability that dequeues jobs

5e42895

first couple of e2e tests

5f73a9f

typo

ab238b3

oliverb123 mentioned this pull request Aug 16, 2024

Cyclotron Megaissue #24424

Open

25 tasks

bretthoerner and others added 17 commits August 16, 2024 11:52

Merge remote-tracking branch 'origin/main' into oliver_cyclotron_lib

10ac4c6

add worker class, do migrations

b69f6f4

cyclotron ports and queue

da7299d

try to appease fetch

bde85a0

add more fetch tests

3f1633d

add syart-cyclotron

1b7224d

uppercase http

fe91810

fix test

8dbf32f

Merge remote-tracking branch 'origin/main' into oliver_cyclotron_lib

6067840

add switch to allow_internal_ips

6c96364

Fix up start method

e6d425c

Merge branch 'oliver_cyclotron_lib' of github.com:PostHog/posthog int…

9ea14bc

…o oliver_cyclotron_lib

allow internal ips when using start-cyclotron

6617179

migration comments

d201917

Basic backpressure in managers if shards are full

f8120d0

Merge remote-tracking branch 'origin/main' into oliver_cyclotron_lib

af6dd6c

feat: Changes to node cyclotron package (#24481)

d88b46a

bretthoerner approved these changes Aug 21, 2024

View reviewed changes

bretthoerner merged commit 9734a40 into master Aug 21, 2024
87 checks passed

bretthoerner deleted the oliver_cyclotron_lib branch August 21, 2024 18:24

bretthoerner added a commit that referenced this pull request Aug 21, 2024

Revert "feat: cyclotron (#24228)"

706d43f

This reverts commit 9734a40.

bretthoerner mentioned this pull request Aug 21, 2024

revert: cyclotron (#24228) #24513

Closed

bretthoerner added a commit that referenced this pull request Aug 21, 2024

revert: cyclotron (#24228)

19ff16a

This reverts commit 9734a40.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cyclotron #24228

feat: cyclotron #24228

oliverb123 commented Aug 6, 2024

bretthoerner left a comment •

edited

Loading

bretthoerner Aug 12, 2024

bretthoerner Aug 12, 2024

bretthoerner Aug 12, 2024

bretthoerner Aug 12, 2024

bretthoerner Aug 12, 2024

bretthoerner Aug 12, 2024

oliverb123 Aug 13, 2024

bretthoerner Aug 13, 2024 •

edited

Loading

bretthoerner Aug 13, 2024

bretthoerner Aug 13, 2024

github-actions bot commented Aug 14, 2024 •

edited

Loading

		@@ -0,0 +1 @@
		Ripped from rusty-hook, since it'll be used across more or less all cyclotron stuff, as well as rustyhook

	// a since chunk of bytes). These are convenience functions for converting between
	// a single chunk of bytes). These are convenience functions for converting between

feat: cyclotron #24228

feat: cyclotron #24228

Conversation

oliverb123 commented Aug 6, 2024

Problem

bretthoerner left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bretthoerner Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Aug 14, 2024 • edited Loading

bretthoerner left a comment •

edited

Loading

bretthoerner Aug 13, 2024 •

edited

Loading

github-actions bot commented Aug 14, 2024 •

edited

Loading