-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
source-mysql: support for replica-only capture #1859
Comments
The implementation of MySQL watermark-less captures is going well and appears to be working. One question which recently came up was whether our consistent rate of watermark writes was serving any other important purposes which might be broken if we outright remove them. In some databases this is definitely true. For instance, in Postgres if the database we're capturing from is entirely idle but it's hosted on a server with other databases which are active, our replication slot LSN will never advance (because there's no new changes for us to confirm that we've read up to) but the Postgres WAL is shared across all databases so it will keep growing until the server either runs out of disk or invalidates the replication slot. So if/when we decide to add read-only replica support for Postgres we will have to think about providing users with some mechanism to keep issuing some sort of heartbeat, and/or warning them and providing setup instructions for generating a heartbeat outside of our connector. But that's not the case in MySQL. While it's always possible that users could have come to rely on our watermark writes for other non-CDC related reasons it's unlikely, so there is only one "useful side effect" that I am aware of, and that's when a user wants to capture changes to MyISAM tables. The issue with MyISAM tables is that the MyISAM storage engine doesn't support transactions, and what that means in concrete terms is that changes to a MyISAM table write out change event(s) into the binlog but there is never a transaction commit event following those. Currently the MySQL CDC connector relies on transaction commit events to tell when it's okay to emit a Flow checkpoint, and so if a database is solely using MyISAM tables for their data our guidance in the past was to make sure the watermarks table was created using InnoDB so that watermark writes would cause there to be at least one commit event in the binlog every minute. After going back and forth for a bit internally, we've concluded that the best solution is to implement support for MyISAM tables so that they actually work correctly. Every change to a MyISAM table is, practically speaking, an independent transaction which commits immediately even when issued in the context of an InnoDB transaction. So in theory, all we should need to do is modify our discovery/metadata logic to keep track of whether a table is using a non-transactional storage engine and then if so we should treat each change to such a table as being followed by an implicit commit event. |
The main PR (#1878) addressing this issue is now out for review. Once that's released, we will need to:
|
PR #1878 has been live for a few hours and nothing has exploded so far, which means that MySQL read replica captures are now something that works in production. I'll get the aforementioned documentation updates out soon and then next week I'll finish the job of removing the deprecated config properties. |
For some of the SQL databases which we have CDC connectors for, it ought to be possible to support capturing from read-only replicas. The major obstacle to this is that we currently rely on watermark writes as our primary mechanism for establishing a causal relationship between backfill queries and the replication stream so that our incremental backfills will be exactly correct, and for obvious reasons issuing writes into the watermarks table can't be done from a read-only replica.
In MySQL (and likely Postgres and SQL Server as well, but they're outside of the initial scope of this work) it should be possible to outright replace the watermarking logic with an equivalent implementation in which watermark writes are replaced by querying the current endpoint of the binlog, and observing a watermark change is replaced by observing a transaction commit event in the binlog whose position is >= that position. Since the endpoint query will be occurring on the same primary database connection as backfill queries and will be issued after a backfill query completes, it should provide the same causality guarantees that watermark writes currently do.
A short summary of the planned sequence of changes:
The text was updated successfully, but these errors were encountered: