[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause dataloss during broker issue when not using EXACTLY_ONCE if there's any batching #70

hhktseng · 2023-11-30T23:33:49Z

What is the purpose of the change

To address the current flow of KafkaProducer having exact same behavior for DeliveryGuarantee.NONE and DeliveryGuarantee.AT_LEAST_ONCE

It is based on the understanding that the existing flush performed on producer via prepareSnapshotPreBarrier and the actual checkpoint completion on commit has a very rare race condition where there could be data being invoked via processElement after the PreBarrier flush, and if KafkaProducer is having retry on a batched data that has yet thrown any error, upon job failure (caused by broker) will cause the batched data to never be committed, and since checkpoint was successful, these data will be lost.

This PR address the issue by enabling AT_LEAST_ONCE to have an opportunity to flush again when commit is happening when needed to avoid this issue. This is to ensure at the end of the checkpoint cycle, producer will definitely have no data left in its buffer.

Please comment or verify on the above understanding.

Brief change log

*add variable hasRecordsInBuffer to FlinkKafkaInternalProducer and will be set when send/flush/close are called
*add variable transactional to KafkaCommittable to track whether a committable is transactional or not
*add new constructor to KafkaCommittable for Unit Test backward compatibility
*have prepareCommit also return list of commitable for DeliveryGuarantee.AT_LEAST_ONCE in prepareCommit()
*have KafkaCommitter check the new transactional value on KafkaCommittable to perform commitTransaction() to preserve original EXACTLY_ONCE pathway

Verifying this change

This change added tests and can be verified as follows:

Added second flush test to test this particular special case (where producer.send was invoked after a flush)
Manually verified the job runs correctly with this change in existing cluster

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

boring-cyborg · 2023-11-30T23:33:53Z

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

MartijnVisser · 2024-01-17T10:16:04Z

@hhktseng Can you rebase your PR?

hhktseng · 2024-01-19T07:56:42Z

@hhktseng Can you rebase your PR?

@MartijnVisser can you point me to which commit to rebase onto?

thanks

MartijnVisser · 2024-01-19T07:59:49Z

@MartijnVisser can you point me to which commit to rebase onto?

@hhktseng On the latest changes from main please

hhktseng · 2024-01-23T18:53:12Z

created patch and applied after syncing to latest commit, then replaced forked branch with latest sync + patch

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

…taloss during broker issue when not using EXACTLY_ONCE if there's any batching

mas-chen · 2024-04-08T18:38:49Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

                + ", closed="
                + closed
                + '}';
    }
+
+    public class TrackingCallback implements Callback {


nit: private

mas-chen · 2024-04-08T18:39:47Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

+
+        @Override
+        public void onCompletion(final RecordMetadata recordMetadata, final Exception e) {
+            pendingRecords.decrementAndGet();


Do we want to decrement after the callback is completed? What's the best approach semantically?

Since it already happened, we should probably decrement as soon as possible?

mas-chen · 2024-04-08T18:41:59Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

+        final long pendingRecordsCount = pendingRecords.get();
+        if (pendingRecordsCount != 0) {
+            throw new IllegalStateException(
+                    "Pending record count must be zero at this point: " + pendingRecordsCount);


nit: how about this message like this:

n pending records after flush. There must be no pending records left.

I'd improve the error message as follows:

Some records have not been fully persisted in Kafka. As a precaution, Flink will restart to resume from previous checkpoint. Please report this issue with logs on https://issues.apache.org/jira/browse/FLINK-33545.

i think having reference to the issue reported might allow us better tracking the potential problem

mas-chen · 2024-04-08T18:42:35Z

@hhktseng were you able to test that this change mitigates your original issue? Is there a way to repro in the tests?

tweise · 2024-07-04T16:56:39Z

@hhktseng thanks for working on this. Can you please address the review comments?

AHeise · 2024-07-08T10:02:45Z

Please check my comment here. https://issues.apache.org/jira/browse/FLINK-33545?focusedCommentId=17863737&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17863737
If everyone is sure that the current fix is addressing the actual issue, please go ahead.

AHeise

Thank you very much for your contribution. Changes look mostly good to me. I have added two small suggestions.

I'm also wondering if we should add a property around this new behavior, so folks can turn off the check if it has unintended side-effects (performance degration or it's interfering with DeliveryGuarantee.NONE)

AHeise · 2024-07-09T08:30:32Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

    @Override
    public Future<RecordMetadata> send(ProducerRecord<K, V> record, Callback callback) {
        if (inTransaction) {
            hasRecordsInTransaction = true;
        }
-        return super.send(record, callback);
+        pendingRecords.incrementAndGet();
+        return super.send(record, new TrackingCallback(callback));


This change creates a new callback with every send. Since the callback being passed in our codebase is mostly constant, we should add a simple cache like new LRUMap(3);. The number is kind of arbitrary and 1 should work already. The most important part is that it shouldn't grow boundless or we get the next memory leak if I overlooked a dynamic usage ;).

just want to check, are we proposing putting instance of TrackingCallback into the rotating cache? wouldn't that caused the previous callback that might not have been invoked to be ignored?

I can't quite follow. I was proposing to use

return super.send(record, callbackCache.computeIfAbsent(callback, TrackingCallback::new));

So we have 3 cases:

New callback, wrap in TackingCallback and cache.

Existing callback (common case), retrieve existing callback and use it.

Remove existing TackingCallback from cache if full.

In all cases, both the TackingCallback and the original callback will be invoked. The only difference to the code without cache is that we avoiding creating extra TrackingCallback instances around the same original callback.

AHeise · 2024-07-09T08:30:41Z

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java

+        final long pendingRecordsCount = pendingRecords.get();
+        if (pendingRecordsCount != 0) {
+            throw new IllegalStateException(
+                    "Pending record count must be zero at this point: " + pendingRecordsCount);


I'd improve the error message as follows:

Some records have not been fully persisted in Kafka. As a precaution, Flink will restart to resume from previous checkpoint. Please report this issue with logs on https://issues.apache.org/jira/browse/FLINK-33545.

boring-cyborg bot added the component=Connectors/Kafka label Nov 30, 2023

hhktseng force-pushed the FLINK-33545 branch from 79f1dd6 to 0ea39ea Compare January 23, 2024 18:49

hhktseng force-pushed the FLINK-33545 branch from 0ea39ea to 029b011 Compare February 15, 2024 20:57

mas-chen reviewed Feb 16, 2024

View reviewed changes

...or-kafka/src/main/java/org/apache/flink/connector/kafka/sink/FlinkKafkaInternalProducer.java Outdated Show resolved Hide resolved

[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause da…

0d81bc8

…taloss during broker issue when not using EXACTLY_ONCE if there's any batching

hhktseng force-pushed the FLINK-33545 branch from 029b011 to 0d81bc8 Compare February 21, 2024 05:16

MartijnVisser requested a review from mas-chen April 4, 2024 12:04

mas-chen reviewed Apr 8, 2024

View reviewed changes

AHeise reviewed Jul 9, 2024

View reviewed changes

AHeise self-assigned this Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause dataloss during broker issue when not using EXACTLY_ONCE if there's any batching #70

[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause dataloss during broker issue when not using EXACTLY_ONCE if there's any batching #70

hhktseng commented Nov 30, 2023

boring-cyborg bot commented Nov 30, 2023

MartijnVisser commented Jan 17, 2024

hhktseng commented Jan 19, 2024

MartijnVisser commented Jan 19, 2024

hhktseng commented Jan 23, 2024

mas-chen Apr 8, 2024

mas-chen Apr 8, 2024

tweise Jul 4, 2024

mas-chen Apr 8, 2024

AHeise Jul 9, 2024

hhktseng Jul 9, 2024

mas-chen commented Apr 8, 2024

tweise commented Jul 4, 2024

AHeise commented Jul 8, 2024

AHeise left a comment

AHeise Jul 9, 2024

hhktseng Jul 9, 2024

AHeise Jul 10, 2024

AHeise Jul 9, 2024

[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause dataloss during broker issue when not using EXACTLY_ONCE if there's any batching #70

Are you sure you want to change the base?

[FLINK-33545][Connectors/Kafka] KafkaSink implementation can cause dataloss during broker issue when not using EXACTLY_ONCE if there's any batching #70

Conversation

hhktseng commented Nov 30, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

boring-cyborg bot commented Nov 30, 2023

MartijnVisser commented Jan 17, 2024

hhktseng commented Jan 19, 2024

MartijnVisser commented Jan 19, 2024

hhktseng commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mas-chen commented Apr 8, 2024

tweise commented Jul 4, 2024

AHeise commented Jul 8, 2024

AHeise left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment