[Bug]: Java BundleProcessorCache evicts all processors for a bundle descriptor id after 1 minute of idleness #29797

scwhittle · 2023-12-18T13:37:05Z

What happened?

A timeout of 1 minute is specified here:
https://github.com/apache/beam/blob/master/sdks/java/harness/src/main/java/org/apache/beam/fn/harness/control/ProcessBundleHandler.java#L923

The key of this cache is the bundle descriptor id (fused stage) and the value is a list of BundleProcessors cached. This means that if we don't process a stage for a minute all of the bundle processors are destroyed. For long-lived streaming pipelines this is wasteful as stages are processed in parallel and if were previously processed, will likely process again. Creating a BundleProcessor involves constructing user DoFn and running Setup method so it is non-trivial.

Some improvements:

increase the timeout for streaming pipelines by default or make it runner configurable
don't throw away the entire list of cached BundleProcessors on expiration but perhaps just remove the last processor.

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

gurjarutkarsh · 2023-12-18T17:17:46Z

could introduce a way to track the frequency of use for each Bundle processor and only remove those that haven't been utilized for a longer period, rather than a strict one-minute rule.

kennknowles · 2023-12-18T19:19:08Z

What's the comparable lifetime in other contexts, e.g. Dataflow legacy worker streaming? Is there any eviction at all?

scwhittle · 2023-12-18T19:44:11Z

Dataflow streaming runner never times these out so that is certainly a valid option in my opinion.

Queue is here:

beam/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/streaming/ComputationState.java

Line 46 in add3438

private final ConcurrentLinkedQueue<ExecutionState> executionStateQueue;

Stored in map here:

beam/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java

Line 209 in add3438

    
           private final ConcurrentMap<String, ComputationState> computationMap = new ConcurrentHashMap<>();

Neither of which have a timeout.

We add all successfully processed dofns back to the queue:

beam/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java

Line 1161 in add3438

computationState.getExecutionStateQueue().offer(executionState);

kennknowles · 2023-12-18T20:26:20Z

Yea, I mean for today we do know that there's a finite number of them since the graph cannot change dynamically. I don't have historical context, nor have I looked at the commit history, so I don't know if there is a specific motivation here. Lacking that, I also would favor copying something that is known good. Aka let us just not time them out. Or we could at least set it to, say, an hour. I suppose you can close up and free bounded sources that have completed, if that is not done some other way.

scwhittle added bug awaiting triage labels Dec 18, 2023

github-actions bot added java P2 labels Dec 18, 2023

scwhittle linked a pull request Nov 20, 2024 that will close this issue

Change the Java sdk harness cache timeout for bundle processors to be an hour for streaming pipelines instead of 1 minute. #33175

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Java BundleProcessorCache evicts all processors for a bundle descriptor id after 1 minute of idleness #29797

[Bug]: Java BundleProcessorCache evicts all processors for a bundle descriptor id after 1 minute of idleness #29797

scwhittle commented Dec 18, 2023

gurjarutkarsh commented Dec 18, 2023

kennknowles commented Dec 18, 2023

scwhittle commented Dec 18, 2023

kennknowles commented Dec 18, 2023

[Bug]: Java BundleProcessorCache evicts all processors for a bundle descriptor id after 1 minute of idleness #29797

[Bug]: Java BundleProcessorCache evicts all processors for a bundle descriptor id after 1 minute of idleness #29797

Comments

scwhittle commented Dec 18, 2023

What happened?

Issue Priority

Issue Components

gurjarutkarsh commented Dec 18, 2023

kennknowles commented Dec 18, 2023

scwhittle commented Dec 18, 2023

kennknowles commented Dec 18, 2023