You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Context
Our team recently encountered a significant issue when attempting to start a large number of activities (~600) asynchronously within a single Temporal workflow. The SDK grouped the activity scheduling requests together, leading to a gRPC error (ResourceExhausted) because the total message size (15MB) exceeded the gRPC-imposed limit of 4MB.
This issue causes the workflow to freeze, as the SDK retries the RespondWorkflowTaskCompleted API call, which continues to fail due to the message size. Currently, the SDK does not provide any built-in mechanism to automatically handle this limitation, nor does it throw a runtime error that could help developers detect and manage the issue proactively.
Underlying Issue
The Temporal SDK batches multiple activity scheduling commands into a single gRPC request.
When the cumulative size of these commands exceeds 4MB, gRPC rejects the request with a ResourceExhausted error.
The SDK does not provide an intuitive way to manage this situation, leading to potential workflow freezes and silent failures.
Describe the solution you'd like
Feature Request
To improve the developer experience and robustness of the Temporal SDK, we request the following enhancements:
Automatic Request Splitting:
Implement a mechanism in the SDK to detect when the cumulative size of a gRPC request is approaching the 4MB limit.
Automatically split the activity scheduling commands into smaller batches that stay within the size limit, sending them in multiple gRPC requests if necessary.
Runtime Error for Exceeding Limits:
If automatic splitting is not feasible, the SDK should throw a descriptive runtime error when a request exceeds the gRPC message size limit.
This error should include suggestions or documentation references on how to adjust the code to stay within the limit, such as reducing the number of activities started simultaneously or compressing payloads.
Improved Logging and Monitoring:
Enhance the logging around gRPC message size limits to make it easier for developers to diagnose issues related to request size.
Additional Considerations
Custom Data Converters: Although developers can use custom data converters to compress payloads, this should be a secondary approach. The SDK should still provide safeguards against exceeding gRPC limits.
Documentation: Update the SDK documentation to clearly explain the gRPC size limitation, how it interacts with Temporal workflows, and best practices for avoiding issues.
Impact
Implementing these changes would significantly reduce the likelihood of silent workflow failures and improve the overall resilience of systems built using the Temporal SDK. Developers would have better tools to manage gRPC limitations, leading to more reliable and maintainable code.
A rough incomplete solution that we are piloting is something like:
importtype{ActivityFunction}from'@temporalio/common';import{DefaultPayloadConverter}from'@temporalio/common';constMAX_BATCH_PAYLOAD_SIZE=3*1024*1024;// 3MB, leaving a buffer for any unknown overhead associated with the activitiesconstdefaultPayloadConverter: DefaultPayloadConverter=newDefaultPayloadConverter();functiongetPayloadSize<PayloadTypeextendsParameters<ActivityFunction>>(args: PayloadType): number{constpayload=defaultPayloadConverter.toPayload(args);if(payload.datainstanceofUint8Array){returnpayload.data.length;}return0;}functiongroupIntoMaxSizePayloads<PayloadTypeextendsParameters<ActivityFunction>>(activityArgs: PayloadType[],maxSize: number=MAX_BATCH_PAYLOAD_SIZE,): PayloadType[][]{constpayloadGroups: PayloadType[][]=[];letcurrentGroup: PayloadType[]=[];letcurrentGroupSize=0;for(constargsofactivityArgs){constargsSize=getPayloadSize(args);if(argsSize>maxSize){// throw some error about the request size being too big}// If the current group size plus the size of the next args is greater than the max size,// push the current group and start a new oneif(currentGroupSize+argsSize>maxSize){payloadGroups.push(currentGroup);currentGroup=[];currentGroupSize=0;}currentGroup.push(args);currentGroupSize+=argsSize;}// Push the last group if it's not emptyif(currentGroup.length>0){payloadGroups.push(currentGroup);}returnpayloadGroups;}exportasyncfunctionsafeStartActivities<ActivityextendsActivityFunction,Arguments>(activity: Activity,args: Arguments[],): Promise<PromiseSettledResult<Awaited<ReturnType<Activity>>>[]>{constmaxRequestSizeBatches: Parameters<Activity>[][]=groupIntoMaxSizePayloads(args.map((arg: Arguments)=>argasParameters<Activity>),);constresults: PromiseSettledResult<Awaited<ReturnType<Activity>>>[]=[];for(constmaxRequestSizeBatchofmaxRequestSizeBatches){constpromises: Promise<ReturnType<Activity>>[]=maxRequestSizeBatch.map((args: Parameters<Activity>)=>activity(args),);constawaited=awaitPromise.allSettled(promises);results.push(...awaited);}returnresults;}
Is your feature request related to a problem? Please describe.
Context
Our team recently encountered a significant issue when attempting to start a large number of activities (~600) asynchronously within a single Temporal workflow. The SDK grouped the activity scheduling requests together, leading to a gRPC error (
ResourceExhausted
) because the total message size (15MB) exceeded the gRPC-imposed limit of 4MB.This issue causes the workflow to freeze, as the SDK retries the
RespondWorkflowTaskCompleted
API call, which continues to fail due to the message size. Currently, the SDK does not provide any built-in mechanism to automatically handle this limitation, nor does it throw a runtime error that could help developers detect and manage the issue proactively.Underlying Issue
ResourceExhausted
error.Describe the solution you'd like
Feature Request
To improve the developer experience and robustness of the Temporal SDK, we request the following enhancements:
Automatic Request Splitting:
Runtime Error for Exceeding Limits:
Improved Logging and Monitoring:
Additional Considerations
Impact
Implementing these changes would significantly reduce the likelihood of silent workflow failures and improve the overall resilience of systems built using the Temporal SDK. Developers would have better tools to manage gRPC limitations, leading to more reliable and maintainable code.
A rough incomplete solution that we are piloting is something like:
Additional context
Cloud Support Thread: https://temporalio.slack.com/archives/C046BRWDV2R/p1723743294696369
The text was updated successfully, but these errors were encountered: