- An application that communicates with elements running in the cloud has to be sensitive to the transient faults that can occur in this environment.
- Faults e.g. momentary loss of network connectivity to components and services, the temporary unavailability of a service, or timeouts that occur when a service is busy.
- These faults are self-correcting and if action is done after delay, it's likely to be successful.
- E.g.
ConnectionClosed
,TimeOut
,RequestCanceled
- E.g.
- Strategies
- Cancel : Report exception & cancel operation. E.g. invalid credentials.
- Retry : If specific fault reported is unusual or rare, E.g. network packet becoming corrupted.
- Retry after delay : Fault caused by e.g.. busy/connectivity failures. Try after short period of time.
- For more common transient failures, period between retries should be chosen to spread requests from multiple instances of the application as evenly as possible
- Reduces chance of being overloaded.
- Too many service retry => longer to recover
- If service fails again, wait & make another attempt, if necessary, increase delays between retry attempts until maximum is reached.
- Delay can be increased incrementally or exponentially depending on the type of failure & probability that it'll be corrected during this time.
- For more common transient failures, period between retries should be chosen to spread requests from multiple instances of the application as evenly as possible
- Many SDKs implement retry policies, where some parameters can be set: maximum number of retries, amount of time between retry, ….
- An application should log the details of faults & failing operations.
- Scaling out can lower frequency of faults caused by being overloaded etc.
- Partition the database & spread the load across multiple servers.
- In code
- Try catch for the exception
- Set delay (
Delay = TimeSpan.FromSeconds(5)
) and wait for the delay (Task.Delay
) - Log the exception
throw
if retry count is maximum
- Set delay (
- Try catch for the exception