Skip to content

Runtime Failure Tolerance

eile edited this page Dec 1, 2010 · 2 revisions

Runtime Failure Tolerance

Author: Stefan Eilemann

State: Design

Table of Contents

[TOC]

Overview

The goal of this feature is to extend Equalizer to handle the failure of resources at runtime. The most common cause is the failure of a node process, either due to a hardware or programming failure.

All blocking operations in Equalizer will throw a eq::base::timeoutError exception. The client and server library will catch the appropriate exceptions and handle them as needed. The timeout time is a global parameter.

Blocking operations

The following subsections list the blocking operations in various parts of the Equalizer code and the necessary changes to handle timeout situations.

eq::base

The RequestHandler::waitRequest and Monitor::wait methods throw a timeout exception. This is a prerequisite for many of the higher-level operations listed below. Locks will keep blocking indefinitely. A TimedLock should be used for operations which may time out, as opposed to the traditional lock usage for mutual exclusion.

The request handler already uses a timed lock for blocking operations. A new value EQ_TIMEOUT_DEFAULT will be the new default value for timeout parameters. The default timeout value is a global parameter. The request is deregistered upon timeout.

The monitor wait operations shall use pthread_cond_timedwait with the given timeout. Attn: the timeout is an absolute time (cf. TimedLock).

Collage (was eq::net)

On most timeout errors, the corresponding Node has to be disconnected. The individual operations mentioned below qualify how this disconnect is affected.

Incomplete receives

All connections perform blocking reads (readSync) use a timeout and throw a timeout exception. The timeout applies to an atomic read, that is, a full packet read may take longer than the timeout, as long as reading an individual chunk happens within the timeout.

When Node::handleData detects an incomplete receive or catches a timeout exception, it drops the packet and disconnects the node. The timeout is not rethrown, since the ReceiverThread is an internal thread. The same applies to disconnect events from the ConnectionSet.

All send operations

All Connection::write operations use a timeout and throw a timeout exception or return false on a closed connection (to be checked what is more feasible). All sends have to pass through the appropriate node to be able to handle the timeout exception in the proper place, i.e., no direct call to Connection::send is made. The Node::send catches the timout exception and disconnects the node.

The timeout is rethrown and caught by all internal threads, i.e, the CommandThread. Note that the ReceiverThread should never send data. Methods writing data should catch the exception and

Barrier

The enter function catches the timeout exception from its wait operation, disconnects the master node and rethrows the exception. The barrier code has to handle late enter requests.

RSP implementation

Currently the RSP implementation closes the multicast group when a timeout during send operations is exceeded. The new implementation shall simply disconnect the peers from which acknowledgements are missing. The exit of the node will be announced on the multicast group for a faster disconnect on other nodes.

Nodes which do not receive any packets for a repeated number of NAcks will disconnect themselves.

ConnectionSet threads (Win32)

When using more than 63 connections in a Windows cluster, the ConnectionSet spawns worker threads, each operating on a subset of 63 connections each. When signalling data to the main thread, they wait for the main thread to process the event. Since this wait does not block the application, they will retry the operation indefinitely upon timeout and consume the exception.

DataIStream::waitReady

The session waits during during mapping for the initial data. The exception is passed through, and handled by mapObject.

Node::connect

Rollback connection in progress. Rethrow exception.

Node::disconnect

Local operation, can't fail. Remote node will detect closed connection.

Node::acquireSendToken

Catch the exception, disconnect the node and rethrow it.

Object Mapping

Object Mapping Sequence
Object Mapping Sequence
A number of operations are blocking during ```mapObject```, i.e., object master query and connect as well as the map request.

Write timeout-aware master query when refactoring. Catch the timeout exception from the waitRequest on map, handle it by detaching the object if needed, and adapt the command handlers to handle commands for timeout/cancelled map requests.

Object::sync, commit

Commit is local operation and can't fail.

eq

The thread main loops will catch all non-caught exceptions and output a warning for each. This allows the application to catch and act on the exception by overwriting the appropriate task methods, while providing a sensible default behaviour.

Equalizer should not need to find and disconnect dead nodes, since Collage shall perform all necessary actions.

One main issue is timeout accumulation. Multiple blocking operations served by a single node each have a separate timeout, which causes the application to block multiple times before the rendering will be interactive again.

Input frames

The FrameData ready Lock will be replaced by a TimedLock. Both the monitor used for waiting on a group of frames and the lock used to wait on a single frame will let the timeout exception through to the callee. The compositor will catch these exceptions, ignore the failed images and rethrow it.

Swapbarrier

Ignore the exception and let it be caught by the thread main loop.

Node, pipe state and frame synchronization

These are local operations which do not fail.

Blocking application methods

Server::chooseConfig, releaseConfig
Config::init, exit, update, finishFrame

These functions catch the timeout exception of their waitRequest, clean up pending data and rethrow the exception.

eq::server

The server::Node needs to check its co::Node state at the beginning of each operation. If the network node has failed, it is set in the failed state. All actions should already perform the appropriate actions based on the init reliability feature.

appNode failure

Failure of the application node has to cause a Config::exit and release of the configuration for further application runs.

API

TBD

Issues:

1. Timeout accumulation

Multiple operations served by the same failed node will have to time out independently. In the worst typical use case this multiplication is ```2

  • numGPUS * latency``` (one input frame and one swap barrier per GPU, latency render frames queued).

Both the swap barrier and input frames do not know which node will provide the necessary data to finish the operation.

2. Hardware swapbarrier

Hardware swapbarrier timeout support is not part of this feature. Node failures in a HW sync group may cause deadlocks (to be tested).

Example deadlocks

...
4  in eq::Pipe::waitFrameFinished (this=0x2805c00, frameNumber=520) at client/pipe.cpp:453
5  in eq::Node::_finishFrame (this=0x36023a0, frameNumber=520) at client/node.cpp:232
6  in eq::Node::_cmdFrameFinish (this=0x36023a0, command=@0x3624800) at client/node.cpp:559
..
13 in eq::fabric::Client::processCommand (this=0x2801000) at fabric/client.cpp:100
14 in eq::Config::finishFrame (this=0x2006000) at client/config.cpp:286
..
16 in main (argc=5, argv=0xbffff11c) at /Users/eile/Software/eq-git/src/examples/eqPly/main.cpp:90

..
3  in eq::base::Monitor<unsigned int>::waitGE (this=0xb0490934, value=@0xb0490950) at monitor.h:348
4  in eq::Compositor::assembleFramesUnsorted (frames=@0x186ab1c, channel=0x186a800, accum=0x3206b80) at client/compositor.cpp:423
5  in eq::Compositor::assembleFrames (frames=@0x186ab1c, channel=0x186a800, accum=0x3206b80) at client/compositor.cpp:217
6  in eqPly::Channel::frameAssemble (this=0x186a800, frameID=521) at /Users/eile/Software/eq-git/src/examples/eqPly/channel.cpp:228
..
13 in eq::Pipe::PipeThread::run (this=0x3203df0) at pipe.h:428