layout | title | sched-activation |
---|---|---|
course |
Active failure testing (Wednesday, Week 11, March 26) |
class="active" |
Source: {{site.data.bibliography.tseitlin2013.title}}.
"Resilience is an attribute of a system that enables it to deal with failure in a way that does not cause the entire system to fail."---Tseitlin, p. 42.
- If a movie recommendation service fails, give users a generic list of titles
- "A complex system is constantly undergoing varying degrees of failure" (p. 42)
-
Redundancy and fault tolerance
-
Regularly induce failure
-
Simply listing possible failures helps you understand your system:
- How far outside the service performance SLA is a "failure"?
- How many retries before you consider a service "failed"?
- What if only part of a service is working (say, read but not write)?
-
Automatically (today)
-
Manually (next class)
"Monkeys"---scripts that deliberately fail key services
Open source versions of Chaos Monkey, Conformity Monkey, and Janitor Monkey
Randomly terminates live, customer-facing instances
Ensures that services do not rely on
- On-instance state
- Instance affinity (has to run on specific instance)
- Persistent connections
Services can set probability or opt out
Causes Netflix services for an entire Amazon Availability Zone to fail
- Partitioned mode (both sides continue)
- Total failure (failed zone terminated)
"Causes massive damage" to Netflix's services
- As of 2012, only run manually
- Increasingly aggressive with every run
Takes down Netflix services for an entire Amazon Region (multiple Availability Zones)
A resilient system cannot be limited to one Zone
Chaos Kong still under development
Introduces delays in client-server communication
Service is still there, just slow
Useful for testing resilience of new services
- Increase latency of services they depend upon
- Leave latency unchanged for all other clients of that service
A style checker ("lint") for instances
Locates resources that should be deleted
Notifies owner
Owner has three days to countermand the deletion
Structured introduction of monkeys
- Run in test environment
- Run live with select volunteer services
- Run live with services opting-in
- Run live with services opting-out
Constantly monitor the health of your system
When users are impacted by a real event, turn off the monkeys
Record all changes to the system
Developers operate the services they create
Learn from failures
Blameless culture
Read {{site.data.bibliography.krishnan2012.title}}.
Google's Disaster Recovery Testing event (DiRT) is a complementary approach to Netflix's Simian Army. Where Simian Army uses automated failures of key services, DiRT uses manual failures of key services.