A system is fault-tolerant if it continues working even if something is wrong
Work like this is never finished, it's always in-progress
- Hardware can fail (relatively uncommon)
- Software can fail (common)
- Fault-tolerance cannot be achieved using a single computer – it may fail
- We have to use several computers
- Concurrency
- Parallel programming
- Distributed programming
- Physics – time to propagate messages
- Engineering
- Message passing is inevitable – the basis of Object-Oriented Programming
- Programming languages should make this
easydoable
We are never going to eliminate failure, systems will be inconsistent. We have to deal with it.
- How individual computers work is the smaller problem. How the system works as a whole is important.
- How the computers are interconnected and the protocols used between computers is the significant problem.
- We want the same way to program large and small scale systems.
Why do we have to program differently for local programs to communicate (memory) versus distributed systems (message bus)?
In Object-Oriented Programming, Alan Kay said that the big idea behind Object-Oriented Programming was "messaging"
- Derived from Smalltalk and Prolog (influenced by ideas from CSP)
- Unifies ideas on concurrent and functional programming
- Follows laws of physics (asynchronous messaging)
- Designed for programming fault-tolerant systems
Invented Erlang to solve a problem, not the language to find a problem to be solved with it.
Building fault-tolerant software boils down to detecting errors and doing something when errors are detected.
We can't fix our own errors in the same way if I'm having a heart attack, I'll need someone to help me out
- Errors that can be detected at compile time
- Errors that can be detected at run-time
- Errors that can be inferred
- Reproducible errors
- Non-reproducible errors
- Find methods to prove Software correct at compile-time
- Assume software is incorrect and will fail at run-time then do something about it at run-time
Proving the self-consistency of small programs will not help. Testing a system will tell us that is self-consistent, not that it is correct.
- Some small things can be proven to be self-consistent.
- Large assemblies of small things are possible to prove correct.
1980 - Erlang model of computation rejected. Shared memory systems ruled the world. 1985 - Ericsson started working on "a replacement of PLEX", started thinking about errors. "errors must be corrected somewhere else", "shared memory is evil", "pure message passing" 1986 - Erlang, unification of OO with FP 1998 - Several products in Erlang, Erlang gets banned and moved to Open Source 2004 - Erlang model of computation widely adopted in many different languages.
- Highly reliable (nuclear power plant control, air-traffic), satellite (very expensive if they fail)
- Reliable (driverless cars) (moderately expensive if they fail. Kills people if they fail)
- Reliable (annoys people if they fail), banks, telephone
- Dodgy (cross if they fail), Internet, HBO, Netflix
- Crap (very cross if they fail), free apps
Different technologies are used to build and validate the systems.
How can we make software that works reasonably well even if there are errors in the software? "Making reliable distributed systems in the presence of software errors" book by Joe Armstrong
- Concurrency
- Error encapsulation
- Fault detection
- Fault identification
- Code upgrade
- Stable storage
- Detect all errors
- If you can't do what you want to do try to do something simpler (relax requirements)
- Handle errors "remotely" (detect errors and ensure that the system is put into a safe state defined by an invariant)
- Identify the "Error Kernel" (the part that must be correct)
The node can be on different machine. Akka is "Erlang supervision for Java and Scala"
- Ericsson smart phone data setup
- CouchDB (CERN)
- Cisco (netconf)
- Spine2 (NHS - uk - riak (basho) replaces Oracle)
- RabbitMQ
- An undesirable property of a program
- Something that crashes a program
- A deviation between desired and observed behaviour
- The program (run-time) finds the error
- Errors
- Arithmetic errors: divide by zero, overflow, underflow, ...
- Array bounds violated
- System routine called with nonsense arguments
- Null pointer
- Switch option not provisioned
- An incorrect value is observed
- What can we do
- Ignore it (no)
- Try to fix it (no)
- Crash immediately (yes)
- Don't make matters worse
- Assume somebody else will fix the problem (someone up in the supervision tree, this is like rebooting the system)
- What should the programmer do when they don't know what to do?
- Ignore it (no)
- Log it (yes)
- Try to fix it (possibly, but don't make maters worse)
- Crash immediately (yes)
- In sequential languages with single threads, crashing is not widely
- Errors
- The programmer finds the error
- The compiler finds the error
A sequential program when crashes it turns everything down (single thread).
When in a parallel program something crashes, the rest should keep working. There are linked process, if a process dies, these other processes are going to tell the process has died; these processes are going to get error messages.
- Fault-tolerance is impossible with one computer
- Scalable is impossible with one computer (yes, you can scale things vertically up to a limit; up to the capacity of the computer)
- Security is very difficult with one computer (if the system gets compromised, everything gets compromised)
- I want one way to program not two ways. One for local systems, the other for distributed systems (rules out shared memory)
- The world is concurrent
- Arithmetic errors
- Silent and deadly errors, errors where the program does not crash but delivers an incorrect result
- Noisy errors, errors which cause the program to crash
- Very difficult to get right
- The same answer in a single and double precision does not mean the answer is right
- If it matters, you must prove every line containing arithmetic is correct
- Real arithmetic is not associative
- Unexpected inputs
- Wrong values
- Programs does not crash, but the values computed are incorrect or inaccurate
- How do we know if a program/value is incorrect if we do not have a specification
- Many programs have no specifications or specs that are so imprecise to be useless
- The specification might be incorrect and the tests and the program
- Wrong assumptions about the environment
- Sequencing errors
- Concurrency errors
- Breaking laws of maths or physics
The programmer does not know what to do: CRASH
I call this "let it crash" Somebody else will fix the error Needs concurrency and links
What do you do when you receive an error?
- Maintain an invariant
- Try to do something simpler
What's a message?
- A program is a black box
- There are thousands of programming languages
- What language used is irrelevant
- The only important thing is what happens at the interface, their behaviour
- Two systems are the same if they obey observational equivalence
We shouldn't care about what happen inside these boxes, we just take care about the boundaries.
- Interaction between components involves message passing
- There are very few ways to describe messages (JSON, XML)
- There are very very few formal ways to describe the valid sequences of messages (= protocols) between components
- We need a way to describe protocols. Protocols are contracts. Contracts assign blame. We need an architecture that describes what we see on the wire: the contract checker
- How do we describe contracts?
- We need a way to describe protocols. Protocols are contracts. Contracts assign blame. We need an architecture that describes what we see on the wire: the contract checker