layout | title | sched-activation |
---|---|---|
course |
Developing a Service-Level Agreement (SLA) |
class="active" |
Thursday, Feb. 20: The answer key.
Due: Wednesday, February 19, 2014
Submit to CourSys: A PDF file giving your answers to each question. Show how they were derived.
Percentage of course grade: 4%
This assignment asks you to work through the issues discussed in [{{ site.data.bibliography.dean2013.title }}]({{ site.data.bibliography.dean2013.url }}). We will use the structure of Assignment 2 as an example application.
The latency of a request in Assignment 2 is due to the latencies of each of its component parts: The server (which saves the original image in S3 and queues up resizes on the workers), the time it takes the SQS queue to deliver messages from the server to the workers, and the workers (which do one or more resizes and save them in S3). Assume the following characteristics for each component:
- the latency of the server is a constant 200 ms,
- the latency of the queue (the time it takes for a message to travel from the server to a worker) in ms is a logarithmic function of the number of workers,
ql = 50 + 25 * log10 w
, wherew
is the number of workers, and - the latencies of the workers according to the following tables:
w = 2
workers
Percentile | Time (ms) |
---|---|
50.0 | 125 |
99.9 | 175 |
Example: If two workers are creating thumbnails and every request requires each worker to make a thumbnail, 50% of the time all thumbnails for one request will be completed within 125 ms after the workers have received them and virtually all requests will have all their thumbnails ready 175 ms after the workers have received them.
w = 1000
workers
Percentile | Time (ms) |
---|---|
50.0 | 150 |
90.0 | 325 |
99.0 | 650 |
99.9 | 1050 |
Assume the latency for a complete request is the sum of the server, queue, and worker latencies. This is the time between the receipt of a request from a user task and all the thumbnails being stored in S3 and available to be read.
Using the above formulas, calculate the 99.9th percentile latency for requests when there are two workers.
Using the same assumptions as above, but now asssuming 1000 workers and that every request will require creation of 1000 thumbnails, one for each worker, calculate the 99.9th percentile latency.
Added Mon, Feb 17: For this question, continue the assumption that you have 1000 workers, as done in Question 2.
Now assume that you have revised your project to use a hedged request algorithm. At the 99th percentile time, for every worker that has still not replied, you start a second worker with the same request. Assume you have a pool of idle workers from which to assign the duplicate request. Assume that a duplicated request has the latency distribution of 2 workers. (This is unrealistically simple but it makes the computation easier.)
What is the 99.9th percentile of this latency distribution?
Added Mon, Feb 17: This question does not include any of the assumptions from Part 1. Answer it only using the assumptions given in this part. Each EC2 instance can only do one request at a time and every request requires only a single EC2 instance.
Assume that your latency computations make you comfortable setting an SLA of 99% at 1400 ms for all your requests. You have reserved four EC2 instances that will be used exclusively for this service (in other words, single-tenant). These instances are identical to the one used for the latency computation.
What is your SLA for throughput?
Assume that Amazon offers an availability of "four nines" (99.99%). In addition, your operations staff tell you that they will likely need to have the system down for one working day (eight hours) per year.
How many "nines" is your SLA for annual availability?