-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathacs-dissertation.tex
1840 lines (1689 loc) · 102 KB
/
acs-dissertation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%
%% ACS project dissertation template.
%%
%% Currently designed for printing two-sided, but if you prefer to
%% print single-sided just remove ",twoside,openright" from the
%% \documentclass[] line below.
%%
%%
%% SMH, May 2010.
\documentclass[a4paper,12pt,twoside,openright]{report}
%%
%% EDIT THE BELOW TO CUSTOMIZE
%%
\def\authorname{Zubair Nabi\xspace}
\def\authorcollege{Robinson College\xspace}
\def\authoremail{[email protected]}
\def\dissertationtitle{Mission Control: Enabling Efficient and Heterogeneous Data Transfers
in Data Intensive Computing}
\def\wordcount{13956}
\usepackage{epsfig,graphicx,parskip,setspace,tabularx,xspace,epstopdf,url,caption,subcaption,float,listings,color}
\graphicspath{{./Figures/}}
\definecolor{javared}{rgb}{0.6,0,0} % for strings
\definecolor{javagreen}{rgb}{0.25,0.5,0.35} % comments
\definecolor{javapurple}{rgb}{0.5,0,0.35} % keywords
\definecolor{javadocblue}{rgb}{0.25,0.35,0.75} % javadoc
%% START OF DOCUMENT
\begin{document}
%% FRONTMATTER (TITLE PAGE, DECLARATION, ABSTRACT, ETC)
\pagestyle{empty}
\singlespacing
\input{titlepage}
\onehalfspacing
\input{declaration}
\onehalfspacing
\input{acknowledgements}
\singlespacing
\input{abstract}
\pagenumbering{roman}
\setcounter{page}{0}
\pagestyle{plain}
\tableofcontents
\listoffigures
\listoftables
\onehalfspacing
%% START OF MAIN TEXT
\chapter{Introduction}
\pagenumbering{arabic}
\setcounter{page}{1}
Over the course of the last decade, proliferation in web applications (Facebook,
Twitter etc.), and scientific computing (sensor networks, high energy physics,
etc.) has resulted in an unparalleled surge in data on the exabyte scale.
Consequently, the need to store and analyze this data has engendered an
ecosystem of distributed file systems~\cite{Ghemawat:2003:GFS}, structured and
unstructured data stores~\cite{Chang:2006:BDS,DeCandia:2007:DAH}, and
data-intensive computing
frameworks~\cite{Dean:2004:MSD,Isard:2007:DDD,Murray:2011:CUE}. These
``shared-nothing'' frameworks are mostly run in massive data centers with tens
of thousands of servers and network elements. These data centers are globally
distributed, for diversity and redundancy, and run a wide-variety of loads;
including user-facing applications, such as email; custom enterprise
applications, such as authentication services; and large-scale batch-processing
applications, such as web indexing~\cite{Benson:2010:NTC}. The goal of each data
center is to provide a multi-tenant\footnote{Multi-tenancy applies to both
public clouds, such as Amazon EC2 and Microsoft Azure, where the resources are
shared by paying customers and to private clouds, maintained by Google,
Facebook, etc., where different departments share the same infrastructure.}
environment for energy-, power-, space-, and cost-efficient computation, and
storage at scale~\cite{Katz:2009:TTB}.
Typically, data centers are built using commodity, off-the-shelf servers and
switches to find a sweet spot between performance and cost, and to leverage
economies of scale~\cite{Barroso:2003:WSP}. Servers in this model store or
process a \emph{chunk} of data, with redundancy ensured via replication without
any cluster/data center wide memory or storage area network (SAN). Due to their
commodity nature, servers and network elements are failure-prone and a failure
in say, a \emph{core switch} can bring down the entire network. Therefore,
storage and processing frameworks use a range of methods to deal with this,
including replication, re-execution, and speculative execution. On the network
side, end-hosts running TCP/IP are connected through cheap switches and wired
links, although wireless~\cite{Halperin:2011:ADC} and
optical~\cite{Wang:2010:CPO,Farrington:2010:HHE} links have recently been
explored. Most data centers are provisioned with 1Gbps and 10Gbps switches with
L2 Ethernet switched fabric interconnects as opposed to fabrics from the high
performance computing community such as InfiniBand. Likewise, due to the
prohibitive cost of high-end switches and routers, as opposed to commodity
off-the-shelf hardware, bandwidth is a constrained resource in the data center.
To ensure optimum utilization, subject to organizational goals, oversubscription
factors\footnote{Oversubscription factor is defined as ``the ratio of worst-case
achievable aggregate bandwidth among the end hosts to the total bisection
bandwidth of a particular communication topology"~\cite{Al-Fares:2008:SCD}.} in
real data centers vary significantly:
Some are as low as 2:1 while some can get as high as
147:1~\cite{Benson:2010:NTC}. In the cluster hierarchy, bandwidth increases from
the edge of the network towards the core (off-rack oversubscription). As a
result of off-rack oversubscription, applications are designed to keep
cross-rack communication to a minimum~\cite{Dean:2004:MSD}.
Concurrently, the emergence of cloud computing has resulted in multi-tenant data
centers in which user applications have wide-varying requirements and
communication patterns. Applications range from low-latency query jobs to
bandwidth-hungry batch processing jobs~\cite{Alizadeh:2010:DCT}. In addition,
applications exhibit one-to-many, many-to-one, and many-to-many communication
patterns. In order to ensure better fault-tolerance, scalability, and end-to-end
bandwidth in data center networks, several new
topologies~\cite{Al-Fares:2008:SCD,Guo:2008:DSF,Guo:2009:BHP,Greenberg:2009:VSF}
have been proposed to replace the status quo: a hierarchical 2/3-level tree
which connects end-hosts through \emph{top-of-rack} (or \emph{aggregation}), and
\emph{core} switches. In the same vein,
L2~\cite{Mudigonda:2010:SCD,Vattikonda:2012:PTD} and
L3/L4~\cite{Alizadeh:2010:DCT,Vasudevan:2009:SEF,Raiciu:2010:DCN,Wilson:2011:BNL,Wu:2010:IIC}
protocols, topologies and design
frameworks~\cite{Singla:2011:JND,Al-Fares:2008:SCD,Guo:2008:DSF,Guo:2009:BHP,Greenberg:2009:VSF,Mudigonda:2011:TFC,Chen:2010:GAA},
and control and virtualization
planes~\cite{NiranjanMysore:2009:PSF,Mudigonda:2011:NSM,Guo:2010:SDC,Ballani:2011:TPD,Shieh:2011:SDC,Rodrigues:2011:GSB,Al-Fares:2010:HDF}
for data center communication and resource allocation have also experienced
innovation, focusing on some combination of high bandwidth, fairness,
fault-tolerance, and scalability.
In this thesis, we explore a new point in the design space by enabling diversity
at the transport layer. Specifically, we propose a decoupling between data
transfer policy and mechanism in data intensive computing. To this end, we
present the design, implementation, and evaluation of \emph{Mission Control}, an
abstraction framework that resides at the application layer and selects the
transport semantics for a particular transfer. Based on runtime parameters,
underlying topology, and user-supplied constraints, it selects both the
transport protocol and the schedule for a particular data transfer.
Data intensive computing frameworks such as MapReduce~\cite{Dean:2004:MSD} and
CIEL~\cite{Murray:2011:CUE} delegate data transfers to Mission Control by
specifying a set of files and network destinations. In addition, frameworks can
also expose a set of constraints, such as the use of encryption. Mission Control
uses global information to choose an optimum policy for each data transfer. We
have implemented Mission Control for the CIEL~\cite{Murray:2011:CUE} framework.
Our main contributions are the following:
\begin{enumerate}
\item Highlighting the shortcomings of TCP in the data center and
proposing that the solution is to have diversity at the transport layer.
\item The design and implementation of Mission Control, an abstraction layer
that chooses data transfer semantics for a particular transfer.
\item An evaluation of different transport protocols to show their pros and
cons. The evaluation suggests that there is no one-size-fits-all protocol as
each protocol explores a different point in the design and optimization space.
\end{enumerate}
Our evaluation shows that data transfer indeed accounts for a large chunk of the
job completion time and can potentially become a performance bottle-neck. The
evaluation also shows that increasing the number of concurrent data fetch flows
at hosts can improve the transfer throughput. In addition, multi-path TCP
enables the framework to leverage multiple paths in the network. Moreover, data
center TCP improves job completion time by 20\% on average by keeping the TCP
receive queue length in check but is unable to match TCP's per flow fairness.
Furthermore, TCPcrypt can transparently encrypt data with an acceptable
overhead. Most importantly, the evaluation corroborates our thesis that there is
no single solution which is applicable across the board and to counter this, it
imperative to have diversity at the transport layer.
\section{The Case Against TCP in the Data Center}
During the last 3 decades the Internet protocol stack has experienced
considerable innovation. To keep up with the diverse range of ever-emerging
applications, a number of physical, data link, and application layer protocols
have been developed and have stood the test of time. In contrast, IP, TCP, and
UDP have been entrenched at the network and transport layer, leading to an
hourglass shaped stack~\cite{Akhshabi:2011:ELP}. This ossification of the
Internet protocol stack at the transport and network layer has also been
mimicked by data center networks. TCP/IP has become the de-facto ``narrow-waist"
of the data center network. On the one hand, this has standardized direct
communication to/from and within the data center without the need for any
middleware, but on the other hand, it has also spawned a new set of problems
unique to the data center environment. In the data center, the role of IP at the
network layer is minimal, as it is only employed for naming and addressing. The
only obstacle is address assignment which needs to be done manually to encode
location/topology information within the address\footnote{DHCP can be used for
address assignment if only physical IDs, such as MAC addresses, are used
throughout. Advanced topologies, such as BCube and VL2, operate on logical IDs,
which requires an additional physical-to-logical address encoding at the DHCP
server.}. Although manual assignment is tedious and error-prone, it can be
automated by leveraging the well-defined structure of data
centers~\cite{Chen:2010:GAA}. In contrast, TCP faces a number of pathological
challenges due to a significantly different bandwidth-delay product, round-trip
time (RTT), and retransmission timeout (RTO)~\cite{Chen:2009:UTI} than would be
experienced in a wide area network (WAN). For example, due to the low RTT, the
congestion window for each flow is very small. As a result, flow recovery
through TCP fast retransmit is impossible, leading to poor net
throughput~\cite{Kandula:2009:NDC}. To exacerbate problems, data-intensive
computing frameworks (DICF) are bandwidth hungry and require all data to be
\emph{materialized} before the next stage can commence (For example,
\emph{shuffle} between Map and Reduce stages). In fact, data transfer between
stages can account for more than 50\% of the job completion
time~\cite{Chowdhury:2011:MDT}.
Figure~\ref{chap:introduction:fig:sort_fraction} shows the CDF of a MapReduce
sort application for different sized datasets (between 2 GB and 20 GB). From the
graph, it is visible that shuffle can account for up to 60\% of total execution
time for an I/O intensive application such as sort. Similar to the problem of
compute, memory, and disk I/O ``stragglers''~\cite{Zaharia:2008:IMP}, an
imbalance in network efficiency can lead to the job completion time being
dictated by the task with the longest shuffle time.
\begin{figure}[h!]
\centering
\includegraphics[width=0.8\textwidth]{sort_fraction.eps}
\caption{Shuffle time as a fraction of execution time for Sort}
\label{chap:introduction:fig:sort_fraction}
\end{figure}
It is important to highlight that some of these problems, stem at least in part
from the difference in traffic characteristics and scale between data centers
and other networks. In addition, unlike wide area network applications, there is
a tight coupling between an application's usage of network, compute, and storage
resources. In production data centers, due to the widely-varying mix of
applications, congestion in the network can last from 10s to 100s of
seconds~\cite{Kandula:2009:NDC}. Moreover, during high utilization periods, the
task failure rate of batch-processing systems increases, especially during the
\emph{reduce} phase of MapReduce-style applications, by a median value of 1.1x,
due to read failures over the network. In the worst case, this can halt the
entire job if subsequent tasks are dependent on the failed ones. Furthermore,
link utilization is highest in the core of the network, but most losses occur
towards the edge~\cite{Benson:2010:NTC}. Additionally, in some cases
over-provisioning the network is overkill and good performance can be achieved
by adding a few extra links at hotspot points~\cite{Kandula:2009:FTD}. Finally,
in commodity switches the buffer pool is shared by all interfaces. As a result,
if long flows hog the memory, queues can build up for the short flows.
In reaction, TCP reduces the window size by half. This leads to a large mismatch
between the input rate and the capacity of the link, leading to buffer
underflows and loss of throughput~\cite{Alizadeh:2010:DCT}. Overall, traffic
engineering in the data center is a complex task due to the wide variability in
flow duration, flow size, flow arrival time, and server
participation~\cite{Kandula:2009:NDC}: Standard TCP without any global knowledge
is suboptimal~\cite{Benson:2010:CFT}.
Another major shortcoming of TCP, which is manifest in the data center, is TCP
throughput collapse, or
\emph{incast}~\cite{Chen:2009:UTI,Vasudevan:2009:SEF,Wu:2010:IIC,Alizadeh:2010:DCT}.
This can cause overall application throughput to decrease by up to
90\%~\cite{Vasudevan:2009:SEF}. Typically, to handle incast, application-level
remedies are added which include varying the size of the packets so that a large
number of packets can fit the switch memory, and adding application-level jitter
to decorrelate the packet inter-arrival time. Both of these solutions are
suboptimal, as they either increase the number of packets, or add unnecessary
delay~\cite{Alizadeh:2010:DCT}. Finally, most applications operate in a
virtualized environment where a single host runs tens of virtual machines (VM).
This sharing of CPU by multiple VMs increases the latency experienced by each
VM, which can be orders of magnitude higher than the RTT between hosts inside a
data center~\cite{Gamage:2011:OFI,Kangarlou:2010:VIT}. This significantly slows
down the progress of TCP connections between different applications. So much so
that there are periods where TCP throughput can go down to
zero~\cite{Wang:2010:IVN}.
All of these problems have even prompted large-scale deployments to abandon TCP
altogether. For instance, Facebook~\cite{Facebook} now uses a custom UDP
transport. Some researchers have gone to the extent of suggesting that fixing
TCP is tantamount to ``saving the world''~\cite{Lyon:2008:TII}\footnote{Heat and
CO$_{2}$ emissions, and radioactive waste can be accounted to TCP's
inefficiency.}. For completeness, it is worth noting that TCP was designed as a
general transport mechanism for a wide area network (WAN)~\cite{Clark:1988:DPD}
and has experienced near universal adoption in environments as diverse as
satellites~\cite{Henderson:1999:TPF}. This is primarily due to TCP's ability to
provide reliability, congestion and flow control, and in-order packet delivery.
Moreover, its maturity makes it a ``kitchen sink'' solution for
developers~\cite{Vasudevan:2009:SEF}. Naturally, it has become the transport of
choice for the data center. But it is clear from the discussion above that TCP
is sub-optimal in a data center environment -- a fact agreed upon universally by
researchers. As a result, in recent years, a number of novel versions of TCP
have been
proposed~\cite{Alizadeh:2010:DCT,Wu:2010:IIC,Wilson:2011:BNL,Vasudevan:2009:SEF,Chen:2009:UTI}.
Unfortunately, most solutions are tied either to the underlying topology or the
structure and traffic mix of the data center: There is no one-size-fits-all
solution.
\section{Augmenting the Narrow-waist of the Data Center}
It should be clear from the discussion above that the data center communication
optimization space is wide due to the interplay of data center structure,
traffic patterns, and application requirements. Traffic engineering that
performs optimal routing in the data center requires global state, multi-path
routing, and the assumption of short-term predictability for
adaptation~\cite{Benson:2010:CFT,Benson:2011:MFG}. Further, data intensive
computing frameworks (DICF) work at the level of a \emph{transfer}, which is not
covered by existing solutions which operate at the flow or packet
level~\cite{Chowdhury:2011:MDT}. Put differently, stages between DICF are
dependent on bulk \emph{transfers} of data. These requirements suggest that
there needs to be a decoupling between data transfer policy and mechanism.
Therefore, in light of these requirements, this thesis proposes application and
network characteristics-centric data transfers. The same strategy when applied
at the network layer improves both application-level and network-level
performance~\cite{Abu-Libdeh:2010:SRF}. In addition to improving performance,
transport layer characteristics can also be chosen to improve energy
efficiency~\cite{Heller:2010:ESE} or security~\cite{bittau:the}. Adding power
constraints can greatly reduce the consumption footprint of today's data centers
where the network consumes 10-20\% of the total power~\cite{Greenberg:2008:CCR}.
Likewise, application traffic can transparently and
efficiently~\cite{bittau:the} be encrypted to ensure security in a multi-tenant
environment.
The rest of this document is structured as follows. \S\ref{chapter:background}
gives an overview of the cloud computing paradigm, data intensive computing
frameworks, and application models in the cloud. Related work is summarized in
\S\ref{chapter:relatedWork}. \S\ref{chapter:designImplementation} presents the
design and implementation of Mission Control. An evaluation of the network
patterns of existing systems and of Mission Control is given in
\S\ref{chapter:evaluation}. Finally, \S\ref{chapter:conclusion} concludes and
points to future work.
\chapter{Background}\label{chapter:background}
This section first gives an overview of cloud computing
(\S\ref{chapter:background:section:cloudComputing}), followed by an introduction
to data intensive computing frameworks
(\S\ref{chapter:background:section:dataIntensive}), and finally, a discussion on
consistency, replication, scalability, and reliability in the context of cloud
computing (\S\ref{chapter:background:section:consistency}).
\section{Cloud Computing}\label{chapter:background:section:cloudComputing}
Cloud computing is an all encompassing term for applications (services from the
point of view of the average user), software systems, and the hardware in
massive data centers~\cite{Armbrust:2009:ATC}. \emph{Public clouds} are used by
companies to provide utility computing in which software (SaaS), infrastructure
(IaaS), and platform (PaaS) can be a service. Similarly, organizations and
companies also maintain \emph{Private clouds} which are used internally for data
processing and storage. In the cloud computing model, users can buy \emph{X} as
a service (XaaS), where X can be software, infrastructure, and platform and
elastically increase or decrease required resources based on dynamic
requirements. This allows the user to focus on application design without having
to worry about system or network administration. The ascent of cloud computing
has been fuelled by the rise of ``Big Data'' and in turn, data intensive
computing. The Cloud has become the ideal platform to store and analyze massive
datasets generated by applications as diverse as satellite imagery and web
indexing. Batch processing systems, such as MapReduce~\cite{Dean:2004:MSD}, are
in turn used to compute user queries over these datasets. Finally, users can
leverage the ``cost associativity''\footnote{Using a large number of computers
for a short time costs the same amount as using a small number for a long time.}
of the Cloud and the parallelism in the application to speed up the job
completion time. Data intensive computing frameworks are the subject of the next
section.
\section{Data Intensive
Computing}\label{chapter:background:section:dataIntensive}
Google's MapReduce~\cite{Dean:2004:MSD} and its open source counterpart, Hadoop
~\cite{hadoop}, are at the forefront of data intensive computing. MapReduce is a
programming model and execution engine for performing large-scale computation on
a cluster of commodity machines. It abstracts work division, communication, and
fault tolerance beneath a two function API. The user only needs to specify one
\emph{map} and one \emph{reduce} function, while the framework takes care of the
rest. The architecture consists of a master node and a set of worker nodes.
The master node schedules shards of the input dataset -- stored on a distributed
filesystem -- on the worker nodes. The entire execution is two-phased: first the
map function is applied to all chunks of input data by the worker nodes,
followed by the application of the reduce function to the output of the mappers.
The framework operates on key-value pairs. The intermediate key-value pairs,
obtained after the map phase, are \emph{shuffled} to a set of reduce workers
based on hash partitioning. The framework also has several optimizations such
as: (1) Locality: input data is assigned on the basis of locality, (2)
Fault-tolerance: failed tasks are simply rescheduled, and (3) Speculative
re-execution: slow executing tasks are speculatively re-executed on other nodes
to improve the job completion time.
One major shortcoming of MapReduce is that it is strictly a two-phase
architecture, as a result of which, it cannot easily express multi-phase
SQL-like queries. Implementing these queries is awkward in the MapReduce
paradigm as they constitute multiple MapReduce jobs, requiring external
operators to merge the output~\cite{Yang:2007:MSR}. Additionally, the framework
is also handicapped by its single input, single output restriction. These
shortcomings are addressed by Microsoft's Dryad system~\cite{Isard:2007:DDD}.
Dryad is a general purpose framework that supports the processing of algorithms
with directed cyclic graph (DAG) flows. Like MapReduce, the framework abstracts
work scheduling, data partitioning, fault tolerance, and communication beneath a
data-flow execution engine. To this end, the developer only needs to (a)
construct a DAG of processing using provided primitives, and (b) provide a
binary executable for nodes in the DAG. Each vertex in the graph represents the
execution of some operation over its input data, while the edges between the
vertices represent the communication patterns. The framework automatically
schedules each vertex to a machine/core and sets up the communication channels.
It is important to highlight that unlike MapReduce, Dryad vertices can have
multiple inputs and outputs depending on the programming semantics. Finally,
developers can easily use high-level languages to restrict the programming
interface for a particular domain.
While MapReduce and Dryad are efficient at processing a rich set of applications
with sequential flow, they fall short of supporting iterative and recursive
applications~\cite{Bu:2010:HEI,Zaharia:2010:SCC}. In addition, these
applications, such as \emph{k}-means clustering, PageRank, etc., require a more
expressive language to define their flow. These problems are addressed by
CIEL~\cite{Murray:2011:CUE} which includes two components: (1) A scripting
language, \emph{Skywriting}~\cite{Murray:2010:SCS}, and (2) A distributed
execution engine. Skywriting is a Turing-complete language that can be used to
express data-dependent control flow. It provides primitives to loop over
expressions or call them recursively. In addition, scripting primitives can be
used to spawn and execute tasks, and de-reference data (reminiscent of pointers
in C/C++). Further, it provides library functions to express the flow of other
frameworks such as MapReduce, as CIEL subsumes both MapReduce and Dryad.
Architecturally, CIEL is similar to MapReduce and Dryad, in that it also has a
single master and several worker nodes, with the former in charge of computation
state, and the latter in charge of the actual computation.
CIEL relies on three primitives -- objects, references, and tasks -- for
defining dynamic task graphs. Specifically, objects are data sequences that are
used for input and are generated as output. Objects which have not been fully
computed yet, can be used as references. Further, tasks are the unit of
computation and scheduling. Tasks use programmatic code, or binaries to execute
over all their dependencies. Moreover, tasks can either compute all of their
output objects, or alternatively, spawn other tasks to do so. Tasks can be
executed in either \emph{eager}, or \emph{lazy} fashion, with the latter being
the default strategy. Finally, CIEL uses a number of optimizations to improve
performance such as deterministic naming, and streaming.
Figure~\ref{chap:intro:fig:frameworks} highlights the major differences between
the three frameworks.
\begin{figure}[h!]
\begin{subfigure}[b]{0.4\textwidth}
\centering
\includegraphics[width=\textwidth]{MapReduce.eps}
\caption{MapReduce has
two stages with single input and single output.}
\label{fig:mapreduce}
\end{subfigure}%
\hspace{0.1pt}
\begin{subfigure}[b]{0.5\textwidth}
\centering
\includegraphics[width=\textwidth]{Dryad.eps}
\caption{Dryad execution can
consist of multiple stages with multiple input and multiple output.}
\label{fig:dryad}
\end{subfigure}
\vspace{0.1pt}
\begin{subfigure}[b]{0.8\textwidth}
\centering
\includegraphics[width=\textwidth]{CIEL.eps}
\caption{CIEL
execution, which subsumes MapReduce and Dryad, can consist of multiple
stages with multiple input and multiple output and tasks can dynamically
spawn other tasks as well.}
\label{fig:ciel}
\end{subfigure}
\caption{Overview of data intensive computing frameworks.}
\label{chap:intro:fig:frameworks}
\end{figure}
There are also high-level
languages~\cite{Olston:2008:PLN,Pike:2005:IDP,Murray:2010:SCS,Yu:2008:DSG} which
allow users to write applications which are transparently translated into a
data-flow plan, and then scheduled on top of the underlying framework. In
addition, there is a rich body of work which aims to reduce the turn-around time
of these systems~\cite{Zaharia:2010:DSS,Isard:2009:QFS,Zaharia:2008:IMP}. While
these proposals have been successful in improving scheduling and fairness, the
network utilization of these batch-processing systems has largely been ignored.
As mentioned earlier, these frameworks operate over large amounts of data inside
environments which are error-prone. Therefore, they have to make trade-offs
between consistency, replication, scalability, and reliability. These issues are
discussed in the next section.
\section{Consistency, Replication, Scalability, and Reliability in the
Cloud}\label{chapter:background:section:consistency}
It is a well-known principle that it is hard to build distributed systems which
are consistent, available, reliable, and scalable. In fact, according to
Brewer's CAP theorem, it is impossible to build a distributed system which has
consistency (all nodes see the same data), availability (the service is always
accessible), and partition tolerance (the services can tolerate the loss of
nodes/data) at the same time~\cite{Brewer:2000:TRD,Gilbert:2002:BCF}. Developers
have to sacrifice one of the three. Due to the failure-prone, commodity
infrastructure used to build distributed systems today, partition tolerance is a
given. Therefore, systems have to make a choice between consistency and
availability~\cite{Vogels:2009:EC}. Most cloud applications today, such as
Amazon's Dynamo~\cite{DeCandia:2007:DAH} focus on availability, resulting in
weak consistency or \emph{eventual consistency}. This basically available, soft
state, eventually consistent (BASE)~\cite{Pritchett:2008:BAA} model enables
applications to function even in the face of partial failure. To this end,
applications operate on local (possibly stale) replicas, while the underlying
system asynchronously applies updates to replicas.
One way to implement replication updates is through \emph{virtual
synchrony}~\cite{Birman:1987:EVS}. In this model, processes are organized into
dynamic \emph{process groups}, and messages are exchanged between groups, not
individual processes, with ordering guarantees. From the perspective of the
application, all changes to the shared data appear to be synchronous. As process
groups can have names, they can be treated as topics in a publish-subscribe
system~\cite{Birman:2010:AHO}. In addition, processes within the group are
informed if group membership changes to ensure a consistent view of group
membership across all members. Whenever a new member joins a process group, the
current state is checkpointed and sent to the newly joined member to initialize
its group replica in one atomic operation. Processes can also send multicast
events to groups with ordering guarantees.
These issues are important for a distributed system such as Mission Control, as
performance can be dependent on the durability and consistency of control
traffic. As mentioned earlier, the traffic matrix in a data center can change in
a matter of seconds, therefore, Mission Control needs to be able to respond in a
timely fashion.
\chapter{Related Work}\label{chapter:relatedWork}
This section presents relevant related work on data center networking in general
and also specific to data intensive computing. Specifically, we summarize recent
work in novel data center topologies, L2-, L3-, and L4-layer protocols,
schedulers, resource allocators, traffic reduction, and rate controllers.
\section{Topologies}
Data intensive computing frameworks such as MapReduce require full-bisection
bandwidth during the data shuffle phase. Unfortunately, a regular 2/3-layer tree
topology is unable to cater to this need as links up the tree hierarchy are
shared and hence, become a bottleneck. To remedy this, a number of novel
topologies have recently been proposed.
Al-Fares et al.~\cite{Al-Fares:2008:SCD} present a data center architecture
using a fat-tree topology which aims to support full-bisection bandwidth. A
\emph{k}-aray fat-tree enables each host and switch to have redundant links. IP
addresses are assigned in the network based on the location of the element in
the topology. To achieve even load balancing, a two-level routing table is
employed. Further, a central route control statically generates all routing
tables and loads them into switches. In addition, the architecture also enables
flow classification and scheduling to minimize overlap between flows. Finally, a
failure broadcast protocol is used to enable switches to route around failures.
VL2~\cite{Greenberg:2009:VSF} also uses a fat-tree topology but requires no
changes to the underlying L2 fabric. It uses two different addressing schemes to
simplify virtual machine migration. Interfaces and switches use
location-specific addresses while applications use application-specific
addresses. The mapping between these two schemes is maintained by a replicated
directory service. To spread traffic over the multiple links, VL2 uses both
valiant load balancing (VLB) and equal-cost multi-path routing (ECMP).
In contrast to fat-tree and VL2, DCell~\cite{Guo:2008:DSF} is a recursive,
server-centric architecture. Each server is connected to other servers directly
and through a small number of switches. DCells form a fully-connected graph and
can be assembled together to form high-level DCells. Servers in DCells are
identified by their position in the structure. Further, the architecture uses a
fault-tolerant single-path routing algorithm that exploits the recursive
structure of the topology. A related architecture, BCube~\cite{Guo:2009:BHP}, is
similar in structure but targets modular data
centers~\cite{Vishwanath:2009:MDC}. Switches in BCube only connect servers,
while servers themselves relay their traffic. Servers are addressed using an
array based on their position, as a result neighbouring servers only differ by a
single digit in the array. This information is used by source servers to encode
routing information in the packet headers, which are in turn, used by servers
for packet forwarding.
Another radical architecture is CamCube, which proposes a 3D torus topology in
which each server is directly connected to a small set of other servers,
inspired by structured overlays~\cite{Abu-Libdeh:2010:SRF}. This scheme places
each server into a 3D wrapped coordinate space and its position (\emph{x,y,z})
in the space represents its address. As a result, the physical and virtual
topology of the network is the same and applications can infer their location
through an API which exposes coordinate space information and allows one-hop
Ethernet packets. Thus, applications can implement their own routing strategies.
One problem that plagues these topologies is of incremental expansion. As a
result of their rigid, well-defined structure, incremental expansion can only be
done if the structure is preserved. Jellyfish~\cite{Singla:2011:JND} addresses
this problem by constructing a random graph topology. Random graphs achieve low
diameter and high bisection bandwidth while allowing incremental expansion with
any number of nodes. At the same time, Jellyfish can enable full-bisection
bandwidth with a lower switch count than a fat-tree.
Finally, these topologies have also spawned a number of systems to improve their
management and construction. Alias~\cite{Walraed-Sullivan:2011} and
DAC~\cite{Chen:2010:GAA} can be used to automatically assign addresses to these
topologies. Perseus~\cite{Mudigonda:2011:TFC} is a framework that enables
designers to choose an optimum data center design based on bandwidth, cost,
application requirements, etc. The framework currently supports fat-tree and
HyperX topologies. Specifically, user requirements and chosen topologies are
given as input to Perseus which then generates a design with a candidate
topology, optimized cabling, cost, and a visualization of the structure.
Although, these topologies expose full bisection bandwidth, in practice due to
multi-tenancy~\cite{Costa:2012:CEI}, and TCP's inefficiency in (a) moving away
from congested links, and (b) exploiting multiple links~\cite{Raiciu:2011:IDP},
applications do not experience the full benefit. In addition, these topologies
also increase the overall cost and wiring complexity. In data centers which
already have these topologies, Mission Control can be used to exploit their
multiple links.
\section{Ethernet Fabrics}
Ethernet also faces a number of scalability challenges in a data center
environment.
SPAIN~\cite{Mudigonda:2010:SCD} is an Ethernet fabric that implements forwarding
along multiple paths connected by commodity switches and arbitrary topologies.
Specifically, it overcomes the single spanning restriction of the Spanning Tree
Protocol (STP) by making use of multiple spanning trees which are calculated
offline. In topologies with multiple paths, each spanning tree is organized into
a separate VLAN to exploit path redundancy. In addition, it also minimizes the
overhead of packet flooding and host broadcasts.
NetLord~\cite{Mudigonda:2011:NSM} is a related system which enables flexibility
in choosing VM addresses. To this end, it implements IP encapsulation for custom
L2 packets and uses SPAIN as the underlying fabric. This enables NetLord to
scale to a large number of tenants and VMs. It also uses a custom push-based ARP
protocol to keep location information consistent. Similarly,
Portland~\cite{NiranjanMysore:2009:PSF} is a layer 2 routing and forwarding
fabric that uses topological information to assign end-hosts hierarchical
pseudo-MAC (PMAC) addresses. It employs a centralized directory service to map
PMAC onto physical MAC addresses. All routing and forwarding takes place using
PMAC prefixes while the end-hosts only see MAC addresses. In addition, the
centralized directory service eliminates the need for broadcast based ARP and
DHCP. Finally, end-hosts are modified to spread load over the available paths.
Vattikonda et al. explore a new point in the design space by dispensing with TCP
altogether and replacing it with a time-division multiple access (TDMA) MAC
layer~\cite{Vattikonda:2012:PTD}. To this end, they employ centralized
scheduling to give communicating end-hosts exclusive access to links. This
exclusive assignment of link bandwidth and switch buffer space automatically
mitigates in-network queuing and congestion. All of these proposals work at
layer 2 and are as such, complementary to our work.
\section{Global Schedulers and Resource Allocators}
To enable applications to make use of multiple paths in the network, a number of
schedulers have also been proposed. One such scheduler is
Hedera~\cite{Al-Fares:2010:HDF} which uses flow information from Openflow
switches to compute non-conflicting paths for flows, which are in turn pushed
back to the switches. This enables the network to maximize bisection bandwidth.
It provides a number of algorithms for choosing optimum paths. Similarly,
DevoFlow~\cite{Curtis:2011:DSF} is a modification of the Openflow architecture
that tries to handle most flows in the data plane to achieve scalability. Hence,
Hedera-like schedulers only need to deal with \emph{interesting} flows.
MicroTE~\cite{Benson:2011:MFG} is a traffic engineering system that adapts to
variation in network traffic at the granularity of a second. It consists of a
central controller that creates a global view of traffic demands and coordinates
scheduling using Openflow. To this end, it differentiates between predictable
traffic and non-predictable traffic. For the former, it computes routes based on
some global objective to achieve optimum routing. The leftover routes are then
used to stripe unpredictable traffic using weighted equal-cost multi-path
routing (ECMP).
A recent system, Orchestra~\cite{Chowdhury:2011:MDT}, works at the application
layer to enable intra- and inter-transfer control of two communication patterns,
which lie at the heart of batch processing systems such as MapReduce:
\emph{shuffles}, and \emph{broadcasts}. For the former, it uses weighted flow
assignment (WFA) enabled by an appropriate number of multiple TCP connections
and TCP's AIMD fair sharing while for the latter it leverages a BitTorrent-like
algorithm. Architecturally, it is a centralized application layer controller for
both intra- and inter-transfer communication. The former is managed by one or
more Transfer Controllers (TCs) while the latter is managed by an Inter-transfer
Controller (ITC). The TC continuously monitors the transfer and updates the set
of sources associated with each destination. Specifically, the TC manages the
transfer at the granularity of flows, by choosing how many concurrent flows to
open from each node, which destinations to open them to, and when to move each
chunk of data. While the results from Orchestra are promising, it is still
suboptimal due to its reliance on TCP. In addition, frameworks like Dryad
support TCP pipes, and shared memory FIFO for communication in addition to
transfer of files over the network, while Orchestra only considers the last of
these.
In contrast to all of these systems, Mission Control uses the diversity in
transport protocols to achieve optimum routing and congestion control. At the
same time, the architecture of Mission Control is inspired by the architecture
of Orchestra.
\section{Transport Protocols}
In this section we enumerate transport protocols which address some of
TCP's shortcomings.
Deadline-Driven Delivery (D$^3$)~\cite{Wilson:2011:BNL} is a data
center-specific version of TCP that targets applications with distributed
workflow and latency targets. These applications associate a deadline with each
network flow and the flow is only useful if the deadline is met\footnote{TCP on
the other hand tries to achieve global fairness and maximize throughput while
remaining agnostic to any flow-specific deadline.}. To this end, applications
expose flow deadline and size information which is exploited by end hosts to
request rates from routers along the data path. Routers only maintain aggregate
counters and thus, do not maintain any per-flow state. As a result, they are
able to allocate sending rates to flows in order for as many deadlines to be met
as possible. On the downside, D$^3$ requires changes to applications, end-hosts,
and network elements. For data centers with customizable network elements,
Mission Control can use D$^3$ for applications that expose explicit deadline
information.
Data Center TCP (DCTCP)~\cite{Alizadeh:2010:DCT} uses Explicit Congestion
Notifications (ECN) from switches to perform active queue management-based
congestion control. Specifically, it uses the congestion experienced flag in
packets---set by switches whenever the buffer occupancy exceeds a small
threshold---to reduce the size of the window based on a fraction of the marked
packets. This enables DCTCP to react quickly to queue buildup and avoid buffer
pressure. DCTCP can only work in a network with ECN-compatible switches.
Therefore, Mission Control enables DCTCP if it detects that switches in the
network support ECN. A spin-off architecture is HULL~\cite{Alizadeh:2012:LIM},
high-bandwidth ultra-low latency, that aims to make a trade-off between
bandwidth and latency for low-latency applications in a shared environment. To
this end, it reduces network buffering by replacing queue occupancy-based
congestion control with link utilization. It makes use of \emph{phantom queues}
which give the illusion of draining a link at less than its actual rate.
End-hosts employing DCTCP experience reduced transmission rates based on ECN
from these phantom queues. This enables the overall network to trade bandwidth
for latency.
Structured Stream Transport (SST)~\cite{Ford:2007:SSN} finds the middle-ground
between stream and datagram transmission. It implements a hierarchical and
hereditary stream structure which allows applications to create substreams from
existing streams without incurring startup and tear-down delay. Substreams
operate independently with their own data transfer and flow control. To ensure
inter-stream fairness, substreams share congestion control. Moreover,
``ephemeral streams'' can be used to enable datagram transmission. Finally,
applications can prioritize their streams on the fly depending on runtime
requirements. Mission Control can use SST to enable MapReduce-like systems to
prioritize their data transfers, by opening multiple substreams, and to send
lightweight control traffic using ephemeral streams.
Raiciu et al.\cite{Raiciu:2010:DCN} argue that current data center architectures
use random load balancing (RLB) to balance load across multiple paths. RLB is
unable to achieve full bisectional bandwidth because certain flows always
traverse the same path. This creates hotspots in the topology, where certain
links are oversubscribed and some are under/un-subscribed. Global flow
controllers fail in this regard too as they are unable to scale with the number
of flows. To remedy this, they propose the use of multi-path TCP (MPTCP) to
establish multiple subflows over different paths between a pair of end-hosts.
The key point is that these subflows operate under a single TCP connection. They
argue that this ensures fairness as the fraction of the total congestion window
for each flow is determined by its speed, i.e. the faster a flow is, the steeper
the slope of its additive increase. This will also move traffic away from the
most congested paths. MPTCP is supposed by Mission Control in multi-path
networks.
\subsection{Incast}
A number of proposals have also tried to tackle incast.
Vasudevan et al.~\cite{Vasudevan:2009:SEF} propose that we increase the
granularity of kernel timers to enable microsecond granularity RTT estimation,
which is in turn used to set the RTO. They show that reducing the RTO can
increase the goodput of simultaneous connections and hence negate incast.
Furthermore, making retransmissions random can also increase the goodput by
stopping multiple flows from timing out, backing off, and retransmitting
simultaneously. The takeaway lesson is that to avoid incast in low-latency data
center networks, RTOs should be on the same scale as network latency. Chen et
al.~\cite{Chen:2009:UTI} complement the analysis of the previous work by
defining an analytical model for incast. This model shows that goodput is
affected by both the minimum RTO timer value and the inter-packet wait time.
Unlike these two solutions which focus on post hoc recovery,
ICTCP~\cite{Wu:2010:IIC} aims to avoid packet losses before incast
manifestation. To this end, ICTCP implements a window-based congestion control
algorithm at the receiver side. Specifically, the available bandwidth at the
receiver is used as a trigger to perform congestion control. The frequency of
this control changes dynamically based on the queueing delay and is larger than
one RTT. Connection throughput is only throttled to avoid incast, not to
decrease application goodput.
\section{Data Center Traffic Analysis}
Kandula et al.~\cite{Kandula:2009:NDC} present an analysis of the traffic of a
1500 server data center. They use lightweight measurements at end-hosts in
conjunction with application level data to note the cause and effect of network
incidents. The target applications have a work-seeks-bandwidth and
scatter-gather pattern. Their analysis shows that most traffic is exchanged
within a rack and on average, a server communicates with two servers within its
rack and four outside the rack. With respect to congestion, their study shows
that highly utilized links are prevalent and congestion can last from 10s of
seconds to 100s. In addition, most periods of congestion are short lived.
Further, during periods of congestion there is an increase in the number of
worker failures. Finally, traffic patterns change both frequently and quickly
over time. Following in the footsteps of Kandula et al., Benson et
al.~\cite{Benson:2009:UDC}, present SNMP data from 19 data centers. Their
analysis shows that traffic load is high in the core and decreases towards the
edge. At the same time, link loses are higher near the edge and decrease towards
the core. In addition, a small fraction of the links account for most losses.
This observation shows that traffic can be routed along alternate routes to
avoid congestion. Finally, traffic is bursty in nature and packet inter-arrival
times during spike periods follow a lognormal distribution. In a follow-up
paper, Benson et al.~\cite{Benson:2010:NTC} augment their previous work by
studying different classes of data centers. These include, university campus,
private enterprise, and cloud data centers. In this work, they examine the
variation in link utilization and the magnitude of losses as well as the nature
of hotspots. Their measurements show that for cloud data centers, 80\% of the
traffic stays within the rack and the number of highly utilized core links never
exceeds 25\% of the total links.
We use the insight provided by these studies to keep the design of Mission
Control generic enough to cater to the needs of these widely-varying traffic
patterns.
\section{Reducing Network Traffic}
Another point in the design space is to reduce the amount of traffic that is
transferred across the network by MapReduce-like
systems~\cite{Costa:2012:CEI,Yu:2009:DAD}. This can be done either across
different levels in the cluster topology~\cite{Yu:2009:DAD}, or made a function
of the network itself as in Camdoop~\cite{Costa:2012:CEI}. On the downside,
performing rack-level aggregation can saturate the ingress link to the
aggregating server~\cite{Yu:2009:DAD}. This \emph{distributed combiner} approach
is most effective for MapReduce functions which are commutative and
associative~\cite{Dean:2004:MSD}. In the same vein, SUDO~\cite{Zhang:2012:ODS}
leverages the functional and data-partition properties of user-defined functions
in MapReduce-like systems to enable various optimizations. This enables the
framework to avoid unnecessary data-shuffling steps and reduce disk and network
I/O. While these frameworks reduce the amount of data transferred across the
network, they still rely on standard TCP, inheriting its flaws.
\section{Wireless and Optical Networks}
Another line of work has argued that providing full-bisection links throughout
the network is overkill as only a few hotspots exist in actual data
centers~\cite{{Halperin:2011:ADC,Kandula:2009:FTD}}. The high cost and technical
challenges of newer topologies can be avoided by provisioning the network for
the average case and avoiding hotspots by adding additional links on the fly.
The key challenge is to choose the placement and duration of \emph{flyways}.
In these proposals, flyways are constructed using Multi-Gigabit wireless links,
controlled by a centralized scheduler that decides the temporal and spatial
placement of wireless links based on observed traffic. On major shortcoming of
this approach is that in cases where the traffic has high churn, the controller
would fail to respond quickly.
Likewise, c-Through~\cite{Wang:2010:CPO} and Helios~\cite{Farrington:2010:HHE}
consider using hybrid packet and circuit switched networks in order to use
optical links to address hotspots. c-Through connects racks using optical
circuit-switched links based on traffic demand. Network isolation is ensured by
using a separate VLAN for the optical and electrical rack links. End-hosts are
used for traffic statistics collection while a central manager, connected to the
optical switch, is used to manage configuration. In contrast,
Helios~\cite{Farrington:2010:HHE} targets modular data
centers~\cite{Vishwanath:2009:MDC}. It uses flow counters in switches to
calculate the traffic matrix and in turn, uses that information to work out a
topology to maximize throughput. Finally, OSA~\cite{Chen:2012:OSA} is a
clean-slate all-optical network which can alter its topology and link capacities
based on the traffic demand.
While we do not consider wireless and optical links in this work, Mission
Control is flexible enough to support both. For example, Mission Control
maintains a global view of the network, therefore it can choose flyways or
optical paths for hotspots, if required.
\section{Virtualization and Rate Limiting}
To take advantage of economies of scale and achieve optimum resource
utilization, most data centers are virtualized. As a result, resources are
shared amongst the tenants, which can lead to contention and other undesired
behaviour.
Due to VM consolidation, the RTT of TCP connections to VMs increases
significantly, causing the latency of connections to go up. vSnoop is a system
that allows the driver domain of a host to send an acknowledgement on behalf of
the VMs~\cite{Kangarlou:2010:VIT}. It snoops on all VM traffic and maintains
per-flow state for each TCP connection which it then uses to enable early
acknowledgement of packets. The same problem afflicts the transmit path as well,
as TCP acknowledgements can get delayed, causing the TCP congestion window to
grow more slowly. To mitigate this, vFlood is another extension that offloads
congestion control to the driver domain, which allows the buffering of a high
number of packets~\cite{Gamage:2011:OFI}. Regardless of the transport protocol,
Mission Control can delegate congestion control to the driver domain in
virtualized environments.
Another related body of work deals with ensuring bandwidth fairness among
virtual machines and tenants in the data center. Seawall~\cite{Shieh:2010:SPI}
uses hypervisor-based rate limiting in conjunction with IP layer feedback
signals from the network to regulate egress traffic. Standard rate control
algorithms, such as TCP, TCP friendly rate control (TFRC), or quantized
congestion notification (QCN) are used to implement rate limiting. On the
downside, as Seawall is VM-centric, it does not divide the link bandwidth among
tenants using the link, but among the total number of VMs sending traffic
through that link. Therefore, tenants with a large number of VMs unfairly get a
large share in bandwidth. Similarly, Gatekeeper~\cite{Rodrigues:2011:GSB} works
at the virtualization layer of end-hosts to ensure bandwidth fairness. It
achieves scalability by using a point-to-point protocol and maintains minimum
data center-wide control state. Maximum and minimum rate is set for both ingress
and egress traffic at the vNIC level and scheduled using a weighted fair
scheduler. In addition, ingress and egress traffic is monitored and if limits
are exceeded, a feedback message with an explicit rate is sent to all VMs
involved in the traffic. Likewise, SecondNet~\cite{Guo:2010:SDC} uses a
\emph{virtual data center} (VDC) abstraction for resource allocation. It allows
the assignment of a set of VMs with computation, storage, and bandwidth
guarantees. Architecturally, a central manager is in charge of all resources
while virtual-to-physical mappings and network state is controlled by the source
hypervisors. While SecondNet can guarantee end-to-end bandwidth, it can only be
used if the bandwidth requirements are known a priori. Choosing an exact
bandwidth is non-trivial for most applications as bandwidth is a function of
their communication pattern. Oktupus~\cite{Ballani:2011:TPD} is a resource
allocator that uses ``virtual networks'' as an abstraction to expose to users.
This allows the cloud provider to decouple the virtual network and the
underlying physical network. As a result, there is a trade-off between
application goals for performance and provider goals for optimum utilization of
resources. For all-to-all communication, such as MapReduce traffic, a
\emph{virtual cluster} is employed which provides the illusion of all virtual
machines being connected to the same non-oversubscribed virtual switch. For
applications with a local traffic pattern, a \emph{virtual oversubscribed
cluster} is employed which emulates an oversubscribed two-tier cluster.
Specifically, Oktupus uses physical network topology, machine usage, etc., to
allocate physical machines to users. In addition, explicit bandwidth
rate-limiting at the end hosts is employed to control bandwidth.
In this work we do not explicitly tackle VM rate limiting, but Mission Control
as an architecture can be used to rate limit ingress and egress links at the
application layer by controlling the transport schedule.
\section{Energy Efficiency and Security}
In addition to performance and fairness, energy efficiency and security are of
great important in the data center as well.
ElasticTree is a system that dynamically alters the power consumption of
networking elements inside a data center based on the current
traffic~\cite{Heller:2010:ESE}. It consists of an optimizer, routing control,
and power control. Based on the topology, traffic matrix, and other inputs, the
optimizer computes an optimum power network subnet which is in turn fed to the
power control and the routing control. The former toggles the power state of
network elements, and the latter chooses paths for each flow. The optimizer
provides a number of strategies which make a trade-off between scalability and
optimality. ElasticTree can be used in conjunction with Mission Control to
choose a suitable underlying transport protocol. For example, the performance of
TCP is adversely affected if there is packet reordering due to splitting a
single flow across multiple links by ElasticTree.
TCPcrypt is a backwards compatible enhancement to TCP that aims to efficiently
and transparently provide encrypted communication to
applications~\cite{bittau:the}. To this end, it uses a custom key exchange
protocol that leverages the TCP options field. Like SSL, to reduce the cost of
connection setup for short-lived flows, it enables cryptographic state from one
TCP connection to bootstrap subsequent ones. Even for legacy applications, it
protects against passive eavesdropping. Additionally, TCPcrypt protects the
integrity of TCP sessions, and defends against attacks that change their state
or affect their performance.
Finally, applications can also be made aware of the presence of TCPcrypt to
negate redundant encryption. We use TCPcrypt as a secure transport protocol
option in Mission Control.
In summary, novel data center topologies in practice do no achieve
full-bisection bandwidth due to their reliance on TCP. In fact, in some cases
full-bisection bandwidth is overkill and the needs of most applications can be
satisfied by using a few additional links. Extensions to Ethernet can address
its scalability problems in the data center but the transport is still at the
mercy of TCP. In addition, global schedulers can take advantage of multiple
paths to improve performance but they either use TCP or ECMP which are by nature
suboptimal in the data center. Moreover, novel transport protocols only work in
specific environments for a limited number of traffic patterns. Furthermore,
aggressive use of \emph{combiners} can only reduce traffic to a certain extent.
Finally, rate limiters expect users to know the bandwidth requirements of their
applications beforehand. This is impractical as the prediction of network
traffic is non-trivial and application requirements dynamically change over the
course of their execution. Mission Control addresses the shortcomings of all of
these systems by enabling diversity in both transport semantics and scheduling.
\chapter{Design and Implementation}\label{chapter:designImplementation}
In this section, we present the design and implementation of Mission Control.
We first enumerate a set of goals for Mission Control:
\begin{enumerate}
\item Mission Control should work out of the box. Therefore, it should require
no changes to network infrastructure.
\item There is no ``one-size-fits-all'' transport for the data center.
Accordingly, Mission Control should support a range of diverse transports.
\item Data intensive computing frameworks, such as MapReduce require minimal
intervention from the user. In fact, these systems have found traction because
they can run unmodified user binaries. Mission Control should not modify the
programming model of these frameworks in any way.
\item The architecture of MapReduce-like systems consists of loose coupling
between the master and workers. As a result, these systems achieve
fault-tolerance, freedom from side-effects, and simple coordination. Mission
Control should also work on the same principles.
\item Data intensive computing frameworks these days scale to thousands of
worker nodes. In addition, their architecture enables them to seamlessly,
scale-up and scale-down. Therefore, Mission Control should not limit the
scalability of these systems.
\item Like mentioned earlier, bandwidth is a constrained resource in the data
center. Mission Control should only add minimal control traffic to the mix. To
this end, it should make use of multicast whenever possible.
\end{enumerate}
\section{Design}
In light of the goals presented in the previous section, we now describe the
design of Mission Control.
Architecturally, Mission Control consists of three key components: (1) Mission
Controller (MC), (2) Flight Controller (FC), and (3) Throttle Controller (TC).
These components are described in detail below.
\begin{figure}[h!]
\centering
\includegraphics[width=0.8\textwidth]{MissionControl.eps}
\caption{Mission Control architecture}
\label{chap:design:fig:missioncontrol}
\end{figure}
\subsection{Mission Controller}
CIEL is capable of running multiple jobs at the same time. Mission Controller
resides alongside the CIEL master and coordinates transfer scheduling across
these jobs. To this end, it uses \emph{weighted fair sharing} to assign each
transfer a weight. In practice, this allows the sharing of each link in the
network proportional to these weights. In addition, weighted fair sharing
subsumes several policies including FIFO and priority~\cite{Chowdhury:2011:MDT}.
The values of these weights are subject to organizational goals and can be
calibrated on the basis of user profiles or job characteristics. For example,
jobs launched by \emph{Alice} can be given twice the priority of those launched
by \emph{Bob}. In the same vein, low latency jobs can be given higher priority
than batch-processing jobs. Similarly, to enable FIFO, weights can be assigned
proportional to job arrival time. By default we use performance-centric network
allocation to maximize parallelism~\cite{Kumar:2012:ACF}. For shuffle transfers
allocation is made on the basis of the number of target machines while per-flow
allocation is used for broadcast transfers. Concretely,
\begin{equation}
share_{shuffle} = N
\end{equation}
\begin{equation}
share_{broadcast} = N^2
\end{equation}
Where $N$ is the number of machines that the job uses. To distinguish between
shuffle and broadcast transfers, Mission Controller walks the task graph to
observe the relationships between tasks and their inputs and outputs (described
in the implementation section). In addition, transport protocols are selected on
the basis of job characteristics and the environment. For example, multi-path