-
Notifications
You must be signed in to change notification settings - Fork 1
/
rfc8316.txt
899 lines (629 loc) · 39.1 KB
/
rfc8316.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
Internet Research Task Force (IRTF) J. Nobre
Request for Comments: 8316 University of Vale do Rio dos Sinos
Category: Informational L. Granville
ISSN: 2070-1721 Federal University of Rio Grande do Sul
A. Clemm
Huawei
A. Gonzalez Prieto
VMware
February 2018
Autonomic Networking Use Case for Distributed Detection of
Service Level Agreement (SLA) Violations
Abstract
This document describes an experimental use case that employs
autonomic networking for the monitoring of Service Level Agreements
(SLAs). The use case is for detecting violations of SLAs in a
distributed fashion. It strives to optimize and dynamically adapt
the autonomic deployment of active measurement probes in a way that
maximizes the likelihood of detecting service-level violations with a
given resource budget to perform active measurements. This
optimization and adaptation should be done without any outside
guidance or intervention.
This document is a product of the IRTF Network Management Research
Group (NMRG). It is published for informational purposes.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This document is a product of the Internet Research Task Force
(IRTF). The IRTF publishes the results of Internet-related research
and development activities. These results might not be suitable for
deployment. This RFC represents the consensus of the Network
Management Research Group of the Internet Research Task Force (IRTF).
Documents approved for publication by the IRSG are not candidates for
any level of Internet Standard; see Section 2 of RFC 7841.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc8316.
Nobre, et al. Informational [Page 1]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Definitions and Acronyms . . . . . . . . . . . . . . . . . . 5
3. Current Approaches . . . . . . . . . . . . . . . . . . . . . 6
4. Use Case Description . . . . . . . . . . . . . . . . . . . . 7
5. A Distributed Autonomic Solution . . . . . . . . . . . . . . 8
6. Intended User Experience . . . . . . . . . . . . . . . . . . 10
7. Implementation Considerations . . . . . . . . . . . . . . . . 11
7.1. Device-Based Self-Knowledge and Decisions . . . . . . . . 11
7.2. Interaction with Other Devices . . . . . . . . . . . . . 11
8. Comparison with Current Solutions . . . . . . . . . . . . . . 12
9. Related IETF Work . . . . . . . . . . . . . . . . . . . . . . 12
10. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 13
11. Security Considerations . . . . . . . . . . . . . . . . . . . 13
12. Informative References . . . . . . . . . . . . . . . . . . . 13
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 16
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 16
Nobre, et al. Informational [Page 2]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
1. Introduction
The Internet has been growing dramatically in terms of size,
capacity, and accessibility in recent years. Communication
requirements of distributed services and applications running on top
of the Internet have become increasingly demanding. Some examples
are real-time interactive video or financial trading. Providing such
services involves stringent requirements in terms of acceptable
latency, loss, and jitter.
Performance requirements lead to the articulation of Service Level
Objectives (SLOs) that must be met. Those SLOs are part of Service
Level Agreements (SLAs) that define a contract between the provider
and the consumer of a service. SLOs, in effect, constitute a
service-level guarantee that the consumer of the service can expect
to receive (and often has to pay for). Likewise, the provider of a
service needs to ensure that the service-level guarantee and
associated SLOs are met. Some examples of clauses that relate to
SLOs can be found in [RFC7297].
Violations of SLOs can be associated with significant financial loss,
which can by divided into two categories. First, there is the loss
that can be incurred by the user of a service when the agreed service
levels are not provided. For example, a financial brokerage's stock
orders might suffer losses when it is unable to execute stock
transactions in a timely manner. An electronic retailer may lose
customers when its online presence is perceived by customers as
sluggish. An online gaming provider may not be able to provide fair
access to online players, resulting in frustrated players who are
lost as customers. In each case, the failure of a service provider
to meet promised service-level guarantees can have a substantial
financial impact on users of the service. Second, there is the loss
that is incurred by the provider of a service who is unable to meet
promised SLOs. Those losses can take several forms, such as
penalties for violating the service level agreement and even loss of
future revenue due to reduced customer satisfaction (which, in many
cases, is more serious). Hence, SLOs are a key concern for the
service provider. In order to ensure that SLOs are not being
violated, service levels need to be continuously monitored at the
network infrastructure layer in order to know, for example, when
mitigating actions need to be taken. To that end, service-level
measurements must take place.
Network measurements can be performed using active or passive
measurement techniques. In passive measurements, production traffic
is observed, and no monitoring traffic is created by the measurement
process itself. That is, network conditions are checked in a
non-intrusive way. In the context of IP Flow Information Export
Nobre, et al. Informational [Page 3]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
(IPFIX), several documents were produced that define how to export
data associated with flow records, i.e., data that is collected as
part of passive measurement mechanisms, generally applied against
flows of production traffic (e.g., [RFC7011]). In addition, it is
possible to collect real data traffic (not just summarized flow
records) with time-stamped packets, possibly sampled (e.g., per
[RFC5474]), as a means of measuring and inferring service levels.
Active measurements, on the other hand, are more intrusive to the
network in the sense that they involve injecting synthetic test
traffic into the network to measure network service levels, as
opposed to simply observing production traffic. The IP Performance
Metrics (IPPM) Working Group produced documents that describe active
measurement mechanisms such as the One-Way Active Measurement
Protocol (OWAMP) [RFC4656], the Two-Way Active Measurement Protocol
(TWAMP) [RFC5357], and the Cisco Service-Level Assurance Protocol
[RFC6812]. In addition, there are some mechanisms that do not
cleanly fit into either active or passive categories, such as
Performance and Diagnostic Metrics (PDM) Destination Option
techniques [RFC8250].
Active measurement mechanisms offer a high level of control over what
and how to measure. They do not require inspecting production
traffic. Because of this, active measurements usually offer better
accuracy and privacy than passive measurement mechanisms. Traffic
encryption and regulations that limit the amount of payload
inspection that can occur are non-issues. Furthermore, active
measurement mechanisms are able to detect end-to-end network
performance problems in a fine-grained way (e.g., simulating the
traffic that must be handled considering specific SLOs). As a
result, active measurements are often preferred over passive
measurement for SLA monitoring. Measurement probes must be hosted in
network devices and measurement sessions must be activated to compute
the current network metrics (for example, metrics such as the ones
described in [RFC4148], although note that [RFC4148] was obsoleted by
[RFC6248]). This activation should be dynamic in order to follow
changes in network conditions, such as those related to routes being
added or new customer demands.
While offering many advantages, active measurements are expensive in
terms of network resource consumption. Active measurements generally
involve measurement probes that generate synthetic test traffic that
is directed at a responder. The responder needs to timestamp test
traffic it receives and reflect it back to the originating
measurement probe. The measurement probe subsequently processes the
returned packets along with time-stamping information in order to
compute service levels. Accordingly, active measurements consume
substantial CPU cycles as well as memory of network devices to
Nobre, et al. Informational [Page 4]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
generate and process test traffic. In addition, synthetic traffic
increases network load. Thus, active measurements compete for
resources with other functions, including routing and switching.
The resources required and traffic generated by the active
measurement sessions are, in a large part, a function of the number
of measured network destinations. (In addition, the amount of
traffic generated for each measurement plays a role that, in turn,
influences the accuracy of the measurement.) When more destinations
are measured, a greater number of resources are consumed and more
traffic is needed to perform the measurements. Thus, to have better
monitoring coverage, it is necessary to deploy more sessions, which
consequently increases consumed resources. Otherwise, enabling the
observation of just a small subset of all network flows can lead to
insufficient coverage.
Furthermore, while some end-to-end service levels can be determined
by adding up the service levels observed across different path
segments, the same is not true for all service levels. For example,
the end-to-end delay or packet loss from a node A to a node C routed
via a node B can often be computed simply by adding delays (or loss)
from A to B and from B to C. This allows the decomposition of a
large set of end-to-end measurements into a much smaller set of
segment measurements. However, end-to-end jitter and mean opinion
scores cannot be decomposed as easily and, for higher accuracy, must
be measured end-to-end.
Hence, the decision about how to place measurement probes becomes an
important management activity. The goal is to obtain the maximum
benefits of service-level monitoring with a limited amount of
measurement overhead. Specifically, the goal is to maximize the
number of service-level violations that are detected with a limited
number of resources.
The use case and the solution approach described in this document
address an important practical issue. They are intended to provide a
basis for further experimentation to lead to solutions for wider
deployment. This document represents the consensus of the IRTF's
Network Management Research Group (NMRG). It was discussed
extensively and received three separate in-depth reviews.
2. Definitions and Acronyms
Active Measurements: Techniques to measure service levels that
involve generating and observing synthetic test traffic
Passive Measurements: Techniques used to measure service levels based
on observation of production traffic
Nobre, et al. Informational [Page 5]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
Autonomic Network: A network containing exclusively autonomic nodes,
requiring no configuration, and deriving all required information
through self-knowledge, discovery, or intent.
Autonomic Service Agent (ASA): An agent implemented on an autonomic
node that implements an autonomic function, either in part (in the
case of a distributed function, as in the context of this
document) or whole
Measurement Session: A communications association between a probe and
a responder used to send and reflect synthetic test traffic for
active measurements
Probe: The source of synthetic test traffic in an active measurement
Responder: The destination for synthetic test traffic in an active
measurement
SLA: Service Level Agreement
SLO: Service Level Objective
P2P: Peer-to-Peer
(Note: The definitions for "Autonomic Network" and "Autonomic Service
Agent" are borrowed from [RFC7575]).
3. Current Approaches
For feasible deployments of active measurement solutions to
distribute the available measurement sessions along the network, the
current best practice consists of relying entirely on the human
administrator's expertise to infer the best location to activate such
sessions. This is done through several steps. First, it is
necessary to collect traffic information in order to grasp the
traffic matrix. Then, the administrator uses this information to
infer the best destinations for measurement sessions. After that,
the administrator activates sessions on the chosen subset of
destinations, taking the available resources into account. This
practice, however, does not scale well because it is still labor
intensive and error-prone for the administrator to determine which
sessions should be activated given the set of critical flows that
needs to be measured. Even worse, this practice completely fails in
networks where the most critical flows change rapidly, resulting in
dynamic changes to what would be the most important destinations.
For example, this can be the case in modern cloud environments. This
is because fast reactions are necessary to reconfigure the sessions,
and administrators are just not quick enough in computing and
Nobre, et al. Informational [Page 6]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
activating the new set of required sessions every time the network
traffic pattern changes. Finally, the current practice for active
measurements usually covers only a fraction of the network flows that
should be observed, which invariably leads to the damaging
consequence of undetected SLA violations.
4. Use Case Description
The use case involves a service-level provider that needs to monitor
the network to detect service-level violations using active service-
level measurements and wants to be able to do so with minimal human
intervention. The goal is to conduct the measurements in an
effective manner to maximize the percentage of detected service-level
violations. The service-level provider has a bounded resource budget
with regard to measurements that can be performed, specifically the
number of measurements that can be conducted concurrently from any
one network device and possibly the total amount of measurement
traffic on the network. However, while at any one point in time the
number of measurements conducted is limited, it is possible for a
device to change which destinations to measure over time. This can
be exploited to achieve a balance of eventually covering all possible
destinations using a reasonable amount of "sampling" where
measurement coverage of a destination cannot be continuous. The
solution needs to be dynamic and able to cope with network conditions
that may change over time. The solution should also be embeddable
inside network devices that control the deployment of active
measurement mechanisms.
The goal is to conduct the measurements in a smart manner that
ensures that the network is broadly covered and that the likelihood
of detecting service-level violations is maximized. In order to
maximize that likelihood, it is reasonable to focus measurement
resources on destinations that are more likely to incur a violation,
while spending fewer resources on destinations that are more likely
to be in compliance. In order to do this, there are various aspects
that can be exploited, including past measurements (destinations
close to a service-level threshold requiring more focus than
destinations farther from it), complementation with passive
measurements such as flow data (to identify network destinations that
are currently popular and critical), and observations from other
parts of the network. In addition, measurements can be coordinated
among different network devices to avoid hitting the same destination
at the same time and to share results that may be useful in future
probe placement.
Clearly, static solutions will have severe limitations. At the same
time, human administrators cannot be in the loop for continuous
dynamic reconfigurations of measurement probes. Thus, an automated
Nobre, et al. Informational [Page 7]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
solution, or ideally an autonomic solution, is needed so that network
measurements are automatically orchestrated and dynamically
reconfigured from within the network. This can be accomplished using
an autonomic solution that is distributed, using ASAs that are
implemented on nodes in the network.
5. A Distributed Autonomic Solution
The use of Autonomic Networking (AN) [RFC7575] can help such
detection through an efficient activation of measurement sessions.
Such an approach, along with a detailed assessment confirming its
viability, is described in [P2PBNM-Nobre-2012]. The problem to be
solved by AN in the present use case is how to steer the process of
measurement session activation by a complete solution that sets all
necessary parameters for this activation to operate efficiently,
reliably, and securely, with no required human intervention other
than setting overall policy.
When a node first comes online, it has no information about which
measurements are more critical than others. In the absence of
information about past measurements and information from measurement
peers, it may start with an initial set of measurement sessions,
possibly randomly seeding a set of starter measurements and perhaps
taking a round-robin approach for subsequent measurement rounds.
However, as measurements are collected, a node will gain an
increasing amount of information that it can utilize to refine its
strategy of selecting measurement targets going forward. For one, it
may take note of which targets returned measurement results very
close to service-level thresholds; these targets may require closer
scrutiny compared to others. Second, it may utilize observations
that are made by its measurement peers in order to conclude which
measurement targets may be more critical than others and to ensure
that proper overall measurement coverage is obtained (so that not
every node incidentally measures the same targets, while other
targets are not measured at all).
We advocate for embedding P2P technology in network devices in order
to use autonomic control loops to make decisions about measurement
sessions.
Specifically, we advocate for network devices to implement an
autonomic function that monitors service levels for violations of
SLOs and that determines which measurement sessions to set up at any
given point in time based on current and past observations of the
node and of other peer nodes.
By performing these functions locally and autonomically on the device
itself, which measurements to conduct can be modified quickly based
Nobre, et al. Informational [Page 8]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
on local observations while taking local resource availability into
account. This allows a solution to be more robust and react more
dynamically to rapidly changing service levels than a solution that
has to rely on central coordination. However, in order to optimize
decisions about which measurements to conduct, a node will need to
communicate with other nodes. This allows a node to take into
account other nodes' observations in addition to its own in its
decisions.
For example, remote destinations whose observed service levels are on
the verge of violating stated objectives may require closer
monitoring than remote destinations that are comfortably within a
range of tolerance. A distributed autonomic solution also allows
nodes to coordinate their probing decisions to collectively achieve
the best possible measurement coverage. Because the number of
resources available for monitoring, exchanging measurement data, and
coordinating with other nodes is limited, a node may be interested in
identifying other nodes whose observations are similar to and
correlated with its own. This helps a node prioritize and decide
which other nodes to coordinate and exchange data with. All of this
requires the use of a P2P overlay.
A P2P overlay is essential for several reasons:
o It makes it possible for nodes (or more specifically, the ASAs
that are deployed on those nodes) in the network to autonomically
set up measurement sessions without having to rely on a central
management system or controller to perform configuration
operations associated with configuring measurement probes and
responders.
o It facilitates the exchange of data between different nodes to
share measurement results so that each node can refine its
measurement strategy based not just on its own observations, but
also on observations from its peers.
o It allows nodes to coordinate their measurements to obtain the
best possible test coverage and avoid measurements that have a
very low likelihood of detecting service-level violations.
The provisioning of the P2P overlay should be transparent for the
network administrator. An Autonomic Control Plane such as defined in
[ACP] provides an ideal candidate for the P2P overlay to run on.
An autonomic solution for the distributed detection of SLA violations
provides several benefits. First, it provides efficiency; this
solution should optimize the resource consumption and avoid resource
starvation on the network devices. A device that is "self-aware" of
Nobre, et al. Informational [Page 9]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
its available resources will be able to adjust measurement activities
rapidly as needed, without requiring a separate control loop
involving resource monitoring by an external system. Second, placing
logic about where to conduct measurements into the node enables rapid
control loops that allow devices to react instantly to observations
and adjust their measurement strategy. For example, a device could
decide to adjust the amount of synthetic test traffic being sent
during the measurement itself depending on results observed so far on
this and other concurrent measurement sessions. As a result, the
solution could decrease the time necessary to detect SLA violations.
Adaptivity features of an autonomic loop could capture the network
dynamics faster than a human administrator or even a central
controller. Finally, the solution could help to reduce the workload
of human administrators.
In practice, these factors combine to maximize the likelihood of SLA
violations being detected while operating within a given resource
budget, allowing a continuous measurement strategy that takes into
account past measurement results to be conducted, observations of
other measures such as link utilization or flow data, measurement
results shared between network devices, and future measurement
activities coordinated among nodes. Combined, this can result in
efficient measurement decisions that achieve a golden balance between
offering broad network coverage and honing in on service-level "hot
spots".
6. Intended User Experience
The autonomic solution should not require any human intervention in
the distributed detection of SLA violations. By virtue of the
solution being autonomic, human users will not have to plan which
measurements to conduct in a network, which is often a very labor-
intensive task that requires detailed analysis of traffic matrices
and network topologies and is not prone to easy dynamic adjustment.
Likewise, they will not have to configure measurement probes and
responders.
There are some ways in which a human administrator may still interact
with the solution. First, the human administrator will, of course,
be notified and obtain reports about service-level violations that
are observed. Second, a human administrator may set policies
regarding how closely to monitor the network for service-level
violations and how many resources to spend. For example, an
administrator may set a resource budget that is assigned to network
devices for measurement operations. With that given budget, the
number of SLO violations that are detected will be maximized.
Alternatively, an administrator may set a target for the percentage
of SLO violations that must be detected, i.e., a target for the ratio
Nobre, et al. Informational [Page 10]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
between the number of detected SLO violations and the number of total
SLO violations that are actually occurring (some of which might go
undetected). In that case, the solution will aim to minimize the
resources spent (i.e., the amount of test traffic and number of
measurement sessions) that are required to achieve that target.
7. Implementation Considerations
The active measurement model assumes that a typical infrastructure
will have multiple network segments, multiple Autonomous Systems
(ASes), and a reasonably large number of routers. It also considers
that multiple SLOs can be in place at a given time. Since
interoperability in a heterogeneous network is a goal, features found
on different active measurement mechanisms (e.g., OWAMP, TWAMP, and
Cisco Service Level Assurance Protocol) and device programmability
interfaces (such as Juniper's Junos API or Cisco's Embedded Event
Manager) could be used for the implementation. The autonomic
solution should include and/or reference specific algorithms,
protocols, metrics, and technologies for the implementation of
distributed detection of SLA violations as a whole.
Finally, it should be noted that there are multiple deployment
scenarios, including deployment scenarios that involve physical
devices hosting autonomic functions or virtualized infrastructure
hosting the same. Co-deployment in conjunction with Virtual Network
Functions (VNFs) is a possibility for further study.
7.1. Device-Based Self-Knowledge and Decisions
Each device has self-knowledge about the local SLA monitoring. This
could be in the form of historical measurement data and SLOs.
Besides that, the devices would have algorithms that could decide
which probes should be activated at a given time. The choice of
which algorithm is better for a specific situation would be also
autonomic.
7.2. Interaction with Other Devices
Network devices should share information about service-level
measurement results. This information can speed up the detection of
SLA violations and increase the number of detected SLA violations.
For example, if one device detects that a remote destination is in
danger of violating an SLO, other devices may conduct additional
measurements to the same destination or other destinations in its
proximity. For any given network device, the exchange of data may be
more important with some devices (for example, devices in the same
network neighborhood or devices that are "correlated" by some other
means) than with others. Defining the network devices that exchange
Nobre, et al. Informational [Page 11]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
measurement data (i.e., management peers) creates a new topology.
Different approaches could be used to define this topology (e.g.,
correlated peers [P2PBNM-Nobre-2012]). To bootstrap peer selection,
each device should use its known neighbors (e.g., FIB and RIB tables)
as initial seeds to identify possible peers. It should be noted that
a solution will benefit if topology information and network discovery
functions are provided by the underlying autonomic framework. A
solution will need to be able to discover measurement peers as well
as measurement targets, specifically measurement targets that support
active measurement responders and that will be able to respond to
measurement requests and reflect measurement traffic as needed.
8. Comparison with Current Solutions
There is no standardized solution for distributed autonomic detection
of SLA violations. Current solutions are restricted to ad hoc
scripts running on a per-node fashion to automate some administrator
actions. There are some proposals for passive probe activation
(e.g., DECON [DECON] and CSAMP [CSAMP]), but these do not focus on
autonomic features.
9. Related IETF Work
This section discusses related IETF work and is provided for
reference. This section is not exhaustive; rather, it provides an
overview of the various initiatives and how they relate to autonomic
distributed detection of SLA violations.
1. LMAP: The Large-Scale Measurement of Broadband Performance
Working Group standardizes the LMAP measurement system for
performance management of broadband access devices. The
autonomic solution could be relevant to LMAP because it deploys
measurement probes and could be used for screening for SLA
violations. Besides that, a solution to decrease the workload of
human administrators in service providers is probably highly
desirable.
2. IPFIX: IP Flow Information Export (IPFIX) Working Group (now
concluded) aimed to standardize IP flows (i.e., netflows). IPFIX
uses measurement probes (i.e., metering exporters) to gather flow
data. In this context, the autonomic solution for the activation
of active measurement probes could possibly be extended to also
address passive measurement probes. Besides that, flow
information could be used in making decisions regarding probe
activation.
Nobre, et al. Informational [Page 12]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
3. ALTO: The Application-Layer Traffic Optimization Working Group
aims to provide topological information at a higher abstraction
layer, which can be based upon network policy, and with
application-relevant service functions located in it. Their work
could be leveraged to define the topology for network devices
that exchange measurement data.
10. IANA Considerations
This document has no IANA actions.
11. Security Considerations
The security of this solution hinges on the security of the network
underlay, i.e., the Autonomic Control Plane. If the Autonomic
Control Plane were to be compromised, an attacker could undermine the
effectiveness of measurement coordination by reporting fraudulent
measurement results to peers. This would cause measurement probes to
be deployed in an ineffective manner that would increase the
likelihood that violations of SLOs go undetected.
Likewise, the security of the solution hinges on the security of the
deployment mechanism for autonomic functions (in this case, the
autonomic function that conducts the service-level measurements). If
an attacker were able to hijack an autonomic function, it could try
to exhaust or exceed the resources that should be spent on autonomic
measurements in order to deplete network resources, including network
bandwidth due to higher-than-necessary volumes of synthetic test
traffic generated by measurement probes. Again, it could also lead
to reporting of misleading results; among other things, this could
result in non-optimal selection of measurement targets and, in turn,
an increase in the likelihood that service-level violations go
undetected.
12. Informative References
[ACP] Eckert, T., Ed., Behringer, M., Ed., and S. Bjarnason, "An
Autonomic Control Plane (ACP)", Work in Progress,
draft-ietf-anima-autonomic-control-plane-13, December
2017.
[CSAMP] Sekar, V., Reiter, M., Willinger, W., Zhang, H., Kompella,
R., and D. Andersen, "CSAMP: A System for Network-Wide
Flow Monitoring", NSDI USENIX Symposium Networked Systems
Design and Implementation, April 2008.
Nobre, et al. Informational [Page 13]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
[DECON] di Pietro, A., Huici, F., Costantini, D., and S.
Niccolini, "DECON: Decentralized Coordination for Large-
Scale Flow Monitoring", IEEE INFOCOM Workshops,
DOI 10.1109/INFCOMW.2010.5466642, March 2010.
[P2PBNM-Nobre-2012]
Nobre, J., Granville, L., Clemm, A., and A. Gonzalez
Prieto, "Decentralized Detection of SLA Violations Using
P2P Technology, 8th International Conference Network and
Service Management (CNSM)", 8th International Conference
on Network and Service Management (CNSM), 2012,
<http://ieeexplore.ieee.org/xpls/
abs_all.jsp?arnumber=6379997>.
[RFC4148] Stephan, E., "IP Performance Metrics (IPPM) Metrics
Registry", BCP 108, RFC 4148, DOI 10.17487/RFC4148, August
2005, <https://www.rfc-editor.org/info/rfc4148>.
[RFC4656] Shalunov, S., Teitelbaum, B., Karp, A., Boote, J., and M.
Zekauskas, "A One-way Active Measurement Protocol
(OWAMP)", RFC 4656, DOI 10.17487/RFC4656, September 2006,
<https://www.rfc-editor.org/info/rfc4656>.
[RFC5357] Hedayat, K., Krzanowski, R., Morton, A., Yum, K., and J.
Babiarz, "A Two-Way Active Measurement Protocol (TWAMP)",
RFC 5357, DOI 10.17487/RFC5357, October 2008,
<https://www.rfc-editor.org/info/rfc5357>.
[RFC5474] Duffield, N., Ed., Chiou, D., Claise, B., Greenberg, A.,
Grossglauser, M., and J. Rexford, "A Framework for Packet
Selection and Reporting", RFC 5474, DOI 10.17487/RFC5474,
March 2009, <https://www.rfc-editor.org/info/rfc5474>.
[RFC6248] Morton, A., "RFC 4148 and the IP Performance Metrics
(IPPM) Registry of Metrics Are Obsolete", RFC 6248,
DOI 10.17487/RFC6248, April 2011,
<https://www.rfc-editor.org/info/rfc6248>.
[RFC6812] Chiba, M., Clemm, A., Medley, S., Salowey, J., Thombare,
S., and E. Yedavalli, "Cisco Service-Level Assurance
Protocol", RFC 6812, DOI 10.17487/RFC6812, January 2013,
<https://www.rfc-editor.org/info/rfc6812>.
[RFC7011] Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
"Specification of the IP Flow Information Export (IPFIX)
Protocol for the Exchange of Flow Information", STD 77,
RFC 7011, DOI 10.17487/RFC7011, September 2013,
<https://www.rfc-editor.org/info/rfc7011>.
Nobre, et al. Informational [Page 14]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
[RFC7297] Boucadair, M., Jacquenet, C., and N. Wang, "IP
Connectivity Provisioning Profile (CPP)", RFC 7297,
DOI 10.17487/RFC7297, July 2014,
<https://www.rfc-editor.org/info/rfc7297>.
[RFC7575] Behringer, M., Pritikin, M., Bjarnason, S., Clemm, A.,
Carpenter, B., Jiang, S., and L. Ciavaglia, "Autonomic
Networking: Definitions and Design Goals", RFC 7575,
DOI 10.17487/RFC7575, June 2015,
<https://www.rfc-editor.org/info/rfc7575>.
[RFC8250] Elkins, N., Hamilton, R., and M. Ackermann, "IPv6
Performance and Diagnostic Metrics (PDM) Destination
Option", RFC 8250, DOI 10.17487/RFC8250, September 2017,
<https://www.rfc-editor.org/info/rfc8250>.
Nobre, et al. Informational [Page 15]
RFC 8316 AN Use Case Detection of SLA Violations February 2018
Acknowledgements
We wish to acknowledge the helpful contributions, comments, and
suggestions that were received from Mohamed Boucadair, Brian
Carpenter, Hanlin Fang, Bruno Klauser, Diego Lopez, Vincent Roca, and
Eric Voit. In addition, we thank Diego Lopez, Vincent Roca, and
Brian Carpenter for their detailed reviews.
Authors' Addresses
Jeferson Campos Nobre
University of Vale do Rio dos Sinos
Porto Alegre
Brazil
Email: [email protected]
Lisandro Zambenedetti Granvile
Federal University of Rio Grande do Sul
Porto Alegre
Brazil
Email: [email protected]
Alexander Clemm
Huawei USA - Futurewei Technologies Inc.
Santa Clara, California
United States of America
Email: [email protected], [email protected]
Alberto Gonzalez Prieto
VMware
Palo Alto, California
United States of America
Email: [email protected]
Nobre, et al. Informational [Page 16]