-
Notifications
You must be signed in to change notification settings - Fork 0
/
draft-szarecki-grow-abstract-nh-scaleout-peering-01.txt
1176 lines (788 loc) · 48.7 KB
/
draft-szarecki-grow-abstract-nh-scaleout-peering-01.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Internet Engineering Task Force R. Szarecki, Ed.
Internet-Draft K. Vairavakkalai
Intended status: Informational N. Venkataraman
Expires: March 7, 2020 Juniper Networks Inc.
M. Venkatesan
Comcast
September 4, 2019
Use of Abstract NH in Scale-Out peering architecture
draft-szarecki-grow-abstract-nh-scaleout-peering-01
Abstract
Many large-scale service provider networks use some form of scale-out
architecture at peering sites. In such an architecture, each
participating Autonomous System (AS) deploys multiple independent
Autonomous System Border Routers (ASBRs) for peering, and Equal Cost
Multi-Path (ECMP) load balancing is used between them. There are
numerous benefits to this architecture, including but not limited to
N+1 redundancy and the ability to flexibly increase capacity as
needed. A cost of this architecture is an increase in the amount of
state in both the control and data planes. This has negative
consequences for network convergence time and scale.
In this document we describe how to mitigate these negative
consequences through configuration of the routing protocols, both BGP
and IGP, to utilize what we term the "Abstract Next-Hop" (ANH). Use
of ANH allows us to both reduce the number of BGP paths in the
control plane and enable rapid path invalidation (hence, network
convergence and traffic restoration). We require no new protocol
features to achieve these benefits.
Status of This Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
Szarecki, et al. Expires March 7, 2020 [Page 1]
Internet-Draft Abstract NH in scale-out peering September 2019
This Internet-Draft will expire on March 7, 2020.
Copyright Notice
Copyright (c) 2019 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1. Scale-Out peering . . . . . . . . . . . . . . . . . . . . 4
1.1.1. Low latency . . . . . . . . . . . . . . . . . . . . . 4
1.1.2. All equal cost paths utilization . . . . . . . . . . 4
1.1.3. Summary . . . . . . . . . . . . . . . . . . . . . . . 5
1.2. Common BGP Deployment Configurations . . . . . . . . . . 7
1.2.1. IBGP with Next-Hop Unchanged . . . . . . . . . . . . 7
1.2.1.1. Example . . . . . . . . . . . . . . . . . . . . . 7
1.2.2. IBGP with Next-Hop-Self . . . . . . . . . . . . . . . 8
2. The BGP Abstract Next-Hop . . . . . . . . . . . . . . . . . . 8
3. Use of Abstract Next-Hop in scale-out peering design . . . . 9
3.1. Egress ASBR-Peer AS Abstract Next Hop (AP-ANH) . . . . . 10
3.2. The Site-Peer AS Abstract Next Hop (SP-ANH) . . . . . . . 11
3.3. Assignment of Abstract Next Hops . . . . . . . . . . . . 14
3.3.1. Native IP Networks . . . . . . . . . . . . . . . . . 14
3.3.2. MPLS . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.2.1. Identical BGP address space and paths received on
all ASBRs . . . . . . . . . . . . . . . . . . . . 14
3.3.2.2. Different address space sets or paths received on
different ASBRs . . . . . . . . . . . . . . . . . 14
3.3.3. SPRING . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.3.1. Identical BGP address space and path received on
all ASBRs . . . . . . . . . . . . . . . . . . . . 15
3.3.3.2. Different address space sets or paths received on
different ASBRs . . . . . . . . . . . . . . . . . 15
3.4. Localization of AP-ANH . . . . . . . . . . . . . . . . . 16
4. Worked Examples . . . . . . . . . . . . . . . . . . . . . . . 16
4.1. Failure of a proper subset of EBGP sessions with a given
peer AS on a single ASBR . . . . . . . . . . . . . . . . 16
Szarecki, et al. Expires March 7, 2020 [Page 2]
Internet-Draft Abstract NH in scale-out peering September 2019
4.2. Failure of a proper subset of EBGP sessions with a given
peer AS on each ASBR of a given site . . . . . . . . . . 17
4.3. Failure of all EBGP sessions with a given peer AS on
single ASBR; Failure of a single ASBR . . . . . . . . . . 17
4.4. All EBGP sessions with a given peer AS on all ASBRs . . . 18
5. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 18
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18
7. Security Considerations . . . . . . . . . . . . . . . . . . . 18
8. Informative References . . . . . . . . . . . . . . . . . . . 19
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20
1. Introduction
Common to all large Internet networks are the requirements for large
aggregate bandwidth and low latency. As network sizes and traffic
volumes have increased, it has become common to use scale-out
architectures to satisfy these requirements. Use of these techniques
within individual networks is well-known. Here, we explore a scale-
out architecture for interconnecting different Autonomous Systems
(ASes).
Below, we show an example topology. Content is hosted within AS 2,
consumers connect via the various ISP Metro ASes.
+---------------+ +----------------+ +---------------+
| | | +-------+ |
| +------+ +-------+ AS 30 |
| +------+ | | ISP Metro |
| +------+ | /----+ |
| | | | //----+ |
| AS 2 | | AS 1 |// +---------------+
| Content | | ISP BackBone X/
| provider +------+ X\
| +------+ |\\ +---------------+
| | | | \\----+ |
| | | | \----+ AS 31 |
| +------+ | | ISP Metro |
| +------+ +-------+ |
| +------+ +-------+ |
+---------------+ +----------------+ +---------------+
Figure 1
ASes 1 and 2 are connected at multiple, geographically diverse,
sites. Geographic diversity is required for reasons including
Szarecki, et al. Expires March 7, 2020 [Page 3]
Internet-Draft Abstract NH in scale-out peering September 2019
resiliency, minimization of latency, and minimization of cost
associated with long-distance data transmission.
1.1. Scale-Out peering
The same trends that have driven the use of scale-out architectures
within ASes drive interest in using them at peering sites. In such
an architecture, each AS at the peering site deploys multiple
independent Autonomous System Border Routers (ASBRs). Benefits that
can be realized include N+1 redundancy and the ability to flexibly
increase capacity as needed. The ASBRs are often connected to the
rest of their AS in a leaf-spine topology through core routers, and
augmented with a per-site pair of BGP route reflectors (RRs). See
for example SITE1 in Figure 2, below.
The fundamental requirements in this architecture are:
a. Keep traffic on a path that has low latency.
b. Utilize all peering links that offer low latency.
c. In the event of failure, minimize the time needed to restore
service.
1.1.1. Low latency
BGP, the Border Gateway Protocol, does not directly carry delay
information. We make the general assumption in this document that
paths selected by the BGP best path algorithm [RFC4271] will provide
lower latency than those not selected. This assumption is not
guaranteed to be true, but lacking special arrangements between
peering ASes, it is what the protocol is able to provide.
1.1.2. All equal cost paths utilization
In order to use all links between peering ASes that provide the same
BGP path costs to the destination prefix, at a minimum BGP speakers
need to be enabled for multi-path operation. Additionally, all AS
ingress BGP speakers need to know at least all equal and best paths
to the destination via multiple ASBRs. If a full IBGP mesh is used,
this happens naturally. However, IBGP full meshes are uncommon in
large networks and are even more impractical in scale-out
architectures due to the high total number of ASBRs.
The well-known techniques to deal with full-mesh scale challenges -
Route Reflection [RFC4456] and Confederations [RFC5065] - hide
redundant paths, as they advertise only a single selected path to
their clients. While this helps keep path and session scale
Szarecki, et al. Expires March 7, 2020 [Page 4]
Internet-Draft Abstract NH in scale-out peering September 2019
manageable, it makes BGP multipath unusable. We overcome this by
using BGP ADD-PATH [RFC7911] between the RR and its clients (or among
sub-ASes).
1.1.3. Summary
In summary, for a scale-out peering architecture:
o BGP multipath needs to be enabled on all IBGP sessions inside the
AS.
o BGP multipath needs to be enabled on all EBGP sessions of each
ASBR.
o BGP ADD-PATH needs to be enabled on all IBGP sessions.
* RRs need to be able to send multiple paths per prefix. The
upper limit depends on:
+ The maximum number of ASBRs per site (say N).
+ Possibly also on the maximum number of EBGP sessions held by
a single ASBR with single peer AS (say M), depending on BGP
next-hop attribute (BGP-NH) configuration.
* RR clients/ASBRs may need to be able to send multiple paths per
prefix if BGP-NH configuration is "next hop unchanged". The
upper limit depends on the maximum number of EBGP sessions held
by a single ASBR with single peer AS (say M).
For further consideration the following network diagram will be used
for reference:
Szarecki, et al. Expires March 7, 2020 [Page 5]
Internet-Draft Abstract NH in scale-out peering September 2019
+------------------------------------------------------------------+
| AS 1 +--------------------+|
| +----------------------------------+ |+------+ SITE3 o--o ||
| | SITE1 +-------- Cost 10 -+------+|CR_3.1|--+ o-|RR| ||
| | o------o | | |+------+ | |Ro--o ||
| | O-|RR_1.1| | | |+------+ | o--o ||
| | |Ro------O | +--- Cost 10 --+|CR_3.K| |+-------+||
| | O------O +------+ | | |+---+--+ ||BR_3.N"|||
| | |CR_1.1|-------+- Cost 10 -+ | | |+-------+||
| | +------+ | | | +----+-----+---------+|
| | / / \ +------+ | | Cost 15 Cost 15 |
| | / / \ |CR_1.K|--Cost+ | +----+-----+---------+|
| | / | \ +------+ 10 | | |+---+--+ | SITE2 ||
| | / | \ / | \ | | +--+|CR_2.K| | o--o ||
| | / | \--X-\ / \ | | |+------+ | |RR|-o ||
| | / /--+--------/ X | | | |+------+ | o--oR| ||
| | / / | /-------/ \ \ | +----+|CR_2.1|<-+ o--o ||
| | / / | / \ \ | |+------+ ||
| | +------+ +------+ +------+ | |+------+ +-------+||
| | |BR_1.1| |BR_1.2|- - -|BR_1.N| | ||BR_2.1| |BR_2.N'|||
| | +X----X+ +-X---X+ +-X---X+ | |+-+--+-+ +-+---+-+||
| +---X----X----X---X--------X-----X-+ +--+--+-------+---+--+|
+-------X----X----X---X-------+------X----------+--+-------+---+---+
\ \ | \ | \----\ | | | |
BR_1.1 \ \ | \-----+----------\ \ | | \ |
^ \-\ \-+-----------+-------\ \ \ \ \ \ \
X BR_1.2 \ | | \ \ \ \ \ \ \
X ^ \ | / \ \ | \ \ | |
X X BR_1.N \ \ /------/ \ | | \ \ | |
X X ^ \ \ / \ | | \ \ | |
X X X | | | ^ ^ ^ | | | \ \ | |
X X X | | | | | | | | | \ \ | |
+---------+ +----+-+-+---+-+-+------------+-+-+--------X--X--+--+--+
| | | | | | | | | | | | \ \ | | |
| | | +-+-+-++ ++-+-+-+ +------+ +------+ | |
| | | |PR_2.1| |PR_2.2|- - - |PR_2.M| |PR_2.P+--+ |
| | | +------+ +------+ +------+ +--+---+.T| |
| | | +------+ |
| AS 3 | | AS 2 |
+---------+ +------------------------------------------------------+
|==================================================================|
|CR - Core Router |
|BR - ASBR and/or Customer Edge in AS1 |
|PR - ASBR in peering ASes |
|==================================================================|
Figure 2
Szarecki, et al. Expires March 7, 2020 [Page 6]
Internet-Draft Abstract NH in scale-out peering September 2019
1.2. Common BGP Deployment Configurations
1.2.1. IBGP with Next-Hop Unchanged
In one standard BGP configuration, an ASBR, when it advertises an
externally learned prefix into IBGP, does not modify the BGP-NH. So,
the BGP-NH is set to the IP address of an interface on the external
peering router. The strength of this technique is the shorter time
needed to restore connectivity with all equal cost multi-path (ECMP)
in-use and on low latency paths. The drawback is extremely high BGP
Routing Information Base (RIB) scale - proportional to the number of
inter-AS links.
1.2.1.1. Example
Let's assume that in the network of Figure 2, all PR2.x of AS2
advertise the same set of prefixes on all sessions to AS1.
If BR1.1-BR1.N and BR2.1-BR2.N' each advertise only one path per
prefix to their respective RRs, then as the result of ADD-PATH among
RRs, BRs and CRs, at site 3 the BRs and CRs will learn N+N' paths per
prefix learned from AS2. This is sufficient to equally distribute
load among all N ASBRs on site 1 (note the IGP cost between site 2
and site 3).
However, when interfaces over which all BR1.1-BR_1.N learned their
best path become unavailable (say interfaces to PR_2.1 in all cases,
as a result of the failure of PR_2.1), the route to the BGP BGP-NH -
that is, the IP address of the PR_2.1 interface - is removed from the
IGP. BGP speakers at other sites (BR_3.x) will react by temporarily
directing traffic to site 2 (BR_2.1-BR_2.N'). This switchover may
happen in sub-second time, in a prefix-scale-independent manner,
thanks to techniques commonly known as BGP PIC Edge
[I-D.ietf-rtgwg-bgp-pic]. As a result, traffic is on a path other
than the lowest cost path, as the connection from site 1 to AS2 is
not entirely broken (links to PR_2.2-PR_2.M are operational).
Subsequently, all BR1.x will update their RRs with a new best path
(say for PR_2.2) for each prefix (for example, 100,000 of them),
triggering global convergence. Such a convergence, for a large
number of prefixes, may take many minutes.
In the above example, BRs, RRs, and possibly CRs keep N+N' paths per
prefix (N from site 1, and N' from site 2). Provided N=N'=4, this
makes 8 path per prefix.
The solution for sub-optimal routing right after the failure would be
to enable each BR to advertise multiple paths to its RRs, and for
Szarecki, et al. Expires March 7, 2020 [Page 7]
Internet-Draft Abstract NH in scale-out peering September 2019
them in turn to propagate it to all other RRs and hence BRs. So,
each of BR1.x at site 1 will advertise M paths (from PR_2.1-PR_2.M),
RR1.x will have N*M ECMP best paths and advertise them to other sites
(site 3). As a result, BGP speakers at other sites (BR3.x at site 3)
are provided with N*M paths per prefix from site 1 and N'*M' from
site 2. Therefore to achieve optimal routing immediately after
failure, a considerably higher scale of BGP paths needs to be
handled. If M=N=N'=M'=4 then for each prefix we have 16 best paths
and 16 non-best, a total of 32. If AS2 advertises 100,000 prefixes,
this becomes 3.2M paths.
Although this solution provides a mean of fast, prefix-scale-
independent traffic switchover, it does it only if an ASBR external
interface goes down, which triggers an IGP event. In case an EBGP
session fails but the underlying interface remains up
(misconfiguration, software defect, etc), recovery still requires
per-prefix withdrawal/update that could take many minutes at high
scale.
1.2.2. IBGP with Next-Hop-Self
The other common technique is to modify BGP-NH to "self" (a local IP
address, typically a loopback) when the BR advertises an externally
learned path into IBGP. This technique allows the reduction of the
number of paths per prefix, while keeping optimal forwarding - least
cost and ECMP - in case of failure discussed above (e.g. PR_2.1 node
failure). Actually, because IP addresses of BGP-NH as seen by other
BGP speakers do not change in response to external failure events,
and are resolvable by the IGP, there is no need to reprogram the
Forwarding Information Base (FIB) at all. Unfortunately, other
failures - loss of all connectivity between a single BR (say BR1.1)
and a peer AS (all PRs in AS2) would not be handled quickly. As the
BGP-NH advertised by BR_1.1 is not changed and is reachable by the
IGP, BGP speakers in AS1 (BRs, CRs) will keep BR_1.1 as a feasible
exit point until they receive BGP withdraws on a prefix-by-prefix
basis. This is a global convergence process that at high scale can
take minutes, during which time packets may be discarded or loop.
2. The BGP Abstract Next-Hop
The Abstract Next Hop (ANH) concept presented below does not require
any changes to the BGP protocol itself. It is architectural solution
to network configuration, that uses existing protocols' capabilities
while achieving higher scale and faster routing convergence when
scale-out peering sites exist.
When a BGP speaker advertises a path to its IBGP peer, it modifies
the Protocol Next-Hop to be the ANH value. The ANH is just an IP
Szarecki, et al. Expires March 7, 2020 [Page 8]
Internet-Draft Abstract NH in scale-out peering September 2019
address that identifies the BGP session or a set of BGP sessions.
The set of BGP sessions is defined by the operator in local
configuration, according to network design needs. For example, an
ANH might identify:
o a set of BGP sessions with the same peer AS and handled by a given
single ASBR
o a set of BGP sessions with same the peer AS and handled by one or
more ASBRs at a given site
o a set of BGP sessions with any upstream provider AS
o a set of BGP sessions with a given peer device and handled by one
or more of ASBRs of the local AS
A host route to the ANH is installed in the relevant RIB and
redistributed into the IGP. BGP maintains the ANH host route based
on the state of the associated group of BGP sessions:
o As soon as all BGP sessions in the set go down, the ANH route is
removed.
o When at least one BGP session in of the set comes up, the ANH
route is created only after initial route convergence is complete
for the peer (End-of-RIB (EoR) [RFC4724] is received).
Taken together, these procedures ensure that as soon as the final
session in the set goes down, ingress routers will see the associated
ANH withdrawn from the IGP. Since the ANH is used to resolve the
associated BGP next hops, the ingress routers are triggered to
converge to send traffic to their alternate (new best) route. They
also ensure that as soon as one session in the set comes up and is
synchronized (that is, the EoR is received), ingress routers will see
the ANH advertised in the IGP and will be able to reconverge to use
routes that are associated with that next hop.
The ANH can be any IP address that the router is eligible to
advertise according to the local network's IP address management
scheme. More details are given in Section 3.3.
3. Use of Abstract Next-Hop in scale-out peering design
In traditional configurations as described in Section 1.2 the meaning
of the BGP-NH is either:
o An egress interface in the case of next-hop-unchanged
configuration, or
Szarecki, et al. Expires March 7, 2020 [Page 9]
Internet-Draft Abstract NH in scale-out peering September 2019
o An egress ASBR in the case of next-hop-self configuration.
The meaning of Abstract Next Hop is more context-dependent. This
document describes network configurations when the BGP-NH identifies:
a. An (egress ASBR, peer AS) pair. The ANH should be advertised
into the IGP if, and only if, the given egress ASBR has at least
one EBGP session in the ESTABLISHED state with the given peer AS,
and the EoR marker has been received on that session. We call
this the ASBR-Peer AS Abstract Next Hop (AP-ANH).
b. An (egress site in local AS, peer AS) pair, where a "site" may
include multiple ASBRs. The ANH should be advertised into the
IGP if, and only if, at least one ASBR of the given site has at
least one EBGP session in the ESTABLISHED state with the given
peer AS, and the EoR marker has been received on this session.
We call this the Site-Peer AS Abstract Next Hop (SP-ANH).
Note that reachability of the ANH address in the IGP depends on EBGP
session state and not inter-AS interface state, although of course,
interface state may impact session state. How the IP route to the
ANH address is instantiated on an ASBR and inserted into the IGP on
particular device is a matter of local implementation.
3.1. Egress ASBR-Peer AS Abstract Next Hop (AP-ANH)
The AP-ANH is unique to an ASBR and its peer AS. For example, in the
network of Figure 2, BR_1.1 would have two AP-ANH assigned - one for
its peering with AS2 and the other for AS3. Similarly, BR_1.2 would
have two AP-ANH, one per peer AS, with values different from the AP-
ANH of BR_1.1, and so on. All AP-ANH are exported into the IGP by
their ASBRs. Each ASBR advertises only one path per prefix to its
RR, with the BGP-NH set to the appropriate AP-ANH. The RR will
propagate it through the entire AS by means of IBGP ADD-PATH. In
consequence, the number of paths learned per prefix is equal to
number of ASBRs servicing a given peer AS. In the network as of
Figure 2, for AS2 prefixes, this would be N+N' (from site_1 + from
site_2) paths per prefix. This sets the scale requirements of this
solution to be on par with Next-Hop-Self (Section 1.2.2). However,
thanks to the properties of ANH, more failures are covered by prefix-
independent techniques, as withdrawal of the ANH from the IGP makes
the BGP-NH unresolvable.
Provided that all ASBRs in a given site (site1 in Figure 2) receive
the same routing information from their peer AS (AS2), in non-faulty
conditions, one could consider setting the ANH value on all ASBRs the
same. However, failure(s) can create situations when multiple ASBRs
will have a session in ESTABLISHED state with a given peer AS, but
Szarecki, et al. Expires March 7, 2020 [Page 10]
Internet-Draft Abstract NH in scale-out peering September 2019
some prefixes would be learned from EBGP only on a subset of these
ASBRs. To prevent problems from arising in this situation, the per-
ASBR AP-ANH needs to be advertised into the IGP and ASBRs need to set
it as the BGP-NH when advertising routes to the site's Route
Reflectors. However, for IBGP path advertisement being propagated
beyond the site (into the RR mesh), the BGP-NH may be replaced by
another ANH value, the Site-Peer AS ANH.
3.2. The Site-Peer AS Abstract Next Hop (SP-ANH)
The AP-ANH works on an ASBR level. From a given local AS
perspective, the number of ANH is proportional to the number of pairs
of ASBRs and ASes each of them peers with. With hundreds of peer
ASes, tens of sites and ~10 ASBRs per site, the number of AP-ANH may
scale into the thousands. At the same time, it may not be necessary
or even desirable for every BGP speaker in the network to have
visibility to every path down to individual egress ASBR granularity.
With symmetrical multiplane backbone and/or leaf-spine designs, it is
sufficient that BGP speakers on other sites have information that a
given site (site1 in Figure 2) has at least one ASBR with an
ESTABLISHED session to the peer AS (AS2). For example, in the
network of Figure 2, even if BR3.1 has only one path with its BGP-NH
equal to the ANH of BR1.1, BR3.1 resolves the BGP-NH in the IGP and
spreads traffic among all CRs on site 3. Thus, traffic will be
delivered to CR1.x at site 1. As long as CR1.x has visibility to all
paths, traffic will be distributed equally to all site 1 ASBRs.
At the same time, when multiple paths are available on BGP speakers,
every change is propagated, with consequent transmission and
processing costs on all BGP speakers across the network. This will
be true even if the route change doesn't impact the forwarding plane.
For example, in the network of Figure 2, even if BR3.1 has N paths
with BGP-NHs set to the ANHs of BR1.1 through BR1.N, BR3.1 will
resolve those BGP-NHs in the IGP and spread traffic among all CRs of
site 3. When one of the egress ASBRs (say BR1.2) loses its
connectivity to the peer AS, the affected BGP routes (those with BGP-
NH equal to AP-ANH of BR1.2) are withdrawn from all BGP speakers
(e.g. BR3.1) of the network. All BGP speakers perform path
selection and possibly update their forwarding data structures.
Since the actual forwarding paths do not change, all this work
represents unnecessary churn.
To avoid the above drawbacks, the RR of a given site (site1 in
Figure 2), when re-advertising a BGP path learned from its ASBR
client, modifies the BGP-NH to another abstract value - the Site-Peer
AS Abstract NH (SP-ANH). This value is unique per (site, peer AS)
pair, and is shared by all RRs of a given site. With this
modification, it is sufficient that inter-site IBGP sessions carry
Szarecki, et al. Expires March 7, 2020 [Page 11]
Internet-Draft Abstract NH in scale-out peering September 2019
only one path per prefix (no ADD-PATH needed). Consequently, BGP RIB
scale is reduced significantly. This frees up memory, reduces the
amount of data RRs need to exchange, and mitigates churn. The BGP
speakers in other sites of AS 1 need to resolve SP-ANH in order to
build their local FIBs. Therefore SP-ANH have to be present in the
IGP - some router(s) in the local site (RR, ASBR or CR) need to
inject it into the IGP. While the selection of role that is
responsible of SP-ANH injection is discussed below, in any case, the
SP-ANH should be reachable in the IGP if, and only if, at least one
of AP-ANH (for the same peer AS and ASBR belonging to given site) is
reachable. Figure 3 illustrates routing information flow in a
network such as that of Figure 2:
Szarecki, et al. Expires March 7, 2020 [Page 12]
Internet-Draft Abstract NH in scale-out peering September 2019
+------------------------------------------------
| +----->IBGP to SITE2
| AS 1 | +--->IBGP to SITE3
/=============================\ | |
|a.a.a.a/a |----------------->| | SP-ANH
| as-path "^2 .*" | | | (SITE1&AS2)
| BGP-NH SP-ANH(SITE1&AS2)| | | IP/32 into IGP
\=============================/ | | ^
| | | |
| +-------------------------+-+------------+---+
/==============================\ o------o o-+-+--o |
|ADD-PATH | |RR_1.2| |RR_1.1| SITE1 |
|a.a.a.a/a | o------O o----X-O |
| as-path "^2 .*" | ^ ^ \ |
| BGP-NH AP-ANH(BR_1.1&AS2)| / / \ |
|a.a.a.a/a |--------------X-X---->| |
| as-path "^2 .*" | / | | |
| BGP-NH AP-ANH(BR_1.2&AS2)| / | | |
\==============================/ / | | |
/==============================\ / | \ |
|a.a.a.a/a | | | \ |
| as-path "^2 .*" |--------->/ | v |
| BGP-NH AP-ANH(BR_1.1&AS2)| / | +------+ |
\==============================/ / | |CR_1.1+--+ |
/==============================\ / / +--+---+.1+-+ |
|a.a.a.a/a |------X------->/ +-+----+X| |
| as-path "^2 .*" | / / +------+ |
| BGP-NH AP-ANH(BR_1.2&AS2)| +------+ +------+ +------+ |
\==============================/ |BR_1.1| |BR_1.2|- - -|BR_1.N| |
| | +------+ +------+ +------+ |
| | ^ ^ |
| | \ \ |
| +-------------X--X---------------------------+
/======================\--------------X--X---------------------------
|a.a.a.a/a | \ \
| as-path "^2 .*" |--------------->\ \---------\
\======================/ \ \
/======================\ \ \
|a.a.a.a/a |-------------------X----------->\
| as-path "^2 .*" |----------------+ +-X------------X-----------
\======================/ | | +X-----+ +--X---+ +
| AS 3 | | |PR_2.1| |PR_2.2|- - -|
| | | +------+ +------+ +
| | | AS 2
+-------------------+ +----a.a.a.a/a network-----
Figure 3
Szarecki, et al. Expires March 7, 2020 [Page 13]
Internet-Draft Abstract NH in scale-out peering September 2019
3.3. Assignment of Abstract Next Hops
In the following subsections we provide more details of how abstract
next hops can be injected in several different common network
architectures.
3.3.1. Native IP Networks
In this network every router, including core routers, has full BGP
routing information and forwards each packet based on destination IP
lookup. Provided that all routers at an egress site receive multiple
paths with BGP-NH set to AP-ANH (and not SP-ANH), it is a matter of
the operator's decision which node - RR, ASBR or CR - will inject the
SP-ANH route into the IGP. One may argue that injection of SP-ANH by
ASBRs may be simpler, as it will be done by the same procedure and
policy as injection of AP-ANH. Others may prefer injection at RR, as
it limits the number of configuration touch-points.
3.3.2. MPLS
3.3.2.1. Identical BGP address space and paths received on all ASBRs
In the MPLS network, since traffic is carried over LSP tunnels, the
SP-ANH needs to be injected into the IGP by a node that has the
ability to perform an IP lookup. This eliminates the RR, and
possibly CRs (in "BGP-free core" architectures). Instead, all ASBRs
are used to insert SP-ANH addresses into the IGP. In case of LDP-
based networks, this is sufficient. The CR will create an ECMP
forwarding structure for labels of SP-ANH FEC coming from other
sites. In RSVP-TE based networks, ECMP needs to happen on the
ingress LSR and therefore, every BGP speaker needs to establish an
LSP to every ASBR, and the SP-ANH address needs to be part of the FEC
for its respective LSP. If SP-ANH is used as an RSVP (signaling)
destination, some other means (such as affinity groups) needs to be
used to ensure the desired 1:1 LSP to egress ASBR mapping.
3.3.2.2. Different address space sets or paths received on different
ASBRs
In the case when the set of prefixes received from a given peer AS by
one ASBR is different from the set received by another one, a
combination of SP-ANH and MPLS-based load balancing on a CR may lead
to a situation where an IP packet will be directed to an ASBR that
lacks external routing information and hence can't forward traffic
directly out of the AS. Similarly, if path attributes for a given
prefix received by one ASBR are different from those received by
another, again packets can be directed to the "wrong" ASBR. In this
case the ASBR would use the IBGP route it learned from another ASBR
Szarecki, et al. Expires March 7, 2020 [Page 14]
Internet-Draft Abstract NH in scale-out peering September 2019
of the same site (via RR, with AP-ANH) and forward traffic over an
LSP to the "correct" ASBR. This extra hop constitutes a sub-optimal
traffic path through the network.
For example in the network of Figure 2, let's assume that prefix P2
is advertised to BR1.2-BR1.N by AS2 but not to BR1.1. BR3.1 has a
BGP best route to P2 with its BGP-NH set to the SP-ANH of (site1,
AS2). It resolves it by ECMP over N MPLS LSPs, terminating on
BR1.1-BR1.N. So, some packets are forwarded by BR3.1 over an LSP via
CR1.x and terminated on BR1.1. BR1.1 has no external route to P2,
but it has (N-1) IBGP routes to P2 w/ BGP-NHs equal to the AP-ANHs of
BR1.2-BR1.N. Therefore BR1.1 performs an IP lookup and forwards this
packet over LSPs via CR1.x and terminated on BR1.2-BR1.N. Traffic is
U-turned on BR1.1 and traverses CRs at site 1 twice.
Such asymmetry may be considered acceptable by the provider, as long
as it's a transient condition. However, in the general case such a
situation could be persistent, as the result of intentional
configuration on the peer AS's ASBRs. Therefore the better solution
would be to insert the SP-ANH into the IGP on CRs. In this case, CRs
need to perform forwarding based on destination IP lookup. Therefore
CRs would have to be able to learn and handle large IP routing and
forwarding tables - at least all prefixes learned from peer ASes by
the local ASBRs.
3.3.3. SPRING
3.3.3.1. Identical BGP address space and path received on all ASBRs
For SPRING based networks, we can take advantage of the unique
capability of Anycast-SID [RFC8402]. The ASBRs of a single site
allocate an Anycast-SID for each SP-ANH address. This SID can be
used as the only SID by an ingress BGP speaker or, if a TE routed
path is desired, depending on TE constraints, the TE controller can
provision a SPRING path with the Anycast-SID at the end, instructing
the CR to perform load balancing among connected ASBRs.
3.3.3.2. Different address space sets or paths received on different
ASBRs
Similarly to a classic MPLS environment, such a situation may lead to
suboptimal routing (redirecting from one ASBR to another), or may
require the CR (instead of ASBR) to insert the SP-ANH into the IGP
and generate a PREFIX-SID (or Anycast-SID if there is more then one
CR) for it.
Szarecki, et al. Expires March 7, 2020 [Page 15]
Internet-Draft Abstract NH in scale-out peering September 2019
3.4. Localization of AP-ANH
The architecture as described above reduces number of BGP paths
exchanged between sites of local AS by mean of use of SP-ANH. Paths
with BGP Next hop set to AP-ANH are visible only to routers in same
site as ASBRs advertising it. However as route to AP-ANHs are
inserted into IGP, in general case they could be visible to all nodes
in local AS, contributing to IGP's LSDB scale. Further optimization
is possible by limiting reach ability of AP-ANH only to site given
AP-ANH is originated. This could be achieved in multiple way. For
example: by running additional IGP instance internally to each site,
or by running L1 ISIS among all nodes of single site and then make
core routers L1/L2 systems, etc.
The benefit would be reduction of Network-wide LSDB size hence faster
IGP convergence and lower resource requirement.
Additionally, localization of AP-ANH allows to re-use IP addresses of
AP-ANH between sites. Although such practice is controversial, it
may be beneficial in certain provisioning automation and ZTP
scenarios.
4. Worked Examples
Below we illustrate the operation of the proposal by working through
its operation in the context of several different types of failures.
Here, we assume that each ASBR in a given site of the local AS (site
1 of AS1 in Figure 2), that has an EBGP session with the given peer
AS (AS2 in Figure 2), receives from its peer routers (PR2.x) routes
to exactly same address space on each session.
4.1. Failure of a proper subset of EBGP sessions with a given peer AS
on a single ASBR
o The impacted ASBR keeps advertising the AP-ANH into the IGP, as at
least one session to the peer AS remains in the ESTABLISHED state.
o The impacted ASBR may send UPDATEs to RRs, however the BGP-NH
remains the same and equal to the pre-failure AP-ANH.
o The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in
other sites, however the BGP-NH remains the same as its pre-
failure value: AP-ANH and SP-ANH respectively.
o As BGP-NH do not change, there are no changes in forwarding data
structures (FIB) on any BGP speaker across the network, except
possibly the ASBR that holds the impacted session.
Szarecki, et al. Expires March 7, 2020 [Page 16]
Internet-Draft Abstract NH in scale-out peering September 2019
4.2. Failure of a proper subset of EBGP sessions with a given peer AS
on each ASBR of a given site
o The impacted ASBRs keep advertising the AP-ANH into the IGP, as at
least one session to the peer AS remains in the ESTABLISHED state
on each ASBR.
o The impacted ASBRs may send UPDATEs to RRs, however the BGP-NH
remains the same and equal to the pre-failure AP-ANH.
o The RRs may send UPDATEs to their clients (CRs, BRs) and to RRs in
other sites, however the BGP-NH remains the same and equal to its
pre-failure value: AP-ANH and SP-ANH respectively.
o As BGP-NH do not change, there are no changes in forwarding data
structures (FIB) on any BGP speaker across the network, except
possibly the ASBRs that hold the impacted sessions.
4.3. Failure of all EBGP sessions with a given peer AS on single ASBR;
Failure of a single ASBR
o The impacted ASBR stops advertising the AP-ANH into the IGP, as it
has lost all sessions with given peer AS.
o The SP-ANH is kept reachable in the IGP.
o All other BGP speakers at the impacted site invalidate all paths
with BGP-NH equal to the AP-ANH. This may trigger prefix-
independent FIB data-structure patching/temporary fixing for sub-
second traffic restoration.
o The impacted ASBR sends WITHDRAWs to its RRs.
o Each RR:
* Sends WITHDRAWs to its clients at the local site (CRs, BRs) for
paths from the impacted ASBR. As these sessions support ADD-
PATH, paths from other ASBRs will remain. Other BGP speakers
at this site have to modify their FIBs.
* May send UPDATEs to RRs in other sites, however the BGP-NH
remains the same, equal to the pre-failure SP-ANH. As the BGP-
NH does not change, there are no changes in forwarding data
structure (FIB) on any of BGP speakers across network, except
those at the impacted site.
o Routing churn is mitigated in many cases to a single peering site,
and does not propagate across the network. FIB changes are
Szarecki, et al. Expires March 7, 2020 [Page 17]
Internet-Draft Abstract NH in scale-out peering September 2019
limited to a single peering site, and do not propagate across the
network.
4.4. All EBGP sessions with a given peer AS on all ASBRs
o Each ASBR stops advertising its AP-ANH into the IGP, as it has
lost all sessions with the given peer AS.
o The SP-ANH is no longer reachable in the IGP, as none of AP-ANH
are reachable.
o All other BGP speakers across the network invalidate all paths
with a BGP-NH equal to the removed AP-ANH or SP-ANH. This may
trigger prefix-independent FIB data-structure patching/temporary
fixing for sub-second traffic restoration.
o Each impacted ASBR sends WITHDRAWs to its RRs.
o The RRs send WITHDRAWs to their clients at the local site (CRs,
BRs) and RRs in other sites for paths from the impacted ASBRs. As
these sessions support ADD-PATH, paths from ASBRs at other sites
will remain. The BGP speakers across the network may need to
modify their FIBs.
5. Acknowledgements
Valuable comments and suggestions on solution covered by this
document was provided by John Scudder and Ron Bonica. Special thanks
to John Scudder, who also helped with editorial changes.
6. IANA Considerations
This memo includes no request to IANA.
7. Security Considerations
Since this is a deployment architecture and not a protocol
modification, it doesn't introduce any new issues to the BGP protocol
itself. General BGP security considerations are discussed in
[RFC4271] and [RFC4272], BGP deployment best practices are documented
in [RFC7454], and nothing in this proposal impedes their use. Many
of the practices recommended in that document are self-evidently
still applicable, for example the use of cryptographic session
protection methods such as TCP MD5 [RFC2385] or the TCP