MPLS Backbone for fun using FRR, MPLS, LDP, VXlan

Intro

Some bad weather weekend plays… building an MPLS/LDP Backbone for fun using open-source software.

The idea was to improve my knowledge in software / open-source routing stacks. Decided to try and setup an IP/MPLS Backbone similar to what I run in day-to-day operations using open-source. And while I’m at it I also wanted to expand on NetBox an Ansible skills.

This topic actually wouldn’t be worth a post if it wasn’t for the struggles… and learnings.

Requirements:

  • IP/MPLS (L3VPN) infrastructure
  • Full dual-stack transport
  • L2VPN/EVPN/pseudo-wire nice-to-have, but not required
  • All open-source, no hardware requirements
  • WAN encryption
  • As close as possible to what we would deploy in the SP world
  • Compatibility with major vendor implementations (Cisco, Juniper)
  • Ansible for deplyoment
  • NetBox as source of truth

Software options explored:

  • Bird (already experienced with, not preferred, BGP only)
  • ExaBGP (already experienced with, preferred for DDoS mitigation, BGP only)
  • Juniper vMX (discontinued, not open-source)
  • Juniper vJunOS-evolution (lab only, not open-source)
  • Cisco ASR1000v (discontinued, not open-source)
  • Cisco Catalyst 8000v (not open-source)
  • FRR (MPLS/LDP, OSPFv2/3, BGP, Cisco-like CLI)
  • Cumulus Linux (Based on FRR, targeting whitebox switches)
  • (VyOS discovered later on, not evaluated)

Went with FRR because it seemed to offer the most “SP-world like” deployment and ticked most checkboxes.

Running topology

Brief topology overview

  • 4 cloud VMs connected in ring topology: pe101 - pe201 - pe202 - pe102 - pe101 as P/PE routers
  • 2 cloud VMs as clients directly connected to pe101 and pe201 terminating in a vrf
  • VMs as cheap as possible (because… this is private fun)
  • All VMs IPv6-only (because cheap now also means paying extra for legacy IP addresses…)
  • Rocky Linux (9)
  • StrongSwan for IPSec/WAN encryption
  • GREoIPSec for multicast support (IGP)
  • IPv4 unnumbered interfaces for GRE using loopback interfaces as IP
  • OSPF as IGP
  • MPLS/LDP for transport (or better use VXLAN?)
  • BGP for L3VPN
  • No L2VPN/EVPN/PW support with LDP as underlay.

WAN Encryption

Went with StrongSwan / swanctl which was up and running quickly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
swanctl --list-sas
plugin 'sqlite': failed to load - sqlite_plugin_create not found and no plugin file available
pe202: #23, ESTABLISHED, IKEv2, e1929d3d28102145_i 09536718fd451ade_r*
  local  '<removed>' @ <removed>[500]
  remote '<removed>' @ <removed>[500]
  CHACHA20_POLY1305/PRF_HMAC_SHA2_256/CURVE_25519
  established 7753s ago, rekeying in 5727s
  pe202: #62, reqid 2, INSTALLED, TUNNEL, ESP:CHACHA20_POLY1305/CURVE_25519
    installed 2294s ago, rekeying in 1095s, expires in 1666s
    in  c72e22ae, 1648108 bytes, 13987 packets,     0s ago
    out cd571cca, 1643347 bytes, 13950 packets,     0s ago
    local  <removed>/128[gre]
    remote <removed>/128[gre]
pe101: #22, ESTABLISHED, IKEv2, 85110301ad5e1ac9_i* 083cf037cb5a0fbf_r
  local  '<removed>' @ <removed>[500]
  remote '<removed>' @ <removed>[500]
  CHACHA20_POLY1305/PRF_HMAC_SHA2_256/CURVE_25519
  established 11861s ago, rekeying in 1966s
  pe101: #63, reqid 1, INSTALLED, TUNNEL, ESP:CHACHA20_POLY1305/CURVE_25519
    installed 2070s ago, rekeying in 1234s, expires in 1890s
    in  cb77481b, 1454279 bytes, 12237 packets,     0s ago
    out c04e62c8, 1450745 bytes, 12201 packets,     0s ago
    local  <removed>/128[gre]
    remote <removed>/128[gre]

Managed using network manager

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
root@pe101:/etc/frr$ nmcli con show
NAME                 UUID                                  TYPE       DEVICE
eth0                 3b0a6709-8aa1-4f4d-b24f-a1288bedf39d  ethernet   eth0
loopback0            c83ca996-d092-4629-b479-b1e1957f9a70  dummy      loopback0
gre_pe102  caff121d-57da-4e91-8aa5-d0315c817cd4  ip-tunnel  pe102
gre_pe201  0399428a-811e-4617-aaa3-bc4675e2dc29  ip-tunnel  pe201
root@pe102:/etc/frr$ nmcli
[...]
loopback0: connected to loopback0
        "loopback0"
        dummy, 52:BF:1C:DE:46:70, sw, mtu 1500
        inet4 10.10.0.4/32
        inet6 fe80::2abe:ec63:73bb:9c97/64
        inet6 fd53:ba85:6bf9:10::4/128
        route6 fd53:ba85:6bf9:10::4/128 metric 550
        route6 fe80::/64 metric 1024

pe101: connected to gre_pe101
        "pe101"
        iptunnel (ip6gre), 2A:07:6D:40:00:01:00:03:00:00:00:00:00:00:F1:02, sw, mtu 1374
        inet4 10.10.0.4/32
        inet6 fe80::3642:b8d8:8090:c5ad/64
        route6 fe80::/64 metric 1024

pe202: connected to gre_pe202
        "pe202"
        iptunnel (ip6gre), 2A:07:6D:40:00:01:00:03:00:00:00:00:00:00:F1:02, sw, mtu 1374
        inet4 10.10.0.4/32
        inet6 fe80::27c6:66a4:b09f:5803/64
        route6 fe80::/64 metric 1024

FRR up and running rather quickly

=> All loopbacks reachable, BGP up and running

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
pe102# sh int brief
Interface       Status  VRF             Addresses
---------       ------  ---             ---------
eth0            up      default         + <removed>/64
ip6gre0         down    default
ip6tnl0         down    default
lo              up      default
loopback0       up      default         10.10.0.4/32
                                        + fd53:ba85:6bf9:10::4/128
pe101 up      default         10.10.0.4/32
pe202 up      default         10.10.0.4/32
pe101# sh ip ospf nei

Neighbor ID     Pri State           Up Time         Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.10.0.4         1 Full/-          44.193s           37.162s 10.10.0.4       pe102:10.10.0.2            0     0     0
10.10.0.1         1 Full/-          39.079s           32.307s 10.10.0.1       pe201:10.10.0.2            0     0     0

pe101# sh ipv6 ospf nei
Neighbor ID     Pri    DeadTime    State/IfState         Duration I/F[State]
10.10.0.4         1    00:00:35     Full/PointToPoint    00:00:46 pe102[PointToPoint]
10.10.0.1         1    00:00:30     Full/PointToPoint    00:00:41 pe201[PointToPoint]
pe101# sh ip route 10.10.0.1
Routing entry for 10.10.0.1/32
  Known via "ospf", distance 110, metric 20, best
  Last update 00:00:48 ago
  * 10.10.0.1, via pe201 onlink, label implicit-null, weight 1

pe101# sh ipv6 route fd53:ba85:6bf9:10::1
Routing entry for fd53:ba85:6bf9:10::1/128
  Known via "ospf6", distance 110, metric 20, best
  Last update 00:01:05 ago
  * fe80::d583:1450:8c1b:4b70, via pe201, label implicit-null, weight 1


pe201# sh ip ospf nei

Neighbor ID     Pri State           Up Time         Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.10.0.2         1 Full/-          39.066s           35.817s 10.10.0.2       pe101:10.10.0.1            0     0     0
10.10.0.3         1 Full/-          47.691s           37.254s 10.10.0.3       pe202:10.10.0.1            0     0     0

pe201# sh ipv6 ospf nei
Neighbor ID     Pri    DeadTime    State/IfState         Duration I/F[State]
10.10.0.2         1    00:00:33     Full/PointToPoint    00:00:41 pe101[PointToPoint]
10.10.0.3         1    00:00:35     Full/PointToPoint    00:00:49 pe202[PointToPoint]
pe201# sh ip route 10.10.0.1
Routing entry for 10.10.0.1/32
  Known via "ospf", distance 110, metric 10
  Last update 00:01:06 ago
    0.0.0.0, via loopback0 onlink, weight 1

Routing entry for 10.10.0.1/32
  Known via "connected", distance 0, metric 0
  Last update 00:01:06 ago
  * directly connected, pe101

Routing entry for 10.10.0.1/32
  Known via "connected", distance 0, metric 0
  Last update 00:01:06 ago
  * directly connected, pe202

Routing entry for 10.10.0.1/32
  Known via "connected", distance 0, metric 0, best
  Last update 00:01:06 ago
  * directly connected, loopback0

pe201# sh ipv6 route fd53:ba85:6bf9:10::1
Routing entry for fd53:ba85:6bf9:10::1/128
  Known via "ospf6", distance 110, metric 10
  Last update 00:01:23 ago
    directly connected, loopback0, weight 1

Routing entry for fd53:ba85:6bf9:10::1/128
  Known via "connected", distance 0, metric 0, best
  Last update 00:01:23 ago
  * directly connected, loopback0

The fun

Some failover testing

Since this is all IGP+LDP failovering should be easy, right? Right?

Two test scenarios: power-off a VM, set one tunnel interface down. Started with the latter by disabling tunnel between pe101 and pe201.

Sadly in both cases some traffic was blackholed:

1
2
3
4
pe101# conf t
pe101(config)# int pe201
pe101(config-if)# shut
pe101(config-if)# end

OSPF does what it’s supposed to do:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
pe101# sh ip ospf nei

Neighbor ID     Pri State           Up Time         Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.10.0.4         1 Full/-          4m45s             36.324s 10.10.0.4       pe102:10.10.0.2            0     0     0

pe101# sh ipv6 ospf nei
Neighbor ID     Pri    DeadTime    State/IfState         Duration I/F[State]
10.10.0.4         1    00:00:33     Full/PointToPoint    00:04:47 pe102[PointToPoint]
pe101# sh ip route 10.10.0.1
Routing entry for 10.10.0.1/32
  Known via "ospf", distance 110, metric 40, best
  Last update 00:01:17 ago
  * 10.10.0.4, via pe102 onlink, label 20, weight 1

pe101# sh ipv6 route fd53:ba85:6bf9:10::1
Routing entry for fd53:ba85:6bf9:10::1/128
  Known via "ospf6", distance 110, metric 40, best
  Last update 00:01:34 ago
  * fe80::3642:b8d8:8090:c5ad, via pe102, label 21, weight 1

Yet IPv4 ping does not work, while IPv6 does:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
pe101# ping 10.10.0.1
PING 10.10.0.1 (10.10.0.1) 56(84) bytes of data.
^C
--- 10.10.0.1 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

pe101# ping fd53:ba85:6bf9:10::1
PING fd53:ba85:6bf9:10::1(fd53:ba85:6bf9:10::1) 56 data bytes
64 bytes from fd53:ba85:6bf9:10::1: icmp_seq=1 ttl=62 time=13.5 ms
^C
--- fd53:ba85:6bf9:10::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 13.527/13.527/13.527/0.000 ms

Let’s check the other end of the tunnel. OSPF doing fine as expected:

1
2
3
4
5
6
7
8
pe201# sh ip ospf nei

Neighbor ID     Pri State           Up Time         Dead Time Address         Interface                        RXmtL RqstL DBsmL
10.10.0.3         1 Full/-          28.526s           34.313s 10.10.0.3       pe202:10.10.0.1            0     0     0

pe201# sh ipv6 ospf nei
Neighbor ID     Pri    DeadTime    State/IfState         Duration I/F[State]
10.10.0.3         1    00:00:31     Full/PointToPoint    00:00:31 pe202[PointToPoint]

Okay, let’s drop out of FRR/vtysh and debug further. “FIB” table of the “router” having issues is looking good. Encapsulated through MPLS and routed through the remaining, longer path.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
root@pe101:/etc/frr$ ip route get 10.10.0.1
10.10.0.1  encap mpls  20 via 10.10.0.4 dev pe102 src 10.10.0.2 uid 0
    cache
root@pe101:/etc/frr$ ip route get 10.10.0.2
local 10.10.0.2 dev lo table local src 10.10.0.2 uid 0
    cache <local>
root@pe201:/etc/frr$ ip route get 10.10.0.1
local 10.10.0.1 dev lo table local src 10.10.0.1 uid 0
    cache <local>
root@pe201:/etc/frr$ ip route get 10.10.0.2
10.10.0.2  encap mpls  20 via 10.10.0.3 dev pe202 src 10.10.0.1 uid 0
    cache

However, interestingly my Tunnel interface on one side of the GRE tunnel remained up despite the other end being shutdown.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
root@pe101:/etc/frr$ ip link show | grep pe
30: pe201@eth0: <POINTOPOINT,NOARP> mtu 1374 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/gre6 <removed> peer <removed> permaddr a2fa:6f7d:eec0::
31: pe102@eth0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1374 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/gre6 <removed> peer <removed> permaddr 8af0:74cb:f4d0::
root@pe201:/etc/frr$ ip link show | grep pe
30: pe101@eth0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1374 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/gre6 <removed> peer <removed> permaddr a2fa:6f7d:eec0::
31: pe202@eth0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1374 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/gre6 <removed> peer <removed> permaddr 8af0:74cb:f4d0::

Turns out: the GRE keepalive feature I was expecting by default (from the Cisco world) is not a defined standard and not supported by Linux. Okay. Strange, but we run OSPF (+BFD) over the link so this should not matter?!

Turns out, it does when running unnumbered tunnel interfaces as FRR/Zebra was having issues installing the new route into “FIB”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=122, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=123, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=124, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=125, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=126, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=127, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=128, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=129, pid=3741216903
pe101 zebra[42975]: [HSYZM-HV7HF] Extended Error: Nexthop device is not up
pe101 zebra[42975]: [WVJCK-PPMGD][EC 4043309093] netlink-dp (NS 0) error: Network is down, type=RTM_NEWNEXTHOP(104), seq=130, pid=3741216903
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (559[if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (563[if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (575[fe80::d583:1450:8c1b:4b70 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (582[10.10.0.1 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (590[10.10.0.1 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (592[10.10.0.1 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (593[fe80::d583:1450:8c1b:4b70 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (595[fe80::d583:1450:8c1b:4b70 if 30 vrfid 0]) into the kernel
pe101 zebra[42975]: [X5XE1-RS0SW][EC 4043309074] Failed to install Nexthop (598[10.10.0.1 if 30 vrfid 0]) into the kernel

A quick test showed: migrating away from unnumbered tunnels using loopback as IPv4 address to /31 assignments per tunnel fixed the issue immediately.

… essentially making my ansible deployment harder. But, that’s something for the next rainy weekend.

Some quick throughput testing

Remember this is MPLS/LDPoGREoIPSecoIPv6 on cheap virtual machines running on some small providers shared cloud with a latency of at least 15ms between them. VMs are supposed to have 10GigE uplink capacity.

My initial tests were limited to way less than 800Mbps using iperf on TCP. CPU wasn’t maxed (encryption) at all which was a bit strange.

After some chatting with one of the providers, they fill their links quite badly and rely on customers not having (baby) elephant flows and load-balance accross multiple links.

Sooo… first learning: No GRE. GRE does not have fields for the provider to load-balance on. No Ports. Only src/dst IP address and therefor exactly one flow from the pov of the provider. Crap. But, what if we replace GRE with VXLAN, essentially making it (VXLAN) p-t-p. VXLAN is UDP and has port numbers. Maybe that will work? Ok Let’s try MPLS/LDPoVXLANoIPSecoIPv6. :-)

Still not better. Maybe we should have TCPdumped this first?

1
2
3
4
5
6
Frame 35: 170 bytes on wire (1360 bits), 170 bytes captured (1360 bits)
Ethernet II, Src: 9e:94:7a:75:4a:23 (9e:94:7a:75:4a:23), Dst: AristaNetwor_a5:0d:ec (44:4c:a8:a5:0d:ec)
Internet Protocol Version 6, Src: <removed>, Dst: <removed>
Encapsulating Security Payload
    ESP SPI: 0xcca64263 (3433448035)
    ESP Sequence: 213

Mhm. IPSec was running in TUNNEL mode which encapsulates the entire VXLan header. Let’s try Transport mode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
Frame 7: 190 bytes on wire (1520 bits), 190 bytes captured (1520 bits)
Ethernet II, Src: 9e:94:7a:75:4a:23 (9e:94:7a:75:4a:23), Dst: AristaNetwor_a5:0d:ec (44:4c:a8:a5:0d:ec)
Internet Protocol Version 6, Src: <removed>, Dst: <removed>
    0110 .... = Version: 6
    .... 0000 0000 .... .... .... .... .... = Traffic Class: 0x00 (DSCP: CS0, ECN: Not-ECT)
    .... 0000 0000 0000 0000 0000 = Flow Label: 0x00000
    Payload Length: 136
    Next Header: Encap Security Payload (50)
    Hop Limit: 64
    Source Address: <removed>
    Destination Address: <removed
Encapsulating Security Payload
    ESP SPI: 0xcca64263 (3433448035)
    ESP Sequence: 684

Not better. Nothing to load-balance on. Maybe try UDP encapsulated ESP in Transport mode?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
Frame 20: 198 bytes on wire (1584 bits), 198 bytes captured (1584 bits)
Ethernet II, Src: 9e:94:7a:75:4a:23 (9e:94:7a:75:4a:23), Dst: AristaNetwor_a5:0d:ec (44:4c:a8:a5:0d:ec)
Internet Protocol Version 6, Src: <removed>, Dst: <removed>
User Datagram Protocol, Src Port: 4500, Dst Port: 4500
    Source Port: 4500
    Destination Port: 4500
    Length: 144
    Checksum: 0xb009 [correct]
    [Checksum Status: Good]
    [Stream index: 0]
    [Timestamps]
    UDP payload (136 bytes)
UDP Encapsulation of IPsec Packets
Encapsulating Security Payload
    ESP SPI: 0xc66a41dd (3328852445)
    ESP Sequence: 130

Doesn’t help either. Well… let’s stick with non-UDP-encapsulated Transport mode at least, because it has less MTU overhead.

Conclusion

If we can ignore the remaining MTU issues (and no support for L2VPN over MPLS): is it worth deploying this? Maybe… with better VMs, better connectivity.

Also, it seems there is some work going on regarding this very issue: UDP encapsulated ESP for ECMP by D. Acharya and H. Holbrook from Arista Networks