VPP

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.

I’ve deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, based on hardware/silicon based forwarding at line rate and high availability. You can read all about my Centec MPLS shenanigans in [this article].

Ever since the release of the Linux Control Plane [ref] plugin in VPP, folks have asked “What about MPLS?” – I have never really felt the need to go this rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling are just as performant, and a little bit less of an ‘art’ to get right. For example, the Centec switches I deployed perform VxLAN, GENEVE and GRE all at line rate in silicon. And in an earlier article, I showed that the performance of VPP in these tunneling protocols is actually pretty good. Take a look at my [VPP L2 article] for context.

You might ask yourself: Then why bother? To which I would respond: if you have to ask that question, clearly you don’t know me :) This article will form a deep dive into MPLS as implemented by VPP. In a later set of articles, I’ll partner with the incomparable @vifino who is adding MPLS support to the Linux Controlplane plugin. After that, I do expect VPP to be able to act as a fully fledged provider- and provider-edge MPLS router.

Lab Setup

A while ago I created a [VPP Lab] which is pretty slick, I use it all the time. Most of the time I find myself messing around on the hypervisor and adding namespaces with interfaces in it, to pair up with the VPP interfaces. And I tcpdump a lot! It’s time for me to make an upgrade to the Lab – take a look at this picture:

Lab Setup

There’s quite a bit to unpack here, but it will be useful to know this layout as I’ll be referring to the components here throughout the rest of the article. Each lab now has seven virtual machines:

  1. vppX-Y are Debian Testing machines running a reasonably fresh VPP - they are daisychained with the first one attaching to the headend called lab.ipng.ch, using its Gi10/0/0 interface, and onwards to its eastbound neighbor vpp0-1 using its GI10/0/1 interface.
  2. hostX-Y are two Debian machines which have their 4 network cards (enp16s0fX) connected each to one VPP instance’s Gi10/0/2 interface (for host0-0) or Gi10/0/3 (for host0-1). This way, I can test all sorts of topologies with one router, two routers, or multiple routers.
  3. tapX-0 is a special virtual machine which receives a copy of every packet on the underlying Open vSwitch network fabric.

NOTE: X is the 0-based lab number, and Y stands for the 0-based logical machine number, so vpp1-3 is the fourth VPP virtualmachine on the second lab.

Detour 1: Open vSwitch

To explain this tap a little bit - let me first talk about the underlay. All seven of these machines (and each their four network cards) are bound by the hypervisor into an Open vSwitch bridge called vpplan. Then, I use two features to build this topology:

Firstly, each pair of interfaces will be added as an access port into individual VLANs. For example, vpp0-0.Gi10/0/1 connects with vpp0-1.Gi10/0/0 in VLAN 20 (annotated in orange), and vpp0-0.Gi10/0/2 connects to host0-0.enp16s0f0 in VLAN 30 (annotated in purple). You can see the East-West traffic over the VPP backbone are in the 20s, the host0-0 traffic northbound is in the 30s, and the host0-1 traffic southbound is in the 40s. Finally, the whole Open vSwitch fabric is connected to lab.ipng.ch using VLAN 10 and a physical network card on the hypervisor (annotated in green). The lab.ipng.ch machine then has internet connectivity.

BR=vpplan
for p in $(ovs-vsctl list-ifaces $BR); do
  ovs-vsctl set port $p vlan_mode=access
done

# Uplink (Green)
ovs-vsctl set port uplink tag=10    ## eno1.200 on the Hypervisor
ovs-vsctl set port vpp0-0-0 tag=10

# Backbone (Orange)
ovs-vsctl set port vpp0-0-1 tag=20
ovs-vsctl set port vpp0-1-0 tag=20
...

# Northbound (Purple)
ovs-vsctl set port vpp0-0-2 tag=30
ovs-vsctl set port host0-0-0 tag=30
...

# Southbound (Red)
...
ovs-vsctl set port vpp0-3-3 tag=43
ovs-vsctl set port host0-1-3 tag=43

NOTE: The KVM interface names such as vppX-Y-Z where X means the lab number (0 in this case – IPng does have multiple labs so I can run experiments and lab environments independently and isolated), Y is the machine number, and Z is the interface number on the machine (from [0..3]).

Detour 2: Mirroring Traffic

Secondly, now that I have created a 29 port switch with 12 VLANs, I decide to create an OVS mirror port, which can be used to make a copy of traffic going in- or out of (a list of) ports. This is a super powerful feature, and it looks like this:

BR=vpplan
MIRROR=mirror-rx
ovs-vsctl set port tap0-0-0 vlan_mode=access

[ ovs-vsctl list mirror $MIRROR >/dev/null 2>&1 ] || \
  ovs-vsctl --id=@m get mirror $MIRROR -- remove bridge $BR mirrors @m

ovs-vsctl --id=@m create mirror name=$MIRROR \
  -- --id=@p get port tap0-0-0 \
  -- add bridge $BR mirrors @m \
  -- set mirror $MIRROR output-port=@p \
  -- set mirror $MIRROR select_dst_port=[] \
  -- set mirror $MIRROR select_src_port=[]

for iface in $(ovs-vsctl list-ports $BR); do
  [[ $iface == tap* ]] && continue
  ovs-vsctl add mirror $MIRROR select_dst_port $(ovs-vsctl get port $iface _uuid)
done

The first call sets up the OVS switchport called tap0-0-0 (which is enp16s0f0 on the machine tap0-0) as an access port. To allow for this script to be idempotent, the second line will look up if the mirror exists and if so, delete it. Then, I (re)create a mirror port with a given name (mirror-rx), add it to the bridge, make the mirror’s output port become tap0-0-0, and finally clear the selected source and destination ports (this is where the traffic is mirrored from). At this point, I have an empty mirror. To give it something useful to do, I loop over all of the ports in the vpplan bridge and add them to the mirror, if they are the destination port (here I have to specify the uuid of the interface, not its name). I will add all interfaces, except those of the tap0-0 machine itself, to avoid loops.

In the end, I create two of these, one called mirror-rx which is forwarded to tap0-0-0 (enp16s0f0) and the other called mirror-tx which is forwarded to tap0-0-1 (enp16s0f1). I can use tcpdump on either of these ports, to show all the traffic either going ingress to any port on any machine, or emitting egress from any port on any machine, respectively.

Preparing the LAB

I wrote a little bit about the automation I use to maintain a few reproducable lab environments in a [previous article], so I’ll only show the commands themselves here, not the underlying systems. When the LAB boots up, it comes with a basic Linux CP configuration that uses OSPF and OSPFv3 running in Bird2, to connect the vpp0-0 through vpp0-3 machines together (each router’s Gi10/0/0 port connects to the next router’s Gi10/0/1 port). LAB0 is in use by @vifino at the moment, so I’ll take the next one running on its own hypervisor, called LAB1.

Each machine has an IPv4 and IPv6 loopback, so the LAB will come up with basic connectivity:

pim@lab:~/src/lab$ LAB=1 ./create
pim@lab:~/src/lab$ LAB=1 ./command pristine
pim@lab:~/src/lab$ LAB=1 ./command start && sleep 150
pim@lab:~/src/lab$ traceroute6 vpp1-3.lab.ipng.ch
traceroute to vpp1-3.lab.ipng.ch (2001:678:d78:211::3), 30 hops max, 24 byte packets
 1  e0.vpp1-0.lab.ipng.ch (2001:678:d78:211::fffe)  2.0363 ms  2.0123 ms  2.0138 ms
 2  e0.vpp1-1.lab.ipng.ch (2001:678:d78:211::1:11)  3.0969 ms  3.1261 ms  3.3413 ms
 3  e0.vpp1-2.lab.ipng.ch (2001:678:d78:211::2:12)  6.4845 ms  6.3981 ms  6.5409 ms
 4  vpp1-3.lab.ipng.ch (2001:678:d78:211::3)  7.4610 ms  7.5698 ms  7.6413 ms

MPLS For Dummies

.. like me! MPLS stands for [Multi Protocol Label Switching]. Rather than looking at the IPv4 or IPv6 header in the packet, and making the routing decision based on the destination address, MPLS takes the whole packet and encpsulates it into a new datagram that carries a 20-bit number (called the label), three bits to classify the traffic, one S-bit to signal that this is the last label in a stack of labels, and finally 8 bits of TTL.

In total, 32 bits are prepended to the whole IP packet, or Ethernet frame, or any other type of inner datagram. This is why it’s called Multi Protocol. The S-bit allows routers to know if the following data is the inner payload (S=1), or if the following 32 bits are another MPLS label (S=0). This way, routers can add more than one labels into a label stack.

Forwarding decisions are made using on the contents of this MPLS label, without the need to examine the packet itself. Two significant benefits become obvious:

  1. The inner data payload (ie. an IPv6 packet or an Ethernet frame) doesn’t have to be rewritten, no new checksum created, no TTL decremented. Any datagram can be stuffed into an MPLS packet, the routing (or packet switching) entirely happens using only the MPLS headers.

  2. Importantly, no source- or destination IP addresses have to be looked up in a possibly very large ~1M large FIB tree to figure out the next hop. Rather than traversing a [Radix Trie] or other datastructure to find the next-hop, a static [Hash Table] with literal integer MPLS labels can be consulted. This greatly simplifies the computational complexity in transit.

P-Router: The simplest form of an MPLS router is a so-called Label-Switch-Router (LSR) which is synonymous for Provider-Router (P-Router). This is the router that sits in the core of the network, and its only purpose is to receive MPLS packets, look up what to do with them based on the label value, and then forward the packet onto the next router. Sometimes the router can (and will) rewrite the label, in an operation called a SWAP, but it can also leave the label as it was (in other words, the input label value can be the same as the outgoing label value). The logic kind of goes like MPLS In-Label => { MPLS Out-Label, Out-Interface, Out-NextHop }. It’s this behavior that explains the name Label Switching.

If you were to imagine plotting a path through the lab network from say vpp1-0 on the one side, through vpp1-1 and vpp1-2 on finally onwards to vpp1-3, each router would be receiving MPLS packets on one interface, and emitting them on their way to the next router on another interface. That path of switching operations on the labels of those MPLS packets thus forms a so-called Label-Switched-Path (LSP). These LSPs are fundamental building blocks of MPLS networks, as I’ll demonstrate later.

PE-Router: Some routers have a less boring job to do - those that sit at the edge of an MPLS network, accept customer traffic and do something useful with it. These are called Label-Edge-Router (LER) which is often colloquially called a Provider-Edge-Router (PE-Router). These routers receive normal packets (ethernet or IP or otherwise), and perform the encapsulation by adding MPLS labels to them upon receipt (ingress, called PUSH), or removing the encapsulation (called POP) and finding the inner payload, continuing to handle them as per normal. The logic for these can be much more complicated, but you can imagine it goes something like MPLS In-Label => { Operation } where the operation may be “take the resulting datagram, assume it is an IPv4 packet, so look it up in the IPv4 routing table” or “take the resulting datagram, assume it is an ethernet frame, and emit it on a specific interface”, and really any number of other “operations”.

The cool thing about MPLS is that the type of operations are vendor-extensible. If two routers A and B agree what label 1234 means to them, they can simply insert it at the top of the labelstack say {100,1234}, where the bottom one (the 100 label that all the P-Routers see) carries the semantic meaning of “switch this packet onto the destination PE-router”, where that PE-router can pop the outer label, to reveal the 1234-label, which it can look up in its table to tell it what to do next with the MPLS payload in any way it chooses - the P-Routers don’t have to understand the meaning of label 1234, they don’t have to use or inspect it at all!

Step 0: End Host setup

Lab Setup

For this lab, I’m going to boot up instance LAB1 with no changes (for posterity, using image vpp-proto-disk0@20230403-release). As an aside, IPng Networks has several of these lab environments, and while @vifino is doing some development testing on LAB0, I simply switch to LAB1 to let him work in peace.

With the MPLS concepts introduced, let me start by configuring host1-0 and host1-1 and giving them an IPv4 loopback address, and a transit network to their routers vpp1-0 and vpp1-3 respectively:

root@host1-1:~# ip link set enp16s0f0 up mtu 1500
root@host1-1:~# ip addr add 192.0.2.2/31 dev enp16s0f0
root@host1-1:~# ip addr add 10.0.1.1/32 dev lo
root@host1-1:~# ip ro add 10.0.1.0/32 via 192.0.2.3

root@host1-0:~# ip link set enp16s0f3 up mtu 1500
root@host1-0:~# ip addr add 192.0.2.0/31 dev enp16s0f3
root@host1-0:~# ip addr add 10.0.1.0/32 dev lo
root@host1-0:~# ip ro add 10.0.1.1/32 via 192.0.2.1
root@host1-0:~# ping -I 10.0.1.0 10.0.1.1

At this point, I don’t expect to see much, as I haven’t configured VPP yet. But host1-0 will start ARPing for 192.0.2.1 on enp16s0f3, which is connected to vpp1-3.e2. Let me take a look on the Open vSwitch mirror to confirm that:

root@tap1-0:~# tcpdump -vni enp16s0f0 vlan 33 
12:41:27.174052 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:28.333901 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:29.517415 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:41:30.645418 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28

Alright! I’m going to leave the ping running in the background, and I’ll trace packets through the network using the Open vSwitch mirror, as well as take a look at what VPP is doing with the packets using its packet tracer.

Step 1: PE Ingress

vpp1-3# set interface state GigabitEthernet10/0/2 up
vpp1-3# set interface ip address GigabitEthernet10/0/2 192.0.2.1/31
vpp1-3# mpls table add 0
vpp1-3# set interface mpls GigabitEthernet10/0/1 enable
vpp1-3# set interface mpls GigabitEthernet10/0/0 enable
vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100

Now the ARP resolution succeeds, and I can see that host1-0 starts sending ICMP packets towards the loopback that I have configured on host1-1, and it’s of course using the newly learned L2 adjacency for 192.0.2.1 at 52:54:00:13:10:02 (which is vpp1-3.e2). But, take a look at what the VPP router does next: due to the ip route add ... command, I’ve told it to reach 10.0.1.1 via a nexthop of vpp1-2.e1, but it will PUSH a single MPLS label 100,S=1 and forward it out on its Gi10/0/0 interface:

root@tap1-0:~# tcpdump -eni enp16s0f0 vlan or mpls
12:45:56.551896 52:54:00:20:10:03 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33
  p 0, ethertype ARP (0x0806), Request who-has 192.0.2.1 tell 192.0.2.0, length 28
12:45:56.553311 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 46: vlan 33
  p 0, ethertype ARP (0x0806), Reply 192.0.2.1 is-at 52:54:00:13:10:02, length 28

12:45:56.620924 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64
12:45:56.621473 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64

My MPLS journey on VPP has officially begun! The first exchange in the tcpdump (packets 1 and 2) is the ARP resolution of 192.0.2.1 by host1-0, after which it knows where to send the ICMP echo (packet 3, on VLAN33), which is then sent out by vpp1-3 as MPLS to vpp1-2 (packet 4, on VLAN22).

Let me show you what such a packet looks like from the point of view of VPP. It has a packet tracing function which shows how any individual packet traverses the graph of nodes through the router. It’s a lot of information, but as a VPP operator, let alone a developer, it’s really important skill to learn – so off I go, capturing and tracing a handful of packets:

vpp1-3# trace add dpdk-input 10
vpp1-3# show trace 
------------------- Start of thread 0 vpp_main -------------------
Packet 1

20:15:00:496109: dpdk-input
  GigabitEthernet10/0/2 rx queue 0
  buffer 0x4c44df: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
  PKT MBUF: port 2, nb_segs 1, pkt_len 98
    buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x2ed13840
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496167: ethernet-input
  frame: flags 0x1, hw-if-index 3, sw-if-index 3
  IP4: 52:54:00:20:10:03 -> 52:54:00:13:10:02

20:15:00:496201: ip4-input
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496225: ip4-lookup
  fib 0 dpo-idx 1 flow hash: 0x00000000
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN
    fragment id 0x2706, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x3bd6 id 8399

20:15:00:496256: ip4-mpls-label-imposition-pipe
    mpls-header:[100:64:0:eos]

20:15:00:496258: mpls-output
  adj-idx 25 : mpls via 192.168.11.10 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001210015254001310008847 flow hash: 0x00000000

20:15:00:496260: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x0018000d
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64

20:15:00:496262: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4c44df: current data -4, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
                   l2-hdr-offset 0 l3-hdr-offset 14 
  PKT MBUF: port 2, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 124, phys_addr 0x2ed13840
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64

This packet has gone through a total of eight nodes, and the local timestamps are the uptime of VPP when the packets were received. I’ll try to explain them in turn:

  1. dpdk-input: The packet is initially received by from Gi10/0/2 receive queue 0. It was an ethernet packet from 52:54:00:20:10:03 (host1-0.enp16s0f3) to 52:54:00:13:10:02 (vpp1-3.e2). Some more information is gleaned here, notably that it was an ethernet frame, an L3 IPv4 and L4 ICMP packet.
  2. ethernet-input: Since it was an ethernet frame, it was passed into this node. Here VPP concludes that this is an IPv4 packet, because the ethertype is 0x0800.
  3. ip4-input: We know it’s an IPv4 packet, and the layer4 information shows this is an ICMP echo packet from 10.0.1.0 to 10.0.1.1 (configured on host1-1.lo). VPP now needs to figure out where to route this packet.
  4. ip4-lookup: VPP takes a look at its FIB for 10.0.1.1 - note the information I specified above (the ip route add ... on vpp1-3) - the next-hop here is 192.168.11.10 on Gi10/0/0 but VPP also sees that I’m intending to add an MPLS out-label of 100.
  5. ip4-mpls-label-inposition-pipe: An MPLS packet header is prepended in front of the IPv4 packet, which will have only one label (100) and seeing it’s the only label, it will set the S-bit (end-of-stack) to 1, and the MPLS TTL initializes at 64.
  6. mpls-output: Now that the IPv4 packet is wrapped into an MPLS packet, VPP uses the rest of the FIB entry (notably the next-hop 192.168.11.0 and the output interface Gi10/0/0) to find where this thing is supposed to go.
  7. Gi10/0/0-output: VPP now prepares the packet to be sent out on Gi10/0/0 as an MPLS ethernet type. It uses the L2FIB adjacency table to figure out that we’ll be sending it from our mac address 52:54:00:13:10:00 (vpp1-3.e0) to the next hop on 52:54:00:12:10:01 (vpp1-2.e1).
  8. Gi10/0/0-tx: VPP hands the fully formed packet with all necessary information back to DPDK to marshall it on the wire.

Can you imagine this router can do such a thing at a rate of 18-20 Million packets per second, linearly scaling up per added CPU thread? I learn something new every time I look at a packet trace, I simply love this dataplane implementation!

Step 2: P-routers

In Step 1 I’ve shown that vpp1-3 did send the MPLS packet to vpp1-2, but I haven’t configured anything there yet, and because I didn’t enable MPLS, each of these beautiful packets is brutally sent to the bit-bucket (also called dpo-drop):

vpp1-2# show err
   Count                  Node                              Reason               Severity 
       132             mpls-input              MPLS input packets decapsulated     info   
       132             mpls-input                      MPLS not enabled            error  

The purpose of a P-router is to switch labels along the Label-Switched-Path. So let’s manually create the following to tell this vpp1-2 router what to do when it receives an MPLS frame with label 100:

vpp1-2# mpls table add 0
vpp1-2# set interface mpls GigabitEthernet10/0/0 enable
vpp1-2# set interface mpls GigabitEthernet10/0/1 enable
vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100

Remember, above I explained that the P-Router has a simple job? It really does! All I’m doing here is telling VPP, that if it receives an MPLS packet on any MPLS-enabled interface (notably Gi10/0/1 from which it is currently receiving MPLS packets from vpp1-3), that it should send the MPLS packet out on Gi10/0/0 to neighbor 192.168.11.8 after imposing label 100.

If I’ve done a good job, I should be able to see this packet traversing the P-Router in a packet trace:

20:42:51:151144: dpdk-input
  GigabitEthernet10/0/1 rx queue 0
  buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01
  label 100 exp 0, s 1, ttl 64

20:42:51:151161: ethernet-input
  frame: flags 0x1, hw-if-index 2, sw-if-index 2
  MPLS: 52:54:00:13:10:00 -> 52:54:00:12:10:01

20:42:51:151171: mpls-input
  MPLS: next mpls-lookup[1]  label 100 ttl 64 exp 0

20:42:51:151174: mpls-lookup
  MPLS: next [6], lookup fib index 0, LB index 74 hash 0 label 100 eos 1

20:42:51:151177: mpls-label-imposition-pipe
    mpls-header:[100:63:0:eos]

20:42:51:151179: mpls-output
  adj-idx 28 : mpls via 192.168.11.8 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001110015254001210008847 flow hash: 0x00000000

20:42:51:151181: GigabitEthernet10/0/0-output
  GigabitEthernet10/0/0 flags 0x0018000d
  MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
  label 100 exp 0, s 1, ttl 63

20:42:51:151184: GigabitEthernet10/0/0-tx
  GigabitEthernet10/0/0 tx queue 0
  buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
                   l2-hdr-offset 0 l3-hdr-offset 14 
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:12:10:00 -> 52:54:00:11:10:01
  label 100 exp 0, s 1, ttl 63

In order, the following nodes are traversed:

  1. dpdk-input: received the frame from the network interface Gi10/0/1
  2. ethernet-input: the frame was an ethernet frame, and VPP determines based on the ethertype (0x8847) that it is an MPLS frame
  3. mpls-input: inspects the MPLS labelstack and sees the outermost label (the only one on this frame) with a value of 100.
  4. mpls-lookup: looks up the MPLS FIB what to do with packets which are End-Of-Stack or EOS (ie. with the S-bit set to 1), and are labeled 100. At this point VPP could make a different choice if there is 1 label (as in this case), or a stack of multiple labels (Not-End-of-Stack or NEOS, ie. with the S-bit set to 0).
  5. mpls-label-imposition-pipe: reads from the FIB that the outer label needs to be SWAPd to a new out-label (also with value 100). Because it’s the same label, this is a no-op. However, since this router is forwarding the MPLS packet, it will decrement the TTL to 63.
  6. mpls-output: VPP then uses the rest of the FIB information to determine the L3 nexthop is 192.168.11.8 on Gi10/0/0.
  7. Gi10/0/0-output: uses the L2FIB adjacency table to determine that the L2 nexthop is MAC address 52:54:00:11:10:01 (vpp1-1.e1). If there is no L2 adjacency, this would be a good time for VPP to send an ARP request to resolve the IP-to-MAC and store it in the L2FIB.
  8. Gi10/0/0-tx: hands off the frame to DPDK for marshalling on the wire.

If you counted with me, you’ll see that this flow in the P-Router also has eight nodes. However, while the IPv4 FIB can and will be north of one million entries in a longest-prefix match radix trie (which is computationally expensive), the MPLS FIB contains far fewer entries which are organized as a literal key lookup in a hash table; and as well compared to IPv4 routing, the packet that is being transported does not have to get a decremented TTL which requires a recalculated IPv4 checksum. MPLS switching is much cheaper than IPv4 routing!

Now that our packets are switched from vpp1-2 to vpp1-1 (which is also a P-Router), I’ll just rinse and repeat there, using the L3 adjacency pointing at vpp1-0.e1 (192.168.11.6 on Gi10/0/0):

vpp1-1# mpls table add 0
vpp1-1# set interface mpls GigabitEthernet10/0/0 enable
vpp1-1# set interface mpls GigabitEthernet10/0/1 enable
vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100

Did I do this correctly? One way to check is by taking a look at which packets are seen on the Open vSwitch mirror ports:

root@tap1-0:~# tcpdump -eni enp16s0f0 
13:42:47.724107 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.724769 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.725038 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

13:42:47.726155 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64

Nice!! I confirm that the ICMP packet first travels over VLAN 33 (from host1-0 to vpp1-3), and then the MPLS packets travel from vpp1-3, through vpp-1-2, through vpp1-1 and towards vpp1-0 over VLAN 22, 21 and 20 respectively.

Step 3: PE Egress

Seeing as I haven’t done anything with vpp1-0 yet, now the MPLS packets all get dropped there. But not for much longer, as I’m now ready to tell vpp1-0 what to do with those packets:

vpp1-0# mpls table add 0
vpp1-0# set interface mpls GigabitEthernet10/0/0 enable
vpp1-0# set interface mpls GigabitEthernet10/0/1 enable
vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0
vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2

The difference between the P-Routers in Step 2 and this PE-Router, is the operation provided in the MPLS FIB. When an MPLS packet with label value 100 is received, instead of forwarding it into another interface (which is what the P-Router would do), I tell VPP here to unwrap the MPLS label, and expect to find an IPv4 packet which I’m asking it to route by looking up an IPv4 next hop in the (IPv4) FIB table 0.

All that’s left for me to do is add a regular static route for 10.0.1.1/32 via 192.0.2.2 (which is the address on interface host1-1.enp16s0f3). If my thinkingcap is still working, I should now see packets emit from vpp1-0 on Gi10/0/3:

vpp1-0# trace add dpdk-input 10
vpp1-0# show trace              

21:34:39:370589: dpdk-input
  GigabitEthernet10/0/1 rx queue 0
  buffer 0x4c4a34: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
  PKT MBUF: port 1, nb_segs 1, pkt_len 102
    buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x2ff28d80
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01
  label 100 exp 0, s 1, ttl 62

21:34:39:370672: ethernet-input
  frame: flags 0x1, hw-if-index 2, sw-if-index 2
  MPLS: 52:54:00:11:10:00 -> 52:54:00:10:10:01

21:34:39:370702: mpls-input
  MPLS: next mpls-lookup[1]  label 100 ttl 62 exp 0

21:34:39:370704: mpls-lookup
  MPLS: next [6], lookup fib index 0, LB index 83 hash 0 label 100 eos 1

21:34:39:370706: ip4-mpls-label-disposition-pipe
  rpf-id:-1 ip4, pipe

21:34:39:370708: lookup-ip4-dst
     fib-index:0 addr:10.0.1.1 load-balance:82

21:34:39:370710: ip4-rewrite
  tx_sw_if_index 4 dpo-idx 32 : ipv4 via 192.0.2.2 GigabitEthernet10/0/3: mtu:9000 next:9 flags:[] 5254002110005254001010030800 flow hash: 0x00000000
  00000000: 5254002110005254001010030800450000543dec40003e01e8bc0a0001000a00
  00000020: 01010800173d231c01a0fce65864000000009ce80b00000000001011

21:34:39:370735: GigabitEthernet10/0/3-output
  GigabitEthernet10/0/3 flags 0x0418000d
  IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
    fragment id 0x3dec, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x173d id 8988

21:34:39:370739: GigabitEthernet10/0/3-tx
  GigabitEthernet10/0/3 tx queue 0
  buffer 0x4c4a34: current data 4, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
                   ext-hdr-valid 
                   l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1
  PKT MBUF: port 1, nb_segs 1, pkt_len 98
    buf_len 2176, data_len 98, ol_flags 0x0, data_off 132, phys_addr 0x2ff28d80
    packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
    rss 0x0 fdir.hi 0x0 fdir.lo 0x0
  IP4: 52:54:00:10:10:03 -> 52:54:00:21:10:00
  ICMP: 10.0.1.0 -> 10.0.1.1
    tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN
    fragment id 0x3dec, flags DONT_FRAGMENT
  ICMP echo_request checksum 0x173d id 8988

Alright, another one of those huge blobs of information about a single packet traversing the VPP dataplane, but it’s the last one for this article, I promise! In order:

  1. dpdk-input: DPDK reads the frame which is arriving from vpp1-1 on Gi10/0/1, it determines that this is an ethernet frame
  2. ethernet-input: Based on the ethertype 0x8447, it knows that this ethernet frame is an MPLS packet
  3. mpls-input: The MPLS labelstack has one label, value 100, with (obviously) the EndOfStack S-bit set to 1; I can also see the (MPLS) TTL is 62 here, because it has traversed three routers (vpp1-3 TTL=64, vpp1-2 TTL=63, and vpp1-1 TTL=62)
  4. mpls-lookup: The lookup of local label 100 informs VPP that it should switch to IPv4 processing and handle the packet as such
  5. ip4-mpls-label-disposition-pipe: The MPLS label is removed, revealing an IPv4 packet as the inner payload of the MPLS datagram
  6. lookup-ip4-dst: VPP can now do a regular IPv4 forwarding table lookup for 10.0.1.1 which informs it that it should forward the packet via 192.0.2.2 which is directly connected to Gi10/0/3.
  7. ip4-rewrite: The IPv4 TTL is decremented and the IP header checksum recomputed
  8. Gi10/0/3-output: VPP now can look up the L2FIB adjacency belonging to 192.0.2.2 on Gi10/0/3, which informs it that 52:54:00:21:10:00 is the ethernet nexthop
  9. Gi10/0/3-tx: The packet is now handed off to DPDK to marshall on the wire, destined to host1-1.enp16s0f3

That means I should be able to see it on host1-1, right? If you, too, are dying to know, check this out:

root@host1-1:~# tcpdump -ni enp16s0f0 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:25:53.776486 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1249, length 64
14:25:53.776522 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1249, length 64
14:25:54.799829 IP 10.0.1.0 > 10.0.1.1: ICMP echo request, id 8988, seq 1250, length 64
14:25:54.799866 IP 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 8988, seq 1250, length 64

“Jiggle jiggle, wiggle wiggle!”, as I do a premature congratulatory dance on the chair in my lab! I created a label-switched-path using VPP as MPLS provider-edge and provider routers, to move this ICMP echo packet all the way from host1-0 to host1-1, but there’s absolutely nothing to suggest that the resulting ICMP echo-reply can go to back from host1-1 to host1-0, because LSPs are unidirectional. The final step for me to do is create an LSP back in the other direction:

vpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103
vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103
vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103
vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0
vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0

And with that, the ping I started at the beginning of this article, shoots to life:

root@host1-0:~# ping -I 10.0.1.0 10.0.1.1
PING 10.0.1.1 (10.0.1.1) from 10.0.1.0 : 56(84) bytes of data.
64 bytes from 10.0.1.1: icmp_seq=7644 ttl=62 time=6.28 ms
64 bytes from 10.0.1.1: icmp_seq=7645 ttl=62 time=7.45 ms
64 bytes from 10.0.1.1: icmp_seq=7646 ttl=62 time=7.01 ms
64 bytes from 10.0.1.1: icmp_seq=7647 ttl=62 time=5.76 ms
64 bytes from 10.0.1.1: icmp_seq=7648 ttl=62 time=5.88 ms
64 bytes from 10.0.1.1: icmp_seq=7649 ttl=62 time=9.23 ms

I will leave you with this packetdump from the Open vSwitch mirror, showing the entire flow of one ICMP packet through the network:

root@tap1-0:~# tcpdump -c 10 -eni enp16s0f0 
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:41:07.526861 52:54:00:20:10:03 > 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.528103 52:54:00:13:10:00 > 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.529342 52:54:00:12:10:00 > 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.530421 52:54:00:11:10:00 > 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62)
    10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64
14:41:07.531160 52:54:00:10:10:03 > 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.0 > 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64

14:41:07.531455 52:54:00:21:10:00 > 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532245 52:54:00:10:10:01 > 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.532732 52:54:00:11:10:01 > 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.533923 52:54:00:12:10:01 > 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22
  p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62)
    10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
14:41:07.535040 52:54:00:13:10:02 > 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33
  p 0, ethertype IPv4 (0x0800), 10.0.1.1 > 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64
10 packets captured
10 packets received by filter

You can see all of the attributes I demonstrated in this article in one go: ingress ICMP packet on VLAN 33, encapsulation with label 100, S=1 and ttl decrementing as the MPLS packet traverses eastwards through the string of VPP routers on VLANs 22, 21 and 20, ultimately being sent out on VLAN 40. There, the ICMP echo request packet is responded to, and we can trace the ICMP response as it makes its way back westwards through the MPLS network using label 103, ultimately re-appearing on VLAN 33.

There you have it. This is a fun story on Multi Protocol Label Switching (MPLS) bringing a packet from a Label-Edge-Router (LER) through several Label-Switch-Routers (LSRs) over a staticlly configured Label-Switched-Path (LSP). I feel like I can now more confidently use these terms without sounding silly.

What’s next

The first mission is accomplished. I’ve taken a good look at IPv4 forwarding in the VPP dataplane as MPLS packets, thereby en- and decapsulating the traffic using PE-Routers and forwarding the traffic using intermediary P-Routers. MPLS switching is cheaper than IPv4/IPv6 routing, but it can also open a bunch of possibilities regarding advanced services offering, such as my coveted Martini Tunnels which transport ethernet frames point-to-point over an MPLS backbone. That will be the topic of an upcoming article, as will I join forces with @vifino who is adding Linux Controlplane functionality to program the MPLS FIB using Netlink – such that things like ‘ip’ and ‘FRR’ can discover and share label information using a Label Distribution Protocol or LDP.