Facility Overview

A Merge-based testbed is a testbed facility built using the Merge technology stack and automation systems. A Merge testbed is made up of a distributed set of modular components that communicate over well defined interfaces and come together in a layered architecture.

The command layer is responsible for authenticating and authorizing requests from Merge portals and delegating them to the appropriate drivers in the testbed automation system.
The automation layer is responsible for building and executing task graphs from materialization requests. The automation layer knows very little about how to actually accomplish materialization tasks, it relies on individual components from the Merge technology stack to accomplish this.
The testbed infrastructure layer is comprised of a set of narrowly focused subsystems that are responsible for carrying out materialization tasks. This includes things like node imaging systems and DHCP/DNS servers.
The fabric layer is compirsed of the switches, routers and network appliances that collectively interconnect the testbed.
The resource layer is comprised of the user-allocatable resources that underpin materializations. This includes physical devices and virtual devices running inside hypervisors.

Hardware

The diagram below shows a typical Merge testbed facility.

The commander server is connected to both the upstream network and the testbed network. In this setup there are a pair of "Cogs" nodes that host the testbed automation layer and parts of the testbed infrastructure layer. We refer to these machines as Cogs as this is what the name of the Merge automation system is. The emulator servers provide network emulation services. The strorage servers host the testbed databases and mass storage systems.

Merge testbeds are typically composed of three physical networks.

A management network for direct hardware access that interconnects all the infrastructure level servers described above, as well as the management ports of switches.
An experiment infrastructure network that interconnects all testbed nodes, with access ports on cogs and storage servers. This network is commonly referred to as the infranet.
A data network that interconnects all testbed nodes, with access ports on emulation servers.

The choice of how the software elements that comprise a Merge testbed are deployed across a hardware substrate are at the discretion of the testbed operator. In this document, we'll organize things around the logical components and give guidance on appropriate deployment strategies as we go. As a quick point of reference for those seeking guidance on deployment strategies. A typical deployment of the Merge software components onto the hardware substrate depicted above is as follows.

server	components
commander	commander
cogs	driver, rex, infrapods, wgd, beluga, gobble
emulator	gobble, moa
storage	gobble, rally, sled
switch	canopy

Commander

The Merge commander is responsible for authenticating requests from a Merge portal and delegating them to the appropriate drivers in the testbed facility. When the commander starts, it exposes two endpoints, one to the portal and one to drivers within the facility. When drivers come online, they contact the driver and register all the resources for which they want commands delegated to them. The driver builds a lookup table where resource IDs are the keys and the values are lists of drivers.

Materialization requests that come in from the portal come in two types.

Materialization notifications notify a facility that a materialization is about to be set up (NotifyIncoming), or that it is OK to discard a materialization (NotifyTeardown).
Materialization requests are messages composed of materialization fragments. Each fragment has a resource id to which it refers and an operation to perform over that resource id. For example, a materialization fragment may contain the resource ID of a node, and a request to recycle that node to a clean state. These requests typically come in batches of fragments for efficiency.

Installation

The Merge Commander is installed from the Merge package server, using the mergetb-commander package.

sudo apt install mergetb-commander

Once installed the commander runs as a systemd service.

sudo service commander status

● commander.service - MergeTB Commander
   Loaded: loaded (/lib/systemd/system/commander.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-04-24 06:24:13 PDT; 1 months 8 days ago
     Docs: https://gitlab.com/mergetb/site
 Main PID: 46581 (commander)
    Tasks: 62 (limit: 19660)
   Memory: 74.4M
   CGroup: /system.slice/commander.service
           └─46581 /usr/bin/commander

Configuration

The is no explicit configuration file for the commander. It does have some flags that can modify its behavior

flag	default	purpose
`listen`	`0.0.0.0`	address to listen on
`port`	`6000`	port to listen on
`cert`	`/etc/merge/cmdr.pem`	client access certificate Merge portals must use to access the materialization interface
`key`	`/etc/merge/cmdr-key.pem`	private key corresponding to the client certificate

Driver

The Merge Driver is responsible for taking delegated materialization requests from a commander and transforming them into a task graph in preparation for execution. This task graph is a directed acyclic graph that captures the dependencies between materialization tasks so they can be maximally parallelized for execution.

Here is a sample execution graph in tabular form.

time                         mzid                taskid       stageid      actionid     kind           instance     action            deps           complete    error    masked
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    FyUNNFCs7    NbaVVNg8u    plumbing       Mq0DDEmGN    create-enclave    []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    mTjOZm1oB    KUMAAKxpG    container      Mq0DDEmGN    launch            []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    mTjOZm1oB    @7TMM@i9m    container      Mq0DDEmGN    launch            []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    mTjOZm1oB    5iq995b6Q    Canopy         Mq0DDEmGN    SetServiceVtep    []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    mTjOZm1oB    2KeTL3mOQ    container      Mq0DDEmGN    launch            []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    mTjOZm1oB    0imRZ0pyf    container      Mq0DDEmGN    launch            []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    ZY8qqZf0F    tVPz6yTkI    Nex            Mq0DDEmGN    CreateNetwork     []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    ZY8qqZf0F    _s4ZZ_8RQ    Nex            Mq0DDEmGN    CreateNetwork     []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    I1d@@MxeK    xaz@bxqFo    Nex            Mq0DDEmGN    AddMembers        []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    I1d@@MxeK    U46NNlxws    Nex            Mq0DDEmGN    AddMembers        []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    3on5P3_KR    hVARRhCdn    MoaControl     Mq0DDEmGN    InitMoaControl    []             true                 false
Apr 20 10:47:18.29 PDT    r1.simple.laforge    CWF6IJrMz    M7lkkHhfE    CgERRC_sS    bookkeeping    Mq0DDEmGN    UpdateStatus      []             true                 false
Apr 20 10:47:19.84 PDT    r1.simple.laforge    EWVvvMikS    ECmngS1u0    vaG_TXP2E    Canopy         Mq0DDEmGN    SetLinks          [CWF6IJrMz]    true                 false
Apr 20 10:47:19.84 PDT    r1.simple.laforge    OrKLLWe25    l11oGh6vd    YBoO5bSnZ    NodeSetup      Mq0DDEmGN    m610 (b)          [CWF6IJrMz]    true                 false
Apr 20 10:47:19.84 PDT    r1.simple.laforge    OrKLLWe25    l11oGh6vd    D7VprtE6I    NodeSetup      Mq0DDEmGN    m588 (a)          [CWF6IJrMz]    true                 false
Apr 20 10:47:19.84 PDT    r1.simple.laforge    EWVvvMikS    ECmngS1u0    CJVYYAGc_    Moa            Mq0DDEmGN    CreateMoaLink     [CWF6IJrMz]    true                 false

In addition to having a basic DAG structure, tasks are also organized into stages and actions. Stages are executed in serial, and actions are executed in parallel. So in a way stages act like execution barriers.

To get a sense for how the above all comes together logically consider the graphical representation of the table above in the diagram below. Here the blue rectangles are tasks, each column within the rectangle is a stage and each rounded rectangle is an action.

Here we see that this materialization is broken up into three tasks for

Setting up the infrapod and it's network enclave.
Setting up the experiment network.
Setting up the nodes.

Within the infrapod setup there is a single action in the first stage that creates the network enclave for the infapod. Once that action is complete the next stage is executed. In this stage 4 containers are launched and a service VXLAN tunnel endpoint (VTEP) is set up. All of these actions are done in parallel. Next DHCP/DNS networks are created that provide names and addresses on the experiments infranet. Members are then added to the DHCP/DNS networks. The network emulation (moa) control container is initialized and a global status for the materialization is updated indicating that the infrapod has been set up.

Once the infrapod is setup, the two tasks that depend on in, network setup and node setup are executed. The network setup happens across two actions in serial across two stages and the node set up happens in parallel across all nodes in the materialization. In this case we have a tiny two node experiment.

Tasks are stored in the facility etcd database discussed later in this document.

Installation

The Merge Driver is installed from the Merge package server, using the mergetb-driver package.

sudo apt install mergetb-driver

Once the driver is installed, it runs as a systemd service.

sudo service driver status

● driver.service - MergeTB Driver
   Loaded: loaded (/lib/systemd/system/driver.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-05-29 12:55:58 PDT; 3 days ago
     Docs: https://gitlab.com/mergetb/tech/cogs
 Main PID: 105230 (driver)
    Tasks: 65 (limit: 19660)
   Memory: 175.8M
   CGroup: /system.slice/driver.service
           └─105230 /usr/bin/driver

Configuration

The driver configuration file is located at /etc/cogs/driver.yml. Here is an example configuration.

# where to reach the etcd database
etcd:
  address: localhost
  port: 2399

# where to reach the commander
commander:
  address: site0
  port: 6000

# connection information to provide the commander
driver:
  address: site1
  port: 10000

# default images to use when not specified by the user
images:

  # default testbed node images
  node:
    default: debian:10 

  # default infrapod container images
  infrapod:
    moactl:  docker.io/mergetb/moactld:0.2.0
    foundry: docker.io/mergetb/foundry:v0.1.16
    etcd:    docker.io/mergetb/etcd:v0.1.0
    nex:     docker.io/mergetb/nex:v0.5.5

Rex

The Driver is only responsible for creating the task graph structure. It is not responsible for managing task execution. That job falls to a component called Rex. Rex continuously watches the etcd database for new tasks or changes in existing task state that may make other tasks eligible for execution.

Installation

The Merge Rex execution agent is installed from the Merge package server, using the mergetb-rex package.

sudo apt install mergetb-rex

Once rex is installed, it runs as a systemd service.

sudo service driver status

● rex.service - MergeTB Rex
   Loaded: loaded (/lib/systemd/system/rex.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2020-06-01 18:20:11 PDT; 10min ago
     Docs: https://gitlab.com/mergetb/tech/cogs
 Main PID: 367 (rex)
    Tasks: 14 (limit: 19660)
   Memory: 13.0M
   CGroup: /system.slice/rex.service
           └─367 /usr/bin/rex

Configuration

The rex configuration file is located at /etc/cogs/driver.yml. here is an example configuration.

# where to reach the etcd database
etcd:
  address: localhost
  port: 2399

# where to reach the beluga power control daemon
beluga:
  address: site0
  port: 5402

# how many seconds to use as a lease TTL
tuning:
  taskLease: 10

# network configuration parameters
net:

  # the interface that connets to the testbed infranet
  vtepIfx:         enp113s0f0

  # the mtu to use for physical interfaces
  mtu:             9216

  # the mtu to use for vteps
  vtepMtu:         9166

  # the BGP autonomous system number of the cogs server rex runs on
  bgpAS:           64799

  # the underlay tunnel IP used for VTEP tunnel communications
  serviceTunnelIP: 10.99.0.254

  # the address info used to talk to networks outside the testbed (e.g. the Internet)
  externalIP: 128.9.18.103
  externalSubnet: 128.9.18.103/24
  externalGateway: 128.9.18.1

# tuning parameters for the sled imaging system
sled:
  # number of seconds to wait for an individual node to be imaged
  mattimeout: 300

  # number of seconds to wait for an individual node to be wiped to
  # be wiped to the default image
  demattimeout: 300

  # connection timeout when connecting to sled imaging daemons
  conntimeout: 10

Cog

The cog command line tool is the primary tool used to operate a testbed facility.

Monitoring Task Progress

Use this command to see what tasks are currently pending on the testbed

cog list tasks

Viewing Materialization Tasks

Use this command to view the tasks of a specific materialization. Use the --all flag to show tasks that have been completed in addition to pending tasks.

cog list mztasks <mzid> --all

note

The mzid of a materialization is denoted by

<realizaton>.<experiment>.<project>

Clearing Task Errors

When a task encounters an error and needs to be retried, either because the operations team has fixed some piece of infrastructure or the error appears to be transient use this command to clear the task error so Rex will try the task again. Rex ignores all tasks that have an error condition.

cog delete task-errors <mzid> <taskid>

Manually Completing Tasks

Sometimes a task is in a hopeless state and the ops team just needs to consider it complete and move on so other dependent tasks and stages can be executed.

cog complete <mzid> <taskid>

Masking Actions

Sometimes an action needs to be masked so it won't be executed when Rex gets to that point in the execution graph. This is what masking is for. When a task is masked Rex will ignore it. It also allows execution to continue beyond the point of the task in the graph, as masked tasks are discarded as a dependency consideration.

cog mask <mzid> <taskid> <actionid>

Manual Dematerialization

Sometimes an experiment must be manually dematerialized. The most common reason for this is a de-synchronization between a Merge portal and the testbed facility.

cog delete mz <mzid>

This will dematerialize the experiment and clear out all associated data records from the Cogs database.

Listing Materializations

To get a list of all active materializations

cog list mz

Showing Materialization Info

This is probably the second most useful command for managing testbed facilities.

sudo cog show mz <mzid>

status: active
INFRANET
VINDEX    VNI    VID    SVCADDR
117       117    27     172.31.0.127
CONTAINERS
CONTAINER    INFO
etcd         up
foundry      up
moactld      up
nex          up
NODES
NODE    XPNAME    IMAGE        IMAC                 ADDRESS                 LEAF          VID
m588    a         debian:10    00:08:a2:0d:df:8e    172.30.0.11 (2:3:54)    il14.swp13    27
m610    b         debian:10    00:08:a2:0d:dd:ae    172.30.0.10 (2:3:37)    il14.swp35    27
INTERFACES
NODE    XPNAME    INTERFACE    ADDRS            MAC                  MTU     ENDPOINT      VID
m588    a         eth1         [10.0.0.1/24]    00:08:a2:0d:df:8f    1500    xl16.swp13    431
m610    b         eth1         [10.0.0.2/24]    00:08:a2:0d:dd:af    1500    xl16.swp35    430
LINKS
m610~m588 b~a
VNI    VID    EDGES
520    430    [m610:eth1] [b]
521    431    [m588:eth1] [a]
HOST          PORT           VID/VNI
xl16          swp13          431
xl16          swp35          430
xl16          uplink         [430 431]
xf2           xl16           [430 431]
emu0          vtep520        0/520
emu0          vtep521        0/521
xf2           vtep520        430/520
xf2           vtep521        431/521
ROUTER API    ROUTER HOST    ADDR            ASN      VNI
emu0          xf2            10.99.1.5/32    64805    520
emu0          xf2            10.99.1.5/32    64805    521

Taking the highlights section by section, this command shows

the overall status of the materialization as active
the service address of the infrapod as 172.31.0.127
all the containers in the infrapod are up
There are two nodes with the debian:10 image that have active DHCP leases with 2 hours and 3 minutes remaining.
What switches each node interface is connected to on both the infranet and xp net as well as VLAN/VXLAN information.
A mapping of experiment links from node to node including all the intermediate switch hops and the VLAN/VXLAN information associated with them.
EVPN advertisement information for cross-fabric or emulated links.

note

We say that a link is cross fabric when it transits across at least one routed VXLAN hop.

Getting Node State

The above image shows some of the internals for node status and sled state. Rex is the driver of state. During a materialization, a node will move from the Clean state to Setup to Ready. Rex manages the state machine transition and prevents nodes from getting into bad states; always driving from current state to target state. Clean and Ready states can be easily mapped to sled states when the node is in Daemon mode (Clean) or is in a Materialization (Ready). Sled states will be discussed in later sections.

cog list nodes
node     mzid                                     current state    target state
n87      harbor.testbed.io                        Clean            Clean
n88      some.mat.coolproj                        Ready            Ready

You can use cog list nodes to detect issues with nodes where the current state != target state and are not materializing or dematerializing.

Infrapods

Infrapods are a collection of containers that provide per-experiment infrastructure. Each infrapod contains the following containers at a minimum

Foundry: node configuration daemon
Nex: DHCP/DNS server
Etcd: Database for infrapod local storage

Optionally infrapods may contain

Moactld: for dynamic network emulation modification.
Simctld: for physics simulation control.

depending on the composition of the experiment.

Technically, an infrapod is a network namespace that contains the above collection of containers. This is similar to the Kubernetes Pod abstraction. Infrapods are created during the materialization initialization phase for every experiment. In addition to a set of pods contained within a network namespace, there are also a set of interfaces that are created within the network namespace for

communication with testbed resources on the infranet
communication with testbed automation systems on the management network

The diagram above shows the network plumbing for two simple infrapods. On the 'top half' of the network the ceth1 interface of the infrapod connects to a service bridge that testbed automation tools like Rex use to communicate with containers inside the pod. The address assigned to the ceth1 interface is unique across all infrapods in a testbed facility. Each ceth1 interface has a corresponding svcX interface in the host network namespace (commonly referred to as the init namespace). Here X gets assigned the integer id of the materialization that is created by the Cogs. The ceth1 and svcX interfaces are virtual Ethernet pair devices, or simply veth devices. Veth devices are commonly used to plumb communications in and out of network namespaces. They are very simple, any Ethernet frame that ingresses on one peer is forwarded to the other.

On the 'bottom half' of the infrapods, is the infranet. Here we have a ceth0 and ifrX Ethernet pair. In this case the address assigned to ceth0 is not unique across materializations. In fact it is almost always 172.30.0.1 as this is the default infranet gateway for experiments. This overlapping is possible due to the network namespaces and the fact that each ifrX interface is enslaved to a dedicated bridge mzbrX.

On the other side of the mzbrX bridge is a VXLAN tunnel endpoint (VTEP). This interface provides connectivity to the infranet of the experiment encapsulated inside of a VXLAN tunnel. The routing and forwarding configuration of this device is managed by GoBGP and Gobble. More details on how the infranet is constructed and managed is in the infranet section.

Nex

The Nex infrapod container provides DHCP/DNS services for the resources in an experiment. There are a few networks defined for each experiment. There is a command line client called nex that is installed by default on the Cogs hosts. This client can be used to inspect the state of DHCP/DNS networks operating inside infrapods, and even modify them in a pinch.

In order to talk to a Nex instance running inside an infrapd, the service address of the pod must be used. This can be seen using the cog show mz command.

cog show mz r1.simple.laforge

status: active
infranet
vindex    vni    vid    svcaddr
117       117    27     172.31.0.127
[..snip..]

The address displayed above can be used to look at the Nex DHCP/DNS networks

nex -s 172.31.0.127:6000 get networks

r1.simple.laforge
static.r1.simple.laforge

These networks can be further inspected using the nex tool

nex -s 172.31.0.127:6000 get network r1.simple.laforge

name:              r1.simple.laforge
subnet4:           172.30.0.0/16
gateways:          172.30.0.1
nameservers:       172.30.0.1
dhcp4server:       172.30.0.1
domain:            r1.simple.laforge
range4:            172.30.0.10-172.30.255.255
lease_duration:    14400s

and their members inspected

nex -s 172.31.0.127:6000 get members r1.simple.laforge

mac                  name                  ip4
00:08:a2:0d:dd:ae    b.r1.simple.laforge    172.30.0.10 (3:45:7)
00:08:a2:0d:df:8e    a.r1.simple.laforge    172.30.0.11 (3:45:20)

The above are the infranet address leases of the nodes in this 2 node experiment.

nex -s 172.31.0.127:6000 get members static.r1.simple.laforge

mac                  name                        ip4
00:00:00:00:00:01    foundry.r1.simple.laforge    172.30.0.1
00:00:00:00:00:02    moactl.r1.simple.laforge     172.30.0.1

The above are static names and addresses associated with infrastructure services provided on the experiment infranet, hence the name static.

All of this information is populated by the Cogs when a materialization is first created. The materialization fragments that come from the merge portal are turned into Cog tasks by the driver that contain Nex configuration information that is extracted from the materialization fragments. This data is loaded into the Nex database using the service address of the infrapod and the Nex gRPC management interface.

tip

If you are just interested in seeing the node lease information about a materialization, you do not need to go through this song and dance every time, the cog show mz does this for you automatically and will show you the address and lease time of every node in the experiment.

Foundry

The Foundry infrapod container provides node configuration services for resources in an experiment. The system images that the testbed stamps onto resources at materialization time contain a daemon that runs at boot time. This is the Foundry client (foundryc). When the Foundry client starts, it reaches out to the Foundry server (foundryd) at the DNS name foundry which is resolved by the Nex DNS server pod described in the previous section, requesting how it should configure the node.

Foundryd responds to foundryc requests with information including at a minimum.

How network interfaces should be set up.
User account information including SSH credentials.
How routing should be set up.
What the hostname of the node should be.

Foundryd is made aware of this information by the Cogs. When a materialization is created, the materialization fragments that come from the merge portal are turned into Cog tasks by the driver that contain Foundry configuration information that is extracted from the materialization fragments. This data is loaded into the Foundry database using the service address of the infrapod and the foundryd gRPC management interface.

Foundry information can be inspected using the Foundry command line tool that is installed as a part of the Cogs software.

foundry --endpoint 172.31.0.127:27000 --cert /etc/foundry/manage.pem list

00:08:a2:0d:dd:ae:
  hostname: b
  users:
  - name: laforge
    groups:
    - sudo
    - laforge
    ssh_authorized_keys:
    - [..snip..]
  interfaces:
  - name: eth1
    addrs:
    - 10.0.0.2/24
    mtu: 1500
  expand_rootfs: true
00:08:a2:0d:df:8e:
  hostname: a
  users:
  - name: laforge
    groups:
    - sudo
    - laforge
    ssh_authorized_keys:
    - [..snip..]
  interfaces:
  - name: eth1
    addrs:
    - 10.0.0.1/24
    mtu: 1500

This shows the full foundry configuration for both of the nodes in this two node experiment.

Sled API

Sled is one of the core components for experiment imaging. There are 3 major components to sled: client, daemon, and api. The sled client (sledc) is responsible for putting an image on a device, the sled daemon (sledd) is responsible for managing communications between the client and the api, and the sled api (sledapi) is responsible for managing etcd storage. The sled controller (sledctl) is an auxilliary tool for users to manage etcd.

Sledapi is created in the infrapod container for the harbor materialization. Its placement in the harbor materialization is necessary to communicate with the sled daemon, clients, which live in the harbor materialization until they can be moved into an experimentor's materialization.

At the core of the sledapi is the concept of a sled command (sledcmd). The sledcmd is the playbook for what actions a client should take when it runs sled. The 4 actions a client can take are:

wipe: the wipe action indicates that the client should wipe (zero) the disk(s). This is an expensive operation, and should be done on teardown if used rather than on boot.
write: the write action writes out a kernel, initramfs, and disk on the client. The kernel and initramfs are artifacts of the disk image, and are generated by the images repository. The kernel and initramfs are written to tmpfs, and cannot exceed the size of memory. The disk image is written to the target block device.
kexec: the kexec action does a kernel execute from the currently running client (sledc on u-root) into the target kernel that was provided in the write action step.
daemon: the daemon action tells the client to go into daemon mode and to wait until a future actionable command is given.

The sledcmd is stored in etcd given by a mac address key as shown below using the sledctl command. The -S option below is the server ip address of sledapi in the harbor materialization network.

sledctl -S 172.31.0.2 list macs
mac:    00:08:a2:0c:0c:77

sledctl -S 172.31.0.2 get command 00:08:a2:0c:0c:77
mac:    00:08:a2:0c:0c:77
id: 5931ed7c
time:   Tue, 11 Aug 2020 14:54:19 UTC
wipe: nil
write: nil
kexec:  in progress
    time:       Tue, 11 Aug 2020 14:54:19 UTC
    append:     root=PARTUUID=a0000000-0000-0000-0000-00000000000a
    rootfstype=ext4 console=tty1 console=ttyS0,115200n8 console=tty1
    console=ttyS0,115200n8 earlyprintk=ttyS0 rw net.ifnames=0
    biosdevname=0 8250.nr_uarts=1 infranet=eth0

sledctl -S 172.31.0.2 get status 5931ed7c
taskID: 5931ed7c
wipe:    not started
write:   not started
kexec:   in progress

The sledcmd has multiple fields:

mac: mac address of the command.
id: unique identifier of the command, so that we can query on the status. This id will change with every update to the command set, as each update creates a new command.
time: the last time the command was updated.
wipe: the wipe command contents.
write: the write command contents.
kexec: the kexec command contents. The kexec status will never be complete, as it is impossible to update the kexec field upon completion as the client will have already executed the new kernel and operating system.

The sledcmd information is passed to sledapi through the Cogs. On a dematerialization request, the cogs inserts a default image write action and daemon action into the sledcmd field. This tells the sledc to remain in daemon mode until it receives a new command. It also reduces the boot time if the experimentor uses the default image. On materialization, the Cogs inserts new fields, mainly a kexec action, and a write action if the default image is not used.

Moactld

The Moactld infrapod provides a means for the end-user moacmd (which runs in a XDC) to control the network emulation provided by Moa. Configuration is automatically done by the Cogs via the API. It's primary function is to limit access to the emulation that is part of the materialization to which a XDC is attached.

Etcd

Etcd is the primary database use used by a Merge testbed facility. There are several systems that use the same underlying etcd cluster. Namespaced etcd proxies are used to isolate these systems from each other.

The proxies are local to the services that access them. So for example, in a deployment where rex and driver one server and the commander runs on another, both servers would run a copy of the cogs grpc proxy service. The etcd proxies run as systemd services. You can see them through the basic systemd service management commands.

sudo service cogs-proxy status

sudo service nex-proxy status

sudo service sled-proxy status

These proxies are all configured to point at the etcd cluster running on the storage nodes inside the testbed facility. There are always an odd number of etcd servers in a cluster. Typically, etcd clusters are of size 1 or 3.

The proxy service configuration for the cogs proxy is located at /etc/systemd/system/cogs-proxy.service, it's definition is useful to point out a few things about the way in which etcd is deployed in a Merge testbed facility

[Unit]
Description=cogs-proxy
Documentation=https://etcd.io

[Service]
ExecStart=/usr/bin/etcd grpc-proxy start \
  --endpoints=https://stor0:2379,https://stor1:2379,https://stor2:2379 \
  --listen-addr=127.0.0.1:2399 \
  --namespace=/cogs/ \
  --cert=/etc/etcd/db.pem \
  --key=/etc/etcd/db-key.pem \
  --cacert=/etc/etcd/ca.pem \
  --max-send-bytes=536870912 \
  --max-recv-bytes=536870912
Restart=on-failure
RestartSec=5

[Install]
Alias=cogs-proxy
WantedBy=multi-user.target

certificates are held in /etc/etcd
Merge at times can use messages that exceed the default etcd threshold, so limits may need tweaking depending on the expected size of experiments.
Proxy services listen locally on 2399 for the cogs. Consult the other services for their ports.
Proxies act as a TLS boundary, e.g. the actual etcd servers require client authentication, but since the proxies only listen locally they do not. This simplifies service and operations staff interaction models with the etcd system.

Sled

The Sled is one of the core components for experiment imaging. There are 3 major components to sled: client, daemon, and api. The sled client (sledc) is responsible for putting an image on a device, the sled daemon (sledd) is responsible for managing communications between the client and the api, and the sled api (sledapi) is responsible for managing etcd storage. The sled controller (sledctl) is an auxilliary tool for users to manage etcd.

Where each sled component runs:

sledc runs on the experiment device.
sledd container runs on the storage server.
sledd-nginx container runs on the storage server (hosts images).
sledapi container runs in the harbor infrapod.

Discussion of sledapi in the harbor infrapod can be found earlier in the document under Sled API. This section will mainly cover the sled protocol and management from an operational perspective.

Sled begins with a node booting. The node is configured with pxeboot as the primary boot option, and disk booting as the next option, followed by EFI or another recovery boot option in case of failures. The secondary boot option using a disk, means that when a node is in a materialization it will not boot from PXE, but instead use the disk image which was written during the initial sled process. This provides users with the expected outcome of rebooting with the same disk.

Going back to the primary boot source using PXE. When the node initially PXE boots, the dhcp service which is handled by nex, maintains an entry for the pxe server and file. The node does a chain boot into this image which is actually an iPXE image (handled by the tftp container in the harbor network) that contains pointers to the sled-server and the sledc kernel and initramfs. sled-server here is the DNS name which is also stored in nex and points to the sled image mount server.

When the node boots, it boots into a u-root client, and runs an initial dhcp request to retrieve an ip address. From there, it runs the sled client (sledc) and it requests the sled command (covered in the sled api section). The client uses every interface to request the sled command, and selects the first interface which can resolve sled-server name. The request is then sent to sled-server where the sled daemon (sledd) is running. Sledd then requests from sledapi the command associated with the mac address. Sledd is also responsible for handling the image copying process on the sled-server host back to the client (over http).

From the Cogs perspective, an MzFragment will come down to the driver with NodeSetup requests. For each node in the NodeSetup, a sled entry is created based on the mac address. The sled code managed by rex will check to see if the node is in daemon mode (ready for new commands), if so, it updates the sled command and sends it to the client. If not, it will create a full sled command that includes a write request as well, then restarts the node. The rex sled code will then wait until the node is reachable before returning. If the the node does not become reachable within a timeout timeframe, the task is failed.

Wgd

TODO

@geoff

Rally

Rally is the mass storage provider for experiment nodes. Rally is responsible for creating, managing, and removing user's experimental storage. Unlike mergefs which lives on the portal, rally is used for ephemeral storage, which lives either until the lifetime of the site, project, or experiment. Rally can run on one or more of the storage servers. Rally is made up of a single service rallyd and the controller tool rallyctl for accessing and modifying rally data.

Rally uses a default configuration file:

services:
  etcd:
    address: localhost
    port: 2379
    # Provide TLS settings as follows
    #TLS:
      #Cacert: /etc/cogs/ca.pem
      #Cert:   /etc/cogs/etcd.pem
      #Key:    /etc/cogs/etcd-key.pem
    timeout: 10
  rally:
    address: localhost
    port: 9950
    timeout: 10
  ceph:
    address: localhost
    port: 6789
    timeout: 10

cephfs:
  name: rallyfs
  datapool: rallyfs-data
  metadatapool: rallyfs-meta
  quota: 1GB
  root: rally
  users: users
  owners: owners

rados:
  pool: rados
  quota: 1GB
  pgnum: 1024

rally:
  mount: "/mnt/rally"

Rally depends on 2 other services: etcd and ceph. Additionally, most of the rally configuration file is defining ceph attributes. The root and users files dictate the cephfs path base off the root rally mount (/mnt/rally/rally/users).

There are two types of supported storage methods at the moment: site and experiment storage. Site storage is created through the mergetb cli and creates an AssetFragment, while experiment storage is created by experiment xir and generates an MzFragment.

Block storage is currently implemented, but has not been enabled in foundry configuration for auto mounting block storage.

mergetb new asset filesystem spineleaf test 10GB
mergetb new asset filesystem spineleaf test2 10GB

mergetb list assets
ASSET                TYPE             SIZE     SITES                  LIFETIME
newAsset             CEPHFS           10.00gb  spineleaf.mergetb.test EXPERIMENT
test                 CEPHFS           10.00gb  spineleaf.mergetb.test SITE
test2                CEPHFS           10.00gb  spineleaf.mergetb.test SITE

The last asset, newAsset was created in the experiment xir definition, when the portal looked up each asset and could not find newAsset, so it added it to the experiment definition to have the site create an asset with the lifetime of the experiment.

Rally is responsible for experiment storage, for now that is only ceph. For this next bit we will dive more in-depth into ceph. Ceph is installed by ceph-deploy. The main components of ceph are:

Monitor (mon): manages the ceph cluster, creates quorum with other monitors. Monitors are the brains of the ceph cluster.
Metadata Server (mds): manages the data distribution of ceph filesystem.
Rados Gateway (rgw): interface for librados.
**Object Store Device (osd): storage devices to hold ceph data.
Ceph Manager (mgr): manages telemetrics of ceph (prometheus, alerts, etc).

Data placement over the OSDs is managed by CRUSH, an algorithm that attempts to reduce erasures by data encoding and data placement beyond failure domains. Data is stored in pools, based on the number of OSDs, the placement group value distributes the pools across OSDs.

In order for experimentors to mount the ceph filesystem, they need to know where the monitor is located, and the path to their data. Rally does this by maintaining a permissions map on each rally user, which maps 1-to-1 with a ceph user. The permissions are tied into the ceph secret, which prevents other users from accessing the data. An administrator can access this data through the rallyctl tool.

rallyctl rally show user lincolnthurlow.test
Details for lincolnthurlow.test:
    Secret: AQCBFDNfNwLhKhAAEkD3JVu6kNXEdvv1I2mFcA==
    Mount: 172.29.0.2:6789:/rally/users/lincolnthurlow.test

During the materialization of nodes, the Cogs make gRPC calls to rally to get this information and place it into the foundry data structure so when the node boots, the data is placed into the node's /etc/fstab to automatically mount the filesystem.

Moa

Moa is a daemon that configures the network emulation, currently implemented using (Fastclick)[https://github.com/mergetb/fastclick]. Moa is configured by the Cogs via its API, and stores the mapping from user-specific tags in the network model to the network links. It receives commands via Moactld, and utilizes Fastclick's control socket interface to dynamically alter the network link characteristics as requested.

The moactl CLI application can be used to inspect the emulations. It can be installed via apt install moactl on the emulation servers. It has facilities for listing emulations, showing the details of a specific emulation, and starting or stoping the fastclick processes that perform the actual emulation.

Beluga

Beluga is a power control daemon. It has the capability to control the power state of many different types of devices through a pluggable device framework. The way that Beluga works is that it exposes a generic power control interface for users of the Beluga API to control devices through. The Beluga power control daemon has a configuration file that maps device names to their underlying power control model. When Beluga receives power control commands it looks up the power control protocol for the device it received the command for and translates the Beluga command into the power control protocol for that device.

An example Beluga configuration looks like this.

# address to listen on
listen: 172.22.0.2:540

# location of etcd storage server
stor:
  address: 127.0.0.1
  port: 2374

# devices under power control of this daemon
devices:

  # ipmi controlled device
  a0:
    controller: ipmi

  # minnow controlled device
  m0:
    controller: minnow

  # APC controlled device
  mc0:
    controller: apc

# details for the controllers
controllers:

  # ipmi connection and authentication parameters
  ipmi:
    devices:
      a0: { IP: "172.22.0.112", Username: "ADMIN", Password: "ADMIN" }

  # apc host and outlet parameters
  apc:
    devices:
      mc0:
        pdu: pdu0
        outlets: [4, 16]

   # minnow chassis and device index parameters
   minnow:
     devices:
       m0:
         chassis: mc0
         index: 0

Here we see the configuration space partitioned into two categories, the device parameters and the power controller parameters. The device parameters make devices controllable through the Beluga API and link each device to a set of control parameters. The control parameters tell Beluga how to interact with the power controller for the specified device.

Beluga runs as a systemd service

sudo service belugad status

and is available through apt packages

sudo apt install belugactl belugad

GoBGP

GoBGP is a border gateway protocol (BGP) daemon that runs on all testbed servers that provide services directly to testbed nodes over the infranet or xpnet. GoBGP implements the BGP protocol and provides a gRPC interface for programmatic control. The way these networks are put together through protocols like BGP will be explained in more detail in the infranet and xpnet sections. Here we will simply go over the mechanics of GoBGP operations.

GoBGP runs as a systemd service

sudo service gobgpd status

You'll find GoBGP running on

Infrapod host servers
Sled imaging servers
Moa emulation servers

GoBGP also comes with a command line tool.

Peers

The neighbor command shows us information about who we are directly peered with.

gobgp neigh

Peer          AS      Up/Down State       |#Received  Accepted
enp113s0f0 64701 31d 06:45:13 Establ      |     1904      1447

This tells us that the daemon on this node has been up for 31 days, and has received 1904 routes from peers.

Underlay Routes

The global routing information base shows us what underlay addresses are routable from our current location. The underlay network provides the foundation over which a set of overlay networks may be created.

gobgp global rib

ry@site1:~$ gobgp global rib
   Network              Next Hop                  AS_PATH                 Age        Attrs
*> 10.99.0.1/32         fe80::526b:4bff:fe8e:8bfa 64701                   31d 06:47:01 [{Origin: i} {Med: 0}]
*> 10.99.0.2/32         fe80::526b:4bff:fe8e:8bfa 64701 64702             31d 06:47:01 [{Origin: i}]
*> 10.99.0.3/32         fe80::526b:4bff:fe8e:8bfa 64701 64702 64703       31d 06:47:01 [{Origin: i}]
*> 10.99.0.4/32         fe80::526b:4bff:fe8e:8bfa 64701 64702 64704       31d 06:47:01 [{Origin: i}]
*> 10.99.0.5/32         fe80::526b:4bff:fe8e:8bfa 64701 64702 64705       31d 06:47:01 [{Origin: i}]
*> 10.99.0.6/32         fe80::526b:4bff:fe8e:8bfa 64701 64702 64706       31d 06:47:01 [{Origin: i}]
*> 10.99.0.250/32       fe80::526b:4bff:fe8e:8bfa 64701 64702 64706 64713 14d 19:22:23 [{Origin: i}]
*> 10.99.0.251/32       fe80::526b:4bff:fe8e:8bfa 64701 64702 64705 64712 31d 06:47:01 [{Origin: i}]
*> 10.99.0.252/32       fe80::526b:4bff:fe8e:8bfa 64701 64702 64704 64711 31d 06:47:01 [{Origin: i}]
*> 10.99.0.253/32       fe80::526b:4bff:fe8e:8bfa 64701 64702 64703 64710 31d 06:47:01 [{Origin: i}]
*> 10.99.0.254/32       0.0.0.0                                           1d 01:46:16 [{Origin: i}]

You should be able to ping any addresses if Gobble is running and has provisioned the corresponding routes.

Overlay Routes

The global routing information base for Ethernet virtual private network (EVPN) show us what overlay addresses are routeable from our current location. The overlay network provides an isolated communication substrate for materialization networks that do not interact with other experiment networks or testbed infrastructure networks.

gobgp global rib -a evpn

ry@site1:~$ gobgp global rib -a evpn
   Network                                                                    Labels     Next Hop             AS_PATH                 Age        Attrs
*> [type:macadv][rd:10.99.0.5:2][etag:0][mac:00:08:a2:0d:de:64][ip:<nil>]     [2]        10.99.0.5            64701 64702 64705       31d 06:52:47 [{Origin: i} {Extcomms: [64705:2], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:10.99.0.254:118][etag:0][mac:be:6e:a1:ec:58:7d][ip:<nil>] [118]      10.99.0.254                                  26d 13:52:10 [{Origin: i} {Extcomms: [VXLAN], [64799:118]} [ESI: single-homed]]
*> [type:macadv][rd:10.99.0.254:119][etag:0][mac:e2:b9:9a:0a:6a:65][ip:<nil>] [119]      10.99.0.254                                  68d 05:43:13 [{Origin: i} {Extcomms: [VXLAN], [64799:119]} [ESI: single-homed]]
*> [type:macadv][rd:10.99.0.3:2][etag:0][mac:00:08:a2:0d:ca:8e][ip:<nil>]     [2]        10.99.0.3            64701 64702 64703       1d 06:27:41 [{Origin: i} {Extcomms: [64703:2], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:10.99.0.3:4][etag:0][mac:00:08:a2:0d:c9:c2][ip:<nil>]     [100]      10.99.0.3            64701 64702 64703       00:34:37   [{Origin: i} {Extcomms: [64703:100], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:10.99.0.5:12][etag:0][mac:00:08:a2:0d:dc:ea][ip:<nil>]    [118]      10.99.0.5            64701 64702 64705       00:09:20   [{Origin: i} {Extcomms: [64705:118], [VXLAN]} [ESI: single-homed]]
*> [type:multicast][rd:10.99.0.6:127][etag:0][ip:10.99.0.6]                              10.99.0.6                                    25d 06:08:02 [{Origin: i} {Extcomms: [VXLAN], [64706:127]}]
*> [type:multicast][rd:10.99.0.5:3][etag:0][ip:10.99.0.5]                                10.99.0.5            64701 64702 64705       04:59:21   [{Origin: i} {Extcomms: [64705:105], [VXLAN]} {Pmsi: type: ingress-repl, label: 105, tunnel-id: 10.99.0.5}]
*> [type:multicast][rd:10.99.0.4:3][etag:0][ip:10.99.0.4]                                10.99.0.4            64701 64702 64704       19d 04:10:30 [{Origin: i} {Extcomms: [64704:128], [VXLAN]} {Pmsi: type: ingress-repl, label: 128, tunnel-id: 10.99.0.4}]
*> [type:multicast][rd:10.99.0.251:2][etag:0][ip:10.99.0.251]                            10.99.0.251          64701 64702 64705 64712 31d 06:52:47 [{Origin: i} {Extcomms: [64705:2], [VXLAN]}]
[..snip..]

On an active testbed there will be a lot of EVPN overlay routes, thousands or tens of thousands depending on the size of the testbed facility. EVPN routes come in two flavors

Type-2 (macadv): These routes define layer-2 segments by providing reachability information for MAC addresses across the overlay within the context of a virtual network identifier (VNI). A VNI is an integer value that defines the isolation domain of the route. This is what provides separation between segments. So if the VNI on one advertisement is 100 and 101 on another, the destinations will not be able to communicate.
Type-3 (multicast): These routes define layer-3 reachability information in terms of underlay addresses. It's a way of saying that for a given VNI, there are targets of interest at the specified underlay address. This serves the function of routing broadcast, multicast and unknown (BUM) packets. When a packet is a BUM packet, it will be broadcast to all Type-3 destinations within the VNI context.

Installation

GoBGP is installed by installing Gobble.

Configuratoin

The following is an example of a GoBGP configuration located at /etc/gobgp.yml

global:
  config:

    # bgp autonomus system numer
    as: 64799

    # the id of this router
    router-id: 10.99.0.254

# the neighbors i am expecting to reach
neighbors:
  - config:
      # interface the neighbor is reachable on
      neighbor-interface: enp113s0f0
      # the autonomous system number of the neighbor
      peer-as: 64701

    # BGP dialects we expect the neighbor to speak
    afi-safis:
      - config:
          afi-safi-name: ipv4-unicast
      - config:
          afi-safi-name: ipv6-unicast
      - config:
          afi-safi-name: l2vpn-evpn

    # neighbor connection retry logic
    timers:
      config:
          connect-retry: 5
          hold-time: 9
          keepalive-interval: 3

Gobble

GoBGP speaks a variety BGP dialects and, provides a gRPC API and command line client to inspect and manipulate the routing and forwarding space for a variety of BGP based protocols. What it does not do is create any routing or forwarding tables on the machine it is running on. This is what Gobble does. Gobble uses the GoBGP gRPC API to inspect the reachable EVPN overlay routes of the node it is running on and updates the underlying Linux routing and forwarding tables accordingly.

Gobble runs as a systemd service

sudo service gobble status

Installation

sudo apt install gobble

Configuration

The gobble configuration file is located at /etc/gobble.yml

# how many milliseconds to wait between GoBGP polls
quantum: 1000

# port we can reach GoBGP on
gobgpd_port: 50051

# What address to use for the pseudo-peer, if you do not know what you are doing
# dont change this
peer_gw: 169.254.0.1

# The subnet of the BGP underlay network
bgp_subnet: 10.99.0.0/24

# Enable trace level loging - VERY noisy, will fill your disks with logs in a
# hot minute
trace: false

# The routing table id to use
#   - 254 for the default table
table: 254

# Network interface to set up peer forwarding and routing for
peer_ifx: enp113s0f0

Canopy

Canopy is a framework for creating virtual networks. There are 3 main parts

A client that lets you efficiently manage virtual networks across a switching mesh.
A daemon that runs on switches and servers that implements virtual network synthesis commands received by clients.
An API and library for programmatically managing virtual networks.

Inspecting Virtual Networks

Using the Canopy client you can inspect the state of any switch or server in the testbed that is running the Canopy daemon. Platforms in a Merge testbed that run the Canopy daemon are.

All switches
Infrapod servers
Storage servers
Network emulation servers
Physics simulation servers.

The network state of a host can be inspected with canopy as follows.

canopy list isp0

physical
name         mtu     bridge    access    tagged    untagged
eth0         1500              0         []        []
if0          9216              0         []        []
if1          9216              0         []        []
if2          9216              0         []        []
if3          9216              0         []        []
swp1         9216    if0       0         []        []
swp10        9216              0         []        []
swp11        9216    if3       0         []        []
swp12        9216    if3       0         []        []
swp13        9216    if3       0         []        []
swp14        9216    if2       0         []        []
swp15        9216    if2       0         []        []
swp16        9216    if2       0         []        []
swp2         9216    if0       0         []        []
swp3         9216    if0       0         []        []
swp4         9216    if1       0         []        []
swp5         9216    if1       0         []        []
swp6         9216    if1       0         []        []
swp7s0       9216    bridge    13        []        [13]
swp7s1       9216    bridge    2         []        [2]
swp7s2       9216    bridge    2         []        [2]
swp7s3       9216    bridge    30        []        [30]
swp8         9216              0         []        []
swp9         9216              0         []        []

vteps
name       vni    mtu     access    device    bridge    learning    tunnel-ip    access    tagged    untagged
vtep103    103    9166    13                  bridge    Off         10.99.0.2    13        []        [13]
vtep120    120    9166    30                  bridge    Off         10.99.0.2    30        []        [30]
vtep136    136    9166    46                  bridge    Off         10.99.0.2    46        []        [46]
vtep2      2      9166    2                   bridge    Off         10.99.0.2    2         []        [2]

bridges
bridge

This output shows us the physical interfaces, VXLAN tunnel endpoints, bridges and their state. This display does not show it, but the interfaces are also colored green if they are up and red if they are down.

Managing Virtual Networks

The Canopy client also lets you modify virtual network state. For example, for the above server isp0, if we wanted to add the VLAN tags 47 and 99 to interfaces if0, if1 and if3, we could do the following.

canopy set port tagged 47,99 if0,if1,if3 isp0

This is the general structure of most commands. Where we have a command stanza, a set of parameter values, a set of ports to apply those values to and a set of hosts to apply the commands to. In the above example we only had one host target, however we can even extend the command to multiple switches.

canopy set port tagged 47,99 if0,if1,if3 isp0,isp1

Internally the components use the Canopy API to create virtual networks for experiments in an automated way.

The Canopy Daemon

In order for the Canopy client CLI or library to be able to setup networks on devices, the Canopy daemon must be running on those devices. The Canopy daemon runs as a systemd service

sudo service canopy status

There is no configuration for Canopy, it just needs to be installed and run. It uses the Linux netlink socket to get all the information it needs and to implement the command requests it receives.

At the time of writing canopy configurations are not persistent. So if the host is rebooted, the configuration will be lost, and the host will need to be re-configured.

Installation

Client

sudo apt install canopy-client

Server

sudo apt install canopy-server

Node Booting Imaging, Initialization, & Debugging

When a node is not in use on the testbed, it is in a special virtual network called the harbor. When a node is selected for materialization the following things happen.

Booting

When a node first boots on the testbed, its firmware must be configured to PXE boot. This is so the testbed can take control of the node. All nodes that are not in use are on a special virtual network called the harbor. There is a DHCP server running on the harbor that provides BOOTP protocol information to PXE clients sufficient to load a bootstrapping PXE or UEFI image from a TFTP server on the harbor.

Both the DHCP server and the TFTP server run as containers in the harbors infrapod.

note

The name of the harbor is main.harbor.<testbed-name>

$ sudo ctr -n main.harbor.spineleaf c ls

CONTAINER    IMAGE                                RUNTIME
etcd         docker.io/mergetb/etcd:latest        io.containerd.runtime.v1.linux
foundry      docker.io/mergetb/foundry:v0.1.12    io.containerd.runtime.v1.linux
nex          docker.io/mergetb/nex:v0.5.5         io.containerd.runtime.v1.linux
sledapi      docker.io/mergetb/sledapi:v0.8.0     io.containerd.runtime.v1.linux
tftp         docker.io/mergetb/tftp:v0.1.4        io.containerd.runtime.v1.linux

You can see the BOOTP protocol options by inspecting the networks of the dhcp server.

$ nex -s 172.31.0.2:6000 get networks

main.harbor.spineleaf
bios.harbor.spineleaf
static.main.harbor.spineleaf

This shows that we have two distinct harbor networks. One main network that serves EFI firmware on boot, and one bios network that serves legacy BIOS firmware on boot. The static network is for providing DNS resolution for service addresses such as foundry.

We can inspect the DHCP/BOOTP configuration of the main harbor network as follows.

$ nex -s 172.31.0.2:6000 get network main.harbor.spineleaf

name:              main.harbor.spineleaf
subnet4:           172.30.0.0/16
gateways:          172.30.0.1
nameservers:       172.30.0.1
dhcp4server:       172.30.0.1
domain:            main.harbor.spineleaf
siaddr:            172.30.0.1
range4:            172.30.0.10-172.30.254.254
lease_duration:    14400s
options:
  66    172.30.0.1
  67    pxe.efi

This shows us that the EFI image PXE clients will be directed to is called pxe.efi.

To see exactly which nodes will be picked up on this network, we need to look at the network membership.

$ nex -s 172.31.0.2:6000 get members main.harbor.spineleaf

mac                  name                        ip4    client
70:00:01:70:11    n6.main.harbor.spineleaf    172.30.0.14 (3:59:44)
70:00:01:70:12    n7.main.harbor.spineleaf    172.30.0.15 (3:59:44)
70:00:01:70:1a    n0.main.harbor.spineleaf    172.30.0.17 (3:59:44)
70:00:01:70:1b    n1.main.harbor.spineleaf    172.30.0.13 (3:59:44)
70:00:01:70:1c    n2.main.harbor.spineleaf    172.30.0.16 (3:59:44)
70:00:01:70:1d    n3.main.harbor.spineleaf    172.30.0.12 (3:59:44)
70:00:01:70:1e    n4.main.harbor.spineleaf    172.30.0.10 (3:59:59)
70:00:01:70:1f    n5.main.harbor.spineleaf    172.30.0.11 (3:59:59)

This list shows that membership for this network is based on fixed MAC address mappings. Other mapping mechanisms are available. On testbeds that only have a single type of firmware, a Nex network that picks up all nodes may be defined. Nex also allows for MAC ranges to be specified, so devices can be targetd by OUI.

The TFTP server container is pre-loaded with 4 basic bootloaders by the testbed software. These bootloaders are built from the MergeTB fork of the iPXE firmware

pxe.efi: EFI bootloader that will attempt to use all interfaces it can find.
pxe.bios: BIOS bootloader that will attempt to use all interfaces it can find.
snponly.efi: EFI bootloader that will only use the network interface it was chain loaded from.
snponly.bios: BIOS bootloader that will only use the network interface it was chain loaded from.

The sole task of this bootloader, in whatever form it comes in, is to load the Sled imaging kernel. The Sled imaging kernel is a lightweight Linux OS that loads purely into the memory of the machine and performs OS imagaing operations.

Imaging

When Sled boots on the node, it attempts contacts the sled server running in the harbor asking for instructions. At this point, if the node is undergoing a materialization, the sled server will send back a message telling the sled imaging client what image to pull down. The sled client will download the specified image and write it to the disk specified by the sled server.

The Sled client also downloads the kernel for the specified OS separately, and then kexecs directly into that kernel. This saves a reboot cycle - which can be a very costly operation, especially on server grade hardware.

Sled is also responsible for providing sufficient kernel parameters for the node to boot. These parameters are a part of the testbed model, and the testbed automation systems that configure the nodes for booting provide this information to Sled from the testbed model. This includes

Root file system location

Merge OS images use the GPT partitioning scheme. By default Merge images have the boot partition at the partition UUID (PARTUUID). Note that this is a partition level label and not a filesystem level label.

root=PARTUUID=a0000000-0000-0000-0000-00000000000a

serial console settings

Serial console settings are necessary to ensure that once the node boots, it's output is available via the serial console. This is a critical lifeline when the network connection on a node goes sideways. An example is as follows

console=ttyS0,115200n8

if you have hardware that sometimes has issues booting, using the kernel earlyprintk option can be very helpful

earlyprintk=ttyS0

Bootstrap network interface

When a node boots, it needs to know how to bootstrap the network. This is provided to the node configuration system by a kernel parameter called infranet.

infranet=eth0

Experiment Images

Experiment images are created using the images repo. Currently images are generated using packer. Packer takes in an image definition in the form of a json template and outputs an image as defined by the template. In the case of merge and sled, these images are output as raw images, which sled will then write directly to the block device.

There are 3 outputs of the packer build process that merge requires:

Disk image
Kernel
Initramfs

The disk image is generated from the template itself. The kernel and Initramfs are created by scripts in the image repo that extract the kernel and initramfs. In most operating systems, there are multiple kernels and initramfs, so one script will create a symlink with a sled- prefix that maps to the currently used version. As the prefix denotes, this is used by sled during the kernel execution (Kexec) into the operating system.

Currently experiment images are segmented by the operating system and the firmware (bios, efi). This is important as above we discussed the root file system partition, it is the image that is responsible for creating this filesystem, and the firmware and bootloaders also need to know about the partitions that are created. For example, in an EFI based system we need to identify the EFI partition.

PARTUUID=10000000-0000-0000-0000-000000000001 /boot/efi vfat defaults 0 2
PARTUUID=a0000000-0000-0000-0000-00000000000a /         ext4 defaults 0 1

This enables the images to survive reboot. While the initial boot is done from a kexec, additional boots are done through standard grub bootloading.

Experimental images are rarely updated, but when they are they generally fall into 3 categories of updates:

Operating System update (upgrade version)
Operating System packaging/tools (add prerequesite pacakge)
Mergetb packaging (foundryc, retty)

Updating Testbed images

There are three types of images used during the mergetb imaging process:

PXE Image
Sled Image
Experiment Image

We've previously discussed each of these images in sections above. To provide a brief recap, pxe images are used to chainload into the sled image, the sled image boots into the experiment image, and the experiment image is used by experimenters. We will dive into a few more specifics for each topic to address how to update a testbed image.

Updating PXE Images

The pxe images live on the infrapod host and are mounted by the tftp container. The below exerpt shows the specifications for the harbor as was shown above, particularly the mounting source and destination. The tftp container is hard coded to mount from /srv, but the source location is left to the facility administrator.

    # ipxe
    - kind: container
      mzid: main.harbor.spineleaf
      action: launch
      data:
        name: tftp
        namespace: main.harbor.spineleaf
        vindex: 2
        image: "docker.io/mergetb/tftp:v0.1.4"
        environment:
          DOMAIN: main.harbor.spineleaf
        mounts:
        - name: ipxe-location
          source:  /srv/pxe
          destination: /srv
          options: [rbind, ro]
          type: bind

When updating a pxe image, the location to place the updated images depends on the source value. It is common to need to spin out a new pxe image due to new physical hardware and the specifications of the console settings or additional linux kernel parameters necessary for sled to work.

#!ipxe

dhcp
kernel http://sled-server/pxe/sledc-kernel root=/dev/ram0 console=tty0 console=ttyS0,115200n8 console=tty1 console=ttyS1,115200n8 earlyprintk=ttyS0 initrd=sledc-dhcp-initramfs
initrd http://sled-server/pxe/sledc-dhcp-initramfs
boot

After building the pxe image the destination is based off of the firmware (bios or efi), and how the nex networks were configured. Recall in the booting section we discuss how nex plays a role in how images PXE. So the names used should match those in the nex entries which are being mounted by the tftp container.

Updating Sled Images

Sled images mainly consist of the u-root binary with sledc built on top. Sled is the imaging protocol, and will likely get updated for a variety of reasons: u-root version, kernel packages, sled protocol change, or boot process.

Noted in the section above in the ipxe section, is the destination of the sled image (http://sled-server/pxe). So this again depends on nex dns resolution and the webserver running on that host. In most testbeds, this webserver will be the slednginx container. The code that specifies the destination for where images live is here. So for slednginx, the destination is hardcoded as /var/img on the sled-server host for experiment images, and /var/img/pxe for these sledc images. The directory pxe here is a reference that these images are loaded by pxe, not that they are ipxe images as described in the section immediately above.

Updating Experiment Images

Testbed experiment images are serviced by sled. The location of those images is also based on the webserver servicing http requests (default /var/img). However, the sled protocol is actually relying on the value given by the sled command:

root@cmdr:/home/rvn$ sledctl get command 00:08:a2:0d:e2:92
mac:    00:08:a2:0d:e2:92
id: ec58ff91
time:   Sun, 07 Jun 2020 23:27:40 UTC
wipe: nil
write:  complete
    time:       Sun, 07 Jun 2020 23:28:45 UTC
    image:      ubuntu-1804-disk
    kernel:     ubuntu-1804-kernel
    initrd:     ubuntu-1804-initramfs
    device:     sda

Within the write command, the names of each entity: image, kernel, initrd is both a filename and a path. So sledc will generate an http get request for http://sled-server/ubuntu-1804-disk. So based on how the root directory of the webserver is hosted, and what the image name is determined to be in etcd for sled, will determine the location and name of the experiment image.

Recommendations for updating images

There are several recommendations to make when updating any of the above images on the testbed.

Prior to deployment on the physical testbed, attempt to verify that the image works under similar conditions in a staged virtual environment.
Disable materialization to the site via the Merge API prior to copying the images over. The testbed is a dynamic system, during an image update a materialization may cause an incomplete image to be copied.
Update sled images for sled daemons. Sled clients in daemon mode act as a cache of images, when a node dematerializes, the cogs place the node into daemon mode and download the default images. When an images are copied over to the new location, if the default image is changed or the sled client itself is changed, all nodes in daemon mode are running the old cached version. For testbed images, one can either update the sled daemon entries with the latest version (a write to the etcd key will cause an update to occur). For sled client images, one can either reboot the node, or use the Sledc Upgrade command to kexec into the new sledc image.

Configuration

At this point the user specified OS is booting. All MergeTB compatible images run a node configuration daemon on startup.

Debugging

This section covers a few more of the specifics with regards to how sled interacts with the cogs, and where problems tend to arise, and how to solve them.

The most common issue that comes up is when the cogs displays: failed to daemon node as shown below.

time                         mzid    taskid    stageid    actionid     kind           instance     action               deps                               complete            error                      masked
19 Nov 20 16:23:55.99 PST    ****    ******    *******    ********    NodeRecycle       *******    n73 (n62)            [eOhxeLthS rmZmIw7px]              false          failed to daemon node           false

This task is the most common failure because it is primarily responsible for imaging a testbed node and has the largest vector for errors. The first step to take in debugging these nodes is to grab a console session to the node, and delete the task error and watch for what happens.

The most common failures are:

PXE DHCP Failure
- This leaves the node to follow the boot order, which with an unsuccessful PXE will result in the last operating system installed to the disk.
DNS Resolution Failure or Sledd/Networking
- If sled-server fails to resolve, or sledd is unable to complete the transaction, the node will end up in u-root with a > elvish shell. On sledc nodes the logs can be found at /tmp/sled.log and the database /tmp/sled.db. You can read these files to get more details.
Latency
- It is common when there are hundreds to thousands of nodes rebooting that a node may not download the images on time. In this case you will also find the node in u-root.

If the console is not working, sledctl has a builtin command line interface which can be used to interface with nodes which have made it to sledc.

sledctl ping 00:08:a2:0d:e8:96
00:08:a2:0d:e8:96:    alive

sledctl run 00:08:a2:0d:e8:96 "cat /tmp/sled.log"
00:08:a2:0d:e8:96:    output: time="2020-11-19T23:13:46Z" level=info msg="testing: lo"
time="2020-11-19T23:13:46Z" level=info msg="testing: eth0"
time="2020-11-19T23:13:47Z" level=info msg="selected: eth0"
time="2020-11-19T23:13:47Z" level=info msg="Attempting: 00:08:a2:0d:e8:96"
...

sledctl run 00:08:a2:0d:e8:96 'kexec --cmdline \"root=PARTUUID=a0000000-0000-0000-0000-00000000000a rootfstype=ext4 console=tty1 console=ttyS0,115200n8 console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0 rw net.ifnames=0 biosdevname=0 8250.nr_uarts=1 infranet=eth0\" -i /tmp/initrd -l /tmp/kernel'

sledctl ping is a daemon command for checking if a sled node is listening. This is used directly by the cogs to check if a node is already configured. sledctl run allows for remote command execution on the daemon node.

Infranets

The infranet is a flat network that interconnects every node in an experiment. There is one infranet per experiment. The infranet serves the following basic functions.

Provides access to experiment resources through XDC wireguard connections.
Allows every experiment node to reach every other node for the purposes of experiment automation.
Provides external network access to nodes, most commoly the Internet.
Provides mass storage access to experiment nodes.
Provides DHCP/DNS to experiment nodes.
Hosts a materialization-specific node configuration server.
Hosts a API endpoints for emulated network control and physics simulation control.

When an experiment is materialized the following things happen to construct the infranet for that experiment.

Infrapod Network Enclave

The first step in establishing an infranet is creating a network elclave for the infrapod. As a reminder an infrapod is a set of containers that share a common network namespace that collectively provide the basic infrastructure for an individual experiment materialization.

The network enclave for an infrapod contains the network elements that allow it co communicate with the nodes in an experiment. This includes

Virtual Ethernet Pair

A pair of virtual ethernet devices strattle the network namespace that encapsulates the containers of the infrapod and the default (sometimes called init) namespace of the host. The veth device internal to the namespace is always named ceth0. The veth device in the default namespace is named ifrX where X is the integer ID of the materialization.

Materialization Bridge

The materialization bridge, called mzbrX where X is the integer ID of the materialization, connects the extenal veth device to a VTEP device. This bridges the infrapod's namespaced network elements with the infranet of the materialization through the VTEP.

Vtep

The VTEP, called vtepX where X is the integer ID of the materialization, provides access to the experiment infranet that spans the testbed switching mesh through a VXLAN tunnel and a set of routes and forwarding entires through that tunnel.

Infranet Virtual Network

The infranet exists as a virtual network that spans the switching mesh of a testbed facility. Depending on the design of the testbed, this may be a pure VXLAN virtual network or may be a combination VXLAN/VLAN virtual network. In the example that follows we present a combined VXLAN/VLAN implementation.

Consider the infranet depicted in the following diagram.

Here, the nodes shaded in blue are the members of the materialization. For Merge testbed facilities, we typically refer to switches in 3 categories.

Leaf switches: Provide only VLAN based access to testbed virtual networks.
Fabric switches: Provide VXLAN (and possibly VLAN) based access to testbed virtual networks.
Spine switches: Interconnect fabric switches and are pure underlay e.g., they transit VXLAN encapsulated traffic but do not take part in encap/decap themselves or actively particiapte in VXLAN control plane protocols such as EVPN.

In this example the infrapod has a VTEP whose parent interface is the physical interface of the infrapod server that connects directly to a spine switch. The nodes that the infrapod must communicate with are below leaf switches. The way this all comes together is the following.

The testbed automation system creates VLAN access ports for each of the nodes on the infranet.
A VLAN trunk is created on the uplink of each trunk switch and the corresponding dowlink of each fabric.
A VTEP is created on each fabric switch, attached to the bridge of that switch and given the same VLAN access tag as the fabric downlink in (2). This funnels all traffic from the node into this VTEP.
The switch is configured to advertise the virtual network identifier (VNI) of every VTEP that is created on it, so all peer routers are aware of it's existance.
The GoBGP router running on the infrapod host sees the advertisements from the fabric switches saves them to it's local routing information base (RIB).
The Gobble daemon running on the infrapod host sees the new advertisements in the RIB based on periodic polling (once a second by default) and adds corresponding routing and forwarding entries to the server it is running on so that the corresponding nodes are reachable through it's local vtepX interface.
The testbed automation sytem creates an EVPN advertisement for the internal infrapod interface ceth0. All of the fabric nodes in the testbed see this advertisement and create the corresponding routes through the VTEPs that were created on the VNI specified in the advertisement.
At this point bidirectional communication has been established between the nodes, and between the nodes and the infrapod.

Inspecting the infranet from an infrapod server

We can see many elements of an infranet from an infrapod server. First let's take a look at a materialization's metadata through the cog tool

sudo cog show mz one.rtr.ry

[..snip..]
INFRANET
VINDEX    VNI    VID    SVCADDR
100       100    10     172.31.0.110

CONTAINERS
etcd         
foundry      
moactld      
nex          

NODES
NODE    XPNAME    IMAGE        IMAC                 ADDRESS                  LEAF           VID
n4      c         debian:10    04:70:00:01:70:1e    172.30.0.10 (3:41:49)    ileaf1.swp1    10
[..snip..]

This shows us that the VXLAN VNI associated with the infranet for this materialization is 100. We also see that the virtual network index of this experiment is 100. It's common for the VNI and vindex for an experiment to be the same number.

tip

Because the infranet is a flat network, there is always just a single VNI for each infranet

If we look at the network of the infrapod server hosting this infrapod we can see the following.

ip -br -c addr | grep 100

mzbr100          UP             fe80::bcd6:70ff:fe79:ce29/64 
ifr100@if2       UP             fe80::3cf1:93ff:feb3:bd8b/64 
svc100@if3       UP             fe80::8002:a3ff:fe02:89bb/64 
vtep100          UNKNOWN        fe80::aca3:79ff:fe8a:829f/64 

Likewise we can peer into the network namespace for this materialization

rvn@cmdr:~$ sudo ip netns exec one.rtr.ry ip -br -c addr
lo               UNKNOWN        127.0.0.1/8 ::1/128
ceth0@if94       UP             172.30.0.1/16 fe80::ecf8:c9ff:fe77:d9e0/64
ceth1@if95       UP             172.31.0.110/16 fe80::64e7:d4ff:fe52:3188/64

The EVPN/BGP state can be inpsected through the GoBGP tool. To see the underlay network.

gobgp global rib

   Network              Next Hop               AS_PATH              Age        Attrs
*> 10.99.0.1/32         fe80::5054:ff:fe58:307 64704 64701          20:00:51   [{Origin: i}]
*> 10.99.0.5/32         0.0.0.0                                     02:29:43   [{Origin: i}]
*> 10.99.0.20/32        fe80::5054:ff:fe58:307 64704 64701 64720    20:00:51   [{Origin: i}]
*> 10.99.0.21/32        fe80::5054:ff:fe58:307 64704 64701 64721    20:00:51   [{Origin: i}]
*> 10.99.0.22/32        fe80::5054:ff:fe58:307 64704 64701 64722    20:00:51   [{Origin: i}]
*> 10.99.0.23/32        fe80::5054:ff:fe58:307 64704 64701 64723    20:00:51   [{Origin: i}]
*> 10.99.0.254/32       fe80::5054:ff:fe58:307 64704                20:00:51   [{Origin: i} {Med: 0}]

This shows us the other routers that can be reached from this node. The node that has the next hop of 0.0.0.0 is the node we are on. The other routers in the network are the fabric switches in the switching mesh, stroage servers, and other infrapod servers. The IPs associated with each of these routers are what we refer to an tunnel-ips. This is because these are the tunnel entry points for VXLAN networks. These VXLAN networks are laid over the top of these tunnel entry points, and for that reason, the tunnel-ip network is called the underlay network and the VXLAN network is called the overlay network.

We can also inspect the state of the overlay network. For any non-trival sized testbed, the overlay network can have a very large number of entries. Here we focus our attention to the entrires involving VNI 100 belonging to the materialization we are looking at.

gobgp global rib -a evpn | grep -E '(Network|100)'

   Network                                                                  Labels     Next Hop             AS_PATH              Age        Attrs
*> [type:multicast][rd:10.99.0.5:100][etag:0][ip:10.99.0.5]                            10.99.0.5                                 02:34:57   [{Origin: i} {Extcomms: [VXLAN], [64705:100]}]
*> [type:macadv][rd:10.99.0.1:3][etag:0][mac:04:70:00:01:70:1e][ip:<nil>]   [100]      10.99.0.1            64704 64701          00:52:33   [{Origin: i} {Extcomms: [64701:100], [VXLAN]} [ESI: single-homed]]
*> [type:multicast][rd:10.99.0.1:3][etag:0][ip:10.99.0.1]                              10.99.0.1            64704 64701          02:34:33   [{Origin: i} {Extcomms: [64701:100], [VXLAN]} {Pmsi: type: ingress-repl, label: 16777213, tunnel-id: 10.99.0.1}]
*> [type:macadv][rd:10.99.0.5:100][etag:0][mac:ee:f8:c9:77:d9:e0][ip:<nil>] [100]      10.99.0.5                                 02:34:57   [{Origin: i} {Extcomms: [VXLAN], [64705:100]} [ESI: single-homed]]

Here we see two types of routes

multicast: These routes determine how 3 classes of traffic broadcast, multicast, and unknown - collectively known as BUM are forwarded. All egress traffic that falls within the BUM class is sent to all routers for wich the originating router has a multicast advertisement on the VNI in question.
macadv: These routes determine how traffic belonging to specific MAC addresses is forwarded. If switches are set to learn MACs as they cross VTEP boundaries, these MACs will be advertised to reduce uknown traffic that must go through multicast routes. The testbed automation systems will also pre-seed known MACs at their fabric entry entry points to prune the initial BUM tree.

We can see how these BGP/EVPN advertisements manifest as actual routes by inspecting the routing and forwarding state of the infrapod server.

sudo bridge fdb | grep vtep100

04:70:00:01:70:1e dev vtep100 dst 10.99.0.1 self permanent
00:00:00:00:00:00 dev vtep100 dst 10.99.0.1 self permanent
ae:a3:79:8a:82:9f dev vtep100 vlan 1 master mzbr100 permanent
ae:a3:79:8a:82:9f dev vtep100 master mzbr100 permanent

Here we see the VXLAN forwarding entries for vtep100 on this host. The first entry corresponds to the macadv entry from GoBGP above. The second entry corresponds to the multicast entry from GoBGP above. The 00:00:00:00:00:00 entry is a special forwarding entry that says send all BUM traffic to the following destination. The remaining two entries are plumbing for the vtep100 itself onto and off of the mzbr100 bridge.

ip -br -c link | grep ae:a3

vtep100 UNKNOWN ae:a3:79:8a:82:9f <BROADCAST,MULTICAST,UP,LOWER_UP>

Note that we do not see an explicit forwarding entry for the ee:f8:c9:77:d9:e0 macadv above. That is because this is the MAC address of the internal veth on the infrapod we are looking at, so there need not be an external forwarding entry.

sudo ip netns exec one.rtr.ry ip -br -c link | grep ee:f8

ceth0@if94 UP ee:f8:c9:77:d9:e0 <BROADCAST,MULTICAST,UP,LOWER_UP>

All of the overlay traffic is handled at the forwarding layer. The underlay traffic is handled at the routing layer. We can see routes to the underlay addresses from earlier as follows.

ip route show table 47

99.0.1 via 169.254.0.1 dev eth1 proto bgp onlink
99.0.20 via 169.254.0.1 dev eth1 proto bgp onlink
99.0.21 via 169.254.0.1 dev eth1 proto bgp onlink
99.0.22 via 169.254.0.1 dev eth1 proto bgp onlink
99.0.23 via 169.254.0.1 dev eth1 proto bgp onlink

By default, testbed managed routes on infrapod servers, storage servers, emulation servers and simualtion servers are kept in table 47 to keep them separate from the management routing table of the server.

All of the above routing entries and forwarding entries are maintained by a daemon called gobble that runs as a systemd service. Gobble periodically polls GoBGP for the underlay and overlay state of the network and adds/removes forwarding and routing entries to the kernel as necessary.

note

Gobble will only add underlay routes for BGP peers with active EVPN routes, otherwise there is no purpose in talking to the peer and an underlay route is not added - even through the peer may have advertised its tunnel endpoint over BGP.

Inspecting the infranet from an fabric switch

The state of the infranet can also be inspected from fabric switches. The switches in a Merge testbed facility run Cumulus Linux, which uses Free Range Routing (FRR) as the routing protocol suite.

To inspect the underlay network from a fabric switch

net show bgp

show bgp ipv4 unicast
=====================
BGP table version is 9, local router ID is 10.99.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
              i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

   Network          Next Hop            Metric LocPrf Weight Path
*> 10.99.0.1/32     0.0.0.0                  0         32768 i
*> 10.99.0.5/32     swp3                                   0 64704 64705 i
*> 10.99.0.20/32    swp4                                   0 64720 i
*> 10.99.0.21/32    swp5                                   0 64721 i
*> 10.99.0.22/32    swp6                                   0 64722 i
*> 10.99.0.23/32    swp7                                   0 64723 i
*> 10.99.0.254/32   swp3                     0             0 64704 i

Displayed  7 routes and 7 total paths

Likewise we can inspect the EVPN status for VNI 100

net show bgp evpn route vni 100

BGP table version is 10, local router ID is 10.99.0.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP]

   Network          Next Hop            Metric LocPrf Weight Path
*> [2]:[0]:[0]:[48]:[04:70:00:01:70:1e]
                    10.99.0.1                          32768 i
*> [2]:[0]:[0]:[48]:[ee:f8:c9:77:d9:e0]
                    10.99.0.5                              0 64704 64705 i
*> [2]:[0]:[0]:[48]:[fe:54:00:14:d8:26]
                    10.99.0.1                          32768 i
*> [3]:[0]:[32]:[10.99.0.1]
                    10.99.0.1                          32768 i
*> [3]:[0]:[32]:[10.99.0.5]
                    10.99.0.5                              0 64704 64705 i

Displayed 5 prefixes (5 paths)

As well as the BGP forwarding entries

bridge fdb | grep vtep100

ea:d6:c4:df:57:c8 dev vtep100 master bridge permanent
be:d6:70:79:ce:29 dev vtep100 vlan 10 master bridge
ee:f8:c9:77:d9:e0 dev vtep100 vlan 10 offload master bridge
ae:a3:79:8a:82:9f dev vtep100 vlan 10 master bridge
00:00:00:00:00:00 dev vtep100 dst 10.99.0.5 self permanent
ee:f8:c9:77:d9:e0 dev vtep100 dst 10.99.0.5 self offload

Hardware#

Commander#

Installation#

Configuration#

Driver#

Installation#

Configuration#

Rex#

Installation#

Configuration#

Cog#

Monitoring Task Progress#

Viewing Materialization Tasks#

note

Clearing Task Errors#

Manually Completing Tasks#

Masking Actions#

Manual Dematerialization#

Listing Materializations#

Showing Materialization Info#

note

Getting Node State#

Infrapods#

Nex#

tip

Foundry#

Sled API#

Moactld#

Etcd#

Sled#

Wgd#

TODO

Rally#

Moa#

Beluga#

GoBGP#

Peers#

Underlay Routes#

Overlay Routes#

Installation#

Configuratoin#

Gobble#

Installation#

Configuration#

Canopy#

Inspecting Virtual Networks#

Managing Virtual Networks#

The Canopy Daemon#

Installation#

Client#

Server#

Node Booting Imaging, Initialization, & Debugging#

Booting#

note

Imaging#

Root file system location#

serial console settings#

Bootstrap network interface#

Experiment Images#

Updating Testbed images#

Updating PXE Images#

Updating Sled Images#

Updating Experiment Images#

Recommendations for updating images#

Configuration#

Debugging#

Infranets#

Infrapod Network Enclave#

Virtual Ethernet Pair#

Materialization Bridge#

Vtep#

Infranet Virtual Network#

Inspecting the infranet from an infrapod server#

tip

note

Inspecting the infranet from an fabric switch#

XpNets#

Hardware

Commander

Installation

Configuration

Driver

Installation

Configuration

Rex

Installation

Configuration

Cog

Monitoring Task Progress

Viewing Materialization Tasks

Clearing Task Errors

Manually Completing Tasks

Masking Actions

Manual Dematerialization

Listing Materializations

Showing Materialization Info

Getting Node State

Infrapods

Nex

Foundry

Sled API

Moactld

Etcd

Sled

Wgd

Rally

Moa

Beluga

GoBGP

Peers

Underlay Routes

Overlay Routes

Installation

Configuratoin

Gobble

Installation

Configuration

Canopy

Inspecting Virtual Networks

Managing Virtual Networks

The Canopy Daemon

Installation

Client

Server

Node Booting Imaging, Initialization, & Debugging

Booting

Imaging

Root file system location

serial console settings

Bootstrap network interface

Experiment Images

Updating Testbed images

Updating PXE Images

Updating Sled Images

Updating Experiment Images

Recommendations for updating images

Configuration

Debugging

Infranets

Infrapod Network Enclave

Virtual Ethernet Pair

Materialization Bridge

Vtep

Infranet Virtual Network

Inspecting the infranet from an infrapod server

Inspecting the infranet from an fabric switch

XpNets