Facility Overview

A Merge-based testbed is a testbed facility built using the Merge technology stack and automation systems. A Merge testbed is made up of a distributed set of modular components that communicate over well defined interfaces and come together in a layered architecture.

  • The command layer is responsible for authenticating and authorizing requests from Merge portals and delegating them to the appropriate drivers in the testbed automation system.

  • The automation layer is responsible for building and executing task graphs from materialization requests. The automation layer knows very little about how to actually accomplish materialization tasks, it relies on individual components from the Merge technology stack to accomplish this.

  • The testbed infrastructure layer is comprised of a set of narrowly focused subsystems that are responsible for carrying out materialization tasks. This includes things like node imaging systems and DHCP/DNS servers.

  • The fabric layer is compirsed of the switches, routers and network appliances that collectively interconnect the testbed.

  • The resource layer is comprised of the user-allocatable resources that underpin materializations. This includes physical devices and virtual devices running inside hypervisors.


The diagram below shows a typical Merge testbed facility.

The commander server is connected to both the upstream network and the testbed network. In this setup there are a pair of "Cogs" nodes that host the testbed automation layer and parts of the testbed infrastructure layer. We refer to these machines as Cogs as this is what the name of the Merge automation system is. The emulator servers provide network emulation services. The strorage servers host the testbed databases and mass storage systems.

Merge testbeds are typically composed of three physical networks.

  • A management network for direct hardware access that interconnects all the infrastructure level servers described above, as well as the management ports of switches.
  • An experiment infrastructure network that interconnects all testbed nodes, with access ports on cogs and storage servers. This network is commonly referred to as the infranet.
  • A data network that interconnects all testbed nodes, with access ports on emulation servers.

The choice of how the software elements that comprise a Merge testbed are deployed across a hardware substrate are at the discretion of the testbed operator. In this document, we'll organize things around the logical components and give guidance on appropriate deployment strategies as we go. As a quick point of reference for those seeking guidance on deployment strategies. A typical deployment of the Merge software components onto the hardware substrate depicted above is as follows.

cogsdriver, rex, infrapods, wgd, beluga, gobble
emulatorgobble, moa
storagegobble, rally, sled


The Merge commander is responsible for authenticating requests from a Merge portal and delegating them to the appropriate drivers in the testbed facility. When the commander starts, it exposes two endpoints, one to the portal and one to drivers within the facility. When drivers come online, they contact the driver and register all the resources for which they want commands delegated to them. The driver builds a lookup table where resource IDs are the keys and the values are lists of drivers.

Materialization requests that come in from the portal come in two types.

  • Materialization notifications notify a facility that a materialization is about to be set up (NotifyIncoming), or that it is OK to discard a materialization (NotifyTeardown).
  • Materialization requests are messages composed of materialization fragments. Each fragment has a resource id to which it refers and an operation to perform over that resource id. For example, a materialization fragment may contain the resource ID of a node, and a request to recycle that node to a clean state. These requests typically come in batches of fragments for efficiency.


The Merge Commander is installed from the Merge package server, using the mergetb-commander package.

sudo apt install mergetb-commander

Once installed the commander runs as a systemd service.

sudo service commander status
● commander.service - MergeTB Commander
Loaded: loaded (/lib/systemd/system/commander.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-04-24 06:24:13 PDT; 1 months 8 days ago
Docs: https://gitlab.com/mergetb/site
Main PID: 46581 (commander)
Tasks: 62 (limit: 19660)
Memory: 74.4M
CGroup: /system.slice/commander.service
└─46581 /usr/bin/commander


The is no explicit configuration file for the commander. It does have some flags that can modify its behavior

listen0.0.0.0address to listen on
port6000port to listen on
cert/etc/merge/cmdr.pemclient access certificate Merge portals must use to access the materialization interface
key/etc/merge/cmdr-key.pemprivate key corresponding to the client certificate


The Merge Driver is responsible for taking delegated materialization requests from a commander and transforming them into a task graph in preparation for execution. This task graph is a directed acyclic graph that captures the dependencies between materialization tasks so they can be maximally parallelized for execution.

Here is a sample execution graph in tabular form.

time mzid taskid stageid actionid kind instance action deps complete error masked
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz FyUNNFCs7 NbaVVNg8u plumbing Mq0DDEmGN create-enclave [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz mTjOZm1oB KUMAAKxpG container Mq0DDEmGN launch [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz mTjOZm1oB @7TMM@i9m container Mq0DDEmGN launch [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz mTjOZm1oB 5iq995b6Q Canopy Mq0DDEmGN SetServiceVtep [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz mTjOZm1oB 2KeTL3mOQ container Mq0DDEmGN launch [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz mTjOZm1oB 0imRZ0pyf container Mq0DDEmGN launch [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz ZY8qqZf0F tVPz6yTkI Nex Mq0DDEmGN CreateNetwork [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz ZY8qqZf0F _s4ZZ_8RQ Nex Mq0DDEmGN CreateNetwork [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz I1d@@MxeK xaz@bxqFo Nex Mq0DDEmGN AddMembers [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz I1d@@MxeK U46NNlxws Nex Mq0DDEmGN AddMembers [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz 3on5P3_KR hVARRhCdn MoaControl Mq0DDEmGN InitMoaControl [] true false
22 Apr 20 10:47:18.29 PDT r1.simple.laforge CWF6IJrMz M7lkkHhfE CgERRC_sS bookkeeping Mq0DDEmGN UpdateStatus [] true false
22 Apr 20 10:47:19.84 PDT r1.simple.laforge EWVvvMikS ECmngS1u0 vaG_TXP2E Canopy Mq0DDEmGN SetLinks [CWF6IJrMz] true false
22 Apr 20 10:47:19.84 PDT r1.simple.laforge OrKLLWe25 l11oGh6vd YBoO5bSnZ NodeSetup Mq0DDEmGN m610 (b) [CWF6IJrMz] true false
22 Apr 20 10:47:19.84 PDT r1.simple.laforge OrKLLWe25 l11oGh6vd D7VprtE6I NodeSetup Mq0DDEmGN m588 (a) [CWF6IJrMz] true false
22 Apr 20 10:47:19.84 PDT r1.simple.laforge EWVvvMikS ECmngS1u0 CJVYYAGc_ Moa Mq0DDEmGN CreateMoaLink [CWF6IJrMz] true false

In addition to having a basic DAG structure, tasks are also organized into stages and actions. Stages are executed in serial, and actions are executed in parallel. So in a way stages act like execution barriers.

To get a sense for how the above all comes together logically consider the graphical representation of the table above in the diagram below. Here the blue rectangles are tasks, each column within the rectangle is a stage and each rounded rectangle is an action.

Here we see that this materialization is broken up into three tasks for

  • Setting up the infrapod and it's network enclave.
  • Setting up the experiment network.
  • Setting up the nodes.

Within the infrapod setup there is a single action in the first stage that creates the network enclave for the infapod. Once that action is complete the next stage is executed. In this stage 4 containers are launched and a service VXLAN tunnel endpoint (VTEP) is set up. All of these actions are done in parallel. Next DHCP/DNS networks are created that provide names and addresses on the experiments infranet. Members are then added to the DHCP/DNS networks. The network emulation (moa) control container is initialized and a global status for the materialization is updated indicating that the infrapod has been set up.

Once the infrapod is setup, the two tasks that depend on in, network setup and node setup are executed. The network setup happens across two actions in serial across two stages and the node set up happens in parallel across all nodes in the materialization. In this case we have a tiny two node experiment.

Tasks are stored in the facility etcd database discussed later in this document.


The Merge Driver is installed from the Merge package server, using the mergetb-driver package.

sudo apt install mergetb-driver

Once the driver is installed, it runs as a systemd service.

sudo service driver status
● driver.service - MergeTB Driver
Loaded: loaded (/lib/systemd/system/driver.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2020-05-29 12:55:58 PDT; 3 days ago
Docs: https://gitlab.com/mergetb/tech/cogs
Main PID: 105230 (driver)
Tasks: 65 (limit: 19660)
Memory: 175.8M
CGroup: /system.slice/driver.service
└─105230 /usr/bin/driver


The driver configuration file is located at /etc/cogs/driver.yml. Here is an example configuration.

# where to reach the etcd database
address: localhost
port: 2399
# where to reach the commander
address: site0
port: 6000
# connection information to provide the commander
address: site1
port: 10000
# default images to use when not specified by the user
# default testbed node images
default: debian:10
# default infrapod container images
moactl: docker.io/mergetb/moactld:0.2.0
foundry: docker.io/mergetb/foundry:v0.1.16
etcd: docker.io/mergetb/etcd:v0.1.0
nex: docker.io/mergetb/nex:v0.5.5


The Driver is only responsible for creating the task graph structure. It is not responsible for managing task execution. That job falls to a component called Rex. Rex continuously watches the etcd database for new tasks or changes in existing task state that may make other tasks eligible for execution.


The Merge Rex execution agent is installed from the Merge package server, using the mergetb-rex package.

sudo apt install mergetb-rex

Once rex is installed, it runs as a systemd service.

sudo service driver status
● rex.service - MergeTB Rex
Loaded: loaded (/lib/systemd/system/rex.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-06-01 18:20:11 PDT; 10min ago
Docs: https://gitlab.com/mergetb/tech/cogs
Main PID: 367 (rex)
Tasks: 14 (limit: 19660)
Memory: 13.0M
CGroup: /system.slice/rex.service
└─367 /usr/bin/rex


The rex configuration file is located at /etc/cogs/driver.yml. here is an example configuration.

# where to reach the etcd database
address: localhost
port: 2399
# where to reach the beluga power control daemon
address: site0
port: 5402
# how many seconds to use as a lease TTL
taskLease: 10
# network configuration parameters
# the interface that connets to the testbed infranet
vtepIfx: enp113s0f0
# the mtu to use for physical interfaces
mtu: 9216
# the mtu to use for vteps
vtepMtu: 9166
# the BGP autonomous system number of the cogs server rex runs on
bgpAS: 64799
# the underlay tunnel IP used for VTEP tunnel communications
# the address info used to talk to networks outside the testbed (e.g. the Internet)
# tuning parameters for the sled imaging system
# number of seconds to wait for an individual node to be imaged
mattimeout: 300
# number of seconds to wait for an individual node to be wiped to
# be wiped to the default image
demattimeout: 300
# connection timeout when connecting to sled imaging daemons
conntimeout: 10


The cog command line tool is the primary tool used to operate a testbed facility.

Monitoring Task Progress

Use this command to see what tasks are currently pending on the testbed

cog list tasks

Viewing Materialization Tasks

Use this command to view the tasks of a specific materialization. Use the --all flag to show tasks that have been completed in addition to pending tasks.

cog list mztasks <mzid> --all

The mzid of a materialization is denoted by

  • <realizaton>.<experiment>.<project>

Clearing Task Errors

When a task encounters an error and needs to be retried, either because the operations team has fixed some piece of infrastructure or the error appears to be transient use this command to clear the task error so Rex will try the task again. Rex ignores all tasks that have an error condition.

cog delete task-errors <mzid> <taskid>

Manually Completing Tasks

Sometimes a task is in a hopeless state and the ops team just needs to consider it complete and move on so other dependent tasks and stages can be executed.

cog complete <mzid> <taskid>

Masking Actions

Sometimes an action needs to be masked so it won't be executed when Rex gets to that point in the execution graph. This is what masking is for. When a task is masked Rex will ignore it. It also allows execution to continue beyond the point of the task in the graph, as masked tasks are discarded as a dependency consideration.

cog mask <mzid> <taskid> <actionid>

Manual Dematerialization

Sometimes an experiment must be manually dematerialized. The most common reason for this is a de-synchronization between a Merge portal and the testbed facility.

cog delete mz <mzid>

This will dematerialize the experiment and clear out all associated data records from the Cogs database.

Listing Materializations

To get a list of all active materializations

cog list mz

Showing Materialization Info

This is probably the second most useful command for managing testbed facilities.

sudo cog show mz <mzid>
status: active
117 117 27
etcd up
foundry up
moactld up
nex up
m588 a debian:10 00:08:a2:0d:df:8e (2:3:54) il14.swp13 27
m610 b debian:10 00:08:a2:0d:dd:ae (2:3:37) il14.swp35 27
m588 a eth1 [] 00:08:a2:0d:df:8f 1500 xl16.swp13 431
m610 b eth1 [] 00:08:a2:0d:dd:af 1500 xl16.swp35 430
m610~m588 b~a
520 430 [m610:eth1] [b]
521 431 [m588:eth1] [a]
xl16 swp13 431
xl16 swp35 430
xl16 uplink [430 431]
xf2 xl16 [430 431]
emu0 vtep520 0/520
emu0 vtep521 0/521
xf2 vtep520 430/520
xf2 vtep521 431/521
emu0 xf2 64805 520
emu0 xf2 64805 521

Taking the highlights section by section, this command shows

  • the overall status of the materialization as active
  • the service address of the infrapod as
  • all the containers in the infrapod are up
  • There are two nodes with the debian:10 image that have active DHCP leases with 2 hours and 3 minutes remaining.
  • What switches each node interface is connected to on both the infranet and xp net as well as VLAN/VXLAN information.
  • A mapping of experiment links from node to node including all the intermediate switch hops and the VLAN/VXLAN information associated with them.
  • EVPN advertisement information for cross-fabric or emulated links.

We say that a link is cross fabric when it transits across at least one routed VXLAN hop.

Getting Node State

The above image shows some of the internals for node status and sled state. Rex is the driver of state. During a materialization, a node will move from the Clean state to Setup to Ready. Rex manages the state machine transition and prevents nodes from getting into bad states; always driving from current state to target state. Clean and Ready states can be easily mapped to sled states when the node is in Daemon mode (Clean) or is in a Materialization (Ready). Sled states will be discussed in later sections.

cog list nodes
node mzid current state target state
n87 harbor.testbed.io Clean Clean
n88 some.mat.coolproj Ready Ready

You can use cog list nodes to detect issues with nodes where the current state != target state and are not materializing or dematerializing.


Infrapods are a collection of containers that provide per-experiment infrastructure. Each infrapod contains the following containers at a minimum

  • Foundry: node configuration daemon
  • Nex: DHCP/DNS server
  • Etcd: Database for infrapod local storage

Optionally infrapods may contain

  • Moactld: for dynamic network emulation modification.
  • Simctld: for physics simulation control.

depending on the composition of the experiment.

Technically, an infrapod is a network namespace that contains the above collection of containers. This is similar to the Kubernetes Pod abstraction. Infrapods are created during the materialization initialization phase for every experiment. In addition to a set of pods contained within a network namespace, there are also a set of interfaces that are created within the network namespace for

  • communication with testbed resources on the infranet
  • communication with testbed automation systems on the management network

The diagram above shows the network plumbing for two simple infrapods. On the 'top half' of the network the ceth1 interface of the infrapod connects to a service bridge that testbed automation tools like Rex use to communicate with containers inside the pod. The address assigned to the ceth1 interface is unique across all infrapods in a testbed facility. Each ceth1 interface has a corresponding svcX interface in the host network namespace (commonly referred to as the init namespace). Here X gets assigned the integer id of the materialization that is created by the Cogs. The ceth1 and svcX interfaces are virtual Ethernet pair devices, or simply veth devices. Veth devices are commonly used to plumb communications in and out of network namespaces. They are very simple, any Ethernet frame that ingresses on one peer is forwarded to the other.

On the 'bottom half' of the infrapods, is the infranet. Here we have a ceth0 and ifrX Ethernet pair. In this case the address assigned to ceth0 is not unique across materializations. In fact it is almost always as this is the default infranet gateway for experiments. This overlapping is possible due to the network namespaces and the fact that each ifrX interface is enslaved to a dedicated bridge mzbrX.

On the other side of the mzbrX bridge is a VXLAN tunnel endpoint (VTEP). This interface provides connectivity to the infranet of the experiment encapsulated inside of a VXLAN tunnel. The routing and forwarding configuration of this device is managed by GoBGP and Gobble. More details on how the infranet is constructed and managed is in the infranet section.


The Nex infrapod container provides DHCP/DNS services for the resources in an experiment. There are a few networks defined for each experiment. There is a command line client called nex that is installed by default on the Cogs hosts. This client can be used to inspect the state of DHCP/DNS networks operating inside infrapods, and even modify them in a pinch.

In order to talk to a Nex instance running inside an infrapd, the service address of the pod must be used. This can be seen using the cog show mz command.

cog show mz r1.simple.laforge
status: active
vindex vni vid svcaddr
117 117 27

The address displayed above can be used to look at the Nex DHCP/DNS networks

nex -s get networks

These networks can be further inspected using the nex tool

nex -s get network r1.simple.laforge
name: r1.simple.laforge
domain: r1.simple.laforge
lease_duration: 14400s

and their members inspected

nex -s get members r1.simple.laforge
mac name ip4
00:08:a2:0d:dd:ae b.r1.simple.laforge (3:45:7)
00:08:a2:0d:df:8e a.r1.simple.laforge (3:45:20)

The above are the infranet address leases of the nodes in this 2 node experiment.

nex -s get members static.r1.simple.laforge
mac name ip4
00:00:00:00:00:01 foundry.r1.simple.laforge
00:00:00:00:00:02 moactl.r1.simple.laforge

The above are static names and addresses associated with infrastructure services provided on the experiment infranet, hence the name static.

All of this information is populated by the Cogs when a materialization is first created. The materialization fragments that come from the merge portal are turned into Cog tasks by the driver that contain Nex configuration information that is extracted from the materialization fragments. This data is loaded into the Nex database using the service address of the infrapod and the Nex gRPC management interface.


If you are just interested in seeing the node lease information about a materialization, you do not need to go through this song and dance every time, the cog show mz does this for you automatically and will show you the address and lease time of every node in the experiment.


The Foundry infrapod container provides node configuration services for resources in an experiment. The system images that the testbed stamps onto resources at materialization time contain a daemon that runs at boot time. This is the Foundry client (foundryc). When the Foundry client starts, it reaches out to the Foundry server (foundryd) at the DNS name foundry which is resolved by the Nex DNS server pod described in the previous section, requesting how it should configure the node.

Foundryd responds to foundryc requests with information including at a minimum.

  • How network interfaces should be set up.
  • User account information including SSH credentials.
  • How routing should be set up.
  • What the hostname of the node should be.

Foundryd is made aware of this information by the Cogs. When a materialization is created, the materialization fragments that come from the merge portal are turned into Cog tasks by the driver that contain Foundry configuration information that is extracted from the materialization fragments. This data is loaded into the Foundry database using the service address of the infrapod and the foundryd gRPC management interface.

Foundry information can be inspected using the Foundry command line tool that is installed as a part of the Cogs software.

foundry --endpoint --cert /etc/foundry/manage.pem list
hostname: b
- name: laforge
- sudo
- laforge
- [..snip..]
- name: eth1
mtu: 1500
expand_rootfs: true
hostname: a
- name: laforge
- sudo
- laforge
- [..snip..]
- name: eth1
mtu: 1500

This shows the full foundry configuration for both of the nodes in this two node experiment.

Sled API

Sled is one of the core components for experiment imaging. There are 3 major components to sled: client, daemon, and api. The sled client (sledc) is responsible for putting an image on a device, the sled daemon (sledd) is responsible for managing communications between the client and the api, and the sled api (sledapi) is responsible for managing etcd storage. The sled controller (sledctl) is an auxilliary tool for users to manage etcd.

Sledapi is created in the infrapod container for the harbor materialization. Its placement in the harbor materialization is necessary to communicate with the sled daemon, clients, which live in the harbor materialization until they can be moved into an experimentor's materialization.

At the core of the sledapi is the concept of a sled command (sledcmd). The sledcmd is the playbook for what actions a client should take when it runs sled. The 4 actions a client can take are:

  • wipe: the wipe action indicates that the client should wipe (zero) the disk(s). This is an expensive operation, and should be done on teardown if used rather than on boot.
  • write: the write action writes out a kernel, initramfs, and disk on the client. The kernel and initramfs are artifacts of the disk image, and are generated by the images repository. The kernel and initramfs are written to tmpfs, and cannot exceed the size of memory. The disk image is written to the target block device.
  • kexec: the kexec action does a kernel execute from the currently running client (sledc on u-root) into the target kernel that was provided in the write action step.
  • daemon: the daemon action tells the client to go into daemon mode and to wait until a future actionable command is given.

The sledcmd is stored in etcd given by a mac address key as shown below using the sledctl command. The -S option below is the server ip address of sledapi in the harbor materialization network.

sledctl -S list macs
mac: 00:08:a2:0c:0c:77
sledctl -S get command 00:08:a2:0c:0c:77
mac: 00:08:a2:0c:0c:77
id: 5931ed7c
time: Tue, 11 Aug 2020 14:54:19 UTC
wipe: nil
write: nil
kexec: in progress
time: Tue, 11 Aug 2020 14:54:19 UTC
append: root=PARTUUID=a0000000-0000-0000-0000-00000000000a
rootfstype=ext4 console=tty1 console=ttyS0,115200n8 console=tty1
console=ttyS0,115200n8 earlyprintk=ttyS0 rw net.ifnames=0
biosdevname=0 8250.nr_uarts=1 infranet=eth0
sledctl -S get status 5931ed7c
taskID: 5931ed7c
wipe: not started
write: not started
kexec: in progress

The sledcmd has multiple fields:

  • mac: mac address of the command.
  • id: unique identifier of the command, so that we can query on the status. This id will change with every update to the command set, as each update creates a new command.
  • time: the last time the command was updated.
  • wipe: the wipe command contents.
  • write: the write command contents.
  • kexec: the kexec command contents. The kexec status will never be complete, as it is impossible to update the kexec field upon completion as the client will have already executed the new kernel and operating system.

The sledcmd information is passed to sledapi through the Cogs. On a dematerialization request, the cogs inserts a default image write action and daemon action into the sledcmd field. This tells the sledc to remain in daemon mode until it receives a new command. It also reduces the boot time if the experimentor uses the default image. On materialization, the Cogs inserts new fields, mainly a kexec action, and a write action if the default image is not used.


The Moactld infrapod provides a means for the end-user moacmd (which runs in a XDC) to control the network emulation provided by Moa. Configuration is automatically done by the Cogs via the API. It's primary function is to limit access to the emulation that is part of the materialization to which a XDC is attached.


Etcd is the primary database use used by a Merge testbed facility. There are several systems that use the same underlying etcd cluster. Namespaced etcd proxies are used to isolate these systems from each other.

The proxies are local to the services that access them. So for example, in a deployment where rex and driver one server and the commander runs on another, both servers would run a copy of the cogs grpc proxy service. The etcd proxies run as systemd services. You can see them through the basic systemd service management commands.

sudo service cogs-proxy status
sudo service nex-proxy status
sudo service sled-proxy status

These proxies are all configured to point at the etcd cluster running on the storage nodes inside the testbed facility. There are always an odd number of etcd servers in a cluster. Typically, etcd clusters are of size 1 or 3.

The proxy service configuration for the cogs proxy is located at /etc/systemd/system/cogs-proxy.service, it's definition is useful to point out a few things about the way in which etcd is deployed in a Merge testbed facility

ExecStart=/usr/bin/etcd grpc-proxy start \
--endpoints=https://stor0:2379,https://stor1:2379,https://stor2:2379 \
--listen-addr= \
--namespace=/cogs/ \
--cert=/etc/etcd/db.pem \
--key=/etc/etcd/db-key.pem \
--cacert=/etc/etcd/ca.pem \
--max-send-bytes=536870912 \
  • certificates are held in /etc/etcd
  • Merge at times can use messages that exceed the default etcd threshold, so limits may need tweaking depending on the expected size of experiments.
  • Proxy services listen locally on 2399 for the cogs. Consult the other services for their ports.
  • Proxies act as a TLS boundary, e.g. the actual etcd servers require client authentication, but since the proxies only listen locally they do not. This simplifies service and operations staff interaction models with the etcd system.


The Sled is one of the core components for experiment imaging. There are 3 major components to sled: client, daemon, and api. The sled client (sledc) is responsible for putting an image on a device, the sled daemon (sledd) is responsible for managing communications between the client and the api, and the sled api (sledapi) is responsible for managing etcd storage. The sled controller (sledctl) is an auxilliary tool for users to manage etcd.

Where each sled component runs:

  • sledc runs on the experiment device.
  • sledd container runs on the storage server.
  • sledd-nginx container runs on the storage server (hosts images).
  • sledapi container runs in the harbor infrapod.

Discussion of sledapi in the harbor infrapod can be found earlier in the document under Sled API. This section will mainly cover the sled protocol and management from an operational perspective.

Sled begins with a node booting. The node is configured with pxeboot as the primary boot option, and disk booting as the next option, followed by EFI or another recovery boot option in case of failures. The secondary boot option using a disk, means that when a node is in a materialization it will not boot from PXE, but instead use the disk image which was written during the initial sled process. This provides users with the expected outcome of rebooting with the same disk.

Going back to the primary boot source using PXE. When the node initially PXE boots, the dhcp service which is handled by nex, maintains an entry for the pxe server and file. The node does a chain boot into this image which is actually an iPXE image (handled by the tftp container in the harbor network) that contains pointers to the sled-server and the sledc kernel and initramfs. sled-server here is the DNS name which is also stored in nex and points to the sled image mount server.

When the node boots, it boots into a u-root client, and runs an initial dhcp request to retrieve an ip address. From there, it runs the sled client (sledc) and it requests the sled command (covered in the sled api section). The client uses every interface to request the sled command, and selects the first interface which can resolve sled-server name. The request is then sent to sled-server where the sled daemon (sledd) is running. Sledd then requests from sledapi the command associated with the mac address. Sledd is also responsible for handling the image copying process on the sled-server host back to the client (over http).

From the Cogs perspective, an MzFragment will come down to the driver with NodeSetup requests. For each node in the NodeSetup, a sled entry is created based on the mac address. The sled code managed by rex will check to see if the node is in daemon mode (ready for new commands), if so, it updates the sled command and sends it to the client. If not, it will create a full sled command that includes a write request as well, then restarts the node. The rex sled code will then wait until the node is reachable before returning. If the the node does not become reachable within a timeout timeframe, the task is failed.





Rally is the mass storage provider for experiment nodes. Rally is responsible for creating, managing, and removing user's experimental storage. Unlike mergefs which lives on the portal, rally is used for ephemeral storage, which lives either until the lifetime of the site, project, or experiment. Rally can run on one or more of the storage servers. Rally is made up of a single service rallyd and the controller tool rallyctl for accessing and modifying rally data.

Rally uses a default configuration file:

address: localhost
port: 2379
# Provide TLS settings as follows
#Cacert: /etc/cogs/ca.pem
#Cert: /etc/cogs/etcd.pem
#Key: /etc/cogs/etcd-key.pem
timeout: 10
address: localhost
port: 9950
timeout: 10
address: localhost
port: 6789
timeout: 10
name: rallyfs
datapool: rallyfs-data
metadatapool: rallyfs-meta
quota: 1GB
root: rally
users: users
owners: owners
pool: rados
quota: 1GB
pgnum: 1024
mount: "/mnt/rally"

Rally depends on 2 other services: etcd and ceph. Additionally, most of the rally configuration file is defining ceph attributes. The root and users files dictate the cephfs path base off the root rally mount (/mnt/rally/rally/users).

There are two types of supported storage methods at the moment: site and experiment storage. Site storage is created through the mergetb cli and creates an AssetFragment, while experiment storage is created by experiment xir and generates an MzFragment.

Block storage is currently implemented, but has not been enabled in foundry configuration for auto mounting block storage.

mergetb new asset filesystem spineleaf test 10GB
mergetb new asset filesystem spineleaf test2 10GB
mergetb list assets
newAsset CEPHFS 10.00gb spineleaf.mergetb.test EXPERIMENT
test CEPHFS 10.00gb spineleaf.mergetb.test SITE
test2 CEPHFS 10.00gb spineleaf.mergetb.test SITE

The last asset, newAsset was created in the experiment xir definition, when the portal looked up each asset and could not find newAsset, so it added it to the experiment definition to have the site create an asset with the lifetime of the experiment.

Rally is responsible for experiment storage, for now that is only ceph. For this next bit we will dive more in-depth into ceph. Ceph is installed by ceph-deploy. The main components of ceph are:

  • Monitor (mon): manages the ceph cluster, creates quorum with other monitors. Monitors are the brains of the ceph cluster.
  • Metadata Server (mds): manages the data distribution of ceph filesystem.
  • Rados Gateway (rgw): interface for librados.
  • **Object Store Device (osd): storage devices to hold ceph data.
  • Ceph Manager (mgr): manages telemetrics of ceph (prometheus, alerts, etc).

Data placement over the OSDs is managed by CRUSH, an algorithm that attempts to reduce erasures by data encoding and data placement beyond failure domains. Data is stored in pools, based on the number of OSDs, the placement group value distributes the pools across OSDs.

In order for experimentors to mount the ceph filesystem, they need to know where the monitor is located, and the path to their data. Rally does this by maintaining a permissions map on each rally user, which maps 1-to-1 with a ceph user. The permissions are tied into the ceph secret, which prevents other users from accessing the data. An administrator can access this data through the rallyctl tool.

rallyctl rally show user lincolnthurlow.test
Details for lincolnthurlow.test:
Secret: AQCBFDNfNwLhKhAAEkD3JVu6kNXEdvv1I2mFcA==

During the materialization of nodes, the Cogs make gRPC calls to rally to get this information and place it into the foundry data structure so when the node boots, the data is placed into the node's /etc/fstab to automatically mount the filesystem.


Moa is a daemon that configures the network emulation, currently implemented using (Fastclick)[https://github.com/mergetb/fastclick]. Moa is configured by the Cogs via its API, and stores the mapping from user-specific tags in the network model to the network links. It receives commands via Moactld, and utilizes Fastclick's control socket interface to dynamically alter the network link characteristics as requested.

The moactl CLI application can be used to inspect the emulations. It can be installed via apt install moactl on the emulation servers. It has facilities for listing emulations, showing the details of a specific emulation, and starting or stoping the fastclick processes that perform the actual emulation.


Beluga is a power control daemon. It has the capability to control the power state of many different types of devices through a pluggable device framework. The way that Beluga works is that it exposes a generic power control interface for users of the Beluga API to control devices through. The Beluga power control daemon has a configuration file that maps device names to their underlying power control model. When Beluga receives power control commands it looks up the power control protocol for the device it received the command for and translates the Beluga command into the power control protocol for that device.

An example Beluga configuration looks like this.

# address to listen on
# location of etcd storage server
port: 2374
# devices under power control of this daemon
# ipmi controlled device
controller: ipmi
# minnow controlled device
controller: minnow
# APC controlled device
controller: apc
# details for the controllers
# ipmi connection and authentication parameters
a0: { IP: "", Username: "ADMIN", Password: "ADMIN" }
# apc host and outlet parameters
pdu: pdu0
outlets: [4, 16]
# minnow chassis and device index parameters
chassis: mc0
index: 0

Here we see the configuration space partitioned into two categories, the device parameters and the power controller parameters. The device parameters make devices controllable through the Beluga API and link each device to a set of control parameters. The control parameters tell Beluga how to interact with the power controller for the specified device.

Beluga runs as a systemd service

sudo service belugad status

and is available through apt packages

sudo apt install belugactl belugad


GoBGP is a border gateway protocol (BGP) daemon that runs on all testbed servers that provide services directly to testbed nodes over the infranet or xpnet. GoBGP implements the BGP protocol and provides a gRPC interface for programmatic control. The way these networks are put together through protocols like BGP will be explained in more detail in the infranet and xpnet sections. Here we will simply go over the mechanics of GoBGP operations.

GoBGP runs as a systemd service

sudo service gobgpd status

You'll find GoBGP running on

  • Infrapod host servers
  • Sled imaging servers
  • Moa emulation servers

GoBGP also comes with a command line tool.


The neighbor command shows us information about who we are directly peered with.

gobgp neigh
Peer AS Up/Down State |#Received Accepted
enp113s0f0 64701 31d 06:45:13 Establ | 1904 1447

This tells us that the daemon on this node has been up for 31 days, and has received 1904 routes from peers.

Underlay Routes

The global routing information base shows us what underlay addresses are routable from our current location. The underlay network provides the foundation over which a set of overlay networks may be created.

gobgp global rib
ry@site1:~$ gobgp global rib
Network Next Hop AS_PATH Age Attrs
*> fe80::526b:4bff:fe8e:8bfa 64701 31d 06:47:01 [{Origin: i} {Med: 0}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64703 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64704 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64705 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64706 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64706 64713 14d 19:22:23 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64705 64712 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64704 64711 31d 06:47:01 [{Origin: i}]
*> fe80::526b:4bff:fe8e:8bfa 64701 64702 64703 64710 31d 06:47:01 [{Origin: i}]
*> 1d 01:46:16 [{Origin: i}]

You should be able to ping any addresses if Gobble is running and has provisioned the corresponding routes.

Overlay Routes

The global routing information base for Ethernet virtual private network (EVPN) show us what overlay addresses are routeable from our current location. The overlay network provides an isolated communication substrate for materialization networks that do not interact with other experiment networks or testbed infrastructure networks.

gobgp global rib -a evpn
ry@site1:~$ gobgp global rib -a evpn
Network Labels Next Hop AS_PATH Age Attrs
*> [type:macadv][rd:][etag:0][mac:00:08:a2:0d:de:64][ip:<nil>] [2] 64701 64702 64705 31d 06:52:47 [{Origin: i} {Extcomms: [64705:2], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:][etag:0][mac:be:6e:a1:ec:58:7d][ip:<nil>] [118] 26d 13:52:10 [{Origin: i} {Extcomms: [VXLAN], [64799:118]} [ESI: single-homed]]
*> [type:macadv][rd:][etag:0][mac:e2:b9:9a:0a:6a:65][ip:<nil>] [119] 68d 05:43:13 [{Origin: i} {Extcomms: [VXLAN], [64799:119]} [ESI: single-homed]]
*> [type:macadv][rd:][etag:0][mac:00:08:a2:0d:ca:8e][ip:<nil>] [2] 64701 64702 64703 1d 06:27:41 [{Origin: i} {Extcomms: [64703:2], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:][etag:0][mac:00:08:a2:0d:c9:c2][ip:<nil>] [100] 64701 64702 64703 00:34:37 [{Origin: i} {Extcomms: [64703:100], [VXLAN]} [ESI: single-homed]]
*> [type:macadv][rd:][etag:0][mac:00:08:a2:0d:dc:ea][ip:<nil>] [118] 64701 64702 64705 00:09:20 [{Origin: i} {Extcomms: [64705:118], [VXLAN]} [ESI: single-homed]]
*> [type:multicast][rd:][etag:0][ip:] 25d 06:08:02 [{Origin: i} {Extcomms: [VXLAN], [64706:127]}]
*> [type:multicast][rd:][etag:0][ip:] 64701 64702 64705 04:59:21 [{Origin: i} {Extcomms: [64705:105], [VXLAN]} {Pmsi: type: ingress-repl, label: 105, tunnel-id:}]
*> [type:multicast][rd:][etag:0][ip:] 64701 64702 64704 19d 04:10:30 [{Origin: i} {Extcomms: [64704:128], [VXLAN]} {Pmsi: type: ingress-repl, label: 128, tunnel-id:}]
*> [type:multicast][rd:][etag:0][ip:] 64701 64702 64705 64712 31d 06:52:47 [{Origin: i} {Extcomms: [64705:2], [VXLAN]}]

On an active testbed there will be a lot of EVPN overlay routes, thousands or tens of thousands depending on the size of the testbed facility. EVPN routes come in two flavors

  • Type-2 (macadv): These routes define layer-2 segments by providing reachability information for MAC addresses across the overlay within the context of a virtual network identifier (VNI). A VNI is an integer value that defines the isolation domain of the route. This is what provides separation between segments. So if the VNI on one advertisement is 100 and 101 on another, the destinations will not be able to communicate.
  • Type-3 (multicast): These routes define layer-3 reachability information in terms of underlay addresses. It's a way of saying that for a given VNI, there are targets of interest at the specified underlay address. This serves the function of routing broadcast, multicast and unknown (BUM) packets. When a packet is a BUM packet, it will be broadcast to all Type-3 destinations within the VNI context.


GoBGP is installed by installing Gobble.


The following is an example of a GoBGP configuration located at /etc/gobgp.yml

# bgp autonomus system numer
as: 64799
# the id of this router
# the neighbors i am expecting to reach
- config:
# interface the neighbor is reachable on
neighbor-interface: enp113s0f0
# the autonomous system number of the neighbor
peer-as: 64701
# BGP dialects we expect the neighbor to speak
- config:
afi-safi-name: ipv4-unicast
- config:
afi-safi-name: ipv6-unicast
- config:
afi-safi-name: l2vpn-evpn
# neighbor connection retry logic
connect-retry: 5
hold-time: 9
keepalive-interval: 3


GoBGP speaks a variety BGP dialects and, provides a gRPC API and command line client to inspect and manipulate the routing and forwarding space for a variety of BGP based protocols. What it does not do is create any routing or forwarding tables on the machine it is running on. This is what Gobble does. Gobble uses the GoBGP gRPC API to inspect the reachable EVPN overlay routes of the node it is running on and updates the underlying Linux routing and forwarding tables accordingly.

Gobble runs as a systemd service

sudo service gobble status


sudo apt install gobble


The gobble configuration file is located at /etc/gobble.yml

# how many milliseconds to wait between GoBGP polls
quantum: 1000
# port we can reach GoBGP on
gobgpd_port: 50051
# What address to use for the pseudo-peer, if you do not know what you are doing
# dont change this
# The subnet of the BGP underlay network
# Enable trace level loging - VERY noisy, will fill your disks with logs in a
# hot minute
trace: false
# The routing table id to use
# - 254 for the default table
table: 254
# Network interface to set up peer forwarding and routing for
peer_ifx: enp113s0f0


Canopy is a framework for creating virtual networks. There are 3 main parts

  • A client that lets you efficiently manage virtual networks across a switching mesh.
  • A daemon that runs on switches and servers that implements virtual network synthesis commands received by clients.
  • An API and library for programmatically managing virtual networks.

Inspecting Virtual Networks

Using the Canopy client you can inspect the state of any switch or server in the testbed that is running the Canopy daemon. Platforms in a Merge testbed that run the Canopy daemon are.

  • All switches
  • Infrapod servers
  • Storage servers
  • Network emulation servers
  • Physics simulation servers.

The network state of a host can be inspected with canopy as follows.

canopy list isp0
name mtu bridge access tagged untagged
eth0 1500 0 [] []
if0 9216 0 [] []
if1 9216 0 [] []
if2 9216 0 [] []
if3 9216 0 [] []
swp1 9216 if0 0 [] []
swp10 9216 0 [] []
swp11 9216 if3 0 [] []
swp12 9216 if3 0 [] []
swp13 9216 if3 0 [] []
swp14 9216 if2 0 [] []
swp15 9216 if2 0 [] []
swp16 9216 if2 0 [] []
swp2 9216 if0 0 [] []
swp3 9216 if0 0 [] []
swp4 9216 if1 0 [] []
swp5 9216 if1 0 [] []
swp6 9216 if1 0 [] []
swp7s0 9216 bridge 13 [] [13]
swp7s1 9216 bridge 2 [] [2]
swp7s2 9216 bridge 2 [] [2]
swp7s3 9216 bridge 30 [] [30]
swp8 9216 0 [] []
swp9 9216 0 [] []
name vni mtu access device bridge learning tunnel-ip access tagged untagged
vtep103 103 9166 13 bridge Off 13 [] [13]
vtep120 120 9166 30 bridge Off 30 [] [30]
vtep136 136 9166 46 bridge Off 46 [] [46]
vtep2 2 9166 2 bridge Off 2 [] [2]

This output shows us the physical interfaces, VXLAN tunnel endpoints, bridges and their state. This display does not show it, but the interfaces are also colored green if they are up and red if they are down.

Managing Virtual Networks

The Canopy client also lets you modify virtual network state. For example, for the above server isp0, if we wanted to add the VLAN tags 47 and 99 to interfaces if0, if1 and if3, we could do the following.

canopy set port tagged 47,99 if0,if1,if3 isp0

This is the general structure of most commands. Where we have a command stanza, a set of parameter values, a set of ports to apply those values to and a set of hosts to apply the commands to. In the above example we only had one host target, however we can even extend the command to multiple switches.

canopy set port tagged 47,99 if0,if1,if3 isp0,isp1

Internally the components use the Canopy API to create virtual networks for experiments in an automated way.

The Canopy Daemon

In order for the Canopy client CLI or library to be able to setup networks on devices, the Canopy daemon must be running on those devices. The Canopy daemon runs as a systemd service

sudo service canopy status

There is no configuration for Canopy, it just needs to be installed and run. It uses the Linux netlink socket to get all the information it needs and to implement the command requests it receives.

At the time of writing canopy configurations are not persistent. So if the host is rebooted, the configuration will be lost, and the host will need to be re-configured.



sudo apt install canopy-client


sudo apt install canopy-server

Node Booting Imaging, Initialization, & Debugging

When a node is not in use on the testbed, it is in a special virtual network called the harbor. When a node is selected for materialization the following things happen.


When a node first boots on the testbed, its firmware must be configured to PXE boot. This is so the testbed can take control of the node. All nodes that are not in use are on a special virtual network called the harbor. There is a DHCP server running on the harbor that provides BOOTP protocol information to PXE clients sufficient to load a bootstrapping PXE or UEFI image from a TFTP server on the harbor.

Both the DHCP server and the TFTP server run as containers in the harbors infrapod.


The name of the harbor is main.harbor.<testbed-name>

$ sudo ctr -n main.harbor.spineleaf c ls
etcd docker.io/mergetb/etcd:latest io.containerd.runtime.v1.linux
foundry docker.io/mergetb/foundry:v0.1.12 io.containerd.runtime.v1.linux
nex docker.io/mergetb/nex:v0.5.5 io.containerd.runtime.v1.linux
sledapi docker.io/mergetb/sledapi:v0.8.0 io.containerd.runtime.v1.linux
tftp docker.io/mergetb/tftp:v0.1.4 io.containerd.runtime.v1.linux

You can see the BOOTP protocol options by inspecting the networks of the dhcp server.

$ nex -s get networks

This shows that we have two distinct harbor networks. One main network that serves EFI firmware on boot, and one bios network that serves legacy BIOS firmware on boot. The static network is for providing DNS resolution for service addresses such as foundry.

We can inspect the DHCP/BOOTP configuration of the main harbor network as follows.

$ nex -s get network main.harbor.spineleaf
name: main.harbor.spineleaf
domain: main.harbor.spineleaf
lease_duration: 14400s
67 pxe.efi

This shows us that the EFI image PXE clients will be directed to is called pxe.efi.

To see exactly which nodes will be picked up on this network, we need to look at the network membership.

$ nex -s get members main.harbor.spineleaf
mac name ip4 client
04:70:00:01:70:11 n6.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:12 n7.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:1a n0.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:1b n1.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:1c n2.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:1d n3.main.harbor.spineleaf (3:59:44)
04:70:00:01:70:1e n4.main.harbor.spineleaf (3:59:59)
04:70:00:01:70:1f n5.main.harbor.spineleaf (3:59:59)

This list shows that membership for this network is based on fixed MAC address mappings. Other mapping mechanisms are available. On testbeds that only have a single type of firmware, a Nex network that picks up all nodes may be defined. Nex also allows for MAC ranges to be specified, so devices can be targetd by OUI.

The TFTP server container is pre-loaded with 4 basic bootloaders by the testbed software. These bootloaders are built from the MergeTB fork of the iPXE firmware

  • pxe.efi: EFI bootloader that will attempt to use all interfaces it can find.
  • pxe.bios: BIOS bootloader that will attempt to use all interfaces it can find.
  • snponly.efi: EFI bootloader that will only use the network interface it was chain loaded from.
  • snponly.bios: BIOS bootloader that will only use the network interface it was chain loaded from.

The sole task of this bootloader, in whatever form it comes in, is to load the Sled imaging kernel. The Sled imaging kernel is a lightweight Linux OS that loads purely into the memory of the machine and performs OS imagaing operations.


When Sled boots on the node, it attempts contacts the sled server running in the harbor asking for instructions. At this point, if the node is undergoing a materialization, the sled server will send back a message telling the sled imaging client what image to pull down. The sled client will download the specified image and write it to the disk specified by the sled server.

The Sled client also downloads the kernel for the specified OS separately, and then kexecs directly into that kernel. This saves a reboot cycle - which can be a very costly operation, especially on server grade hardware.

Sled is also responsible for providing sufficient kernel parameters for the node to boot. These parameters are a part of the testbed model, and the testbed automation systems that configure the nodes for booting provide this information to Sled from the testbed model. This includes

Root file system location

Merge OS images use the GPT partitioning scheme. By default Merge images have the boot partition at the partition UUID (PARTUUID). Note that this is a partition level label and not a filesystem level label.


serial console settings

Serial console settings are necessary to ensure that once the node boots, it's output is available via the serial console. This is a critical lifeline when the network connection on a node goes sideways. An example is as follows


if you have hardware that sometimes has issues booting, using the kernel earlyprintk option can be very helpful


Bootstrap network interface

When a node boots, it needs to know how to bootstrap the network. This is provided to the node configuration system by a kernel parameter called infranet.


Experiment Images

Experiment images are created using the images repo. Currently images are generated using packer. Packer takes in an image definition in the form of a json template and outputs an image as defined by the template. In the case of merge and sled, these images are output as raw images, which sled will then write directly to the block device.

There are 3 outputs of the packer build process that merge requires:

  • Disk image
  • Kernel
  • Initramfs

The disk image is generated from the template itself. The kernel and Initramfs are created by scripts in the image repo that extract the kernel and initramfs. In most operating systems, there are multiple kernels and initramfs, so one script will create a symlink with a sled- prefix that maps to the currently used version. As the prefix denotes, this is used by sled during the kernel execution (Kexec) into the operating system.

Currently experiment images are segmented by the operating system and the firmware (bios, efi). This is important as above we discussed the root file system partition, it is the image that is responsible for creating this filesystem, and the firmware and bootloaders also need to know about the partitions that are created. For example, in an EFI based system we need to identify the EFI partition.

PARTUUID=10000000-0000-0000-0000-000000000001 /boot/efi vfat defaults 0 2
PARTUUID=a0000000-0000-0000-0000-00000000000a / ext4 defaults 0 1

This enables the images to survive reboot. While the initial boot is done from a kexec, additional boots are done through standard grub bootloading.

Experimental images are rarely updated, but when they are they generally fall into 3 categories of updates:

  • Operating System update (upgrade version)
  • Operating System packaging/tools (add prerequesite pacakge)
  • Mergetb packaging (foundryc, retty)

Updating Testbed images

There are three types of images used during the mergetb imaging process:

  • PXE Image
  • Sled Image
  • Experiment Image

We've previously discussed each of these images in sections above. To provide a brief recap, pxe images are used to chainload into the sled image, the sled image boots into the experiment image, and the experiment image is used by experimenters. We will dive into a few more specifics for each topic to address how to update a testbed image.

Updating PXE Images

The pxe images live on the infrapod host and are mounted by the tftp container. The below exerpt shows the specifications for the harbor as was shown above, particularly the mounting source and destination. The tftp container is hard coded to mount from /srv, but the source location is left to the facility administrator.

# ipxe
- kind: container
mzid: main.harbor.spineleaf
action: launch
name: tftp
namespace: main.harbor.spineleaf
vindex: 2
image: "docker.io/mergetb/tftp:v0.1.4"
DOMAIN: main.harbor.spineleaf
- name: ipxe-location
source: /srv/pxe
destination: /srv
options: [rbind, ro]
type: bind

When updating a pxe image, the location to place the updated images depends on the source value. It is common to need to spin out a new pxe image due to new physical hardware and the specifications of the console settings or additional linux kernel parameters necessary for sled to work.

kernel http://sled-server/pxe/sledc-kernel root=/dev/ram0 console=tty0 console=ttyS0,115200n8 console=tty1 console=ttyS1,115200n8 earlyprintk=ttyS0 initrd=sledc-dhcp-initramfs
initrd http://sled-server/pxe/sledc-dhcp-initramfs

After building the pxe image the destination is based off of the firmware (bios or efi), and how the nex networks were configured. Recall in the booting section we discuss how nex plays a role in how images PXE. So the names used should match those in the nex entries which are being mounted by the tftp container.

Updating Sled Images

Sled images mainly consist of the u-root binary with sledc built on top. Sled is the imaging protocol, and will likely get updated for a variety of reasons: u-root version, kernel packages, sled protocol change, or boot process.

Noted in the section above in the ipxe section, is the destination of the sled image (http://sled-server/pxe). So this again depends on nex dns resolution and the webserver running on that host. In most testbeds, this webserver will be the slednginx container. The code that specifies the destination for where images live is here. So for slednginx, the destination is hardcoded as /var/img on the sled-server host for experiment images, and /var/img/pxe for these sledc images. The directory pxe here is a reference that these images are loaded by pxe, not that they are ipxe images as described in the section immediately above.

Updating Experiment Images

Testbed experiment images are serviced by sled. The location of those images is also based on the webserver servicing http requests (default /var/img). However, the sled protocol is actually relying on the value given by the sled command:

root@cmdr:/home/rvn$ sledctl get command 00:08:a2:0d:e2:92
mac: 00:08:a2:0d:e2:92
id: ec58ff91
time: Sun, 07 Jun 2020 23:27:40 UTC
wipe: nil
write: complete
time: Sun, 07 Jun 2020 23:28:45 UTC
image: ubuntu-1804-disk
kernel: ubuntu-1804-kernel
initrd: ubuntu-1804-initramfs
device: sda

Within the write command, the names of each entity: image, kernel, initrd is both a filename and a path. So sledc will generate an http get request for http://sled-server/ubuntu-1804-disk. So based on how the root directory of the webserver is hosted, and what the image name is determined to be in etcd for sled, will determine the location and name of the experiment image.

Recommendations for updating images

There are several recommendations to make when updating any of the above images on the testbed.

  • Prior to deployment on the physical testbed, attempt to verify that the image works under similar conditions in a staged virtual environment.
  • Disable materialization to the site via the Merge API prior to copying the images over. The testbed is a dynamic system, during an image update a materialization may cause an incomplete image to be copied.
  • Update sled images for sled daemons. Sled clients in daemon mode act as a cache of images, when a node dematerializes, the cogs place the node into daemon mode and download the default images. When an images are copied over to the new location, if the default image is changed or the sled client itself is changed, all nodes in daemon mode are running the old cached version. For testbed images, one can either update the sled daemon entries with the latest version (a write to the etcd key will cause an update to occur). For sled client images, one can either reboot the node, or use the Sledc Upgrade command to kexec into the new sledc image.


At this point the user specified OS is booting. All MergeTB compatible images run a node configuration daemon on startup.


This section covers a few more of the specifics with regards to how sled interacts with the cogs, and where problems tend to arise, and how to solve them.

The most common issue that comes up is when the cogs displays: failed to daemon node as shown below.

time mzid taskid stageid actionid kind instance action deps complete error masked
19 Nov 20 16:23:55.99 PST **** ****** ******* ******** NodeRecycle ******* n73 (n62) [eOhxeLthS rmZmIw7px] false failed to daemon node false

This task is the most common failure because it is primarily responsible for imaging a testbed node and has the largest vector for errors. The first step to take in debugging these nodes is to grab a console session to the node, and delete the task error and watch for what happens.

The most common failures are:

  • PXE DHCP Failure
    • This leaves the node to follow the boot order, which with an unsuccessful PXE will result in the last operating system installed to the disk.
  • DNS Resolution Failure or Sledd/Networking
    • If sled-server fails to resolve, or sledd is unable to complete the transaction, the node will end up in u-root with a > elvish shell. On sledc nodes the logs can be found at /tmp/sled.log and the database /tmp/sled.db. You can read these files to get more details.
  • Latency
    • It is common when there are hundreds to thousands of nodes rebooting that a node may not download the images on time. In this case you will also find the node in u-root.

If the console is not working, sledctl has a builtin command line interface which can be used to interface with nodes which have made it to sledc.

sledctl ping 00:08:a2:0d:e8:96
00:08:a2:0d:e8:96: alive
sledctl run 00:08:a2:0d:e8:96 "cat /tmp/sled.log"
00:08:a2:0d:e8:96: output: time="2020-11-19T23:13:46Z" level=info msg="testing: lo"
time="2020-11-19T23:13:46Z" level=info msg="testing: eth0"
time="2020-11-19T23:13:47Z" level=info msg="selected: eth0"
time="2020-11-19T23:13:47Z" level=info msg="Attempting: 00:08:a2:0d:e8:96"
sledctl run 00:08:a2:0d:e8:96 'kexec --cmdline \"root=PARTUUID=a0000000-0000-0000-0000-00000000000a rootfstype=ext4 console=tty1 console=ttyS0,115200n8 console=tty1 console=ttyS0,115200n8 earlyprintk=ttyS0 rw net.ifnames=0 biosdevname=0 8250.nr_uarts=1 infranet=eth0\" -i /tmp/initrd -l /tmp/kernel'

sledctl ping is a daemon command for checking if a sled node is listening. This is used directly by the cogs to check if a node is already configured. sledctl run allows for remote command execution on the daemon node.


The infranet is a flat network that interconnects every node in an experiment. There is one infranet per experiment. The infranet serves the following basic functions.

  • Provides access to experiment resources through XDC wireguard connections.
  • Allows every experiment node to reach every other node for the purposes of experiment automation.
  • Provides external network access to nodes, most commoly the Internet.
  • Provides mass storage access to experiment nodes.
  • Provides DHCP/DNS to experiment nodes.
  • Hosts a materialization-specific node configuration server.
  • Hosts a API endpoints for emulated network control and physics simulation control.

When an experiment is materialized the following things happen to construct the infranet for that experiment.

Infrapod Network Enclave

The first step in establishing an infranet is creating a network elclave for the infrapod. As a reminder an infrapod is a set of containers that share a common network namespace that collectively provide the basic infrastructure for an individual experiment materialization.

The network enclave for an infrapod contains the network elements that allow it co communicate with the nodes in an experiment. This includes

Virtual Ethernet Pair

A pair of virtual ethernet devices strattle the network namespace that encapsulates the containers of the infrapod and the default (sometimes called init) namespace of the host. The veth device internal to the namespace is always named ceth0. The veth device in the default namespace is named ifrX where X is the integer ID of the materialization.

Materialization Bridge

The materialization bridge, called mzbrX where X is the integer ID of the materialization, connects the extenal veth device to a VTEP device. This bridges the infrapod's namespaced network elements with the infranet of the materialization through the VTEP.


The VTEP, called vtepX where X is the integer ID of the materialization, provides access to the experiment infranet that spans the testbed switching mesh through a VXLAN tunnel and a set of routes and forwarding entires through that tunnel.

Infranet Virtual Network

The infranet exists as a virtual network that spans the switching mesh of a testbed facility. Depending on the design of the testbed, this may be a pure VXLAN virtual network or may be a combination VXLAN/VLAN virtual network. In the example that follows we present a combined VXLAN/VLAN implementation.

Consider the infranet depicted in the following diagram.

Here, the nodes shaded in blue are the members of the materialization. For Merge testbed facilities, we typically refer to switches in 3 categories.

  • Leaf switches: Provide only VLAN based access to testbed virtual networks.
  • Fabric switches: Provide VXLAN (and possibly VLAN) based access to testbed virtual networks.
  • Spine switches: Interconnect fabric switches and are pure underlay e.g., they transit VXLAN encapsulated traffic but do not take part in encap/decap themselves or actively particiapte in VXLAN control plane protocols such as EVPN.

In this example the infrapod has a VTEP whose parent interface is the physical interface of the infrapod server that connects directly to a spine switch. The nodes that the infrapod must communicate with are below leaf switches. The way this all comes together is the following.

  1. The testbed automation system creates VLAN access ports for each of the nodes on the infranet.
  2. A VLAN trunk is created on the uplink of each trunk switch and the corresponding dowlink of each fabric.
  3. A VTEP is created on each fabric switch, attached to the bridge of that switch and given the same VLAN access tag as the fabric downlink in (2). This funnels all traffic from the node into this VTEP.
  4. The switch is configured to advertise the virtual network identifier (VNI) of every VTEP that is created on it, so all peer routers are aware of it's existance.
  5. The GoBGP router running on the infrapod host sees the advertisements from the fabric switches saves them to it's local routing information base (RIB).
  6. The Gobble daemon running on the infrapod host sees the new advertisements in the RIB based on periodic polling (once a second by default) and adds corresponding routing and forwarding entries to the server it is running on so that the corresponding nodes are reachable through it's local vtepX interface.
  7. The testbed automation sytem creates an EVPN advertisement for the internal infrapod interface ceth0. All of the fabric nodes in the testbed see this advertisement and create the corresponding routes through the VTEPs that were created on the VNI specified in the advertisement.
  8. At this point bidirectional communication has been established between the nodes, and between the nodes and the infrapod.

Inspecting the infranet from an infrapod server

We can see many elements of an infranet from an infrapod server. First let's take a look at a materialization's metadata through the cog tool

sudo cog show mz one.rtr.ry
100 100 10
n4 c debian:10 04:70:00:01:70:1e (3:41:49) ileaf1.swp1 10

This shows us that the VXLAN VNI associated with the infranet for this materialization is 100. We also see that the virtual network index of this experiment is 100. It's common for the VNI and vindex for an experiment to be the same number.


Because the infranet is a flat network, there is always just a single VNI for each infranet

If we look at the network of the infrapod server hosting this infrapod we can see the following.

ip -br -c addr | grep 100
mzbr100 UP fe80::bcd6:70ff:fe79:ce29/64
ifr100@if2 UP fe80::3cf1:93ff:feb3:bd8b/64
svc100@if3 UP fe80::8002:a3ff:fe02:89bb/64
vtep100 UNKNOWN fe80::aca3:79ff:fe8a:829f/64

Likewise we can peer into the network namespace for this materialization

rvn@cmdr:~$ sudo ip netns exec one.rtr.ry ip -br -c addr
lo UNKNOWN ::1/128
ceth0@if94 UP fe80::ecf8:c9ff:fe77:d9e0/64
ceth1@if95 UP fe80::64e7:d4ff:fe52:3188/64

The EVPN/BGP state can be inpsected through the GoBGP tool. To see the underlay network.

gobgp global rib
Network Next Hop AS_PATH Age Attrs
*> fe80::5054:ff:fe58:307 64704 64701 20:00:51 [{Origin: i}]
*> 02:29:43 [{Origin: i}]
*> fe80::5054:ff:fe58:307 64704 64701 64720 20:00:51 [{Origin: i}]
*> fe80::5054:ff:fe58:307 64704 64701 64721 20:00:51 [{Origin: i}]
*> fe80::5054:ff:fe58:307 64704 64701 64722 20:00:51 [{Origin: i}]
*> fe80::5054:ff:fe58:307 64704 64701 64723 20:00:51 [{Origin: i}]
*> fe80::5054:ff:fe58:307 64704 20:00:51 [{Origin: i} {Med: 0}]

This shows us the other routers that can be reached from this node. The node that has the next hop of is the node we are on. The other routers in the network are the fabric switches in the switching mesh, stroage servers, and other infrapod servers. The IPs associated with each of these routers are what we refer to an tunnel-ips. This is because these are the tunnel entry points for VXLAN networks. These VXLAN networks are laid over the top of these tunnel entry points, and for that reason, the tunnel-ip network is called the underlay network and the VXLAN network is called the overlay network.

We can also inspect the state of the overlay network. For any non-trival sized testbed, the overlay network can have a very large number of entries. Here we focus our attention to the entrires involving VNI 100 belonging to the materialization we are looking at.

gobgp global rib -a evpn | grep -E '(Network|100)'
Network Labels Next Hop AS_PATH Age Attrs
*> [type:multicast][rd:][etag:0][ip:] 02:34:57 [{Origin: i} {Extcomms: [VXLAN], [64705:100]}]
*> [type:macadv][rd:][etag:0][mac:04:70:00:01:70:1e][ip:<nil>] [100] 64704 64701 00:52:33 [{Origin: i} {Extcomms: [64701:100], [VXLAN]} [ESI: single-homed]]
*> [type:multicast][rd:][etag:0][ip:] 64704 64701 02:34:33 [{Origin: i} {Extcomms: [64701:100], [VXLAN]} {Pmsi: type: ingress-repl, label: 16777213, tunnel-id:}]
*> [type:macadv][rd:][etag:0][mac:ee:f8:c9:77:d9:e0][ip:<nil>] [100] 02:34:57 [{Origin: i} {Extcomms: [VXLAN], [64705:100]} [ESI: single-homed]]

Here we see two types of routes

  • multicast: These routes determine how 3 classes of traffic broadcast, multicast, and unknown - collectively known as BUM are forwarded. All egress traffic that falls within the BUM class is sent to all routers for wich the originating router has a multicast advertisement on the VNI in question.
  • macadv: These routes determine how traffic belonging to specific MAC addresses is forwarded. If switches are set to learn MACs as they cross VTEP boundaries, these MACs will be advertised to reduce uknown traffic that must go through multicast routes. The testbed automation systems will also pre-seed known MACs at their fabric entry entry points to prune the initial BUM tree.

We can see how these BGP/EVPN advertisements manifest as actual routes by inspecting the routing and forwarding state of the infrapod server.

sudo bridge fdb | grep vtep100
04:70:00:01:70:1e dev vtep100 dst self permanent
00:00:00:00:00:00 dev vtep100 dst self permanent
ae:a3:79:8a:82:9f dev vtep100 vlan 1 master mzbr100 permanent
ae:a3:79:8a:82:9f dev vtep100 master mzbr100 permanent

Here we see the VXLAN forwarding entries for vtep100 on this host. The first entry corresponds to the macadv entry from GoBGP above. The second entry corresponds to the multicast entry from GoBGP above. The 00:00:00:00:00:00 entry is a special forwarding entry that says send all BUM traffic to the following destination. The remaining two entries are plumbing for the vtep100 itself onto and off of the mzbr100 bridge.

ip -br -c link | grep ae:a3

Note that we do not see an explicit forwarding entry for the ee:f8:c9:77:d9:e0 macadv above. That is because this is the MAC address of the internal veth on the infrapod we are looking at, so there need not be an external forwarding entry.

sudo ip netns exec one.rtr.ry ip -br -c link | grep ee:f8
ceth0@if94 UP ee:f8:c9:77:d9:e0 <BROADCAST,MULTICAST,UP,LOWER_UP>

All of the overlay traffic is handled at the forwarding layer. The underlay traffic is handled at the routing layer. We can see routes to the underlay addresses from earlier as follows.

ip route show table 47 via dev eth1 proto bgp onlink via dev eth1 proto bgp onlink via dev eth1 proto bgp onlink via dev eth1 proto bgp onlink via dev eth1 proto bgp onlink

By default, testbed managed routes on infrapod servers, storage servers, emulation servers and simualtion servers are kept in table 47 to keep them separate from the management routing table of the server.

All of the above routing entries and forwarding entries are maintained by a daemon called gobble that runs as a systemd service. Gobble periodically polls GoBGP for the underlay and overlay state of the network and adds/removes forwarding and routing entries to the kernel as necessary.


Gobble will only add underlay routes for BGP peers with active EVPN routes, otherwise there is no purpose in talking to the peer and an underlay route is not added - even through the peer may have advertised its tunnel endpoint over BGP.

Inspecting the infranet from an fabric switch

The state of the infranet can also be inspected from fabric switches. The switches in a Merge testbed facility run Cumulus Linux, which uses Free Range Routing (FRR) as the routing protocol suite.

To inspect the underlay network from a fabric switch

net show bgp
show bgp ipv4 unicast
BGP table version is 9, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 0 32768 i
*> swp3 0 64704 64705 i
*> swp4 0 64720 i
*> swp5 0 64721 i
*> swp6 0 64722 i
*> swp7 0 64723 i
*> swp3 0 0 64704 i
Displayed 7 routes and 7 total paths

Likewise we can inspect the EVPN status for VNI 100

net show bgp evpn route vni 100
BGP table version is 10, local router ID is
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-2 prefix: [2]:[ESI]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[ESI]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
*> [2]:[0]:[0]:[48]:[04:70:00:01:70:1e] 32768 i
*> [2]:[0]:[0]:[48]:[ee:f8:c9:77:d9:e0] 0 64704 64705 i
*> [2]:[0]:[0]:[48]:[fe:54:00:14:d8:26] 32768 i
*> [3]:[0]:[32]:[] 32768 i
*> [3]:[0]:[32]:[] 0 64704 64705 i
Displayed 5 prefixes (5 paths)

As well as the BGP forwarding entries

bridge fdb | grep vtep100
ea:d6:c4:df:57:c8 dev vtep100 master bridge permanent
be:d6:70:79:ce:29 dev vtep100 vlan 10 master bridge
ee:f8:c9:77:d9:e0 dev vtep100 vlan 10 offload master bridge
ae:a3:79:8a:82:9f dev vtep100 vlan 10 master bridge
00:00:00:00:00:00 dev vtep100 dst self permanent
ee:f8:c9:77:d9:e0 dev vtep100 dst self offload