Portal Overview

The Merge portal is a Kubernetes based experimentation hub for testbed users. The portal provides the following services to users

Implements the Merge API that allows users to
- Manage projects and experiments.
- Manage experiment revisions.
- Compile and analyze experiments.
- Visualize Experiments.
- Realize Experiments
- Materialize Experiments
- Launch and access experiment development containers (XDC)
- Attach to experiment infrastructure networks from XDCs
- Store data in home and project volumes that are independent of individual experiment lifetime.

The Merge portal implements these features in a layered architecture which is depicted below.

The ingress layer is the front door of the portal. It is responsible for routing traffic from external sources to the correct internal destinations within Kubernetes. Inside Kubernetes, the Merge core services, including the Merge API, are hosted as pods. These pods collectively implement all the services described above. Beneath the pods are a connectivity layer that allows users to access their experiments on remote testbeds, and a storage layer that provides an interface between pods and a Ceph-based persistent data storage cluster.

Each of these layers is described in more detail below

Hardware

There are many different hardware arrangements that can comprise a Merge portal. A typical hardware deployment for a Merge portal includes.

1 or more proxy servers that implement the Merge ingress layer.
1 or more Kubernetes master servers
1 or more Kubernetes worker servers
1 or more Ceph storage servers

These servers can be physical or virtual. Virtual deployments allow for smaller portals to be built from a single server if needed.

There are several logical networks that interconnect the various portal systems. Typically these networks are partitioned into two physical networks:

management network: allows for administrator access to nodes for maintenance, log aggregation and alerting.
data network: provides connectivity between the core systems and services that implement the portal's functionality.

note

10 gbps or greater networks are recommended for the data network, as this network interconnect storage servers and carries remote mounts from storage servers to XDC worker nodes

A typical portal deployment is shown in the diagram below, with a pair of switches interconnecting every node onto the management and data networks.

Ingress

The portal ingress systems are designed to route traffic between external clients and the services provided by the portal. This routing is implemented by a set of proxy servers. Each proxy server:

Runs an nftables based NAT that funnels external traffic aimed at specific API endpoints onto the correct internal network entry point.
Runs an HAProxy proxy instance that load balances traffic across replicated internal services.

The are 4 public access endpoints exposed by the portal.

merge-api:443 This is the TLS endpoint for the Merge API.
xdc:2202 This is the SSH endpoint for XDC access.
xdc:443 This is the TLS endpoint for XDC web interface access.
launch:443 This is the TLS endpoint for accessing the Merge 'Launch' web interface.

On a live Merge system, each of these endpoints is assigned a public IP address for user access. It is the job of the ingress layer to route traffic on these addresses to the corresponding internal subsystems.

NAT

The NAT rules that route traffic in and out of internal portal networks are actually quite simple. Consider the following example where the following IP address mappings apply.

endpoint	IP
merge-api	2.3.4.10
xdc	2.3.4.100
launch	2.3.4.200

Then the corresponding nft rules provide NAT into and out of the portal internal networks.

table ip nat {
    chain prerouting {
        type nat hook prerouting priority filter; policy accept;
        ip daddr 2.3.4.10  tcp dport 443  dnat to  127.0.1.2:443
        ip daddr 2.3.4.100 tcp dport 443  dnat to  127.0.1.3:443
        ip daddr 2.3.4.100 tcp dport 2202 dnat to 127.0.1.3:2202
        ip daddr 2.3.4.200 tcp dport 443  dnat to  127.0.1.4:443
    }

    chain input {
        type nat hook input priority filter; policy accept;
    }

    chain output {
        type nat hook output priority filter; policy accept;
    }

    chain postrouting {
        type nat hook postrouting priority filter; policy accept;
        oifname "eno2" masquerade
        ip saddr 127.0.1.2 tcp sport 443  snat to 2.3.4.10:443
        ip saddr 127.0.1.3 tcp sport 443  snat to 2.3.4.100:443
        ip saddr 127.0.1.3 tcp sport 2202 snat to 2.3.4.100:2202
        ip saddr 127.0.1.4 tcp sport 443  snat to 2.3.4.200:443
    }
}

These rules are automatically set up by the portal-deploy playbook. However the goal here is to describe how these rules work in the context of the larger Merge portal systems.

So great, the external traffic is are getting NAT'd to ..... various localhost addresses? Hang tight, this is where HAProxy takes over.

HAProxy

In the previous section, we described how external traffic directed at Merge services is routed into and out of a local address space. Each one of the local targets used by the nft NAT (127.0.1.[2,3,4]) is used a front-end address by HAProxy. We refer to these local addresses as anchors.

The primary function HAProxy serves, is distributing traffic from anchors to their corresponding services in the Merge Kubernetes cluster. Consider the following HAProxy configuration fragment.

frontend merge-front
    bind 127.0.1.2:443
    default_backend merge-back

backend merge-back
    balance roundrobin
    server  merge-api0 mapi0:443 check
    server  merge-api1 mapi1:443 check
    server  merge-api2 mapi2:443 check

Here we see that HAProxy is binding to one of the anchors nft is NAT'ing traffic to 127.0.1.2 on port 443. On the back side, HAProxy is distributing this traffic to 3 backend servers mapi[0,1,2]. These host names are bound to addresses that belong to Kubernetes external services that connect to the pods that implement the Merge API. These host name bindings are placed in /etc/hosts by portal-deploy. By default the mapping is the following.

name	address
mapi0	10.100.0.10
mapi1	10.100.0.11
mapi2	10.100.0.12

However, these addresses are configurable, so the addresse mappings may vary between installations.

All of the other endpoints follow the same basic scheme. Using nft NAT onto local addresses and HAProxy for distribution onto Kubernetes service addresses.

Kubernetes

The bulk of the Merge portal lives inside Kubernetes. The Merge systems that live inside Kubernetes can be broadly partitioned into two mostly independent stacks.

The Merge API, Policy Layer and Core Services
XDCs, SSH Jump Bastions, HTTP Proxies

Each of these systems occupies a namespace in k8s. To get a sense of what's running, you can use the kubectl get pods function in each namespace on one of the k8s master servers in your deployment.

The Merge API and associated core services run in the default namespace.

kubectl get pods

NAME                            READY   STATUS    RESTARTS   AGE
alloc-78b4569598-6pfpc          1/1     Running   0          3d4h
alloc-78b4569598-6qv68          1/1     Running   0          3d4h
alloc-78b4569598-tbkzc          1/1     Running   0          3d4h
api-7b99b47c6c-bdbkr            1/1     Running   0          5h57m
api-7b99b47c6c-hkcw5            1/1     Running   0          5h57m
api-7b99b47c6c-zk4w6            1/1     Running   0          5h57m
commission-78bf8f688d-rqn2s     1/1     Running   0          2d9h
commission-78bf8f688d-xg5cc     1/1     Running   0          2d9h
commission-78bf8f688d-zsmkx     1/1     Running   0          2d9h
etcd-6548f8d84-bb5b2            1/1     Running   0          3d2h
materialize-646f588bd8-kdp9v    1/1     Running   0          4h21m
materialize-646f588bd8-swqpm    1/1     Running   0          4h21m
materialize-646f588bd8-w8njj    1/1     Running   0          4h21m
mergectl                        1/1     Running   0          3d4h
mergetest                       1/1     Running   0          3d4h
model-6cd5d46546-nqqgs          1/1     Running   0          3d4h
model-6cd5d46546-skd86          1/1     Running   0          3d4h
model-6cd5d46546-zpxnm          1/1     Running   0          3d4h
realize-74449867bf-2q6h7        1/1     Running   0          2d9h
realize-74449867bf-gbwqw        1/1     Running   0          2d9h
realize-74449867bf-glvzg        1/1     Running   0          2d9h
tbui-f649d8947-662gj            1/1     Running   0          3d1h
wgdcoord-5d986dc9bb-rcds8       1/1     Running   0          3d4h
workspace-bbfdfd9bb-k5hxk       1/1     Running   0          3d1h
workspace-bbfdfd9bb-qcx5k       1/1     Running   0          3d1h
workspace-bbfdfd9bb-wmdh7       1/1     Running   0          3d1h
xdc-operator-694cc5bbfb-cbbq4   1/1     Running   0          3d1h

By default each service is replicated 3 times.

The XDC systems run in the xdc namespace.

kubectl get pods -n xdc

NAME                                          READY   STATUS    RESTARTS   AGE
falco.powerplant.jenny-7dc445cb6c-g79jl       1/1     Running   0          31h
falco.rroutes.dcomptb-5c8956f8f-gxkv9         1/1     Running   0          5h30m
gtl.m2.dcomptb-d8dff54cd-zptgp                1/1     Running   0          34h
gtl.myexperiment3.dlebling-b96946c99-c6cnj    1/1     Running   0          2d3h
jumpc-76d77cc887-7jbms                        1/1     Running   0          2d8h
jumpc-76d77cc887-dnqss                        1/1     Running   0          2d8h
jumpc-76d77cc887-wxdmj                        1/1     Running   0          2d8h
xdc-web-proxy-6f987ffc7-dwc5j                 1/1     Running   0          2d6h
xdc-web-proxy-6f987ffc7-hjd46                 1/1     Running   0          2d6h
xdc-web-proxy-6f987ffc7-sg62z                 1/1     Running   0          2d6h
xdc.hi.glawler-5ff9ddf67c-8xfw2               1/1     Running   0          2d12h
xdc.myfirstexp.hank-6d68f9dc95-v2hnj          1/1     Running   0          96m
xdc.warp9.laforge-dbd478dc9-blkdh             1/1     Running   0          160m

Here you'll see containers that users have created to work with their experiments, as well as the jumpc and web-proxy infrastructure containers.

Merge API

The Merge OpenAPI 2 specification is implemented inside k8s by a pair deployment and service objects. The API pods are exposed to the HAProxy instance in the Ingress layer through external services.

We can see these services as follows.

kubectl get service api

NAME   TYPE        CLUSTER-IP    EXTERNAL-IP                           PORT(S)   AGE
api    ClusterIP   10.2.242.27   10.100.0.10,10.100.0.11,10.100.0.12   443/TCP   3d5h

The EXTERNAL-IP addresses should match the HAProxy backend targets described in the ingress section. Services are a k8s abstraction that bind addresses to a set of pods. This binding is done through selectors. We can see the selectors using kubectl describe.

kubectl describe service api

Name:              api
Namespace:         default
Labels:            <none>
Annotations:       Selector:  app=portal,component=api
Type:              ClusterIP
IP:                10.2.242.27
External IPs:      10.100.0.10,10.100.0.11,10.100.0.12
Port:              <unset>  443/TCP
TargetPort:        443/TCP
Endpoints:         10.1.12.43:443,10.1.13.36:443,10.1.13.37:443
Session Affinity:  None
Events:            <none>

The Selector: app=portal,component=api tag means that traffic to this service will route to any pod that the tags app=portal and component=api. Taking a look at the API pod deployment we see that these selectors have been applied to the API pods in the Pod Template section below.

kubectl describe deployment api

Name:                   api
Namespace:              default
CreationTimestamp:      Thu, 28 May 2020 14:56:02 -0700
Labels:                 <none>
Annotations:            deployment.kubernetes.io/revision: 4
Selector:               app=portal,component=api
Replicas:               3 desired | 3 updated | 3 total | 3 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=portal
           component=api
[..snip..]

Policy Layer

Every single call to the Merge API goes through the Policy layer. This is an authorization layer that happens after client authentication. The policy layer determines whether the caller is authorized to make the requested API call based on a declarative policy specification that is put into place by the portal administrator. The following is a concrete example of policy that governs experiments.

experiment:
  public:
    create: [Project.Member]
    read:   [Any.Any]
    update: [Experiment.Maintainer, Project.Maintainer]
    delete: [Experiment.Creator, Project.Maintainer]
  protected:
    create: [Project.Member]
    read:   [Project.Member]
    update: [Experiment.Maintainer, Project.Maintainer]
    delete: [Experiment.Creator, Project.Maintainer]
  private:
    create: [Project.Member]
    read:   [Experiment.Maintainer, Project.Creator]
    update: [Experiment.Maintainer, Project.Creator]
    delete: [Experiment.Creator, Project.Creator]

This policy states that an experiment is subject to one of three potential policy sets.

public
protected
private

Each policy set lays out a set of rules that govern how users can interact with experiments. For example in a public project any user can read the data associated with an experiment, such as the topology definition. However, for a protected project only project members can read experiment data and for a private project only maintainers and creators can see an experiments data. A complete policy rule set is defined here.

The API pod expects to find the policy definition at the location /etc/merge/policy.yml. The policy file is typically provisioned at this location by an administrator through a k8s configmap. The portal-deploy will set up the merge-config configmap automatically and reference it from the API container. Operations engineers may reference this playbook for guidance on provisioning policy updates through configmaps.

Core Services

Once an API call has cleared the policy layer. The API service delegates the call to one of the Merge core service pods. These core service pods include

alloc: manages resource allocations
commission: handles adding/removing resources and resource activation state
materialize: automates materializations of experiments across testbed facilities
model: provides model compilation, static analysis and reticulation services
realize: computes embeddings of user experiments onto underlying resource networks.
workspace: provides experiment and project management services

By default, each of these core services has 3 replicas. The naming scheme of each service is as follows.

<service_name>-<replica_set_id>-<replica_id>

The logs for each service can be obtained by kubectl log. For example

kubectl logs model-6cd5d46546-zpxnm

time="2020-05-28T22:02:16Z" level=info msg="model @ portal v0.5.11-0-gf9361eb"
time="2020-05-28T22:02:16Z" level=info msg="Initializing the Model Server"
time="2020-05-28T22:02:16Z" level=info msg="Listening on tcp://0.0.0.0:6000"
time="2020-05-29T17:23:01Z" level=info msg="Checking: network-island"
time="2020-05-29T17:23:01Z" level=warning msg="Network Islands model check implementation is TBD."
time="2020-05-29T17:23:01Z" level=info msg="Reticulating: ipv4-addressing"
time="2020-05-29T17:23:01Z" level=info msg="This model does not require ipv4 addressing"
time="2020-05-29T17:23:01Z" level=info msg="Reticulating: static-routing"
time="2020-05-29T17:23:01Z" level=info msg="This model does not request static routing"
time="2020-05-31T22:01:01Z" level=info msg="Checking: network-island"
time="2020-05-31T22:01:01Z" level=warning msg="Network Islands model check implementation is TBD."
time="2020-05-31T22:01:01Z" level=info msg="Reticulating: ipv4-addressing"
time="2020-05-31T22:01:01Z" level=info msg="Adding ipv4 addressing to the model"
time="2020-05-31T22:01:01Z" level=info msg="Reticulating: static-routing"
time="2020-05-31T22:01:01Z" level=info msg="This model requests static routing. Adding."
time="2020-05-31T22:01:01Z" level=info msg="Added routes as props on the network nodes"
time="2020-06-01T00:10:59Z" level=info msg="Checking: network-island"
time="2020-06-01T00:10:59Z" level=warning msg="Network Islands model check implementation is TBD."
time="2020-06-01T00:10:59Z" level=info msg="Reticulating: ipv4-addressing"
time="2020-06-01T00:10:59Z" level=info msg="This model does not require ipv4 addressing"
time="2020-06-01T00:10:59Z" level=info msg="Reticulating: static-routing"
time="2020-06-01T00:10:59Z" level=info msg="This model does not request static routing"

Each Merge core service has a corresponding k8s service. This allows the API to talk to services by name, and in some cases for the services to talk to each other by name.

kubectl get services

NAME           TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
alloc          ClusterIP   10.2.21.89     <none>        6000/TCP   3d12h
commission     ClusterIP   10.2.126.88    <none>        6000/TCP   3d12h
materialize    ClusterIP   10.2.158.112   <none>        6000/TCP   3d12h
model          ClusterIP   10.2.63.30     <none>        6000/TCP   3d12h
realize        ClusterIP   10.2.202.113   <none>        6000/TCP   3d12h
workspace      ClusterIP   10.2.73.91     <none>        6000/TCP   3d12h

Container Images

By default the containers in the portal pull from the latest sources in the MergeTB Quay repository. The URI scheme for these containers is the following.

quay.io/mergetb/<service>:latest

A quick way to see what images you are running is the following.

kubectl describe deployments | grep 'Image:'

Image:  quay.io/mergetb/alloc:latest
Image:  quay.io/mergetb/api:latest
Image:  quay.io/mergetb/commission:latest
Image:  quay.io/coreos/etcd:latest
Image:  quay.io/mergetb/materialize:latest
Image:  quay.io/mergetb/model:latest
Image:  quay.io/mergetb/realize:latest
Image:  quay.io/mergetb/tbui:latest
Image:  quay.io/mergetb/wgdcoord:latest
Image:  quay.io/mergetb/workspace:latest
Image:  quay.io/mergetb/xdc:latest

You can change the image a deployment points to by using kubectl edit. For example to change the realization image.

kubectl edit deployment realization

This will drop you into an editor with a YAML file where you can search for image: and you will find a reference to the current image that you can modify. Upon saving the file and closing the editor, k8s will pick up the changes, bring up a new replica set that points to the specified image and take down the old replica set.

Etcd Database

The Merge Portal uses etcd as it's sole database. Etcd is hosted in a k8s deployment with an associated k8s service just like the Merge core services.

One significant difference with the etcd deployment, is that the underlying storage used for the database is provided by a Ceph mount through a combination of a persistent volume and a persistent volume claim.

These k8s storage elements can be seen through kubectl as follows.

For the volume

kubectl get persistentvolumes

NAME             CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                 STORAGECLASS    REASON   AGE
etcd-pv          1Ti        RWX            Retain           Bound    default/etcd-pvc      csi-cephfs-sc            3d11h

and for the claim

kubectl get persistentvolumeclaims

NAME          STATUS   VOLUME       CAPACITY   ACCESS MODES   STORAGECLASS    AGE
etcd-pvc      Bound    etcd-pv      1Ti        RWX            csi-cephfs-sc   3d11h

These volumes are mounted into the etcd container through the etcd deployment definition and can be viewed using kubectl describe

kubectl describe deployment etcd

[...snip...]
  Volumes:
   data-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  etcd-pvc
    ReadOnly:   false

Launch Web Interface

TODO

@glawler

XDCs

Experiment development containers (XDC) provide a gateway to experiments for users. These are containers that are launched on-demand through the Merge API and are made accessible through

SSH jump bastions
HTTPS web proxies

XDCs and the infrastructure to support them are in the xdc namespace. The one exception is the xdc-operator pod, which is in the default namespace. This pod is responsible for launching XDCs using the k8s API and responds to requests to spawn or terminate XDCs from the Merge API.

Each XDC has several persistent volume mount points.

1 for the home directory of each user
1 for the project directory of the parent project

These can be seen through the kubectl describe method.

kubectl -n xdc describe deployment falco.rroutes.dcomptb

Name:                   falco.rroutes.dcomptb
Namespace:              xdc
Selector:               name=falco,proj=dcomptb,xp=rroutes
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
Pod Template:
  Labels:  name=falco
           proj=dcomptb
           xp=rroutes
  Containers:
   falco:
    Image:      quay.io/mergetb/xdc-jupyter:csi
    Port:       443/TCP
    Host Port:  0/TCP
    Environment:
      JUPYTER_TOKEN:     bb101d69-dffe-48a4-b32d-742cc9a90354
      JUPYTER_BASE_URL:  /dcomptb/rroutes/falco
    Mounts:
      /certs from xdc-cert-volume (ro)
      /home/elkins from mergefs (rw,path="home/elkins")
      /home/glawler from mergefs (rw,path="home/glawler")
      /home/kryptos from mergefs (rw,path="home/kryptos")
      /home/lincolnthurlow from mergefs (rw,path="home/lincolnthurlow")
      /home/mpcollins from mergefs (rw,path="home/mpcollins")
      /home/ry from mergefs (rw,path="home/ry")
      /home/schwab from mergefs (rw,path="home/schwab")
      /project/dcomptb from mergefs (rw,path="project/dcomptb")
      /tmp/group from mergefs (ro,path="containers/xdc/dcomptb/group")
      /tmp/passwd from mergefs (ro,path="containers/xdc/dcomptb/passwd")
  Volumes:
   mergefs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mergefs-pvc-xdc
    ReadOnly:   false
   xdc-cert-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      merge-xdc-cert
    Optional:  false

Here we see that there are 7 home directories and a project directory mounted. Each of these is mounted from the mergefs Ceph file system. The lifetime of the data in these directories is independent of the lifetime of any particular XDC. They exist for the lifetime of the corresponding user account or project.

There are some other relevant details to note in the deployment description above.

The selector is used in conjunction with a service so that this pod is reachable from the jumpc bastion and the https web proxy.
XDCs by default come with a Jupyter notebook interface and the XDC operator sets up environment variables to enable remote access through the https web proxy.
passwd, and group files are mounted into the XDC from mergefs. These are files that are managed by the portal and are merged into the XDCs native passwd and group files at instantiation time. They ensure that the proper user and group permissions are set up for each member of the project the XDC belongs to.

Looking at the k8x service associated with this deployment we see the selector tags that make traffic to the pod routable.

kubectl -n xdc describe service falco-rroutes-dcomptb

Name:              falco-rroutes-dcomptb
Namespace:         xdc
Selector:          name=falco,proj=dcomptb,xp=rroutes
Type:              ClusterIP
IP:                10.2.176.195
Port:              web  443/TCP
TargetPort:        443/TCP
Endpoints:         10.1.10.7:443
Port:              ssh  22/TCP
TargetPort:        22/TCP
Endpoints:         10.1.10.7:22
Session Affinity:  None

We also see that there are two endpoints exposed

22 for SSH
443 for TLS

note

The naming scheme for XDC services is <xdc_name>-<exp_name>-<proj-name> this is the naming scheme users employ to access their XDCs. At the time of writing dots are not allowed in k8s services, so this was the next best thing.

Jumpc

The jumpc deployment is a set of bastion pods and associated k8s services that provide external SSH connectivity to XDCs. The jumpc services expose external addresses that act as targets for the ingress layer.

kubectl -n xdc get service jumpc

NAME   TYPE       CLUSTER-IP    EXTERNAL-IP                          PORT(S)   AGE
jumpc  ClusterIP  10.2.224.200  10.100.0.20,10.100.0.21,10.100.0.22  2202/TCP  3d13h

note

The Merge portal plumbing uses port 2202 for XDC/SSH access as opposed to the usual 22.

A closer look at the jumpc service shows that the translation between port 2202 and port 22 takes place within this service.

kubectl -n xdc describe service jumpc

Name:              jumpc
Namespace:         xdc
Annotations:       Selector:  app=portal,component=jumpc
Type:              ClusterIP
IP:                10.2.224.200
External IPs:      10.100.0.20,10.100.0.21,10.100.0.22
Port:              <unset>  2202/TCP
TargetPort:        22/TCP
Endpoints:         10.1.12.30:22,10.1.13.28:22,10.1.14.24:22
Session Affinity:  None

The jumpc pods contain a mount of all the home directories of every user that is on the portal. This is a read-only mount. This mount is necessary for SSH authentication to work. When users upload their SSH pubkeys to the portal, they are placed in their home directories thereby providing access to XDCs through jumpc.

kubectl -n xdc describe deployment jumpc

Name:                   jumpc
Namespace:              xdc
CreationTimestamp:      Thu, 28 May 2020 16:33:14 -0700
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=portal,component=jumpc
Replicas:               3 desired | 3 updated | 3 total | 3 available | 0 unavailable
Pod Template:
  Labels:  app=portal
           component=jumpc
  Containers:
   jumpc:
    Image:        quay.io/mergetb/jumpc:latest
    Port:         22/TCP
    Host Port:    0/TCP
    Environment:  <none>
    Mounts:
      /etc/group from mergefs (ro,path="containers/sys/jumpc/group")
      /etc/passwd from mergefs (ro,path="containers/sys/jumpc/passwd")
      /home from mergefs (ro,path="home")
  Volumes:
   mergefs:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  mergefs-pvc-xdc

XDC Web Proxy

The XDC web proxy performs a similar function as the SSH bastion. It provides HTTPS proxying services so users can access the Jupyter web interface of their XDCs.

The xdc-web-proxy service exposes a set of external IP addresses as targets for the ingress layer.

kubectl -n xdc describe service xdc-web-proxy

Name:              xdc-web-proxy
Namespace:         xdc
Annotations:       Selector:  app=portal,component=xdc-web-proxy
Type:              ClusterIP
IP:                10.2.0.100
External IPs:      10.100.0.30,10.100.0.31,10.100.0.32
Port:              <unset>  443/TCP
TargetPort:        443/TCP
Endpoints:         10.1.12.31:443,10.1.13.29:443,10.1.13.30:443

Externally XDC web interfaces are mad accessible through the following addressing scheme

https://xdc.<facility>.mergetb.io/<project>/<experiment>/<xdc>

The XDC web proxy inspects this URL pattern and translates the path into the

<xdc>-<experiment>-<project>

form and routes the request to the corresponding XDC through its k8s service. XDC access is guarded by a web token that is only available to members of the project the XDC belongs to. So even though the HTTPS interface may be guessed by others and is generally exposed, authorization to actually use the XDC is managed by protected tokens.

Taints and Tolerations

The Merge portal uses k8s taints and tolerations to control where pods are scheduled. The general rules are

core services and general infrastructure get scheduled to service_worker servers
XDCs get scheduled to xdc_worker servers

Taints are applied to nodes, and any pod that is to be scheduled to a node must have tolerations that match the nodes taints. Node taints can be viewed using kubectl.

Here we show the taints on a service_worker node.

kubectl describe node sw0
Name:               sw0
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=sw0
                    kubernetes.io/os=linux
Annotations:        csi.volume.kubernetes.io/nodeid: {"cephfs.csi.ceph.com":"sw0"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 28 May 2020 12:27:09 -0700
Taints:             dedicated=service_worker:NoSchedule
[..snip..]

A pod that gets scheduled to this node must have a corresponding toleration, as shown below.

y@ma0:~$ kubectl describe pod mergectl
Name:         mergectl
Namespace:    default
Node:         sw0/172.30.0.9
Annotations:  Status:  Running
IP:           10.1.12.5
Tolerations:     dedicated=service_worker:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s

Wireguard

TODO

@glawler

Container Storage Interface

The k8s Container Storage Interface (CSI) is the touch point between Kubernetes and the Ceph cluster in the Merge portal. The machinery that makes this interface work is in the csi namespace. The particular CSI implementation used by the Merge portal is ceph-csi. It's deployed through a very standard set of ceph-csi templates.

kubectl -n csi get all

pods

NAME                                                READY   STATUS    RESTARTS   AGE
pod/csi-cephfsplugin-4c4xc                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-5pn54                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-7chpb                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-bdknl                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-bpj9x                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-bxnhr                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-fnmmp                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-jw49m                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-m8fdv                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-n5vh4                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-nfpjz                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-provisioner-7df49c46b6-8z5w7   5/5     Running   0          3d14h
pod/csi-cephfsplugin-provisioner-7df49c46b6-hcjb2   5/5     Running   0          3d14h
pod/csi-cephfsplugin-provisioner-7df49c46b6-x9f6t   5/5     Running   0          3d14h
pod/csi-cephfsplugin-qn87h                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-sfqpp                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-tdsmr                          3/3     Running   0          3d14h
pod/csi-cephfsplugin-x4pvn                          3/3     Running   0          3d14h

services

NAME                                   TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
service/csi-cephfsplugin-provisioner   ClusterIP   10.2.11.12    <none>        8080/TCP   3d14h
service/csi-metrics-cephfsplugin       ClusterIP   10.2.147.54   <none>        8080/TCP   3d14h

daemon sets

NAME                              DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/csi-cephfsplugin   3         3         3       3            3           <none>          3d14h

deployments

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/csi-cephfsplugin-provisioner   3/3     3            3           3d14h

replica sets

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/csi-cephfsplugin-provisioner-7df49c46b6   3         3         3       3d14h

The parts of the ceph-csi that are specific to a particular Merge portal is how the connection to the underlying Ceph cluster is made. This information is communicated through the ceph-csi-config ConfigMap.

kubectl -n csi describe configmap ceph-csi-config

Name:         ceph-csi-config
Namespace:    csi
Labels:       <none>
Annotations:
Data
====
config.json:
----
[
  {
    "clusterID": "ceph",
    "monitors": ["10.0.0.21", "10.0.0.22", "10.0.0.23"]
  }
]

This configuration specifies the addresses of the Ceph monitors and provides the basis for communication between k8s/ceph-csi and the Ceph cluster.

Ceph

The Ceph cluster provides persistent storage for the pods that run on Kubernetes. By convention these nodes are named st<N> where N is the integer index of the Ceph node. In the default Merge/Ceph deployment for the portal each Ceph node runs

Ceph monitor
Ceph manager
An object storage daemon (OSD) for each configured disk.
A metadata server (MDS) for the mergefs file system.

Basic Monitoring

The basic health status of a Ceph cluster comes through the ceph health command.

ceph health detail

HEALTH_OK

note

In order to use the ceph command, the user invoking the command must be a member of the ceph group or be root.

The overall status of the cluster can be viewed using ceph status

ceph status

  cluster:
    id:     6376cfba-4972-5285-adc2-d1247f1cf7d2
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum st0,st2,st1 (age 2d)
    mgr: st2(active, since 2d), standbys: st1, st0
    mds: mergefs:1 {0=st1=up:active} 2 up:standby
    osd: 24 osds: 24 up (since 2d), 24 in (since 4d)

  data:
    pools:   2 pools, 1024 pgs
    objects: 1.28M objects, 1.6 TiB
    usage:   5.0 TiB used, 126 TiB / 131 TiB avail
    pgs:     1024 active+clean

  io:
    client:   40 KiB/s rd, 11 KiB/s wr, 28 op/s rd, 0 op/s wr

Monitoring Services

There is also a Ceph web dashboard. Where the dashboard is hosted, is decided by Ceph. You can see where it is hosted using the ceph mgr command.

ceph mgr services

{
    "dashboard": "http://st2:8080/",
    "prometheus": "http://st2:9283/"
}

The dashboard provides a good deal of useful information about the health and performance of the Ceph cluster. The Merge/Ceph setup also enables the Prometheus monitoring endpoint by default, so operators can hook up Prometheus servers to get Ceph metrics.

Crash Reports

If a Ceph service crashes, crash reports from Ceph may be available.

ceph crash ls

ID                                                               ENTITY  NEW
2020-05-28_03:54:48.686274Z_164a3e39-8503-4fc3-b0f1-0a4eaaad73f6 mgr.st0
2020-05-29_23:42:05.661126Z_19e804bb-d304-4e42-9a01-eaac855c1dfb mgr.st2

System Configuration

Each Ceph node has a configuration file located at

/etc/ceph/ceph.conf

that will look similar to this

[global]
fsid = 6376cfba-4972-5285-adc2-d1247f1cf7d2
public network = 10.0.0.0/24
mon initial members = st0
mon host = 10.0.0.21
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 1024
osd pool default pgp num = 1024

[mon.st0]
mon host = st0
mon addr = 10.0.0.21:6789

[mon.st1]
mon host = st1
mon addr = 10.0.0.22:6789

[mon.st2]
mon host = st2
mon addr = 10.0.0.23:6789

# ANSIBLE MDS CONFIG BEGIN
[mds.st0]
host = st0
# ANSIBLE MDS CONFIG END

Systemd Daemons

The Ceph services run as systemd daemons. The Ceph element to systemd service names are as follows.

Ceph Element	Systemd Service
monitor	`ceph-mon`
manager	`ceph-mgr@<node>`
metadata server	`ceph-mds@<node>`
object storage daemon	`ceph-osd@<index>`

The status of any of these services can be checked by

sudo service <name> status

and the logs for each service are available through journalctl

sudo journalctl -u <name>

Hardware#

note

Ingress#

NAT#

HAProxy#

Kubernetes#

Merge API#

Policy Layer#

Core Services#

Container Images#

Etcd Database#

Launch Web Interface#

TODO

XDCs#

note

Jumpc#

note

XDC Web Proxy#

Taints and Tolerations#

Wireguard#

TODO

Container Storage Interface#

pods#

services#

daemon sets#

deployments#

replica sets#

Ceph#

Basic Monitoring#

note

Monitoring Services#

Crash Reports#

System Configuration#

Systemd Daemons#

Hardware

Ingress

NAT

HAProxy

Kubernetes

Merge API

Policy Layer

Core Services

Container Images

Etcd Database

Launch Web Interface

XDCs

Jumpc

XDC Web Proxy

Taints and Tolerations

Wireguard

Container Storage Interface

pods

services

daemon sets

deployments

replica sets

Ceph

Basic Monitoring

Monitoring Services

Crash Reports

System Configuration

Systemd Daemons