Local Soft-RoCE development workflow with kind
RDMA over Converged Ethernet (RoCE) is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. There are multiple RoCE versions, in particular the RoCE v2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol and destination port number 4791 has been reserved for RoCE v2.
Network-intensive applications like networked storage, cluster computing and Artificial Intelligence workloads need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces are lower latency, lower CPU load and higher bandwidth.
Soft-RoCE is a software implementation of remote direct memory access (RDMA) over Ethernet, which is also called RXE. We can use Soft-RoCE on hosts without RoCE host channel adapters (HCA).
We will be using ib_write_bw as the network benchmarking tool.
Testing with Docker/Podman
-
Load the software RoCE module:
sudo modprobe rdma_rxe -
Create RXE on the active interface:
sudo rdma link add rxe0 type rxe netdev eth0 -
Verify configuration:
rdma link show link rxe0/1 state ACTIVE physical_state LINK_UP netdev eth0 -
Check GID setup (this is the tricky part):
cat /sys/class/infiniband/rxe0/ports/1/gids/0 fe80:0000:0000:0000:c805:5eff:fe37:e715 cat /sys/class/infiniband/rxe0/ports/1/gids/1 0000:0000:0000:0000:0000:ffff:c0a8:0117GID 0 is an IPv6 link-local address while GID 1 is an IPv4-mapped IPv6 address:
- GID 0:
fe80:0000:0000:0000:c805:5eff:fe37:e715IPv6 link-local address - GID 1:
0000:0000:0000:0000:0000:ffff:c0a8:0117IPv4-mapped IPv6 address - 192.168.1.23
- GID 0:
-
Create client/server containers with RDMA access:
docker run -d --name server-container --privileged \ --device=/dev/infiniband/uverbs0 \ --volume=/sys/class/infiniband:/sys/class/infiniband:ro \ quay.io/cloud-bulldozer/k8s-netperf:latest \ sleep 3600 docker run -d --name client-container --privileged \ --device=/dev/infiniband/uverbs0 \ --volume=/sys/class/infiniband:/sys/class/infiniband:ro \ quay.io/cloud-bulldozer/k8s-netperf:latest \ sleep 3600 -
Get server container IP address:
docker inspect server-container | grep IPAddress | tail -1 "IPAddress": "172.17.0.2", -
Run the test:
docker exec server-container ib_write_bw -d rxe0 -x 1 -F docker exec client-container ib_write_bw -d rxe0 -x 1 -F 172.17.0.2 --------------------------------------------------------------------------------------- RDMA_Write BW Test Dual-port : OFF Device : rxe0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF PCIe relax order: ON Lock-free : OFF ibv_wr* API : OFF Using DDP : OFF TX depth : 128 CQ Moderation : 1 CQE Poll Batch : 16 Mtu : 1024[B] Link type : Ethernet GID index : 1 Max inline data : 0[B] rdma_cm QPs : OFF Data ex. method : Ethernet --------------------------------------------------------------------------------------- local address: LID 0000 QPN 0x001f PSN 0xeb83da RKey 0x00108c VAddr 0x007fa70d049000 GID: 254:128:00:00:00:00:00:00:65:153:129:16:120:165:110:231 remote address: LID 0000 QPN 0x0020 PSN 0xeb83da RKey 0x00115c VAddr 0x007f3a0742d000 GID: 254:128:00:00:00:00:00:00:65:153:129:16:120:165:110:231 --------------------------------------------------------------------------------------- #bytes #iterations BW peak[MiB/sec] BW average[MiB/sec] MsgRate[Mpps] 65536 5000 1032.41 1024.75 0.016396 ---------------------------------------------------------------------------------------We obtained an average bandwidth of ~1024 MiB/sec.
-
Cleanup. Remove the containers and the RXE device:
docker rm -f server-container client-container sudo rdma link delete rxe0
Testing with Kind
We will be using k8s-netperf as the kubernetes network benchmarking tool. k8s-netperf supports ib_write_bw as one of its backends.
-
Load the software RoCE module:
sudo modprobe rdma_rxe -
Create RXE on the active interface:
sudo rdma link add rxe0 type rxe netdev eth0 -
Verify configuration:
rdma link show link rxe0/1 state ACTIVE physical_state LINK_UP netdev eth0 -
Letβs create an RDMA enabled kind cluster config
kind-config-rdma.yaml:kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraMounts: - hostPath: /dev/infiniband containerPath: /dev/infiniband - hostPath: /sys/class/infiniband containerPath: /sys/class/infiniband readOnly: true - hostPath: /sys/class/net containerPath: /sys/class/net readOnly: true - role: worker extraMounts: - hostPath: /dev/infiniband containerPath: /dev/infiniband - hostPath: /sys/class/infiniband containerPath: /sys/class/infiniband readOnly: true - hostPath: /sys/class/net containerPath: /sys/class/net readOnly: true - role: worker extraMounts: - hostPath: /dev/infiniband containerPath: /dev/infiniband - hostPath: /sys/class/infiniband containerPath: /sys/class/infiniband readOnly: true - hostPath: /sys/class/net containerPath: /sys/class/net readOnly: true -
Create the kind cluster (works perfectly fine with podman as well, just export
KIND_EXPERIMENTAL_PROVIDER=podman):kind create cluster --config kind-config-rdma.yaml Creating cluster "kind" ... β Ensuring node image (kindest/node:v1.32.2) πΌ β Preparing nodes π¦ π¦ π¦ β Writing configuration π β Starting control-plane πΉοΈ β Installing CNI π β Installing StorageClass πΎ β Joining worker nodes π -
Label the nodes:
kubectl label node kind-worker node-role.kubernetes.io/worker="" kubectl label node kind-worker2 node-role.kubernetes.io/worker="" -
Create a k8s-netperf config file for RoCEv2 traffic
config.yaml:--- tests: - UDPStream: parallelism: 1 profile: "UDP_STREAM" duration: 5 samples: 1 messagesize: 1024 -
Run the test:
k8s-netperf --config config.yaml --hostNet --privileged --ib-write-bw rxe0:1 INFO[2025-12-28 11:29:38] Starting k8s-netperf (roce2@e2988034e0f9dd4e2a59f131f6ae7866b12fd6de) INFO[2025-12-28 11:29:38] π Reading config.yaml file. INFO[2025-12-28 11:29:38] π Reading config.yaml file - using ConfigV2 Method. WARN[2025-12-28 11:29:38] π₯ Prometheus is not available INFO[2025-12-28 11:29:38] π¨ Creating namespace: netperf INFO[2025-12-28 11:29:38] π¨ Creating service account: netperf WARN[2025-12-28 11:29:38] β οΈ No zone label WARN[2025-12-28 11:29:38] β οΈ Single node per zone and/or no zone labels INFO[2025-12-28 11:29:38] π Starting Deployment for: client-host in namespace: netperf INFO[2025-12-28 11:29:38] β° Checking for client-host Pods to become ready... INFO[2025-12-28 11:29:56] Looking for pods with label role=host-client INFO[2025-12-28 11:29:56] π Starting Deployment for: server-host in namespace: netperf INFO[2025-12-28 11:29:56] β° Checking for server-host Pods to become ready... INFO[2025-12-28 11:30:11] Looking for pods with label role=host-server INFO[2025-12-28 11:30:16] ποΈ Running ib_write_bw UDP_STREAM (service false) for 5s WARN[2025-12-28 11:30:22] Not able to collect OpenShift specific node info +-------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-------------------+--------------------------+ | RESULT TYPE | DRIVER | SCENARIO | PARALLELISM | HOST NETWORK | SERVICE | EXTERNAL SERVER | UDN INFO | BRIDGE INFO | MESSAGE SIZE | BURST | SAME NODE | DURATION | SAMPLES | AVG VALUE | 95% CONFIDENCE INTERVAL | +-------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-------------------+--------------------------+ | π Stream Results | ib_write_bw | UDP_STREAM | 1 | true | false | false | | | 1024 | 0 | false | 5 | 1 | 935.110000 (Mb/s) | 0.000000-0.000000 (Mb/s) | +-------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-------------------+--------------------------+ +------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-----------+ | TYPE | DRIVER | SCENARIO | PARALLELISM | HOST NETWORK | SERVICE | EXTERNAL SERVER | UDN INFO | BRIDGE INFO | MESSAGE SIZE | BURST | SAME NODE | DURATION | SAMPLES | AVG VALUE | +------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-----------+ | UDP Loss Percent | ib_write_bw | UDP_STREAM | 1 | true | false | false | | | 1024 | 0 | false | 5 | 1 | 0.000000 | +------------------+-------------+------------+-------------+--------------+---------+-----------------+----------+-------------+--------------+-------+-----------+----------+---------+-----------+ INFO[2025-12-28 11:30:22] Cleaning resources created by k8s-netperf INFO[2025-12-28 11:30:22] β° Waiting for netperf Namespace to be deleted...The average bandwidth was ~935 MiB/sec.
-
Cleanup. Remove the kind cluster and the RXE device:
kind delete cluster sudo rdma link delete rxe0
Single node cluster
-
You can also create a one node all-in-one kind cluster:
kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraMounts: - hostPath: /dev/infiniband containerPath: /dev/infiniband - hostPath: /sys/class/infiniband containerPath: /sys/class/infiniband readOnly: true - hostPath: /sys/class/net containerPath: /sys/class/net readOnly: true -
Label the one node:
k8s-netperf --config config.yaml --hostNet --privileged --ib-write-bw rxe0:1 --local -
Donβt forget the
--localflag when invoking k8s-netperf:k8s-netperf --config config.yaml --hostNet --privileged --ib-write-bw rxe0:1 --local
GitHub actions
First attempt (not working)
For the first attempt we tried using the Ubuntu Azure github runner OS. Ubuntu Azure kernel lacks soft-RoCE (rdma_rxe) but has the soft-iWARP (siw) driver instead:
name: RoCE
on:
push:
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@master
- name: Setup Soft-RoCE
run: |
sudo apt-get install -y linux-modules-extra-$(uname -r)
sudo modprobe siw
sudo rdma link add siw0 type siw netdev eth0
rdma link show
- uses: engineerd/setup-kind@v0.6.2
with:
config: testing/kind-config-rdma.yaml
version: "v0.27.0"
- name: Labeling
run: |
kubectl label node kind-control-plane node-role.kubernetes.io/worker=""
- name: Runs k8s-netperf
run: |
git clone --depth 1 https://github.com/josecastillolema/k8s-netperf.git
cd k8s-netperf
go build -o k8s-netperf cmd/k8s-netperf/k8s-netperf.go
./k8s-netperf --config config.yaml --hostNet --privileged --ib-write-bw siw0:0 --local
Unfortunately this elegant and quick worflow does not work. It fails at QP RTS transition. The following kernel ring buffer entry points to something related to running on a Docker veth interface:
eth0: renamed from vethf90b642
We have tried running ib_write_bw with the -R flag to use RDMA CM (Connection Manager) instead of manual / IB-style QP setup but even that did not work.
Final attempt
Since we were unable to make soft-iWARP work we opted for running the workload on a QEMU Fedora virtual machine with soft-RoCE. The workflow takes approximately 30 minutes:
- It spins a Fedora 43 virtual machine with 2 vCPUs and 7 GB of memory
- Resizes the image to 10 GB
- Through cloud-init:
- Installs
kernel-modules-extrato load therdma_rxedriver - Makes some
sysctladjustments to prevent the too many open files error - Installs
kindandkubectl - Deploys a single node kind cluster using the podman provider
- Runs the
k8s-netperfworkload
- Installs
name: RoCE
on:
pull_request:
branches: [ "*" ]
paths-ignore:
- '**.md'
- '**.sh'
jobs:
qemu:
runs-on: ubuntu-latest
timeout-minutes: 45
steps:
- name: Checkout k8s-netperf
uses: actions/checkout@v4
- name: Build k8s-netperf
run: |
make build
- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y \
qemu-system-x86 \
qemu-utils \
genisoimage
- name: Boot Fedora Cloud Image with QEMU and install Podman + RDMA + Kind
run: |
cd /tmp
FEDORA_CLOUD_URL="https://download.fedoraproject.org/pub/fedora/linux/releases/43/Cloud/x86_64/images"
FEDORA_IMAGE="Fedora-Cloud-Base-Generic-43-1.6.x86_64.qcow2"
wget -q "$FEDORA_CLOUD_URL/$FEDORA_IMAGE" -O fedora-cloud.qcow2; then
# Resize the qcow2 image to 10GB (default 5GB is too small for packages + Kind)
qemu-img resize fedora-cloud.qcow2 10G
mkdir -p /tmp/cloud-init
# Copy Kind config from repository
cp $GITHUB_WORKSPACE/testing/kind-config-rdma.yaml /tmp/cloud-init/kind-config.yaml
# Copy k8s-netperf binary and config
echo "Copying k8s-netperf binary and config..."
cp $GITHUB_WORKSPACE/bin/amd64/k8s-netperf /tmp/cloud-init/
cp $GITHUB_WORKSPACE/examples/roce.yml /tmp/cloud-init/
# Create user-data for cloud-init
printf '%s\n' \
'#cloud-config' \
'users:' \
' - default' \
' - name: fedora' \
' sudo: ALL=(ALL) NOPASSWD:ALL' \
' shell: /bin/bash' \
'' \
'chpasswd:' \
' list: |' \
' fedora:fedora' \
' expire: false' \
'' \
'packages:' \
' - podman' \
' - rdma-core' \
'' \
'runcmd:' \
' - set -ex' \
' - trap "echo CLOUD_INIT_FAILED; sync; poweroff" ERR' \
' - dnf install -y kernel-modules-extra-$(uname -r)' \
' - mkdir -p /mnt/cdrom' \
' - mount /dev/sr0 /mnt/cdrom || mount /dev/sr1 /mnt/cdrom || mount /dev/sdb /mnt/cdrom || { echo "β Failed to mount cloud-init ISO"; echo "FAST_FAIL_MOUNT_ERROR"; poweroff; exit 1; }' \
' - cp /mnt/cdrom/kind-config.yaml /tmp/' \
' - cp /mnt/cdrom/k8s-netperf /usr/local/bin/' \
' - chmod +x /usr/local/bin/k8s-netperf' \
' - cp /mnt/cdrom/roce.yml /tmp/' \
' - umount /mnt/cdrom' \
' - modprobe ib_core || { echo "β Failed to load ib_core module"; exit 1; }' \
' - modprobe ib_uverbs || { echo "β Failed to load ib_uverbs module"; exit 1; }' \
' - modprobe rdma_rxe || { echo "β Failed to load rdma_rxe module"; exit 1; }' \
' - rdma link add rxe0 type rxe netdev ens3 || { echo "β Failed to create RXE device"; exit 1; }' \
' - rdma link show' \
' - sysctl -w kernel.keys.maxkeys=5000' \
' - sysctl -w fs.inotify.max_user_watches=2099999999' \
' - sysctl -w fs.inotify.max_user_instances=2099999999' \
' - sysctl -w fs.inotify.max_queued_events=2099999999' \
' - curl -Lo /usr/local/bin/kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64' \
' - chmod +x /usr/local/bin/kind' \
' - KIND_EXPERIMENTAL_PROVIDER=podman kind create cluster --config=/tmp/kind-config.yaml' \
' - kind get clusters' \
' - curl -Lo /usr/local/bin/kubectl "https://dl.k8s.io/release/v1.31.0/bin/linux/amd64/kubectl"' \
' - chmod +x /usr/local/bin/kubectl' \
' - kubectl label node kind-control-plane node-role.kubernetes.io/worker=""' \
' - k8s-netperf --config /tmp/roce.yml --privileged --ib-write-bw rxe0:1 --hostNet --local' \
' - echo "=== RDMA + Podman + Kind + k8s-netperf completed successfully! ==="' \
' - sync' \
' - poweroff' \
> /tmp/cloud-init/user-data
# Create meta-data using printf
printf '%s\n' \
'instance-id: rdma-test-vm' \
'local-hostname: rdma-test' \
> /tmp/cloud-init/meta-data
# Create cloud-init ISO
genisoimage -output /tmp/cloud-init.iso -volid cidata -joliet -rock /tmp/cloud-init/user-data /tmp/cloud-init/meta-data /tmp/cloud-init/kind-config.yaml /tmp/cloud-init/k8s-netperf /tmp/cloud-init/roce.yml
# Boot QEMU with cloud image in background
touch /tmp/vm-output.log
qemu-system-x86_64 \
-machine accel=tcg \
-cpu max \
-m 7168 \
-smp 2 \
-drive file=/tmp/fedora-cloud.qcow2,format=qcow2 \
-drive file=/tmp/cloud-init.iso,format=raw \
-netdev user,id=net0,dns=8.8.8.8 \
-device e1000,netdev=net0 \
-nographic \
-serial mon:stdio > /tmp/vm-output.log 2>&1 &
QEMU_PID=$!
echo "QEMU started with PID $QEMU_PID"
# Stream logs in real-time
tail -f /tmp/vm-output.log &
TAIL_PID=$!
# Cleanup
kill $TAIL_PID 2>/dev/null || true
kill $QEMU_PID 2>/dev/null || true
sleep 2
VM_OUTPUT=$(cat /tmp/vm-output.log)
Comments