In this article, we show you how to connect NVMe Flash storage using TCP.
Oracle Linux UEK 5 (Unbreakable Enterprise Kernel) introduced NVMe technology over Fabrics, which allows the transmission of NVMe storage commands over an Infiniband or Ethernet port using RDMA (Remote Direct Memory Access) technology. UEK5 U1 launched NVMe, including Fiber Channel storage networks, through Fabrics. Now with UEK 6, NVMe over TCP has been introduced, which expands NVMe over Fabric to use a standard Ethernet network without having to purchase network hardware with RDMA capability.
What is NVMe-TCP?
The NVMe Multi-Queue Model implements a Management Send Queue and Completion Queue as well as a 64k I / O (Input / Output) Send and Completion Queue on each NVMe controller. For a PCIe-attached NVMe controller, these queues are applied to main memory and shared by both the main CPUs and the NVMe Controller. I / O is sent to the NVMe device when the device driver writes a command to an I / O send queue and then writes to a doorbell register to notify the device. When the command is completed, the device writes to an I / O completion queue and creates an interrupt to notify the device driver.
NVMe over the network extends this design so that send and complete queues in host memory are replicated to the remote controller so that the host-based queue pair is mapped to a controller-based queue pair. NVMe over network topologies defines Command and Response Capsules used by queues to communicate across the network as well as Data Capsules. NVMe-TCP defines how these capsules are encapsulated in a TCP PDU (Protocol Data Unit). Each host-based queue pair and associated controller-based queue pair maps to its own TCP connection and can be assigned to a separate CPU core.
- The ubiquitous nature of TCP. TCP is one of the most common network transfers in use, already being implemented in most data centers around the world.
- It is designed to work with existing network infrastructure. In other words, there is no need to replace existing ethernet routers, switches, NICs, which simplifies the maintenance of network infrastructure.
- Unlike RDMA-based applications, TCP is fully routable and well suited for larger deployments and longer distances while maintaining high performance and low latency.
- TCP is actively maintained and developed by a large community.
- TCP can increase CPU usage because certain operations such as calculating checksums must be done by the CPU as part of the TCP stack.
- Although TCP provides high performance with low latency, compared to RDMA implementations, the latency can affect some applications partly due to the additional copies of data that need to be protected.
Setting the NVMe-TCP Instance
UEK6 was released with NVMe-TCP enabled by default, but to try it out with an original kernel you will need to compile with the following kernel configuration parameters:
- CONFIG_NVME_TARGET_TCP = m
$ sudo modprobe nvme_tcp
$ sudo modprobe nvmet
$ sudo modprobe nvmet-tcp
$ sudo mkdir /sys/kernel/config/nvmet/subsystems/nvmet-test
$ cd /sys/kernel/config/nvmet/subsystems/nvmet-test
$ echo 1 |sudo tee -a attr_allow_any_host > /dev/null
$ sudo mkdir namespaces/1
$ cd namespaces/1/
$ sudo echo -n /dev/nvme0n1 |sudo tee -a device_path > /dev/null
$ echo 1|sudo tee -a enable > /dev/null
If you do not have access to an NVMe device on the target host, you can use an empty block device instead.
$ sudo modprobe null_blk nr_devices=1
$ sudo ls /dev/nullb0
$ echo -n /dev/nullb0 > device_path
$ echo 1 > enable
$ sudo mkdir /sys/kernel/config/nvmet/ports/1
$ cd /sys/kernel/config/nvmet/ports/1
$ echo 10.147.27.85 |sudo tee -a addr_traddr > /dev/null
$ echo tcp|sudo tee -a addr_trtype > /dev/null
$ echo 4420|sudo tee -a addr_trsvcid > /dev/null
$ echo ipv4|sudo tee -a addr_adrfam > /dev/null
$ sudo ln -s /sys/kernel/config/nvmet/subsystems/nvmet-test/ /sys/kernel/config/nvmet/ports/1/subsystems/nvmet-t
You should now see the following message caught in dmesg:
$ dmesg |grep "nvmet_tcp"
[24457.458325] nvmet_tcp: enabling port 1 (10.147.27.85:4420)
Setting Up the Client
$ sudo modprobe nvme
$ sudo modprobe nvme-tcp
$ sudo nvme discover -t tcp -a 10.147.27.85 -s 4420
Discovery Log Number of Records 1, Generation counter 3
=====Discovery Log Entry 0======
subtype: nvme subsystem
treq: not specified, sq flow control disable supported
$ sudo nvme connect -t tcp -n nvmet-test -a 10.147.27.85 -s 4420
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
------------- ------------------- --------------- --------- ----------- --------- -------
/dev/nvme0n1 610d2342db36e701 Linux 1 2.20 GB / 2.20 GB 512 B + 0 B
You now have a remote NVMe block device exported via NVMe over the network using TCP. You can then write and read from it like any other locally attached high performance block device.
To compare NVMe-RDMA and NVMe-TCP, a pair of Oracle X7-2 hosts were used, each with Mellanox ConnectX-5 running Oracle Linux (OL8.2) and UEK6 (v5.4.0-1944). Configured a pair of 40Gb ConnectX-5 ports with RoCEv2 (RDMA), performed performance tests, reconfigured to use TCP, and rerun performance tests. The performance utility FIO was used to measure I / O per second (IOPS) and latency.
When testing for IOPS, a single-thread 8k read test with a 32-queue depth revealed that RDMA performed significantly better than TCP, but when additional threads were added that made better use of the NVMe queue model, TCP IOPS performance increased. When the thread count reaches 32, the TCP IOPS performance matches that of RDMA.
Latency was measured using an 8k reading from a single thread with tail depth 1. TCP latency was 30% higher than RDMA. Most of the difference is due to the buffer copies that TCP requires.
Although NVMe-TCP suffers from innovation, the same is not true for TCP, and with its dominance in the data center, there is no doubt that NVMe-TCP will be a dominant player in the data center SAN space. We can expect many third-party NVMe-TCP products to be released this year, from Ethernet adapters optimized for NVMe-TCP to NVMe-TCP SAN products.