Revolutionizing Data Transfer: Advantages of Remote Direct Memory Access

Overview: This article explores remote direct memory access, detailing its benefits, protocols, and impact on data transfer efficiency in high-performance computing environments that bypass traditional protocols.
In traditional networking, data is sent over the Transmission Control Protocol/Internet Protocol (TCP/IP) network, and the network has to copy it to the kernel more than once before passing it to receivers. Additional memory copying and processing are required upon receiving data, resulting in a higher latency.
Online tools like AI, deep learning, wide area networks, and Big Data have grown over the past few years. The network input speed has grown from 40 Gbps to 100 Gbps, 400 Gbps, and even higher. Conventional data transfer methods are inadequate for a multimode multiGPU scenario that necessitates low latency.
A data center network with high speed, low latency, and stable data transfer is needed to support high-performance computing. To better handle very fast networks (above 10 Gbps), many techniques like network card offloading remote direct memory access (RDMA) have been created. RDMA eliminates the overhead caused by traditional data transfer methods by directly transferring between the source and destination devices.
What is remote direct memory access?
Remote direct memory access is an effective network interconnect technology that allows data to be transferred directly from the memory of one computer or device to another without involving the processor or operating system of respective devices. It can offer low-latency, high-throughput communication.
It enables direct data transmission between two network nodes without kernel involvement, which occurs in conventional TCP/IP protocol, as shown in Fig. 1. As a result, RDMA technology simplifies data transfer as compared to conventional techniques that depend on the CPU and software protocol stacks.

Fig. 1 Difference in data transmission between RDMA and TCP/IP protocol Source: MDPI
Working of RDMA
The RDMA network interface card (NIC) directly handles all transmission-related operations with user-space virtual memory. The data transfer is done without using the operating system's kernel or requiring any further data movement or copying.
The transmission system involves a client-server structure where the client initiates data operation requests, and the server responds. This system supports both two-sided and one-sided operations, each serving different purposes in data transmission, as shown in Fig. 2.
SEND/RECEIVE operations require participation from both the client and the server. They are typically used for transmitting short control messages.
RDMA READ and RDMA WRITE operations allow the client to read from or write directly to the server's memory without requiring the server's involvement, which is efficient for large data transfers.

Fig. 2 Illustration of RDMA transmission system Source: MDPI
RDMA Technologies
RDMA is commonly implemented using technologies such as:
- Infiniband
- RoCE (RDMA over Converged Ethernet)
- iWARP (Internet Wide Area RDMA Protocol)
Infiniband
InfiniBand represents the original and most comprehensive RDMA implementation. It is exceptional in high-performance computation, offering low latency and performance.
Infiniband establishes a new hierarchical architecture that uses its own physical and link layers, providing a complete network incompatible with current Ethernet devices. For example, the transition of a data center from Ethernet to InfiniBand in response to a performance slowdown would necessitate the purchase of a full range of InfiniBand devices, including NICs, cables, wires, switches, and routers, resulting in huge expenses.
RoCE (RDMA over Converged Ethernet)
RoCEv2, which stands for RDMA over Converged Ethernet version 2, is now the most popular RDMA standard. The latest version, RoCE v2, operates on top of the Ethernet link and IP network layers; it allows lossless Ethernet deployment utilizing existing Ethernet infrastructure while supporting RDMA capabilities.
This reduces the need for separate RDMA architectures, making it more cost-effective and user-friendly. RDMA wasn't widely used in Ethernet-based data centers until RoCE came along. RoCE allowed data centers to operate with low latency and excellent speed. Encapsulating the RDMA Transport Protocol with the UDP protocol in the transport layer enhances performance by reducing latency.
iWARP (Internet Wide Area RDMA Protocol)
iWARP enables RDMA operations over standard TCP/IP networks, making it compatible with existing Ethernet infrastructure. While iWARP provides flexibility and scalability, its performance may be lower than that of InfiniBand or RoCE due to the overhead of TCP.
Advantages
- By bypassing the CPU, RDMA enables a "zero-copy" mechanism where data can be transferred directly between application memory spaces without intermediate copying to kernel buffers, reducing latency.
- Since specialized RDMA hardware handles data transfers, the CPU's working power is freed up. This improves system performance and increases processing capability for other task.
- By utilizing the full capability of the underlying network architecture, RDMA enables high-bandwidth data transfer. It is very advantageous in high-performance computing environments.
- RDMA maximizes network utilization by reducing the quantity of unnecessary data movement and the number of network messages necessary for data transfer.
- RDMA can scale efficiently to handle large-scale systems and distributed environments, allowing for seamless integration into cluster and data center architectures.
- It supports various network layouts, from simple two-node setups to complex multi-cluster environments, while maintaining the same high-performance characteristics.
- It efficiently handles parallel operations; as more nodes are added to the system, the aggregate bandwidth increases linearly and prevents the creation of computational overheads
Challenges
The current AI and machine learning applications also present networking challenges that traditional RDMA protocols may not address. The UEC (Ultra Ethernet Consortium) represents a forward-thinking approach to networking that addresses traditional RDMA protocols' limitations by developing new standards and technologies for high-performance AI and machine learning workloads.
Summarizing the Key Points
- Traditional TCP/IP protocols involve multiple data copies and kernel interventions, leading to higher latency and making RDMA a primary choice for applications that demand speed.
- RDMA allows direct data transfer between computers' memory, bypassing the CPU and OS, significantly reducing latency and improving throughput in high-performance environments.
- Emerging networking challenges in AI and machine learning are prompting the development of new RDMA standards and protocols, enhancing capabilities for these workloads.
Reference
Ma, J., Guo, Z., Pan, Y., Zhang, M., Zhao, Z., Sun, Z., & Chang, Y. (2024). ORNIC: A High-Performance RDMA NIC with Out-of-Order Packet Direct Write Method for Multipath Transmission. Electronics, 14(1), 88. https://doi.org/10.3390/electronics14010088
Sun, Z., Guo, Z., Ma, J., & Pan, Y. (2024). A High-Performance FPGA-Based RoCE v2 RDMA Packet Parser and Generator. Electronics, 13(20), 4107.
https://doi.org/10.3390/electronics13204107
He, Q., Gao, P., Zhang, F., Bian, G., Zhang, W., & Li, Z. (2023). Design and optimization of a distributed file system based on RDMA. Applied Sciences, 13(15), 8670.
https://doi.org/10.3390/app13158670