Mellanox Ofed 4.4 User Manual

For additional details, please refer to the MFT User’s Manual docs/. 1.3 Mellanox OFED Package 1.3.1 ISO Image Mellanox OFED for Linux (MLNXOFEDLINUX) is provided as ISO images or as a tarball, one per supported Linux distribution and CPU architecture, that includes source code and binary RPMs, firmware, utilities, and documentation.

Persistent memory technologies such as Intel® Optane™ DC persistent memory come with several challenges. Remote access seems to be one of the most difficult aspects of persistent memory applications because there is no ready-to-use technology that supports remote persistent memory (RPMEM). Most commonly used remote direct memory access (RDMA) for remote memory access does not consider data durability aspects.

This paper proposes solutions for accessing RPMEM based on traditional RDMA. These solutions have been implemented in the Persistent Memory Development Kit (PMDK) librpmem library. Before you read this part, which describes how to set up and configure RDMA-capable NICs (RNICs) and how to use the RDMA-capable network in a replication process, we suggest reading Part 2: Remote Persistent Memory 101. The other parts in this series include:

  • Part 1, “Understanding Remote Persistent Memory,' describes the theoretical realm of remote persistent memory.
  • Part 4, 'Persistent Memory Development Kit-Based PMEM Replication,' describes how to create and configure an application that can replicate persistent memory over an RDMA-capable network.

Prepare to Shine – RNIC Setup

RNICs are recommended for the use with rpmem because RDMA allows writing data directly to persistent memory, which gives significant performance gains.

This section gives entry-level knowledge regarding the setup and the configuration of RNICs from three manufacturers: Mellanox*, Chelsio*, and Intel. RNIC configuration is a complex topic and describing it in detail is outside the scope of this paper. For details regarding the setup of RNICs from each manufacturer, see the manuals mentioned in the respective sections below.

The setup described here consists of two separate machines, each equipped with its own RNIC and connected to each other. The configuration steps described below have to be performed on both of them; on the initiator as well as on the target.

Mellanox Ofed 4.4 User Manual

Mellanox*

Mellanox RNICs support three RDMA-compatible protocols: InfiniBand*, RoCE v1, and RoCE v2. The decision on which protocol is the best for the specific application depends on the technical constraints of the network. Mellanox provides exhaustive documentation describing all required configuration steps for all of them. For details see Mellanox OFED for Linux* User Manual Rev 4.5 or later.

To check if the machine is equipped with Mellanox RNIC:

To start working with Mellanox RNICs, first choose the RDMA protocol and perform the following steps:

  1. Install Mellanox software (see “Installing Mellanox Software” later in this section).
  2. Configure port types required for chosen RDMA protocol (see “Configuring Port Types” later in this section).
  3. Configure the chosen RDMA protocol:
    1. RoCE v2 (see “RoCE v2” later in this section); or
    2. InfiniBand (see “InfiniBand*” later in this section); or
    3. RoCE v1 (see “RoCE v1” later in this section).

Ofed Linux

Installing Mellanox Software

Download the software package from Mellanox. Choose the relevant package depending on the machine’s operating system.

Reboot the machine.

Configuring Port Types

Mellanox RNIC ports can be individually configured to work as InfiniBand or Ethernet ports. Each of the RDMA protocols requires a specific port type so, prior to configuration of the chosen RDMA protocol, it is necessary to configure the appropriate port type.

Before querying and changing the port type, start the mst (Mellanox Software Tools) service:

Querying port types:

The LINK_TYPE has to match chosen RDMA protocol. There are two possible LINK_TYPE values:

  • ETH(2) - Ethernet (required by RoCE v2)
  • IB(1) - InfiniBand (required by InfiniBand and RoCE v1)

Setting port type (for example, to the Ethernet):

After changing the port type, a machine reboot is required.

If the machine is equipped with more than one Mellanox RNIC it will also have more than one /dev/mst/mt*_pciconf0 devices in the system. In this case, the wildcard in the path has to be replaced with appropriate vendor_part_id. It can be found in the ibv_devinfo(1) command output.

RoCE v2

RoCE v2 requires configured Priority Flow Control (PFC) and tagging outbound packages (egress) in order to function reliably. The example below creates the separate virtual LAN interface (VLAN) for which all outbound packets are tagged with the chosen PFC priority.

How to choose the appropriate PFC priority is outside the scope of this paper. For details see Mellanox OFED for Linux User Manual Rev 4.5 or later.

For RoCE v2, several link speeds are available. It is important to choose the right one to obtain the desired performance. The link speed is negotiated between the machines on both ends of the link, so the link speed has to be set to the same value on both machines.

InfiniBand defines several fixed-size MTUs; for example: 1024, 2048, or 4096 bytes. However, when configuring using ifconfig it is necessary to take the RoCE transport headers into account. The acceptable values are 1500, 2200, and 4200. This value has to be the same on both machines.

Libfabric uses librdmacm for communication management. RDMA_CM has to be configured to use RoCE v2:

InfiniBand*

The subnet manager (opensm) must be running for each InfiniBand subnet. Since the initiator and the target are in the same subnet, opensm has to run only on one of them.

Further configuration has to be performed on both nodes:

Opensm has to be started after both IB network interfaces in the subnet are powered on.

RoCE v1

RoCE v1 and RoCE v2 run over Converged Ethernet so they impose the same requirements on the Ethernet configuration (PFC and egress mapping) in order to function reliably.

Because it is an Ethernet link layer protocol, RoCE v1 allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol so it can be routed between broadcast domains.

PFC setup has to be performed on both nodes:

Since RoCE v1 runs over InfiniBand, it also requires one opensm instance for each InfiniBand subnet. It can run either on the initiator or on the target:

After opensm starts, the network interfaces on both machines should be up and ready for further configuration:

How to choose appropriate PFC priority is outside the scope of this paper. For details, see Mellanox OFED for Linux User Manual Rev 4.5 or later.

Libfabric uses librdmacm for communication management. RDMA_CM has to be configured to use RoCE v1:

Chelsio*

Chelsio RNICs support RDMA by implementing the iWARP protocol. Chelsio provides documentation describing all required configuration steps. For details, see the User Guide for Chelsio Unified Wire v3.10.0.0 (or later) for Linux.

Installing Chelsio Software

Download the software package from Chelsio. Choose the relevant package depending on the machine’s operating system.

Configuring

Associate the Chelsio network controller ports to the network interfaces.

Configure the desired link speed and an IP address for the interface.

Intel

The RNIC configuration steps for Intel® Omni-Path Fabric software are described in the Intel Omni-Path Fabric Software Rev. 11.0 or later. The latest documentation and end user publications are available at Fabric Software Publications.

Installing Intel® Omni-Path Fabric Software

Go to the download center and find Intel Omni-Path Fabric software. Choose the relevant package of Intel® Omni-Path Fabric suite (OPA-IFS), depending on the machine’s operating system.

Reboot machine.

Configuring

Associate the Intel® Omni-Path Host Fabric Interface adapter ports to the network interfaces.

Configure the desired MTU and an IP address for the interface.

Checking Connectivity

This paper assumes the IP addresses in Figure 1 are assigned to the network interfaces.


Figure 1. Two RDMA machines network

Allow translating the human-friendly name target to the target IP address:

To check connectivity use the regular ping(8) command:

Note A firewall configuration on the target node may prevent pinging from the initiator. Firewall configuration is outside the scope of this paper.

It is also possible to check RDMA connectivity using rping(1), which establishes an RDMA connection between the two nodes. In the following example it also performs RDMA transfers. First, start the rping server on the target and then the rping client on the initiator.

When the rping client finishes the transfers, the server will display additional messages:

If rping finished its connection test successfully, the RDMA connection is ready for use.

Shift to RDMA-Capable Network

Having configured replication between two machines (described in Replicating to Another Machine), shifting to RDMA-capable NICs (RNICs) is straightforward. This section highlights the differences between using RNICs and regular NICs.


Figure 2. Rpmem software stack with two machines equipped with RDMA.

Configuring Platforms

The rpmem replication between two machines may be subject to limits established via limits.conf(5). Limits are modified on the initiator and on the target separately. For details, see Single Machine Setup.

Network Configuration

RDMA network configuration is described in Prepare to Shine – RNIC Setup.

Rpmem Installation

Just as with the Ethernet NICs, rpmem requires some specific software components installed on the initiator and on the target machine. No changes here.

SSH Configuration

Mellanox Ofed 4.4 User Manual Online

SSH configuration is exactly the same, independent of whether NICs or RNICs are used.

If the following steps were executed for other network interfaces there is no need to repeat the first step. But the second and third steps have to be performed each time a new machine, account, or IP address is used.

Rpmemd Configuration

The rpmemd configuration does not depend on the type of network interface, so the rpmemd configuration is the same:

Mellanox

Librpmem Configuration

The key difference between using NICs and RNICs is in the librpmem configuration. The Ethernet network interfaces do not support the verbs fabric provider, so they require enabling the sockets provider. This is not the case with the RNICs so the librpmem library does not require any additional configuration.

If the initiator machine was previously used for replication using NICs it may have enabled the socket provider support. This is not an issue because, if both the verbs provider and the sockets provider are available, librpmem will always pick the verbs provider.

Creating a Memory Pool with a Replica

After updating the configuration, using rpmem replication is no different.

Inspecting the Results

Mellanox Winof

The expected results are the same:

This can also be verified using hexdump to verify that the replica has the expected contents:

Mellanox Ofed 4.4 User Manual 2017

Proceed to Part 4, 'Persistent Memory Development KIt-Based PMEM Replication,' which describes how to create and configure an application that can replicate persistent memory over an RDMA-capable network

Other Articles in This Series

Mellanox Ofed 4.4 User Manual 2016

Part 1, “Understanding Remote Persistent Memory,' describes the theoretical realm of remote persistent memory.

Mellanox Ofed Windows

Part 2, 'Remote Persistent Memory 101,' depicts examples of setups and practical uses of RPMEM.

Mellanox Ofed 4.4 User Manual Pdf

For more complete information about compiler optimizations, see our Optimization Notice.