Applies to SUSE Linux Enterprise Server 12 SP5

25 Persistent Memory #

This chapter contains additional information about using SUSE Linux Enterprise Server with non-volatile main memory, also known as Persistent Memory, comprising one or more NVDIMMs.

25.1 Introduction #

Persistent memory is a new type of computer storage, combining speeds approaching those of normal dynamic RAM (DRAM) along with RAM's byte-by-byte addressability, plus the permanence of solid-state disks (SSDs).

Like conventional RAM, it is installed directly into motherboard memory slots. As such, it is supplied in the same physical form factor as RAM—as DIMMs. These are known as NVDIMMs: non-volatile dual inline memory modules.

Unlike RAM, though, persistent memory is also similar to flash-based SSDs in several ways. Both are based on forms of solid-state memory circuitry, but despite this, both provide non-volatile storage: their contents are retained when the system is powered off or restarted. For both forms of medium, writing data is slower than reading it, and both support a limited number of rewrite cycles. Finally, also like SSDs, sector-level access to persistent memory is possible if that is more suitable for a particular application.

Different models use different forms of electronic storage medium, such as Intel 3D XPoint, or a combination of NAND-flash and DRAM. New forms of non-volatile RAM are also in development. This means that different vendors and models of NVDIMM offer different performance and durability characteristics.

Because the storage technologies involved are in an early stage of development, different vendors' hardware may impose different limitations. Thus, the following statements are generalizations.

Persistent memory is up to ten times slower than DRAM, but around a thousand times faster than flash storage. It can be rewritten on a byte-by-byte basis rather than flash memory's whole-sector erase-and-rewrite process. Finally, while rewrite cycles are limited, most forms of persistent memory can handle millions of rewrites, compared to the thousands of cycles of flash storage.

This has two important consequences:

It is not possible with current technology to run a system with only persistent memory and thus achieve completely non-volatile main memory. You must use a mixture of both conventional RAM and NVDIMMs. The operating system and applications will execute in conventional RAM, with the NVDIMMs providing very fast supplementary storage.
The performance characteristics of different vendors' persistent memory mean that it may be necessary for programmers to be aware of the hardware specifications of the NVDIMMs in a particular server, including how many NVDIMMs there are and in which memory slots they are fitted. This will obviously impact hypervisor use, migration of software between different host machines, and so on.

This new storage subsystem is defined in version 6 of the ACPI standard. However, libnvdimm supports pre-standard NVDIMMs and they can be used in the same way.

25.2 Terms #

Region

A region is a block of persistent memory that can be divided up into one or more namespaces. You cannot access the persistent memory of a region without first allocating it to a namespace.

Namespace

A single contiguously-addressed range of non-volatile storage, comparable to NVM Express SSD namespaces, or to SCSI Logical Units (LUNs). Namespaces appear in the server's /dev directory as separate block devices. Depending on the method of access required, namespaces can either amalgamate storage from multiple NVDIMMs into larger volumes, or allow it to be partitioned into smaller volumes.

Mode

Each namespace has a mode that defines which NVDIMM features are enabled for that namespace. Sibling namespaces of the same parent region will always have the same type, but might be configured to have different modes. Namespace modes include:

raw: A memory disk. Does not support DAX. Compatible with other operating systems.
sector: For legacy file systems which do not checksum metadata. Suitable for small boot volumes. Compatible with other operating systems.
fsdax: File system-DAX mode. Default if no other mode is specified. Creates a block device (/dev/pmemX [.Y]) which supports DAX for ext4 or XFS.
devdax: Device-DAX mode. Creates a single-character device file (/dev/daxX.Y). Does not require file system creation.

Type

Each namespace and region has a type that defines the way in which the persistent memory associated with that namespace or region can be accessed. A namespace always has the same type as its parent region. There are two different types: Persistent Memory and Block Mode.

Persistent Memory (PMEM): PMEM storage offers byte-level access, just like RAM. This enables Direct Access (DAX), meaning that accessing the memory bypasses the kernel's page cache and goes direct to the medium. Additionally, using PMEM, a single namespace can include multiple interleaved NVDIMMs, allowing them all to be accessed as a single device.
Block Mode (BLK): BLK access is in sectors, usually of 512 bytes, through a defined access window, the aperture. This behavior is more like a traditional disk drive. This also means that both reads and writes are cached by the kernel. With BLK access, each NVDIMM is accessed as a separate namespace.

Some devices support both PMEM and BLK modes. Additionally, some allow the storage to be split into separate namespaces, so that some can be accessed using PMEM and some using BLK.

Apart from devdax namespaces, all other types must be formatted with a file system such as ext2, ext4 or XFS, just as with a conventional drive.

Direct Access (DAX)

DAX allows persistent memory to be directly mapped into a process's address space, for example using the mmap system call. This is suitable for directly accessing large amounts of PMEM without using any additional RAM, for registering blocks of PMEM for RDMA, or for directly assigning it to virtual machines.

DIMM Physical Address (DPA)

A memory address as an offset into a single DIMM's memory; that is, starting from zero as the lowest addressable byte on that DIMM.

Label

Metadata stored on the NVDIMM, such as namespace definitions. This can be accessed using DSMs.

Device-specific method (DSM)

ACPI method to access the firmware on an NVDIMM.

25.3 Use Cases #

25.3.1 PMEM with DAX #

It is important to note that this form of memory access is not transactional. In the event of a power outage or other system failure, data may not be completely written into storage. PMEM storage is only suitable if the application can handle the situation of partially-written data.

25.3.1.1 Applications that benefit from large amounts of byte-addressable storage. #

If the server will host an application that can directly use large amounts of fast storage on a byte-by-byte basis, the programmer can use the mmap system call to place blocks of persistent memory directly into the application's address space, without using any additional system RAM.

25.3.1.2 Avoiding Use of the Kernel Page Cache #

If you wish to conserve the use of RAM for the page cache, and instead give it to your applications. For instance, non-volatile memory could be dedicated to holding virtual machine (VM) images. As these would not be cached, this would reduce the cache usage on the host, allowing more VMs per host.

25.3.2 PMEM with BTT #

This is useful when you want to use the persistent memory on a set of NVDIMMs as a disk-like pool of very fast storage.

To applications, such devices just appear as very fast SSDs and can be used like any other storage device. For example, LVM can be layered on top of the non-volatile storage and will work as normal.

The advantage of BTT is that sector write atomicity is guaranteed, so even sophisticated applications that depend on data integrity will keep working. Media error reporting works through standard error-reporting channels.

25.3.3 BLK storage #

Although it is more robust against single-device failure, this requires additional management, as each NVDIMM appears as a separate device. Thus, PMEM with BTT is generally preferred.

Note

BLK storage is deprecated and is not supported in later versions of SUSE Linux Enterprise Server.

25.4 Tools for Managing Persistent Memory #

To manage persistent memory, it is necessary to install the ndctl package. This also installs the libndctl package, which provides a set of user-space libraries to configure NVDIMMs.

These tools work via the libnvdimm library, which supports three types of NVDIMMs:

PMEM
BLK
Simultaneous PMEM and BLK

The ndctl utility has a helpful set of man pages, accessible with the command:

ndctl help subcommand

To see a list of available subcommands, use:

ndctl --list-cmds

The available subcommands include:

version: Displays the current version of the NVDIMM support tools.
enable-namespace: Makes the specified namespace available for use.
disable-namespace: Prevents the specified namespace from being used.
create-namespace: Creates a new namespace from the specified storage devices.
destroy-namespace: Removes the specified namespace.
enable-region: Makes the specified region available for use.
disable-region: Prevents the specified region from being used.
zero-labels: Erases the metadata from a device.
read-labels: Retrieves the metadata of the specified device.
list: Displays available devices.
help: Displays information about using the tool.

25.5 Setting Up Persistent Memory #

25.5.1 Viewing Available NVDIMM Storage #

The ndctl list command can be used to list all available NVDIMMs in a system.

In the following example, the system has three NVDIMMs which are in a single, triple-channel interleaved set.

root # ndctl list --dimms

[
 {
  "dev":"nmem2",
  "id":"8089-00-0000-12325476"
 },
 {
  "dev":"nmem1",
  "id":"8089-00-0000-11325476"
 },
 {
  "dev":"nmem0",
  "id":"8089-00-0000-10325476"
 }
]

With a different parameter, ndctl list will also list the available regions.

Note

Regions may not appear in numerical order.

Note that although there are only three NVDIMMs, they appear as four regions.

root # ndctl list --regions

[
 {
  "dev":"region1",
  "size":68182605824,
  "available_size":68182605824,
  "type":"blk"
 },
 {
  "dev":"region3",
  "size":202937204736,
  "available_size":202937204736,
  "type":"pmem",
  "iset_id":5903239628671731251
  },
  {
   "dev":"region0",
   "size":68182605824,
   "available_size":68182605824,
   "type":"blk"
  },
  {
   "dev":"region2",
   "size":68182605824,
   "available_size":68182605824,
   "type":"blk"
  }
]

The space is available in two different forms: either as three separate 64 GB regions of type BLK, or as one combined 189 GB region of type PMEM which presents all the space on the three interleaved NVDIMMs as a single volume.

Note that the displayed value for available_size is the same as that for size. This means that none of the space has been allocated yet.

25.5.2 Configuring the Storage as a Single PMEM Namespace with DAX #

For the first example, we will configure our three NVDIMMs into a single PMEM namespace with Direct Access (DAX).

The first step is to create a new namespace.

root # ndctl create-namespace --type=pmem --mode=fsdax --map=memory
{
 "dev":"namespace3.0",
 "mode":"memory",
 "size":199764213760,
 "uuid":"dc8ebb84-c564-4248-9e8d-e18543c39b69",
 "blockdev":"pmem3"
}

This creates a block device /dev/pmem3, which supports DAX. The 3 in the device name is inherited from the parent region number, in this case region3.

The --map=memory option sets aside part of the PMEM storage space on the NVDIMMs so that it can be used to allocate internal kernel data structures called struct pages. This allows the new PMEM namespace to be used with features such as O_DIRECT I/O and RDMA.

The reservation of some persistent memory for kernel data structures is why the resulting PMEM namespace has a smaller capacity than the parent PMEM region.

Next, we verify that the new block device is available to the operating system:

root # fdisk -l /dev/pmem3
Disk /dev/pmem3: 186 GiB, 199764213760 bytes, 390164480 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Before it can be used, like any other drive, it must be formatted. In this example, we format it with XFS:

root # mkfs.xfs /dev/pmem3
meta-data=/dev/pmem3      isize=256    agcount=4, agsize=12192640 blks
         =                sectsz=4096  attr=2, projid32bit=1
         =                crc=0        finobt=0, sparse=0
data     =                bsize=4096   blocks=48770560, imaxpct=25
         =                sunit=0      swidth=0 blks
naming   =version 2       bsize=4096   ascii-ci=0 ftype=1
log      =internal log    bsize=4096   blocks=23813, version=2
         =                sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none            extsz=4096   blocks=0, rtextents=0

Next, we can mount the new drive onto a directory:

root # mount -o dax /dev/pmem3 /mnt/pmem3

Then we can verify that we now have a DAX-capable device:

root # mount | grep dax
/dev/pmem3 on /mnt/pmem3 type xfs (rw,relatime,attr2,dax,inode64,noquota)

The result is that we now have a PMEM namespace formatted with the XFS file system and mounted with DAX.

Any mmap() calls to files in that file system will return virtual addresses that directly map to the persistent memory on our NVDIMMs, completely bypassing the page cache.

Any fsync or msync calls on files in that file system will still ensure that modified data has been fully written to the NVDIMMs. These calls flush the processor cache lines associated with any pages that have been modified in userspace via mmap mappings.

25.5.2.1 Removing a Namespace #

Before creating any other type of volume that uses the same storage, we must unmount and then remove this PMEM volume.

First, unmount it:

root # umount /mnt/pmem3

Then disable the namespace:

root # ndctl disable-namespace namespace3.0
disabled 1 namespace

Then delete it:

root # ndctl destroy-namespace namespace3.0
destroyed 1 namespace

25.5.3 Creating a PMEM Namespace with BTT #

In the next example, we create a PMEM namespace that uses BTT.

root # ndctl create-namespace --type=pmem --mode=sector
{
 "dev":"namespace3.0",
 "mode":"sector",
 "uuid":"51ab652d-7f20-44ea-b51d-5670454f8b9b",
 "sector_size":4096,
 "blockdev":"pmem3s"
}

Next, verify that the new device is present:

root # fdisk -l /dev/pmem3s
Disk /dev/pmem3s: 188.8 GiB, 202738135040 bytes, 49496615 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Like the DAX-capable PMEM namespace we previously configured, this BTT-capable PMEM namespace consumes all the available storage on the NVDIMMs.

Note

The trailing s in the device name (/dev/pmem3s) stands for sector and can be used to easily distinguish PMEM and BLK namespaces that are configured to use the BTT.

The volume can be formatted and mounted as in the previous example.

The PMEM namespace shown here cannot use DAX. Instead it uses the BTT to provide sector write atomicity. On each sector write through the PMEM block driver, the BTT will allocate a new sector to receive the new data. The BTT atomically updates its internal mapping structures after the new data is fully written so the newly written data will be available to applications. If the power fails at any point during this process, the write will be completely lost and the application will have access to its old data, still intact. This prevents the condition known as "torn sectors".

This BTT-enabled PMEM namespace can be formatted and used with a file system just like any other standard block device. It cannot be used with DAX. However, mmap mappings for files on this block device will use the page cache.

Note

In both these examples, space from all the NVDIMMs is combined into a single volume. Just as with a non-redundant disk array, this means that if any individual NVDIMM suffers an error, the contents of the entire volume could be lost. The more NVDIMMs are included in the volume, the higher the chance of such an error.

25.5.3.1 Removing the PMEM Volume #

As in the previous example, before re-allocating the space, we must first remove the volume and the namespace:

root # ndctl disable-namespace namespace3.0
disabled 1 namespace

root # ndctl destroy-namespace namespace3.0
destroyed 1 namespace

25.5.4 Creating BLK Namespaces #

In this example, we will create three separate BLK devices: one per NVDIMM.

One advantage of this approach is that if any individual NVDIMM fails, the other volumes will be unaffected.

Note

The commands must be repeated for each namespace.

root # ndctl create-namespace --type=blk --mode=sector
{
 "dev":"namespace1.0",
 "mode":"sector",
 "uuid":"fed466bd-90f6-460b-ac81-ad1f08716602",
 "sector_size":4096,
 "blockdev":"ndblk1.0s"
}
   
root # ndctl create-namespace --type=blk --mode=sector
{
 "dev":"namespace0.0",
 "mode":"sector",
 "uuid":"12a29b6f-b951-4d08-8dbc-8dea1a2bb32d",
 "sector_size":4096,
 "blockdev":"ndblk0.0s"
}
    
root # ndctl create-namespace --type=blk --mode=sector
{
 "dev":"namespace2.0",
 "mode":"sector",
 "uuid":"7c84dab5-cc08-452a-b18d-53e430bf8833",
 "sector_size":4096,
 "blockdev":"ndblk2.0s"
}

Next, we can verify that the new devices exist:

root # fdisk -l /dev/ndblk*
Disk /dev/ndblk0.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk1.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Disk /dev/ndblk2.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

The block devices generated for BLK namespaces are named /dev/ndblkX.Y where X is the parent region number and Y is a unique namespace number within that region. So, /dev/ndblk2.0s is child namespace number 0 of region 2.

As in the previous example, the trailing s means that this namespace is configured to use the BTT—in other words, for sector-based access. Because they are accessed via a block window, programs cannot use DAX, but accesses will be cached.

As ever, these devices must all be formatted and mounted before they can be used.

25.6 Troubleshooting #

Persistent memory is more durable than SSD storage, but it can wear out. If an NVDIMM fails, it is necessary to isolate the individual module that developed a fault so that the remaining data can be recovered and the hardware replaced. Three pieces of information must be found:

Which NVDIMM module has failed: the physical location of the defective module.
Which namespace (/dev/pmemX) now contains bad blocks.
What other namespaces or regions also use that physical module.

After the faulty module has been determined along with whatever namespaces and regions use it, then the data in other, unaffected namespaces can be backed up, the server shut down and the NVDIMM replaced.

25.6.1 Locating a Failed Module #

A set of NVDIMMs are located in DIMM slots on the motherboard of the server.

The platform (the combination of server hardware and firmware) allocates the memory on these NVDIMMs to one or more regions, such as region0.

Then within those regions, the operating system defines namespaces, for example /dev/pmem1 or /dev/dax0.

In the diagram below, there are three regions. One is a PMEM region, composed of part of the space from three NVDIMMs, interleaved. The remaining space on two of the NVDIMMs has been configured as two additional BLK regions. In each of these, a namespace has been created.

Figure 25.1: NVDIMM Region Layout #

In our example, the part of region0 labeled as [X] has been damaged or become defective.

You must:

Identify which NVDIMM module(s) contain the affected region.
This is particularly important if the region is interleaved across more than one NVDIMM.
Back up the contents of any other namespaces on the affected NVDIMM.
In this example, you must back up the contents of /dev/pmem2s.
Identify the relationship between the namespaces and the physical position of the NVDIMM (in which motherboard memory slot it is located).
The server must be shut down, its cover removed, and the defective modules found, removed and replaced.

25.6.2 Testing Persistent Memory #

Note: Prerequisites for Fault-Finding

For testing, the nfit_test kernel module is required.

The testing procedure is described in detail on the GitHub page for the ndctl command, in steps 1-4 of the section Unit test. See Section 25.7, “For More Information” at the end of this chapter.

Procedure 25.1: Testing Procedure #

Execute the ndctl command with the parameters list -RM.

This shows the list of bad blocks.

 tux > sudo ndctl list -RM 
  :
  :
 { 
   "dev":"region5", 
   "size":33554432, 
   "available_size":33554432, 
   "type":"pmem", 
   "iset_id":4676476994879183020, 
   "badblock_count":8, 
   "badblocks":[ 
     { 
       "offset":32768, 
       "length":8, 
       "dimms":[ 
          "nmem1" 1  
       ] 
     } 
   ] 
 }, 
 :

1	The specific NVDIMM is identified here.

Execute the ndctl command with the parameters list -Du.

This shows the handle of the DIMM.

  tux > sudo ndctl list -Du
  { 
     "dev":"nmem1", 
     "id":"cdab-0a-07e0-feffffff", 
     "handle":"0x1", 1 
     "phys_id":"0x1"     
  }, 
   : 
   :

1	This is the handle of the NVDIMM.

Execute the ndctl command with the parameters list --d DIMM name.

 tux > sudo ndctl list -R -d nmem1 
 [ 
   { 
     "dev":"region5", 
     "size":33554432, 
     "available_size":33554432, 
     "type":"pmem", 
     "iset_id":4676476994879183020, 
     "badblock_count":8 
    }, 
   : 
   :

25.7 For More Information #

More about this topic can be found in the following list:

Persistent Memory Wiki
Contains instructions for configuring NVDIMM systems, information about testing, and links to specifications related to NVDIMM enabling. This site is developing as NVDIMM support in Linux is developing.
Persistent Memory Programming
Information about configuring, using and programming systems with non-volatile memory under Linux and other operating systems. Covers the NVM Library (NVML), which aims to provide useful APIs for programming with persistent memory in userspace.
LIBNVDIMM: Non-Volatile Devices
Aimed at kernel developers, this is part of the Documentation folder in the current Linux kernel tree. It talks about the different kernel modules involved in NVDIMM enablement, lays out some technical details of the kernel implementation, and talks about the sysfsinterface to the kernel that is used by the ndctl tool.
GitHub: pmem/ndctl
Utility library for managing the libnvdimm subsystem in the Linux kernel. Also contains userspace libraries, as well as unit tests and documentation.