25 Persistent Memory #
This chapter contains additional information about using SUSE Linux Enterprise Server with non-volatile main memory, also known as Persistent Memory, comprising one or more NVDIMMs.
25.1 Introduction #
Persistent memory is a new type of computer storage, combining speeds approaching those of normal dynamic RAM (DRAM) along with RAM's byte-by-byte addressability, plus the permanence of solid-state disks (SSDs).
Like conventional RAM, it is installed directly into motherboard memory slots. As such, it is supplied in the same physical form factor as RAM—as DIMMs. These are known as NVDIMMs: non-volatile dual inline memory modules.
Unlike RAM, though, persistent memory is also similar to flash-based SSDs in several ways. Both are based on forms of solid-state memory circuitry, but despite this, both provide non-volatile storage: their contents are retained when the system is powered off or restarted. For both forms of medium, writing data is slower than reading it, and both support a limited number of rewrite cycles. Finally, also like SSDs, sector-level access to persistent memory is possible if that is more suitable for a particular application.
Different models use different forms of electronic storage medium, such as Intel 3D XPoint, or a combination of NAND-flash and DRAM. New forms of non-volatile RAM are also in development. This means that different vendors and models of NVDIMM offer different performance and durability characteristics.
Because the storage technologies involved are in an early stage of development, different vendors' hardware may impose different limitations. Thus, the following statements are generalizations.
Persistent memory is up to ten times slower than DRAM, but around a thousand times faster than flash storage. It can be rewritten on a byte-by-byte basis rather than flash memory's whole-sector erase-and-rewrite process. Finally, while rewrite cycles are limited, most forms of persistent memory can handle millions of rewrites, compared to the thousands of cycles of flash storage.
This has two important consequences:
It is not possible with current technology to run a system with only persistent memory and thus achieve completely non-volatile main memory. You must use a mixture of both conventional RAM and NVDIMMs. The operating system and applications will execute in conventional RAM, with the NVDIMMs providing very fast supplementary storage.
The performance characteristics of different vendors' persistent memory mean that it may be necessary for programmers to be aware of the hardware specifications of the NVDIMMs in a particular server, including how many NVDIMMs there are and in which memory slots they are fitted. This will obviously impact hypervisor use, migration of software between different host machines, and so on.
This new storage subsystem is defined in version 6 of the ACPI standard.
However, libnvdimm
supports pre-standard NVDIMMs and
they can be used in the same way.
25.2 Terms #
- Region
A region is a block of persistent memory that can be divided up into one or more namespaces. You cannot access the persistent memory of a region without first allocating it to a namespace.
- Namespace
A single contiguously-addressed range of non-volatile storage, comparable to NVM Express SSD namespaces, or to SCSI Logical Units (LUNs). Namespaces appear in the server's
/dev
directory as separate block devices. Depending on the method of access required, namespaces can either amalgamate storage from multiple NVDIMMs into larger volumes, or allow it to be partitioned into smaller volumes.- Mode
Each namespace has a mode that defines which NVDIMM features are enabled for that namespace. Sibling namespaces of the same parent region will always have the same type, but might be configured to have different modes. Namespace modes include:
- raw
A memory disk. Does not support DAX. Compatible with other operating systems.
- sector
For legacy file systems which do not checksum metadata. Suitable for small boot volumes. Compatible with other operating systems.
- fsdax
File system-DAX mode. Default if no other mode is specified. Creates a block device (
/dev/pmemX [.Y]
) which supports DAX forext4
orXFS
.- devdax
Device-DAX mode. Creates a single-character device file (
/dev/daxX.Y
). Does not require file system creation.
- Type
Each namespace and region has a type that defines the way in which the persistent memory associated with that namespace or region can be accessed. A namespace always has the same type as its parent region. There are two different types: Persistent Memory and Block Mode.
- Persistent Memory (PMEM)
PMEM storage offers byte-level access, just like RAM. This enables Direct Access (DAX), meaning that accessing the memory bypasses the kernel's page cache and goes direct to the medium. Additionally, using PMEM, a single namespace can include multiple interleaved NVDIMMs, allowing them all to be accessed as a single device.
- Block Mode (BLK)
BLK access is in sectors, usually of 512 bytes, through a defined access window, the aperture. This behavior is more like a traditional disk drive. This also means that both reads and writes are cached by the kernel. With BLK access, each NVDIMM is accessed as a separate namespace.
Some devices support both PMEM and BLK modes. Additionally, some allow the storage to be split into separate namespaces, so that some can be accessed using PMEM and some using BLK.
Apart from
devdax
namespaces, all other types must be formatted with a file system such asext2
,ext4
orXFS
, just as with a conventional drive.- Direct Access (DAX)
DAX allows persistent memory to be directly mapped into a process's address space, for example using the
mmap
system call. This is suitable for directly accessing large amounts of PMEM without using any additional RAM, for registering blocks of PMEM for RDMA, or for directly assigning it to virtual machines.- DIMM Physical Address (DPA)
A memory address as an offset into a single DIMM's memory; that is, starting from zero as the lowest addressable byte on that DIMM.
- Label
Metadata stored on the NVDIMM, such as namespace definitions. This can be accessed using DSMs.
- Device-specific method (DSM)
ACPI method to access the firmware on an NVDIMM.
25.3 Use Cases #
25.3.1 PMEM with DAX #
It is important to note that this form of memory access is not transactional. In the event of a power outage or other system failure, data may not be completely written into storage. PMEM storage is only suitable if the application can handle the situation of partially-written data.
25.3.1.1 Applications that benefit from large amounts of byte-addressable storage. #
If the server will host an application that can directly use large amounts
of fast storage on a byte-by-byte basis, the programmer can use the mmap
system call to place blocks of persistent memory directly into the
application's address space, without using any additional system RAM.
25.3.1.2 Avoiding Use of the Kernel Page Cache #
If you wish to conserve the use of RAM for the page cache, and instead give it to your applications. For instance, non-volatile memory could be dedicated to holding virtual machine (VM) images. As these would not be cached, this would reduce the cache usage on the host, allowing more VMs per host.
25.3.2 PMEM with BTT #
This is useful when you want to use the persistent memory on a set of NVDIMMs as a disk-like pool of very fast storage.
To applications, such devices just appear as very fast SSDs and can be used like any other storage device. For example, LVM can be layered on top of the non-volatile storage and will work as normal.
The advantage of BTT is that sector write atomicity is guaranteed, so even sophisticated applications that depend on data integrity will keep working. Media error reporting works through standard error-reporting channels.
25.3.3 BLK storage #
Although it is more robust against single-device failure, this requires additional management, as each NVDIMM appears as a separate device. Thus, PMEM with BTT is generally preferred.
BLK storage is deprecated and is not supported in later versions of SUSE Linux Enterprise Server.
25.4 Tools for Managing Persistent Memory #
To manage persistent memory, it is necessary to install the
ndctl
package. This also installs the
libndctl
package, which provides a set of user-space
libraries to configure NVDIMMs.
These tools work via the libnvdimm
library, which
supports three types of NVDIMMs:
PMEM
BLK
Simultaneous PMEM and BLK
The ndctl
utility has a helpful set of
man
pages, accessible with the command:
ndctl help subcommand
To see a list of available subcommands, use:
ndctl --list-cmds
The available subcommands include:
- version
Displays the current version of the NVDIMM support tools.
- enable-namespace
Makes the specified namespace available for use.
- disable-namespace
Prevents the specified namespace from being used.
- create-namespace
Creates a new namespace from the specified storage devices.
- destroy-namespace
Removes the specified namespace.
- enable-region
Makes the specified region available for use.
- disable-region
Prevents the specified region from being used.
- zero-labels
Erases the metadata from a device.
- read-labels
Retrieves the metadata of the specified device.
- list
Displays available devices.
- help
Displays information about using the tool.
25.5 Setting Up Persistent Memory #
25.5.1 Viewing Available NVDIMM Storage #
The ndctl
list
command can be used to
list all available NVDIMMs in a system.
In the following example, the system has three NVDIMMs which are in a single, triple-channel interleaved set.
root #
ndctl list --dimms
[ { "dev":"nmem2", "id":"8089-00-0000-12325476" }, { "dev":"nmem1", "id":"8089-00-0000-11325476" }, { "dev":"nmem0", "id":"8089-00-0000-10325476" } ]
With a different parameter, ndctl
list
will also list the available regions.
Regions may not appear in numerical order.
Note that although there are only three NVDIMMs, they appear as four regions.
root #
ndctl list --regions
[ { "dev":"region1", "size":68182605824, "available_size":68182605824, "type":"blk" }, { "dev":"region3", "size":202937204736, "available_size":202937204736, "type":"pmem", "iset_id":5903239628671731251 }, { "dev":"region0", "size":68182605824, "available_size":68182605824, "type":"blk" }, { "dev":"region2", "size":68182605824, "available_size":68182605824, "type":"blk" } ]
The space is available in two different forms: either as three separate 64 GB regions of type BLK, or as one combined 189 GB region of type PMEM which presents all the space on the three interleaved NVDIMMs as a single volume.
Note that the displayed value for available_size
is the
same as that for size
. This means that none of the space
has been allocated yet.
25.5.2 Configuring the Storage as a Single PMEM Namespace with DAX #
For the first example, we will configure our three NVDIMMs into a single PMEM namespace with Direct Access (DAX).
The first step is to create a new namespace.
root #
ndctl create-namespace --type=pmem --mode=fsdax --map=memory
{ "dev":"namespace3.0", "mode":"memory", "size":199764213760, "uuid":"dc8ebb84-c564-4248-9e8d-e18543c39b69", "blockdev":"pmem3" }
This creates a block device /dev/pmem3
, which supports
DAX. The 3
in the device name is inherited from the
parent region number, in this case region3
.
The --map=memory
option sets aside part of the PMEM
storage space on the NVDIMMs so that it can be used to allocate internal
kernel data structures called struct pages
. This allows
the new PMEM namespace to be used with features such as O_DIRECT
I/O
and RDMA
.
The reservation of some persistent memory for kernel data structures is why the resulting PMEM namespace has a smaller capacity than the parent PMEM region.
Next, we verify that the new block device is available to the operating system:
root #
fdisk -l /dev/pmem3
Disk /dev/pmem3: 186 GiB, 199764213760 bytes, 390164480 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Before it can be used, like any other drive, it must be formatted. In this example, we format it with XFS:
root #
mkfs.xfs /dev/pmem3
meta-data=/dev/pmem3 isize=256 agcount=4, agsize=12192640 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0 data = bsize=4096 blocks=48770560, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=23813, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
Next, we can mount the new drive onto a directory:
root #
mount -o dax /dev/pmem3 /mnt/pmem3
Then we can verify that we now have a DAX-capable device:
root #
mount | grep dax
/dev/pmem3 on /mnt/pmem3 type xfs (rw,relatime,attr2,dax,inode64,noquota)
The result is that we now have a PMEM namespace formatted with the XFS file system and mounted with DAX.
Any mmap()
calls to files in that file system will
return virtual addresses that directly map to the persistent memory on our
NVDIMMs, completely bypassing the page cache.
Any fsync
or msync
calls on files in
that file system will still ensure that modified data has been fully
written to the NVDIMMs. These calls flush the processor cache lines
associated with any pages that have been modified in userspace via
mmap
mappings.
25.5.2.1 Removing a Namespace #
Before creating any other type of volume that uses the same storage, we must unmount and then remove this PMEM volume.
First, unmount it:
root #
umount /mnt/pmem3
Then disable the namespace:
root #
ndctl disable-namespace namespace3.0
disabled 1 namespace
Then delete it:
root #
ndctl destroy-namespace namespace3.0
destroyed 1 namespace
25.5.3 Creating a PMEM Namespace with BTT #
In the next example, we create a PMEM namespace that uses BTT.
root #
ndctl create-namespace --type=pmem --mode=sector
{ "dev":"namespace3.0", "mode":"sector", "uuid":"51ab652d-7f20-44ea-b51d-5670454f8b9b", "sector_size":4096, "blockdev":"pmem3s" }
Next, verify that the new device is present:
root #
fdisk -l /dev/pmem3s
Disk /dev/pmem3s: 188.8 GiB, 202738135040 bytes, 49496615 sectors Units: sectors of 1 * 4096 = 4096 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Like the DAX-capable PMEM namespace we previously configured, this BTT-capable PMEM namespace consumes all the available storage on the NVDIMMs.
The trailing s
in the device name
(/dev/pmem3s
) stands for
sector
and can be used to easily distinguish PMEM and
BLK namespaces that are configured to use the BTT.
The volume can be formatted and mounted as in the previous example.
The PMEM namespace shown here cannot use DAX. Instead it uses the BTT to provide sector write atomicity. On each sector write through the PMEM block driver, the BTT will allocate a new sector to receive the new data. The BTT atomically updates its internal mapping structures after the new data is fully written so the newly written data will be available to applications. If the power fails at any point during this process, the write will be completely lost and the application will have access to its old data, still intact. This prevents the condition known as "torn sectors".
This BTT-enabled PMEM namespace can be formatted and used with a file system
just like any other standard block device. It cannot be used with DAX.
However, mmap
mappings for files on this block device
will use the page cache.
In both these examples, space from all the NVDIMMs is combined into a single volume. Just as with a non-redundant disk array, this means that if any individual NVDIMM suffers an error, the contents of the entire volume could be lost. The more NVDIMMs are included in the volume, the higher the chance of such an error.
25.5.3.1 Removing the PMEM Volume #
As in the previous example, before re-allocating the space, we must first remove the volume and the namespace:
root #
ndctl disable-namespace namespace3.0
disabled 1 namespaceroot #
ndctl destroy-namespace namespace3.0
destroyed 1 namespace
25.5.4 Creating BLK Namespaces #
In this example, we will create three separate BLK devices: one per NVDIMM.
One advantage of this approach is that if any individual NVDIMM fails, the other volumes will be unaffected.
The commands must be repeated for each namespace.
root #
ndctl create-namespace --type=blk --mode=sector
{ "dev":"namespace1.0", "mode":"sector", "uuid":"fed466bd-90f6-460b-ac81-ad1f08716602", "sector_size":4096, "blockdev":"ndblk1.0s" }root #
ndctl create-namespace --type=blk --mode=sector { "dev":"namespace0.0", "mode":"sector", "uuid":"12a29b6f-b951-4d08-8dbc-8dea1a2bb32d", "sector_size":4096, "blockdev":"ndblk0.0s" }root #
ndctl create-namespace --type=blk --mode=sector
{ "dev":"namespace2.0", "mode":"sector", "uuid":"7c84dab5-cc08-452a-b18d-53e430bf8833", "sector_size":4096, "blockdev":"ndblk2.0s" }
Next, we can verify that the new devices exist:
root #
fdisk -l /dev/ndblk*
Disk /dev/ndblk0.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk /dev/ndblk1.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk /dev/ndblk2.0s: 63.4 GiB, 68115001344 bytes, 16629639 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
The block devices generated for BLK namespaces are named
/dev/ndblkX.Y
where X is the parent region number and
Y is a unique namespace number within that
region. So, /dev/ndblk2.0s
is child namespace number 0
of region 2.
As in the previous example, the trailing s
means that
this namespace is configured to use the BTT—in other words, for
sector-based access. Because they are accessed via a block
window
, programs cannot use DAX, but accesses will be cached.
As ever, these devices must all be formatted and mounted before they can be used.
25.6 Troubleshooting #
Persistent memory is more durable than SSD storage, but it can wear out. If an NVDIMM fails, it is necessary to isolate the individual module that developed a fault so that the remaining data can be recovered and the hardware replaced. Three pieces of information must be found:
Which NVDIMM module has failed: the physical location of the defective module.
Which namespace (
/dev/pmemX
) now contains bad blocks.What other namespaces or regions also use that physical module.
After the faulty module has been determined along with whatever namespaces and regions use it, then the data in other, unaffected namespaces can be backed up, the server shut down and the NVDIMM replaced.
25.6.1 Locating a Failed Module #
A set of NVDIMMs are located in DIMM slots on the motherboard of the server.
The platform (the combination of server hardware and firmware) allocates the
memory on these NVDIMMs to one or more regions, such as region0
.
Then within those regions, the operating system defines namespaces, for
example /dev/pmem1
or /dev/dax0
.
In the diagram below, there are three regions. One is a PMEM region, composed of part of the space from three NVDIMMs, interleaved. The remaining space on two of the NVDIMMs has been configured as two additional BLK regions. In each of these, a namespace has been created.
In our example, the part of region0
labeled as
[X] has been damaged or become defective.
You must:
Identify which NVDIMM module(s) contain the affected region.
This is particularly important if the region is interleaved across more than one NVDIMM.
Back up the contents of any other namespaces on the affected NVDIMM.
In this example, you must back up the contents of
/dev/pmem2s
.Identify the relationship between the namespaces and the physical position of the NVDIMM (in which motherboard memory slot it is located).
The server must be shut down, its cover removed, and the defective modules found, removed and replaced.
25.6.2 Testing Persistent Memory #
For testing, the nfit_test
kernel module is required.
The testing procedure is described in detail on the GitHub page for the
ndctl
command, in steps 1-4 of the section
Unit test
. See Section 25.7, “For More Information” at
the end of this chapter.
Execute the
ndctl
command with the parameterslist -RM
.This shows the list of bad blocks.
tux >
sudo
ndctl list -RM : : { "dev":"region5", "size":33554432, "available_size":33554432, "type":"pmem", "iset_id":4676476994879183020, "badblock_count":8, "badblocks":[ { "offset":32768, "length":8, "dimms":[ "nmem1" 1 ] } ] }, :The specific NVDIMM is identified here.
Execute the
ndctl
command with the parameterslist -Du
.This shows the handle of the DIMM.
tux >
sudo
ndctl list -Du { "dev":"nmem1", "id":"cdab-0a-07e0-feffffff", "handle":"0x1", 1 "phys_id":"0x1" }, : :This is the handle of the NVDIMM.
Execute the
ndctl
command with the parameterslist --d DIMM name
.tux >
sudo
ndctl list -R -d nmem1 [ { "dev":"region5", "size":33554432, "available_size":33554432, "type":"pmem", "iset_id":4676476994879183020, "badblock_count":8 }, : :
25.7 For More Information #
More about this topic can be found in the following list:
Contains instructions for configuring NVDIMM systems, information about testing, and links to specifications related to NVDIMM enabling. This site is developing as NVDIMM support in Linux is developing.
Information about configuring, using and programming systems with non-volatile memory under Linux and other operating systems. Covers the NVM Library (NVML), which aims to provide useful APIs for programming with persistent memory in userspace.
LIBNVDIMM: Non-Volatile Devices
Aimed at kernel developers, this is part of the Documentation folder in the current Linux kernel tree. It talks about the different kernel modules involved in NVDIMM enablement, lays out some technical details of the kernel implementation, and talks about the
sysfs
interface to the kernel that is used by thendctl
tool.Utility library for managing the
libnvdimm
subsystem in the Linux kernel. Also contains userspace libraries, as well as unit tests and documentation.