Linux – Clarifying NVMe APST Problems

boothardwarekernellinuxnvme

I have experienced an issue nearly identical to one described in the askubuntu community.

Like that of the user who posted this issue, my system features a Kingston NVME disk, and as with that user,
my issue resolved by adding the following kernel option in the grub menu: nvme_core.default_ps_max_latency_us=0.

The user's stated resolution begins as follows:

The problem was of a SSD features, the Autonomous Power State Transitions(APST) was causing the freezes. To mitigate it, until they will release the fix, include the line nvme_core.default_ps_max_latency_us=0
in the GRUB_CMDLINE_LINUX_DEFAULT options.

Although helpful, this comment leaves several questions open, including the following:

  1. What and where is the specific flaw causing the problem?
  2. What does the workaround change to prevent the presentation of the flaw?
  3. What functionality or other desired effect is lost due to such a workaround?
  4. And especially, what is required to be fixed, the kernel, the storage-media firmware, the system firmware (i.e. UEFI/BIOS), or some other component, to provide a proper a resolution?

Any comments are helpful attempting to resolve all or part of this confusion.

Best Answer

The code comment within drivers/nvme/host/core.c in Linux kernel source seems to explain it best:

static int nvme_configure_apst(struct nvme_ctrl *ctrl)
{
    /*
     * APST (Autonomous Power State Transition) lets us program a
     * table of power state transitions that the controller will
     * perform automatically.  We configure it with a simple
     * heuristic: we are willing to spend at most 2% of the time
     * transitioning between power states.  Therefore, when running
     * in any given state, we will enter the next lower-power
     * non-operational state after waiting 50 * (enlat + exlat)
     * microseconds, as long as that state's exit latency is under
     * the requested maximum latency.
     *
     * We will not autonomously enter any non-operational state for
     * which the total latency exceeds ps_max_latency_us.  Users
     * can set ps_max_latency_us to zero to turn off APST.
     */

So, APST is a feature that allows the NVMe controller (within the NVMe SSD) to switch between power management states autonomously, following configurable rules. The NVMe controller specifies how many microseconds it needs to enter and exit each power-save state; the kernel uses this information to configure the state transition rules within the NVMe controller.

  1. What and where is the specific flaw causing the problem?

It looks like this particular Kingston NVMe SSD is either way too optimistic in its wake-up time estimates, or fails to wake up at all (without fully resetting the controller) after entering a deep enough power saving state. When given the permission to use APST, it apparently goes into some power saving state and then fails to return to operational state within the specified time, which makes the kernel unhappy.

  1. What does the workaround change to prevent the presentation of the flaw?

It tells the maximum allowed time for waking up from APST power management states is exactly 0 microseconds, which causes the APST feature to be disabled.

  1. What functionality or other desired effect is lost due to such a workaround?

If the NVMe controller's autonomous power management feature cannot be used, the controller will only be allowed to enter power-saving states when specifically requested by the kernel. This means the power savings most likely won't be as great as with APST in use.

  1. And especially, what is required to be fixed, the kernel, the storage-media firmware, the system firmware (i.e. UEFI/BIOS), or some other component, for users to experience a proper a resolution?

The optimal fix would be for Kingston to provide a NVMe disk firmware update that either makes the APST power management work correctly, or at minimum, makes the drive not promise something it cannot deliver, i.e. not announce APST modes with overly-optimistic transition times, and/or not announce at all any APST modes that will cause the controller to fail if used.

If it turns out the problem can be avoided by e.g. programming APST to avoid the deepest power-saving state completely, it might be possible to create a more specific kernel-level workaround. Many device drivers in the Linux kernel have "quirk tables" specifying workarounds for specific hardware models. In the case of NVMe, you can find one in drivers/nvme/host/pci.c within Linux kernel source:

static const struct pci_device_id nvme_id_table[] = {
    { PCI_VDEVICE(INTEL, 0x0953),   /* Intel 750/P3500/P3600/P3700 */
        .driver_data = NVME_QUIRK_STRIPE_SIZE |
                NVME_QUIRK_DEALLOCATE_ZEROES, },
    { PCI_VDEVICE(INTEL, 0x0a53),   /* Intel P3520 */
        .driver_data = NVME_QUIRK_STRIPE_SIZE |
                NVME_QUIRK_DEALLOCATE_ZEROES, },
    { PCI_VDEVICE(INTEL, 0x0a54),   /* Intel P4500/P4600 */
        .driver_data = NVME_QUIRK_STRIPE_SIZE |
                NVME_QUIRK_DEALLOCATE_ZEROES, },
    { PCI_VDEVICE(INTEL, 0x0a55),   /* Dell Express Flash P4600 */
        .driver_data = NVME_QUIRK_STRIPE_SIZE |
                NVME_QUIRK_DEALLOCATE_ZEROES, },
    { PCI_VDEVICE(INTEL, 0xf1a5),   /* Intel 600P/P3100 */
        .driver_data = NVME_QUIRK_NO_DEEPEST_PS |
                NVME_QUIRK_MEDIUM_PRIO_SQ |
                NVME_QUIRK_NO_TEMP_THRESH_CHANGE |
                NVME_QUIRK_DISABLE_WRITE_ZEROES, },
[...]

Here the various NVME_QUIRK_ settings trigger various pieces of workaround code within the driver.

Note that there already exists a quirk setting named NVME_QUIRK_NO_DEEPEST_PS which prevents state transitions to the deepest power management state. If the APST problem of your Kingston NVMe turns out to have the same workaround as already implemented for Intel 600P/P3100 and ADATA SX8200PNP, then all it would take is writing a new quirk table entry like this (replacing the things within <angle brackets> with appropriate values, you can get them with lspci -nn):

    { PCI_DEVICE(<PCI vendor ID>, <PCI product ID of the SSD>),   /* <specify make/model of SSD here> */
        .driver_data = NVME_QUIRK_NO_DEEPEST_PS, },

and recompiling the kernel with this modification.

Obviously, someone who actually has this exact SSD model is needed to test this. If you happen to be familiar with C programming basics and how to compile custom kernels, this could be your chance to get your name to the long list of Linux kernel contributors! If you are interested, you should probably read kernelnewbies.org for more details.

The kernel programming is not always deeply intricate: there are lot of simple parts that just need a person with the right kind of hardware and some basic programming knowledge. I've submitted a few minor patches just like this.

If setting the NVME_QUIRK_NO_DEEPEST_PS turns out not to fix the problem, then implementing a new quirk might be needed. That could be more complicated, and might require some experimentation or ideally information from Kingston to find out what exactly needs to be done to avoid this problem, and perhaps discussion with the Linux NVMe driver maintainer on the best way to implement it.

Related Question