Ubuntu – WD (Sandisk) NVMe M.2 stick not quite working

18.0418.10nvme

To be clear, I expected trouble. The computer is an old HP Z820 (certainly no BIOS support for NVMe) with the latest 2018 BIOS update. The stick is a new(-ish?) Western Digital (Sandisk) model:

WD Black 500GB NVMe SSD – M.2 2280 – WDS500G2X0C

Mounted on a PCIe 3.0 x4 card:

Mailiya M.2 PCIe to PCIe 3.0 x4 Adapter

I am not trying to boot from NVMe, just use for storage. Linux does see the drive (via lsblk and lspci) and can read … but not write.

This is Ubuntu 18.04.2 LTS with the kernel version:

Linux brutus 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

(Also tested on 18.10.)

Pulled the Linux sources for this version, and for the current 5.0 Linux (from torvalds/linux on Github). There are substantial differences in driver/nvme between Ubuntu LTS and current, with updates as recent(!) as yesterday (2019.03.16 in "cd drivers/nvme ; git log").

Like I said at the start, expecting trouble. 🙂

Should mention I am slightly familiar with Linux device drivers, having written one of moderate complexity.

Tried compiling the current Linux 5.0 sources, and "rmmod nvme ; insmod nvme" – which did not work (no surprise). Tried copying the 5.0 nvme driver into the 4.15 tree and compiling – which did not work (also no surprise, but hey, got to try).

Next exercise would be to boot off the current Linux 5.0 kernel. But might as well put this in public, in case someone else is further.

Reads seen to work, but slower than expected:

# hdparm -t --direct /dev/nvme0n1 

/dev/nvme0n1:
 Timing O_DIRECT disk reads: 4840 MB in  3.00 seconds = 1612.83 MB/sec

# dd bs=1M count=8192 if=/dev/nvme0n1 of=/dev/null
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB, 8.0 GiB) copied, 4.57285 s, 1.9 GB/s

Writes fail badly:

# dd bs=1M count=2 if=/dev/zero of=/dev/nvme0n1 
(hangs)

From journalctl:

Mar 17 18:49:23 brutus kernel: nvme nvme0: async event result 00010300
Mar 17 18:49:23 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 0
Mar 17 18:49:23 brutus kernel: buffer_io_error: 118 callbacks suppressed
Mar 17 18:49:23 brutus kernel: Buffer I/O error on dev nvme0n1, logical block 0, lost async page write
[snip]
Mar 17 18:49:23 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 1024
Mar 17 18:49:23 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 3072

Poked around a bit with the "nvme" command line tool, but only guessing:

# nvme list -o json
{
  "Devices" : [
    {
      "DevicePath" : "/dev/nvme0n1",
      "Firmware" : "101140WD",
      "Index" : 0,
      "ModelNumber" : "WDS500G2X0C-00L350",
      "ProductName" : "Unknown Device",
      "SerialNumber" : "184570802442",
      "UsedBytes" : 500107862016,
      "MaximiumLBA" : 976773168,
      "PhysicalSize" : 500107862016,
      "SectorSize" : 512
    }
  ]

FYI – lspci output:

03:00.0 Non-Volatile memory controller: Sandisk Corp Device 5002 (prog-if 02 [NVM Express])
        Subsystem: Sandisk Corp Device 5002
        Physical Slot: 1
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 37
        NUMA node: 0
        Region 0: Memory at de500000 (64-bit, non-prefetchable) [size=16K]
        Region 4: Memory at de504000 (64-bit, non-prefetchable) [size=256]
        Capabilities: [80] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [b0] MSI-X: Enable+ Count=65 Masked-
                Vector table: BAR=0 offset=00002000
                PBA: BAR=4 offset=00000000
        Capabilities: [c0] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 1024 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L0s <256ns, L1 <8us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [1b8 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [300 v1] #19
        Capabilities: [900 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
                          PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Kernel driver in use: nvme
        Kernel modules: nvme

Heh. Credit where due. 🙂

preston@brutus:~/sources/linux/drivers/nvme$ git log . | grep -i 'wdc.com\|@sandisk' | sed -e 's/^.*: //' | sort -uf
Adam Manzanares <adam.manzanares@wdc.com>
Bart Van Assche <bart.vanassche@sandisk.com>
Bart Van Assche <bart.vanassche@wdc.com>
Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Jeff Lien <jeff.lien@wdc.com>

Also tested with the current (2019.03.17) Linux kernel:

root@brutus:~# uname -a
Linux brutus 5.1.0-rc1 #1 SMP Mon Mar 18 01:03:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

root@brutus:~# pvcreate /dev/nvme0n1 
  /dev/nvme0n1: write failed after 0 of 4096 at 4096: Input/output error
  Failed to wipe new metadata area at the start of the /dev/nvme0n1
  Failed to add metadata area for new physical volume /dev/nvme0n1
  Failed to setup physical volume "/dev/nvme0n1".

From the journal:

Mar 18 02:05:10 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 8 flags 8801
Mar 18 02:09:06 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 8 flags 8801
Mar 18 02:09:36 brutus kernel: print_req_error: I/O error, dev nvme0n1, sector 8 flags 8801

So … not working in any version of Linux (yet), it seems.

Best Answer

I don't know whether you're still having these issues, but I'll at least post this in case others run into it.

I have this same drive and use it is as my primary drive running 18.04. I've used the Windows firmware utility and haven't seen any updates to this point. I also tested the live environment for 19.04, which has the same freeze ups/failure to install I experienced with 18.04 and 18.10 so the issue seems to still be open.

The problem appears to be that the drive becomes unstable when it goes into low power states so the fix is to disable the low power modes via kernel boot parameter. I did this a few months back and have had zero problems on 18.04 since. This method should work on the new versions (18.10/19.04) as well, but it's a shame that it hasn't been fixed yet.

In the GRUB boot menu, press e to edit startup parameter. Add nvme_core.default_ps_max_latency_us=5500 by the end of quiet splash Ctrl-x to boot up, the installer should detect this disk in partition step.

After finishing finish installation, press shift while power on to enter GRUB again, add same kernel parameter nvme_core.default_ps_max_latency_us=5500, Ctrl-x to boot up. You will see Ubuntu boot up successfully, edit /etc/default/grub, add parameter nvme_core.default_ps_max_latency_us=5500 again, execute sudo update-grub. So that every time boot up will contain this parameter in the grub automatically, no more manually edit.

https://community.wd.com/t/linux-support-for-wd-black-nvme-2018/225446/9

Related Question