PCIe Bus Error – Causes and Solutions for ‘AER / Bad TLP’

I'm seeing error messages like these below:

Nov 15 15:49:52 x99 kernel: pcieport 0000:00:03.0: AER: Multiple 
Corrected error received: id=0018 Nov 15 15:49:52 x99 kernel: pcieport
0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, 
id=0018(Receiver ID) Nov 15 15:49:52 x99 kernel: pcieport 0000:00:03.0: 
device [8086:6f08] error status/mask=00000040/00002000 Nov 15 15:49:52 
x99 kernel: pcieport 0000:00:03.0: [ 6] Bad TLP

These will cause degraded performance even though they have (so far) been corrected. Obviously, this issue needs to be resolved. However, I cannot find much about it on the Internet. (Maybe I'm looking in the wrong places.) I found only a few links which I will post below.

Does anyone know more about these errors?

Is it the motherboard, the Samsung 950 Pro, or the GPU (or some combination of these)?

The hardware is: Asus X99 Deluxe II Samsung 950 Pro NVMe in the M2. slot on the mb (which shares PCIe port 3). Nothing else is plugged into PCIe port 3. A GeForce GTX 1070 in PCIe slot 1 Core i7 6850K CPU

A couple of the links I found mentions the same hardware (X99 Deluxe II mb & Samsung950 Pro). I'm running Arch Linux.

I do not find the string "8086:6f08" in journalctl or anywhere else I have thought to search so far.

odd error message with nvme ssd (Bad TLP) : linuxquestions https://www.reddit.com/r/linuxquestions/comments/4walnu/odd_error_message_with_nvme_ssd_bad_tlp/

PCIe: Is your card silently struggling with TLP retransmits? http://billauer.co.il/blog/2011/07/pcie-tlp-dllp-retransmit-data-link-layer-error/

GTX 1080 Throwing Bad TLP PCIe Bus Errors – GeForce Forums https://forums.geforce.com/default/topic/957456/gtx-1080-throwing-bad-tlp-pcie-bus-errors/

drivers – PCIe error in dmesg log – Ask Ubuntu https://askubuntu.com/questions/643952/pcie-error-in-dmesg-log

780Ti X99 hard lock – PCIE errors – NVIDIA Developer Forums
https://devtalk.nvidia.com/default/topic/779994/linux/780ti-x99-hard-lock-pcie-errors/

Best Answer

I can give at least a few details, even though I cannot fully explain what happens.

As described for example here, the CPU communicates with the PCIe bus controller by transaction layer packets (TLPs). The hardware detects when there are faulty ones, and the Linux kernel reports that as messages.

The kernel option pci=nommconf disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.

That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device. It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.

By using pci=nommconf, the configuration space of all devices will be accessed in the original way, and changing the access methods works around this problem. So if you want, it's both resolving and suppressing it.