Debian – How to fix mpt2sas “port enable: FAILED with timeout (timeout=300s)”

debianscsi

Since having set the drives attached to an LSI SAS 2008 controller to "Power up in standby", these drives are no longer being detected. Power up in standby is also known as PUIS, POIS or "ATA6 power up in standby mode".

In the Super Micro X8SI6-F BIOS "Load onboard SAS Option Rom" is set to "Disabled'.

Loading of kernel module mpt2sas during boot is disabled using: echo 'blacklist mpt2sas' >> /etc/modprobe.d/mpt2sas.conf; depmod; update-initramfs -u -k $(uname -r)

modprobe mpt2sas is done in /etc/rc.local

PUIS on the drives was set using "/sbin/hdparm -s1 --yes-i-know-what-i-am-doing /dev/sdX"

# tail /var/log/messages
Dec 19 21:07:21 debian kernel: [   14.503509] mpt2sas0: Scatter Gather Elements per IO(128)
Dec 19 21:07:22 debian kernel: [   14.735785] mpt2sas0: LSISAS2008: FWVersion(14.00.01.00), ChipRevision(0x03), BiosVersion(07.27.00.00)
Dec 19 21:07:22 debian kernel: [   14.735878] mpt2sas0: Protocol=(Initiator), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
Dec 19 21:07:22 debian kernel: [   14.736748] mpt2sas0: sending port enable !!
Dec 19 21:07:22 debian kernel: [   15.294663] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
Dec 19 21:07:22 debian kernel: [   15.294759] e1000e 0000:03:00.0: eth0: 10/100 speed: disabling TSO
Dec 19 21:07:22 debian kernel: [   15.296146] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Dec 19 21:07:23 debian kernel: [   16.257786] mpt2sas0: host_add: handle(0x0001), sas_addr(0x5003048007abbc00), phys(8)
Dec 19 21:12:22 debian kernel: [  314.234004] mpt2sas0: port enable: FAILED with timeout (timeout=300s)
Dec 19 21:12:54 debian kernel: [  346.439736] mpt2sas0: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x50014380182cf0e6), phys(37)

# tail -n40 /var/log/syslog
Dec 19 21:41:11 debian kernel: [  240.376096] INFO: task modprobe:1341 blocked for more than 120 seconds.
Dec 19 21:41:11 debian kernel: [  240.376171] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 19 21:41:11 debian kernel: [  240.376263] modprobe      D 0000000000000000     0  1341   1287 0x00000000
Dec 19 21:41:11 debian kernel: [  240.376414]  ffff88023f06b880 0000000000000082 0000000000000000 000000000000bfc5
Dec 19 21:41:11 debian kernel: [  240.376656]  0000000000000096 ffffffff8104e54b 000000000000f9e0 ffff88023d681fd8
Dec 19 21:41:11 debian kernel: [  240.376895]  0000000000015780 0000000000015780 ffff88023bc80000 ffff88023bc802f8
Dec 19 21:41:11 debian kernel: [  240.377134] Call Trace:
Dec 19 21:41:11 debian kernel: [  240.377204]  [<ffffffff8104e54b>] ? release_console_sem+0x17e/0x1af
Dec 19 21:41:11 debian kernel: [  240.377278]  [<ffffffff8105aeba>] ? __mod_timer+0x141/0x153
Dec 19 21:41:11 debian kernel: [  240.377350]  [<ffffffff812fbec4>] ? schedule_timeout+0xa5/0xdd
Dec 19 21:41:11 debian kernel: [  240.377422]  [<ffffffff8105aa34>] ? process_timeout+0x0/0x5
Dec 19 21:41:11 debian kernel: [  240.377492]  [<ffffffff812fbd04>] ? wait_for_common+0xde/0x15b
Dec 19 21:41:11 debian kernel: [  240.377566]  [<ffffffff8104a461>] ? default_wake_function+0x0/0x9
Dec 19 21:41:11 debian kernel: [  240.377647]  [<ffffffffa021d6e1>] ? _base_make_ioc_operational+0x929/0xa6f [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.377743]  [<ffffffffa021fa85>] ? mpt2sas_base_attach+0xb73/0xc61 [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.377817]  [<ffffffff810412ee>] ? enqueue_task_fair+0x3e/0x82
Dec 19 21:41:11 debian kernel: [  240.377889]  [<ffffffff8103a311>] ? enqueue_task+0x5f/0x68
Dec 19 21:41:11 debian kernel: [  240.377956]  [<ffffffff8103a403>] ? activate_task+0x22/0x28
Dec 19 21:41:11 debian kernel: [  240.378037]  [<ffffffffa0222e21>] ? _scsih_probe+0x32c/0x501 [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.378115]  [<ffffffff811a2d46>] ? local_pci_probe+0x12/0x16
Dec 19 21:41:11 debian kernel: [  240.378188]  [<ffffffff811a3996>] ? pci_device_probe+0xc0/0xe9
Dec 19 21:41:11 debian kernel: [  240.378263]  [<ffffffff81221520>] ? driver_probe_device+0xa3/0x14b
Dec 19 21:41:11 debian kernel: [  240.378333]  [<ffffffff81221617>] ? __driver_attach+0x4f/0x6f
Dec 19 21:41:11 debian kernel: [  240.378404]  [<ffffffff812215c8>] ? __driver_attach+0x0/0x6f
Dec 19 21:41:11 debian kernel: [  240.378477]  [<ffffffff81220def>] ? bus_for_each_dev+0x43/0x74
Dec 19 21:41:11 debian kernel: [  240.378549]  [<ffffffff812207af>] ? bus_add_driver+0xaf/0x1f8
Dec 19 21:41:11 debian kernel: [  240.378621]  [<ffffffff812218cf>] ? driver_register+0xa7/0x111
Dec 19 21:41:11 debian kernel: [  240.378698]  [<ffffffffa015f000>] ? _scsih_init+0x0/0x112 [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.378772]  [<ffffffff811a3bdc>] ? __pci_register_driver+0x50/0xb8
Dec 19 21:41:11 debian kernel: [  240.378849]  [<ffffffffa015f000>] ? _scsih_init+0x0/0x112 [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.378928]  [<ffffffffa015f0fc>] ? _scsih_init+0xfc/0x112 [mpt2sas]
Dec 19 21:41:11 debian kernel: [  240.379002]  [<ffffffff8100a065>] ? do_one_initcall+0x64/0x174
Dec 19 21:41:11 debian kernel: [  240.379072]  [<ffffffff8107ab54>] ? sys_init_module+0xc5/0x21a
Dec 19 21:41:11 debian kernel: [  240.379144]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Dec 19 21:42:18 debian kernel: [  307.313037] mpt2sas0: _base_send_port_enable: timeout
Dec 19 21:42:18 debian kernel: [  307.313106] mpt2sas0: port enable: FAILED
Dec 19 21:42:18 debian kernel: [  307.313171] mpt2sas0: sending diag reset !!
Dec 19 21:42:19 debian kernel: [  308.430890] mpt2sas0: diag reset: SUCCESS
Dec 19 21:42:19 debian kernel: [  308.431001] mpt2sas 0000:01:00.0: PCI INT A disabled
Dec 19 21:42:19 debian kernel: [  308.431102] mpt2sas0: failure at /build/buildd-linux-2.6_2.6.32-46-amd64-_ApuPc/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/scsi/mpt2sas/mpt2sas_scsih.c:6021/_scsih_probe()!

Regression

Tested with Debian Linux kernels:

  1. Linux debian 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012
    x86_64
  2. BPO.3 (Linux debian 3.2.0-0.bpo.3-amd64 #1 SMP Thu Aug 23
    07:41:30 UTC 2012 x86_64)
  3. BPO.4 (Linux debian 3.2.0-0.bpo.4-amd64 #1 SMP Debian 3.2.32-1~bpo60+1 x86_64)

Tested with SAS BIOS enabled = no change.

After the timeout, bpo.3 and bpo4. do freeze. Even the PgUp/PgDown keys don't work at the console.

modinfo /lib/modules/2.6.32-5-amd64/kernel/drivers/scsi/mpt2sas/mpt2sas.ko | grep ^version:
version:        02.100.03.00

Update #1: Tested with LSI driver version 15.00.00.00 on BIOS and firmware upgraded adapter:

mpt2sas0: LSISAS2008: FWVersion(15.00.00.00), ChipRevision(0x03), BiosVersion(07.29.00.00)
mpt2sas0: port enable: FAILED with timeout (timeout=300s)

And the system freezes 3492 seconds after boot with mpt2sas driver version 15. Freezes are solved by flashing to IT firmware.

Update #2: Some more detailed SMP reporting

# smp_rep_phy_err_log /dev/bsg/expander-0\:0 -vvv
    Report phy error log request: 40 11 06 02 00 00 00 00 00 00 00 00 00 00 00 00 
Report phy error log response:
  Expander change count: 303
  phy identifier: 0
  invalid dword count: 18518
  running disparity error count: 18492
  loss of dword synchronization count: 2
  phy reset problem count: 0

# smp_rep_phy_err_log /dev/bsg/expander-0\:1 -vvv
    Report phy error log request: 40 11 06 02 00 00 00 00 00 00 00 00 00 00 00 00 
Report phy error log response:
  Expander change count: 715
  phy identifier: 0
  invalid dword count: 36103
  running disparity error count: 35004
  loss of dword synchronization count: 4
  phy reset problem count: 0

# smp_rep_phy_sata --phy=5 /dev/bsg/expander-0\:0 -vvv
    Report phy SATA request: 40 12 10 02 00 00 00 00 00 05 00 00 00 00 00 00 
Report phy SATA response:
  expander change count: 303
  phy identifier: 5
  STP I_T nexus loss occurred: 0
  affiliations supported: 1
  affiliation valid: 1
  STP SAS address: 0x50014380182cf0c5
  register device to host FIS:
    34 00 50 01 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 
  affiliated STP initiator SAS address: 0x5003048007abbc00
  STP I_T nexus loss SAS address: 0x0
  affiliation context: 0
  current affiliation contexts: 1
  maximum affiliation contexts: 1

# smp_rep_exp_route_tbl /dev/bsg/expander-0\:0 -vvv
    Report expander route table request: 
      40 22 ff 06 00 00 00 00  00 3e 00 00 00 00 00 00 
      00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00 
Report expander route table response header:
  expander change count: 303
  expander route table change count: 1
  self configuring: 0
  zone configuring: 0
  configuring: 0
  zone enabled: 0
  expander route table descriptor length: 4 dwords
  number of expander route table descriptors: 0
  first routed SAS address index: 0
  last routed SAS address index: 0
  starting phy id: 0

Update #3: Verbose logging enabled in mpt2sas.ko and scsi logging increased using dev.scsi.logging_level = 0x180000F1 in /etc/sysctl.conf. Resulting /var/log/messages output:

debian kernel: [    0.927392] setting logging_level(0x00080000)
debian kernel: [    1.591808] mpt2sas0: sending port enable !!
debian kernel: [    3.113480] mpt2sas0: host_add: handle(0x0001), sas_addr(0x5003048007abbc00), phys(8)
debian kernel: [    3.124224] mpt2sas0: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x50014380182cf0e6), phys(37)
debian kernel: [    3.137436] mpt2sas0: detecting: handle(0x000a), sas_address(0x50014380182cf0c0), phy(0)
debian kernel: [    3.137520] mpt2sas0: REPORT_LUNS: handle(0x000a), retries(0)
debian kernel: [    8.127417] mf:
debian kernel: [    8.127417]   0000000a 00000000 00000000 3a580000 00600000 00000018 00000000 000007f8 
debian kernel: [    8.127842]   00000000 0000000c 00000000 00000000 00000000 00000000 00000000 02000000 
debian kernel: [    8.128261]   000000a0 00000000 0000f807 00000000 00000000 00000000 00000000 00000000 
debian kernel: [    8.128679]   d30007f8 3c3a7000 00000002 00000000 
debian kernel: [    8.128980] mpt2sas0: issue target reset: handle(0x000a)
debian kernel: [    8.352363] mpt2sas0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
debian kernel: [    8.352500] mpt2sas0: target reset completed: handle(0x000a)
debian kernel: [    8.352563] mpt2sas0: issue retry: handle (0x000a)
debian kernel: [   11.347175] mpt2sas0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
debian kernel: [   11.347276] mpt2sas0: TEST_UNIT_READY: handle(0x000a), lun(0)
debian kernel: [   14.591621] mpt2sas0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
debian kernel: [   14.591720] mpt2sas0: SATA Initialization Timeout,sending a retry
debian kernel: [   14.591785] mpt2sas0: TEST_UNIT_READY: handle(0x000a), lun(0)
debian kernel: [   17.586480] mpt2sas0: log_info(0x31111000): originator(PL), code(0x11), sub_code(0x1000)
debian kernel: [   17.586836] mpt2sas0: detecting: handle(0x000b), sas_address(0x50014380182cf0c1), phy(1)

Update #4: No timeouts and correct drive initialization when PUIS = disabled.

The BIOSes of other adapters, like the HP Smart Array P410/256MB controller (462862-B21) and Highpoint Rocket 2720SGL, are also not able to detect any PUIS/POIS Hitachi drives. The Highpoint controller BIOS says that it is spinning up Group #1, but is still not able to detect any drives behind the expander.

How to detect drives in POIS/PUIS mode behind a HP SAS expander using the LSISAS2008 controller?

Best Answer

Does it work with the PUIS disabled?

The fact that you get a port enable failed may indicate an HBA problem but the fact that you got the target resets on the drives (log_info 0x31111000) shows you got the port enabled in some cases but the drives failed to respond. This may be due to the PUIS condition, they simply did not initialize.

Normally when you have disks behind a SAS expander the expander will take on the task to wake them up. You can try to get the status of the expander with the sg_ses utility to see what it thinks.

The errors you are showing in SMP may indicate a cable problem somewhere. You do not have errors on the disks but you do have them on the cable from the expander to the initiator. You should check that option too.

The log_info 0x31111000 decodes into:

./lsi_decode_loginfo.py 0x31111000
Value       31111000h
Type:       30000000h   SAS 
Origin:     01000000h   PL 
Code:       00110000h   PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:   00001000h   PL_LOGINFO_SUB_CODE_DSCVRY_SATA_INIT_TIMEOUT 
Related Question