Uninterruptible Backup Process on Solaris

backupkillsolariszfs

=========== System Details ===========

OS: Solaris 10, update 11
CPU_ARCH: SPARC (sparcv9)
HW: Sun Fire V490 (Yeahhhh baby old school)
KERNEL_REV: 150400-40
Program: bpbkar32 (Symantec's Netbackup)
TL;DR: Can't kill processes even with kill -9 due to SUSPENDED zpool due to possibly not two good paths.

Issue:

We have a bunch (16) of un-killable processes on the system; we were notified by the backup team that they couldn't kill these jobs from the NB Master server, nor generate new backups so we hopped on and attempted a ./bp.kill_alland received:

bash-3.2# ./bp.kill_all

Looking for NetBackup processes that need to be terminated.
Killing bpbkar processes…

The following processes are still active
root 20346 1 0 02:02:33 ? 0:00 bpbkar32 -r 2678400 -ru root -dt 1047868 -to 0 -bpstart_time 1481767648 -clnt n
root 18689 1 0 Dec 09 ? 0:00 bpbkar32 -r 8035200 -ru root -dt 0 -to 0 -bpstart_time 1481325879 -clnt nerp323
root 12618 1 0 Dec 07 ? 0:00 bpbkar32 -r 2678400 -ru root -dt 357484 -to 0 -bpstart_time 1481077264 -clnt ne
root 29693 1 0 Dec 09 ? 0:00 bpbkar32 -r 2678400 -ru root -dt 529430 -to 0 -bpstart_time 1481249210 -clnt ne
root 10168 1 0 Dec 09 ? 0:00 bpbkar32 -r 2678400 -ru root -dt 530349 -to 0 -bpstart_time 1481250129 -clnt ne
root 1950 1 0 Dec 14 ? 0:00 bpbkar32 -r 2678400 -ru root -dt 962300 -to 0 -bpstart_time 1481682080 -clnt ne
Do you want this script to attempt to kill them? [y,n] (y) y
Killing remaining processes…
Waiting for processes to terminate…
Waiting for processes to terminate…
Waiting for processes to terminate…
Waiting for processes to terminate…
Waiting for processes to terminate…
There are processes still running.

… truncated output for readability.

Leading us to then proceed to attempt to kill those processes with extreme prejudice, via kill -9, also to no avail.
I've looked at How to kill a task that cannot be killed (uninterruptable?) and What if 'kill -9' does not work? as well as searched on "Solaris uninterruptable process" with partial results. Reboot seems to be the common theme and one that looks to be our "bang-head-against-desk-here" solution as well.

That being said, I'd like to:
– validate my logic and reasoning of what the root cause is
– See if there's a better way to determine where the process is stopped/what sys call it's attempting to execute
– Resolve the I/O without a reboot if at all possible, and subsequently those processes that can't be killed.
Pretty much just a root cause analysis and some sort of "In the future don't do switch work while backups are running or if you don't have two working paths" mitigation.

Here's what I got/what I'm thinking:
1) Popping into the /proc/1950/ directory and looking at the status. No dice with understanding that output, even with strings. Spews random characters.
Thing of note is that 'cwd' shows a link to nothing, and attempting to resolve it via ls -alL /proc/1950/cwd will hang the terminal and also create drumroll another uninterruptible process.

2) Running a pstack 1950 will generate some useful info but nothing I can't see from a ps -eaf or that I can understand. All zeroes though, looks bad as we don't see addresses or syscall like I do with a working pid.

bash-3.2# pstack 1950

1950: bpbkar32 -r 2678400 -ru root -dt 962300 -to 0 -bpstart_time 1481682080
0000000000000000 ???????? (0, 0, 0, 0, 0, 0)

3) Running a truss will hang if attempted on the running process, ditto for pfiles generating an error of "pfiles: cannot control process 1950". Interesting, but expected.

4) Running a strace just tells me that a "tracer already exists"

5) running a pwdx to print the cwd returns:
bash-3.2# pwdx 1950

1950: /bucket

This is interesting as our df does include it…
df -h /bucket

Filesystem size used avail capacity Mounted on
bucket 1.9T 31K 1.9T 1% /bucket

… but attempting to cd into /bucket and do an ls produces the same hanging effect.

bash-3.2# zpool list

NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
bucket 1.94T 308K 1.94T 0% SUSPENDED –
rpool 136G 58.0G 78.0G 42% ONLINE –

bash-3.2# umount /bucket

cannot open 'bucket': pool I/O is currently suspended

bash-3.2# zpool export bucket

cannot unmount '/bucket': Device busy

bash-3.2# zpool status -x

pool: bucket
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scan: none requested
config:
NAME STATE READ WRITE CKSUM
bucket SUSPENDED 0 0 0 experienced I/O failures
c3t50060E80102B1F5Ad78 FAULTED 2 0 0 too many errors

Sooo… I'm sensing that we are dead in the water, and really that when that "switch work" was happening, there were NOT two active/healthy paths to the SAN, and so we ended up pulling the rug from underneath the vdev and it just so happened that backup was working there when it died but any process, like my ls, would of had the same behavior.

Anyone have any last saving thoughts of "run this unknown command that will save you a reboot"???

Best Answer

As suggested by Jeff, the zpool clear should help resolve the issue if the path(s?) have returned. Since it sounds like it didn't the server probably cannot see the LUN(s).

A zpool clear -F -n bucket will also tell you if the pool could be imported by discarding the last set of transactions (the -F option).`

You mentioned switch work, so you may want to check what work was done, and if one of the changes removed the or any of the paths. Have you looked at your `luxadm display /dev/rdsk/c<____>s2 output? Or tried reconfiguring the paths with cfgadm? Or sending a forcelip event down a path?

The full output of a zpool status bucket might also be useful to determine the type of pool (mirror, cat, stripe, ...). I'm assuming not a mirror based on the issue.

I realize it's easy for me to say since I'm not in the mix, but don't panic quite yet as the data should still all be present on the array assuming it's not the issue. But you may end up having to reimport with some of the transactions rolled back.

Best of luck!

Related Question