Linux – “Last resort” Linux terminal command to reboot (over ssh) in case of a kernel bug

I have a small server with Ubuntu 10.04 on it; I am manipulating this server from another computer via ssh, and I tried to use nfs on it to share files. That mostly works, until one of the clients unmounts and I want to shutdown nfs-kernel-server on the server. While the stopping seems proper:

$ sudo service nfs-kernel-server stop
 * Stopping NFS kernel daemon                                                     [ OK ] 
 * Unexporting directories for NFS kernel daemon...                               [ OK ]

… I do get something like this in the log:

Feb  5 11:50:17 user init: statd main process (3806) killed by KILL signal
Feb  5 11:50:17 user init: statd main process ended, respawning
Feb  5 11:50:17 user init: idmapd main process (3808) killed by KILL signal
Feb  5 11:50:17 user init: idmapd main process ended, respawning
Feb  5 11:50:17 user statd-pre-start: local-filesystems started
Feb  5 11:50:17 user sm-notify[3815]: Already notifying clients; Exiting!
Feb  5 11:50:17 user rpc.statd[3830]: Version 1.1.6 Starting
Feb  5 11:50:17 user rpc.statd[3830]: Flags:

… meaning that some related processes to nfs didn't care about me saying stop, and respawned again. If at this point I try to do sudo service nfs-kernel-server start (again via ssh), that command freezes, and in /var/log/syslog I get this:

Feb  5 11:43:55 user mountd[2045]: authenticated mount request from 192.168.0.2:1005 for /media/disk (/media/disk)
Feb  5 11:45:19 user mountd[2045]: Caught signal 15, un-registering and exiting.
Feb  5 11:45:19 user kernel: [27428.148368] nfsd: last server has exited, flushing export cache
Feb  5 11:45:19 user kernel: [27428.148431] BUG: Dentry d0bc8b28{i=1f6,n=} still in use (1) [unmount of vfat sdd8]
Feb  5 11:45:19 user kernel: [27428.148473] ------------[ cut here ]------------
Feb  5 11:45:19 user kernel: [27428.148481] kernel BUG at /build/buildd/linux-2.6.32/fs/dcache.c:670!
Feb  5 11:45:19 user kernel: [27428.148491] invalid opcode: 0000 [#1] SMP 
Feb  5 11:45:19 user kernel: [27428.148501] last sysfs file: /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq
...
Feb  5 11:45:19 user kernel: [27428.148807] Call Trace:
Feb  5 11:45:19 user kernel: [27428.148824]  [<c024c780>] ? vfs_quota_off+0x0/0x20
Feb  5 11:45:19 user kernel: [27428.148838]  [<c021d4fc>] ? shrink_dcache_for_umount+0x3c/0x50
Feb  5 11:45:19 user kernel: [27428.148852]  [<c020d090>] ? generic_shutdown_super+0x20/0xe0
...
Feb  5 11:45:19 user kernel: [27428.149511] EIP: [<c021d4a9>] shrink_dcache_for_umount_subtree+0x249/0x260 SS:ESP 0068:ccc6de6c
Feb  5 11:45:19 user kernel: [27428.149631] ---[ end trace 6198103bb62887ac ]---
Feb  5 11:49:53 user init: idmapd main process (838) killed by TERM signal
Feb  5 11:49:53 user init: idmapd main process ended, respawning
Feb  5 11:49:53 user rpc.statd[769]: Caught signal 15, un-registering and exiting.
Feb  5 11:49:53 user init: statd main process ended, respawning
Feb  5 11:49:53 user statd-pre-start: local-filesystems started
Feb  5 11:49:53 user sm-notify[3790]: Already notifying clients; Exiting!
Feb  5 11:49:53 user rpc.statd[3806]: Version 1.1.6 Starting
Feb  5 11:49:53 user rpc.statd[3806]: Flags: 
...

Now, the thing is this – after this bug happens, the server's ssh server is (for some reason) usually still "live", so I can log in via ssh again, and try to close processes (and realize it is impossible to kill /usr/sbin/rpc.nfsd 8, which is the one hanging).

BUT – if at this point, I try to issue a reboot via sudo shutdown -r now && exit from ssh, then this server PC will start the reboot process – but will not complete it; it will drop to a terminal, dump some error messages, and stay there :(

The problem is – the server PC is in a really difficult to access location, and having to go there to do Alt+SysRq + REISUB to properly reboot (if the kernel reacts to that key combo; else it's hard powerdown) is really difficult.

So my question is – is there some "hardcore reboot" command in Linux, that will more-less "guarantee" that the PC will reboot (and not just hang/freeze), even if it has encountered a kernel bug – and which I could issue via ssh? Something that would be the equivalent of a hard powerdown (i.e. turning of the power by e.g. holding the power button for 10+ seconds) and hard powerup?

Best Answer

To ensure that the system will reboot no matter what, I always do this sequence:

# echo s > /proc/sysrq-trigger
# echo u > /proc/sysrq-trigger
# echo s > /proc/sysrq-trigger
# echo b > /proc/sysrq-trigger

This requests the kernel to do:

emergency sync of the block devices
mount readonly of all filesystems
again a sync
force an immediate boot; you can also use o for poweroff.

See e.g. here for explanation of this feature.

Best Answer

Related Solutions

Linux – Enabling ethernet over USB support in Linux kernel

Shell – How to reboot over a ssh connection without a return code of -1

Related Question