What?! No Reset Button! – No hands remote fix of a jammed server…

The scene is set late at night and a failing hard disk has forced the filesystem on a very remote server to go read-only. The server RAM fills and the system steadily jams up. There are two ssh terminals open. Any command such as “ls” or “umount -l” or “mount” or “reboot” all return the fateful response “system IO error”… Killing one of the ssh sessions to try to free up some resource gives no improvement… There is now just the one ssh terminal.

What to do?…

With some Linux kernel magic and by the magic of the bash builtin command “echo”:

echo b > /proc/sysrq-trigger

 

With that, there was a long pause as the system rebooted. ssh reconnected ok. The filesystem was still read-only but at least now the memory wasn’t jammed up.

The next few bits of Linux magic were to create a new directory ram_dir in the tmpfs mounted /var/run, “cp -a /etc /var/run/ram_dir”, bind mount the copied etc over the read-only /etc to give a read-writable /etc. Then nfs mount some diskspace from other servers to mount up the rest of the system and restart the various services. Voila! Rapid resurrection 🙂

And… Must make note to move to using at least 3 disks in a raid for the remote machines! 😐

See:

Leave a Reply