We still have some older hardware running for non-critical and lightweight operations. Since upgrading these old boxes to Debian Squeeze weird things started to happen. After some time, the system clock stops working and that results in processes to crash and new ones unable to start. Of course, nothing works properly on any computer without a working system clock. Here you see how that looks like:
First thought was to replace the CMOS battery, but that didn’t help and it occurred on multiple machines around the same time. Rebooting solves the problem for a while but 24-36 hours later the system clock stops again. Sometimes the problem stays away much longer. Also, some servers with the same hardware do not seem to have the problem. It seems hardware related. But is that really true?
Since the server crashes it is kinda hard to figure out what exactly happens. Logs aren’t properly written anymore and system processes are unreliable. Sometimes it is possible to access the server via SSH, sometimes it isn’t.
Today this happened again with one of our servers. Fortunately I was able to SSH into the machine and I found this warning in the logs:
Mar 14 13:00:05 server kernel: [97383.660485] Clocksource tsc unstable (delta = 4686838547 ns)
TSC is the Time Stamp Counter. Processors have dynamically changed clock speed (ofc to save power). The TSC is supposed to tick at the CPU rate so on frequency change, this ought to happen. The kernel will automatically switch to something else. This is true for modern hardware, this old CPU is set to constant_tsc (when using cat /proc/cpuinfo it’s one of the flags). It is thus supposed to run at the same frequency at all times.
The system might have multiple available clock sources. You can list the available clock sources in a system like this:
On my system this returns:
tsc acpi_pm jiffies
To see which one is actually in use, issue:
On my system this returns:
So this tells me I’m using ‘tsc’ as a Clock Source, and my system has two more options. Since I’m having trouble with ‘tsc’, let’s change the Clock Source from ‘tsc’ to ‘acpi_pm’.
echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
Verify the new setting like this:
It should now return ‘acpi_pm”.
This can be done at run time, when you reboot the setting is lost. It is a great way to test the setting. To make it permanent, add this line to the boot parameters in /boot/grub/menu.lst:
I’ve just changed this on the server and I’m really curious whether this will be the solution to prevent the hangs! At least the time is still ticking.. 😉
If anyone has more suggestions or info, please let me know! I’ll keep you posted.
Update: Just reached 4 days of uptime!
Update 2: Today it’s 4 month’s after I wrote this blog. The machine is still up & running and has reached 106 days of uptime!
Update 3: After 7 months (222 days of uptime) I finally retired the machine since we migrated to our CloudStack cloud. The fix described above really works 🙂