LInux: How to diagnose / isolate what’s causing “random” hangs and spontaneous reboots

crashlinuxUbuntuubuntu-9.10

(originally posted on serverfault)

So, rather than guessing just what the cause is (though my money's on the nvidia drivers), where do I start looking to pin down some facts?

I've been through /var/log on several occasions but there's a LOT of stuff in there and I can't (yet) spot the important bits.

Background: The Short Version

I moved from WinXP to Ubuntu Karmic just after it became available.

Since then I have had a series of seemingly random crashes that manifest as either:

  • a spontaneous reboot
  • a complete lockup with my USB keyboard and mouse becoming unresponsive (right down to he LEDs all turning off). Also I will typically be unable to ssh to the box when this happens.

I've done plenty of searching and Nvidia seems to be the prime suspect but I have no idea where to start looking to work out just what the real cause is.

A serverfault user suggested checking the RAM with MemtextX86+. No errors found. Monitoring video card temperature has also been suggested, which I'm looking into now.

Other than than, suggestions anyone?

Background: The Long Version

At times, I can go an entire week without a crash then have 5 in 2 days.

Motivated by the desire to eliminate possible suspects, I've made a few changes over time to no avail:

  • Originally I used KVM for virtualization, I now use VirtualBox OSE
  • I had NFS running in the kernel but now use Samba
  • I was using Compiz but have since turned that off
  • I've rolled from 64-bit Karmic to 32-bit (for other reasons as well)
  • I've tried Ubuntu, Kubuntu and Xubuntu. Same trouble each time (although of late it seems to be more frequent in Gnome than in XFCE).
  • I rolled the Nvidia driver from version 185 back to version 96 (NVIDIA Linux x86 Kernel Module 96.43.13 Thu Jun 25 18:42:21 PDT 2009). This seems to have reduced the frequency of error.

In terms of what's running at the time, this can vary. The following are common but were not necessarily running for every crash:

  • Firefox 3.5
  • VirtualBox OSE with 1 or 2 Windows XP VMs
  • Skype
  • Rhythmbox or Exaile

My hardware is 2 – 3 years old:

  • Core 2 Duo 6300
  • 4GB RAM
  • some breed of Intel motherboard of that vintage
  • an Asus dual-head video card with Nvdia GeForce 7300 GS chipset
  • 2 x SATA HDDs
  • dual monitors (hence I rely on the proprietary nvidia drivers)

I've been keeping current with my system updates.

Hopefully the data above might prompt someone to suggest a specific type of log or config that would be worth investigating.

Update 1

just had a crash in which the speakers went nuts. Did some googling and it seems PulseAudio has had a few issues in the past. Not sure yet if this is relevant but PulseAudio will have been running every time I had a crash.

Update 2

Following @CarlF's link to the Debian Sysadmin Guide has lead me to the magic sysrq key which I shall try at the next crash. Not that this'll give me much clues as to the cause but at least I'll hopefully be able to shutdown gracefully.

Update 3

lm-sensors reports my GPU running at nearly 70C / 158F – interesting. If I had to guess I'd say this is an important clue.

Update 4

Hit the insides of the system with an airduster shortly after my last update – net result: only one crash since then. I'm gonna call this a thermal problem.

Best Answer

There's good advice from the Debian Administrator's Guide here: http://www.debian-administration.org/articles/492