Wednesday, January 11, 2012

Random Boot Failures, Linux Kernel Updates, and Hardware Troubles

Earlier this week, Debian unstable offered me a kernel upgrade, and, after some cautious data backup, I installed it. Soon after, I attempted to reboot, and, sadly, it was back to kernel panic, can't find grub, can't find kernel, etc., etc. This time, using the Debian testing installation disk in rescue mode was no help. I re-installed grub, updated the configuration file (update-grub) and got the same errors. I shut down the computer and went to my laptop to search for help, and decided to try some alternative rescue disks: Super Grub2 and Rescatux, and, if that didn't suit, perhaps a Linux Mint live CD.

I burned the CD's inserted my first choice, SuperGrub2, and waited for it to spin. It never did (it turned out to be a bad disk). As I sat there, scratching my head, the computer loaded the Grub menu page, selected the first kernel, and proceeded to boot successfully.

So, perhaps, the problem is not a configuration problem, but a hardware problem. I get a variety of intermittent error messages concerning ata1 (and I was getting a bunch them yesterday, so once again I checked the hard drive with smartctl command line utilities. As before, both the long and short versions reported no problems. fsck also reported no problems.

The search terms "boot failure linux hardware" returned a few tales of woe similar to mine, such as Boot failure randomly. Can someone help?: Here's someone who has intermittent boot failures, and suspects a hardware problem. One reader comments:

one of the causes for this can be faulty memory. That can also explain why you sometimes have a problem: if that bad memory is not touched everything works like a charm. If it does get touch unexpected results can be expected. In general: irregular behaviour during boot always start thinking memory 1st. So run memtest from the live cd or from grub if you can boot into it and see if 1 of your memory modules is in bad shape and if so replace it.

The original poster eventually reports:

found a fix but not an explanation therefore help still needed here ;-)

The Fix--Disconnect the cable and take off the battery for a few seconds.

Conjectures--This fix lead me to conclude that maybe something is kept into memory of some hardware.

And here's someone who has the opposite problem from mine: [amd64] Reproducible cold boot failure (reboot succeeds) His computer fails to boot if it's been turned off for several hours, but will reboot, or boot successfully if it's been off for a short time. Eventually, he reports that he's found a memory issue:

In fact, one of the 4GB DIMMs in the system returns bogus data
(0x10000000 or 0x04000000 instead of 0) for some 40 to 50 seconds
after power-on. Once warmed up, memtest86+ runs for days without a
single extra data error (I wanted to have an estimate for the defect
having led to damaged data in disk files).

Running memtester showed no errors, although I started getting reams of complaints about ata:1.00 again. I haven't yet run the more comprehensive memtest, which has to run at start-up; I'll get to that in the next few days. Last night, aptitude update informed me that there was a kernel update. I went ahead and installed it, then shut down the computer. This morning, I powered up, and got a missing grub message, restarted, and the computer booted successfully. So far, today, I'm not getting any complaints about ata1.

Here are some resources I've found either helpful or informative this week:

Hardware Checking Resources for My Particular Problem
  • Test Memory Using memtester in Linux:
    memtester is a user-space utility for testing the memory subsystem in a computer to determine if it is faulty. It does a good job of finding intermittent faults and non-deterministic faults. It has many tests to help catch borderline memory. memtester should compile and run on any 32- or 64-bit Unix or Unix-like system.
  • How should I run fsck on a Linux file system? fsck (not a mis-typed profanity, it's short for "file system check") checks and repairs *nix file systems. You can't (or shouldn't) run it on a mounted file system. As root,
    Single User Mode and umount the file system: Issue command to change run level and umount the /home file system that is mounted on /dev/sda2:
    # init 1
    # umount /home
    Run fsck:
    # fsck /dev/sda2

Helpful Resources for Boot Failure
  • GRUB 2 bootloader - Full tutorial:
    My goal is to provide people running any flavor of UNIX-like operating systems or multi-booting their computers and using GRUB as their bootloader with a simple, no-nonsense, step-by-step, proven and working tutorial that should allow them to quickly, easily and painlessly control the boot sequence of their systems.
  • MondoRescue HOWTO Utilization and Configuration of Mondo and Mindi under Linux
    This document describes the use of mondo and mindi tools to realize disaster recovery backup of your systems. It provides information on installation, backup and restore modes, hardware and software requirements, and answers to some frequently asked questions. The goals are to offer a general view of the functions and their best usages. Mondo Rescue is a Disaster Recovery Solution which allows you to effortlessly backup and interactively restore Linux, Windows and other supported file system partitions to/from CD/DVD-+R/RW media, tape, NFS, ... and Mindi Linux provides the bootable emergency restore media which Mondo uses at boot-time.
  • Troubleshooting Boot Failures section of "Configuring a Linux Kernel" offers some advice on how to manage failures in a home-rolled kernel. It offers this advice:
    A boot failure can either be due to: kernel configuration issue, a system configuration issue, or a hardware malfunction. It's pretty easy to guess which one is the hardest to detect but easiest to resolve (hint: it's the hardware malfunction one). The other ones, well, they require a bit preparation in order to easily troubleshoot and, eventually, solve.
  • How to solve boot problems with Ubuntu after kernel upgrade: Here's a clear explanation of what happens as linux boots, and a description of some of the things that can go wrong. It didn't help me fix my problems, but I do understand the messages I get as the process fails.
  • From the same Web site, a description of Grub2. It's a little old, but this is where I found a link to Super Grub2 and Rescatux, boot problem fixes meant to be easier and quicker than using the rescue mode on the system installer.

    Update: SuperGrub2 returns this message: "not a known file system" and fails to find an OS. Not really helpful for my problem, which, to be fair, probably has little to do with Grub.

No comments: