Bugzilla – Bug 223773
grub fails boot after update
Last modified: 2013-02-17 08:35:31 UTC
Problem: After upgrading to 10.2 RC1 (from 10.1) I found my box unbootable. GRUB printed GRUB Loading stage1.5 and hung. Executing "grub-install /dev/hda" from a rescue environment (using chroot into /) made GRUB come somewhat further, but GRUB Loading stage 1.5 GRUB loading, please wait... and rebooted. So I cleared most of the MBR using dd if=/dev/zero of=/dev/hda bs=512 count=1 dd if=/dev/zero of=/dev/hda bs=512 count=53 skip=1 and reran grub-install. Reboot problem persisted. grub version is 0.97-39. Temporary workaround: Copying the "stage2" file from (10.1's) grub-0.97-14 (or 15, don't know) to /boot/grub/stage2 and rerun grub-install solved the problem. 10.2rc1# md5sum /boot/grub/stage2 /usr/lib/grub/stage2 8e9c95dd8dd6d2402ea6fd506bb93cb4 /boot/grub/stage2 66797a774c25457f3b5e7c7f0920db9f /usr/lib/grub/stage2 10.1box# md5sum /boot/grub/stage2 /usr/lib/grub/stage2 8e9c95dd8dd6d2402ea6fd506bb93cb4 /boot/grub/stage2 8e9c95dd8dd6d2402ea6fd506bb93cb4 /usr/lib/grub/stage2
0.97-14 for sure.
Please provide the output: fdisk -l /dev/hda What filesystems are you using? Please attach /etc/grub.conf
16:24 ichi:~ # fdisk -l Disk /dev/hda: 40.0 GB, 40020664320 bytes 255 heads, 63 sectors/track, 4865 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hda1 1 66 530113+ 82 Linux swap / Solaris /dev/hda2 67 1354 10345860 83 Linux /dev/hda3 * 1355 4094 22009050 7 HPFS/NTFS /dev/hda4 4095 4865 6193057+ 5 Extended /dev/hda5 4095 4865 6193026 c W95 FAT32 (LBA) Disk /dev/hdc: 251.0 GB, 251000193024 bytes 255 heads, 63 sectors/track, 30515 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/hdc1 1 33 265041 82 Linux swap / Solaris /dev/hdc2 34 30515 244846665 5 Extended /dev/hdc5 34 4898 39078081 83 Linux /dev/hdc6 4899 12807 63529011 83 Linux /dev/hdc7 12808 25000 97940241 83 Linux /dev/hdc8 25001 30515 44299206 83 Linux All xfs. 16:25 ichi:~ # cat /boot/grub/menu.lst default 4 timeout 4 title Kernel-2.6.18.2-jen37c-default root (hd0,1) kernel /boot/vmlinuz-2.6.18.2-jen37c-default root=LABEL=root rootflags=usrquota,grpquota initrd /boot/initrd-2.6.18.2-jen37c-default title Windows XP root (hd0,2) chainloader +1 title Memtest kernel (hd0,1)/boot/memtest86.bin I really don't know what GRUB is up to. On a fresh 10.2 RC1 install inside a virtual machine (vmware) grub-0.97-39 works without a hitch.
Where is /etc/grub.conf ? Please paste here.
11:19 ichi:~ > cat /etc/grub.conf root (hd0,1) install --stage2=/boot/grub/stage2 /boot/grub/stage1 d (hd0) /boot/grub/stage2 0x8000 (hd0,1)/boot/grub/menu.lst quit
So what happens if you do "grub --batch < /etc/grub.conf" ? It doesn't mention any stage1.5, so grub, as configured by yast2, might work.
(0) Current status: grub-0.97-39 is installed, but the 0.97-14's /boot/grub/stage2 is in place. This harddisk has been cloned to a virtual machine for further testing and reproducing. (1) Installing grub-0.97-39 (using --force because it is already installed) successfully makes the machine hang again. I presume this is due to the fact that grub-install is _not_ run in the %post section. (2) Rerunning "grub-install /dev/hda" from the chroot of '/' (dev/hda2) within the rescue CD produces a virtual machine kernel stack fault (aka. reboot) Comment #6: That does not work. The only message it ever prints on boot is "GRUB Loading stage2.." (yes, two dots) (where's my stage1.5 loader gone?) and also generates a VM kernel stack fault (reboot).
After (0) I have set a snapshot in VMware, so I can go back there easily. Interesting observation follows: # rpm -Uhv grub-0.97-39.i586.rpm --force ... # grub-install /dev/hda The file /boot/grub/stage2 not read correctly. This is likely to be disk caching (the reason we can't do grub-install in %post?), so I went back and did it again # Uhv 39... # sync # echo 3 >/proc/sys/vm/drop_caches # sync # grub-install /dev/hda (success - but still generates a VM fault)
Other observation: # xfs_bmap /boot/grub/stag* /boot/grub/xfs_stage1_5 /boot/grub/stage1: 0: [0..7]: 1714824..1714831 /boot/grub/stage2: 0: [0..199]: 1748800.1748999 /boot/grub/xfs_stage1_5: 0: [0..23]: 1714864..1714887 # Uhv 39 ... # xfs_bmap again /boot/grub/stage1: 0: [0..7]: 1447456..1447463 /boot/grub/stage2: 0: [0..207]: 1447520..1447727 /boot/grub/stage2.old: 0: [0..199]: 1748800..1748999 /boot/grub/xfs_stage1_5: 0: [0..23]: 1447496..1447519 Which could explain part of the hang "(1)". grub-0.97-40 has the same issue. Is there something in compilation that changed between -14 and -39? Do you still have releases between those two?
Strange enough, cat /usr/lib/grub/stage2 >/boot/grub/stage2 (wait a bit until the fs syncs) grub-install /dev/hda worked out today, with /usr/lib/grub/stage2 being grub-0.97-40. No idea what's going on, but once it works on the real system, I'll just close this bug down.
Can you tell me what happens? # rpm -q grub grub-0.97-39 # md5sum /boot/grub/stage2 /usr/lib/grub/stage2 8e9c..3cb4 /boot/grub/stage2 (the very older version) 6679..db9f /usr/lib/grub/stage2 Now on upgrade, I had hoped to be everything made straight, but # rpm -Uhv grub-0.97-40.i586.rpm ... # md5sum /boot/grub/stage2 /usr/lib/grub/stage2 d4d9..f354 /boot/grub/stage2 6679..db9f /usr/lib/grub/stage2 What gives? Why does not /boot/grub/stage2 have the same MD5 even though it is copied over from /usr/lib/grub/stage2? When I manually mv & dd like in the %post script, the md5 is correct. What gives?
Yet another comment, including solution. Snapshot state: Works. Case 1: Upgrading grub installs a stage2-only loader, which hangs at next boot. Case 2: Upgrading grub and running grub-install afterwards to have a stage1.5-2-loader, causes a VM machine fault at next boot. Case 3: Upgrading grub, *replacing* /boot/grub/stage2 with /usr/lib/grub/stage2 so that they both have *the same md5* and then running grub-install makes the machine boot successfully. ================================================== Answer to comment #11: grub --batch </etc/grub.conf pokes into /boot/grub/stage2. I suppose GRUB does not get it right and the boot hangs/faults. I suppose the "sync" is not long enough. Also try removing the 2>/dev/null >&1 in the `grub --batch` line.
Note: *DONT* use grub-install, unless you know exactly what you're doing. grub-install guesses many things on the fly. Hence we record the commands to install grub in /etc/grub.conf for reliable reproducibility. "grub --batch < /etc/grub.install" is the preferred method to install / update the boot loader. Note2: /boot/grub/stage2 will be modified in the process, as you found already. That's why we work on a copy, the original is in /usr/lib/grub. (the dd is a workaround for a reiserfs quirk) VMware has problems on its own.
So the remaining problem described in Comment #12, Case1 is what we have in Bug #144773 the device nodes aren't there during update. *** This bug has been marked as a duplicate of bug 144773 ***
Can't access 144773. The devices nodes _are_ there (as in /dev/hda) however. And there's yet case 3 which works best.
"You are not authorized to access bug #144773." Please open that one.
Still doesn't work for me on 10.2 final. Booting the installed system from the CD, starting yast within it and reinstalling grub from there didn't work, neither running grub-install directly. Stage2 in /boot didn't match stage2 in /usr/lib/grub as well. leon@Dellin:/boot/grub> md5sum stage2 76c99ae1f95e9c76d5adf291922e8b42 stage2 leon@Dellin:/boot/grub> md5sum /usr/lib/grub/stage2 66797a774c25457f3b5e7c7f0920db9f /usr/lib/grub/stage2 Copying the last stage2 to /boot/grub worked around the problem.
Fixing the update problem is a work in progress. 2nd try to get you access to that. *** This bug has been marked as a duplicate of bug 144773 ***
That still does not work.
144773 does not seem relevant (obsolete devs.rpm??) That said, this bug has bitten me again when upgrading a (cleanly installed) 10.3 alpha1 to the newest factory from yesterday (alpha2plus). Running grub-install on today's factory rescue iso runs through by the new scheme (see below), but system remains unbootable. (chrooted into /dev/sda2, /dev is available) rescue:/# rpm -q grub grub-0.97-48 rescue:/# grub-install /dev/sda [ Minimal BASH-like ... ] grub> setup --stage2=/boot/grub/stage2 (hd0) (hd0,1) Checking if "/boot/grub/stage1" exists... yes Checking if "/boot/grub/stage2" exists... yes Checking if "/boot/grub/xfs_stage1_5" exists... yes Running "embed /boot/grub/xfs_stage1_5 (hd0)"... 18 sectors are embedded. suceceeded Running "install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+18 p (hd0,1)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded Done. grub> quit Rescue:/#
As per comment #6, I tried "grub --batch" again, but: #cat /etc/SuSE-release openSUSE 10.3 (i586) Alpha2plus VERSION = 10.3 #cat /etc/grub.conf setup --stage2=/boot/grub/stage2 (hd0) (hd0,1) #grub --batch </etc/grub.conf (see comment #20) So both ways of doing it (grub --batch, grub-install) do it with stage1.5.
And it happened again when going from Alpha3 (0.97-50) to grub-0.97-52.
I've got a testcase in a VMware 6 machine, grab it at http://jengelh.hopto.org/sk0.tar.bz2 (155 MB). Install /var/lib/smart/packages/grub-0.97-52.i586.rpm using smart or rpm, will trigger the bug on next boot.
Testcase file (sk0.tar.bz2) updated. Fixed: root could not login.
Happened again with the update from 0.97-62 to 0.97-64. Why?
any update? Torsten?
*** Bug 291038 has been marked as a duplicate of this bug. ***
Jan. even if you do not understand, this _is_ a symptom of missing device nodes, as discussed in Bug #144773 . This should be worked around now in yast. (This never was a grub bug).
Should work in 10.3 Beta3.
I have read http://en.opensuse.org/Software_Management/Upgrade/Devs_Rpm that is linked from 144773. But that does not help. I am *NOT* using "rpm --root" nor any other chrooting magic, so how should a device node be missing?
Thank you for fixing, whatever it was.
Again, last in comment #29, this never was a grub issue. Thanks go to the yast team for fixing this. Also from me :-)
I do not use yast, and it is not even installed. I am just trying to figure out what was changed, so if you can point me to the commit, that'd be easiest.
obviously
Happened again when going from 10.3 to 11.0. grub did not change a bit, and yet it fails. This fucking thing sucks!
Created attachment 225153 [details] /etc/grub.conf Thankfully, the grub.rpm update left me with a /boot/grub/stage2.old, which, when put back into /boot/grub/stage2, makes the boot procedure work again. Since the 'stage2' file seems to come from /usr/lib/grub/stage2 and is altered before copied to /boot/grub, I presume this very process of 'morphing' is broken.
I agree with comment #37: XFS really does suck, especially when it comes to booting Linux on a PC. Fortunately we do not support it any more for new installations, an ext2 /boot partition is highly recommended. The problem is that with XFS, sync(2) returns, but the data isn't synced. The first time yast calls grub install, grub does not find the new stage1.5, because it is not on the disk yet, despite a successful sync; thus it modifies stage2 to do the job. On the second invocation, stage1.5 is found and installed, but stage2 already is modified. So once again this isn't a grub bug, but an XFS bug with FS semantics.
*** Bug 331685 has been marked as a duplicate of this bug. ***
>Fortunately we do not support it any more for new installations, an ext2 /boot partition is highly recommended. /boot on XFS, or XFS in general?
Well, I can only talk about /boot residing on XFS. When the new package is available, please test. It waits about 10 secs for the FS to settle; this is noticeable during package update. rpm -U is sufficient for testing.
Does grub even use xfs_bmap?
Of course not. http://oss.sgi.com/archives/xfs/2008-07/msg00013.html
"You are not authorized to access bug #144773." -- I'm still being blocked when going to look at this. - > I agree with comment #37: XFS really does suck, especially when it comes to > booting Linux on a PC. Fortunately we do not support it any more for new > installations, an ext2 /boot partition is highly recommended. --- I've been using XFS on root, with certainty, since 9.0 (I remmber getting being bitten by a SuSE9.2 bug where the xfs driver on the installation disk was fault and couldn't read old partitions. I am 'fairly' sure that I've been using it since 7.3. I've never had any problems except when a disk-cable was going bad. Hardly XFS's fault. But I also use lilo. Grub doesn't have accessible documentation. lilo seems more reliable than grub for most purposes -- and grub, while it looks cool -- seems awfully complex for what it does. lilo has been good for dual and triple boot systems (Linux, Win98+Win2k), and dual mode booting on that Win98 partition (native & under VMWare). I've even used the BIOS re-ordering in lilo to allow booting from sda -- as well as the dynamically adjusting hidden/non-hidden partitions that I needed at one point for Win98 (this isn't recent, I'll mind you...). With a simple adjustment in lilo.conf, I could boot off a cloned removable hard disk -- and change the its boot params while it was still a 2ndary hard disk. Then on bootup, hit the BIOS, one-time boot switch (F12?) and boot from the removable instead of the fixed -- and the system would come up off the 2ndary, while calling it "hda" (the old fixed drive became hdb). It was trivial to figure out and it just worked! The same was not true for grub. It always "knew" better than to write to the 2ndary drive without referring to it by BIOS id 0x8[1-3], trying to configure the 2ndary drive to be "0x80" would have grub do its installation on the 1st disk -- not what I wanted. After multiple attempts, I went back and reinstalled lilo -- no problems. I think Grub was taking a higher-level view of the hard disks -- and wouldn't be so easily 'fooled' by a simple text-label change in its config files. I can see that being a benefit if the order changes and you *don't* want the boot order to change -- but in the opposite case, where I wanted the old behavior -- it did everything to protect me from what I needed to do. Maybe grub is expecting more file-driver functionality to be present when the OS isn't fully "awake"? I find it annoying to go from a working lilo+xfs, and now am told that because grub can't deal w/xfs, xfs isn't supported for booting from anymore. Why not continue to support lilo+xfs? It's just that grub -- is a much more complex beast and demands more from the drivers (and users) -- which is fine if one doesn't need to understand or tweak what's going on, whereas lilo being fairly primitive seems to have fewer failure points. Also, I get the _feeling_ (perhaps wrong), but that lilo is still used alot in the kernel development group. That might mean it gets tested more and might be good, at the very least, as a fall-back when grub gets too demanding...
#42: Where is the update?
#42: the workaround only went into the grub package in factory-11.1, but 11.0's package was not updated. Not good :-/
I'd like to comment at this point -- as this situation "highlights" a problem in the bug-report-fix-release system. Under the current system, bug reports are marked "closed" when some patch has gone into a future revision -- not when the reporter has actually received or tested a fix or when the bug-fix actually gets released to "CD-duplication" (or release engineering?) There should be more states in the bug system to allow for these intermediate states: 1) having been fixed in the code tree & 1a) maybe verified by reporting customer 2) test-case is created to duplicate problem, and 'release' (not code) is tested to verify that the fix is in and works. 3) product with 'fix' in, is 'released' with final image signed off by 'testing'... (yeah...alot more picky, but can help prevent things falling into cracks...)
I happen to me in ext3 also with openSUSE11.0, I will verify with SLED11 and see it also happen there. the md5sum for the /boot/grub/stage2 is different form the /usr/lib/grub/stage2. This should be the main cause.
This works for me now. (I thought I had 'grub' locked in zypper but it seems I have not, and the upgrade from 11.0->11.1 installed. I did not notice, since it booted fine after that.)
Except inside VMware. Oh how I hate this craploader.
The best bet so far seems to be to run mount /boot -o remount,sync before updating any grub files in /boot. Comments whether this is feasible? (Grub is not updated that often.)
The problem is that a 'sync', means that the data is written to disk. XFS does this this just fine, BUT, it may write meta-data into the journal FIRST, and later copy the final information into place. XFS guarantees the file is on disk -- which includes being in the journal, but it doesn't guarantee that the data is in the final position. This can be the case with ANY journaling or dynamically optimizing file system. Grub isn't going through the file-system calls to get to the data, Apparently it's using "DIRECT I/O" on a mounted file system. Anyone with common sense would know this is just plain dangerous. A simple 'sync' won't solve the problem. The only way to guarantee everything is finalized, is to unmount the file system. Then everything *should* sync under normal circumstances. IF something goes wrong during unmount, XFS could be prevented from completing it's play of the journal -- as when happens during a crash. When the file system is remounted, XFS plays the unfinished portion of the journal. I don't know the timing of fs availability vs. the journal being played out, but I was under the _impression_ that the journal is played out before the fs is made ready for use. So if it is possible, I'd unmount the boot partition then remount it. However. Someone else in the thread at http://oss.sgi.com/archives/xfs/2008-07/msg00031.html, made the comment " the GRUB shell directly to write it. grub-install doesn't work reliable." Does the GRUB shell work through the file system, where maybe grub-install does not? If that's the case, then maybe using a shell script to feed the GRUB shell commands might be another possible workaround. It's unlikely GRUB would work on NTFS either, since while file data is written to disk, the MFT stays resident in memory and locked until the OS goes down. To rely on the disk being in a static state while mounted is bad programming and it should be fixed. It's a 1970's mentality with regard to disks. Now days you want to touch disk as little as possible and disks are getting more dynamic as volume managers and shadow copies show different views of disks (even when synced), than what may exist on disk. I had this same problem with grub -- it didn't use what the file system said -- it used values on disk that were wrong. It had nothing to do with XFS, but that at the high level I changed the disk labels, and was booting from a different partition. So what was mounted in /boot (label=Boot), WASN't what grub was writing to -- it was writing and updating my old boot partition, because -- meanwhile, yast2 was happy with writing options to /boot -- which was the new partition that was really mounted, while grub ignored what was going on at the high level. All of that has nothing to do with any particular file system, but again -- has to do with grub not using the file system, but using it's own block commands. I've never had a problem with grub interacting badly with my XFS based boot partition, But I have a *dedicated* partition for boot (it's not on root). So there's very little I/O happening on /boot other than when I copy a new kernel to it. After that, it takes very little time for XFS to dump it's buffers and for the disk to be 'idle' again. The safest thing to do would be to fix grub to use the file system. Then I wouldn't have run into my bug (forget the bug# but did log it), which _wasn't_ file system related, but entirely related to grub not using the 'high level view' of the system and mounted file systems. (While Yast and I were blissfully modifying boot params on the new /boot partition, which had no effect on grub whatsoever). I was *lucky*, in that I didn't immediately scrub the hold partition, my system still booted, just that no changes on the 'live' file system were being used -- grub was now using an inactive partition to boot from. That would never happen if their code was written correctly. IF you can't fix grub, then only way to be safe is unmount the file system, and remount it. (that or check if the grub-run-from-shell is really using file system calls to do it's housekeeping, then a script might be a solution). good luck? :-) -l
Well the reason why bootloaders still work like that is because the 31744 bytes between the MBR and the 1st partition (when the chosen geometry is 255/63, it may be less with others!) is pretty small already, so small that it could not possibly stuff all fs drivers in there. Of course the easiest thing would be if firmware (BIOS) would do the file handling. OpenBoot seems to do that, though only for UFS. SILO still writes itself to byte position 1024, going the "old way" of binding to fs blocks. Since that is not going to go away anytime soon, it should be catered for somehow. Have the yast installer always create a /boot partition, and perhaps preferably with filesystems that write "rather immediately" to place, i.e. ext2/3, and show appropriate warning dialogs in the installer if one decides to not have a separate /boot, etc.
Your safest bet would be to go with FAT32 if the user chooses Grub. That way, no mistaking grub for a loader that can handle a modern file systems. The better alternative would be to use lilo, which doesn't have this problem. What would you rather have, Grub+FAT32 or lilo and a pure XFS (or ZFS, or EXT3/4, or whatever). I'd prefer to choose my file system and use the boot loader that works with it than have my choice of file system be dictated by a bootloader. The only reason grub has these problems is because they want to provide features that used to be (maybe still are) in a Boot PROM in higher end systems. But that's not safe anymore on high end file systems unless the file system code is in the PROM -- which is practical when you are talking 1 vendor who uses 1-2 file systems, not the plethora linux has. Who is pushing for going with Grub at SuSE when it has problems like this, when lilo does not? Grub doesn't buy you anything -- it causes more problems than it's worth. It does NOT buy one name independence like I was lead to believe, as that's simply hidden in the boot-load ram disk -- and that, IMO, is BAD! I've gotten bit by that more than once, not knowing exactly what grub was doing and what it was relying on. One time, I changed the disks to a different system, and grub was unhappy because the controller ID's had changed, so the disk's GUID-paths had changed (they were still sda, sdb, etc, but it generated unique numbers by HW). Quite the opposite of device order independence! ;-> If one was really committed to this working, they grub could dynamically load drivers from disk in an area that was marked immutable and non-movable. But that's so much effort for a feature that won't be needed when people move to EFI, 'any day now...' *cough* (it's on one of my new systems, but when I installed SuSE, SuSE didn't pick up on it having the option, (and I didn't know it was there till later), so it got a standard PC boot. Not sure what the benefit is supposed to be besides not having to use grub...I guess then it's ELILO? I think the decision to abandon lilo was premature. If it works with the advanced file systems and Grub doesn't, that should be good enough reason to bring back support. But someone seems to have a real passion for grubs. ;^)
This discussion is going nowhere. BTW a few "bugs" in grub turned out to be compiler related. If you have actual problems with grub on 11.3 that formally qualify as bugs feel free to report, but stick to the facts.