Bug 223773 - grub fails boot after update
grub fails boot after update
Status: VERIFIED INVALID
: 291038 331685 (view as bug list)
Classification: openSUSE
Product: openSUSE 11.0
Classification: openSUSE
Component: Bootloader
Final
x86 Linux
: P5 - None : Enhancement with 5 votes (vote)
: openSUSE 11.1
Assigned To: Torsten Duwe
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2006-11-26 11:33 UTC by Jan Engelhardt
Modified: 2013-02-17 08:35 UTC (History)
9 users (show)

See Also:
Found By: Beta-Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
/etc/grub.conf (52 bytes, text/plain)
2008-06-30 15:38 UTC, Jan Engelhardt
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Engelhardt 2006-11-26 11:33:30 UTC
Problem:
After upgrading to 10.2 RC1 (from 10.1) I found my box unbootable. GRUB printed

  GRUB Loading stage1.5

and hung. Executing "grub-install /dev/hda" from a rescue environment (using chroot into /) made GRUB come somewhat further, but

  GRUB Loading stage 1.5
  GRUB loading, please wait...

and rebooted. So I cleared most of the MBR using

  dd if=/dev/zero of=/dev/hda bs=512 count=1
  dd if=/dev/zero of=/dev/hda bs=512 count=53 skip=1

and reran grub-install. Reboot problem persisted. grub version is 0.97-39.

Temporary workaround:
Copying the "stage2" file from (10.1's) grub-0.97-14 (or 15, don't know) to /boot/grub/stage2 and rerun grub-install solved the problem.

10.2rc1# md5sum /boot/grub/stage2  /usr/lib/grub/stage2
8e9c95dd8dd6d2402ea6fd506bb93cb4  /boot/grub/stage2
66797a774c25457f3b5e7c7f0920db9f  /usr/lib/grub/stage2

10.1box# md5sum /boot/grub/stage2 /usr/lib/grub/stage2
8e9c95dd8dd6d2402ea6fd506bb93cb4  /boot/grub/stage2
8e9c95dd8dd6d2402ea6fd506bb93cb4  /usr/lib/grub/stage2
Comment 1 Jan Engelhardt 2006-11-26 11:34:20 UTC
0.97-14 for sure.
Comment 2 Andreas Jaeger 2006-11-26 15:22:40 UTC
Please provide the output:
fdisk -l /dev/hda

What filesystems are you using?

Please attach /etc/grub.conf
Comment 3 Jan Engelhardt 2006-11-26 15:25:23 UTC
16:24 ichi:~ # fdisk -l

Disk /dev/hda: 40.0 GB, 40020664320 bytes
255 heads, 63 sectors/track, 4865 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hda1               1          66      530113+  82  Linux swap / Solaris
/dev/hda2              67        1354    10345860   83  Linux
/dev/hda3   *        1355        4094    22009050    7  HPFS/NTFS
/dev/hda4            4095        4865     6193057+   5  Extended
/dev/hda5            4095        4865     6193026    c  W95 FAT32 (LBA)

Disk /dev/hdc: 251.0 GB, 251000193024 bytes
255 heads, 63 sectors/track, 30515 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/hdc1               1          33      265041   82  Linux swap / Solaris
/dev/hdc2              34       30515   244846665    5  Extended
/dev/hdc5              34        4898    39078081   83  Linux
/dev/hdc6            4899       12807    63529011   83  Linux
/dev/hdc7           12808       25000    97940241   83  Linux
/dev/hdc8           25001       30515    44299206   83  Linux

All xfs.

16:25 ichi:~ # cat /boot/grub/menu.lst 
default 4
timeout 4

title Kernel-2.6.18.2-jen37c-default
    root (hd0,1)
    kernel /boot/vmlinuz-2.6.18.2-jen37c-default root=LABEL=root rootflags=usrquota,grpquota
    initrd /boot/initrd-2.6.18.2-jen37c-default

title Windows XP
    root (hd0,2)
    chainloader +1

title Memtest
    kernel (hd0,1)/boot/memtest86.bin

I really don't know what GRUB is up to. On a fresh 10.2 RC1 install inside a virtual machine (vmware) grub-0.97-39 works without a hitch.
Comment 4 Torsten Duwe 2006-11-27 10:12:33 UTC
Where is /etc/grub.conf ? Please paste here.
Comment 5 Jan Engelhardt 2006-11-27 10:21:10 UTC
11:19 ichi:~ > cat /etc/grub.conf 
root (hd0,1)
install --stage2=/boot/grub/stage2 /boot/grub/stage1 d (hd0) /boot/grub/stage2 0x8000 (hd0,1)/boot/grub/menu.lst
quit
Comment 6 Torsten Duwe 2006-11-27 12:04:18 UTC
So what happens if you do "grub --batch < /etc/grub.conf" ?
It doesn't mention any stage1.5, so grub, as configured by yast2, might work.
Comment 7 Jan Engelhardt 2006-11-27 21:04:45 UTC
(0) Current status: grub-0.97-39 is installed, but the 0.97-14's /boot/grub/stage2 is in place. This harddisk has been cloned to a virtual machine for further testing and reproducing.

(1) Installing grub-0.97-39 (using --force because it is already installed) successfully makes the machine hang again.
I presume this is due to the fact that grub-install is _not_ run in the %post section.

(2) Rerunning "grub-install /dev/hda" from the chroot of '/' (dev/hda2) within the rescue CD produces a virtual machine kernel stack fault (aka. reboot)

Comment #6:
That does not work. The only message it ever prints on boot is "GRUB Loading stage2.." (yes, two dots) (where's my stage1.5 loader gone?) and also generates a VM kernel stack fault (reboot).
Comment 8 Jan Engelhardt 2006-11-27 21:18:26 UTC
After (0) I have set a snapshot in VMware, so I can go back there easily. Interesting observation follows:

# rpm -Uhv grub-0.97-39.i586.rpm --force
...
# grub-install /dev/hda
The file /boot/grub/stage2 not read correctly.

This is likely to be disk caching (the reason we can't do grub-install in %post?), so I went back and did it again

# Uhv 39...
# sync
# echo 3 >/proc/sys/vm/drop_caches
# sync
# grub-install /dev/hda
(success - but still generates a VM fault)
Comment 9 Jan Engelhardt 2006-11-27 21:19:51 UTC
Other observation:
# xfs_bmap /boot/grub/stag* /boot/grub/xfs_stage1_5
/boot/grub/stage1:
    0: [0..7]: 1714824..1714831
/boot/grub/stage2:
    0: [0..199]: 1748800.1748999
/boot/grub/xfs_stage1_5:
    0: [0..23]: 1714864..1714887
# Uhv 39
...
# xfs_bmap again
/boot/grub/stage1:
    0: [0..7]: 1447456..1447463
/boot/grub/stage2:
    0: [0..207]: 1447520..1447727
/boot/grub/stage2.old:
    0: [0..199]: 1748800..1748999
/boot/grub/xfs_stage1_5:
    0: [0..23]: 1447496..1447519

Which could explain part of the hang "(1)". grub-0.97-40 has the same issue. Is there something in compilation that changed between -14 and -39? Do you still have releases between those two?
Comment 10 Jan Engelhardt 2006-11-29 21:54:42 UTC
Strange enough,

  cat /usr/lib/grub/stage2 >/boot/grub/stage2
  (wait a bit until the fs syncs)
  grub-install /dev/hda

worked out today, with /usr/lib/grub/stage2 being grub-0.97-40. No idea what's going on, but once it works on the real system, I'll just close this bug down.
Comment 11 Jan Engelhardt 2006-12-02 09:41:07 UTC
Can you tell me what happens?

# rpm -q grub
grub-0.97-39
# md5sum /boot/grub/stage2 /usr/lib/grub/stage2
8e9c..3cb4  /boot/grub/stage2 (the very older version)
6679..db9f  /usr/lib/grub/stage2

Now on upgrade, I had hoped to be everything made straight, but
# rpm -Uhv grub-0.97-40.i586.rpm
...
# md5sum /boot/grub/stage2 /usr/lib/grub/stage2
d4d9..f354  /boot/grub/stage2
6679..db9f  /usr/lib/grub/stage2

What gives? Why does not /boot/grub/stage2 have the same MD5 even though it is copied over from /usr/lib/grub/stage2? When I manually mv & dd like in the %post script, the md5 is correct. What gives?
Comment 12 Jan Engelhardt 2006-12-02 10:27:33 UTC
Yet another comment, including solution.

Snapshot state:
Works.

Case 1:
Upgrading grub installs a stage2-only loader, which hangs at next boot.

Case 2:
Upgrading grub and running grub-install afterwards to have a stage1.5-2-loader, causes a VM machine fault at next boot.

Case 3:
Upgrading grub, *replacing* /boot/grub/stage2 with /usr/lib/grub/stage2 so that they both have *the same md5* and then running grub-install makes the machine boot successfully.

==================================================

Answer to comment #11:
  grub --batch </etc/grub.conf
pokes into /boot/grub/stage2. I suppose GRUB does not get it right and the boot hangs/faults. I suppose the "sync" is not long enough. Also try removing the 2>/dev/null >&1 in the `grub --batch` line.
Comment 13 Torsten Duwe 2006-12-07 13:12:25 UTC
Note: *DONT* use grub-install, unless you know exactly what you're doing. grub-install guesses many things on the fly. Hence we record the commands to install grub in /etc/grub.conf for reliable reproducibility.
"grub --batch < /etc/grub.install" is the preferred method to install / update the boot loader.

Note2: /boot/grub/stage2 will be modified in the process, as you found already. That's why we work on a copy, the original is in /usr/lib/grub.

(the dd is a workaround for a reiserfs quirk)

VMware has problems on its own.
Comment 14 Torsten Duwe 2006-12-07 13:35:51 UTC
So the remaining problem described in Comment #12, Case1 is what we have in Bug #144773 the device nodes aren't there during update.

*** This bug has been marked as a duplicate of bug 144773 ***
Comment 15 Jan Engelhardt 2006-12-07 13:41:53 UTC
Can't access 144773.

The devices nodes _are_ there (as in /dev/hda) however.

And there's yet case 3 which works best.
Comment 16 Jan Engelhardt 2006-12-20 20:40:18 UTC
"You are not authorized to access bug #144773."
Please open that one.
Comment 17 Leon Freitag 2006-12-20 21:07:49 UTC
 
Still doesn't work for me on 10.2 final. Booting the installed system from the CD, starting yast within it and reinstalling grub from there didn't work, neither running grub-install directly.

Stage2 in /boot didn't match stage2 in /usr/lib/grub as well.
leon@Dellin:/boot/grub> md5sum stage2
76c99ae1f95e9c76d5adf291922e8b42  stage2
leon@Dellin:/boot/grub> md5sum /usr/lib/grub/stage2
66797a774c25457f3b5e7c7f0920db9f  /usr/lib/grub/stage2

Copying the last stage2 to /boot/grub worked around the problem.
Comment 18 Torsten Duwe 2007-02-02 13:44:59 UTC
Fixing the update problem is a work in progress.
2nd try to get you access to that.

*** This bug has been marked as a duplicate of bug 144773 ***
Comment 19 Leon Freitag 2007-02-02 20:01:39 UTC
That still does not work.
Comment 20 Jan Engelhardt 2007-03-23 19:51:22 UTC
144773 does not seem relevant (obsolete devs.rpm??)

That said, this bug has bitten me again when upgrading a (cleanly installed) 10.3 alpha1 to the newest factory from yesterday (alpha2plus). Running grub-install on today's factory rescue iso runs through by the new scheme (see below), but system remains unbootable.

(chrooted into /dev/sda2, /dev is available)
rescue:/# rpm -q grub
grub-0.97-48
rescue:/# grub-install /dev/sda

 [ Minimal BASH-like ... ]
grub> setup --stage2=/boot/grub/stage2 (hd0) (hd0,1)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/xfs_stage1_5" exists... yes
 Running "embed /boot/grub/xfs_stage1_5 (hd0)"...  18 sectors are embedded.
suceceeded
 Running "install --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+18 p (hd0,1)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit
Rescue:/#
Comment 21 Jan Engelhardt 2007-03-23 22:03:55 UTC
As per comment #6, I tried "grub --batch" again, but:

#cat /etc/SuSE-release
openSUSE 10.3 (i586) Alpha2plus
VERSION = 10.3
#cat /etc/grub.conf
setup --stage2=/boot/grub/stage2 (hd0) (hd0,1)
#grub --batch </etc/grub.conf
(see comment #20)

So both ways of doing it (grub --batch, grub-install) do it with stage1.5.
Comment 22 Jan Engelhardt 2007-04-25 12:13:07 UTC
And it happened again when going from Alpha3 (0.97-50) to grub-0.97-52.
Comment 23 Jan Engelhardt 2007-04-25 14:25:01 UTC
I've got a testcase in a VMware 6 machine, grab it at http://jengelh.hopto.org/sk0.tar.bz2 (155 MB). Install /var/lib/smart/packages/grub-0.97-52.i586.rpm using smart or rpm, will trigger the bug on next boot.
Comment 24 Jan Engelhardt 2007-05-03 09:18:41 UTC
Testcase file (sk0.tar.bz2) updated. Fixed: root could not login.
Comment 25 Jan Engelhardt 2007-06-27 20:38:30 UTC
Happened again with the update from 0.97-62 to 0.97-64. Why?
Comment 27 Stephan Kulow 2007-09-06 18:40:09 UTC
any update? Torsten?
Comment 28 Torsten Duwe 2007-09-14 11:50:48 UTC
*** Bug 291038 has been marked as a duplicate of this bug. ***
Comment 29 Torsten Duwe 2007-09-14 11:53:23 UTC
Jan. even if you do not understand, this _is_ a symptom of missing device nodes, as discussed in Bug #144773 . This should be worked around now in yast.
(This never was a grub bug).
Comment 30 Torsten Duwe 2007-09-14 11:54:49 UTC
Should work in 10.3 Beta3.
Comment 31 Jan Engelhardt 2007-09-14 12:11:40 UTC
I have read http://en.opensuse.org/Software_Management/Upgrade/Devs_Rpm that is linked from 144773. But that does not help. I am *NOT* using "rpm --root" nor any other chrooting magic, so how should a device node be missing?
Comment 32 Jan Engelhardt 2007-10-30 14:58:35 UTC
Thank you for fixing, whatever it was.
Comment 33 Jan Engelhardt 2007-10-30 14:58:59 UTC
Thank you for fixing, whatever it was.
Comment 34 Torsten Duwe 2007-10-30 16:58:44 UTC
Again, last in comment #29, this never was a grub issue. Thanks go to the yast team for fixing this. Also from me :-)
Comment 35 Jan Engelhardt 2007-10-30 17:15:13 UTC
I do not use yast, and it is not even installed.
I am just trying to figure out what was changed, so if you can point me to the commit, that'd be easiest.
Comment 36 Jan Engelhardt 2008-06-30 15:02:00 UTC
obviously
Comment 37 Jan Engelhardt 2008-06-30 15:03:23 UTC
Happened again when going from 10.3 to 11.0. grub did not change a bit, and yet it fails. This fucking thing sucks! 
Comment 38 Jan Engelhardt 2008-06-30 15:38:57 UTC
Created attachment 225153 [details]
/etc/grub.conf

Thankfully, the grub.rpm update left me with a /boot/grub/stage2.old, which, when put back into /boot/grub/stage2, makes the boot procedure work again.

Since the 'stage2' file seems to come from /usr/lib/grub/stage2 and is altered before copied to /boot/grub, I presume this very process of 'morphing' is broken.
Comment 39 Torsten Duwe 2008-07-01 14:44:49 UTC
I agree with comment #37: XFS really does suck, especially when it comes to booting Linux on a PC. Fortunately we do not support it any more for new installations, an ext2 /boot partition is highly recommended.

The problem is that with XFS, sync(2) returns, but the data isn't synced.
The first time yast calls grub install, grub does not find the new stage1.5, because it is not on the disk yet, despite a successful sync; thus it modifies stage2 to do the job. On the second invocation, stage1.5 is found and installed, but stage2 already is modified.

So once again this isn't a grub bug, but an XFS bug with FS semantics.
Comment 40 Torsten Duwe 2008-07-01 14:46:31 UTC
*** Bug 331685 has been marked as a duplicate of this bug. ***
Comment 41 Jan Engelhardt 2008-07-01 15:00:20 UTC
>Fortunately we do not support it any more for new installations, an ext2 /boot partition is highly recommended.

/boot on XFS, or XFS in general?
Comment 42 Torsten Duwe 2008-07-01 15:26:17 UTC
Well, I can only talk about /boot residing on XFS.

When the new package is available, please test. It waits about 10 secs for the FS to settle; this is noticeable during package update. rpm -U is sufficient for testing.
Comment 43 Jan Engelhardt 2008-07-01 15:36:53 UTC
Does grub even use xfs_bmap?
Comment 44 Jan Engelhardt 2008-07-01 20:34:17 UTC
Of course not. http://oss.sgi.com/archives/xfs/2008-07/msg00013.html
Comment 45 L. A. Walsh 2008-07-02 23:27:56 UTC
"You are not authorized to access bug #144773." 
     -- I'm still being blocked when going to look at this.
-

> I agree with comment #37: XFS really does suck, especially when it comes to
> booting Linux on a PC. Fortunately we do not support it any more for new
> installations, an ext2 /boot partition is highly recommended.
---
    I've been using XFS on root, with certainty, since 9.0 (I remmber getting
being bitten by a SuSE9.2 bug where the xfs driver on the installation disk was fault and couldn't read old partitions.  I am 'fairly' sure that I've been using it since 7.3.  I've never had any problems except when a disk-cable was going bad.  Hardly XFS's fault.  But I also use lilo.  

    Grub doesn't have accessible documentation.  

    lilo seems more reliable than grub for most purposes -- and grub, while
it looks cool -- seems awfully complex for what it does.  lilo has been good for
dual and triple boot systems (Linux, Win98+Win2k), and dual mode booting on that Win98 partition (native & under VMWare).  I've even used the BIOS re-ordering
in lilo to allow booting from sda -- as well as the dynamically adjusting hidden/non-hidden partitions that I needed at one point for Win98 (this isn't recent, I'll mind you...).  

    With a simple adjustment in lilo.conf, I could boot off a cloned removable hard disk -- and change the its boot params while it was still a 2ndary hard disk.  Then on bootup, hit the BIOS, one-time boot switch (F12?) and boot from the removable instead of the fixed -- and the system would come up off the 2ndary, while calling it "hda" (the old fixed drive became hdb).  It was trivial
to figure out and it just worked!

    The same was not true for grub.  It always "knew" better than to write to the 2ndary drive without referring to it by BIOS id 0x8[1-3], trying to configure the 2ndary drive to be "0x80" would have grub do its installation on the 1st disk -- not what I wanted.  After multiple attempts, I went back and reinstalled lilo -- no problems.

     I think Grub was taking a higher-level view of the hard disks -- and wouldn't be so easily 'fooled' by a simple text-label change in its config files.  I can see that being a benefit if the order changes and you *don't* want the boot order to change -- but in the opposite case, where I wanted the old
behavior -- it did everything to protect me from what I needed to do.  Maybe 
grub is expecting more file-driver functionality to be present when the OS isn't fully "awake"?  

    I find it annoying to go from a working lilo+xfs, and now am told that
because grub can't deal w/xfs, xfs isn't supported for booting from anymore.
Why not continue to support lilo+xfs?  It's just that grub -- is a much more
complex beast and demands more from the drivers (and users) -- which is fine if one doesn't need to understand or tweak what's going on, whereas lilo being fairly primitive seems to have fewer failure points.

    Also, I get the _feeling_ (perhaps wrong), but that lilo is still used alot in the kernel development group.  That might mean it gets tested more and might be good, at the very least, as a fall-back when grub gets too demanding...


Comment 46 Jan Engelhardt 2008-11-20 21:58:04 UTC
#42: Where is the update?
Comment 47 Jan Engelhardt 2008-11-20 22:00:10 UTC
#42: the workaround only went into the grub package in factory-11.1, but 11.0's package was not updated. Not good :-/
Comment 48 L. A. Walsh 2008-11-21 00:53:29 UTC
I'd like to comment at this point -- as this situation "highlights" a problem in the bug-report-fix-release system.

Under the current system, bug reports are marked "closed" when some patch has gone into a future revision -- not when the reporter has actually received or tested a fix or when the bug-fix actually gets released to "CD-duplication" (or release engineering?)

There should be more states in the bug system to allow for these intermediate states:

1) having been fixed in the code tree & 
1a) maybe verified by reporting customer
2) test-case is created to duplicate problem, and 
   'release' (not code) is tested to verify that the fix is in and works.
3) product with 'fix' in, is 'released' with final image 
   signed off by 'testing'...


(yeah...alot more picky, but can help prevent things falling into cracks...)
Comment 49 Alex Lau 2008-11-25 17:09:52 UTC
I happen to me in ext3 also with openSUSE11.0, I will verify with SLED11 and see it also happen there. 

the md5sum for the /boot/grub/stage2 is different form the /usr/lib/grub/stage2. This should be the main cause. 

Comment 50 Jan Engelhardt 2009-01-07 05:55:30 UTC
This works for me now. (I thought I had 'grub' locked in zypper but it seems I have not, and the upgrade from 11.0->11.1 installed. I did not notice, since it booted fine after that.)
Comment 51 Jan Engelhardt 2009-01-07 06:00:58 UTC
Except inside VMware. Oh how I hate this craploader.
Comment 52 Jan Engelhardt 2009-11-23 23:46:24 UTC
The best bet so far seems to be to run
  mount /boot -o remount,sync
before updating any grub files in /boot.
Comments whether this is feasible? (Grub is not updated that often.)
Comment 53 L. A. Walsh 2009-11-24 01:44:14 UTC
The problem is that a 'sync', means that the data is written to disk.  XFS does this this just fine, BUT, it may write meta-data into the journal FIRST, and later copy the final information into place.  XFS guarantees the file is on disk  -- which includes being in the journal, but it doesn't guarantee that the data is in the final position.  This can be the case with ANY journaling or dynamically optimizing file system.

Grub isn't going through the file-system calls to get to the data, Apparently it's using "DIRECT I/O" on a mounted file system.  Anyone with common sense would know this is just plain dangerous.

A simple 'sync' won't solve the problem.  The only way to guarantee everything is finalized, is to unmount the file system.  Then everything *should* sync under normal circumstances.  IF something goes wrong during unmount, XFS could be prevented from completing it's play of the journal -- as when happens during a crash.  When the file system is remounted, XFS plays the unfinished portion of the journal.  I don't know the timing of fs availability vs. the journal being played out, but I was under the _impression_ that the journal is played out before the fs is made ready for use.

So if it is possible, I'd unmount the boot partition then remount it.

However.  Someone else in the thread at http://oss.sgi.com/archives/xfs/2008-07/msg00031.html, made the comment " the GRUB shell directly to 
write it. grub-install doesn't work reliable."  Does the GRUB shell work through the file system, where maybe grub-install does not?  

If that's the case, then maybe using a shell script to feed the GRUB shell commands might be another possible workaround.

It's unlikely GRUB would work on NTFS either, since while file data is written to disk, the MFT stays resident in memory and locked until the OS goes down.  To rely on the disk being in a static state while mounted is bad programming and it should be fixed.  It's a 1970's mentality with regard to disks.  Now days you want to touch disk as little as possible and disks are getting more dynamic as volume managers and shadow copies show different views of disks (even when synced), than what may exist on disk.  

I had this same problem with grub -- it didn't use what the file system said -- it used values on disk that were wrong.  It had nothing to do with XFS, but that at the high level I changed the disk labels, and was booting from a different partition.  So what was mounted in /boot (label=Boot), WASN't what grub was writing to -- it was writing and updating my old boot partition, because -- meanwhile, yast2 was happy with writing options to /boot -- which was the new partition that was really mounted, while grub ignored what was going on at the high level.

All of that has nothing to do with any particular file system, but again -- has to do with grub not using the file system, but using it's own block commands.

I've never had a problem with grub interacting badly with my XFS based
boot partition, But I have a *dedicated* partition for boot (it's not on root).
So there's very little I/O happening on /boot other than when I copy a new kernel to it.  After that, it takes very little time for XFS to dump it's buffers and for the disk to be 'idle' again. 

The safest thing to do would be to fix grub to use the file system.  Then I wouldn't have run into my bug (forget the bug# but did log it), which _wasn't_ file system related, but entirely related to grub not using the 'high level view' of the system and mounted file systems.  (While Yast and I were blissfully modifying boot params on the new /boot partition, which had no effect on grub whatsoever).  I was *lucky*, in that I didn't immediately scrub the hold partition, my system still booted, just that no changes on the 'live' file system were being used -- grub was now using an inactive partition to boot from.  That would never happen if their code was written correctly.

IF you can't fix grub, then only way to be safe is unmount the file system, and remount it.  (that or check if the grub-run-from-shell is really using file system calls to do it's housekeeping, then a script might be a solution).

good luck? :-)
-l
Comment 54 Jan Engelhardt 2009-11-24 09:06:21 UTC
Well the reason why bootloaders still work like that is because the 31744 bytes between the MBR and the 1st partition (when the chosen geometry is 255/63, it may be less with others!) is pretty small already, so small that it could not possibly stuff all fs drivers in there.

Of course the easiest thing would be if firmware (BIOS) would do the file handling. OpenBoot seems to do that, though only for UFS. SILO still writes itself to byte position 1024, going the "old way" of binding to fs blocks. Since that is not going to go away anytime soon, it should be catered for somehow. Have the yast installer always create a /boot partition, and perhaps preferably with filesystems that write "rather immediately" to place, i.e. ext2/3, and show appropriate warning dialogs in the installer if one decides to not have a separate /boot, etc.
Comment 55 L. A. Walsh 2009-11-24 14:51:17 UTC
Your safest bet would be to go with FAT32 if the user chooses Grub. That way, no mistaking grub for a loader that can handle a modern file systems.  The better alternative would be to use lilo, which doesn't have this problem.  What would you rather have, Grub+FAT32 or lilo and a pure XFS (or ZFS, or EXT3/4, or whatever).

I'd prefer to choose my file system and use the boot loader that works with it than have my choice of file system be dictated by a bootloader.  The only reason grub has these problems is because they want to provide features that used to be (maybe still are) in a Boot PROM in higher end systems.  But that's not safe anymore on high end file systems unless the file system code is in the PROM -- which is practical when you are talking 1 vendor who uses 1-2 file systems, not the plethora linux has.  

Who is pushing for going with Grub at SuSE when it has problems like this, when lilo does not?  Grub doesn't buy you anything -- it causes more problems than it's worth.  It does NOT buy one name independence like I was lead to believe, as that's simply hidden in the boot-load ram disk -- and that, IMO, is BAD!  

I've gotten bit by that more than once, not knowing exactly what grub was doing and what it was relying on.  One time, I changed the disks to a different system, and grub was unhappy because the controller ID's had changed, so the disk's GUID-paths had changed (they were still sda, sdb, etc, but it generated unique numbers by HW).  Quite the opposite of device order independence! ;->

If one was really committed to this working, they grub could dynamically load drivers from disk in an area that was marked immutable and non-movable.  But that's so much effort for a feature that won't be needed when people move to EFI, 'any day now...' *cough* (it's on one of my new systems, but when I installed SuSE, SuSE didn't pick up on it having the option, (and I didn't know it was there till later), so it got a standard PC boot.  Not sure what the benefit is supposed to be besides not having to use grub...I guess then it's ELILO?  

I think the decision to abandon lilo was premature.  If it works with the advanced file systems and Grub doesn't, that should be good enough reason to bring back support.  But someone seems to have a real passion for grubs.
;^)
Comment 56 Torsten Duwe 2010-02-11 13:45:04 UTC
This discussion is going nowhere. BTW a few "bugs" in grub turned out to be compiler related. If you have actual problems with grub on 11.3 that formally qualify as bugs feel free to report, but stick to the facts.