Bug 568319 - DM lockup causes filesystem failure
DM lockup causes filesystem failure
Status: RESOLVED FIXED
: 567912 (view as bug list)
Classification: openSUSE
Product: openSUSE 11.2
Classification: openSUSE
Component: Kernel
Final
i686 openSUSE 11.2
: P5 - None : Major (vote)
: ---
Assigned To: Nikanth Karthikesan
E-mail List
maint:released:11.2:32013
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-01-05 14:18 UTC by Carlos Robinson
Modified: 2010-04-08 10:44 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
message log showing kernel dumps and such (36.16 KB, application/x-gzip)
2010-01-29 22:18 UTC, Carlos Robinson
Details
boot log (14.82 KB, application/x-gzip)
2010-01-29 22:18 UTC, Carlos Robinson
Details
requested lsof output (23.51 KB, application/x-gzip)
2010-02-13 23:40 UTC, Carlos Robinson
Details
message log showing kernel dumps and such (36.96 KB, application/x-gzip)
2010-02-13 23:46 UTC, Carlos Robinson
Details
xfs - non-blocking file size updates during io completion backport to openSUSE11.2 (4.91 KB, patch)
2010-02-17 11:45 UTC, Nikanth Karthikesan
Details | Diff
message log showing kernel dumps and such, xfs success, reiserfs crash. (21.07 KB, application/x-gzip)
2010-03-01 21:42 UTC, Carlos Robinson
Details
boot log of previous session (68.45 KB, application/octet-stream)
2010-03-01 21:43 UTC, Carlos Robinson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carlos Robinson 2010-01-05 14:18:37 UTC
I reported Bug 567912 upstream, to SGI:

http://oss.sgi.com/bugzilla/show_bug.cgi?id=860

They say it is not an XFS problem, but a DM lockup, a DM bug.


Will you please have a look at it? This problem was reported on 2007 and not solved yet.


Older info: Bug 345039

Guess: it is caused by a suse/novell addition to the kernel: Per Comment 43 there kernel vanilla did not have this problem.
Comment 1 Carlos Robinson 2010-01-15 20:49:48 UTC
As suggested there, mounting 'nobarrier' avoids the issue in 11.2 (not in 11.0)
Comment 2 Nikanth Karthikesan 2010-01-25 11:37:01 UTC
Does your setup have loop devices for files or disk/partition?
If file-backed loop device, is it over XFS?

Also earlier vanilla kernel didn't support barriers in loop devices. Barrier support for loop driver was added to mainline kernel in ~2.6.29. So can you test with vanilla kernel in 11.2?
Comment 3 Carlos Robinson 2010-01-26 21:24:19 UTC
(In reply to comment #2)
> Does your setup have loop devices for files or disk/partition?
> If file-backed loop device, is it over XFS?

Yes and yes.


I wrote the setup on a previous report, but I'll copy it again here.

/etc/crypttab:

crmm_dvd_f1x    /mnt/TMPBig/imgs/crypta_f1_dvd.mm.xfs   none  noauto


/etc/fstab:

/dev/mapper/crmm_dvd_f1x   /mnt/crypta.mm_dvd1.x    xfs  \
     noatime,noauto,nofail,nobarrier    1 5

This means that they are loop mounted.

And /mnt/TMPBig is xfs, too:

bombadillo:~ # mount | grep /mnt/TMPBig
/dev/sda9 on /mnt/TMPBig type xfs (rw,nosuid,nodev,_netdev,noatime,nodiratime)


> Also earlier vanilla kernel didn't support barriers in loop devices. Barrier
> support for loop driver was added to mainline kernel in ~2.6.29. 

(remember that this problem appeared first on oS 10.3)

> So can you
> test with vanilla kernel in 11.2?

Both vanilla and default are 2.6.31.8, so shouldn't default kernel have that support included? Ok, I'll try. Perhaps this weekend, earlier if I have time.
Comment 4 Carlos Robinson 2010-01-29 22:18:14 UTC
Created attachment 339770 [details]
message log showing kernel dumps and such

Tested:  It crashes the same.

I include full messages log of this session (I issued "echo t > /proc/sysrq-trigger") and boot log.
Comment 5 Carlos Robinson 2010-01-29 22:18:43 UTC
Created attachment 339771 [details]
boot log
Comment 6 Nikanth Karthikesan 2010-02-01 06:48:05 UTC
I am guessing that now even vanilla would have this deadlock as barrier support for loop device is upstream now. Barrier support for loop device is nothing more than fsyncs on the backing file.

I was trying to reproduce this locally. This looks more like a dm issue. Thanks for the logs. I will look at them closely.
Comment 7 Nikanth Karthikesan 2010-02-02 08:41:05 UTC
I tried reproducing it locally, but didnt hit it.

Would it be possible for you to provide a dump when this happens? 

http://en.opensuse.org/Kdump
Comment 8 Nikanth Karthikesan 2010-02-03 09:30:40 UTC
I guess this is mostly an xfs over loop on xfs bug. Just that when the loop device supports barrier(issues fsync), xfs deadlocks.

pdflush: writing dirty pages of the encrypted loop files. It holds an exclusive lock of the sb inodes of this encrypted disk.

loop0,loop1,loop2,loop3: all are flushing the encrypted disk. Have issued an fsync on the underlying fs and waiting for them to finish. But it never finishes!

kdmflush: all are waiting for loop threads to complete I/O.

xfsdata/3 is waiting for excl lock of some inode.

mld_hash holds shared/read lock of some inode.

par2 holds exclusive/write locks on some inodes.

There seems to be a deadlock for the inode lock.

If you can upload a dump, it would be easy to find out exact cause. It is quiet difficult to deduce which xfs call we are seeing, whether the xfs calls over the encrypted disk or the underlying disk.
Comment 9 Carlos Robinson 2010-02-04 01:13:22 UTC
A kernel dump? I'm investigating how I could do that. For one thing, I only have an ADSL internet connection, upload speed is 300 bits per second. The memory in that machine is 8 GiB, I don't know how to send all that :-? I hope a kernel dump is not a full memory dump, I recogn haven't fully understood that link you posted. I need a second read or more clues.

On the other hand, the kernel does not crash entirely: it just that all operations reading or writing on one of the xfs filesystems lock, the rest is ok. Meaning, I can write to some of the other partitions. Hopefully that means I would not need a second loaded kernel?

It is 2 AM here, I should be sleeping (I wake up at 6). If you can clarify a bit more what I need to do, I'll attempt it on the weekend.
Comment 10 Nikanth Karthikesan 2010-02-04 04:32:31 UTC
Yes kernel dump would be huge and let us not pursue that route now.

If you can provide more info on the files being opened by mld_hash and par2, that could help my understanding a bit. Expecially whether they are working on files in the underlying xfs mount or the exncrypted xfs mount. Just output of commands when you hit this deadlock,

mount
dmsetup table
lsof

Also whether you can continue to read/write to the underlying xfs partition? Say create a new file on underlying xfs/encrypted xfs after the deadlock?

Thanks for taking the effort to report this issue and providing good amount of info.
Comment 11 Carlos Robinson 2010-02-04 19:52:29 UTC
(In reply to comment #9)
> A kernel dump? I'm investigating how I could do that. For one thing, I only
> have an ADSL internet connection, upload speed is 300 bits per second.

Wow! Not 300 bits/s, but 300 kbit/s. I guess I'm not to be trusted too much when I'm tired O:-)


(In reply to comment #10)
> Yes kernel dump would be huge and let us not pursue that route now.

If after all it is needed, I could burn it to a dvd and mail it.

> If you can provide more info on the files being opened by mld_hash and par2,
> that could help my understanding a bit. 

Video, avi files. Par2 is very cpu intensive, mld_hash not so. Previously (oS 10.3, 11.0, 11.1) I had this problem with simply copying files, any files. Now it is harder to reproduce, but I'll see if it does by simply copying files. That would be easier to track.

> Expecially whether they are working on
> files in the underlying xfs mount or the exncrypted xfs mount. Just output of
> commands when you hit this deadlock,

The working should be on the encrypted xfs filesystem. They reside on files (of exactly dvd size), loop mounted, on an underlaying xfs filesystem. I will get the output of the commands this weekend, right now the system is busy and I can't risk crashing it at this moment. I can also provide the exact procedure I used to create these filesystems (at the moment, it is in a mail in Spanish).

> mount
> dmsetup table
> lsof

This weekend, time permitting :-)


> Also whether you can continue to read/write to the underlying xfs partition?
> Say create a new file on underlying xfs/encrypted xfs after the deadlock?

As far as I remember, no, it locks. I can't even umount them, I have to use:

umount -l mountpoint &

or it locks that xterm or console.

 
> Thanks for taking the effort to report this issue and providing good amount of
> info.

And thank you for trying to solve it, I appreciate it.
Comment 12 Carlos Robinson 2010-02-07 23:54:02 UTC
Mmm. Just to comment that I have had a bad and busy weekend, and couldn't do any testing on this problem. I had a destroyed grub, couldn't restore it, had to reinstall (whole system), destroyed another grub of another partition, had to recover that one... long tale. I'm tired, I got a headache, got re-thickened fingers, I'm not in shape for testing this weekend. And now it is already Monday early hours. Off to bed if headache permits, up early.

Next weekend, I hope. I'm really interested in solving this one.
Comment 13 Carlos Robinson 2010-02-13 23:39:41 UTC
Ok, done. I got a lock simply by copying files, with this script:

#!/bin/bash

echo One
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.one.iso
echo Two
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.two.iso
echo Three
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.three.iso
echo Four
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.four.iso
Echo Five
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.five.iso
echo Six
cp /mnt/Ext/Huge/copia_nimrodel/Grande/cer/distros/openSUSE-11.2-GNOME-LiveCD-i686.iso openSUSE-11.2-GNOME-LiveCD-i686.six.iso

l
df -h


I started four xterms, and on each one I run the above script copying the iso image of 11.2 CD to one fo the four crypto loop device (I have four of DVD size). Ie, four different copy processes, one on each loop.

The output went like this wen it halted:

----------- xterm 1 reads ---------------------------------
@Elessar:/mnt/crypta.mm_dvd1.x> test_copy
One
Two
Three
----------- xterm 2 reads ---------------------------------
@Elessar:/mnt/crypta.mm_dvd2.x> test_copy
One
Two
Three
----------- xterm 3 reads ---------------------------------
@Elessar:/mnt/crypta.mm_dvd3.x> test_copy
One
Two
----------- xterm 4 reads ---------------------------------
@Elessar:/mnt/crypta.mm_dvd4.x> test_copy
One
Two
Three
-----------------------------------------------------------


Elessar:~ # mount
/dev/sdd9 on / type reiserfs (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdd3 on /boot type ext2 (rw,acl,user_xattr)
/dev/sdd8 on /data/vmware type xfs (rw)
/dev/sdd12 on /data/storage_a type xfs (rw)
/dev/sdf14 on /data/storage_c type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
/dev/mapper/crmm_dvd_f1x on /mnt/crypta.mm_dvd1.x type xfs (rw,_netdev,noatime)
/dev/mapper/crmm_dvd_f2x on /mnt/crypta.mm_dvd2.x type xfs (rw,_netdev,noatime)
/dev/mapper/crmm_dvd_f3x on /mnt/crypta.mm_dvd3.x type xfs (rw,_netdev,noatime)
/dev/mapper/crmm_dvd_f4x on /mnt/crypta.mm_dvd4.x type xfs (rw,_netdev,noatime)
/dev/sdb5 on /mnt/Ext/Huge type xfs (rw,nosuid,nodev,_netdev,noatime,nodiratime)
Elessar:~ # 
Elessar:~ # dmsetup table
crmm_dvd_f3x: 0 9177656 crypt aes-cbc-essiv:sha256
0000000000000000000000000000000000000000000000000000000000000000 0 7:2 2056
crmm_dvd_f1x: 0 9177656 crypt aes-cbc-essiv:sha256
0000000000000000000000000000000000000000000000000000000000000000 0 7:0 2056
crmm_dvd_f4x: 0 9177656 crypt aes-cbc-essiv:sha256
0000000000000000000000000000000000000000000000000000000000000000 0 7:3 2056
crmm_dvd_f2x: 0 9177656 crypt aes-cbc-essiv:sha256
0000000000000000000000000000000000000000000000000000000000000000 0 7:1 2056


output of lsof will follow.
Comment 14 Carlos Robinson 2010-02-13 23:40:41 UTC
Created attachment 342356 [details]
requested lsof output
Comment 15 Carlos Robinson 2010-02-13 23:46:14 UTC
Created attachment 342357 [details]
message log showing kernel dumps and such

There are two crashes (crash, reboot, retry, crash, ie lock) logged here. The output includes different sysrequest dumps while I tried. I suppose you can determine what output is what in the file, I'm not familiar with these tools.

Ah! I forgot to mention that this time I'm using the x64 version of oS 11.2


Do you want I log here the procedure I used to create the loop devices?
Comment 16 Nikanth Karthikesan 2010-02-16 12:02:24 UTC
This more and more looks like an xfs issue.

The problem could be that, when xfsdatad is waiting for lock on the loop mounted(nested inside another xfs) xfs file-system and pdflush holds an xfs_ilock in the nested xfs and waits for the log to be flushed.
But pdflush's wait will not complete as xfsdatad is per-cpu and not per-mounted file-system. For pdflush to recover, xfsdatad has to work on the items queued for the underlying xfs filesystem, instead of working/waiting on the lock for the nested file-system. This creates a deadlock.

If this is the case, this could be resolved, either by creating xfsdatad per mounted file-system, or by imposing some kind of ordering/priority such that the underlying filesystem's work is given higher priority

I would also comment on the http://oss.sgi.com/bugzilla/show_bug.cgi?id=860 as well to get some feedback
Comment 17 Nikanth Karthikesan 2010-02-17 11:42:59 UTC
See

http://oss.sgi.com/bugzilla/show_bug.cgi?id=860#c12
http://oss.sgi.com/bugzilla/show_bug.cgi?id=860#c13

-------------------------------------------------

Info copied from the above links:

Comment #12 From  Dave Chinner   2010-02-16 17:16:10 CST   (-) [reply] -------

Nikanth,

Nice analysis. I think you're on the right track, but it looks to me that the
problem is more complex than you've outlined even though the solution is likely
to be the same.

Log IO (which are the barrier IOs in question) in XFS is completed through the
xfslogd, not the xfsdatad, hence the pdflush threads that are blocked in log
forces are not blocked behind the xfsdatad that is waiting on an IO lock.
Because of these two independent IO completion paths in XFS this lock interplay
is not usually a problem - the xfsdatad blocking on an inode lock does not hold
up log IO completion. Hence the barrier IO will complete, the log force
completes, the inode is unlocked when the transaction completes and then the
data IO can complete.

The problem is that the DM loop barrier implementation has to wait for data IO
to complete as well as the barrier. i.e. instead of there being two separate
completion channels that can block independently, DM loop barriers require both
data IO and log (barrier) IO from the underlying filesystem to complete through
the one channel (dm_flush) before being split back into two again. IOWs, XFS
can't block data IO completion on the lower filesystem while the upper
filesystem waits for log IO to complete because dm_flush() needs the data IO to
complete as well. Without barriers, this dependency between the upper and lower
filesystems does not exist, hence the problem not existing previously.

IOWs, this isn't so much a bug but a reflection on the fact that a new feature
(DM loop barriers) has introduced an implicit IO completion order dependency
that never existed before.

As you suggested this could be fixed by adding per-filesystem xfsdatad threads,
but that is not an option because XFS is used on very large systems (e.g. 2048p
machines) and they often have tens of XFS filesystems mounted. i.e. it is not
feasible to have XFS create tens of thousands of threads on such machines.
Prioritising upper vs lower filesystem IO completion is not really practical,
either, because each filesystem has no context of what dependencies it might
have on other filesystems.

However, we can avoid blocking the xfsdatad on inode locks by using try-lock
semantics and requeuing the IO completion if we'd block on the inode lock. This
should avoid completion order dependent deadlocks like this one. I'll attach a 
patch in a few minutes after some sanity testing.

------- Comment #13 From Dave Chinner 2010-02-16 18:21:23 CST (-) [reply] -------

Created an attachment (id=288) [details]
non-blocking file size updates during io completion

This patch should fix the problem being seen with barrier flushes. Can you
please try it?
Comment 18 Nikanth Karthikesan 2010-02-17 11:45:58 UTC
Created attachment 342869 [details]
xfs - non-blocking file size updates during io completion backport to openSUSE11.2

Backport of Dave chinners patch http://oss.sgi.com/bugzilla/attachment.cgi?id=288
Also contains mainline commit 5ec4fabb02fcb5b4a4154a27e4299af5aa0f87ac
http://www.suse.com/~knikanth/bnc568319-17FEB2010/xfs-end-io-cleanup
to make the backport apply easily.

A openSUSE 11.2 kernel built with these 2 patches can be downloaded from
http://www.suse.com/~knikanth/bnc568319-17FEB2010/
Comment 19 Nikanth Karthikesan 2010-02-17 11:47:26 UTC
Carlos,

Please test with the kernel available at
http://www.suse.com/~knikanth/bnc568319-17FEB2010/ 
to see whether the deadlock is still reproducible.

If not, we can ask Dave Chinner to push this patch to mainline.
Comment 20 Carlos Robinson 2010-02-22 19:43:11 UTC
(I'm waiting for a time slot in which to test this - couldn't be this weekend, sorry.)
Comment 21 Carlos Robinson 2010-02-27 21:02:33 UTC
Good and bad news.

The bad news: the test destroyed completely the system, I can no longer boot it, new or old kernel. The fstab file dissapeared, boot stops with a message about tmpfs on /dev. I had to fsck --rebuild tree and lost many things.

It first crashed on reboot after the test. It did not reboot, halted on "running /etc/init.d/halt.local", blinking the [A] led on keyboard.

I'm running the x86 test partition now, that was the x64 that got destroyed. I will have to reinstall it from scratch, and I don't have time. Hopefully I can keep the home at least.

It will take some convincing to ask me to test anything again.


The good news: I had four xterms running copying, at the same time, the dvd iso image of 11.2 6 times on each (the script of a previous comment), and the sequence run 30 times, without a hitch.

So yes, that part is OK.

I have the message log with kernel dump saved if you want it.
Comment 22 Nikanth Karthikesan 2010-03-01 04:05:01 UTC
Ah.. I am very very sorry. I hope you have not lost any unrecoverable data. Thanks a lot for testing this. At least we have identified the cause of the deadlock. May be the fix or my back-porting has some bug, that lead to the crash. We should report this to sgi and work for a proper fix to xfs. Please attach the log file.

Once again sorry for the trouble.
Comment 23 Carlos Robinson 2010-03-01 21:42:21 UTC
Created attachment 345769 [details]
message log showing kernel dumps and such, xfs success, reiserfs crash.

(In reply to comment #22)
> Ah.. I am very very sorry. I hope you have not lost any unrecoverable data.
> Thanks a lot for testing this. At least we have identified the cause of the
> deadlock. May be the fix or my back-porting has some bug, that lead to the
> crash. We should report this to sgi and work for a proper fix to xfs. Please
> attach the log file.
> 
> Once again sorry for the trouble.

I haven't lost valuable data, because that was one of my 3 test partitions (the final setup is not installed yet). A day lost reinstalling and reconfiguring, yes. As I was testing things, I hadn't made a backup. 

I attach the log. The last line is not logged, my halt.local file has this code:

MESSAGE=`uptime`
# syslog may not be working
DATE=`date  --rfc-3339=seconds`
echo "$DATE - Halting the system now - uptime:  $MESSAGE" >> /var/log/messages


and that line is missing from the log.
Comment 24 Carlos Robinson 2010-03-01 21:43:26 UTC
Created attachment 345771 [details]
boot log of previous session
Comment 25 Nikanth Karthikesan 2010-03-05 08:30:54 UTC
The fix worked. And it has been queued for upstream inclusion.
See http://oss.sgi.com/archives/xfs/2010-02/msg00280.html
http://oss.sgi.com/cgi-bin/mesg.cgi?a=xfs&i=1266384989-28928-1-git-send-email-david%40fromorbit.com

Final version about to be included in mainline is here:
http://oss.sgi.com/archives/xfs/2010-02/msg00340.html
Comment 26 Nikanth Karthikesan 2010-03-05 09:13:46 UTC
commit 77d7a0c2eeb285c9069e15396703d0cb9690ac50 in the xfs-dev tree has been queued for .33

I have committed the patch to openSUSE-11.2 and should be available in the next maintenance update.
Comment 29 Carlos Robinson 2010-03-07 13:02:26 UTC
(In reply to comment #26)
> commit 77d7a0c2eeb285c9069e15396703d0cb9690ac50 in the xfs-dev tree has been
> queued for .33
> 
> I have committed the patch to openSUSE-11.2 and should be available in the next
> maintenance update.

Thankyou!

However, before testing that one, I will have to create a full backup...
Comment 30 Swamp Workflow Management 2010-03-22 13:52:36 UTC
Update released for: kernel-debug, kernel-debug-base, kernel-debug-base-debuginfo, kernel-debug-debuginfo, kernel-debug-debugsource, kernel-debug-devel, kernel-debug-devel-debuginfo, kernel-default, kernel-default-base, kernel-default-base-debuginfo, kernel-default-debuginfo, kernel-default-debugsource, kernel-default-devel, kernel-default-devel-debuginfo, kernel-desktop, kernel-desktop-base, kernel-desktop-base-debuginfo, kernel-desktop-debuginfo, kernel-desktop-debugsource, kernel-desktop-devel, kernel-desktop-devel-debuginfo, kernel-pae, kernel-pae-base, kernel-pae-base-debuginfo, kernel-pae-debuginfo, kernel-pae-debugsource, kernel-pae-devel, kernel-pae-devel-debuginfo, kernel-source, kernel-source-vanilla, kernel-syms, kernel-trace, kernel-trace-base, kernel-trace-base-debuginfo, kernel-trace-debuginfo, kernel-trace-debugsource, kernel-trace-devel, kernel-trace-devel-debuginfo, kernel-vanilla, kernel-vanilla-base, kernel-vanilla-base-debuginfo, kernel-vanilla-debuginfo, kernel-vanilla-debugsource, kernel-vanilla-devel, kernel-vanilla-devel-debuginfo, kernel-xen, kernel-xen-base, kernel-xen-base-debuginfo, kernel-xen-debuginfo, kernel-xen-debugsource, kernel-xen-devel, kernel-xen-devel-debuginfo, preload-kmp-default, preload-kmp-desktop
Products:
openSUSE 11.2 (debug, i586, x86_64)
Comment 31 Nikanth Karthikesan 2010-04-08 10:44:54 UTC
*** Bug 567912 has been marked as a duplicate of this bug. ***