remarkable-linux/fs
Dave Chinner 7d2ffe80aa xfs: fix split buffer vector log recovery support
A long time ago in a galaxy far away....

.. the was a commit made to fix some ilinux specific "fragmented
buffer" log recovery problem:

http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603

That problem occurred when a contiguous dirty region of a buffer was
split across across two pages of an unmapped buffer. It's been a
long time since that has been done in XFS, and the changes to log
the entire inode buffers for CRC enabled filesystems has
re-introduced that corner case.

And, of course, it turns out that the above commit didn't actually
fix anything - it just ensured that log recovery is guaranteed to
fail when this situation occurs. And now for the gory details.

xfstest xfs/085 is failing with this assert:

XFS (vdb): bad number of regions (0) in inode log format
XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583

Largely undocumented factoid #1: Log recovery depends on all log
buffer format items starting with this format:

struct foo_log_format {
	__uint16_t	type;
	__uint16_t	size;
	....

As recoery uses the size field and assumptions about 32 bit
alignment in decoding format items.  So don't pay much attention to
the fact log recovery thinks that it decoding an inode log format
item - it just uses them to determine what the size of the item is.

But why would it see a log format item with a zero size? Well,
luckily enough xfs_logprint uses the same code and gives the same
error, so with a bit of gdb magic, it turns out that it isn't a log
format that is being decoded. What logprint tells us is this:

Oper (130): tid: a0375e1a  len: 28  clientid: TRANS  flags: none
BUF:  #regs: 2   start blkno: 144 (0x90)  len: 16  bmap size: 2  flags: 0x4000
Oper (131): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
BUF DATA
----------------------------------------------------------------------------
Oper (132): tid: a0375e1a  len: 4096  clientid: TRANS  flags: none
xfs_logprint: unknown log operation type (4e49)
**********************************************************************
* ERROR: data block=2                                                 *
**********************************************************************

That we've got a buffer format item (oper 130) that has two regions;
the format item itself and one dirty region. The subsequent region
after the buffer format item and it's data is them what we are
tripping over, and the first bytes of it at an inode magic number.
Not a log opheader like there is supposed to be.

That means there's a problem with the buffer format item. It's dirty
data region is 4096 bytes, and it contains - you guessed it -
initialised inodes. But inode buffers are 8k, not 4k, and we log
them in their entirety. So something is wrong here. The buffer
format item contains:

(gdb) p /x *(struct xfs_buf_log_format *)in_f
$22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000,
       blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2,
       blf_data_map = {0xffffffff, 0xffffffff, .... }}

Two regions, and a signle dirty contiguous region of 64 bits.  64 *
128 = 8k, so this should be followed by a single 8k region of data.
And the blf_flags tell us that the type of buffer is a
XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have
the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation
buffer. So, it should be followed by 8k of inode data.

But we know that the next region has a header of:

(gdb) p /x *ohead
$25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69,
       oh_flags = 0x0, oh_res2 = 0x0}

and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not
long enough to hold all the logged data. There must be another
region. There is - there's a following opheader for another 4k of
data that contains the other half of the inode cluster data - the
one we assert fail on because it's not a log format header.

So why is the second part of the data not being accounted to the
correct buffer log format structure? It took a little more work with
gdb to work out that the buffer log format structure was both
expecting it to be there but hadn't accounted for it. It was at that
point I went to the kernel code, as clearly this wasn't a bug in
xfs_logprint and the kernel was writing bad stuff to the log.

First port of call was the buffer item formatting code, and the
discontiguous memory/contiguous dirty region handling code
immediately stood out. I've wondered for a long time why the code
had this comment in it:

                        vecp->i_addr = xfs_buf_offset(bp, buffer_offset);
                        vecp->i_len = nbits * XFS_BLF_CHUNK;
                        vecp->i_type = XLOG_REG_TYPE_BCHUNK;
/*
 * You would think we need to bump the nvecs here too, but we do not
 * this number is used by recovery, and it gets confused by the boundary
 * split here
 *                      nvecs++;
 */
                        vecp++;

And it didn't account for the extra vector pointer. The case being
handled here is that a contiguous dirty region lies across a
boundary that cannot be memcpy()d across, and so has to be split
into two separate operations for xlog_write() to perform.

What this code assumes is that what is written to the log is two
consecutive blocks of data that are accounted in the buf log format
item as the same contiguous dirty region and so will get decoded as
such by the log recovery code.

The thing is, xlog_write() knows nothing about this, and so just
does it's normal thing of adding an opheader for each vector. That
means the 8k region gets written to the log as two separate regions
of 4k each, but because nvecs has not been incremented, the buf log
format item accounts for only one of them.

Hence when we come to log recovery, we process the first 4k region
and then expect to come across a new item that starts with a log
format structure of some kind that tells us whenteh next data is
going to be. Instead, we hit raw buffer data and things go bad real
quick.

So, the commit from 2002 that commented out nvecs++ is just plain
wrong. It breaks log recovery completely, and it would seem the only
reason this hasn't been since then is that we don't log large
contigous regions of multi-page unmapped buffers very often. Never
would be a closer estimate, at least until the CRC code came along....

So, lets fix that by restoring the nvecs accounting for the extra
region when we hit this case.....

.... and there's the problemin log recovery it is apparently working
around:

XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135

Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty
regions being broken up into multiple regions by the log formatting
code. That's an easy fix, though - if the number of contiguous dirty
bits exceeds the length of the region being copied out of the log,
only account for the number of dirty bits that region covers, and
then loop again and copy more from the next region. It's a 2 line
fix.

Now xfstests xfs/085 passes, we have one less piece of mystery
code, and one more important piece of knowledge about how to
structure new log format items..

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 709da6a61a)
2013-05-30 17:18:01 -05:00
..
9p aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
adfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
affs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
afs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
autofs4 autofs - remove autofs dentry mount check 2013-05-06 13:06:59 -07:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-04-30 09:36:50 -07:00
bfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs 2013-05-09 13:07:40 -07:00
cachefiles lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
ceph aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
cifs cifs: small variable name cleanup 2013-05-04 22:18:10 -05:00
coda lift sb_start_write() out of ->write() 2013-04-09 14:12:56 -04:00
configfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
cramfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
debugfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
devpts fs: Limit sys_mount to only request filesystem modules (Part 2). 2013-03-07 01:08:55 -08:00
dlm Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2013-05-01 14:08:52 -07:00
ecryptfs Improve performance when AES-NI (and most likely other crypto accelerators) is 2013-05-10 09:20:01 -07:00
efivarfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
efs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
exofs block: Add bio_for_each_segment_all() 2013-03-23 14:26:28 -07:00
exportfs hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
ext2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ext3 Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
ext4 Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
f2fs f2fs updates for v3.10 2013-05-08 15:11:48 -07:00
fat aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
freevxfs fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
fscache fs/fscache/stats.c: fix memory leak 2013-04-29 15:54:27 -07:00
fuse Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
gfs2 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
hfs Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
hfsplus aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
hostfs hostfs: use kmalloc instead of kzalloc 2013-05-04 15:48:45 -04:00
hpfs hpfs: move setting hpfs-private i_dirty to ->write_end() 2013-04-09 14:12:55 -04:00
hppfs hppfs: get rid of ->fsync() 2013-04-29 15:41:42 -04:00
hugetlbfs hugetlbfs: fix mmap failure in unaligned size request 2013-05-07 18:38:27 -07:00
isofs fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
jbd Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-05-03 09:56:25 -07:00
jbd2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
jffs2 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
jfs Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
lockd LOCKD: Ensure that nlmclnt_block resets block->b_status after a server reboot 2013-04-21 18:08:42 -04:00
logfs block: Remove bi_idx references 2013-03-23 14:15:31 -07:00
minix fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
ncpfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
nfs More NFS client bugfixes for 3.10 2013-05-09 10:24:54 -07:00
nfs_common nfs_common: Update the translation between nfsv3 acls linux posix acls 2013-02-13 06:15:14 -08:00
nfsd Merge branch 'for-3.10' of git://linux-nfs.org/~bfields/linux 2013-05-10 09:28:55 -07:00
nilfs2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
nls
notify unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE 2013-05-09 13:46:38 -04:00
ntfs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ocfs2 aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
omfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
openpromfs fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
proc Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux 2013-05-07 08:42:20 -07:00
pstore Couple of pstore cleanups 2013-05-09 16:42:10 -07:00
qnx4 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
qnx6 fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
quota quota: add missing use of dq_data_lock in __dquot_initialize 2013-03-11 22:05:56 +01:00
ramfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-02-26 20:16:07 -08:00
reiserfs Merge branch 'akpm' (incoming from Andrew) 2013-05-07 20:49:51 -07:00
romfs romfs: fix nommu map length to keep inside filesystem 2013-04-29 09:17:57 +10:00
squashfs fs: Limit sys_mount to only request filesystem modules. (Part 3) 2013-03-11 07:09:48 -07:00
sysfs sysfs: check if one entry has been removed before freeing 2013-04-05 15:35:52 -07:00
sysv fs: Readd the fs module aliases. 2013-03-12 18:55:21 -07:00
ubifs aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
udf aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
ufs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-04-30 09:36:50 -07:00
xfs xfs: fix split buffer vector log recovery support 2013-05-30 17:18:01 -05:00
aio.c aio: kill ki_retry 2013-05-07 19:46:02 -07:00
anon_inodes.c get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero 2013-02-26 02:46:11 -05:00
attr.c
bad_inode.c
binfmt_aout.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
binfmt_elf.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_elf_fdpic.c Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc 2013-05-02 10:16:16 -07:00
binfmt_em86.c
binfmt_flat.c new helper: read_code() 2013-04-29 15:40:23 -04:00
binfmt_misc.c binfmt_misc: reuse string_unescape_inplace() 2013-04-30 17:04:03 -07:00
binfmt_script.c exec: do not leave bprm->interp on stack 2012-12-20 17:40:19 -08:00
binfmt_som.c
bio-integrity.c bio-integrity: Add explicit field for owner of bip_buf 2013-03-23 14:26:34 -07:00
bio.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
block_dev.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
buffer.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
char_dev.c
compat.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
compat_binfmt_elf.c
compat_ioctl.c Removed unused typedef to avoid "unused local typedef" warnings. 2013-05-04 15:03:05 -04:00
coredump.c do_coredump(): don't wait for thaw if coredump has already been interrupted 2013-05-04 14:45:54 -04:00
coredump.h
dcache.c vfs: use list_move instead of list_del/list_add 2013-05-04 15:43:02 -04:00
dcookies.c consolidate compat lookup_dcookie() 2013-03-03 23:00:23 -05:00
direct-io.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
drop_caches.c
eventfd.c
eventpoll.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal 2013-05-01 07:21:43 -07:00
exec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
fcntl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
fhandle.c Merge branch 'for-3.8' of git://linux-nfs.org/~bfields/linux 2012-12-20 14:04:11 -08:00
file.c don't bother with deferred freeing of fdtables 2013-05-01 17:31:42 -04:00
file_table.c cache the value of file_inode() in struct file 2013-03-01 19:48:30 -05:00
filesystems.c fs: Limit sys_mount to only request filesystem modules. 2013-03-03 19:36:31 -08:00
fs-writeback.c Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block 2013-05-08 10:13:35 -07:00
fs_struct.c constify path_get/path_put and fs_struct.c stuff 2013-03-01 23:51:07 -05:00
generic_acl.c
inode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
internal.h pipe: fold file_operations instances in one 2013-04-09 14:12:58 -04:00
ioctl.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
ioprio.c
Kconfig efivarfs: Move to fs/efivarfs 2013-04-17 13:25:09 +01:00
Kconfig.binfmt fs: make binfmt support for #! scripts modular and removable 2013-04-30 17:04:04 -07:00
libfs.c vfs: drop vmtruncate 2012-12-20 18:46:29 -05:00
locks.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
mbcache.c
mount.h get rid of full-hash scan on detaching vfsmounts 2013-04-09 14:12:52 -04:00
mpage.c
namei.c Merge git://git.infradead.org/users/eparis/audit 2013-05-11 14:29:11 -07:00
namespace.c create_mnt_ns: unidiomatic use of list_add() 2013-05-04 15:18:53 -04:00
no-block.c
open.c make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect 2013-03-03 22:58:33 -05:00
pipe.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
pnode.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
pnode.h Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
posix_acl.c
proc_namespace.c
read_write.c aio: don't include aio.h in sched.h 2013-05-07 20:16:25 -07:00
readdir.c new helper: file_inode(file) 2013-02-22 23:31:31 -05:00
select.c sched/rt: Move rt specific bits into new header file 2013-02-07 20:51:08 +01:00
seq_file.c new helper: single_open_size() 2013-04-09 14:13:29 -04:00
signalfd.c switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE 2013-03-03 22:58:46 -05:00
splice.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-05-01 17:51:54 -07:00
stack.c
stat.c switch vfs_getattr() to struct path 2013-02-26 02:46:08 -05:00
statfs.c vfs: fix user_statfs to retry once on ESTALE errors 2012-12-20 18:50:07 -05:00
super.c hlist: drop the node parameter from iterators 2013-02-27 19:10:24 -08:00
sync.c teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long 2013-03-03 22:46:22 -05:00
timerfd.c compat: restore timerfd settime and gettime compat syscalls 2013-03-02 09:35:13 -05:00
utimes.c vfs: allow utimensat() calls to retry once on an ESTALE error 2012-12-20 18:50:08 -05:00
xattr.c vfs: make lremovexattr retry once on ESTALE error 2012-12-20 18:50:11 -05:00
xattr_acl.c