Commit graph

32109 commits

Author SHA1 Message Date
Al Viro 568f8f5ec5 [readdir] convert hpfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:51 +04:00
Al Viro cd62cdae0b reiserfs: switch reiserfs_readdir_dentry to inode
... and clean the callers up a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:51 +04:00
Al Viro 99ce4169a9 reiserfs: is_privroot_deh() needs only directory inode, actually
... and that - only to get the superblock.  Privroot is a directory
and we don't allow hardlinks to those...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:50 +04:00
Al Viro 4acf381e1b [readdir] convert reiserfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:49 +04:00
Al Viro 956ce2083c [readdir] convert ntfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:48 +04:00
Al Viro bfee7169c0 [readdir] convert isofs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:47 +04:00
Al Viro 0312fa7ccd [readdir] convert jffs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:47 +04:00
Al Viro 6f7f231e7b [readdir] convert f2fs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:46 +04:00
Al Viro 8f29843a51 [readdir] convert 9p
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:45 +04:00
Al Viro 0edf977d2a [readdir] convert affs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:44 +04:00
Al Viro 2638ffbac9 [readdir] convert adfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:44 +04:00
Al Viro 46d0733801 [readdir] convert logfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:43 +04:00
Al Viro 070a0ebf42 [readdir] convert jfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:42 +04:00
Al Viro 77acfa29e1 [readdir] convert ceph
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:41 +04:00
Al Viro 23db862060 [readdir] convert nfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:40 +04:00
Al Viro 725bebb278 [readdir] convert ext4
and trim the living hell out bogosities in inline dir case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:40 +04:00
Al Viro 4deb398a1b [readdir] convert qnx6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:39 +04:00
Al Viro 663f4deca7 [readdir] convert qnx4
... and use strnlen() instead of strlen() - it's done on untrusted data,
after all.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:38 +04:00
Al Viro 9fd4d05949 [readdir] convert omfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:37 +04:00
Al Viro 1616abe841 [readdir] convert nilfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:36 +04:00
Al Viro d55fea8ddb [readdir] convert sysfs
get rid of the kludges in sysfs_readdir()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:36 +04:00
Al Viro d81a8ef598 [readdir] convert gfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:35 +04:00
Al Viro 75811d4fda [readdir] convert exofs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:34 +04:00
Al Viro 81b9f66e6b [readdir] convert bfs
... and get rid of that ridiculous mutex in bfs_readdir()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:33 +04:00
Al Viro f0c3b5093a [readdir] convert procfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
Al Viro 68c6147113 [readdir] convert openpromfs
what the hell is op_mutex for, BTW?

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
Al Viro 7aa123a0dc [readdir] convert efs
* sanity checks belong before risky operation, not after it
* don't quit as soon as we'd found an entry

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:31 +04:00
Al Viro 52018855e6 [readdir] convert configfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:30 +04:00
Al Viro 3903b38ce7 [readdir] convert romfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:29 +04:00
Al Viro 5f6039ce69 [readdir] convert squashfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:28 +04:00
Al Viro 01122e0688 [readdir] convert ubifs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:25 +04:00
Al Viro 5add2ee198 [readdir] convert udf
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:50 +04:00
Al Viro 5ded75ec4c [readdir] convert ext3
new helper: dir_relax(inode).  Call when you are in location that will
_not_ be invalidated by directory modifications (block boundary, in case
of ext*).  Returns whether the directory has survived (dropping i_mutex
allows rmdir to kill the sucker; if it returns false to us, ->iterate()
is obviously done)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:49 +04:00
Al Viro 5f99f4e79a [readdir] switch dcache_readdir() users to ->iterate()
new helpers - dir_emit_dot(file, ctx, dentry), dir_emit_dotdot(file, ctx),
dir_emit_dots(file, ctx).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:48 +04:00
Al Viro 80886298c0 [readdir] simple local unixlike: switch to ->iterate()
ext2, ufs, minix, sysv

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:47 +04:00
Al Viro bb6f619b3a [readdir] introduce ->iterate(), ctx->pos, dir_emit()
New method - ->iterate(file, ctx).  That's the replacement for ->readdir();
it takes callback from ctx->actor, uses ctx->pos instead of file->f_pos and
calls dir_emit(ctx, ...) instead of filldir(data, ...).  It does *not*
update file->f_pos (or look at it, for that matter); iterate_dir() does the
update.

Note that dir_emit() takes the offset from ctx->pos (and eventually
filldir_t will lose that argument).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:47 +04:00
Al Viro 5c0ba4e076 [readdir] introduce iterate_dir() and dir_context
iterate_dir(): new helper, replacing vfs_readdir().

struct dir_context: contains the readdir callback (and will get more stuff
in it), embedded into whatever data that callback wants to deal with;
eventually, we'll be passing it to ->readdir() replacement instead of
(data,filldir) pair.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:46 +04:00
Al Viro e06aeb5716 compat.c: LOOP_CLR_FD is taken care of in loop.c itself...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:46:44 +04:00
Artem Bityutskiy 605c912bb8 UBIFS: fix a horrid bug
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no
mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are
in the middle of 'ubifs_readdir()'.

This means that 'file->private_data' can be freed while 'ubifs_readdir()' uses
it, and this is a very bad bug: not only 'ubifs_readdir()' can return garbage,
but this may corrupt memory and lead to all kinds of problems like crashes an
security holes.

This patch fixes the problem by using the 'file->f_version' field, which
'->llseek()' always unconditionally sets to zero. We set it to 1 in
'ubifs_readdir()' and whenever we detect that it became 0, we know there was a
seek and it is time to clear the state saved in 'file->private_data'.

I tested this patch by writing a user-space program which runds readdir and
seek in parallell. I could easily crash the kernel without these patches, but
could not crash it with these patches.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:45:37 +04:00
Artem Bityutskiy 33f1a63ae8 UBIFS: prepare to fix a horrid bug
Al Viro pointed me to the fact that '->readdir()' and '->llseek()' have no
mutual exclusion, which means the 'ubifs_dir_llseek()' can be run while we are
in the middle of 'ubifs_readdir()'.

First of all, this means that 'file->private_data' can be freed while
'ubifs_readdir()' uses it.  But this particular patch does not fix the problem.
This patch is only a preparation, and the fix will follow next.

In this patch we make 'ubifs_readdir()' stop using 'file->f_pos' directly,
because 'file->f_pos' can be changed by '->llseek()' at any point. This may
lead 'ubifs_readdir()' to returning inconsistent data: directory entry names
may correspond to incorrect file positions.

So here we introduce a local variable 'pos', read 'file->f_pose' once at very
the beginning, and then stick to 'pos'. The result of this is that when
'ubifs_dir_llseek()' changes 'file->f_pos' while we are in the middle of
'ubifs_readdir()', the latter "wins".

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:45:37 +04:00
Bob Peterson a01aedfe21 GFS2: Reserve journal space for quota change in do_grow
If a GFS2 file system is mounted with quotas and a file is grown
in such a way that its free blocks for the allocation are represented
in a secondary bitmap, GFS2 ran out of blocks in the transaction.
That resulted in "fatal: assertion "tr->tr_num_buf <= tr->tr_blocks".
This patch reserves extra blocks for the quota change so the
transaction has enough space.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-27 18:16:27 +01:00
Bart Van Assche cfa805f6f1 dlm: Avoid LVB truncation
For lockspaces with an LVB length above 64 bytes, avoid truncating
the LVB while exchanging it with another node in the cluster.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-26 11:38:02 -05:00
Stephane Eranian 2976b10f05 perf: Disable monitoring on setuid processes for regular users
There was a a bug in setup_new_exec(), whereby
the test to disabled perf monitoring was not
correct because the new credentials for the
process were not yet committed and therefore
the get_dumpable() test was never firing.

The patch fixes the problem by moving the
perf_event test until after the credentials
are committed.

Signed-off-by: Stephane Eranian <eranian@google.com>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-26 11:40:18 +02:00
Linus Torvalds 5dbc746960 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse bugfix from Miklos Szeredi:
 "This fixes a race between fallocate() and truncate()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: hold i_mutex in fuse_file_fallocate()
2013-06-25 09:06:04 -10:00
David Teigland 696b3d8460 dlm: log an error for unmanaged lockspaces
Log an error message if the dlm user daemon exits
before all the lockspaces have been removed.

Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-25 12:53:20 -05:00
Zhao Hongjiang ad917e7f82 dlm: config: using strlcpy instead of strncpy
for NUL terminated string, need alway set '\0' in the end.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-25 12:53:06 -05:00
Greg Kroah-Hartman b5aef682e0 Merge 3.10-rc7 into driver-core-next
We want the firmware merge fixes, and other bits, in here now.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-24 15:14:43 -07:00
Randy Dunlap acdb37c361 fs: fix new splice.c kernel-doc warning
Fix new kernel-doc warning in fs/splice.c:

  Warning(fs/splice.c:1298): No description found for parameter 'opos'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-06-23 16:19:56 -10:00
Al Viro 7995bd2871 splice: don't pass the address of ->f_pos to methods
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-20 19:02:45 +04:00
Abhijith Das 6a98c333ed GFS2: Fix fstrim boundary conditions
This patch correctly distinguishes two boundary conditions:

1. When the given range is entire within the unaccounted space between
   two rgrps, and
2. The range begins beyond the end of the filesystem

Also fix the unit of the returned value r.len (total trimming) to be in bytes 
instead of the (incorrect) 512 byte blocks

With this patch, GFS2 passes multiple iterations of all the relevant xfstests
(251, 260, 288) with different fs block sizes.

Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 21:41:26 +01:00
Benjamin Marzinski 2b12eea656 GFS2: fix warning message
This patch fixes a warning message introduced in the recent
"GFS2: aggressively issue revokes in gfs2_log_flush" patch.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 21:29:19 +01:00
Wei Yongjun 06452eb053 dlm: remove duplicated include from lowcomms.c
Remove duplicated include.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-19 09:52:09 -05:00
David Howells dcfae32f89 FS-Cache: Don't use spin_is_locked() in assertions
Under certain circumstances, spin_is_locked() is hardwired to 0 - even when the
code would normally be in a locked section where it should return 1.  This
means it cannot be used for an assertion that checks that a spinlock is locked.

Remove such usages from FS-Cache.

The following oops might otherwise be observed:

FS-Cache: Assertion failed
BUG: failure at fs/fscache/operation.c:270/fscache_start_operations()!
Kernel panic - not syncing: BUG!
CPU: 0 PID: 10 Comm: kworker/u2:1 Not tainted 3.10.0-rc1-00133-ge7ebb75 #2
Workqueue: fscache_operation fscache_op_work_func [fscache]
7f091c48 603c8947 7f090000 7f9b1361 7f25f080 00000001 7f26d440 7f091c90
60299eb8 7f091d90 602951c5 7f26d440 3000000008 7f091da0 7f091cc0 7f091cd0
00000007 00000007 00000006 7f091ae0 00000010 0000010e 7f9af330 7f091ae0
Call Trace:
7f091c88: [<60299eb8>] dump_stack+0x17/0x19
7f091c98: [<602951c5>] panic+0xf4/0x1e9
7f091d38: [<6002b10e>] set_signals+0x1e/0x40
7f091d58: [<6005b89e>] __wake_up+0x4e/0x70
7f091d98: [<7f9aa003>] fscache_start_operations+0x43/0x50 [fscache]
7f091da8: [<7f9aa1e3>] fscache_op_complete+0x1d3/0x220 [fscache]
7f091db8: [<60082985>] unlock_page+0x55/0x60
7f091de8: [<7fb25bb0>] cachefiles_read_copier+0x250/0x330 [cachefiles]
7f091e58: [<7f9ab03c>] fscache_op_work_func+0xac/0x120 [fscache]
7f091e88: [<6004d5b0>] process_one_work+0x250/0x3a0
7f091ef8: [<6004edc7>] worker_thread+0x177/0x2a0
7f091f38: [<6004ec50>] worker_thread+0x0/0x2a0
7f091f58: [<60054418>] kthread+0xd8/0xe0
7f091f68: [<6005bb27>] finish_task_switch.isra.64+0x37/0xa0
7f091fd8: [<600185cf>] new_thread_handler+0x8f/0xb0

Reported-by: Milosz Tanski <milosz@adfin.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
2013-06-19 14:16:47 +01:00
David Howells 1bb4b7f98f FS-Cache: The retrieval remaining-pages counter needs to be atomic_t
struct fscache_retrieval contains a count of the number of pages that still
need some processing (n_pages).  This is decremented as the pages are
processed.

However, this needs to be atomic as fscache_retrieval_complete() (I think) just
occasionally may be called from cachefiles_read_backing_file() and
cachefiles_read_copier() simultaneously.

This happens when an fscache_read_or_alloc_pages() request containing a lot of
pages (say a couple of hundred) is being processed.  The read on each backing
page is dispatched individually because we need to insert a monitor into the
waitqueue to catch when the read completes.  However, under low-memory
conditions, we might be forced to wait in the allocator - and this gives the
I/O on the backing page a chance to complete first.

When the I/O completes, fscache_enqueue_retrieval() chucks the retrieval onto
the workqueue without waiting for the operation to finish the initial I/O
dispatch (we want to release any pages we can as soon as we can), thus both can
end up running simultaneously and potentially attempting to partially complete
the retrieval simultaneously (ENOMEM may occur, backing pages may already be in
the page cache).

This was demonstrated by parallelling the non-atomic counter with an atomic
counter and printing both of them when the assertion fails.  At this point, the
atomic counter has reached zero, but the non-atomic counter has not.

To fix this, make the counter an atomic_t.

This results in the following bug appearing

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:421!

or

	FS-Cache: Assertion failed
	3 == 5 is false
	------------[ cut here ]------------
	kernel BUG at fs/fscache/operation.c:414!

With a backtrace like the following:

RIP: 0010:[<ffffffffa0211b1d>] fscache_put_operation+0x1ad/0x240 [fscache]
Call Trace:
 [<ffffffffa0213185>] fscache_retrieval_work+0x55/0x270 [fscache]
 [<ffffffffa0213130>] ? fscache_retrieval_work+0x0/0x270 [fscache]
 [<ffffffff81090b10>] worker_thread+0x170/0x2a0
 [<ffffffff81096d10>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810909a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff81096966>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968d0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-and-tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Haicheng Li 2144bc78d4 cachefiles: remove unused macro list_to_page()
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 1362729b16 FS-Cache: Simplify cookie retention for fscache_objects, fixing oops
Simplify the way fscache cache objects retain their cookie.  The way I
implemented the cookie storage handling made synchronisation a pain (ie. the
object state machine can't rely on the cookie actually still being there).

Instead of the the object being detached from the cookie and the cookie being
freed in __fscache_relinquish_cookie(), we defer both operations:

 (*) The detachment of the object from the list in the cookie now takes place
     in fscache_drop_object() and is thus governed by the object state machine
     (fscache_detach_from_cookie() has been removed).

 (*) The release of the cookie is now in fscache_object_destroy() - which is
     called by the cache backend just before it frees the object.

This means that the fscache_cookie struct is now available to the cache all the
way through from ->alloc_object() to ->drop_object() and ->put_object() -
meaning that it's no longer necessary to take object->lock to guarantee access.

However, __fscache_relinquish_cookie() doesn't wait for the object to go all
the way through to destruction before letting the netfs proceed.  That would
massively slow down the netfs.  Since __fscache_relinquish_cookie() leaves the
cookie around, in must therefore break all attachments to the netfs - which
includes ->def, ->netfs_data and any outstanding page read/writes.

To handle this, struct fscache_cookie now has an n_active counter:

 (1) This starts off initialised to 1.

 (2) Any time the cache needs to get at the netfs data, it calls
     fscache_use_cookie() to increment it - if it is not zero.  If it was zero,
     then access is not permitted.

 (3) When the cache has finished with the data, it calls fscache_unuse_cookie()
     to decrement it.  This does a wake-up on it if it reaches 0.

 (4) __fscache_relinquish_cookie() decrements n_active and then waits for it to
     reach 0.  The initialisation to 1 in step (1) ensures that we only get
     wake ups when we're trying to get rid of the cookie.

This leaves __fscache_relinquish_cookie() a lot simpler.


***
This fixes a problem in the current code whereby if fscache_invalidate() is
followed sufficiently quickly by fscache_relinquish_cookie() then it is
possible for __fscache_relinquish_cookie() to have detached the cookie from the
object and cleared the pointer before a thread is dispatched to process the
invalidation state in the object state machine.

Since the pending write clearance was deferred to the invalidation state to
make it asynchronous, we need to either wait in relinquishment for the stores
tree to be cleared in the invalidation state or we need to handle the clearance
in relinquishment.

Further, if the relinquishment code does clear the tree, then the invalidation
state need to make the clearance contingent on still having the cookie to hand
(since that's where the tree is rooted) and we have to prevent the cookie from
disappearing for the duration.

This can lead to an oops like the following:

BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
...
RIP: 0010:[<ffffffff8151023e>] _spin_lock+0xe/0x30
...
CR2: 000000000000000c ...
...
Process kslowd002 (...)
....
Call Trace:
 [<ffffffffa01c3278>] fscache_invalidate_writes+0x38/0xd0 [fscache]
 [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
 [<ffffffff8105e759>] ? find_busiest_queue+0x69/0x150
 [<ffffffff8110ddd4>] ? slow_work_enqueue+0x104/0x180
 [<ffffffffa01c1303>] fscache_object_slow_work_execute+0x5e3/0x9d0 [fscache]
 [<ffffffff81096b67>] ? bit_waitqueue+0x17/0xd0
 [<ffffffff8110e233>] slow_work_execute+0x233/0x310
 [<ffffffff8110e515>] slow_work_thread+0x205/0x360
 [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8110e310>] ? slow_work_thread+0x0/0x360
 [<ffffffff81096936>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810968a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The parameter to fscache_invalidate_writes() was object->cookie which is NULL.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells caaef6900b FS-Cache: Fix object state machine to have separate work and wait states
Fix object state machine to have separate work and wait states as that makes
it easier to envision.

There are now three kinds of state:

 (1) Work state.  This is an execution state.  No event processing is performed
     by a work state.  The function attached to a work state returns a pointer
     indicating the next state to which the OSM should transition.  Returning
     NO_TRANSIT repeats the current state, but goes back to the scheduler
     first.

 (2) Wait state.  This is an event processing state.  No execution is
     performed by a wait state.  Wait states are just tables of "if event X
     occurs, clear it and transition to state Y".  The dispatcher returns to
     the scheduler if none of the events in which the wait state has an
     interest are currently pending.

 (3) Out-of-band state.  This is a special work state.  Transitions to normal
     states can be overridden when an unexpected event occurs (eg. I/O error).
     Instead the dispatcher disables and clears the OOB event and transits to
     the specified work state.  This then acts as an ordinary work state,
     though object->state points to the overridden destination.  Returning
     NO_TRANSIT resumes the overridden transition.

In addition, the states have names in their definitions, so there's no need for
tables of state names.  Further, the EV_REQUEUE event is no longer necessary as
that is automatic for work states.

Since the states are now separate structs rather than values in an enum, it's
not possible to use comparisons other than (non-)equality between them, so use
some object->flags to indicate what phase an object is in.

The EV_RELEASE, EV_RETIRE and EV_WITHDRAW events have been squished into one
(EV_KILL).  An object flag now carries the information about retirement.

Similarly, the RELEASING, RECYCLING and WITHDRAWING states have been merged
into an KILL_OBJECT state and additional states have been added for handling
waiting dependent objects (JUMPSTART_DEPS and KILL_DEPENDENTS).

A state has also been added for synchronising with parent object initialisation
(WAIT_FOR_PARENT) and another for initiating look up (PARENT_READY).

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 493f7bc114 FS-Cache: Wrap checks on object state
Wrap checks on object state (mostly outside of fs/fscache/object.c) with
inline functions so that the mechanism can be replaced.

Some of the state checks within object.c are left as-is as they will be
replaced.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 610be24ee4 FS-Cache: Uninline fscache_object_init()
Uninline fscache_object_init() so as not to expose some of the FS-Cache
internals to the cache backend.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
David Howells 0c59a95d90 FS-Cache: Don't sleep in page release if __GFP_FS is not set
Don't sleep in __fscache_maybe_release_page() if __GFP_FS is not set.  This
goes some way towards mitigating fscache deadlocking against ext4 by way of
the allocator, eg:

INFO: task flush-8:0:24427 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
flush-8:0       D ffff88003e2b9fd8     0 24427      2 0x00000000
 ffff88003e2b9138 0000000000000046 ffff880012e3a040 ffff88003e2b9fd8
 0000000000011c80 ffff88003e2b9fd8 ffffffff81a10400 ffff880012e3a040
 0000000000000002 ffff880012e3a040 ffff88003e2b9098 ffffffff8106dcf5
Call Trace:
 [<ffffffff8106dcf5>] ? __lock_is_held+0x31/0x53
 [<ffffffff81219b61>] ? radix_tree_lookup_element+0xf4/0x12a
 [<ffffffff81454bed>] schedule+0x60/0x62
 [<ffffffffa01d349c>] __fscache_wait_on_page_write+0x8b/0xa5 [fscache]
 [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
 [<ffffffffa01d393a>] __fscache_maybe_release_page+0x30c/0x324 [fscache]
 [<ffffffffa01d369a>] ? __fscache_maybe_release_page+0x6c/0x324 [fscache]
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffffa01fd7b2>] nfs_fscache_release_page+0x68/0x94 [nfs]
 [<ffffffffa01ef73e>] nfs_release_page+0x7e/0x86 [nfs]
 [<ffffffff810aa553>] try_to_release_page+0x32/0x3b
 [<ffffffff810b6c70>] shrink_page_list+0x535/0x71a
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff810b7352>] shrink_inactive_list+0x20a/0x2dd
 [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
 [<ffffffff810b7a65>] shrink_lruvec+0x34c/0x3eb
 [<ffffffff810b7bd3>] do_try_to_free_pages+0xcf/0x355
 [<ffffffff810b7fc8>] try_to_free_pages+0x9a/0xa1
 [<ffffffff810b08d2>] __alloc_pages_nodemask+0x494/0x6f7
 [<ffffffff810d9a07>] kmem_getpages+0x58/0x155
 [<ffffffff810dc002>] fallback_alloc+0x120/0x1f3
 [<ffffffff8106db23>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff810dbed3>] ____cache_alloc_node+0x177/0x186
 [<ffffffff81162a6c>] ? ext4_init_io_end+0x1c/0x37
 [<ffffffff810dc403>] kmem_cache_alloc+0xf1/0x176
 [<ffffffff810b17ac>] ? test_set_page_writeback+0x101/0x113
 [<ffffffff81162a6c>] ext4_init_io_end+0x1c/0x37
 [<ffffffff81162ce4>] ext4_bio_write_page+0x20f/0x3af
 [<ffffffff8115cc02>] mpage_da_submit_io+0x26e/0x2f6
 [<ffffffff811088e5>] ? __find_get_block_slow+0x38/0x133
 [<ffffffff81161348>] mpage_da_map_and_submit+0x3a7/0x3bd
 [<ffffffff81161a60>] ext4_da_writepages+0x30d/0x426
 [<ffffffff810b3359>] do_writepages+0x1c/0x2a
 [<ffffffff81102f4d>] __writeback_single_inode+0x3e/0xe5
 [<ffffffff81103995>] writeback_sb_inodes+0x1bd/0x2f4
 [<ffffffff81103b3b>] __writeback_inodes_wb+0x6f/0xb4
 [<ffffffff81103c81>] wb_writeback+0x101/0x195
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff811043aa>] ? wb_do_writeback+0xaa/0x173
 [<ffffffff8110434a>] wb_do_writeback+0x4a/0x173
 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81038554>] ? del_timer+0x4b/0x5b
 [<ffffffff811044e0>] bdi_writeback_thread+0x6d/0x147
 [<ffffffff81104473>] ? wb_do_writeback+0x173/0x173
 [<ffffffff81048fbc>] kthread+0xd0/0xd8
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
2 locks held by flush-8:0/24427:
 #0:  (&type->s_umount_key#41){.+.+..}, at: [<ffffffff810e3b73>] grab_super_passive+0x4c/0x76
 #1:  (jbd2_handle){+.+...}, at: [<ffffffff81190d81>] start_this_handle+0x475/0x4ea


The problem here is that another thread, which is attempting to write the
to-be-stored NFS page to the on-ext4 cache file is waiting for the journal
lock, eg:

INFO: task kworker/u:2:24437 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/u:2     D ffff880039589768     0 24437      2 0x00000000
 ffff8800395896d8 0000000000000046 ffff8800283bf040 ffff880039589fd8
 0000000000011c80 ffff880039589fd8 ffff880039f0b040 ffff8800283bf040
 0000000000000006 ffff8800283bf6b8 ffff880039589658 ffffffff81071a13
Call Trace:
 [<ffffffff81071a13>] ? mark_held_locks+0xbe/0xea
 [<ffffffff81455e73>] ? _raw_spin_unlock_irqrestore+0x3a/0x50
 [<ffffffff81071b53>] ? trace_hardirqs_on_caller+0x114/0x170
 [<ffffffff81071bbc>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff81454bed>] schedule+0x60/0x62
 [<ffffffff81190c23>] start_this_handle+0x317/0x4ea
 [<ffffffff810498a8>] ? __init_waitqueue_head+0x4d/0x4d
 [<ffffffff81190fcc>] jbd2__journal_start+0xb3/0x12e
 [<ffffffff81176606>] __ext4_journal_start_sb+0xb2/0xc6
 [<ffffffff8115f137>] ext4_da_write_begin+0x109/0x233
 [<ffffffff810a964d>] generic_file_buffered_write+0x11a/0x264
 [<ffffffff811032cf>] ? __mark_inode_dirty+0x2d/0x1ee
 [<ffffffff810ab1ab>] __generic_file_aio_write+0x2a5/0x2d5
 [<ffffffff810ab24a>] generic_file_aio_write+0x6f/0xd0
 [<ffffffff81159a2c>] ext4_file_write+0x38c/0x3c4
 [<ffffffff810e0915>] do_sync_write+0x91/0xd1
 [<ffffffffa00a17f0>] cachefiles_write_page+0x26f/0x310 [cachefiles]
 [<ffffffffa01d470b>] fscache_write_op+0x21e/0x37a [fscache]
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffffa01d2479>] fscache_op_work_func+0x78/0xd7 [fscache]
 [<ffffffff8104455a>] process_one_work+0x232/0x3a8
 [<ffffffff810444ff>] ? process_one_work+0x1d7/0x3a8
 [<ffffffff81044ee0>] worker_thread+0x214/0x303
 [<ffffffff81044ccc>] ? manage_workers+0x245/0x245
 [<ffffffff81048fbc>] kthread+0xd0/0xd8
 [<ffffffff81455eb2>] ? _raw_spin_unlock_irq+0x29/0x3e
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
 [<ffffffff81456aac>] ret_from_fork+0x7c/0xb0
 [<ffffffff81048eec>] ? __init_kthread_worker+0x55/0x55
4 locks held by kworker/u:2/24437:
 #0:  (fscache_operation){.+.+.+}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
 #1:  ((&op->work)){+.+.+.}, at: [<ffffffff810444ff>] process_one_work+0x1d7/0x3a8
 #2:  (sb_writers#14){.+.+.+}, at: [<ffffffff810ab22c>] generic_file_aio_write+0x51/0xd0
 #3:  (&sb->s_type->i_mutex_key#19){+.+.+.}, at: [<ffffffff810ab236>] generic_file_aio_write+0x5b/0x

fscache already tries to cancel pending stores, but it can't cancel a write
for which I/O is already in progress.

An alternative would be to accept writing garbage to the cache under extreme
circumstances and to kill the afflicted cache object if we have to do this.
However, we really need to know how strapped the allocator is before deciding
to do that.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
J. Bruce Fields 6bd5e82b09 CacheFiles: name i_mutex lock class explicitly
Just some cleanup.

(And note the caller of this function may, for example, call vfs_unlink
on a child, so the "1" (I_MUTEX_PARENT) really was what was intended
here.)

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Sebastian Andrzej Siewior ee8be57bc3 fs/fscache: remove spin_lock() from the condition in while()
The spinlock() within the condition in while() will cause a compile error
if it is not a function. This is not a problem on mainline but it does not
look pretty and there is no reason to do it that way.
That patch writes it a little differently and avoids the double condition.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-By: Milosz Tanski <milosz@adfin.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2013-06-19 14:16:47 +01:00
Benjamin Marzinski 5d054964f5 GFS2: aggressively issue revokes in gfs2_log_flush
This patch looks at all the outstanding blocks in all the transactions
on the log, and moves the completed ones to the ail2 list.  Then it
issues revokes for these blocks.  This will hopefully speed things up
in situations where there is a lot of contention for glocks, especially
if they are acquired serially.

revoke_lo_before_commit will issue at most one log block's full of these
preemptive revokes. The amount of reserved log space that
gfs2_log_reserve() ignores has been incremented to allow for this extra
block.

This patch also consolidates the common revoke instructions into one
function, gfs2_add_revoke().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-19 09:41:59 +01:00
Greg Kroah-Hartman bb07b00be7 Merge 3.10-rc6 into driver-core-next
We want these fixes here too.

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-06-17 16:57:20 -07:00
Maxim Patlasov 14c14414d1 fuse: hold i_mutex in fuse_file_fallocate()
Changing size of a file on server and local update (fuse_write_update_size)
should be always protected by inode->i_mutex. Otherwise a race like this is
possible:

1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE).
fuse_file_fallocate() sends FUSE_FALLOCATE request to the server.
2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr()
sends shrinking FUSE_SETATTR request to the server and updates local i_size
by i_size_write(inode, outarg.attr.size).
3. Process 'A' resumes execution of fuse_file_fallocate() and calls
fuse_write_update_size(inode, offset + length). But 'offset + length' was
obsoleted by ftruncate from previous step.

Changed in v2 (thanks Brian and Anand for suggestions):
 - made relation between mutex_lock() and fuse_set_nowrite(inode) more
   explicit and clear.
 - updated patch description to use ftruncate(2) in example

Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2013-06-18 01:39:03 +02:00
Jon Ernst 03b40e3496 ext4: delete unused variables
This patch removed several unused variables.

Signed-off-by: Jon Ernst <jonernst07@gmx.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-17 08:56:26 -04:00
Namjae Jeon 696c018c77 f2fs: add remount_fs callback support
Add the f2fs_remount function call which will be used
during the filesystem remounting. This function
will help us to change the mount options specific to
f2fs.

Also modify the f2fs background_gc mount option, which
will allow the user to dynamically trun on/off the
garbage collection in f2fs based on the background_gc
value. If background_gc=on, Garbage collection will
be turned off & if background_gc=off, Garbage collection
will be truned on.

By default the garbage collection is on in f2fs.

Change Log:
v2: Incorporated the review comments by Gu Zheng.
    Removing the restore part for VFS flags
    Updating comments with proper flag conditions
    Display GC background option as ON/OFF
    Revised conditions to stop GC in case of remount

v1: Initial changes for adding remount_fs callback
support.

Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change /** with /* for the coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-17 21:17:55 +09:00
Linus Torvalds d0ff934881 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
 "Several fixes + obvious cleanup (you've missed a couple of open-coded
  can_lookup() back then)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  snd_pcm_link(): fix a leak...
  use can_lookup() instead of direct checks of ->i_op->lookup
  move exit_task_namespaces() outside of exit_notify()
  fput: task_work_add() can fail if the caller has passed exit_task_work()
  ncpfs: fix rmdir returns Device or resource busy
2013-06-14 19:18:56 -10:00
Linus Torvalds d58c6ff0b7 xfs: fixes for 3.10-rc6
- Remove noisy warnings about experimental support which spams the logs
 - Add padding to align directory and attr structures correctly
 - Set block number on child buffer on a root btree split
 - Disable verifiers during log recovery for non-CRC filesystems
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJRu4gPAAoJENaLyazVq6ZO0GwP/j7i8hEl6hoFZZJ2WX7niFCP
 t0r218J9JZDCLSk7+rY26gmxOzifRHAIt5TRwwqSCbNnZbuQZsqFUpvDMSMY3XOj
 4qnUlO6diRLonN5ixrOb5YMTQJ8YHG7cB4jvxBDAqPqEfNpRyqikxstcH6KBmtSU
 duqhuQMdmHAjMUqfpdt5ewueOCmw6jI79ZqvMnEfSHW7YS7G4SrKYa71HkfRR6CD
 +K/FqEoDO/9psbsFlrkQ4Uvqngp8c9c0wQULxreN0BSdRbVqHfrS6eAWGhT3K2HW
 7ZGxEiTcwR5XCtDQjhw7vbZQEMeMcl6yZ6J7e+jJc53maySOOrqCaYyyrhzZFw4H
 Xh52pcVJtGuGVBHDxpfhI5e7KI4DjEugQK9AaONy02bhhTh3r3CKu5pprDyenyHr
 9s/DG8u/gJX8tm8DSBlIXv2iCvY4mTeesYkMaLHgC8uLXmItkRBoUaj1NQvnsTqo
 EF1xVVqh3aiueD4+cvu3+x4J4dTFmYQ++Oi3Zt1YpjBBb/h3n3KFUfizhRIp9r43
 R4UO5W3b6s4q/1oC+bO6Qlxfny9vcyz+UrkcLpbuo+cRTC3bKi85v2Gaaw69bcB1
 1SZCFRuVvDvzffX6Nir699Dj/uU4GETvDw/+y/igcKcETx6L4AgQPV9y/izJq5zr
 zLhC+OSCDvuOGaOmRvco
 =bijX
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Ben Myers:
 - Remove noisy warnings about experimental support which spams the logs
 - Add padding to align directory and attr structures correctly
 - Set block number on child buffer on a root btree split
 - Disable verifiers during log recovery for non-CRC filesystems

* tag 'for-linus-v3.10-rc6' of git://oss.sgi.com/xfs/xfs:
  xfs: don't shutdown log recovery on validation errors
  xfs: ensure btree root split sets blkno correctly
  xfs: fix implicit padding in directory and attr CRC formats
  xfs: don't emit v5 superblock warnings on write
2013-06-14 19:16:31 -10:00
Al Viro 0525290119 use can_lookup() instead of direct checks of ->i_op->lookup
a couple of places got missed back when Linus has introduced that one...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:41:45 +04:00
Oleg Nesterov e7b2c40692 fput: task_work_add() can fail if the caller has passed exit_task_work()
fput() assumes that it can't be called after exit_task_work() but
this is not true, for example free_ipc_ns()->shm_destroy() can do
this. In this case fput() silently leaks the file.

Change it to fallback to delayed_fput_work if task_work_add() fails.
The patch looks complicated but it is not, it changes the code from

	if (PF_KTHREAD) {
		schedule_work(...);
		return;
	}
	task_work_add(...)

to
	if (!PF_KTHREAD) {
		if (!task_work_add(...))
			return;
		/* fallback */
	}
	schedule_work(...);

As for shm_destroy() in particular, we could make another fix but I
think this change makes sense anyway. There could be another similar
user, it is not safe to assume that task_work_add() can't fail.

Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:39:08 +04:00
Dave Chinner d302cf1d31 xfs: don't shutdown log recovery on validation errors
Unfortunately, we cannot guarantee that items logged multiple times
and replayed by log recovery do not take objects back in time. When
they are taken back in time, the go into an intermediate state which
is corrupt, and hence verification that occurs on this intermediate
state causes log recovery to abort with a corruption shutdown.

Instead of causing a shutdown and unmountable filesystem, don't
verify post-recovery items before they are written to disk. This is
less than optimal, but there is no way to detect this issue for
non-CRC filesystems If log recovery successfully completes, this
will be undone and the object will be consistent by subsequent
transactions that are replayed, so in most cases we don't need to
take drastic action.

For CRC enabled filesystems, leave the verifiers in place - we need
to call them to recalculate the CRCs on the objects anyway. This
recovery problem can be solved for such filesystems - we have a LSN
stamped in all metadata at writeback time that we can to determine
whether the item should be replayed or not. This is a separate piece
of work, so is not addressed by this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 9222a9cf86)
2013-06-14 15:59:45 -05:00
Dave Chinner 088c9f67c3 xfs: ensure btree root split sets blkno correctly
For CRC enabled filesystems, the BMBT is rooted in an inode, so it
passes through a different code path on root splits than the
freespace and inode btrees. This is much less traversed by xfstests
than the other trees. When testing on a 1k block size filesystem,
I've been seeing ASSERT failures in generic/234 like:

XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317

which are generally preceded by a lblock check failure. I noticed
this in the bmbt stats:

$ pminfo -f xfs.btree.block_map

xfs.btree.block_map.lookup
    value 39135

xfs.btree.block_map.compare
    value 268432

xfs.btree.block_map.insrec
    value 15786

xfs.btree.block_map.delrec
    value 13884

xfs.btree.block_map.newroot
    value 2

xfs.btree.block_map.killroot
    value 0
.....

Very little coverage of root splits and merges. Indeed, on a 4k
filesystem, block_map.newroot and block_map.killroot are both zero.
i.e. the code is not exercised at all, and it's the only generic
btree infrastructure operation that is not exercised by a default run
of xfstests.

Turns out that on a 1k filesystem, generic/234 accounts for one of
those two root splits, and that is somewhat of a smoking gun. In
fact, it's the same problem we saw in the directory/attr code where
headers are memcpy()d from one block to another without updating the
self describing metadata.

Simple fix - when copying the header out of the root block, make
sure the block number is updated correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit ade1335afe)
2013-06-14 15:59:31 -05:00
Dave Chinner 5170711df7 xfs: fix implicit padding in directory and attr CRC formats
Michael L. Semon has been testing CRC patches on a 32 bit system and
been seeing assert failures in the directory code from xfs/080.
Thanks to Michael's heroic efforts with printk debugging, we found
that the problem was that the last free space being left in the
directory structure was too small to fit a unused tag structure and
it was being corrupted and attempting to log a region out of bounds.
Hence the assert failure looked something like:

.....
#5 calling xfs_dir2_data_log_unused() 36 32
#1 4092 4095 4096
#2 8182 8183 4096
XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568

Where #1 showed the first region of the dup being logged (i.e. the
last 4 bytes of a directory buffer) and #2 shows the corrupt values
being calculated from the length of the dup entry which overflowed
the size of the buffer.

It turns out that the problem was not in the logging code, nor in
the freespace handling code. It is an initial condition bug that
only shows up on 32 bit systems. When a new buffer is initialised,
where's the freespace that is set up:

[  172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname()
[  172.316346] #9 calling xfs_dir2_data_log_unused()
[  172.316351] #1 calling xfs_trans_log_buf() 60 63 4096
[  172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096

Note the offset of the first region being logged? It's 60 bytes into
the buffer. Once I saw that, I pretty much knew that the bug was
going to be caused by this.

Essentially, all direct entries are rounded to 8 bytes in length,
and all entries start with an 8 byte alignment. This means that we
can decode inplace as variables are naturally aligned. With the
directory data supposedly starting on a 8 byte boundary, and all
entries padded to 8 bytes, the minimum freespace in a directory
block is supposed to be 8 bytes, which is large enough to fit a
unused data entry structure (6 bytes in size). The fact we only have
4 bytes of free space indicates a directory data block alignment
problem.

And what do you know - there's an implicit hole in the directory
data block header for the CRC format, which means the header is 60
byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs
padding. And while looking at the structures, I found the same
problem in the attr leaf header. Fix them both.

Note that this only affects 32 bit systems with CRCs enabled.
Everything else is just fine. Note that CRC enabled filesystems created
before this fix on such systems will not be readable with this fix
applied.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Debugged-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 8a1fd2950e)
2013-06-14 15:59:16 -05:00
Dave Chinner 47ad2fcba9 xfs: don't emit v5 superblock warnings on write
We write the superblock every 30s or so which results in the
verifier being called. Right now that results in this output
every 30s:

XFS (vda): Version 5 superblock detected. This kernel has EXPERIMENTAL support enabled!
Use of these features in this kernel is at your own risk!

And spamming the logs.

We don't need to check for whether we support v5 superblocks or
whether there are feature bits we don't support set as these are
only relevant when we first mount the filesytem. i.e. on superblock
read. Hence for the write verification we can just skip all the
checks (and hence verbose output) altogether.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 34510185ab)
2013-06-14 15:58:47 -05:00
Mike Christie 86e92ad299 dlm: disable nagle for SCTP
For TCP we disable Nagle and I cannot think of why it would be needed
for SCTP. When disabled it seems to improve dlm_lock operations like it
does for TCP.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie 5d6898714f dlm: retry failed SCTP sends
Currently if a SCTP send fails, we lose the data we were trying
to send because the writequeue_entry is released when we do the send.
When this happens other nodes will then hang waiting for a reply.

This adds support for SCTP to retry the send operation.

I also removed the retry limit for SCTP use, because we want
to make sure we try every path during init time and for longer
failures we want to continually retry in case paths come back up
while trying other paths. We will do this until userspace tells us
to stop.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie 98e1b60ecc dlm: try other IPs when sctp init assoc fails
Currently, if we cannot create a association to the first IP addr
that is added to DLM, the SCTP init assoc code will just retry
the same IP. This patch adds a simple failover schemes where we
will try one of the addresses that was passed into DLM.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie b390ca38d2 dlm: clear correct bit during sctp init failure handling
We should be testing and cleaing the init pending bit because later
when sctp_init_assoc is recalled it will be checking that it is not set
and set the bit.

We do not want to touch CF_CONNECT_PENDING here because we will queue
swork and process_send_sockets will then call the connect_action function.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:11 -05:00
Mike Christie e1631d0c48 dlm: set sctp assoc id during setup
sctp_assoc was not getting set so later lookups failed.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:10 -05:00
Mike Christie efad7e6b1a dlm: clear correct init bit during sctp setup
We were clearing the base con's init pending flags, but the
con for the node was the one with the pending bit set.

Signed-off-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: David Teigland <teigland@redhat.com>
2013-06-14 13:07:10 -05:00
Bob Peterson 512cbf02fd GFS2: fix regression in dir_double_exhash
Recent commit e8830d8 introduced a bug in function dir_double_exhash;
it was failing to set h in the fall-back case. This patch corrects it.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-14 13:27:17 +01:00
Steven Whitehouse 6d4ade986f GFS2: Add atomic_open support
I've restricted atomic_open to only operate on regular files, although
I still don't understand why atomic_open should not be possible also for
directories on GFS2. That can always be added in later though, if it
makes sense.

The ->atomic_open function can be passed negative dentries, which
in most cases means either ENOENT (->lookup) or a call to d_instantiate
(->create). In the GFS2 case though, we need to actually perform the
look up, since we do not know whether there has been a new inode created
on another node. The look up calls d_splice_alias which then tries to
rehash the dentry - so the solution here is to simply check for that
in d_splice_alias. The same issue is likely to affect any other cluster
filesystem implementing ->atomic_open

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields fieldses org>
Cc: Jeff Layton <jlayton@redhat.com>
2013-06-14 11:17:15 +01:00
Linus Torvalds a2648ebb7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of crash fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: stop all workers before cleaning up roots
  Btrfs: fix use-after-free bug during umount
  Btrfs: init relocate extent_io_tree with a mapping
  btrfs: Drop inode if inode root is NULL
  Btrfs: don't delete fs_roots until after we cleanup the transaction
2013-06-13 22:34:14 -07:00
Jaegeuk Kim 354a3399dc f2fs: recover wrong pino after checkpoint during fsync
If a file is linked, f2fs loose its parent inode number so that fsync calls
for the linked file should do checkpoint all the time.
But, if we can recover its parent inode number after the checkpoint, we can
adjust roll-forward mechanism for the further fsync calls, which is able to
improve the fsync performance significatly.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:45 +09:00
Haicheng Li b25958b6ec f2fs: optimize do_write_data_page()
Since "need_inplace_update() == true" is a very rare case, using unlikely()
to give compiler a chance to optimize the code.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:44 +09:00
Haicheng Li 8d8451af68 f2fs: make locate_dirty_segment() as static
It's used only locally and could be static.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:44 +09:00
Haicheng Li e79efe3b69 f2fs: remove unnecessary parameter "offset" from __add_sum_entry()
We can get the value directly from pointer "curseg".

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Jaegeuk Kim b3783873cc f2fs: avoid freqeunt write_inode calls
If update_inode is called, we don't need to do write_inode.
So, let's use a *dirty* flag for each inode.

Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:43 +09:00
Namjae Jeon d7cc950b4c f2fs: optimise the truncate_data_blocks_range() range
The function truncate_data_blocks_range() decrements the valid
block count of inode via dec_valid_block_count(). Since this
function updates the i_blocks field of inode, we can update this
field once we have calculated total the number of blocks
to be freed.

Therefore we can decrement valid blocks outside of the for loop.

	if (nr_free) {
+		dec_valid_block_count(sbi, dn->inode, nr_free);
 		set_page_dirty(dn->node_page);
 		sync_inode_page(dn);
 	}

'nr_free' tells the total number of blocks freed. So, we can
just directly pass this value to dec_valid_block_count() and update
the i_blocks.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:42 +09:00
Namjae Jeon 6a3e8ef0de f2fs: use the F2FS specific flags in f2fs_ioctl()
In f2fs_ioctl() function, it is using generic flags.
Since F2FS specific flags are defined. So lets use
those flags.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
2013-06-14 09:04:26 +09:00
Jie Liu 72dac95d44 ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
Return the FIEMAP_EXTENT_UNKNOWN flag as well except the
FIEMAP_EXTENT_DELALLOC because the data location of an
delayed allocation extent is unknown.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
2013-06-12 23:13:59 -04:00
Paul Gortmaker 75497d0607 jbd2: remove debug dependency on debug_fs and update Kconfig help text
Commit b6e96d0067 ("jbd2: use module parameters instead of debugfs
for jbd_debug") removed any need for a dependency on DEBUG_FS.  It
also moved the /sys variables out from underneath the typical debugfs
mount point.  Delete the dependency and update the /sys path to where
the debug settings are currently.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:07:51 -04:00
Paul Gortmaker 169f1a2a87 jbd2: use a single printk for jbd_debug()
Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):

[  290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[  290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[  290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[  290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[  290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079

i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved.  The output should have been:

(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104

It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Suggested-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 23:04:04 -04:00
Paul Gortmaker cfc7bc896f jbd2: fix duplicate debug label for phase 2
Currently we see this output:

  $git grep phase fs/jbd2
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 1\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 2\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 3\n");
  fs/jbd2/commit.c:       jbd_debug(3, "JBD2: commit phase 4\n");
  [...]

There is clearly a duplicate label for phase 2, and they are
both active (i.e. not in #if ... #else block).  Rename them to
be "2a" and "2b" so the debug output is unambiguous.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:56:35 -04:00
Paul Gortmaker 0ef54180e0 jbd2: drop checkpoint mutex when waiting in __jbd2_log_wait_for_space()
While trying to debug an an issue under extreme I/O loading
on preempt-rt kernels, the following backtrace was observed
via SysRQ output:

rm              D ffff8802203afbc0  4600  4878   4748 0x00000000
 ffff8802217bfb78 0000000000000082 ffff88021fc2bb80 ffff88021fc2bb80
 ffff88021fc2bb80 ffff8802217bffd8 ffff8802217bffd8 ffff8802217bffd8
 ffff88021f1d4c80 ffff88021fc2bb80 ffff8802217bfb88 ffff88022437b000
Call Trace:
 [<ffffffff8172dc34>] schedule+0x24/0x70
 [<ffffffff81225b5d>] jbd2_log_wait_commit+0xbd/0x140
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff81223635>] jbd2_log_do_checkpoint+0xf5/0x520
 [<ffffffff81223b09>] __jbd2_log_wait_for_space+0xa9/0x1f0
 [<ffffffff8121dc40>] start_this_handle.isra.10+0x2e0/0x530
 [<ffffffff81060390>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff8121e0a3>] jbd2__journal_start+0xc3/0x110
 [<ffffffff811de7ce>] ? ext4_rmdir+0x6e/0x230
 [<ffffffff8121e0fe>] jbd2_journal_start+0xe/0x10
 [<ffffffff811f308b>] ext4_journal_start_sb+0x5b/0x160
 [<ffffffff811de7ce>] ext4_rmdir+0x6e/0x230
 [<ffffffff811435c5>] vfs_rmdir+0xd5/0x140
 [<ffffffff8114370f>] do_rmdir+0xdf/0x120
 [<ffffffff8105c6b4>] ? task_work_run+0x44/0x80
 [<ffffffff81002889>] ? do_notify_resume+0x89/0x100
 [<ffffffff817361ae>] ? int_signal+0x12/0x17
 [<ffffffff81145d85>] sys_unlinkat+0x25/0x40
 [<ffffffff81735f22>] system_call_fastpath+0x16/0x1b

What is interesting here, is that we call log_wait_commit, from
within wait_for_space, but we are still holding the checkpoint_mutex
as it surrounds mostly the whole of wait_for_space.  And then, as we
are waiting, journal_commit_transaction can run, and if the JBD2_FLUSHED
bit is set, then we will also try to take the same checkpoint_mutex.

It seems that we need to drop the checkpoint_mutex while sitting in
jbd2_log_wait_commit, if we want to guarantee that progress can be made
by jbd2_journal_commit_transaction().  There does not seem to be
anything preempt-rt specific about this, other then perhaps increasing
the odds of it happening.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:47:35 -04:00
Paul Gortmaker 3ca841c106 jbd2: relocate assert after state lock in journal_commit_transaction()
The state lock is taken after we are doing an assert on the state
value, not before.  So we might in fact be doing an assert on a
transient value.  Ensure the state check is within the scope of
the state lock being taken.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:46:35 -04:00
Dmitry Monakhov 4418e14112 ext4: Fix fsync error handling after filesystem abort
If filesystem was aborted after inode's write back is complete
but before its metadata was updated we may return success
results in data loss.
In order to handle fs abort correctly we have to check
fs state once we discover that it is in MS_RDONLY state

Test case: http://patchwork.ozlabs.org/patch/244297

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:38:04 -04:00
Dmitry Monakhov 06a407f13d ext4: fix data integrity for ext4_sync_fs
Inode's data or non journaled quota may be written w/o jounral so we
_must_ send a barrier at the end of ext4_sync_fs. But it can be
skipped if journal commit will do it for us.

Also fix data integrity for nojournal mode.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:25:07 -04:00
Dmitry Monakhov 9ff8644624 jbd2: optimize jbd2_journal_force_commit
Current implementation of jbd2_journal_force_commit() is suboptimal because
result in empty and useless commits. But callers just want to force and wait
any unfinished commits. We already have jbd2_journal_force_commit_nested()
which does exactly what we want, except we are guaranteed that we do not hold
journal transaction open.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-06-12 22:24:07 -04:00