Commit graph

3418 commits

Author SHA1 Message Date
Jie Liu 8fe657760d xfs: remove redundant stat assignment in xfs_bulkstat_one_int
From: Jie Liu <jeff.liu@oracle.com>

Remove the redundant BULKSTAT_RV_NOTHING assignment in case of call
xfs_iget() failed at xfs_bulkstat_one_int().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-24 11:33:28 +10:00
Brian Foster 80d6d69821 xfs: add log attributes for log lsn and grant head data
Create log attributes to export the current runtime state of the log to
sysfs. Note that the filesystem should be frozen for consistency across
attributes.

The following per-mount attributes are created: log_head_lsn,
log_tail_lsn, reserve_grant_head and write_grant_head. These represent
the physical log head, tail and reserve and write grant heads
respectively. Attribute values are exported in the following format:

	"cycle:[block,byte]"

... where cycle represents the log cycle and [block,bytes] represents
either the basic block or byte offset of the log, depending on the
attribute.  Log sequence number (LSN) values are encoded in basic blocks
and grant heads are encoded in bytes. All values are in decimal format.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:48 +10:00
Brian Foster baff4e44b9 xfs: add xlog sysfs kobject and attribute handlers
Embed a kobject into the xfs log data structure (xlog). This creates a
'log' subdirectory for every XFS mount instance in sysfs. The lifecycle
of the log kobject is tied to the lifecycle of the log.

Also define a set of generic attribute handlers associated with the log
kobject in preparation for the addition of attributes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:29 +10:00
Brian Foster a31b1d3d89 xfs: add xfs_mount sysfs kobject
Embed a base kobject into xfs_mount. This creates a kobject associated
with each XFS mount and a subdirectory in sysfs with the name of the
filesystem. The subdirectory lifecycle matches that of the mount. Also
add the new xfs_sysfs.[c,h] source files with some XFS sysfs
infrastructure to facilitate attribute creation.

Note that there are currently no attributes exported as part of the
xfs_mount kobject. It exists solely to serve as a per-mount container
for child objects.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 08:07:01 +10:00
Brian Foster 3d8712265c xfs: add a sysfs kset
Create a sysfs kset to contain all sub-objects associated with the XFS
module. The kset is created and removed on module initialization and
removal respectively. The kset uses fs_obj as a parent. This leads to
the creation of a /sys/fs/xfs directory when the kset exists.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:41:37 +10:00
Brian Foster a70a4fa528 xfs: fix a couple error sequence jumps in xfs_mountfs()
xfs_mountfs() has a couple failure conditions that do not jump to the
correct labels. Specifically:

- xfs_initialize_perag_data() failure does not deallocate the log even
  though it occurs after log initialization
- xfs_mount_reset_sbqflags() failure returns the error directly rather
  than jump to the error sequence

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:41:25 +10:00
Dave Chinner 7f8a058f6d Merge branch 'xfs-libxfs-restructure' into for-next 2014-07-15 07:37:18 +10:00
Dave Chinner 03e01349c6 xfs: null unused quota inodes when quota is on
When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d58 ("xfs: Start using pquotaino from the
superblock").

In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.

Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
0102629 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:

[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount

By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:28:41 +10:00
Dave Chinner cf11da9c5d xfs: refine the allocation stack switch
The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:24 +10:00
Dave Chinner aa182e64f1 Revert "xfs: block allocation work needs to be kswapd aware"
This reverts commit 1f6d64829d.

This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-07-15 07:08:10 +10:00
Dave Chinner 2451337dd0 xfs: global error sign conversion
Convert all the errors the core XFs code to negative error signs
like the rest of the kernel and remove all the sign conversion we
do in the interface layers.

Errors for conversion (and comparison) found via searches like:

$ git grep " E" fs/xfs
$ git grep "return E" fs/xfs
$ git grep " E[A-Z].*;$" fs/xfs

Negation points found via searches like:

$ git grep "= -[a-z,A-Z]" fs/xfs
$ git grep "return -[a-z,A-D,F-Z]" fs/xfs
$ git grep " -[a-z].*;" fs/xfs

[ with some bits I missed from Brian Foster ]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:58:08 +10:00
Dave Chinner 30f712c9dd libxfs: move source files
Move all the source files that are shared with userspace into
libxfs/. This is done as one big chunk simpy to get it done
quickly

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:53 +10:00
Dave Chinner 84be0ffc90 libxfs: move header files
Move all the header files that are shared with userspace into
libxfs. This is done as one big chunk simpy to get it done quickly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:36 +10:00
Dave Chinner 69116a1317 xfs: create libxfs infrastructure
To minimise the differences between kernel and userspace code,
split the kernel code into the same structure as the userspace code.
That is, the gneric core functionality of XFS is moved to a libxfs/
directory and treat it as a layering barrier in the XFS code.

This patch introduces the libxfs directory, the build infrastructure
and an initial source and header file to build. The libxfs directory
will contain the header files that are needed to build libxfs - most
of userspace does not care about the location of these header files
as they are accessed indirectly. Hence keeping them inside libxfs
makes it easy to track the changes and script the sync process as
the directory structure will be identical.

To allow this changeover to occur in the kernel code, there are some
temporary infrastructure in the makefiles to grab the header
filesystem from both locations. Once all the files are moved,
modifications will be made in the source code that will make the
need for these include directives go away.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-25 14:57:22 +10:00
Eric Sandeen b474c7ae43 xfs: Nuke XFS_ERROR macro
XFS_ERROR was designed long ago to trap return values, but it's not
runtime configurable, it's not consistently used, and we can do
similar error trapping with ftrace scripts and triggers from
userspace.

Just nuke XFS_ERROR and associated bits.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:04:54 +10:00
Eric Sandeen d99831ff39 xfs: return is not a function
return is not a function.  "return(EIO);" is silly;
"return (EIO);" moreso.  return is not a function.
Nuke the pointless parens.

[dchinner: catch a couple of extra cases in xfs_attr_list.c,
xfs_acl.c and xfs_linux.h.]

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-22 15:03:54 +10:00
Linus Torvalds 16b9057804 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "This the bunch that sat in -next + lock_parent() fix.  This is the
  minimal set; there's more pending stuff.

  In particular, I really hope to get acct.c fixes merged this cycle -
  we need that to deal sanely with delayed-mntput stuff.  In the next
  pile, hopefully - that series is fairly short and localized
  (kernel/acct.c, fs/super.c and fs/namespace.c).  In this pile: more
  iov_iter work.  Most of prereqs for ->splice_write with sane locking
  order are there and Kent's dio rewrite would also fit nicely on top of
  this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
  lock_parent: don't step on stale ->d_parent of all-but-freed one
  kill generic_file_splice_write()
  ceph: switch to iter_file_splice_write()
  shmem: switch to iter_file_splice_write()
  nfs: switch to iter_splice_write_file()
  fs/splice.c: remove unneeded exports
  ocfs2: switch to iter_file_splice_write()
  ->splice_write() via ->write_iter()
  bio_vec-backed iov_iter
  optimize copy_page_{to,from}_iter()
  bury generic_file_aio_{read,write}
  lustre: get rid of messing with iovecs
  ceph: switch to ->write_iter()
  ceph_sync_direct_write: stop poking into iov_iter guts
  ceph_sync_read: stop poking into iov_iter guts
  new helper: copy_page_from_iter()
  fuse: switch to ->write_iter()
  btrfs: switch to ->write_iter()
  ocfs2: switch to ->write_iter()
  xfs: switch to ->write_iter()
  ...
2014-06-12 10:30:18 -07:00
Al Viro 9c1d5284c7 Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus
Backmerge of dcache.c changes from mainline.  It's that, or complete
rebase...

Conflicts:
	fs/splice.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:28:09 -04:00
Al Viro 8d0207652c ->splice_write() via ->write_iter()
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter().  A bunch of simple cases coverted to that...

[AV: fixed the braino spotted by Cyrill]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-06-12 00:18:51 -04:00
Linus Torvalds 412dd3a6da xfs: update for 3.16-rc1
This update contains:
 o cleanup removing unused function args
 o rework of the filestreams allocator to use dentry cache parent lookups
 o new on-disk free inode btree and optimised inode allocator
 o various bug fixes
 o rework of internal attribute API
 o cleanup of superblock feature bit support to remove historic cruft
 o more fixes and minor cleanups
 o added a new directory/attribute geometry abstraction
 o yet more fixes and minor cleanups.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTl+YfAAoJEK3oKUf0dfod/T8QALmvR28JZTL3vtlD5rppXp9+
 DXOMrwgJ9V+GOI39tpUgw1u/C5DuaFPRPtmCjnb9Do4DJMrHj+zD8ADvoVd6asa0
 FHH4TuulQqOJVu67SZ3ng15yjyy+wPfymQZIQPQY/IwVMUUEpWnSFnKha1GAsL8Y
 RY/WNU50wMu4wxi0++ENooHJC2EoxXpzB80cHddN81zFEFZobw0cm5Aa5xBZEZ4i
 P+GpEuUpWHKvVaWRLuIMgVC0NuOt5KtLfS97ong+tRgWCw//QVl28Rxhrj1ZHsF3
 VAskVsSFVIIPHP7qKjQyCGk71iqBfrfAgRqqJHFZgSmtSzyK3hVvJlRRDdCT5hi6
 00aHg9vz9815I7zrQwyMuy872N3DTislOxJZGD7PKgLpgfeHs4qk+cQ1xCi2gdFn
 xnh2p4mLolZHzanUsoxYpSh7f7o+NT3xgET3yS63uuO/I57o74JJDfRDjWNX6I9F
 LLtIGb1cwVFUYbXcHGfP1wxQ1BS6rYYYwKpSJqqwJXApL9MqoxH2B8Hoo0BaG43/
 3UlNi+yljvhBNiJnx2pAIdU+WaIL1ZQj9XzuU1sFSa8lnFNb2x+wkgjHzJ0Hdotm
 zZqirCo1jyyNkyTwGfwJwGzNgZemQDMQ7cr2MYzG1mhFMLEZZJeFmWVzfuzJ3yoR
 jke/Hy/qiWVK0en43MdR
 =qnz2
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.16-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs updates from Dave Chinner:
 "This update contains:
   - cleanup removing unused function args
   - rework of the filestreams allocator to use dentry cache parent
     lookups
   - new on-disk free inode btree and optimised inode allocator
   - various bug fixes
   - rework of internal attribute API
   - cleanup of superblock feature bit support to remove historic cruft
   - more fixes and minor cleanups
   - added a new directory/attribute geometry abstraction
   - yet more fixes and minor cleanups"

* tag 'xfs-for-linus-3.16-rc1' of git://oss.sgi.com/xfs/xfs: (86 commits)
  xfs: fix xfs_da_args sparse warning in xfs_readdir
  xfs: Fix rounding in xfs_alloc_fix_len()
  xfs: tone down writepage/releasepage WARN_ONs
  xfs: small cleanup in xfs_lowbit64()
  xfs: kill xfs_buf_geterror()
  xfs: xfs_readsb needs to check for magic numbers
  xfs: block allocation work needs to be kswapd aware
  xfs: remove redundant geometry information from xfs_da_state
  xfs: replace attr LBSIZE with xfs_da_geometry
  xfs: pass xfs_da_args to xfs_attr_leaf_newentsize
  xfs: use xfs_da_geometry for block size in attr code
  xfs: remove mp->m_dir_geo from directory logging
  xfs: reduce direct usage of mp->m_dir_geo
  xfs: move node entry counts to xfs_da_geometry
  xfs: convert dir/attr btree threshold to xfs_da_geometry
  xfs: convert m_dirblksize to xfs_da_geometry
  xfs: convert m_dirblkfsbs to xfs_da_geometry
  xfs: convert directory segment limits to xfs_da_geometry
  xfs: convert directory db conversion to xfs_da_geometry
  xfs: convert directory dablk conversion to xfs_da_geometry
  ...
2014-06-11 09:03:47 -07:00
Andy Lutomirski 23adbe12ef fs,userns: Change inode_capable to capable_wrt_inode_uidgid
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-10 13:57:22 -07:00
Dave Chinner 7691283d05 Merge branch 'xfs-misc-fixes-3-for-3.16' into for-next 2014-06-10 07:32:56 +10:00
Dave Chinner 8612c7e594 Merge branch 'xfs-da-geom' into for-next 2014-06-10 07:32:41 +10:00
Dave Chinner 35f46c5f04 xfs: fix xfs_da_args sparse warning in xfs_readdir
The kbuild test robot reported:

>> fs/xfs/xfs_dir2_readdir.c:672:41: sparse: Using plain integer as NULL pointer

Fix it.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-10 07:30:36 +10:00
Jan Kara 30265117ee xfs: Fix rounding in xfs_alloc_fix_len()
Rounding in xfs_alloc_fix_len() is wrong. As the comment states, the
result should be a number of a form (k*prod+mod) however due to sign
mistake the result is different. As a result allocations on raid arrays
could be misaligned in some cases.

This also seems to fix occasional assertion failure:
	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0)
in xfs_alloc_ag_vextent_size().

Also add an assertion that the result of xfs_alloc_fix_len() is of
expected form.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:06:37 +10:00
Christoph Hellwig 448011e2ab xfs: tone down writepage/releasepage WARN_ONs
I recently ran into the issue fixed by

  "xfs: kill buffers over failed write ranges properly"

which spams the log with lots of backtraces.  Make debugging any
issues like that easier by using WARN_ON_ONCE in the writeback code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:05:15 +10:00
Dan Carpenter 72208ee060 xfs: small cleanup in xfs_lowbit64()
There are two checkpatch.pl complaints here because of the bad
indenting and because of the assignment inside the condition.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:04:42 +10:00
Dave Chinner 36de95567f xfs: kill xfs_buf_geterror()
Most of the callers are just calling ASSERT(!xfs_buf_geterror())
which means they are checking for bp->b_error == 0. If bp is null in
this case, we will assert fail, and hence it's no different in
result to oopsing because of a null bp. In some cases, errors have
already been checked for or the function returning the buffer can't
return a buffer with an error, so it's just a redundant assert.
Either way, the assert can either be removed.

The other two non-assert callers can just test for a buffer and
error properly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:02:12 +10:00
Dave Chinner 556b8883cf xfs: xfs_readsb needs to check for magic numbers
Commit daba542 ("xfs: skip verification on initial "guess"
superblock read") dropped the use of a verifier for the initial
superblock read so we can probe the sector size of the filesystem
stored in the superblock. It, however, now fails to validate that
what was read initially is actually an XFS superblock and hence will
fail the sector size check and return ENOSYS.

This causes probe-based mounts to fail because it expects XFS to
return EINVAL when it doesn't recognise the superblock format.

cc: <stable@vger.kernel.org>
Reported-by: Plamen Petrov <plamen.sisi@gmail.com>
Tested-by: Plamen Petrov <plamen.sisi@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 16:00:43 +10:00
Dave Chinner 1f6d64829d xfs: block allocation work needs to be kswapd aware
Upon memory pressure, kswapd calls xfs_vm_writepage() from
shrink_page_list(). This can result in delayed allocation occurring
and that gets deferred to the the allocation workqueue.

The allocation then runs outside kswapd context, which means if it
needs memory (and it does to demand page metadata from disk) it can
block in shrink_inactive_list() waiting for IO congestion. These
blocking waits are normally avoiding in kswapd context, so under
memory pressure writeback from kswapd can be arbitrarily delayed by
memory reclaim.

To avoid this, pass the kswapd context to the allocation being done
by the workqueue, so that memory reclaim understands correctly that
the work is being done for kswapd and therefore it is not blocked
and does not delay memory reclaim.

To avoid issues with int->char conversion of flag fields (as noticed
in v1 of this patch) convert the flag fields in the struct
xfs_bmalloca to bool types. pahole indicates these variables are
still single byte variables, so no extra space is consumed by this
change.

cc: <stable@vger.kernel.org>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:59:59 +10:00
Dave Chinner b2a21e7a6b xfs: remove redundant geometry information from xfs_da_state
It's carried in state->args->geo, so there's no need to duplicate it
and use more stack space than necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:22:04 +10:00
Dave Chinner c2c4c477e0 xfs: replace attr LBSIZE with xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:21:45 +10:00
Dave Chinner c59f0ad23a xfs: pass xfs_da_args to xfs_attr_leaf_newentsize
As it's only ever called from contexts where the xfs_da_args is
present and contains all the information needed inside the args
structure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:21:27 +10:00
Dave Chinner 33a6039007 xfs: use xfs_da_geometry for block size in attr code
Rather than using the superblock value obtained through the
xfs_mount.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:21:10 +10:00
Dave Chinner bc85178a76 xfs: remove mp->m_dir_geo from directory logging
We don't pass the xfs_da_args or the geometry all the way down to
the directory buffer logging code, hence we have to use
mp->m_dir_geo here. Fix this to use the geometry passed via the
xfs_da_args, and convert all the directory logging functions for
consistency.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:20:54 +10:00
Dave Chinner 53f82db003 xfs: reduce direct usage of mp->m_dir_geo
There are many places in the directory code were we don't pass the
args into and so have to extract the geometry direct from the mount
structure. Push the args or the geometry into these leaf functions
so that we don't need to grab it from the struct xfs_mount.

This, in turn, brings use to the point where directory geometry is
no longer a property of the struct xfs_mount; it is not a global
property anymore, and hence we can start to consider per-directory
configuration of physical geometries.

Start by converting the xfs_dir_isblock/leaf code - pass in the
xfs_da_args and convert the readdir code to use xfs_da_args like
the rest of the directory code to pass information around.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:20:32 +10:00
Dave Chinner 7ab610f9e0 xfs: move node entry counts to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:20:02 +10:00
Dave Chinner ed358c0058 xfs: convert dir/attr btree threshold to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:18:10 +10:00
Dave Chinner 8f66193c89 xfs: convert m_dirblksize to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:15:59 +10:00
Dave Chinner d6cf13051f xfs: convert m_dirblkfsbs to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:14:11 +10:00
Dave Chinner 7dda6e8644 xfs: convert directory segment limits to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:11:18 +10:00
Dave Chinner 30028030b1 xfs: convert directory db conversion to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:08:18 +10:00
Dave Chinner 2998ab1d45 xfs: convert directory dablk conversion to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:07:53 +10:00
Dave Chinner 9b3b5522d3 xfs: convert dir byte/off conversion to xfs_da_geometry
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:06:53 +10:00
Dave Chinner 8c44a28561 xfs: kill XFS_DIR2...FIRSTDB macros
They are just simple wrappers around xfs_dir2_byte_to_db(), and
we've already removed one usage earlier in the patch set. Kill
the rest before we start removing the xfs_mount from conversion
functions.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:04:41 +10:00
Dave Chinner 892e3f342f xfs: move directory block translatiosn to xfs_dir2_priv.h
Because they aren't actually part of the on-disk format, and so
shouldn't be in xfs_da_format.h.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:04:05 +10:00
Dave Chinner 0650b55497 xfs: introduce directory geometry structure
The directory code has a dependency on the struct xfs_mount to
supply the directory block geometry. Block size, block log size,
and other parameters are pre-caclulated in the struct xfs_mount or
access directly from the superblock embedded in the struct
xfs_mount.

Extract all of this geometry information out of the struct xfs_mount
and superblock and place it into a new struct xfs_da_geometry
defined by the directory code. Allocate and initialise it at mount
time, and attach it to the struct xfs_mount so it canbe passed back
into the directory code appropriately rather than using the struct
xfs_mount.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-06-06 15:01:58 +10:00
Dave Chinner b70f14e1ff Merge branch 'xfs-feature-bit-cleanup' into for-next
Conflicts:
	fs/xfs/xfs_inode.c
2014-05-20 08:57:02 +10:00
Dave Chinner 0d907a3bb4 Merge branch 'xfs-misc-fixes-2-for-3.16' into for-next
Conflicts:
	fs/xfs/xfs_ialloc.c
2014-05-20 08:56:00 +10:00
Roger Willcocks 376c2f3a5f xfs: fix compile error when libxfs header used in C++ code
xfs_ialloc.h:102: error: expected ',' or '...' before 'delete'

Simple parameter rename, no changes to behaviour.

Signed-off-by: Roger Willcocks <roger@filmlight.ltd.uk>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:52:21 +10:00
Jie Liu 8695d27ec3 xfs: fix infinite loop at xfs_vm_writepage on 32bit system
Write to a file with an offset greater than 16TB on 32-bit system and
then trigger page write-back via sync(1) will cause task hang.

# block_size=4096
# offset=$(((2**32 - 1) * $block_size))
# xfs_io -f -c "pwrite $offset $block_size" /storage/test_file
# sync

INFO: task sync:2590 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
sync            D c1064a28     0  2590   2097 0x00000000
.....
Call Trace:
[<c1064a28>] ? ttwu_do_wakeup+0x18/0x130
[<c1066d0e>] ? try_to_wake_up+0x1ce/0x220
[<c1066dbf>] ? wake_up_process+0x1f/0x40
[<c104fc2e>] ? wake_up_worker+0x1e/0x30
[<c15b6083>] schedule+0x23/0x60
[<c15b3c2d>] schedule_timeout+0x18d/0x1f0
[<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
[<c10515f1>] ? __queue_delayed_work+0x91/0x150
[<c12a12ef>] ? do_raw_spin_lock+0x3f/0x100
[<c12a143e>] ? do_raw_spin_unlock+0x4e/0x90
[<c15b5b5d>] wait_for_completion+0x7d/0xc0
[<c1066d60>] ? try_to_wake_up+0x220/0x220
[<c116a4d2>] sync_inodes_sb+0x92/0x180
[<c116fb05>] sync_inodes_one_sb+0x15/0x20
[<c114a8f8>] iterate_supers+0xb8/0xc0
[<c116faf0>] ? fdatawrite_one_bdev+0x20/0x20
[<c116fc21>] sys_sync+0x31/0x80
[<c15be18d>] sysenter_do_call+0x12/0x28

This issue can be triggered via xfstests/generic/308.

The reason is that the end_index is unsigned long with maximum value
'2^32-1=4294967295' on 32-bit platform, and the given offset cause it
wrapped to 0, so that the following codes will repeat again and again
until the task schedule time out:

end_index = offset >> PAGE_CACHE_SHIFT;
last_index = (offset - 1) >> PAGE_CACHE_SHIFT;
if (page->index >= end_index) {
	unsigned offset_into_page = offset & (PAGE_CACHE_SIZE - 1);
        /*
         * Just skip the page if it is fully outside i_size, e.g. due
         * to a truncate operation that is in progress.
         */
        if (page->index >= end_index + 1 || offset_into_page == 0) {
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
		unlock_page(page);
		return 0;
	}

In order to check if a page is fully outsids i_size or not, we can fix
the code logic as below:
	if (page->index > end_index ||
	    (page->index == end_index && offset_into_page == 0))

Secondly, there still has another similar issue when calculating the
end offset for mapping the filesystem blocks to the file blocks for
delalloc.  With the same tests to above, run unmount(8) will cause
kernel panic if CONFIG_XFS_DEBUG is enabled:

XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) || \
	ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 964

kernel BUG at fs/xfs/xfs_message.c:108!
invalid opcode: 0000 [#1] SMP
task: edddc100 ti: ec6ee000 task.ti: ec6ee000
EIP: 0060:[<f83d87cb>] EFLAGS: 00010296 CPU: 1
EIP is at assfail+0x2b/0x30 [xfs]
..............
Call Trace:
[<f83d9cd4>] xfs_fs_destroy_inode+0x74/0x120 [xfs]
[<c115ddf1>] destroy_inode+0x31/0x50
[<c115deff>] evict+0xef/0x170
[<c115dfb2>] dispose_list+0x32/0x40
[<c115ea3a>] evict_inodes+0xca/0xe0
[<c1149706>] generic_shutdown_super+0x46/0xd0
[<c11497b9>] kill_block_super+0x29/0x70
[<c1149a14>] deactivate_locked_super+0x44/0x70
[<c114a427>] deactivate_super+0x47/0x60
[<c1161c3d>] mntput_no_expire+0xcd/0x120
[<c1162ae8>] SyS_umount+0xa8/0x370
[<c1162dce>] SyS_oldumount+0x1e/0x20
[<c15be18d>] sysenter_do_call+0x12/0x28

That because the end_offset is evaluated to 0 which is the same reason
to above, hence the mapping and covertion for dealloc file blocks to
file system blocks did not happened.

This patch just fixed both issues.

Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:24:26 +10:00
Dave Chinner 7c166350b1 xfs: remove redundant checks from xfs_da_read_buf
All of the verification checks of magic numbers are now done by
verifiers, so ther eis no need to check them again once the buffer
has been successfully read. If the magic number is bad, it won't
even get to that code to verify it so it really serves no purpose at
all anymore. Remove it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:23:06 +10:00
Dave Chinner 110dc24ad2 xfs: log vector rounding leaks log space
The addition of direct formatting of log items into the CIL
linear buffer added alignment restrictions that the start of each
vector needed to be 64 bit aligned. Hence padding was added in
xlog_finish_iovec() to round up the vector length to ensure the next
vector started with the correct alignment.

This adds a small number of bytes to the size of
the linear buffer that is otherwise unused. The issue is that we
then use the linear buffer size to determine the log space used by
the log item, and this includes the unused space. Hence when we
account for space used by the log item, it's more than is actually
written into the iclogs, and hence we slowly leak this space.

This results on log hangs when reserving space, with threads getting
stuck with these stack traces:

Call Trace:
[<ffffffff81d15989>] schedule+0x29/0x70
[<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0
[<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140
[<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220
[<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310
.....

The 4 bytes is significant. Brain Foster did all the hard work in
tracking down a reproducable leak to inode chunk allocation (it went
away with the ikeep mount option). His rough numbers were that
creating 50,000 inodes leaked 11 log blocks. This turns out to be
roughly 800 inode chunks or 1600 inode cluster buffers. That
works out at roughly 4 bytes per cluster buffer logged, and at that
I started looking for a 4 byte leak in the buffer logging code.

What I found was that a struct xfs_buf_log_format structure for an
inode cluster buffer is 28 bytes in length. This gets rounded up to
32 bytes, but the vector length remains 28 bytes. Hence the CIL
ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
for that vector rather than 28 bytes which are written into the log.

The fix for this problem is to separately track the bytes used by
the log vectors in the item and use that instead of the buffer
length when accounting for the log space that will be used by the
formatted log item.

Again, thanks to Brian Foster for doing all the hard work and long
hours to isolate this leak and make finding the bug relatively
simple.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:18:09 +10:00
Namjae Jeon ce576f1c56 xfs: remove XFS_TRANS_RESERVE in collapse range
There is no need to dip into reserve pool. Reserve pool is used for much
more important things. And xfs_trans_reserve will never return ENOSPC
because punch hole is already done. If we get ENOSPC, collapse range
will be simply failed.

Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 08:15:57 +10:00
Dave Chinner ab3e57b53f xfs: remove shared supberlock feature checking
We reject any filesystem that is mounted with this feature bit set,
so we don't need to check for it anywhere else. Remove the function
for checking if the feature bit is set and any code that uses it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 07:47:05 +10:00
Dave Chinner 5d074a4f80 xfs: don't need dirv2 checks anymore
If the the V2 directory feature bit is not set in the superblock
feature mask the filesystem will fail the good version check.
Hence we don't need any other version checking on the dir2 feature
bit in the code as the filesystem will not mount without it set.
Remove the checking code.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 07:46:55 +10:00
Dave Chinner 263997a684 xfs: turn NLINK feature on by default
mkfs has turned on the XFS_SB_VERSION_NLINKBIT feature bit by
default since November 2007. It's about time we simply made the
kernel code turn it on by default and so always convert v1 inodes to
v2 inodes when reading them in from disk or allocating them. This
This removes needless version checks and modification when bumping
link counts on inodes, and will take code out of a few common code
paths.

   text    data     bss     dec     hex filename
 783251  100867     616  884734   d7ffe fs/xfs/xfs.o.orig
 782664  100867     616  884147   d7db3 fs/xfs/xfs.o.patched

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 07:46:40 +10:00
Dave Chinner 32bf1deae1 xfs: keep sb_bad_features2 the same a sb_features2
Whenever we update sb_features2, we need to update sb_bad_features2
so that they remain identical on disk. This prevents future mounts
or userspace utilities from getting confused over which features the
filesystem supports.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 07:41:43 +10:00
Dave Chinner f68a373525 xfs: make superblock version checks reflect reality
We only support filesystems that have v2 directory support, and than
means all the checking and handling of superblock versions prior to
this support being added is completely unnecessary overhead.

Strip out all the version 1-3 support, sanitise the good version
checking to reflect the supported versions, update all the feature
supported functions and clean up all the support bit definitions to
reflect the fact that we no longer care about Irix bootloader flag
regions for v4 feature bits. Also, convert the return values to
boolean types and remove typedefs from function declarations to
clean up calling conventions, too.

Because the feature bit checking is all inline code, this relatively
small cleanup has a noticable impact on code size:

   text    data     bss     dec     hex filename
 785195  100867     616  886678   d8796 fs/xfs/xfs.o.orig
 783595  100867     616  885078   d8156 fs/xfs/xfs.o.patched

i.e. it reduces it by 1600 bytes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-20 07:41:16 +10:00
Dave Chinner 2d6dcc6d7e Merge branch 'xfs-attr-cleanup' into for-next
Conflicts:
	fs/xfs/xfs_attr.c
2014-05-15 09:39:28 +10:00
Dave Chinner ff14ee42a0 Merge branch 'xfs-misc-fixes-1-for-3.16' into for-next 2014-05-15 09:38:15 +10:00
Dave Chinner b76769294b Merge branch 'xfs-free-inode-btree' into for-next 2014-05-15 09:37:44 +10:00
Dave Chinner 232c2f5c65 Merge branch 'xfs-filestreams-lookup' into for-next 2014-05-15 09:36:59 +10:00
Dave Chinner fdd3a2ae2e Merge branch 'xfs-unused-args-cleanup' into for-next 2014-05-15 09:36:35 +10:00
Dave Chinner ee4eec478b xfs: list_lru_init returns a negative error
And we don't invert it properly when initialising the dquot lru
list.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:23:24 +10:00
Dave Chinner bc147822d5 xfs: negate xfs_icsb_init_counters error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:23:07 +10:00
Dave Chinner 45687642e4 xfs: negate mount workqueue init error value
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:53 +10:00
Dave Chinner 6670232b48 xfs: fix wrong err sign on xfs_set_acl()
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:37 +10:00
Dave Chinner a5a14de22e xfs: fix wrong errno from xfs_initxattrs
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:21 +10:00
Dave Chinner 65149e3fab xfs: correct error sign on COLLAPSE_RANGE errors
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:22:07 +10:00
Dave Chinner b38a134b22 xfs: xfs_commit_metadata returns wrong errno
Invert it.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:52 +10:00
Dave Chinner 8ff1e6705a xfs: fix incorrect error sign in xfs_file_aio_read
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:37 +10:00
Dave Chinner 43ec1460a2 xfs: xfs_dir_fsync() returns positive errno
And it should be negative.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-15 09:21:11 +10:00
Christoph Hellwig 6c888af0b4 xfs: pass struct da_args to xfs_attr_calc_size
And remove a very confused comment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:40:19 +10:00
Christoph Hellwig 67fd718f30 xfs: simplify attr name setup
Replace xfs_attr_name_to_xname with a new xfs_attr_args_init helper that
sets up the basic da_args structure without using a temporary xfs_name
structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:43 +10:00
Christoph Hellwig 1bc426a76b xfs: fold xfs_attr_remove_int into xfs_attr_remove
Also remove a useless ilock roundtrip for the first attr fork check, it's
racy anyway and we redo it later under the ilock before we start the removal.

Plus various minor style fixes to the new xfs_attr_remove.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:33 +10:00
Christoph Hellwig b87d022c27 xfs: fold xfs_attr_get_int into xfs_attr_get
This allows doing an unlocked check if an attr for is present at all and
slightly reduce the lock hold time if we actually do an attr get.

Plus various minor style fixes to the new xfs_attr_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:24 +10:00
Christoph Hellwig c5b4ac39a4 xfs: fold xfs_attr_set_int into xfs_attr_set
Plus various minor style fixes to the new xfs_attr_set.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-13 16:34:14 +10:00
Linus Torvalds afcf0a2d92 Fixes for 3.15-rc5:
- fix a remote attribute size calculation bug that leads to a
   transaction overrun
 - add default ACLs to O_TMPFILE files
 - Remove the EXPERIMENTAL tag from filesystems with metadata CRC
   support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTa+0vAAoJEK3oKUf0dfodQGkQAI4IzVYr4K02Rj7UcbZaGlUM
 R9OAATojaQ8/hPDshzrhyLK7/KvF2tJvH9PhP6zj3bko3fmhofJ5/mB6LYIsQtt+
 3rNd/jij9Icmq4+ouZPRDl00nJdnCZjJcfYys6N/tXwLNIvwKP04vjB4QoC1rxVv
 j6L85yUkMpPohA0Wbf+PKrTVJDrtTOe+YpczciYgGKHr0YF27Bdy6iYSU3KvTvd+
 wuqXvGAc9ARZDsrVHt8t6eh9OKRRk1RAV5vdwGwucBrVlnxGaspvia/85JyU3Kv0
 F2EQ3fWcGQs5ydQjpvSZlEIttDqBDn/LiuncNctXIUvHpr+MQ73XMVrNLoNY1m6d
 wQqXFQXT4e/vzJTXyQz/jYgzGl5t9Lvf/1Z5lFHliqhaBm1aNMhdjfCZhEpehoaQ
 09JSVj8ZKLHZt3yRgwkZdOmM0bl4thJmY1Wf5O2EPMrk3NE3nZKiNG+W2U/sSFti
 i12M4uVgInmeHoDIWFNL9kXp3fs+gr6HF5BNQOulm0ywzG3U1ozWGyKsnRmpPFQr
 995voVKZKDP410wzp98UKpjXalmonYuTFLNUDEEjr2UKUWq6fRpvDdSeBSRirGxP
 kdwfpgCZHDJlZEY7d4lv4Pv6L84KgYYHQpmbaFcPEAmJmlMZ4web1KqHl8TDy1hT
 Z+STYvTImpXV9sP5TZYT
 =79c6
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs

Pull xfs fixes from Dave Chinner:
 "The main fix is adding support for default ACLs on O_TMPFILE opened
  inodes to bring XFS into line with other filesystems.  Metadata CRCs
  are now also considered well enough tested to be fully supported, so
  we're removing the shouty warnings issued at mount time for
  filesystems with that format.  And there's transaction block
  reservation overrun fix.

  Summary:
   - fix a remote attribute size calculation bug that leads to a
     transaction overrun
   - add default ACLs to O_TMPFILE files
   - Remove the EXPERIMENTAL tag from filesystems with metadata CRC
     support"

* tag 'xfs-for-linus-3.15-rc5' of git://oss.sgi.com/xfs/xfs:
  xfs: remote attribute overwrite causes transaction overrun
  xfs: initialize default acls for ->tmpfile()
  xfs: fully support v5 format filesystems
2014-05-08 19:20:45 -07:00
Dave Chinner 8cfcc3e565 xfs: fix directory readahead offset off-by-one
Directory readahead can throw loud scary but harmless warnings
when multiblock directories are in use a specific pattern of
discontiguous blocks are found in the directory. That is, if a hole
follows a discontiguous block, it will throw a warning like:

XFS (dm-1): xfs_da_do_buf: bno 637 dir: inode 34363923462
XFS (dm-1): [00] br_startoff 637 br_startblock 1917954575 br_blockcount 1 br_state 0
XFS (dm-1): [01] br_startoff 638 br_startblock -2 br_blockcount 1 br_state 0

And dump a stack trace.

This is because the readahead offset increment loop does a double
increment of the block index - it does an increment for the loop
iteration as well as increase the loop counter by the number of
blocks in the extent. As a result, the readahead offset does not get
incremented correctly for discontiguous blocks and hence can ask for
readahead of a directory block from an offset part way through a
directory block.  If that directory block is followed by a hole, it
will trigger a mapping warning like the above.

The bad readahead will be ignored, though, because the main
directory block read loop uses the correct mapping offsets rather
than the readahead offset and so will ignore the bad readahead
altogether.

Fix the warning by ensuring that the readahead offset is correctly
incremented.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:52 +10:00
Dave Chinner ac983517ec xfs: don't sleep in xlog_cil_force_lsn on shutdown
Reports of a shutdown hang when fsyncing a directory have surfaced,
such as this:

[ 3663.394472] Call Trace:
[ 3663.397199]  [<ffffffff815f1889>] schedule+0x29/0x70
[ 3663.402743]  [<ffffffffa01feda5>] xlog_cil_force_lsn+0x185/0x1a0 [xfs]
[ 3663.416249]  [<ffffffffa01fd3af>] _xfs_log_force_lsn+0x6f/0x2f0 [xfs]
[ 3663.429271]  [<ffffffffa01a339d>] xfs_dir_fsync+0x7d/0xe0 [xfs]
[ 3663.435873]  [<ffffffff811df8c5>] do_fsync+0x65/0xa0
[ 3663.441408]  [<ffffffff811dfbc0>] SyS_fsync+0x10/0x20
[ 3663.447043]  [<ffffffff815fc7d9>] system_call_fastpath+0x16/0x1b

If we trigger a shutdown in xlog_cil_push() from xlog_write(), we
will never wake waiters on the current push sequence number, so
anything waiting in xlog_cil_force_lsn() for that push sequence
number to come up will not get woken and hence stall the shutdown.

Fix this by ensuring we call wake_up_all(&cil->xc_commit_wait) in
the push abort handling, in the log shutdown code when waking all
waiters, and adding a shutdown check in the sequence completion wait
loops to ensure they abort when a wakeup due to a shutdown occurs.

Reported-by: Boris Ranto <branto@redhat.com>
Reported-by: Eric Sandeen <esandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:50 +10:00
Dave Chinner 49abc3a8f8 xfs: truncate_setsize should be outside transactions
truncate_setsize() removes pages from the page cache, and hence
requires page locks to be held. It is not valid to lock a page cache
page inside a transaction context as we can hold page locks when we
we reserve space for a transaction. If we do, then we expose an ABBA
deadlock between log space reservation and page locks.

That is, both the write path and writeback lock a page, then start a
transaction for block allocation, which means they can block waiting
for a log reservation with the page lock held. If we hold a log
reservation and then do something that locks a page (e.g.
truncate_setsize in xfs_setattr_size) then that page lock can block
on the page locked and waiting for a log reservation. If the
transaction that is waiting for the page lock is the only active
transaction in the system that can free log space via a commit,
then writeback will never make progress and so log space will never
free up.

This issue with xfs_setattr_size() was introduced back in 2010 by
commit fa9b227 ("xfs: new truncate sequence") which moved the page
cache truncate from outside the transaction context (what was
xfs_itruncate_data()) to inside the transaction context as a call to
truncate_setsize().

The reason truncate_setsize() was located where in this place was
that we can't shouldn't change the file size until after we are in
the transaction context and the operation will either succeed or
shut down the filesystem on failure. However, block_truncate_page()
already modifies the file contents before we enter the transaction
context, so we can't really fulfill this guarantee in any way. Hence
we may as well ensure that on success or failure, the in-memory
inode and data is truncated away and that the application cleans up
the mess appropriately.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-07 08:05:45 +10:00
Al Viro bf97f3bc0c xfs: switch to ->write_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:39:40 -04:00
Al Viro b4f5d2c6d1 xfs: switch to ->read_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:37:56 -04:00
Al Viro b318891929 xfs: trim the argument lists of xfs_file_{dio,buffered}_aio_write()
pos is redundant (it's iocb->ki_pos), and iov/nr_segs/count are taken
care of by lifting iov_iter into the caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:55 -04:00
Al Viro 0c949334a9 iov_iter_truncate()
Now It Can Be Done(tm) - we don't need to do iov_shorten() in
generic_file_direct_write() anymore, now that all ->direct_IO()
instances are converted to proper iov_iter methods and honour
iter->count and iter->iov_offset properly.

Get rid of count/ocount arguments of generic_file_direct_write(),
while we are at it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:54 -04:00
Al Viro 71d8e532b1 start adding the tag to iov_iter
For now, just use the same thing we pass to ->direct_IO() - it's all
iovec-based at the moment.  Pass it explicitly to iov_iter_init() and
account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO()
uses.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:49 -04:00
Al Viro 31b140398c switch {__,}blockdev_direct_IO() to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:46 -04:00
Al Viro a6cbcd4a4a get rid of pointless iov_length() in ->direct_IO()
all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:45 -04:00
Al Viro d8d3d94b80 pass iov_iter to ->direct_IO()
unmodified, for now

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:44 -04:00
Al Viro cb66a7a1f1 kill generic_segment_checks()
all callers of ->aio_read() and ->aio_write() have iov/nr_segs already
checked - generic_segment_checks() done after that is just an odd way
to spell iov_length().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:43 -04:00
Al Viro f8579f8673 generic_file_direct_write(): switch to iov_iter
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-05-06 17:32:42 -04:00
Dave Chinner 8275cdd0e7 xfs: remote attribute overwrite causes transaction overrun
Commit e461fcb ("xfs: remote attribute lookups require the value
length") passes the remote attribute length in the xfs_da_args
structure on lookup so that CRC calculations and validity checking
can be performed correctly by related code. This, unfortunately has
the side effect of changing the args->valuelen parameter in cases
where it shouldn't.

That is, when we replace a remote attribute, the incoming
replacement stores the value and length in args->value and
args->valuelen, but then the lookup which finds the existing remote
attribute overwrites args->valuelen with the length of the remote
attribute being replaced. Hence when we go to create the new
attribute, we create it of the size of the existing remote
attribute, not the size it is supposed to be. When the new attribute
is much smaller than the old attribute, this results in a
transaction overrun and an ASSERT() failure on a debug kernel:

XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, file: fs/xfs/xfs_trans.c, line: 331

Fix this by keeping the remote attribute value length separate to
the attribute value length in the xfs_da_args structure. The enables
us to pass the length of the remote attribute to be removed without
overwriting the new attribute's length.

Also, ensure that when we save remote block contexts for a later
rename we zero the original state variables so that we don't confuse
the state of the attribute to be removes with the state of the new
attribute that we just added. [Spotted by Brain Foster.]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06 07:37:31 +10:00
Brian Foster d540e43b0a xfs: initialize default acls for ->tmpfile()
The current tmpfile handler does not initialize default ACLs. Doing so
within xfs_vn_tmpfile() makes it roughly equivalent to xfs_vn_mknod(),
which is already used as a common create handler.

xfs_vn_mknod() does not currently have a mechanism to determine whether
to link the file into the namespace. Therefore, further abstract
xfs_vn_mknod() into a new xfs_generic_create() handler with a tmpfile
parameter. This new handler calls xfs_create_tmpfile() and d_tmpfile()
on the dentry when called via ->tmpfile().

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-06 07:34:28 +10:00
From: Tuomas Tynkkynen b28fd7b5fe xfs: Fix wrong error codes being returned
xfs_{compat_,}attrmulti_by_handle could return an errno with incorrect
sign in some cases. While at it, make sure ENOMEM is returned instead of
E2BIG if kmalloc fails.

Signed-off-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:30:20 +10:00
Dave Chinner 3c35337576 xfs: remove dquot hints
group and project quota hints are currently stored on the user
dquot. If we are attaching quotas to the inode, then the group and
project dquots are stored as hints on the user dquot to save having
to look them up again later.

The thing is, the hints are not used for that inode for the rest of
the life of the inode - the dquots are attached directly to the
inode itself - so the only time the hints are used is when an inode
first has dquots attached.

When the hints on the user dquot don't match the dquots being
attache dto the inode, they are then removed and replaced with the
new hints. If a user is concurrently modifying files in different
group and/or project contexts, then this leads to thrashing of the
hints attached to user dquot.

If user quotas are not enabled, then hints are never even used.

So, if the hints are used to avoid the cost of the lookup, is the
cost of the lookup significant enough to justify the hint
infrstructure? Maybe it was once, when there was a global quota
manager shared between all XFS filesystems and was hash table based.

However, lookups are now much simpler, requiring only a single lock and
radix tree lookup local to the filesystem and no hash or LRU
manipulations to be made. Hence the cost of lookup is much lower
than when hints were implemented. Turns out that benchmarks show
that, too, with thir being no differnce in performance when doing
file creation workloads as a single user with user, group and
project quotas enabled - the hints do not make the code go any
faster. In fact, removing the hints shows a 2-3% reduction in the
time it takes to create 50 million inodes....

So, let's just get rid of the hints and the complexity around them.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:30:15 +10:00
Eric Sandeen f58522c5a4 xfs: bulletfproof xfs_qm_scall_trunc_qfiles()
Coverity noticed that if we sent junk into
xfs_qm_scall_trunc_qfiles(), we could get back an
uninitialized error value.  So sanitize the flags we
will accept, and initialize error anyway for good measure.

(This bug may have been introduced via c61a9e39).

Should resolve Coverity CID 1163872.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:27:06 +10:00
Eric Sandeen 9da93f9b7c xfs: fix Q_XQUOTARM ioctl
The Q_XQUOTARM quotactl was not working properly, because
we weren't passing around proper flags.  The xfs_fs_set_xstate()
ioctl handler used the same flags for Q_XQUOTAON/OFF as
well as for Q_XQUOTARM, but Q_XQUOTAON/OFF look for
XFS_UQUOTA_ACCT, XFS_UQUOTA_ENFD, XFS_GQUOTA_ACCT etc,
i.e. quota type + state, while Q_XQUOTARM looks only for
the type of quota, i.e. XFS_DQ_USER, XFS_DQ_GROUP etc.

Unfortunately these flag spaces overlap a bit, so we
got semi-random results for Q_XQUOTARM; i.e. the value
for XFS_DQ_USER == XFS_UQUOTA_ACCT, etc.  yeargh.

Add a new quotactl op vector specifically for the QUOTARM
operation, since it operates with a different flag space.

This has been broken more or less forever, AFAICT.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 17:25:50 +10:00
Dave Chinner c99d609a16 xfs: fully support v5 format filesystems
We have had this code in the kernel for over a year now and have
shaken all the known issues out of the code over the past few
releases. It's now time to remove the experimental warnings during
mount and fully support the new filesystem format in production
systems.

Remove the experimental warning, and add a version number to the
initial "mounting filesystem" message to tell use what type of
filesystem is being mounted. Also, remove the temporary inode
cluster size output at mount time now we know that this code works
fine.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-05-05 16:18:37 +10:00
Brian Foster 53801fd97a xfs: enable the finobt feature on v5 superblocks
Add the finobt feature bit to the list of known features. As of
this point, the kernel code knows how to mount and manage both
finobt and non-finobt formatted filesystems.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:42 +10:00
Brian Foster 0c153c1e43 xfs: report finobt status in fs geometry
Define the XFS_FSOP_GEOM_FLAGS_FINOBT fs geometry flag and set the
associated bit if the filesystem supports the free inode btree.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:41 +10:00
Brian Foster a3fa516dd8 xfs: add finobt support to growfs
Add finobt support to growfs. Initialize the agi root/level fields
and the root finobt block.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:01:39 +10:00
Brian Foster 3efa4ffd58 xfs: update the finobt on inode free
An inode free operation can have several effects on the finobt. If
all inodes have been freed and the chunk deallocated, we remove the
finobt record. If the inode chunk was previously full, we must
insert a new record based on the existing inobt record. Otherwise,
we modify the record in place.

Create the xfs_difree_finobt() function to identify the potential
scenarios and update the finobt appropriately.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster 2b64ee5cdc xfs: refactor xfs_difree() inobt bits into xfs_difree_inobt() helper
Refactor xfs_difree() in preparation for the finobt. xfs_difree()
performs the validity checks against the ag and reads the agi
header. The work of physically updating the inode allocation btree
is pushed down into the new xfs_difree_inobt() helper.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster 6dd8638e4e xfs: use and update the finobt on inode allocation
Replace xfs_dialloc_ag() with an implementation that looks for a
record in the finobt. The finobt only tracks records with at least
one free inode. This eliminates the need for the intra-ag scan in
the original algorithm. Once the inode is allocated, update the
finobt appropriately (possibly removing the record) as well as the
inobt.

Move the original xfs_dialloc_ag() algorithm to
xfs_dialloc_ag_inobt() and fall back as such if finobt support is
not enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster 0aa0a756ec xfs: insert newly allocated inode chunks into the finobt
A newly allocated inode chunk, by definition, has at least one
free inode, so a record is always inserted into the finobt.

Create the xfs_inobt_insert() helper from existing code to insert
a record in an inobt based on the provided BTNUM. Update
xfs_ialloc_ag_alloc() to invoke the helper for the existing
XFS_BTNUM_INO tree and XFS_BTNUM_FINO tree, if enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:53 +10:00
Brian Foster 9d43b180af xfs: update inode allocation/free transaction reservations for finobt
Create the xfs_calc_finobt_res() helper to calculate the finobt log
reservation for inode allocation and free. Update
XFS_IALLOC_SPACE_RES() to reserve blocks for the additional finobt
insertion on inode allocation. Create XFS_IFREE_SPACE_RES() to
reserve blocks for the potential finobt record insertion on inode
free (i.e., if an inode chunk was previously fully allocated).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster aafc3c2465 xfs: support the XFS_BTNUM_FINOBT free inode btree type
Define the AGI fields for the finobt root/level and add magic
numbers. Update the btree code to add support for the new
XFS_BTNUM_FINOBT inode btree.

The finobt root block is reserved immediately following the inobt
root block in the AG. Update XFS_PREALLOC_BLOCKS() to determine the
starting AG data block based on whether finobt support is enabled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster 8e2c84df20 xfs: reserve v5 superblock read-only compat. feature bit for finobt
Reserve a v5 read-only compatibility feature bit for the finobt and
create the xfs_sb_version_hasfinobt() helper to determine whether
an fs has the feature enabled.

The finobt does not change existing on-disk structures, but must
remain consistent with the ialloc btree. Modifications from older
kernels would violate that constrant. Therefore, we restrict older
kernels to read-only mounts of finobt-enabled filesystems.

Note that this does not yet enable the ability to rw mount a finobt
fs (by setting the feature bit in the XFS_SB_FEAT_RO_COMPAT_ALL
mask).

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:52 +10:00
Brian Foster 57bd3dbe40 xfs: refactor xfs_ialloc_btree.c to support multiple inobt numbers
The introduction of the free inode btree (finobt) requires that
xfs_ialloc_btree.c handle multiple trees. Refactor xfs_ialloc_btree.c
so the caller specifies the btree type on cursor initialization to
prepare for addition of the finobt.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-24 16:00:50 +10:00
Christoph Hellwig b94acd4786 xfs: add filestream allocator tracepoints
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:52 +10:00
Christoph Hellwig 3b8d90766a xfs: remove xfs_filestream_associate
There is no good reason to create a filestream when a directory entry
is created.  Delay it until the first allocation happens to simply
the code and reduce the amount of mru cache lookups we do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:52 +10:00
Christoph Hellwig 1919adda07 xfs: don't create a slab cache for filestream items
We only have very few of these around, and allocation isn't that
much of a hot path.  Remove the slab cache to simplify the code,
and to not waste any resources for the usual case of not having
any inodes that use the filestream allocator.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig 2cd2ef6a30 xfs: rewrite the filestream allocator using the dentry cache
In Linux we will always be able to find a parent inode for file that are
undergoing I/O.  Use this to simply the file stream allocator by only
keeping track of parent inodes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-23 07:11:51 +10:00
Christoph Hellwig f37211c336 xfs: remove XFS_IFILESTREAM
We never test the flag except in xfs_inode_is_filestream, but that
function already tests the on-disk flag or filesystem wide flags,
and is used to decide if we want to set XFS_IFILESTREAM in the
first place.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig 22328d712d xfs: embedd mru_elem into parent structure
There is no need to do a separate allocation for each mru element, just
embedd the structure into the parent one in the user.  Besides saving
a memory allocation and the infrastructure required for it this also
simplifies the API.

While we do major surgery on xfs_mru_cache.c also de-typedef it and
make struct mru_cache private to the implementation file.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:51 +10:00
Christoph Hellwig ce695c6551 xfs: handle duplicate entries in xfs_mru_cache_insert
The radix tree code can detect and reject duplicate keys at insert
time.  Make xfs_mru_cache_insert handle this case so that future
changes to the filestream allocator can take advantage of this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-23 07:11:50 +10:00
Christoph Hellwig c977eb1065 xfs: split xfs_bmap_btalloc_nullfb
Split xfs_bmap_btalloc_nullfb into one function for filestream allocations
and one for everything else that share a few helpers.  This dramatically
simplifies the control flow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2014-04-23 07:11:41 +10:00
Linus Torvalds 9ac0367501 These are regression and bug fixes for ext4.
We had a number of new features in ext4 during this merge window
 (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
 there were many more regression and bug fixes this time around.  It
 didn't help that xfstests hadn't been fully updated to fully stress
 test COLLAPSE_RANGE until after -rc1.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTVIEUAAoJENNvdpvBGATwnKkQANlzQv6BhgzCa0b5Iu0SkHeD
 OuLAtPFYE5OVEK22oWT0H76gBi71RHLboHwThd+ZfEeEPvyfs42wY0J/PV/R9dHx
 kwhU+MaDDzugfVj3gg29DpYNLQkL/evq0vlNbrRk5je877c2I8JbXV/aAoTVFZoH
 NGOsagwBqWCsgL5nSOk/nEZSRX2AzSCkgmOVxylLzFoyTUkX3vZx8G8XtS1zRgbH
 b1yOWIK1Ifj7tmBZ4HLpNiK6/NpHAHeHRFiaCQxY0hkLjUeMyVNJfZzXS/Fzp8DP
 p1/nm5z9PaFj4nyBC1Wvh9Z6Lj0zQ0ap73LV+w4fHM1SZub3XY+hvyXj/8qMNaSc
 lLIGwa2AZFpurbKKn6MZTi5CubVLZs6PZKzDgYURnEcJCgeMujMOxbKekcL5sP9E
 Gb6Hh9I/f08HagCRox5O0W7f0/TBY5bFryx5kQQZUtpcRmnY3m7cohSkn6WriwTZ
 zYApOZMZkFX5spSeYsfyi8K8wHij/5mXvm7qeqQ0Rj4Ehycd+7jwltOCVXAYN29+
 zSKaBaxH2+V7zuGHSxjDFbOOlPotTFNzGmFh08DPTF4Vgnc9uMlLo0Oz8ADFDcT2
 JZ4pAFTEREnHOATNl5bAEi8wNrU/Ln9IGhlYCYI9X5BQXjf9oPXcYwQT/lKCb07s
 ks8ujfry1R/gjQGuv+LH
 =gi42
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "These are regression and bug fixes for ext4.

  We had a number of new features in ext4 during this merge window
  (ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
  there were many more regression and bug fixes this time around.  It
  didn't help that xfstests hadn't been fully updated to fully stress
  test COLLAPSE_RANGE until after -rc1"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
  ext4: disable COLLAPSE_RANGE for bigalloc
  ext4: fix COLLAPSE_RANGE failure with 1KB block size
  ext4: use EINVAL if not a regular file in ext4_collapse_range()
  ext4: enforce we are operating on a regular file in ext4_zero_range()
  ext4: fix extent merging in ext4_ext_shift_path_extents()
  ext4: discard preallocations after removing space
  ext4: no need to truncate pagecache twice in collapse range
  ext4: fix removing status extents in ext4_collapse_range()
  ext4: use filemap_write_and_wait_range() correctly in collapse range
  ext4: use truncate_pagecache() in collapse range
  ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
  ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
  ext4: always check ext4_ext_find_extent result
  ext4: fix error handling in ext4_ext_shift_extents
  ext4: silence sparse check warning for function ext4_trim_extent
  ext4: COLLAPSE_RANGE only works on extent-based files
  ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
  ext4: use i_size_read in ext4_unaligned_aio()
  fs: disallow all fallocate operation on active swapfile
  fs: move falloc collapse range check into the filesystem methods
  ...
2014-04-20 20:43:47 -07:00
Brian Foster 330033d697 xfs: fix tmpfile/selinux deadlock and initialize security
xfstests generic/004 reproduces an ilock deadlock using the tmpfile
interface when selinux is enabled. This occurs because
xfs_create_tmpfile() takes the ilock and then calls d_tmpfile(). The
latter eventually calls into xfs_xattr_get() which attempts to get the
lock again. E.g.:

xfs_io          D ffffffff81c134c0  4096  3561   3560 0x00000080
ffff8801176a1a68 0000000000000046 ffff8800b401b540 ffff8801176a1fd8
00000000001d5800 00000000001d5800 ffff8800b401b540 ffff8800b401b540
ffff8800b73a6bd0 fffffffeffffffff ffff8800b73a6bd8 ffff8800b5ddb480
Call Trace:
[<ffffffff8177f969>] schedule+0x29/0x70
[<ffffffff81783a65>] rwsem_down_read_failed+0xc5/0x120
[<ffffffffa05aa97f>] ? xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffff813b3434>] call_rwsem_down_read_failed+0x14/0x30
[<ffffffff810ed179>] ? down_read_nested+0x89/0xa0
[<ffffffffa05aa7f2>] ? xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa7f2>] xfs_ilock+0x122/0x250 [xfs]
[<ffffffffa05aa97f>] xfs_ilock_attr_map_shared+0x1f/0x50 [xfs]
[<ffffffffa05701d0>] xfs_attr_get+0x90/0xe0 [xfs]
[<ffffffffa0565e07>] xfs_xattr_get+0x37/0x50 [xfs]
[<ffffffff8124842f>] generic_getxattr+0x4f/0x70
[<ffffffff8133fd9e>] inode_doinit_with_dentry+0x1ae/0x650
[<ffffffff81340e0c>] selinux_d_instantiate+0x1c/0x20
[<ffffffff813351bb>] security_d_instantiate+0x1b/0x30
[<ffffffff81237db0>] d_instantiate+0x50/0x70
[<ffffffff81237e85>] d_tmpfile+0xb5/0xc0
[<ffffffffa05add02>] xfs_create_tmpfile+0x362/0x410 [xfs]
[<ffffffffa0559ac8>] xfs_vn_tmpfile+0x18/0x20 [xfs]
[<ffffffff81230388>] path_openat+0x228/0x6a0
[<ffffffff810230f9>] ? sched_clock+0x9/0x10
[<ffffffff8105a427>] ? kvm_clock_read+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8123101a>] do_filp_open+0x3a/0x90
[<ffffffff817845e7>] ? _raw_spin_unlock+0x27/0x40
[<ffffffff8124054f>] ? __alloc_fd+0xaf/0x1f0
[<ffffffff8121e3ce>] do_sys_open+0x12e/0x210
[<ffffffff8121e4ce>] SyS_open+0x1e/0x20
[<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b

xfs_vn_tmpfile() also fails to initialize security on the newly created
inode.

Pull the d_tmpfile() call up into xfs_vn_tmpfile() after the transaction
has been committed and the inode unlocked. Also, initialize security on
the inode based on the parent directory provided via the tmpfile call.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:30 +10:00
Eric Sandeen 8d6c121018 xfs: fix buffer use after free on IO error
When testing exhaustion of dm snapshots, the following appeared
with CONFIG_DEBUG_OBJECTS_FREE enabled:

ODEBUG: free active (active state 0) object type: work_struct hint: xfs_buf_iodone_work+0x0/0x1d0 [xfs]

indicating that we'd freed a buffer which still had a pending reference,
down this path:

[  190.867975]  [<ffffffff8133e6fb>] debug_check_no_obj_freed+0x22b/0x270
[  190.880820]  [<ffffffff811da1d0>] kmem_cache_free+0xd0/0x370
[  190.892615]  [<ffffffffa02c5924>] xfs_buf_free+0xe4/0x210 [xfs]
[  190.905629]  [<ffffffffa02c6167>] xfs_buf_rele+0xe7/0x270 [xfs]
[  190.911770]  [<ffffffffa034c826>] xfs_trans_read_buf_map+0x7b6/0xac0 [xfs]

At issue is the fact that if IO fails in xfs_buf_iorequest,
we'll queue completion unconditionally, and then call
xfs_buf_rele; but if IO failed, there are no IOs remaining,
and xfs_buf_rele will free the bp while work is still queued.

Fix this by not scheduling completion if the buffer has
an error on it; run it immediately.  The rest is only comment
changes.

Thanks to dchinner for spotting the root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:28 +10:00
Dave Chinner 07d5035a28 xfs: wrong error sign conversion during failed DIO writes
We negate the error value being returned from a generic function
incorrectly. The code path that it is running in returned negative
errors, so there is no need to negate it to get the correct error
signs here.

This was uncovered by generic/019.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:27 +10:00
Dave Chinner 9c23eccc1e xfs: unmount does not wait for shutdown during unmount
And interesting situation can occur if a log IO error occurs during
the unmount of a filesystem. The cases reported have the same
signature - the update of the superblock counters fails due to a log
write IO error:

XFS (dm-16): xfs_do_force_shutdown(0x2) called from line 1170 of file fs/xfs/xfs_log.c.  Return address = 0xffffffffa08a44a1
XFS (dm-16): Log I/O Error Detected.  Shutting down filesystem
XFS (dm-16): Unable to update superblock counters. Freespace may not be correct on next mount.
XFS (dm-16): xfs_log_force: error 5 returned.
XFS (¿-¿¿¿): Please umount the filesystem and rectify the problem(s)

It can be seen that the last line of output contains a corrupt
device name - this is because the log and xfs_mount structures have
already been freed by the time this message is printed. A kernel
oops closely follows.

The issue is that the shutdown is occurring in a separate IO
completion thread to the unmount. Once the shutdown processing has
started and all the iclogs are marked with XLOG_STATE_IOERROR, the
log shutdown code wakes anyone waiting on a log force so they can
process the shutdown error. This wakes up the unmount code that
is doing a synchronous transaction to update the superblock
counters.

The unmount path now sees all the iclogs are marked with
XLOG_STATE_IOERROR and so never waits on them again, knowing that if
it does, there will not be a wakeup trigger for it and we will hang
the unmount if we do. Hence the unmount runs through all the
remaining code and frees all the filesystem structures while the
xlog_iodone() is still processing the shutdown. When the log
shutdown processing completes, xfs_do_force_shutdown() emits the
"Please umount the filesystem and rectify the problem(s)" message,
and xlog_iodone() then aborts all the objects attached to the iclog.
An iclog that has already been freed....

The real issue here is that there is no serialisation point between
the log IO and the unmount. We have serialisations points for log
writes, log forces, reservations, etc, but we don't actually have
any code that wakes for log IO to fully complete. We do that for all
other types of object, so why not iclogbufs?

Well, it turns out that we can easily do this. We've got xfs_buf
handles, and that's what everyone else uses for IO serialisation.
i.e. bp->b_sema. So, lets hold iclogbufs locked over IO, and only
release the lock in xlog_iodone() when we are finished with the
buffer. That way before we tear down the iclog, we can lock and
unlock the buffer to ensure IO completion has finished completely
before we tear it down.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Bob Mastors <bob.mastors@solidfire.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:26 +10:00
Dave Chinner d39a2ced0f xfs: collapse range is delalloc challenged
FSX has been detecting data corruption after to collapse range
calls. The key observation is that the offset of the last extent in
the file was not being shifted, and hence when the file size was
adjusted it was truncating away data because the extents handled
been correctly shifted.

Tracing indicated that before the collapse, the extent list looked
like:

....
ino 0x5788 state  idx 6 offset 26 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 39 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0

and after the shift of 2 blocks:

ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 32 flag 0

Note that the last extent did not change offset. After the changing
of the file size:

ino 0x5788 state  idx 6 offset 24 block 195904 count 10 flag 0
ino 0x5788 state  idx 7 offset 37 block 195917 count 35 flag 0
ino 0x5788 state  idx 8 offset 86 block 195964 count 30 flag 0

You can see that the last extent had it's length truncated,
indicating that we've lost data.

The reason for this is that the xfs_bmap_shift_extents() loop uses
XFS_IFORK_NEXTENTS() to determine how many extents are in the inode.
This, unfortunately, doesn't take into account delayed allocation
extents - it's a count of physically allocated extents - and hence
when the file being collapsed has a delalloc extent like this one
does prior to the range being collapsed:

....
ino 0x5788 state  idx 4 offset 11 block 4503599627239429 count 1 flag 0
....

it gets the count wrong and terminates the shift loop early.

Fix it by using the in-memory extent array size that includes
delayed allocation extents to determine the number of extents on the
inode.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:25 +10:00
Dave Chinner 0e1f789d0d xfs: don't map ranges that span EOF for direct IO
Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct IO code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.

Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct IO code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.

Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

Again, thanks to Al Viro for finding what I couldn't.

[ dchinner: Fix for __divdi3 build error:

	Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Tested-by: Paul Gortmaker <paul.gortmaker@windriver.com>
	Signed-off-by: Mark Tinguely <tinguely@sgi.com>
	Reviewed-by: Eric Sandeen <sandeen@redhat.com>
]

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-17 08:15:19 +10:00
Christoph Hellwig 8b90a33f47 xfs: don't try to use the filestream allocator for metadata allocations
xfs_bmap_btalloc_nullfb has two entirely different control flows when
using the filestream allocator vs the regular one, but it get the
conditionals wrong and ends up mixing the two for metadata allocations.
Fix this by adding a missing userdata check and slight refactoring.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:37:42 +10:00
Eric Sandeen 5e06d14894 xfs: remove unused calculation in xfs_dir2_sf_addname()
The "add_entsize" calculated here is never used.
"incr_isize" accounts for the inode expansion of the
old entries + parent + new entry all by itself.

Once we've removed add_entsize there, it's just a pointless
intermediate variable elsewhere, so remove it.
For that matter, old_isize is gratuitous too, so nuke that.

And add a few comments so the magic "+1's" and "+2's" make
a bit more sense.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:07:23 +10:00
Eric Sandeen e5e98bc64d xfs: remove pointless pointer increment in xfs_dir2_block_compact()
xfs_dir2_block_compact() is passed a pointer to *blp, and
advances it locally - but nobody uses the pointer (locally)
after that.

This behavior came about as part of prior refactoring,

20f7e9f xfs: factor dir2 block read operations

and looking at the code as it was before, it seems quite clear
that this change introduced a bug; the pre-refactoring code
expects blp to be modified after compaction.
And indeed it did; see this commit which fixed it:

37f1356 xfs: recalculate leaf entry pointer after compacting a dir2 block

So the bug was introduced & resolved in the 3.8 cycle.

Whoops.  Well, it's fixed now, and mystery solved; just remove
the now-pointless local increment of the blp pointer.

(I guess we should have run clang earlier!)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:46 +10:00
Eric Sandeen bbe4c66869 xfs: remove unused trans pointer arg from xlog_recover_unmount_trans()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:25 +10:00
Eric Sandeen e4a1e29cb0 xfs: remove unused ail pointer arg from xfs_trans_ail_cursor_done()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:06:05 +10:00
Eric Sandeen bda65ef8a8 xfs: remove unused xfs_mount arg from xfs_symlink_hdr_ok()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:05:43 +10:00
Eric Sandeen fd9fdba6c3 xfs: remove unused bp arg from xfs_iflush_fork()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:04:46 +10:00
Eric Sandeen e009400870 xfs: remove unused pag ptr arg from iterator execute functions
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:04:19 +10:00
Eric Sandeen 6f8950cd73 xfs: remove unused length arg from alloc_block ops
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:03:53 +10:00
Eric Sandeen 6ea94bb5b3 xfs: remove unused mp arg from xfs_calc_dquots_per_chunk()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:03:34 +10:00
Eric Sandeen 2599405374 xfs: remove unused mp arg from xfs_dir2 dataptr/byte functions
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:02:30 +10:00
Eric Sandeen 9df2dd0b0d xfs: remove unused tp arg from xfs_da_reada_buf & callers
This one hits a few functions as we unravel the unused arg
up through the callers.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:59 +10:00
Eric Sandeen 72b0636bb7 xfs: remove unused bip arg from xfs_buf_item_log_segment()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:34 +10:00
Eric Sandeen 87937bf8ca xfs: remove unused flags arg from _xfs_buf_get_pages()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:20 +10:00
Eric Sandeen 34dcefd717 xfs: remove unused args from xfs_alloc_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:01:00 +10:00
Eric Sandeen a96c41519a xfs: remove unused blocksize arg from xfs_setsize_buftarg()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 19:00:29 +10:00
Eric Sandeen 0d7409b142 xfs: remove unused level arg from xfs_btree_read_buf_block()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:59:56 +10:00
Eric Sandeen 6a9edd3d54 xfs: remove unused mp arg from xfs_bmap_forkoff_reset()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:59:26 +10:00
Eric Sandeen 152d93b732 xfs: remove unused mp arg from xfs_bmdr_maxrecs()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:51 +10:00
Eric Sandeen 6d0081a3da xfs: remove unused mp arg from xfs_attr3_rmt_hdr_ok()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:29 +10:00
Eric Sandeen 7fb2cd4d32 xfs: remove unused tp arg from xfs_bmap_last_offset() and callers
remove unused transaction pointer from various
callchains leading to xfs_bmap_last_offset().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:58:05 +10:00
Dave Chinner 897b73b6a2 xfs: zeroing space needs to punch delalloc blocks
When we are zeroing space andit is covered by a delalloc range, we
need to punch the delalloc range out before we truncate the page
cache. Failing to do so leaves and inconsistency between the page
cache and the extent tree, which we later trip over when doing
direct IO over the same range.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:15:11 +10:00
Dave Chinner aad3f3755e xfs: xfs_vm_write_end truncates too much on failure
Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:14:11 +10:00
Dave Chinner 72ab70a19b xfs: write failure beyond EOF truncates too much data
If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.

Fix it by only truncating away the exact range we are updating state
for in this write_begin call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:13:29 +10:00
Dave Chinner 4ab9ed578e xfs: kill buffers over failed write ranges properly
When a write fails, if we don't clear the delalloc flags from the
buffers over the failed range, they can persist beyond EOF and cause
problems. writeback will see the pages in the page cache, see they
are dirty and continually retry the write, assuming that the page
beyond EOF is just racing with a truncate. The page will eventually
be released due to some other operation (e.g. direct IO), and it
will not pass through invalidation because it is dirty. Hence it
will be released with buffer_delay set on it, and trigger warnings
in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
because invalidation failed and we didn't write the corect amount.

This causes failures on block size < page size filesystems in fsx
and fsstress workloads run by xfstests.

Fix it by completely trashing any state on the buffer that could be
used to imply that it contains valid data when the delalloc range
over the buffer is punched out during the failed write handling.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-14 18:11:58 +10:00
Linus Torvalds 5166701b36 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The first vfs pile, with deep apologies for being very late in this
  window.

  Assorted cleanups and fixes, plus a large preparatory part of iov_iter
  work.  There's a lot more of that, but it'll probably go into the next
  merge window - it *does* shape up nicely, removes a lot of
  boilerplate, gets rid of locking inconsistencie between aio_write and
  splice_write and I hope to get Kent's direct-io rewrite merged into
  the same queue, but some of the stuff after this point is having
  (mostly trivial) conflicts with the things already merged into
  mainline and with some I want more testing.

  This one passes LTP and xfstests without regressions, in addition to
  usual beating.  BTW, readahead02 in ltp syscalls testsuite has started
  giving failures since "mm/readahead.c: fix readahead failure for
  memoryless NUMA nodes and limit readahead pages" - might be a false
  positive, might be a real regression..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  missing bits of "splice: fix racy pipe->buffers uses"
  cifs: fix the race in cifs_writev()
  ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
  kill generic_file_buffered_write()
  ocfs2_file_aio_write(): switch to generic_perform_write()
  ceph_aio_write(): switch to generic_perform_write()
  xfs_file_buffered_aio_write(): switch to generic_perform_write()
  export generic_perform_write(), start getting rid of generic_file_buffer_write()
  generic_file_direct_write(): get rid of ppos argument
  btrfs_file_aio_write(): get rid of ppos
  kill the 5th argument of generic_file_buffered_write()
  kill the 4th argument of __generic_file_aio_write()
  lustre: don't open-code kernel_recvmsg()
  ocfs2: don't open-code kernel_recvmsg()
  drbd: don't open-code kernel_recvmsg()
  constify blk_rq_map_user_iov() and friends
  lustre: switch to kernel_sendmsg()
  ocfs2: don't open-code kernel_sendmsg()
  take iov_iter stuff to mm/iov_iter.c
  process_vm_access: tidy up a bit
  ...
2014-04-12 14:49:50 -07:00
Lukas Czerner 23fffa925e fs: move falloc collapse range check into the filesystem methods
Currently in do_fallocate in collapse range case we're checking
whether offset + len is not bigger than i_size.  However there is
nothing which would prevent i_size from changing so the check is
pointless.  It should be done in the file system itself and the file
system needs to make sure that i_size is not going to change.  The
i_size check for the other fallocate modes are also done in the
filesystems.

As it is now we can easily crash the kernel by having two processes
doing truncate and fallocate collapse range at the same time.  This
can be reproduced on ext4 and it is theoretically possible on xfs even
though I was not able to trigger it with this simple test.

This commit removes the check from do_fallocate and adds it to the
file system.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Acked-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2014-04-12 09:56:41 -04:00
Kirill A. Shutemov f1820361f8 mm: implement ->map_pages for page cache
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07 16:35:53 -07:00
Linus Torvalds d15e03104e xfs: update for 3.15-rc1
The main changes in the XFS tree for 3.15-rc1 are:
 
         - O_TMPFILE support
         - allowing AIO+DIO writes beyond EOF
         - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
           implementation
         - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
           implementation
         - IO verifier cleanup and rework
         - stack usage reduction changes
         - vm_map_ram NOIO context fixes to remove lockdep warings
         - various bug fixes and cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJTPykMAAoJEK3oKUf0dfod/KoP/jKQwzQPdtT8EtAu5vENh9AO
 55zwCDXXFjCNIGIFPkrUBQbbARVAqhLZn3vuLUUhqtRRELdgJy/yFKZ37MPd8bhU
 dKetivEB192Jcd6Sn74vsOsNLm1u9mJqbQ1aothz0TiOrkkWFZlz4Otu36MZRHN3
 9WgZXWSxr6I/hYHGyCorJWZ5ISs0XD3vR5dYXYeZChbTpTxlCT4X/YgUtW4WH/Tq
 y4gG0fKfwr9KK07/LXuQgUuZGU8vwVuNNsXPhqh+FZ39SLD2Ea83h46Hzf/+vVNI
 kCIyYN1y40uBWczmwAptVEnUwhpGK8PzNrhKwTZICDtuct9sikf7c+o0aEE9lcqo
 8IBt0Dy4l7BQVFSZOjYo5Jw5a8jAbkh47zru31HxogEVqafdz80iWB12JagOOnXM
 v/McvDvZMyfgGckih32FM4G7ElvTYgGai5/3dLhfMuhc4/DdwcBOF1yHmFmnjhWO
 QRsQxLdefUtP3MfMYKaJHM6v2wE1S2l0owgp+HdPluNiOUmH/fqFq1WpHxqqeRPz
 nuHF8oYlxaZP5WAarz6Yf1/twIeZJ1rTD8np8uocvMqQJzMYJgrQyH+xJqjJaITR
 iveQcEoRB8D7/fXMGDdcjZYE2fG4l4JE2kuh97k5NZw76e3v2YXSGh0kd9WqR1uN
 t07joLRQKR2pJuSmuD5E
 =uSkJ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs

Pull xfs update from Dave Chinner:
 "There are a couple of new fallocate features in this request - it was
  decided that it was easiest to push them through the XFS tree using
  topic branches and have the ext4 support be based on those branches.
  Hence you may see some overlap with the ext4 tree merge depending on
  how they including those topic branches into their tree.  Other than
  that, there is O_TMPFILE support, some cleanups and bug fixes.

  The main changes in the XFS tree for 3.15-rc1 are:

   - O_TMPFILE support
   - allowing AIO+DIO writes beyond EOF
   - FALLOC_FL_COLLAPSE_RANGE support for fallocate syscall and XFS
     implementation
   - FALLOC_FL_ZERO_RANGE support for fallocate syscall and XFS
     implementation
   - IO verifier cleanup and rework
   - stack usage reduction changes
   - vm_map_ram NOIO context fixes to remove lockdep warings
   - various bug fixes and cleanups"

* tag 'xfs-for-linus-3.15-rc1' of git://oss.sgi.com/xfs/xfs: (34 commits)
  xfs: fix directory hash ordering bug
  xfs: extra semi-colon breaks a condition
  xfs: Add support for FALLOC_FL_ZERO_RANGE
  fs: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  xfs: inode log reservations are still too small
  xfs: xfs_check_page_type buffer checks need help
  xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
  xfs: use NOIO contexts for vm_map_ram
  xfs: don't leak EFSBADCRC to userspace
  xfs: fix directory inode iolock lockdep false positive
  xfs: allocate xfs_da_args to reduce stack footprint
  xfs: always do log forces via the workqueue
  xfs: modify verifiers to differentiate CRC from other errors
  xfs: print useful caller information in xfs_error_report
  xfs: add xfs_verifier_error()
  xfs: add helper for updating checksums on xfs_bufs
  xfs: add helper for verifying checksums on xfs_bufs
  xfs: Use defines for CRC offsets in all cases
  xfs: skip pointless CRC updates after verifier failures
  xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
  ...
2014-04-04 15:50:08 -07:00
Linus Torvalds 24e7ea3bea Major changes for 3.14 include support for the newly added ZERO_RANGE
and COLLAPSE_RANGE fallocate operations, and scalability improvements
 in the jbd2 layer and in xattr handling when the extended attributes
 spill over into an external block.
 
 Other than that, the usual clean ups and minor bug fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y
 Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A
 ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI
 kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd
 bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15
 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs
 rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa
 ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/
 +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA
 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr
 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU
 DrPKlXwIgva7zq5/S0Vr
 =4s1Z
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Major changes for 3.14 include support for the newly added ZERO_RANGE
  and COLLAPSE_RANGE fallocate operations, and scalability improvements
  in the jbd2 layer and in xattr handling when the extended attributes
  spill over into an external block.

  Other than that, the usual clean ups and minor bug fixes"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits)
  ext4: fix premature freeing of partial clusters split across leaf blocks
  ext4: remove unneeded test of ret variable
  ext4: fix comment typo
  ext4: make ext4_block_zero_page_range static
  ext4: atomically set inode->i_flags in ext4_set_inode_flags()
  ext4: optimize Hurd tests when reading/writing inodes
  ext4: kill i_version support for Hurd-castrated file systems
  ext4: each filesystem creates and uses its own mb_cache
  fs/mbcache.c: doucple the locking of local from global data
  fs/mbcache.c: change block and index hash chain to hlist_bl_node
  ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate
  ext4: refactor ext4_fallocate code
  ext4: Update inode i_size after the preallocation
  ext4: fix partial cluster handling for bigalloc file systems
  ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents
  ext4: only call sync_filesystm() when remounting read-only
  fs: push sync_filesystem() down to the file system's remount_fs()
  jbd2: improve error messages for inconsistent journal heads
  jbd2: minimize region locked by j_list_lock in jbd2_journal_forget()
  jbd2: minimize region locked by j_list_lock in journal_get_create_access()
  ...
2014-04-04 15:39:39 -07:00
Johannes Weiner 91b0abe36a mm + fs: store shadow entries in page cache
Reclaim will be leaving shadow entries in the page cache radix tree upon
evicting the real page.  As those pages are found from the LRU, an
iput() can lead to the inode being freed concurrently.  At this point,
reclaim must no longer install shadow pages because the inode freeing
code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code sets
under the tree lock before doing the final truncate.  Reclaim will check
for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Metin Doslu <metin@citusdata.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Ozgun Erdogan <ozgun@citusdata.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ryan Mallon <rmallon@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03 16:21:01 -07:00
Dave Chinner a6cf33bc56 Merge branch 'xfs-bug-fixes-for-3.15-3' into for-next 2014-04-04 08:07:35 +11:00
Mark Tinguely c88547a811 xfs: fix directory hash ordering bug
Commit f5ea1100 ("xfs: add CRCs to dir2/da node blocks") introduced
in 3.10 incorrectly converted the btree hash index array pointer in
xfs_da3_fixhashpath(). It resulted in the the current hash always
being compared against the first entry in the btree rather than the
current block index into the btree block's hash entry array. As a
result, it was comparing the wrong hashes, and so could misorder the
entries in the btree.

For most cases, this doesn't cause any problems as it requires hash
collisions to expose the ordering problem. However, when there are
hash collisions within a directory there is a very good probability
that the entries will be ordered incorrectly and that actually
matters when duplicate hashes are placed into or removed from the
btree block hash entry array.

This bug results in an on-disk directory corruption and that results
in directory verifier functions throwing corruption warnings into
the logs. While no data or directory entries are lost, access to
them may be compromised, and attempts to remove entries from a
directory that has suffered from this corruption may result in a
filesystem shutdown.  xfs_repair will fix the directory hash
ordering without data loss occuring.

[dchinner: wrote useful a commit message]

cc: <stable@vger.kernel.org>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 07:10:49 +11:00
Dan Carpenter 805eeb8e04 xfs: extra semi-colon breaks a condition
There were some extra semi-colons here which mean that we return true
unintentionally.

Fixes: a49935f200 ('xfs: xfs_check_page_type buffer checks need help')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-04-04 06:56:30 +11:00
Al Viro 0a64bc2c04 xfs_file_buffered_aio_write(): switch to generic_perform_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:36 -04:00
Al Viro 5cb6c6c7eb generic_file_direct_write(): get rid of ppos argument
always equal to &iocb->ki_pos.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:35 -04:00
Al Viro fcacafd269 kill the 5th argument of generic_file_buffered_write()
same story - it's &iocb->ki_pos in all cases

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:34 -04:00
Al Viro 5d826c847b new helper: readlink_copy()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-04-01 23:19:15 -04:00
Theodore Ts'o 02b9984d64 fs: push sync_filesystem() down to the file system's remount_fs()
Previously, the no-op "mount -o mount /dev/xxx" operation when the
file system is already mounted read-write causes an implied,
unconditional syncfs().  This seems pretty stupid, and it's certainly
documented or guaraunteed to do this, nor is it particularly useful,
except in the case where the file system was mounted rw and is getting
remounted read-only.

However, it's possible that there might be some file systems that are
actually depending on this behavior.  In most file systems, it's
probably fine to only call sync_filesystem() when transitioning from
read-write to read-only, and there are some file systems where this is
not needed at all (for example, for a pseudo-filesystem or something
like romfs).

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Cc: Jan Kara <jack@suse.cz>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Anders Larsen <al@alarsen.net>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Cc: Petr Vandrovec <petr@vandrovec.name>
Cc: xfs@oss.sgi.com
Cc: linux-btrfs@vger.kernel.org
Cc: linux-cifs@vger.kernel.org
Cc: samba-technical@lists.samba.org
Cc: codalist@coda.cs.cmu.edu
Cc: linux-ext4@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: fuse-devel@lists.sourceforge.net
Cc: cluster-devel@redhat.com
Cc: linux-mtd@lists.infradead.org
Cc: jfs-discussion@lists.sourceforge.net
Cc: linux-nfs@vger.kernel.org
Cc: linux-nilfs@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: ocfs2-devel@oss.oracle.com
Cc: reiserfs-devel@vger.kernel.org
2014-03-13 10:14:33 -04:00
Dave Chinner fe986f9d88 Merge branch 'xfs-O_TMPFILE-support' into for-next
Conflicts:
	fs/xfs/xfs_trans_resv.c
	- fix for XFS_INODE_CLUSTER_SIZE macro removal
2014-03-13 19:14:43 +11:00
Dave Chinner 5f44e4c185 Merge branch 'xfs-bug-fixes-for-3.15-2' into for-next 2014-03-13 19:13:05 +11:00
Dave Chinner 49ae4b97d7 Merge branch 'xfs-verifier-cleanup' into for-next 2014-03-13 19:12:33 +11:00
Dave Chinner 730357a5cb Merge branch 'xfs-stack-fixes' into for-next 2014-03-13 19:12:13 +11:00
Dave Chinner b6db0551fd Merge branch 'xfs-collapse-range' into for-next 2014-03-13 19:11:06 +11:00
Lukas Czerner 376ba31314 xfs: Add support for FALLOC_FL_ZERO_RANGE
Introduce new FALLOC_FL_ZERO_RANGE flag for fallocate. This has the same
functionality as xfs ioctl XFS_IOC_ZERO_RANGE.

We can also preallocate blocks past EOF in the same was as with
fallocate. Flag FALLOC_FL_KEEP_SIZE will cause the inode size to remain
the same even if we preallocate blocks past EOF.

It uses the same code to zero range as it is used by the
XFS_IOC_ZERO_RANGE ioctl.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-13 19:07:58 +11:00
Dave Chinner fe4c224aa1 xfs: inode log reservations are still too small
Back in commit 23956703 ("xfs: inode log reservations are too
small"), the reservation size was increased to take into account the
difference in size between the in-memory BMBT block headers and the
on-disk BMDR headers. This solved a transaction overrun when logging
the inode size.

Recently, however, we've seen a number of these same overruns on
kernels with the above fix in it. All of them have been by 4 bytes,
so we must still not be accounting for something correctly.

Through inspection it turns out the above commit didn't take into
account everything it should have. That is, it only accounts for a
single log op_hdr structure, when it can actually require up to four
op_hdrs - one for each region (log iovec) that is formatted. These
regions are the inode log format header, the inode core, and the two
forks that can be held in the literal area of the inode.

This means we are not accounting for 36 bytes of log space that the
transaction can use, and hence when we get inodes in certain formats
with particular fragmentation patterns we can overrun the
transaction. Fix this by adding the correct accounting for log
op_headers in the transaction.

Tested-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07 16:19:14 +11:00
Dave Chinner a49935f200 xfs: xfs_check_page_type buffer checks need help
xfs_aops_discard_page() was introduced in the following commit:

  xfs: truncate delalloc extents when IO fails in writeback

... to clean up left over delalloc ranges after I/O failure in
->writepage(). generic/224 tests for this scenario and occasionally
reproduces panics on sub-4k blocksize filesystems.

The cause of this is failure to clean up the delalloc range on a
page where the first buffer does not match one of the expected
states of xfs_check_page_type(). If a buffer is not unwritten,
delayed or dirty&mapped, xfs_check_page_type() stops and
immediately returns 0.

The stress test of generic/224 creates a scenario where the first
several buffers of a page with delayed buffers are mapped & uptodate
and some subsequent buffer is delayed. If the ->writepage() happens
to fail for this page, xfs_aops_discard_page() incorrectly skips
the entire page.

This then causes later failures either when direct IO maps the range
and finds the stale delayed buffer, or we evict the inode and find
that the inode still has a delayed block reservation accounted to
it.

We can easily fix this xfs_aops_discard_page() failure by making
xfs_check_page_type() check all buffers, but this breaks
xfs_convert_page() more than it is already broken. Indeed,
xfs_convert_page() wants xfs_check_page_type() to tell it if the
first buffers on the pages are of a type that can be aggregated into
the contiguous IO that is already being built.

xfs_convert_page() should not be writing random buffers out of a
page, but the current behaviour will cause it to do so if there are
buffers that don't match the current specification on the page.
Hence for xfs_convert_page() we need to:

	a) return "not ok" if the first buffer on the page does not
	match the specification provided to we don't write anything;
	and
	b) abort it's buffer-add-to-io loop the moment we come
	across a buffer that does not match the specification.

Hence we need to fix both xfs_check_page_type() and
xfs_convert_page() to work correctly with pages that have mixed
buffer types, whilst allowing xfs_aops_discard_page() to scan all
buffers on the page for a type match.

Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07 16:19:14 +11:00
Brian Foster e480a72397 xfs: avoid AGI/AGF deadlock scenario for inode chunk allocation
The inode chunk allocation path can lead to deadlock conditions if
a transaction is dirtied with an AGF (to fix up the freelist) for
an AG that cannot satisfy the actual allocation request. This code
path is written to try and avoid this scenario, but it can be
reproduced by running xfstests generic/270 in a loop on a 512b fs.

An example situation is:
- process A attempts an inode allocation on AG 3, modifies
  the freelist, fails the allocation and ultimately moves on to
  AG 0 with the AG 3 AGF held
- process B is doing a free space operation (i.e., truncate) and
  acquires the AG 0 AGF, waits on the AG 3 AGF
- process A acquires the AG 0 AGI, waits on the AG 0 AGF (deadlock)

The problem here is that process A acquired the AG 3 AGF while
moving on to AG 0 (and releasing the AG 3 AGI with the AG 3 AGF
held). xfs_dialloc() makes one pass through each of the AGs when
attempting to allocate an inode chunk. The expectation is a clean
transaction if a particular AG cannot satisfy the allocation
request. xfs_ialloc_ag_alloc() is written to support this through
use of the minalignslop allocation args field.

When using the agi->agi_newino optimization, we attempt an exact
bno allocation request based on the location of the previously
allocated chunk. minalignslop is set to inform the allocator that
we will require alignment on this chunk, and thus to not allow the
request for this AG if the extra space is not available. Suppose
that the AG in question has just enough space for this request, but
not at the requested bno. xfs_alloc_fix_freelist() will proceed as
normal as it determines the request should succeed, and thus it is
allowed to modify the agf. xfs_alloc_ag_vextent() ultimately fails
because the requested bno is not available. In response, the caller
moves on to a NEAR_BNO allocation request for the same AG. The
alignment is set, but the minalignslop field is never reset. This
increases the overall requirement of the request from the first
attempt. If this delta is the difference between allocation success
and failure for the AG, xfs_alloc_fix_freelist() rejects this
request outright the second time around and causes the allocation
request to unnecessarily fail for this AG.

To address this situation, reset the minalignslop field immediately
after use and prevent it from leaking into subsequent requests.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07 16:19:14 +11:00
Dave Chinner ae687e58b3 xfs: use NOIO contexts for vm_map_ram
When we map pages in the buffer cache, we can do so in GFP_NOFS
contexts. However, the vmap interfaces do not provide any method of
communicating this information to memory reclaim, and hence we get
lockdep complaining about it regularly and occassionally see hangs
that may be vmap related reclaim deadlocks. We can also see these
same problems from anywhere where we use vmalloc for a large buffer
(e.g. attribute code) inside a transaction context.

A typical lockdep report shows up as a reclaim state warning like so:

[14046.101458] =================================
[14046.102850] [ INFO: inconsistent lock state ]
[14046.102850] 3.14.0-rc4+ #2 Not tainted
[14046.102850] ---------------------------------
[14046.102850] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[14046.102850] kswapd0/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
[14046.102850]  (&xfs_dir_ilock_class){++++?+}, at: [<791a04bb>] xfs_ilock+0xff/0x16a
[14046.102850] {RECLAIM_FS-ON-W} state was registered at:
[14046.102850]   [<7904cdb1>] mark_held_locks+0x81/0xe7
[14046.102850]   [<7904d390>] lockdep_trace_alloc+0x5c/0xb4
[14046.102850]   [<790c2c28>] kmem_cache_alloc_trace+0x2b/0x11e
[14046.102850]   [<790ba7f4>] vm_map_ram+0x119/0x3e6
[14046.102850]   [<7914e124>] _xfs_buf_map_pages+0x5b/0xcf
[14046.102850]   [<7914ed74>] xfs_buf_get_map+0x67/0x13f
[14046.102850]   [<7917506f>] xfs_attr_rmtval_set+0x396/0x4d5
[14046.102850]   [<7916e8bb>] xfs_attr_leaf_addname+0x18f/0x37d
[14046.102850]   [<7916ed9e>] xfs_attr_set_int+0x2f5/0x3e8
[14046.102850]   [<7916eefc>] xfs_attr_set+0x6b/0x74
[14046.102850]   [<79168355>] xfs_xattr_set+0x61/0x81
[14046.102850]   [<790e5b10>] generic_setxattr+0x59/0x68
[14046.102850]   [<790e4c06>] __vfs_setxattr_noperm+0x58/0xce
[14046.102850]   [<790e4d0a>] vfs_setxattr+0x8e/0x92
[14046.102850]   [<790e4ddd>] setxattr+0xcf/0x159
[14046.102850]   [<790e5423>] SyS_lsetxattr+0x88/0xbb
[14046.102850]   [<79268438>] sysenter_do_call+0x12/0x36

Now, we can't completely remove these traces - mainly because
vm_map_ram() will do GFP_KERNEL allocation and that generates the
above warning before we get into the reclaim code, but we can turn
them all into false positive warnings.

To do that, use the method that DM and other IO context code uses to
avoid this problem: there is a process flag to tell memory reclaim
not to do IO that we can set appropriately. That prevents GFP_KERNEL
context reclaim being done from deep inside the vmalloc code in
places we can't directly pass a GFP_NOFS context to. That interface
has a pair of wrapper functions: memalloc_noio_save() and
memalloc_noio_restore().

Adding them around vm_map_ram and the vzalloc call in
kmem_alloc_large() will prevent deadlocks and most lockdep reports
for this issue. Also, convert the vzalloc() call in
kmem_alloc_large() to use __vmalloc() so that we can pass the
correct gfp context to the data page allocation routine inside
__vmalloc() so that it is clear that GFP_NOFS context is important
to this vmalloc call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07 16:19:14 +11:00
Dave Chinner ac75a1f7a4 xfs: don't leak EFSBADCRC to userspace
While the verifier routines may return EFSBADCRC when a buffer has
a bad CRC, we need to translate that to EFSCORRUPTED so that the
higher layers treat the error appropriately and we return a
consistent error to userspace. This fixes a xfs/005 regression.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-03-07 16:19:14 +11:00
Linus Torvalds 8d7531825c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull filesystem fixes from Jan Kara:
 "Notification, writeback, udf, quota fixes

  The notification patches are (with one exception) a fallout of my
  fsnotify rework which went into -rc1 (I've extented LTP to cover these
  cornercases to avoid similar breakage in future).

  The UDF patch is a nasty data corruption Al has recently reported,
  the revert of the writeback patch is due to possibility of violating
  sync(2) guarantees, and a quota bug can lead to corruption of quota
  files in ocfs2"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: Allocate overflow events with proper type
  fanotify: Handle overflow in case of permission events
  fsnotify: Fix detection whether overflow event is queued
  Revert "writeback: do not sync data dirtied after sync start"
  quota: Fix race between dqput() and dquot_scan_active()
  udf: Fix data corruption on file type conversion
  inotify: Fix reporting of cookies for inotify events
2014-02-27 10:37:22 -08:00
Dave Chinner 93a8614e3a xfs: fix directory inode iolock lockdep false positive
The change to add the IO lock to protect the directory extent map
during readdir operations has cause lockdep to have a heart attack
as it now sees a different locking order on inodes w.r.t. the
mmap_sem because readdir has a different ordering to write().

Add a new lockdep class for directory inodes to avoid this false
positive.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 16:51:39 +11:00
Dave Chinner a1358aa3d3 xfs: allocate xfs_da_args to reduce stack footprint
The struct xfs_da_args used to pass directory/attribute operation
information to the lower layers is 128 bytes in size and is
allocated on the stack. Dynamically allocate them to reduce the
stack footprint of directory operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 16:51:26 +11:00
Dave Chinner f876e44603 xfs: always do log forces via the workqueue
Log forces can occur deep in the call chain when we have relatively
little stack free. Log forces can also happen at close to the call
chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
places where we really don't want to add more stack overhead.

This stack overhead occurs because log forces do foreground CIL
pushes (xlog_cil_push_foreground()) rather than waking the
background push wq and waiting for the for the push to complete.
This foreground push was done to avoid confusing the CFQ Io
scheduler when fsync()s were issued, as it has trouble dealing with
dependent IOs being issued from different process contexts.

Avoiding blowing the stack is much more critical than performance
optimisations for CFQ, especially as we've been recommending against
the use of CFQ for XFS since 3.2 kernels were release because of
it's problems with multi-threaded IO workloads.

Hence convert xlog_cil_push_foreground() to move the push work
to the CIL workqueue. We already do the waiting for the push to
complete in xlog_cil_force_lsn(), so there's nothing else we need to
modify to make this work.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 16:40:42 +11:00
Eric Sandeen ce5028cfe3 xfs: modify verifiers to differentiate CRC from other errors
Modify all read & write verifiers to differentiate
between CRC errors and other inconsistencies.

This sets the appropriate error number on bp->b_error,
and then calls xfs_verifier_error() if something went
wrong.  That function will issue the appropriate message
to the user.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:23:10 +11:00
Eric Sandeen db9355c296 xfs: print useful caller information in xfs_error_report
xfs_error_report used to just print the hex address of the caller;
%pF will give us something more human-readable.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:21:37 +11:00
Eric Sandeen ca23f8fdd6 xfs: add xfs_verifier_error()
We want to distinguish between corruption, CRC errors,
etc.  In addition, the full stack trace on verifier errors
seems less than helpful; it looks more like an oops than
corruption.

Create a new function to specifically alert the user to
verifier errors, which can differentiate between
EFSCORRUPTED and CRC mismatches.  It doesn't dump stack
unless the xfs error level is turned up high.

Define a new error message (EFSBADCRC) to clearly identify
CRC errors.  (Defined to EBADMSG, bad message)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:21:07 +11:00
Eric Sandeen f1dbcd7e38 xfs: add helper for updating checksums on xfs_bufs
Many/most callers of xfs_update_cksum() pass bp->b_addr and
BBTOB(bp->b_length) as the first 2 args.  Add a helper
which can just accept the bp and the crc offset, and work
it out on its own, for brevity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:18:23 +11:00
Eric Sandeen 5158217058 xfs: add helper for verifying checksums on xfs_bufs
Many/most callers of xfs_verify_cksum() pass bp->b_addr and
BBTOB(bp->b_length) as the first 2 args.  Add a helper
which can just accept the bp and the crc offset, and work
it out on its own, for brevity.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:17:27 +11:00
Eric Sandeen 533b81c875 xfs: Use defines for CRC offsets in all cases
Some calls to crc functions used useful #defines,
others used awkward offsetof() constructs.

Switch them all to #define to make things a bit cleaner.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:15:27 +11:00
Eric Sandeen e0d2c23a25 xfs: skip pointless CRC updates after verifier failures
Most write verifiers don't update CRCs after the verifier
has failed and the buffer has been marked in error.  These
two didn't, but should.

Add returns to the verifier failure block, since the buffer
won't be written anyway.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-27 15:14:31 +11:00
Namjae Jeon e1d8fb88a6 xfs: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
This patch implements fallocate's FALLOC_FL_COLLAPSE_RANGE for XFS.

The semantics of this flag are following:
1) It collapses the range lying between offset and length by removing any data
   blocks which are present in this range and than updates all the logical
   offsets of extents beyond "offset + len" to nullify the hole created by
   removing blocks. In short, it does not leave a hole.
2) It should be used exclusively. No other fallocate flag in combination.
3) Offset and length supplied to fallocate should be fs block size aligned
   in case of xfs and ext4.
4) Collaspe range does not work beyond i_size.

Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-24 10:58:19 +11:00
Linus Torvalds 645ceee885 Merge branch 'xfs-fixes-for-3.14-rc4' of git://oss.sgi.com/xfs/xfs
Pull xfs fixes from Dave Chinner:
 "This is the first pull request I've had to do for you, so I'm still
  sorting things out.  The reason I'm sending this and not Ben should be
  obvious from the first commit below - SGI has stepped down from the
  XFS maintainership role.  As such, I'd like to take another
  opportunity to thank them for their many years of effort maintaining
  XFS and supporting the XFS community that they developed from the
  ground up.

  So I haven't had time to work things like signed tags into my
  workflows yet, so this is just a repo branch I'm asking you to pull
  from.  And yes, I named the branch -rc4 because I wanted the fixes in
  rc4, not because the branch was for merging into -rc3.  Probably not
  right, either.

  Anyway, I should have everything sorted out by the time the next merge
  window comes around.  If there's anything that you don't like in the
  pull req, feel free to flame me unmercifully.

  The changes are fixes for recent regressions and important thinkos in
  verification code:

        - a log vector buffer alignment issue on ia32
        - timestamps on truncate got mangled
        - primary superblock CRC validation fixes and error message
          sanitisation"

* 'xfs-fixes-for-3.14-rc4' of git://oss.sgi.com/xfs/xfs:
  xfs: limit superblock corruption errors to actual corruption
  xfs: skip verification on initial "guess" superblock read
  MAINTAINERS: SGI no longer maintaining XFS
  xfs: xfs_sb_read_verify() doesn't flag bad crcs on primary sb
  xfs: ensure correct log item buffer alignment
  xfs: ensure correct timestamp updates from truncate
2014-02-22 08:26:01 -08:00
Jan Kara 0dc83bd30b Revert "writeback: do not sync data dirtied after sync start"
This reverts commit c4a391b53a. Dave
Chinner <david@fromorbit.com> has reported the commit may cause some
inodes to be left out from sync(2). This is because we can call
redirty_tail() for some inode (which sets i_dirtied_when to current time)
after sync(2) has started or similarly requeue_inode() can set
i_dirtied_when to current time if writeback had to skip some pages. The
real problem is in the functions clobbering i_dirtied_when but fixing
that isn't trivial so revert is a safer choice for now.

CC: stable@vger.kernel.org # >= 3.13
Signed-off-by: Jan Kara <jack@suse.cz>
2014-02-22 02:02:28 +01:00
Dave Chinner 027f185e66 Merge remote-tracking branch 'xfs-async-aio-extend' into for-next 2014-02-20 15:16:39 +11:00
Dave Chinner b678573e29 Merge branch 'xfs-fixes-for-3.15' into for-next 2014-02-20 15:16:09 +11:00
Eric Sandeen 5ef11eb070 xfs: limit superblock corruption errors to actual corruption
Today, if

xfs_sb_read_verify
  xfs_sb_verify
    xfs_mount_validate_sb

detects superblock corruption, it'll be extremely noisy, dumping
2 stacks, 2 hexdumps, etc.

This is because we call XFS_CORRUPTION_ERROR in xfs_mount_validate_sb
as well as in xfs_sb_read_verify.

Also, *any* errors in xfs_mount_validate_sb which are not corruption
per se; things like too-big-blocksize, bad version, bad magic, v1 dirs,
rw-incompat etc - things which do not return EFSCORRUPTED - will
still do the whole XFS_CORRUPTION_ERROR spew when xfs_sb_read_verify
sees any error at all.  And it suggests to the user that they
should run xfs_repair, even if the root cause of the mount failure
is a simple incompatibility.

I'll submit that the probably-not-corrupted errors don't warrant
this much noise, so this patch removes the warning for anything
other than EFSCORRUPTED returns, and replaces the lower-level
XFS_CORRUPTION_ERROR with an xfs_notice().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-19 15:39:35 +11:00
Eric Sandeen daba5427da xfs: skip verification on initial "guess" superblock read
When xfs_readsb() does the very first read of the superblock,
it makes a guess at the length of the buffer, based on the
sector size of the underlying storage.  This may or may
not match the filesystem sector size in sb_sectsize, so
we can't i.e. do a CRC check on it; it might be too short.

In fact, mounting a filesystem with sb_sectsize larger
than the device sector size will cause a mount failure
if CRCs are enabled, because we are checksumming a length
which exceeds the buffer passed to it.

So always read twice; the first time we read with NULL
buffer ops to skip verification; then set the proper
read length, hook up the proper verifier, and give it
another go.

Once we are sure that we've got the right buffer length,
we can also use bp->b_length in the xfs_sb_read_verify,
rather than the less-trusted on-disk sectorsize for
secondary superblocks.  Before this we ran the risk of
passing junk to the crc32c routines, which didn't always
handle extreme values.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-19 15:39:16 +11:00
Eric Sandeen 7a01e707a3 xfs: xfs_sb_read_verify() doesn't flag bad crcs on primary sb
My earlier commit 10e6e65 deserves a layer or two of brown paper
bags.  The logic in that commit means that a CRC failure on the
primary superblock will *never* result in an error return.

Hopefully this fixes it, so that we always return the error
if it's a primary superblock, otherwise only if the filesystem
has CRCs enabled.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
2014-02-19 15:33:05 +11:00
Dave Chinner 3895e51f6d xfs: ensure correct log item buffer alignment
On 32 bit platforms, the log item vector headers are not 64 bit
aligned or sized. hence if we don't take care to align them
correctly or pad the buffer appropriately for 8 byte alignment, we
can end up with alignment issues when accessing the user buffer
directly as a structure.

To solve this, simply pad the buffer headers to 64 bit offset so
that the data section is always 8 byte aligned.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Tested-by: Michael L. Semon <mlsemon35@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-10 10:37:18 +11:00
Christoph Hellwig fe60a8a091 xfs: ensure correct timestamp updates from truncate
The VFS doesn't set the proper ATTR_CTIME and ATTR_MTIME values for
truncate, so filesystems have to manually add them.  The
introduction of xfs_setattr_time accidentally broke this special
case an caused a regression in generic/313.  Fix this by removing
the local mask variable in xfs_setattr_size so that we only have a
single place to keep the attribute information.

cc: <stable@vger.kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-10 10:35:22 +11:00
Christoph Hellwig 9862f62fab xfs: allow appending aio writes
XFS can easily support appending aio writes by ensuring we always allocate
blocks as unwritten extents when performing direct I/O writes and only
converting them to written extents at I/O completion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-10 10:28:04 +11:00
Christoph Hellwig d531d91d69 xfs: always use unwritten extents for direct I/O writes
To allow aio writes beyond i_size we need to create unwritten extents for
newly allocated blocks, similar to how we already do inside i_size.

Instead of adding another special case we now use unwritten extents
unconditionally.  This also marks the end of directly allocation data
extents in all of XFS - we now always use either delalloc or unwritten
extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-10 10:27:43 +11:00
Al Viro d311d79de3 fix O_SYNC|O_APPEND syncing the wrong range on write()
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
	pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
	pos_before_write .. pos_before_write + written - 1
instead.  Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().

All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().

The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-02-09 15:18:09 -05:00
Jie Liu 492185ef1d xfs: remove XFS_TRANS_DEBUG dead code
Remove the leftover XFS_TRANS_DEBUG dead code following the previous
cleaning up of it in commits ec47eb6b0b.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-02-07 15:26:11 +11:00