1
0
Fork 0
Commit Graph

1551 Commits (aa42240ab2544a8bcb2efb400193826f57f3175e)

Author SHA1 Message Date
Steven Whitehouse d69a3c6561 GFS2: Move log buffer lists into transaction
Over time, we hope to be able to improve the concurrency available
in the log code. This is one small step towards that, by moving
the buffer lists from the super block, and into the transaction
structure, so that each transaction builds its own buffer lists.

At transaction commit time, the buffer lists are merged into
the currently accumulating transaction. That transaction then
is passed into the before and after commit functions at journal
flush time. Thus there should be no change in overall behaviour
yet.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-24 16:54:54 +00:00
Steven Whitehouse 654a6d2f96 GFS2: Reduce struct gfs2_trans in size
A couple of "int" fields were being used as boolean values
so we can make them bitfields of one bit, and put them in
what might otherwise be a hole in the structure with 64
bit alignment.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-21 11:52:00 +00:00
David Teigland ad781971d9 GFS2: add missing newline
Log message is missing newline.

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-17 10:00:58 +00:00
Rashika Kheria c2b0b30edd GFS2: Mark functions as static in gfs2/rgrp.c
Mark functions as static in gfs2/rgrp.c because they are not used
outside this file.

This eliminates the following warning in gfs2/rgrp.c:
fs/gfs2/rgrp.c:1092:5: warning: no previous prototype for ‘gfs2_rgrp_bh_get’ [-Wmissing-prototypes]
fs/gfs2/rgrp.c:1157:5: warning: no previous prototype for ‘update_rgrp_lvb’ [-Wmissing-prototypes]

Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-10 12:29:16 +00:00
Steven Whitehouse 44aaada9d1 GFS2: Add meta readahead field in directory entries
The intent of this new field in the directory entry is to
allow a subsequent lookup to know how many blocks, which
are contiguous with the inode, contain metadata which relates
to the inode. This will then allow the issuing of a single
read to read these blocks, rather than reading the inode
first, and then issuing a second read for the metadata.

This only works under some fairly strict conditions, since
we do not have back pointers from inodes to directory entries
we must ensure that the blocks referenced in this way will
always belong to the inode.

This rules out being able to use this system for indirect
blocks, as these can change as a result of truncate/rewrite.

So the idea here is to restrict this to xattr blocks only
for the time being. For most inodes, that means only a
single block. Also, when using ACLs and/or SELinux or
other LSMs, these will be added at inode creation time
so that they will be contiguous with the inode on disk and
also will almost always be needed when we read the inode in
for permissions checks.

Once an xattr block for an inode is allocated, it will never
change until the inode is deallocated.

This patch adds the new field, a further patch will add the
readahead in due course.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-07 11:23:22 +00:00
Bob Peterson a0846a534c GFS2: Lock i_mutex and use a local gfs2_holder for fallocate
This patch causes GFS2 to lock the i_mutex during fallocate. It
also switches from using a dinode's inode glock to using a local
holder like the other GFS2 i_operations.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-06 15:49:58 +00:00
Steven Whitehouse 774016b2d4 GFS2: journal data writepages update
GFS2 has carried what is more or less a copy of the
write_cache_pages() for some time. It seems that this
copy has slipped behind the core code over time. This
patch brings it back uptodate, and in addition adds the
tracepoint which would otherwise be missing.

We could go further, and eliminate some or all of the
code duplication here. The issue is that if we do that,
then the function we need to split out from the existing
write_cache_pages(), which will look a lot like
gfs2_jdata_write_pagevec(), would land up putting quite a
lot of extra variables on the stack. I know that has been
a problem in the past in the writeback code path, which
is why I've hesitated to do it here.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-06 15:47:47 +00:00
Steven Whitehouse b2c8b3ea87 GFS2: Allocate block for xattr at inode alloc time, if required
This is another step towards improving the allocation of xattr
blocks at inode allocation time. Here we take advantage of
Christoph's recent work on ACLs to allocate a block for the
xattrs early if we know that we will be adding ACLs to the
inode later on. The advantage of that is that it is much
more likely that we'll get a contiguous run of two blocks
where the first is the inode and the second is the xattr block.

We still have to fall back to the original system in case we
don't get the requested two contiguous blocks, or in case the
ACLs are too large to fit into the block.

Future patches will move more of the ACL setting code further
up the gfs2_inode_create() function. Also, I'd like to be
able to do the same thing with the xattrs from LSMs in
due course, too. That way we should be able to slowly reduce
the number of independent transactions, at least in the
most common cases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-04 15:45:11 +00:00
Steven Whitehouse 885bceca7f GFS2: Plug on AIL flush
When we do a flush of the AIL list, we are writing out what is
likely to be a lot of small I/Os, which are possibly in an order
which is not ideal performance-wise. Since this is done by calling
filemap_fdatatwrite for each individual inode's address space there
is no overall plugging going on.

In addition to that, we do not always wait for AIL i/o when we flush
it, so that it is possible for things to get left behind on the queue.
By adding explicit plugging here, we reduce the chances of this
being an issues. A quick test using the AIL flush tracepoint shows a
small, but measurable improvement.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-02-03 09:57:29 +00:00
Linus Torvalds f568849eda Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block
Pull core block IO changes from Jens Axboe:
 "The major piece in here is the immutable bio_ve series from Kent, the
  rest is fairly minor.  It was supposed to go in last round, but
  various issues pushed it to this release instead.  The pull request
  contains:

   - Various smaller blk-mq fixes from different folks.  Nothing major
     here, just minor fixes and cleanups.

   - Fix for a memory leak in the error path in the block ioctl code
     from Christian Engelmayer.

   - Header export fix from CaiZhiyong.

   - Finally the immutable biovec changes from Kent Overstreet.  This
     enables some nice future work on making arbitrarily sized bios
     possible, and splitting more efficient.  Related fixes to immutable
     bio_vecs:

        - dm-cache immutable fixup from Mike Snitzer.
        - btrfs immutable fixup from Muthu Kumar.

  - bio-integrity fix from Nic Bellinger, which is also going to stable"

* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
  xtensa: fixup simdisk driver to work with immutable bio_vecs
  block/blk-mq-cpu.c: use hotcpu_notifier()
  blk-mq: for_each_* macro correctness
  block: Fix memory leak in rw_copy_check_uvector() handling
  bio-integrity: Fix bio_integrity_verify segment start bug
  block: remove unrelated header files and export symbol
  blk-mq: uses page->list incorrectly
  blk-mq: use __smp_call_function_single directly
  btrfs: fix missing increment of bi_remaining
  Revert "block: Warn and free bio if bi_end_io is not set"
  block: Warn and free bio if bi_end_io is not set
  blk-mq: fix initializing request's start time
  block: blk-mq: don't export blk_mq_free_queue()
  block: blk-mq: make blk_sync_queue support mq
  block: blk-mq: support draining mq queue
  dm cache: increment bi_remaining when bi_end_io is restored
  block: fixup for generic bio chaining
  block: Really silence spurious compiler warnings
  block: Silence spurious compiler warnings
  block: Kill bio_pair_split()
  ...
2014-01-30 11:19:05 -08:00
Linus Torvalds bf3d846b78 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted stuff; the biggest pile here is Christoph's ACL series.  Plus
  assorted cleanups and fixes all over the place...

  There will be another pile later this week"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
  __dentry_path() fixes
  vfs: Remove second variable named error in __dentry_path
  vfs: Is mounted should be testing mnt_ns for NULL or error.
  Fix race when checking i_size on direct i/o read
  hfsplus: remove can_set_xattr
  nfsd: use get_acl and ->set_acl
  fs: remove generic_acl
  nfs: use generic posix ACL infrastructure for v3 Posix ACLs
  gfs2: use generic posix ACL infrastructure
  jfs: use generic posix ACL infrastructure
  xfs: use generic posix ACL infrastructure
  reiserfs: use generic posix ACL infrastructure
  ocfs2: use generic posix ACL infrastructure
  jffs2: use generic posix ACL infrastructure
  hfsplus: use generic posix ACL infrastructure
  f2fs: use generic posix ACL infrastructure
  ext2/3/4: use generic posix ACL infrastructure
  btrfs: use generic posix ACL infrastructure
  fs: make posix_acl_create more useful
  fs: make posix_acl_chmod more useful
  ...
2014-01-28 08:38:04 -08:00
Christoph Hellwig e01580bf9e gfs2: use generic posix ACL infrastructure
This contains some major refactoring for the create path so that
inodes are created with the right mode to start with instead of
fixing it up later.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:22 -05:00
Christoph Hellwig 37bc15392a fs: make posix_acl_create more useful
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
Christoph Hellwig 5bf3258fd2 fs: make posix_acl_chmod more useful
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-01-25 23:58:18 -05:00
J. Bruce Fields d57b9c9a99 GFS2: revert "GFS2: d_splice_alias() can't return error"
0d0d110720 asserts that "d_splice_alias()
can't return error unless it was given an IS_ERR(inode)".

That was true of the implementation of d_splice_alias, but this is
really a problem with d_splice_alias: at a minimum it should be able to
return -ELOOP in the case where inserting the given dentry would cause a
directory loop.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-18 09:50:53 +00:00
Bob Peterson 8b127d0494 GFS2: Small cleanup
This is a small cleanup to function gfs2_rgrp_go_lock so that it
uses rgd instead of its more complicated twin.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-16 14:22:12 +00:00
Steven Whitehouse ac3beb6a5d GFS2: Don't use ENOBUFS when ENOMEM is the correct error code
Al Viro has tactfully pointed out that we are using the incorrect
error code in some cases. This patch fixes that, and also removes
the (unused) return value for glock dumping.

>        * gfs2_iget() - ENOBUFS instead of ENOMEM.  ENOBUFS is
> "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
> we don't support STREAMS it's probably fair game, but... what the hell?

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
2014-01-16 10:31:13 +00:00
Steven Whitehouse 1e3d36206b GFS2: Fix kbuild test robot reported warning
Well I don't get the same warning locally as the kbuild
robot, but I guess this should fix the problem, anyway.
Here is the warning:

head:   2d9e72303d
commit: ee2411a8db [19/20] GFS2: Clean up quota slot allocation
config: make ARCH=powerpc allmodconfig

All error/warnings:

   fs/gfs2/quota.c: In function 'gfs2_quota_init':
>> fs/gfs2/quota.c:1246:3: error: implicit declaration of function '__vmalloc' [-Werror=implicit-function-declaration]
      sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
      ^
>> fs/gfs2/quota.c:1246:24: warning: assignment makes pointer from integer without a cast [enabled by default]
      sdp->sd_quota_bitmap = __vmalloc(bm_size, GFP_NOFS, PAGE_KERNEL);
                           ^
   fs/gfs2/quota.c: In function 'gfs2_quota_cleanup':
>> fs/gfs2/quota.c:1361:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
       vfree(sdp->sd_quota_bitmap);

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-15 12:57:25 +00:00
Steven Whitehouse 2d9e72303d GFS2: Move quota bitmap operations under their own lock
Gradually, the global qd_lock is being used for less and less.
After this patch it will only be used for the per super block
list whose purpose is to allow syncing of changes back to the
master quota file from the local quota changes file. Fixing
up that process to make it more efficient will be the subject
of a later patch, however this patch removes another barrier
to doing that.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2014-01-14 19:29:06 +00:00
Steven Whitehouse ee2411a8db GFS2: Clean up quota slot allocation
Quota slot allocation has historically used a vector of pages
and a set of homegrown find/test/set/clear bit functions. Since
the size of the bitmap is likely to be based on the default
qc file size, thats a couple of pages at most. So we ought
to be able to allocate that as a single chunk, with a vmalloc
fallback, just in case of memory fragmentation.

We are then able to use the kernel's own find/test/set/clear
bit functions, rather than rolling our own.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2014-01-14 19:28:49 +00:00
Steven Whitehouse 8ad151c2ac GFS2: Only run logd and quota when mounted read/write
While investigating a rather strange bit of code in the quota
clean up function, I spotted that the reason for its existence
was that when remounting read only, we were not stopping the
quotad thread, and thus it was possible for it to still have
a reference to some of the quotas in that case.

This patch moves the logd and quota thread start and stop into
the make_fs_rw/ro functions, so that we now stop those threads
when mounted read only.

This means that quotad will always be stopped before we call
the quota clean up function, and we can thus dispose of the
(rather hackish) code that waits for it to give up its
reference on the quotas.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2014-01-14 19:28:25 +00:00
Steven Whitehouse c754fbbb1b GFS2: Use RCU/hlist_bl based hash for quotas
Prior to this patch, GFS2 kept all the quotas for each
super block in a single linked list. This is rather slow
when there are large numbers of quotas.

This patch introduces a hlist_bl based hash table, similar
to the one used for glocks. The initial look up of the quota
is now lockless in the case where it is already cached,
although we still have to take the per quota spinlock in
order to bump the ref count. Either way though, this is a
big improvement on what was there before.

The qd_lock and the per super block list is preserved, for
the time being. However it is intended that since this is no
longer used for its original role, it should be possible to
shrink the number of items on that list in due course and
remove the requirement to take qd_lock in qd_get.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2014-01-14 19:27:56 +00:00
Steven Whitehouse 086352f1aa GFS2: No need to invalidate pages for a dio read
We recently fixed the writeback of pages prior to performing
direct i/o, however the initial fix was perhaps a bit heavy
handed. There is no need to invalidate pages if the direct i/o
is only a read, since they will be identical to what has been
flushed to disk anyway.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-14 19:20:49 +00:00
Steven Whitehouse 39849d6946 GFS2: Add initialization for address space in super block
Spotted by Andy Price. This should fix the odd messages from
lockdep caused by 70d4ee94b3

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Price <anprice@redhat.com>
2014-01-09 16:34:04 +00:00
Steven Whitehouse 01bcb0dedb GFS2: Add hints to directory leaf blocks
This patch adds four new fields to directory leaf blocks.
The intent is not to use them in the kernel itself, although
perhaps we may be able to use them as hints at some later date,
but instead to provide more information for debug/fsck use.

One new field adds a pointer to the inode to which the leaf
belongs. This can be useful if the pointer to the leaf block
has become corrupt, as it will allow us to know which inode
this block should be associated with. This field is set when
the leaf is created and never changed over its lifetime.

The second field is a "distance from the hash table" field.
The meaning is as follows:
 0  = An old leaf in which this value has not been set
 1  = This leaf is pointed to directly from the hash table
 2+ = This leaf is part of a chain, pointed to by another leaf
      block, the value gives the position in the chain.

The third and fourth fields combine to give a time stamp of
the most recent directory insertion or deletion from this
leaf block. The time stamp is not updated when a new leaf
block is chained from the current one. The code is currently
written such that the timestamp on the dir inode will match
that of the leaf block for the most recent insertion/deletion.

For backwards compatibility, any of these new fields which is
zero should be considered to be "unknown".

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-08 12:14:57 +00:00
Steven Whitehouse 22b5a6c0c0 GFS2: For exhash conversion, only one block is needed
For most cases, only a single new block is needed when we reach
the point of converting from stuffed to exhash directory. The
exception being when the file name is so long that it will not
fit within the new leaf block.

So this patch adds a simple test for that situation so that we
do not need to request the full reservation size in this case.

Potentially we could calculate more accurately the value to use
in other cases too, but that is much more complicated to do and
it is doubtful that the benefit would outweigh the extra cost
in code complexity.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-08 11:05:29 +00:00
Bob Peterson 62e96cf819 GFS2: Increase i_writecount during gfs2_setattr_chown
This patch calls get_write_access in function gfs2_setattr_chown,
which merely increases inode->i_writecount for the duration of the
function. That will ensure that any file closes won't delete the
inode's multi-block reservation while the function is running.
It also ensures that a multi-block reservation exists when needed
for quota change operations during the chown.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-07 09:43:51 +00:00
Steven Whitehouse 2b47dad866 GFS2: Remember directory insert point
When we look to see if there is enough space to add a dir
entry without allocation, we have then been repeating the
same search later when we do the actual insertion. This
patch caches the details of the location in the gfs2_diradd
structure, so that we do not have to repeat the search.

This will provide a performance improvement which will be
greater as the size of the directory increases.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-06 12:49:43 +00:00
Steven Whitehouse 534cf9ca55 GFS2: Consolidate transaction blocks calculation for dir add
There are three cases where we need to calculate the number of
blocks to reserve in a transaction involving linking an inode
into a directory. The one in rename is a bit more complicated,
but the basis of it is the same as for link and create. So it
makes sense to move this calculation into a single function
rather than repeating it three times.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-06 12:03:05 +00:00
Steven Whitehouse 3c1c0ae1db GFS2: Add directory addition info structure
The intent is that this structure will hold the information
required when adding entries to a directory (linking). To
start with, it will contain only the number of blocks which
are required to link the new entry into the directory. The
current calculation returns either 0 or the maximim number of
blocks that can ever be requested by such a transaction.

The intent is that in a later patch, we can update the dir
code to calculate this value more accurately. In addition
further patches will also add further fields to the new
structure to increase its utility.

In addition this patch fixes a bug where the link used during
inode creation was adding requesting too many blocks in
some cases. This is harmless unless the fs is close to being
full.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-06 11:28:41 +00:00
Steven Whitehouse 70d4ee94b3 GFS2: Use only a single address space for rgrps
Prior to this patch, GFS2 had one address space for each rgrp,
stored in the glock. This patch changes them to use a single
address space in the super block. This therefore saves
(sizeof(struct address_space) * nr_of_rgrps) bytes of memory
and for large filesystems, that can be significant.

It would be nice to be able to do something similar and merge
the inode metadata address space into the same global
address space. However, that is rather more complicated as the
on-disk location doesn't have a 1:1 mapping with the inodes in
general. So while it could be done, it will be a more complicated
operation as it requires changing a lot more code paths.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 10:01:50 +00:00
Steven Whitehouse 7005c3e4ae GFS2: Use range based functions for rgrp sync/invalidation
Each rgrp header is represented as a single extent on disk, so we
can calculate the position within the address space, since we are
using address spaces mapped 1:1 to the disk. This means that it
is possible to use the range based versions of filemap_fdatawrite/wait
and for invalidating the page cache.

Our eventual intent is to then be able to merge the address spaces
used for rgrps into a single address space, rather than to have
one for each glock, saving memory and reducing complexity.

Since during umount, the rgrp structures are disposed of before
the glocks, we need to store the extent information in the glock
so that is is available for a final invalidation. This patch uses
a field which is otherwise unused in rgrp glocks to do that, so
that we do not have to expand the size of a glock.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 10:00:31 +00:00
Steven Whitehouse 7de41d36ff GFS2: Remove test which is always true
Since gfs2_inplace_reserve() is always called with a valid
alloc parms structure, there is no need to test for this
within the function itself - and in any case, after we've
all ready dereferenced it anyway.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:59:30 +00:00
Steven Whitehouse 7aed98fb1d GFS2: Remove gfs2_quota_change_host structure
There is only one place this is used, when reading in the quota
changes at mount time. It is not really required and much
simpler to just convert the fields from the on-disk structure
as required.

There should be no functional change as a result of this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:59:05 +00:00
Steven Whitehouse e4f2920625 GFS2: Clean up releasepage
For historical reasons, we drop and retake the log lock in ->releasepage()
however, since there is no reason why we cannot hold the log lock over
the whole function, this allows some simplification. In particular,
pinning a buffer is only ever done under the log lock, so it is possible
here to remove the test for pinned buffers in the second loop, since it
is impossible for that to happen (it is also tested in the first loop).

As a result, two tests made later in the second loop become constants
and can also be reduced to the only possible branch. So the net result
is to remove various bits of unreachable code and make this more
readable.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:58:41 +00:00
Bob Peterson 5ea5050cec GFS2: Implement a "rgrp has no extents longer than X" scheme
With the preceding patch, we started accepting block reservations
smaller than the ideal size, which requires a lot more parsing of the
bitmaps. To reduce the amount of bitmap searching, this patch
implements a scheme whereby each rgrp keeps track of the point
at this multi-block reservations will fail.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:58:08 +00:00
Bob Peterson 1330edbeaf GFS2: Drop inadequate rgrps from the reservation tree
This is just basically a resend of a patch I posted earlier.
It didn't change from its original, except in diff offsets, etc:

This patch fixes a bug in the GFS2 block allocation code. The problem
starts if a process already has a multi-block reservation, but for
some reason, another process disqualifies it from further allocations.
For example, the other process might set on the GFS2_RDF_ERROR bit.
The process holding the reservation jumps to label skip_rgrp, but
that label comes after the code that removes the reservation from the
tree. Therefore, the no longer usable reservation is not removed from
the rgrp's reservations tree; it's lost. Eventually, the lost reservation
causes the count of reserved blocks to get off, and eventually that
causes a BUG_ON(rs->rs_rbm.rgd->rd_reserved < rs->rs_free) to trigger.
This patch moves the call to after label skip_rgrp so that the
disqualified reservation is properly removed from the tree, thus keeping
the rgrp rd_reserved count sane.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:57:31 +00:00
Bob Peterson 5ce13431dd GFS2: If requested is too large, use the largest extent in the rgrp
Here is a second try at a patch I posted earlier, which also implements
suggestions Steve made:

Before this patch, GFS2 would keep searching through all the rgrps
until it found one that had a chunk of free blocks big enough to
satisfy the size hint, which is based on the file write size,
regardless of whether the chunk was big enough to perform the write.
However, when doing big writes there may not be a large enough
chunk of free blocks in any rgrp, due to file system fragmentation.
The largest chunk may be big enough to satisfy the write request,
but it may not meet the ideal reservation size from the "size hint".
The writes would slow to a crawl because every write would search
every rgrp, then finally give up and default to a single-block write.
In my case, performance would drop from 425MB/s to 18KB/s, or 24000
times slower.

This patch basically makes it so that if we can't find a contiguous
chunk of blocks big enough to satisfy the sizehint, we'll use the
largest chunk of blocks we found that will still contain the write.
It does so by keeping track of the largest run of blocks within the
rgrp.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-03 09:57:02 +00:00
Tetsuo Handa 0b3a2c9968 GFS2: Fix unsafe dereference in dump_holder()
GLOCK_BUG_ON() might call this function without RCU read lock. Make sure that
RCU read lock is held when using task_struct returned from pid_task().

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2014-01-02 12:18:04 +00:00
Steven Whitehouse 582d2f7aed GFS2: Wait for async DIO in glock state changes
We need to wait for any outstanding DIO to complete in a couple
of situations. Firstly, in case we are changing out of deferred
mode (in inode_go_sync) where GLF_DIRTY will not be set. That
call could be prefixed with a test for gl_state == LM_ST_DEFERRED
but it doesn't seem worth it bearing in mind that the test for
outstanding DIO is very quick anyway, in the usual case that there
is none.

The second case is in inode_go_lock which will catch the cases
where we have a cached EX lock, but where we grant deferred locks
against it so that there is no glock state transistion. We only
need to wait if the state is not deferred, since DIO is valid
anyway in that state.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20 10:42:08 +00:00
Steven Whitehouse dfd11184d8 GFS2: Fix incorrect invalidation for DIO/buffered I/O
In patch 209806aba9 we allowed
local deferred locks to be granted against a cached exclusive
lock. That opened up a corner case which this patch now
fixes.

The solution to the problem is to check whether we have cached
pages each time we do direct I/O and if so to unmap, flush
and invalidate those pages. Since the glock state machine
normally does that for us, mostly the code will be a no-op.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-20 10:41:21 +00:00
Bob Peterson 502be2a32f GFS2: Fix slab memory leak in gfs2_bufdata
This patch fixes a slab memory leak that sometimes can occur
for files with a very short lifespan. The problem occurs when
a dinode is deleted before it has gotten to the journal properly.
In the leak scenario, the bd object is pinned for journal
committment (queued to the metadata buffers queue: sd_log_le_buf)
but is subsequently unpinned and dequeued before it finds its way
to the ail or the revoke queue. In this rare circumstance, the bd
object needs to be freed from slab memory, or it is forgotten.
We have to be very careful how we do it, though, because
multiple processes can call gfs2_remove_from_journal. In order to
avoid double-frees, only the process that does the unpinning is
allowed to free the bd.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:40 +00:00
Bob Peterson 9290a9a7c0 GFS2: Fix use-after-free race when calling gfs2_remove_from_ail
Function gfs2_remove_from_ail drops the reference on the bh via
brelse. This patch fixes a race condition whereby bh is deferenced
after the brelse when setting bd->bd_blkno = bh->b_blocknr;
Under certain rare circumstances, bh might be gone or reused,
and bd->bd_blkno is set to whatever that memory happens to be,
which is often 0. Later, in gfs2_trans_add_unrevoke, that bd fails
the test "bd->bd_blkno >= blkno" which causes it to never be freed.
The end result is that the bd is never freed from the bufdata cache,
which results in this error:
slab error in kmem_cache_destroy(): cache `gfs2_bufdata': Can't free all objects

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:23 +00:00
Steven Whitehouse dfe5b9ad83 GFS2: don't hold s_umount over blkdev_put
This is a GFS2 version of Tejun's patch:
4f331f01b9
vfs: don't hold s_umount over close_bdev_exclusive() call

In this case its blkdev_put itself that is the issue and this
patch uses the same solution of dropping and retaking s_umount.

Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-12-13 21:42:03 +00:00
Kent Overstreet 4f024f3797 block: Abstract out bvec iterator
Immutable biovecs are going to require an explicit iterator. To
implement immutable bvecs, a later patch is going to add a bi_bvec_done
member to this struct; for now, this patch effectively just renames
things.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Ed L. Cashin" <ecashin@coraid.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Yehuda Sadeh <yehuda@inktank.com>
Cc: Sage Weil <sage@inktank.com>
Cc: Alex Elder <elder@inktank.com>
Cc: ceph-devel@vger.kernel.org
Cc: Joshua Morris <josh.h.morris@us.ibm.com>
Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: dm-devel@redhat.com
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: linux390@de.ibm.com
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Benny Halevy <bhalevy@tonian.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: xfs@oss.sgi.com
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchand@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Peng Tao <tao.peng@emc.com>
Cc: Andy Adamson <andros@netapp.com>
Cc: fanchaoting <fanchaoting@cn.fujitsu.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Namjae Jeon <namjae.jeon@samsung.com>
Cc: Pankaj Kumar <pankaj.km@samsung.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>6
2013-11-23 22:33:47 -08:00
Steven Whitehouse ea0341e071 GFS2: Fix ref count bug relating to atomic_open
In the case that atomic_open calls finish_no_open() with
the dentry that was supplied to gfs2_atomic_open() an
extra reference count is required. This patch fixes that
issue preventing a bug trap triggering at umount time.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-11-21 18:47:57 +00:00
Michal Nazarewicz e3c4269d13 GFS2: fix potential NULL pointer dereference
Commit [e66cf1610: GFS2: Use lockref for glocks] replaced call:
    atomic_read(&gi->gl->gl_ref) == 0
with:
    __lockref_is_dead(&gl->gl_lockref)
therefore changing how gl is accessed, from gi->gl to plan gl.
However, gl can be a NULL pointer, and so gi->gl needs to be
used instead (which is guaranteed not to be NULL because fo
the while loop checking that condition).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-11-21 09:55:45 +00:00
Al Viro 951b4bd553 gfs2: endianness misannotations
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-15 22:04:16 -05:00
Linus Torvalds 9bc9ccd7db Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "All kinds of stuff this time around; some more notable parts:

   - RCU'd vfsmounts handling
   - new primitives for coredump handling
   - files_lock is gone
   - Bruce's delegations handling series
   - exportfs fixes

  plus misc stuff all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits)
  ecryptfs: ->f_op is never NULL
  locks: break delegations on any attribute modification
  locks: break delegations on link
  locks: break delegations on rename
  locks: helper functions for delegation breaking
  locks: break delegations on unlink
  namei: minor vfs_unlink cleanup
  locks: implement delegations
  locks: introduce new FL_DELEG lock flag
  vfs: take i_mutex on renamed file
  vfs: rename I_MUTEX_QUOTA now that it's not used for quotas
  vfs: don't use PARENT/CHILD lock classes for non-directories
  vfs: pull ext4's double-i_mutex-locking into common code
  exportfs: fix quadratic behavior in filehandle lookup
  exportfs: better variable name
  exportfs: move most of reconnect_path to helper function
  exportfs: eliminate unused "noprogress" counter
  exportfs: stop retrying once we race with rename/remove
  exportfs: clear DISCONNECTED on all parents sooner
  exportfs: more detailed comment for path_reconnect
  ...
2013-11-13 15:34:18 +09:00
Linus Torvalds 8b5baa460b The main feature of interest this time is quota updates. There are
some clean ups and some patches to use the new generic lru list
 code. There is still plenty of scope for some further changes in
 due course - faster lookups of quota structures is very much
 on the todo list. Also, a start has been made towards the more tricky
 issue of using the generic lru code with glocks, but that will
 have to be completed in a subsequent merge window.
 
 The other, more minor feature, is that there have been a number of
 performance patches which relate to block allocation. In particular
 they will improve performance when the disk is nearly full.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSeQMSAAoJEMrg3m4a/8jS4/EP/AtkfsT+GATPmK2R3Yoy0Hrb
 4KucaloOtlUmSsVwTpzYGYZaqJo2D5BndJWw9jekPJOS4aB5CbE1ZYCMIKyuhhr4
 Y70kjfGlwK5hRSItPJ5gHnWkiTZzR65wBLj/+EBAFm2gF3UsJ4DJvLNvd8DP9SJC
 3IfYfqV6cPa7aPDhmEbdq5h0X5iSI+Ee/X2Z3a6fe7rErR1cD4iAEFEyPHa0aHgt
 TkrS32DodOn/J26PvUFq5MUb+El+Ul6EpeB3CC8UN0+pvucAKCMVy8+sPROTbViz
 mMRyWxHHHPDEEulFPWJlXW/tOAhHMHTPGbnWu4bH+iDudzOHif7E0tWklPR9bJAY
 4/1Fa4ILIxV02kdGBHxO74Vv/ir4gyLzzXCPbXOpxu5jMw3VdN9dp0/Uck+rmsya
 rC5Q/8vm4AUO7YYHUBEEaT7Nqp8HRRWGwv11Wdyqf3RQyC5jYHNEXYJkdMsODEae
 p+Ju/O6MfLw68IrG38RaGT4/tCBPonggsCVxwqNxqyDnjtNEpO/o3VjJMJ/3j2b5
 CCRx+9JYENT8EsdpIFWasfABy66xbKPTE9RiMUbk1e73julXLfzIMI3/Ol/Bj7rQ
 YLs5XYrKcz3QfYgMvNS3nMbI3w3nJrCnzdV7STps8nyaOa1oQndGxe9b7tDb+Fb8
 /acuYuQclbvsAkzvH4jc
 =qwXe
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull gfs2 updates from Steven Whitehouse:
 "The main feature of interest this time is quota updates.  There are
  some clean ups and some patches to use the new generic lru list code.

  There is still plenty of scope for some further changes in due course -
  faster lookups of quota structures is very much on the todo list.
  Also, a start has been made towards the more tricky issue of using the
  generic lru code with glocks, but that will have to be completed in a
  subsequent merge window.

  The other, more minor feature, is that there have been a number of
  performance patches which relate to block allocation.  In particular
  they will improve performance when the disk is nearly full"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Use generic list_lru for quota
  GFS2: Rename quota qd_lru_lock qd_lock
  GFS2: Use reflink for quota data cache
  GFS2: Use lockref for glocks
  GFS2: Protect quota sync generation
  GFS2: Inline qd_trylock into gfs2_quota_unlock
  GFS2: Make two similar quota code fragments into a function
  GFS2: Remove obsolete quota tunable
  GFS2: Move gfs2_icbit_munge into quota.c
  GFS2: Speed up starting point selection for block allocation
  GFS2: Add allocation parameters structure
  GFS2: Clean up reservation removal
  GFS2: fix dentry leaks
  GFS2: new function gfs2_rbm_incr
  GFS2: Introduce rbm field bii
  GFS2: Do not reset flags on active reservations
  GFS2: introduce bi_blocks for optimization
  GFS2: optimize rbm_from_block wrt bi_start
  GFS2: d_splice_alias() can't return error
2013-11-11 07:11:00 +09:00
Steven Whitehouse 2147dbfd05 GFS2: Use generic list_lru for quota
By using the generic list_lru code, we can now separate the
per sb quota list locking from the lru locking. The lru
lock is made into the inner-most lock.

As a result of this new lock order, we may occasionally see
items on the per-sb quota list which are "dead" so that the
two places where we traverse that list are updated to take
account of that.

As a result of this patch, the gfs2 quota shrinker is now
NUMA zone aware, and we are also laying the foundations for
further improvments in due course.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
Cc: Dave Chinner <dchinner@redhat.com>
2013-11-04 11:17:49 +00:00
Steven Whitehouse 7d80823e1d GFS2: Rename quota qd_lru_lock qd_lock
This is a straight forward rename which is in preparation for
introducing the generic list_lru infrastructure in the
following patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
2013-11-04 11:17:36 +00:00
Steven Whitehouse 9b9f039d57 GFS2: Use reflink for quota data cache
This patch adds reflink support to the quota data cache. It
looks a bit strange because we still don't have a sensible
split in the lookup by id and the lru list. That is coming in
later patches though.

The intent here is just to swap the current ref count for
reflinks in all cases with as little as possible other change.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Abhijith Das <adas@redhat.com>
Tested-by: Abhijith Das <adas@redhat.com>
2013-11-04 11:17:07 +00:00
Al Viro 87dc800be2 new helper: kfree_put_link()
duplicated to hell and back...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24 23:34:49 -04:00
Steven Whitehouse e66cf16109 GFS2: Use lockref for glocks
Currently glocks have an atomic reference count and also a spinlock
which covers various internal fields, such as the state. This intent of
this patch is to replace the spinlock and the atomic reference count
with a lockref structure. This contains a spinlock which we can continue
to use as before, and a reference counter which is used in conjuction
with the spinlock to replace the previous atomic counter.

As a result of this there are some new rules for reference counting on
glocks. We need to distinguish between reference count changes under
gl_spin (which are now just increment or decrement of the new counter,
provided the count cannot hit zero) and those which are outside of
gl_spin, but which now take gl_spin internally.

The conversion is relatively straight forward. There is probably some
further clean up which can be done, but the priority at this stage is to
make the change in as simple a manner as possible.

A consequence of this change is that the reference count is being
decoupled from the lru list processing. This should allow future
adoption of the lru_list code with glocks in due course.

The reason for using the "dead" state and not just relying on 0 being
the "invalid state" is so that in due course 0 ref counts can be
allowable. The intent is to eventually be able to remove the ref count
changes which are currently hidden away in state_change().

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-10-15 15:18:08 +01:00
Steven Whitehouse e46c772dba GFS2: Protect quota sync generation
Now that gfs2_quota_sync can be potentially called from multiple
threads, we should protect this bit of code, and the sync generation
number in particular in order to ensure that there are no races
when syncing quotas.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2013-10-04 12:29:34 +01:00
Steven Whitehouse aabd7c72f5 GFS2: Inline qd_trylock into gfs2_quota_unlock
The function qd_trylock was not a trylock despite its name and
can be inlined into gfs2_quota_unlock in order to make the
code a bit clearer. There should be no functional change as a
result of this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2013-10-04 11:39:21 +01:00
Steven Whitehouse 1bf59bf6de GFS2: Make two similar quota code fragments into a function
There should be no functional change bar the removal of a
test of the MS_READONLY flag which would never be reachable.
This merges the common code from qd_fish and qd_trylock into
a single function and calls it from both those places.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2013-10-04 11:14:46 +01:00
Steven Whitehouse bef292a72d GFS2: Remove obsolete quota tunable
There is no need for a paramater which relates to the internals
of quota to be exposed to users. The only possible use would be
to turn it up so large that the memory allocation fails. So lets
remove it and set it to a sensible value which ensures that we
don't ask for multipage allocations.

Currently the size of struct gfs2_holder means that the caluclated
value is identical to the previous default value, so there should
be no functional change.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
2013-10-04 09:49:29 +01:00
Steven Whitehouse 26e43a15d4 GFS2: Move gfs2_icbit_munge into quota.c
This function is only called twice, and both callers are
quota related, so lets move this function into quota.c and
make it static.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-10-02 14:47:02 +01:00
Steven Whitehouse 9e07f2cb3d GFS2: Speed up starting point selection for block allocation
When setting the starting point for block allocation, there were calls
to both gfs2_rbm_to_block() and gfs2_rbm_from_block() in the common case
of there being an active reservation. The gfs2_rbm_from_block() function
can be quite slow, and since the two conversions were effectively a
no-op, it makes sense to avoid them entirely in this case.

There is no functional change here, but the code should be a bit more
efficient after this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-10-02 14:42:45 +01:00
Steven Whitehouse 7b9cff4671 GFS2: Add allocation parameters structure
This patch adds a structure to contain allocation parameters with
the intention of future expansion of this structure. The idea is
that we should be able to add more information about the allocation
in the future in order to allow the allocator to make a better job
of placing the requests on-disk.

There is no functional difference from applying this patch.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-10-02 11:13:25 +01:00
Steven Whitehouse af5c269799 GFS2: Clean up reservation removal
The reservation for an inode should be cleared when it is truncated so
that we can start again at a different offset for future allocations.
We could try and do better than that, by resetting the search based on
where the truncation started from, but this is only a first step.

In addition, there are three callers of gfs2_rs_delete() but only one
of those should really be testing the value of i_writecount. While
we get away with that in the other cases currently, I think it would
be better if we made that test specific to the one case which
requires it.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-27 12:49:33 +01:00
Miklos Szeredi 5ca1db41ec GFS2: fix dentry leaks
We need to dput() the result of d_splice_alias(), unless it is passed to
finish_no_open().

Edited by Steven Whitehouse in order to make it apply to the current
GFS2 git tree, and taking account of a prerequisite patch which hasn't
been applied.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: stable@vger.kernel.org
2013-09-23 13:30:57 +01:00
Bob Peterson 149ed7f51e GFS2: new function gfs2_rbm_incr
Since the previous patch eliminated bi in favor of bii, this follow-on
patch needed to be adjusted accordingly. Here is the revised version.

This patch adds a new function, gfs2_rbm_incr, which increments
an rbm structure. This is more efficient than calling gfs2_rbm_to_block,
incrementing, then calling gfs2_rbm_from_block.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-18 10:40:38 +01:00
Bob Peterson e579ed4f44 GFS2: Introduce rbm field bii
This is a respin of the original patch. As Steve pointed out, the
introduction of field bii makes it easy to eliminate bi itself.
This revised patch does just that, replacing bi with bii.

This patch adds a new field to the rbm structure, called bii,
which is an index into the array of bitmaps for an rgrp.
This replaces *bi which was a pointer to the bitmap.
This is being done for further optimizations.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-18 10:39:53 +01:00
Bob Peterson b870890519 GFS2: Do not reset flags on active reservations
When we used try locks for rgrps on block allocations, it was important
to clear the flags field so that we used a blocking hold on the glock.
Now that we're not doing try locks, clearing flags is unnecessary, and
a waste of time. In fact, it's probably doing the wrong thing because
it clears the GL_SKIP bit that was set for the lvb tracking purposes.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-17 10:19:29 +01:00
Bob Peterson 7e230f5774 GFS2: introduce bi_blocks for optimization
This patch introduces a new field in the bitmap structure called
bi_blocks. Its purpose is to save us from constantly multiplying
bi_len by the constant GFS2_NBBY. It also paves the way for more
optimization in a future patch.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-17 10:15:13 +01:00
Bob Peterson 6aa7640f30 GFS2: optimize rbm_from_block wrt bi_start
In function gfs2_rbm_from_block, it starts by checking if the block
falls within the first bitmap. It does so by checking if the rbm's
offset is less than (rbm->bi->bi_start + rbm->bi->bi_len) * GFS2_NBBY.
However, the first bitmap will always have bi_start==0. Therefore
this is an unnecessary calculation in a function that gets called
billions of times. This patch removes the reference to bi_start.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-17 10:14:39 +01:00
Miklos Szeredi 0d0d110720 GFS2: d_splice_alias() can't return error
unless it was given an IS_ERR(inode), which isn't the case here.  So clean
up the unnecessary error handling in gfs2_create_inode().

This paves the way for real fixes (hence the stable Cc).

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: stable@vger.kernel.org
2013-09-17 10:04:07 +01:00
Miklos Szeredi c5bf8fef52 gfs2: set FILE_CREATED
In gfs2_create_inode() set FILE_CREATED in *opened.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Linus Torvalds ac4de9543a Merge branch 'akpm' (patches from Andrew Morton)
Merge more patches from Andrew Morton:
 "The rest of MM.  Plus one misc cleanup"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits)
  mm/Kconfig: add MMU dependency for MIGRATION.
  kernel: replace strict_strto*() with kstrto*()
  mm, thp: count thp_fault_fallback anytime thp fault fails
  thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()
  thp: do_huge_pmd_anonymous_page() cleanup
  thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
  mm: cleanup add_to_page_cache_locked()
  thp: account anon transparent huge pages into NR_ANON_PAGES
  truncate: drop 'oldsize' truncate_pagecache() parameter
  mm: make lru_add_drain_all() selective
  memcg: document cgroup dirty/writeback memory statistics
  memcg: add per cgroup writeback pages accounting
  memcg: check for proper lock held in mem_cgroup_update_page_stat
  memcg: remove MEMCG_NR_FILE_MAPPED
  memcg: reduce function dereference
  memcg: avoid overflow caused by PAGE_ALIGN
  memcg: rename RESOURCE_MAX to RES_COUNTER_MAX
  memcg: correct RESOURCE_MAX to ULLONG_MAX
  mm: memcg: do not trap chargers with full callstack on OOM
  mm: memcg: rework and document OOM waiting and wakeup
  ...
2013-09-12 15:44:27 -07:00
Kirill A. Shutemov 7caef26767 truncate: drop 'oldsize' truncate_pagecache() parameter
truncate_pagecache() doesn't care about old size since commit
cedabed49b ("vfs: Fix vmtruncate() regression").  Let's drop it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12 15:38:02 -07:00
Dave Chinner 1ab6c4997e fs: convert fs shrinkers to new scan/count API
Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time.  For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.

I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e.  all the time under
memory pressure).

[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
[assorted fixes folded in]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:31 -04:00
Glauber Costa 55f841ce93 super: fix calculation of shrinkable objects for small numbers
The sysctl knob sysctl_vfs_cache_pressure is used to determine which
percentage of the shrinkable objects in our cache we should actively try
to shrink.

It works great in situations in which we have many objects (at least more
than 100), because the aproximation errors will be negligible.  But if
this is not the case, specially when total_objects < 100, we may end up
concluding that we have no objects at all (total / 100 = 0, if total <
100).

This is certainly not the biggest killer in the world, but may matter in
very low kernel memory situations.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 18:56:29 -04:00
Linus Torvalds 6c337ad6cc This is possibly the smallest ever set of GFS2 patches for a merge
window. Also, most of them are bug fixes this time. Two of my
 three patches (moving gfs2_sync_meta and merging the two writepage
 implementations) are clean ups with the third (taking the glock ref
 in examine_bucket) being a fix for a difficult to hit race condition.
 
 The removal of an unused memory barrier is a clean up from Bob Peterson,
 and the "spectator" relates to a rarely used mount option. Ben
 Marzinski's patch fixes a corner case where the incorrect inode
 flags were being set, resulting in incorrect behaviour on fsync.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJSLYWjAAoJEMrg3m4a/8jSHBUQAIBXyrLypCJNJNpYreRlg4Te
 YZOqXMxqrfsWD5jkfWvxV/vZyVu6FEJRJRRwKoO8foZhnVIy2HiGdpFvwk4NMzGs
 Ah5cCaySgDhFIKXNq1CnhVZNpRU8N+Lf+8U2MwkpPUrT+KnZlDdG8COuHbWJ3/+t
 prqHy0N1oa8hgj3P0oDf3kyQKiB2MQVhORTBVOWaas1mzw8vsKRjvO7k5g0VPcfV
 VnIYzvQ033V6pW6ymWGcbz6us7hXeFJmXo4grAdLIclK6QDt1zPLVKlvk7X/Me52
 PTxO5AP/Nw6AlJABTNy8UZ3uJO3QaiqhcIz+OzIXlYpsICqma30qkHHrWzOh0Q5F
 HrDqh03A2zuqigamVmsJr2+tbdLWxz72D6N3eFlIBLAw2JxILosCPcGc4UNzG/7g
 mFr6+LDnZXzFpjT6OBdepjVH06ZqfcV+nr3KhvmKQT+tkzTizArbt8JutOdKjahH
 3+pXx/bA0B7LRVYymK5/xrfbjnQp629GG3QnGiI23gtDZ6eEzm663bQz6MTgjR88
 H0nigLi8d2KmM8UnrZqq5nI+LirFD/IZ5DO47Vv6w3NCZW/ZpNOodlWkd1igvhHR
 ugTvKqyw2O9hImM8jPoRWbqL01RGTKTMMn/3cCCO8Uww7DpI4GLb48a39buYh4+2
 kijIxCtntfQAOiR6MCvA
 =rsQv
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull GFS2 updates from Steven Whitehouse:
 "This is possibly the smallest ever set of GFS2 patches for a merge
  window.  Also, most of them are bug fixes this time.

  Two of my three patches (moving gfs2_sync_meta and merging the two
  writepage implementations) are clean ups with the third (taking the
  glock ref in examine_bucket) being a fix for a difficult to hit race
  condition.

  The removal of an unused memory barrier is a clean up from Bob
  Peterson, and the "spectator" relates to a rarely used mount option.
  Ben Marzinski's patch fixes a corner case where the incorrect inode
  flags were being set, resulting in incorrect behaviour on fsync"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: dirty inode correctly in gfs2_write_end
  GFS2: Don't flag consistency error if first mounter is a spectator
  GFS2: Remove unnecessary memory barrier
  GFS2: Merge ordered and writeback writepage
  GFS2: Take glock reference in examine_bucket()
  GFS2: Move gfs2_sync_meta to lops.c
2013-09-09 09:16:51 -07:00
Linus Torvalds dc0755cdb1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 2 (of many) from Al Viro:
 "Mostly Miklos' series this time"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  constify dcache.c inlined helpers where possible
  fuse: drop dentry on failed revalidate
  fuse: clean up return in fuse_dentry_revalidate()
  fuse: use d_materialise_unique()
  sysfs: use check_submounts_and_drop()
  nfs: use check_submounts_and_drop()
  gfs2: use check_submounts_and_drop()
  afs: use check_submounts_and_drop()
  vfs: check unlinked ancestors before mount
  vfs: check submounts and drop atomically
  vfs: add d_walk()
  vfs: restructure d_genocide()
2013-09-07 14:36:57 -07:00
Linus Torvalds 2e515bf096 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree from Jiri Kosina:
 "The usual trivial updates all over the tree -- mostly typo fixes and
  documentation updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits)
  doc: Documentation/cputopology.txt fix typo
  treewide: Convert retrun typos to return
  Fix comment typo for init_cma_reserved_pageblock
  Documentation/trace: Correcting and extending tracepoint documentation
  mm/hotplug: fix a typo in Documentation/memory-hotplug.txt
  power: Documentation: Update s2ram link
  doc: fix a typo in Documentation/00-INDEX
  Documentation/printk-formats.txt: No casts needed for u64/s64
  doc: Fix typo "is is" in Documentations
  treewide: Fix printks with 0x%#
  zram: doc fixes
  Documentation/kmemcheck: update kmemcheck documentation
  doc: documentation/hwspinlock.txt fix typo
  PM / Hibernate: add section for resume options
  doc: filesystems : Fix typo in Documentations/filesystems
  scsi/megaraid fixed several typos in comments
  ppc: init_32: Fix error typo "CONFIG_START_KERNEL"
  treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
  page_isolation: Fix a comment typo in test_pages_isolated()
  doc: fix a typo about irq affinity
  ...
2013-09-06 09:36:28 -07:00
Miklos Szeredi 1191a2bdf0 gfs2: use check_submounts_and_drop()
Do have_submounts(), shrink_dcache_parent() and d_drop() atomically.

check_submounts_and_drop() can deal with negative dentries and
non-directories as well.

Non-directories can also be mounted on.  And just like directories we don't
want these to disappear with invalidation.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-05 16:23:51 -04:00
Benjamin Marzinski 0c9018097f GFS2: dirty inode correctly in gfs2_write_end
GFS2 was only setting I_DIRTY_DATASYNC on files that it wrote to, when
it actually increased the file size.  If gfs2_fsync was called without
I_DIRTY_DATASYNC set, it didn't flush the incore data to the log before
returning, so any metadata or journaled data changes were not getting
fsynced. This meant that writes to the middle of files were not always
getting fsynced properly.

This patch makes gfs2 set I_DIRTY_DATASYNC whenever metadata has been
updated during a write. It also make gfs2_sync flush the incore log
if I_DIRTY_PAGES is set, and the file is using data journalling. This
will make sure that all incore logged data gets written to disk before
returning from a fsync.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-05 09:04:24 +01:00
Bob Peterson 1d12d175ea GFS2: Don't flag consistency error if first mounter is a spectator
This patch checks for the first mounter being a specator. If so, it
makes sure all the journals are clean. If there's a dirty journal,
the mount fails.

Testing results:

# insmod gfs2.ko
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
mount: permission denied
# dmesg | tail -2
[ 3390.655996] GFS2: fsid=MUSKETEER:home: Now mounting FS...
[ 3390.841336] GFS2: fsid=MUSKETEER:home.s: jid=0: Journal is dirty, so the first mounter must not be a spectator.
# mount -tgfs2 /dev/sasdrives/scratch /mnt/gfs2
# umount /mnt/gfs2
# mount -tgfs2 -o spectator /dev/sasdrives/scratch /mnt/gfs2
# ls /mnt/gfs2|wc -l
352
# umount /mnt/gfs2

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-05 09:03:57 +01:00
Bob Peterson 068213f7d3 GFS2: Remove unnecessary memory barrier
Function test_and_clear_bit implies a memory barrier, so subsequent
memory barriers are unnecessary.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-09-04 15:58:21 +01:00
Steven Whitehouse 9d35814355 GFS2: Merge ordered and writeback writepage
The writepages function was recently merged between writeback
and ordered mode. This completes the change by doing the same
with writepage. The remaining differences in writepage were
left over from some earlier time and not actually doing anything
useful.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-27 21:22:07 +01:00
Joe Perches 8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Steven Whitehouse 7286b31eab GFS2: Take glock reference in examine_bucket()
We need to check the glock ref counter in a race free way
in order to ensure that the gfs2_glock_hold() call will
succeed. The easiest way to do that is to simply take the
reference count early in the common code of examine_bucket,
skipping any glocks with zero ref count.

That means that the examiner functions all need to put their
reference on the glock once they've performed their function.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
2013-08-20 09:35:09 +01:00
Steven Whitehouse 7c0ef28a2c GFS2: Move gfs2_sync_meta to lops.c
Since gfs2_sync_meta() is only called from a single file, lets move
it to lops.c where it is used, and mark it static. At the same
time, we can clean up the meta_io.h header too.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 17:26:32 +01:00
Steven Whitehouse 7bd9ee58a4 GFS2: Check for glock already held in gfs2_getxattr
Since the introduction of atomic_open, gfs2_getxattr can be
called with the glock already held, so we need to allow for
this.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
2013-08-19 09:33:57 +01:00
Dan Carpenter dfc4616dde GFS2: alloc_workqueue() doesn't return an ERR_PTR
alloc_workqueue() returns a NULL on error, it doesn't return an ERR_PTR.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:33:43 +01:00
Benjamin Marzinski 1bc333f4cf GFS2: don't overrun reserved revokes
When run during fsync, a gfs2_log_flush could happen between the
time when gfs2_ail_flush checked the number of blocks to revoke,
and when it actually started the transaction to do those revokes.
This occassionally caused it to need more revokes than it reserved,
causing gfs2 to crash.

Instead of just reserving enough revokes to handle the blocks that
currently need them, this patch makes gfs2_ail_flush reserve the
maximum number of revokes it can, without increasing the total number
of reserved log blocks. This patch also passes the number of reserved
revokes to __gfs2_ail_flush() so that it doesn't go over its limit
and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
will still cause a BUG() necessary revokes are skipped.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:33:16 +01:00
Tejun Heo d08fa65a81 GFS2: WQ_NON_REENTRANT is meaningless and going away
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away.  Remove its usages.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
2013-08-19 09:33:01 +01:00
Steven Whitehouse 2523d47a79 GFS2: Fix typo in gfs2_create_inode()
PTR_RET should be PTR_ERR

Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-08-19 09:32:29 +01:00
Linus Torvalds 790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Linus Torvalds fc76a258d4 Driver core patches for 3.11-rc1
Here's the big driver core merge for 3.11-rc1
 
 Lots of little things, and larger firmware subsystem updates, all
 described in the shortlog.  Nice thing here is that we finally get rid
 of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
 been always on for a number of kernel releases, now it's just removed.)
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iEYEABECAAYFAlHRsGMACgkQMUfUDdst+ylIIACfW8lLxOPVK+iYG699TWEBAkp0
 LFEAnjlpAMJ1JnoZCuWDZObNCev93zGB
 =020+
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here's the big driver core merge for 3.11-rc1

  Lots of little things, and larger firmware subsystem updates, all
  described in the shortlog.  Nice thing here is that we finally get rid
  of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
  been always on for a number of kernel releases, now it's just
  removed)"

* tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
  driver core: device.h: fix doc compilation warnings
  firmware loader: fix another compile warning with PM_SLEEP unset
  build some drivers only when compile-testing
  firmware loader: fix compile warning with PM_SLEEP set
  kobject: sanitize argument for format string
  sysfs_notify is only possible on file attributes
  firmware loader: simplify holding module for request_firmware
  firmware loader: don't export cache_firmware and uncache_firmware
  drivers/base: Use attribute groups to create sysfs memory files
  firmware loader: fix compile warning
  firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER
  Documentation: Updated broken link in HOWTO
  Finally eradicate CONFIG_HOTPLUG
  driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend
  driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware
  Documentation: Tidy up some drivers/base/core.c kerneldoc content.
  platform_device: use a macro instead of platform_driver_register
  firmware: move EXPORT_SYMBOL annotations
  firmware: Avoid deadlock of usermodehelper lock at shutdown
  dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly
  ...
2013-07-02 11:44:19 -07:00
Linus Torvalds c4eb1b0730 Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw
Pull GFS2 updates from Steven Whitehouse:
 "There are a few bug fixes for various, mostly very minor corner cases,
  plus some interesting new features.

  The new features include atomic_open whose main benefit will be the
  reduction in locking overhead in case of combined lookup/create and
  open operations, sorting the log buffer lists by block number to
  improve the efficiency of AIL writeback, and aggressively issuing
  revokes in gfs2_log_flush to reduce overhead when dropping glocks."

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: Reserve journal space for quota change in do_grow
  GFS2: Fix fstrim boundary conditions
  GFS2: fix warning message
  GFS2: aggressively issue revokes in gfs2_log_flush
  GFS2: fix regression in dir_double_exhash
  GFS2: Add atomic_open support
  GFS2: Only do one directory search on create
  GFS2: fix error propagation in init_threads()
  GFS2: Remove no-op wrapper function
  GFS2: Cocci spatch "ptr_ret.spatch"
  GFS2: Eliminate gfs2_rg_lops
  GFS2: Sort buffer lists by inplace block number
2013-07-02 09:41:18 -07:00
Linus Torvalds 9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Jeff Layton 1c8c601a8c locks: protect most of the file_lock handling with i_lock
Having a global lock that protects all of this code is a clear
scalability problem. Instead of doing that, move most of the code to be
protected by the i_lock instead. The exceptions are the global lists
that the ->fl_link sits on, and the ->fl_block list.

->fl_link is what connects these structures to the
global lists, so we must ensure that we hold those locks when iterating
over or updating these lists.

Furthermore, sound deadlock detection requires that we hold the
blocked_list state steady while checking for loops. We also must ensure
that the search and update to the list are atomic.

For the checking and insertion side of the blocked_list, push the
acquisition of the global lock into __posix_lock_file and ensure that
checking and update of the  blocked_list is done without dropping the
lock in between.

On the removal side, when waking up blocked lock waiters, take the
global lock before walking the blocked list and dequeue the waiters from
the global list prior to removal from the fl_block list.

With this, deadlock detection should be race free while we minimize
excessive file_lock_lock thrashing.

Finally, in order to avoid a lock inversion problem when handling
/proc/locks output we must ensure that manipulations of the fl_block
list are also protected by the file_lock_lock.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:42 +04:00
Linus Torvalds da53be12bb Don't pass inode to ->d_hash() and ->d_compare()
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:36 +04:00
Al Viro ac6614b764 [readdir] constify ->actor
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:05 +04:00
Al Viro d81a8ef598 [readdir] convert gfs2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:35 +04:00
Bob Peterson a01aedfe21 GFS2: Reserve journal space for quota change in do_grow
If a GFS2 file system is mounted with quotas and a file is grown
in such a way that its free blocks for the allocation are represented
in a secondary bitmap, GFS2 ran out of blocks in the transaction.
That resulted in "fatal: assertion "tr->tr_num_buf <= tr->tr_blocks".
This patch reserves extra blocks for the quota change so the
transaction has enough space.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-06-27 18:16:27 +01:00