Commit graph

36783 commits

Author SHA1 Message Date
Trond Myklebust 2d91e8953c nfsd: Always use lookup_clientid() in nfsd4_process_open1
In later patches, we'll be moving the stateowner table into the
nfs4_client, and by doing this we ensure that we have a cached
nfs4_client pointer.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:06 -04:00
Trond Myklebust 13d6f66b08 nfsd: Convert nfsd4_process_open1() to work with lookup_clientid()
...and have alloc_init_open_stateowner just use the cstate->clp pointer
instead of passing in a clp separately. This allows us to use the
cached nfs4_client pointer in the cstate instead of having to look it
up again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:05 -04:00
Jeff Layton 4b24ca7d30 nfsd: Allow struct nfsd4_compound_state to cache the nfs4_client
We want to use the nfsd4_compound_state to cache the nfs4_client in
order to optimise away extra lookups of the clid.

In the v4.0 case, we use this to ensure that we only have to look up the
client at most once per compound for each call into lookup_clientid. For
v4.1+ we set the pointer in the cstate during SEQUENCE processing so we
should never need to do a search for it.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:04 -04:00
Jeff Layton 62814d6a9b nfsd: add a nfserrno mapping for -E2BIG to nfserr_fbig
I saw this pop up with some pynfs testing:

    [  123.609992] nfsd: non-standard errno: -7

...and -7 is -E2BIG. I think what happened is that XFS returned -E2BIG
due to some xattr operations with the ACL10 pynfs TEST (I guess it has
limited xattr size?).

Add a better mapping for that error since it's possible that we'll need
it. How about we convert it to NFSERR_FBIG? As Bruce points out, they
both have "BIG" in the name so it must be good.

Also, turn the printk in this function into a WARN() so that we can get
a bit more information about situations that don't have proper mappings.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:03 -04:00
Jeff Layton 722b620d18 nfsd: properly convert return from commit_metadata to __be32
Commit 2a7420c03e504 (nfsd: Ensure that nfsd_create_setattr commits
files to stable storage), added a couple of calls to commit_metadata,
but doesn't convert their return codes to __be32 in the appropriate
places.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:02 -04:00
Trond Myklebust 2dd6e458c3 nfsd: Cleanup - Let nfsd4_lookup_stateid() take a cstate argument
The cstate already holds information about the session, and hence
the client id, so it makes more sense to pass that information
rather than the current practice of passing a 'minor version' number.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:01 -04:00
Trond Myklebust d4e19e7027 nfsd: Don't get a session reference without a client reference
If the client were to disappear from underneath us while we're holding
a session reference, things would be bad. This cleanup helps ensure
that it cannot, which will be a possibility when the client_mutex is
removed.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:55:00 -04:00
Jeff Layton fd44907c2d nfsd: clean up nfsd4_release_lockowner
Now that we know that we won't have several lockowners with the same,
owner->data, we can simplify nfsd4_release_lockowner and get rid of
the lo_list in the process.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:59 -04:00
Trond Myklebust b3c32bcd9c nfsd: NFSv4 lock-owners are not associated to a specific file
Just like open-owners, lock-owners are associated with a name, a clientid
and, in the case of minor version 0, a sequence id. There is no association
to a file.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:58 -04:00
Jeff Layton c53530da4d nfsd: Allow lockowners to hold several stateids
A lockowner can have more than one lock stateid. For instance, if a
process has more than one file open and has locks on both, then the same
lockowner has more than one stateid associated with it. Change it so
that this reality is better reflected by the objects that nfsd uses.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-09 20:54:57 -04:00
Trond Myklebust 3c87b9b7c0 nfsd: lock owners are not per open stateid
In the NFSv4 spec, lock stateids are per-file objects. Lockowners are not.
This patch replaces the current list of lock owners in the open stateids
with a list of lock stateids.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:37 -04:00
Trond Myklebust acf9295b1c nfsd: clean up nfsd4_close_open_stateid
Minor cleanup that should introduce no behavioral changes.

Currently this function just unhashes the stateid and leaves the caller
to do the work of the CLOSE processing.

Change nfsd4_close_open_stateid so that it handles doing all of the work
of closing a stateid. Move the handling of the unhashed stateid into it
instead of doing that work in nfsd4_close. This will help isolate some
coming changes to stateid handling from nfsd4_close.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:36 -04:00
Jeff Layton db24b3b4b2 nfsd: declare v4.1+ openowners confirmed on creation
There's no need to confirm an openowner in v4.1 and above, so we can
go ahead and set NFS4_OO_CONFIRMED when we create openowners in
those versions. This will also be necessary when we remove the
client_mutex, as it'll be possible for two concurrent opens to race
in versions >4.0.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:35 -04:00
Trond Myklebust b607664ee7 nfsd: Cleanup nfs4svc_encode_compoundres
Move the slot return, put session etc into a helper in fs/nfsd/nfs4state.c
instead of open coding in nfs4svc_encode_compoundres.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:34 -04:00
Trond Myklebust e17f99b728 nfsd: nfs4_preprocess_seqid_op should only set *stpp on success
Not technically a bugfix, since nothing tries to use the return pointer
if this function doesn't return success, but it could be a problem
with some coming changes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:33 -04:00
Jeff Layton 5b8db00bae nfsd: add a new /proc/fs/nfsd/max_connections file
Currently, the maximum number of connections that nfsd will allow
is based on the number of threads spawned. While this is fine for a
default, there really isn't a clear relationship between the two.

The number of threads corresponds to the number of concurrent requests
that we want to allow the server to process at any given time. The
connection limit corresponds to the maximum number of clients that we
want to allow the server to handle. These are two entirely different
quantities.

Break the dependency on increasing threads in order to allow for more
connections, by adding a new per-net parameter that can be set to a
non-zero value. The default is still to base it on the number of threads,
so there should be no behavior change for anyone who doesn't use it.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:32 -04:00
Trond Myklebust 0f3a24b43b nfsd: Ensure that nfsd_create_setattr commits files to stable storage
Since nfsd_create_setattr strips the mode from the struct iattr, it
is quite possible that it will optimise away the call to nfsd_setattr
altogether.
If this is the case, then we never call commit_metadata() on the
newly created file.

Also ensure that both nfsd_setattr() and nfsd_create_setattr() fail
when the call to commit_metadata fails.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:31 -04:00
Kinglong Mee 1e444f5bc0 NFSD: Remove iattr parameter from nfsd_symlink()
Commit db2e747b14 (vfs: remove mode parameter from vfs_symlink())
have remove mode parameter from vfs_symlink.
So that, iattr isn't needed by nfsd_symlink now, just remove it.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:31 -04:00
Trond Myklebust 950e0118d0 nfsd: Protect addition to the file_hashtbl
Current code depends on the client_mutex to guarantee a single struct
nfs4_file per inode in the file_hashtbl and make addition atomic with
respect to lookup.  Rely instead on the state_Lock, to make it easier to
stop taking the client_mutex here later.

To prevent an i_lock/state_lock inversion, change nfsd4_init_file to
use ihold instead if igrab. That's also more efficient anyway as we
definitely hold a reference to the inode at that point.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:30 -04:00
Christoph Hellwig 7e6a72e5f1 nfsd: fix file access refcount leak when nfsd4_truncate fails
nfsd4_process_open2 will currently will get access to the file, and then
call nfsd4_truncate to (possibly) truncate it. If that operation fails
though, then the access references will never be released as the
nfs4_ol_stateid is never initialized.

Fix by moving the nfsd4_truncate call into nfs4_get_vfs_file, ensuring
that the refcounts are properly put if the truncate fails.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:29 -04:00
Kinglong Mee 1055414fe1 NFSD: Avoid warning message when compile at i686 arch
fs/nfsd/nfs4xdr.c: In function 'nfsd4_encode_readv':
>> fs/nfsd/nfs4xdr.c:3137:148: warning: comparison of distinct pointer types lacks a cast [enabled by default]
thislen = min(len, ((void *)xdr->end - (void *)xdr->p));

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:28 -04:00
J. Bruce Fields d5e2338324 nfsd4: replace defer_free by svcxdr_tmpalloc
Avoid an extra allocation for the tmpbuf struct itself, and stop
ignoring some allocation failures.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:27 -04:00
J. Bruce Fields bcaab953b1 nfsd4: remove nfs4_acl_new
This is a not-that-useful kmalloc wrapper.  And I'd like one of the
callers to actually use something other than kmalloc.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:27 -04:00
J. Bruce Fields 29c353b3fe nfsd4: define svcxdr_dupstr to share some common code
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:26 -04:00
J. Bruce Fields ce043ac826 nfsd4: remove unused defer_free argument
28e05dd845 "knfsd: nfsd4: represent nfsv4 acl with array instead of
linked list" removed the last user that wanted a custom free function.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:25 -04:00
J. Bruce Fields 7fb84306f5 nfsd4: rename cr_linkname->cr_data
The name of a link is currently stored in cr_name and cr_namelen, and
the content in cr_linkname and cr_linklen.  That's confusing.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:24 -04:00
J. Bruce Fields 52ee04330f nfsd: let nfsd_symlink assume null-terminated data
Currently nfsd_symlink has a weird hack to serve callers who don't
null-terminate symlink data: it looks ahead at the next byte to see if
it's zero, and copies it to a new buffer to null-terminate if not.

That means callers don't have to null-terminate, but they *do* have to
ensure that the byte following the end of the data is theirs to read.

That's a bit subtle, and the NFSv4 code actually got this wrong.

So let's just throw out that code and let callers pass null-terminated
strings; we've already fixed them to do that.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:23 -04:00
J. Bruce Fields 0aeae33f5d nfsd: make NFSv2 null terminate symlink data
It's simple enough for NFSv2 to null-terminate the symlink data.

A bit weird (it depends on knowing that we've already read the following
byte, which is either padding or part of the mode), but no worse than
the conditional kstrdup it otherwise relies on in nfsd_symlink().

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:23 -04:00
J. Bruce Fields b829e9197a nfsd: fix rare symlink decoding bug
An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.

Reported-by: Jeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-07-08 17:14:22 -04:00
Jeff Layton d4c8e34fe8 nfsd: properly handle embedded newlines in fault_injection input
Currently rpc_pton() fails to handle the case where you echo an address
into the file, as it barfs on the newline. Ensure that we NULL out the
first occurrence of any newline.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:38 -04:00
Jeff Layton f7ce5d2842 nfsd: fix return of nfs4_acl_write_who
AFAICT, the only way to hit this error is to pass this function a bogus
"who" value. In that case, we probably don't want to return -1 as that
could get sent back to the client. Turn this into nfserr_serverfault,
which is a more appropriate error for a server bug like this.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:38 -04:00
Jeff Layton 94ec938b61 nfsd: add appropriate __force directives to filehandle generation code
The filehandle structs all use host-endian values, but will sometimes
stuff big-endian values into those fields. This is OK since these
values are opaque to the client, but it confuses sparse. Add __force to
make it clear that we are doing this intentionally.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton e2afc81919 nfsd: nfsd_splice_read and nfsd_readv should return __be32
The callers expect a __be32 return and the functions they call return
__be32, so having these return int is just wrong. Also, nfsd_finish_read
can be made static.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton b3d8d1284a nfsd: clean up sparse endianness warnings in nfscache.c
We currently hash the XID to determine a hash bucket to use for the
reply cache entry, which is fed into hash_32 without byte-swapping it.
Add __force to make sparse happy, and add some comments to explain
why.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Jeff Layton f419992c1f nfsd: add __force to opaque verifier field casts
sparse complains that we're stuffing non-byte-swapped values into
__be32's here. Since they're supposed to be opaque, it doesn't matter
much. Just add __force to make sparse happy.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:37 -04:00
Kinglong Mee bf18f163e8 NFSD: Using exp_get for export getting
Don't using cache_get besides export.h, using exp_get for export.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee 0da22a919d NFSD: Using path_get when assigning path for export
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee f15a5cf912 SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_ok
rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool.

Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Kinglong Mee 3c7aa15d20 NFSD: Using min/max/min_t/max_t for calculate
Signed-off-by: Kinglong Mee <kinglongmee@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2014-06-23 11:31:36 -04:00
Linus Torvalds 2dfded8210 File locking related bugfixes for v3.16 (pile #2)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTpiwZAAoJEAAOaEEZVoIVplsP/383a9q3eXonbsi+Ea8CGbRl
 tdjjVhM1OY4NZYFAoulILDt3HqPTC6MBnqKlHz+BuMziwd/1+3w8S4E7IEwm/KtM
 ghNYX8ct5Bf1nc5QEdDmwf4PX48QbRwTuT1uIcXaJ+KtxTzI9qN7mnjRN91TUtq4
 WRGvOl0AsGCVq8YqxjztgD3TYbu7AG/72Em+DE9f81PTArAPTo2ySc3gxPuJJAsg
 G1x46Gx46sfqFX2FY4SPsXen+J/67Og67y6eBawxnT2Bp6ZGDuW+jyPRmkhf0yth
 pWAtkUi3XmEe6kk6GHiICsS0Yn0RG4jbz39+Ja+X7jibQVJ8Iz6b+Optw9RNQwYt
 jDWHKFS2AaL/CDejHYOQ1shHcozpRojtIbDLIZ9vTNTQ2r5cdaBvkXMmQzdoktmN
 wQtQ9AzBl8fHOFOQeCAwd/ZfCLIotvLoLds3K/CSqmpsyK2+9IyriQLKKZ2xm6Iu
 8+UUspGQcNVwcMP6YWtI6G+u58/mVanmK6dtpiyXrncZLAfU4H7ETL2IPu8jJTbv
 kTFCOJtXQzNZa5Xqur1hIewOG9/RlvAZAnii0Ghc3nXTWCCeNeI8re3jw4g9KdRv
 33t4sYfJld8LQ1NSMqIDyAs+fvytmGurYt+uhVpb58G/4CqBLNpmmIGIQ6LFLMfs
 75FQnbAezrD0H/JyAHUk
 =AJD3
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux

Pull file locking fixes from Jeff Layton:
 "File locking related bugfixes

  Nothing too earth-shattering here.  A fix for a potential regression
  due to a patch in pile #1, and the addition of a memory barrier to
  prevent a race condition between break_deleg and generic_add_lease"

* tag 'locks-v3.16-2' of git://git.samba.org/jlayton/linux:
  locks: set fl_owner for leases back to current->files
  locks: add missing memory barrier in break_deleg
2014-06-21 16:40:30 -10:00
Linus Torvalds e13d100beb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This fixes some lockups in btrfs reported with rc1.  It probably has
  some performance impact because it is backing off our spinning locks
  more often and switching to a blocking lock.  I'll be able to nail
  that down next week, but for now I want to get the lockups taken care
  of.

  Otherwise some more stack reduction and assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix wrong error handle when the device is missing or is not writeable
  Btrfs: fix deadlock when mounting a degraded fs
  Btrfs: use bio_endio_nodec instead of open code
  Btrfs: fix NULL pointer crash when running balance and scrub concurrently
  btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
  Btrfs: fix broken free space cache after the system crashed
  Btrfs: make free space cache write out functions more readable
  Btrfs: remove unused wait queue in struct extent_buffer
  Btrfs: fix deadlocks with trylock on tree nodes
2014-06-21 14:21:43 -10:00
Linus Torvalds 147f1404db Merge branch 'for-3.16' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "Fixes for a new regression from the xdr encoding rewrite, and a
  delegation problem we've had for a while (made somewhat more annoying
  by the vfs delegation support added in 3.13)"

* 'for-3.16' of git://linux-nfs.org/~bfields/linux:
  NFSD: fix bug for readdir of pseudofs
  NFSD: Don't hand out delegations for 30 seconds after recalling them.
2014-06-21 14:20:38 -10:00
Miao Xie 8408c716d7 Btrfs: fix wrong error handle when the device is missing or is not writeable
The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie c55f139640 Btrfs: fix deadlock when mounting a degraded fs
The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
 # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
 # echo 1 > /sys/block/`basename <dev0>`/device/delete
 # mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:56 -07:00
Miao Xie e990f16763 Btrfs: use bio_endio_nodec instead of open code
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Wang Shilong 298a8f9cf1 Btrfs: fix NULL pointer crash when running balance and scrub concurrently
While running balance, scrub, fsstress concurrently we hit the
following kernel crash:

[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395]  RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078

This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.

Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.

Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:55 -07:00
Qu Wenruo ced96edc48 btrfs: Skip scrubbing removed chunks to avoid -ENOENT.
When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie e570fd27f2 Btrfs: fix broken free space cache after the system crashed
When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Miao Xie 5349d6c3ff Btrfs: make free space cache write out functions more readable
This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:54 -07:00
Filipe Manana 46fefe41b5 Btrfs: remove unused wait queue in struct extent_buffer
The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-06-19 14:20:28 -07:00