1
0
Fork 0
Commit Graph

3301 Commits (b349acc76b7f65400b85abd09a5379ddd6fa5a97)

Author SHA1 Message Date
fanchaoting 4376c94618 pnfs-block: removing DM device maybe cause oops when call dev_remove
when pnfs block using device mapper,if umounting later,it maybe
cause oops. we apply "1 + sizeof(bl_umount_request)" memory for
msg->data, the memory maybe overflow when we do "memcpy(&dataptr
[sizeof(bl_msg)], &bl_umount_request, sizeof(bl_umount_request))",
because the size of bl_msg is more than 1 byte.

Signed-off-by: fanchaoting<fanchaoting@cn.fujitsu.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-03-21 10:11:06 -04:00
Trond Myklebust cf4ab538f1 NFSv4: Fix the string length returned by the idmapper
Functions like nfs_map_uid_to_name() and nfs_map_gid_to_group() are
expected to return a string without any terminating NUL character.
Regression introduced by commit 57e62324e4
(NFS: Store the legacy idmapper result in the keyring).

Reported-by: Dave Chiluk <dave.chiluk@canonical.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org [>=3.4]
2013-03-20 16:45:16 -04:00
Eric W. Biederman fa7614ddd6 fs: Readd the fs module aliases.
I had assumed that the only use of module aliases for filesystems
prior to "fs: Limit sys_mount to only request filesystem modules."
was in request_module.  It turns out I was wrong.  At least mkinitcpio
in Arch linux uses these aliases.

So readd the preexising aliases, to keep from breaking userspace.

Userspace eventually will have to follow and use the same aliases the
kernel does.  So at some point we may be delete these aliases without
problems.  However that day is not today.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-12 18:55:21 -07:00
Eric W. Biederman 7f78e03513 fs: Limit sys_mount to only request filesystem modules.
Modify the request_module to prefix the file system type with "fs-"
and add aliases to all of the filesystems that can be built as modules
to match.

A common practice is to build all of the kernel code and leave code
that is not commonly needed as modules, with the result that many
users are exposed to any bug anywhere in the kernel.

Looking for filesystems with a fs- prefix limits the pool of possible
modules that can be loaded by mount to just filesystems trivially
making things safer with no real cost.

Using aliases means user space can control the policy of which
filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf
with blacklist and alias directives.  Allowing simple, safe,
well understood work-arounds to known problematic software.

This also addresses a rare but unfortunate problem where the filesystem
name is not the same as it's module name and module auto-loading
would not work.  While writing this patch I saw a handful of such
cases.  The most significant being autofs that lives in the module
autofs4.

This is relevant to user namespaces because we can reach the request
module in get_fs_type() without having any special permissions, and
people get uncomfortable when a user specified string (in this case
the filesystem type) goes all of the way to request_module.

After having looked at this issue I don't think there is any
particular reason to perform any filtering or permission checks beyond
making it clear in the module request that we want a filesystem
module.  The common pattern in the kernel is to call request_module()
without regards to the users permissions.  In general all a filesystem
module does once loaded is call register_filesystem() and go to sleep.
Which means there is not much attack surface exposed by loading a
filesytem module unless the filesystem is mounted.  In a user
namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT,
which most filesystems do not set today.

Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Kees Cook <keescook@chromium.org>
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-03-03 19:36:31 -08:00
Linus Torvalds 8d05b3771d NFS client bugfixes for Linux 3.9
- Don't allow NFS silly-renamed files to be deleted
 - Don't start the retransmission timer when out of socket space
 - Fix a couple of pnfs-related Oopses.
 - Fix one more NFSv4 state recovery deadlock
 - Don't loop forever when LAYOUTGET returns NFS4ERR_LAYOUTTRYLATER
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJRMpMhAAoJEGcL54qWCgDy4BMP/0Zl7Ei7x9bJSb1C1lpPSo5p
 Lr9XoHLYqhPcAwKUXQfgM5IkC69bE62bD5esmdDqkgZYqnmGE0E4LG6MsbsMmvzk
 yug5WOxmjOFee7Bdpd8B86Z0qsa4l2TkQu2h9G3zE36P2rPKQaNzpteIjhis5UEQ
 EfNyLoBdFcuUSh4ztMVZOzbAyDcbNfsyl03XVmlv+Qn/o0l42Zjth0qwOP60bjuM
 zJF1CkHi5NLbXEhmOev9mA6UYz6zWRbiA/Yu92pomtXVDtOtzWpUniBIcf/S1ZH/
 V8Gj6bWdHHyFCa2PjhY1/QdLBOPRPdxpAAJk+q48AKmzyiOU6g3lIHBp5ai1WZNI
 1C+SYxABE/EJgq9SoQYGqq6SUiolrFulqnFHXF0jHF+ifdjoHjSRmpGQAoyoZ0k1
 aSl+Ojqx7QHibJd8GZBavWc3upRDzhHDRRB3tkQCENi+hryBZxEyeS2Z54NmBRUN
 tsOuyac6rtknZdD8Do4DMt9uc9u1DWicaiZbLfkP1VL1Angh6NKSA7qbmH6giLBS
 9Y+DPcIk5e34uKQ21WTxFydGD+SMg0EMnOmfr6EYXWEHBhKNYVR+cHyH0mAF6RzX
 enU2g0H2m+3vUQqajPUP0DV/eLGtdsvWvMjiskc3KX90CWfHmV2C8GFSxjV2OkT1
 vG1KFrICO6DR2943Udit
 =FMtb
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "We've just concluded another Connectathon interoperability testing
  week, and so here are the fixes for the bugs that were discovered:

   - Don't allow NFS silly-renamed files to be deleted
   - Don't start the retransmission timer when out of socket space
   - Fix a couple of pnfs-related Oopses.
   - Fix one more NFSv4 state recovery deadlock
   - Don't loop forever when LAYOUTGET returns NFS4ERR_LAYOUTTRYLATER"

* tag 'nfs-for-3.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: One line comment fix
  NFSv4.1: LAYOUTGET EDELAY loops timeout to the MDS
  SUNRPC: add call to get configured timeout
  PNFS: set the default DS timeout to 60 seconds
  NFSv4: Fix another open/open_recovery deadlock
  nfs: don't allow nfs_find_actor to match inodes of the wrong type
  NFSv4.1: Hold reference to layout hdr in layoutget
  pnfs: fix resend_to_mds for directio
  SUNRPC: Don't start the retransmission timer when out of socket space
  NFS: Don't allow NFS silly-renamed files to be deleted, no signal
2013-03-02 16:46:07 -08:00
Linus Torvalds b6669737d3 Merge branch 'for-3.9' of git://linux-nfs.org/~bfields/linux
Pull nfsd changes from J Bruce Fields:
 "Miscellaneous bugfixes, plus:

   - An overhaul of the DRC cache by Jeff Layton.  The main effect is
     just to make it larger.  This decreases the chances of intermittent
     errors especially in the UDP case.  But we'll need to watch for any
     reports of performance regressions.

   - Containerized nfsd: with some limitations, we now support
     per-container nfs-service, thanks to extensive work from Stanislav
     Kinsbursky over the last year."

Some notes about conflicts, since there were *two* non-data semantic
conflicts here:

 - idr_remove_all() had been added by a memory leak fix, but has since
   become deprecated since idr_destroy() does it for us now.

 - xs_local_connect() had been added by this branch to make AF_LOCAL
   connections be synchronous, but in the meantime Trond had changed the
   calling convention in order to avoid a RCU dereference.

There were a couple of more obvious actual source-level conflicts due to
the hlist traversal changes and one just due to code changes next to
each other, but those were trivial.

* 'for-3.9' of git://linux-nfs.org/~bfields/linux: (49 commits)
  SUNRPC: make AF_LOCAL connect synchronous
  nfsd: fix compiler warning about ambiguous types in nfsd_cache_csum
  svcrpc: fix rpc server shutdown races
  svcrpc: make svc_age_temp_xprts enqueue under sv_lock
  lockd: nlmclnt_reclaim(): avoid stack overflow
  nfsd: enable NFSv4 state in containers
  nfsd: disable usermode helper client tracker in container
  nfsd: use proper net while reading "exports" file
  nfsd: containerize NFSd filesystem
  nfsd: fix comments on nfsd_cache_lookup
  SUNRPC: move cache_detail->cache_request callback call to cache_read()
  SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() function
  SUNRPC: rework cache upcall logic
  SUNRPC: introduce cache_detail->cache_request callback
  NFS: simplify and clean cache library
  NFS: use SUNRPC cache creation and destruction helper for DNS cache
  nfsd4: free_stid can be static
  nfsd: keep a checksum of the first 256 bytes of request
  sunrpc: trim off trailing checksum before returning decrypted or integrity authenticated buffer
  sunrpc: fix comment in struct xdr_buf definition
  ...
2013-02-28 18:02:55 -08:00
Weston Andros Adamson 3000512137 NFSv4.1: LAYOUTGET EDELAY loops timeout to the MDS
The client will currently try LAYOUTGETs forever if a server is returning
NFS4ERR_LAYOUTTRYLATER or NFS4ERR_RECALLCONFLICT - even if the client no
longer needs the layout (ie process killed, unmounted).

This patch uses the DS timeout value (module parameter 'dataserver_timeo'
via rpc layer) to set an upper limit of how long the client tries LATOUTGETs
in this situation.  Once the timeout is reached, IO is redirected to the MDS.

This also changes how the client checks if a layout is on the clp list
to avoid a double list_add.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-28 17:41:35 -08:00
Weston Andros Adamson eb97cbb459 PNFS: set the default DS timeout to 60 seconds
The client should have 60 second default timeouts for DS operations, not 6
seconds.

NFS4_DEF_DS_TIMEO is used as "timeout in tenths of a second" in
nfs_init_timeout_values (and is not used anywhere else).
This matches up with the description of the module param dataserver_timeo.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-28 17:35:00 -08:00
Trond Myklebust 7aa262b522 NFSv4: Fix another open/open_recovery deadlock
If we don't release the open seqid before we wait for state recovery,
then we may end up deadlocking the state recovery thread.
This patch addresses a new deadlock that was introduced by
commit c21443c2c7 (NFSv4: Fix a reboot
recovery race when opening a file)

Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-28 16:19:59 -08:00
Sasha Levin b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Tejun Heo d687031265 nfs4client: convert to idr_alloc()
Convert to the much saner new idr interface.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:20 -08:00
Tejun Heo d97bec801d nfs: idr_destroy() no longer needs idr_remove_all()
idr_destroy() can destroy idr by itself and idr_remove_all() is being
deprecated.  Drop reference to idr_remove_all().  Note that the code
wasn't completely correct before because idr_remove() on all entries
doesn't necessarily release all idr_layers which could lead to memory
leak.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:14 -08:00
Jeff Layton f6488c9ba5 nfs: don't allow nfs_find_actor to match inodes of the wrong type
Benny Halevy reported the following oops when testing RHEL6:

<7>nfs_update_inode: inode 892950 mode changed, 0040755 to 0100644
<1>BUG: unable to handle kernel NULL pointer dereference at (null)
<1>IP: [<ffffffffa02a52c5>] nfs_closedir+0x15/0x30 [nfs]
<4>PGD 81448a067 PUD 831632067 PMD 0
<4>Oops: 0000 [#1] SMP
<4>last sysfs file: /sys/kernel/mm/redhat_transparent_hugepage/enabled
<4>CPU 6
<4>Modules linked in: fuse bonding 8021q garp ebtable_nat ebtables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi softdog bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_round_robin dm_multipath objlayoutdriver2(U) nfs(U) lockd fscache auth_rpcgss nfs_acl sunrpc vhost_net macvtap macvlan tun kvm_intel kvm be2net igb dca ptp pps_core microcode serio_raw sg iTCO_wdt iTCO_vendor_support i7core_edac edac_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif ahci dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 6332, comm: dd Not tainted 2.6.32-358.el6.x86_64 #1 HP ProLiant DL170e G6  /ProLiant DL170e G6
<4>RIP: 0010:[<ffffffffa02a52c5>]  [<ffffffffa02a52c5>] nfs_closedir+0x15/0x30 [nfs]
<4>RSP: 0018:ffff88081458bb98  EFLAGS: 00010292
<4>RAX: ffffffffa02a52b0 RBX: 0000000000000000 RCX: 0000000000000003
<4>RDX: ffffffffa02e45a0 RSI: ffff88081440b300 RDI: ffff88082d5f5760
<4>RBP: ffff88081458bba8 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 0000000000000772 R11: 0000000000400004 R12: 0000000040000008
<4>R13: ffff88082d5f5760 R14: ffff88082d6e8800 R15: ffff88082f12d780
<4>FS:  00007f728f37e700(0000) GS:ffff8800456c0000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 0000000000000000 CR3: 0000000831279000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process dd (pid: 6332, threadinfo ffff88081458a000, task ffff88082fa0e040)
<4>Stack:
<4> 0000000040000008 ffff88081440b300 ffff88081458bbf8 ffffffff81182745
<4><d> ffff88082d5f5760 ffff88082d6e8800 ffff88081458bbf8 ffffffffffffffea
<4><d> ffff88082f12d780 ffff88082d6e8800 ffffffffa02a50a0 ffff88082d5f5760
<4>Call Trace:
<4> [<ffffffff81182745>] __fput+0xf5/0x210
<4> [<ffffffffa02a50a0>] ? do_open+0x0/0x20 [nfs]
<4> [<ffffffff81182885>] fput+0x25/0x30
<4> [<ffffffff8117e23e>] __dentry_open+0x27e/0x360
<4> [<ffffffff811c397a>] ? inotify_d_instantiate+0x2a/0x60
<4> [<ffffffff8117e4b9>] lookup_instantiate_filp+0x69/0x90
<4> [<ffffffffa02a6679>] nfs_intent_set_file+0x59/0x90 [nfs]
<4> [<ffffffffa02a686b>] nfs_atomic_lookup+0x1bb/0x310 [nfs]
<4> [<ffffffff8118e0c2>] __lookup_hash+0x102/0x160
<4> [<ffffffff81225052>] ? selinux_inode_permission+0x72/0xb0
<4> [<ffffffff8118e76a>] lookup_hash+0x3a/0x50
<4> [<ffffffff81192a4b>] do_filp_open+0x2eb/0xdd0
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff8119f562>] ? alloc_fd+0x92/0x160
<4> [<ffffffff8117de79>] do_sys_open+0x69/0x140
<4> [<ffffffff811811f6>] ? sys_lseek+0x66/0x80
<4> [<ffffffff8117df90>] sys_open+0x20/0x30
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 65 48 8b 04 25 c8 cb 00 00 83 a8 44 e0 ff ff 01 5b 41 5c c9 c3 90 55 48 89 e5 53 48 83 ec 08 0f 1f 44 00 00 48 8b 9e a0 00 00 00 <48> 8b 3b e8 13 0c f7 ff 48 89 df e8 ab 3d ec e0 48 83 c4 08 31
<1>RIP  [<ffffffffa02a52c5>] nfs_closedir+0x15/0x30 [nfs]
<4> RSP <ffff88081458bb98>
<4>CR2: 0000000000000000

I think this is ultimately due to a bug on the server. The client had
previously found a directory dentry. It then later tried to do an atomic
open on a new (regular file) dentry. The attributes it got back had the
same filehandle as the previously found directory inode. It then tried
to put the filp because it failed the aops tests for O_DIRECT opens, and
oopsed here because the ctx was still NULL.

Obviously the root cause here is a server issue, but we can take steps
to mitigate this on the client. When nfs_fhget is called, we always know
what type of inode it is. In the event that there's a broken or
malicious server on the other end of the wire, the client can end up
crashing because the wrong ops are set on it.

Have nfs_find_actor check that the inode type is correct after checking
the fileid. The fileid check should rarely ever match, so it should only
rarely ever get to this check. In the case where we have a broken
server, we may see two different inodes with the same i_ino, but the
client should be able to cope with them without crashing.

This should fix the oops reported here:

    https://bugzilla.redhat.com/show_bug.cgi?id=913660

Reported-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-27 17:28:20 -08:00
Linus Torvalds d895cb1af1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile (part one) from Al Viro:
 "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
  locking violations, etc.

  The most visible changes here are death of FS_REVAL_DOT (replaced with
  "has ->d_weak_revalidate()") and a new helper getting from struct file
  to inode.  Some bits of preparation to xattr method interface changes.

  Misc patches by various people sent this cycle *and* ocfs2 fixes from
  several cycles ago that should've been upstream right then.

  PS: the next vfs pile will be xattr stuff."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
  saner proc_get_inode() calling conventions
  proc: avoid extra pde_put() in proc_fill_super()
  fs: change return values from -EACCES to -EPERM
  fs/exec.c: make bprm_mm_init() static
  ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
  ocfs2: fix possible use-after-free with AIO
  ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
  get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
  target: writev() on single-element vector is pointless
  export kernel_write(), convert open-coded instances
  fs: encode_fh: return FILEID_INVALID if invalid fid_type
  kill f_vfsmnt
  vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
  nfsd: handle vfs_getattr errors in acl protocol
  switch vfs_getattr() to struct path
  default SET_PERSONALITY() in linux/elf.h
  ceph: prepopulate inodes only when request is aborted
  d_hash_and_lookup(): export, switch open-coded instances
  9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
  9p: split dropping the acls from v9fs_set_create_acl()
  ...
2013-02-26 20:16:07 -08:00
Jeff Layton ecf3d1f1aa vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
The following set of operations on a NFS client and server will cause

    server# mkdir a
    client# cd a
    server# mv a a.bak
    client# sleep 30  # (or whatever the dir attrcache timeout is)
    client# stat .
    stat: cannot stat `.': Stale NFS file handle

Obviously, we should not be getting an ESTALE error back there since the
inode still exists on the server. The problem is that the lookup code
will call d_revalidate on the dentry that "." refers to, because NFS has
FS_REVAL_DOT set.

nfs_lookup_revalidate will see that the parent directory has changed and
will try to reverify the dentry by redoing a LOOKUP. That of course
fails, so the lookup code returns ESTALE.

The problem here is that d_revalidate is really a bad fit for this case.
What we really want to know at this point is whether the inode is still
good or not, but we don't really care what name it goes by or whether
the dcache is still valid.

Add a new d_op->d_weak_revalidate operation and have complete_walk call
that instead of d_revalidate. The intent there is to allow for a
"weaker" d_revalidate that just checks to see whether the inode is still
good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
special casing.

[AV: changed method name, added note in porting, fixed confusion re
having it possibly called from RCU mode (it won't be)]

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:09 -05:00
Weston Andros Adamson a47970ff78 NFSv4.1: Hold reference to layout hdr in layoutget
This fixes an oops where a LAYOUTGET is in still in the rpciod queue,
but the requesting processes has been killed.  Without this, killing
the process does the final pnfs_put_layout_hdr() and sets NFS_I(inode)->layout
to NULL while the LAYOUTGET rpc task still references it.

Example oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
IP: [<ffffffffa01bd586>] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
PGD 7365b067 PUD 7365d067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: nfs_layout_nfsv41_files nfsv4 auth_rpcgss nfs lockd sunrpc ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle ip6table_filter ip6_tables ppdev e1000 i2c_piix4 i2c_core shpchp parport_pc parport crc32c_intel aesni_intel xts aes_x86_64 lrw gf128mul ablk_helper cryptd mptspi scsi_transport_spi mptscsih mptbase floppy autofs4
CPU 0
Pid: 27, comm: kworker/0:1 Not tainted 3.8.0-dros_cthon2013+ #4 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa01bd586>]  [<ffffffffa01bd586>] pnfs_choose_layoutget_stateid+0x37/0xef [nfsv4]
RSP: 0018:ffff88007b0c1c88  EFLAGS: 00010246
RAX: ffff88006ed36678 RBX: 0000000000000000 RCX: 0000000ea877e3bc
RDX: ffff88007a729da8 RSI: 0000000000000000 RDI: ffff88007a72b958
RBP: ffff88007b0c1ca8 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007a72b958
R13: ffff88007a729da8 R14: 0000000000000000 R15: ffffffffa011077e
FS:  0000000000000000(0000) GS:ffff88007f600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000080 CR3: 00000000735f8000 CR4: 00000000001407f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:1 (pid: 27, threadinfo ffff88007b0c0000, task ffff88007c2fa0c0)
Stack:
 ffff88006fc05388 ffff88007a72b908 ffff88007b240900 ffff88006fc05388
 ffff88007b0c1cd8 ffffffffa01a2170 ffff88007b240900 ffff88007b240900
 ffff88007b240970 ffffffffa011077e ffff88007b0c1ce8 ffffffffa0110791
Call Trace:
 [<ffffffffa01a2170>] nfs4_layoutget_prepare+0x7b/0x92 [nfsv4]
 [<ffffffffa011077e>] ? __rpc_atrun+0x15/0x15 [sunrpc]
 [<ffffffffa0110791>] rpc_prepare_task+0x13/0x15 [sunrpc]

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-25 18:32:59 -08:00
Linus Torvalds 94f2f14234 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace and namespace infrastructure changes from Eric W Biederman:
 "This set of changes starts with a few small enhnacements to the user
  namespace.  reboot support, allowing more arbitrary mappings, and
  support for mounting devpts, ramfs, tmpfs, and mqueuefs as just the
  user namespace root.

  I do my best to document that if you care about limiting your
  unprivileged users that when you have the user namespace support
  enabled you will need to enable memory control groups.

  There is a minor bug fix to prevent overflowing the stack if someone
  creates way too many user namespaces.

  The bulk of the changes are a continuation of the kuid/kgid push down
  work through the filesystems.  These changes make using uids and gids
  typesafe which ensures that these filesystems are safe to use when
  multiple user namespaces are in use.  The filesystems converted for
  3.9 are ceph, 9p, afs, ocfs2, gfs2, ncpfs, nfs, nfsd, and cifs.  The
  changes for these filesystems were a little more involved so I split
  the changes into smaller hopefully obviously correct changes.

  XFS is the only filesystem that remains.  I was hoping I could get
  that in this release so that user namespace support would be enabled
  with an allyesconfig or an allmodconfig but it looks like the xfs
  changes need another couple of days before it they are ready."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (93 commits)
  cifs: Enable building with user namespaces enabled.
  cifs: Convert struct cifs_ses to use a kuid_t and a kgid_t
  cifs: Convert struct cifs_sb_info to use kuids and kgids
  cifs: Modify struct smb_vol to use kuids and kgids
  cifs: Convert struct cifsFileInfo to use a kuid
  cifs: Convert struct cifs_fattr to use kuid and kgids
  cifs: Convert struct tcon_link to use a kuid.
  cifs: Modify struct cifs_unix_set_info_args to hold a kuid_t and a kgid_t
  cifs: Convert from a kuid before printing current_fsuid
  cifs: Use kuids and kgids SID to uid/gid mapping
  cifs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring_alloc
  cifs: Use BUILD_BUG_ON to validate uids and gids are the same size
  cifs: Override unmappable incoming uids and gids
  nfsd: Enable building with user namespaces enabled.
  nfsd: Properly compare and initialize kuids and kgids
  nfsd: Store ex_anon_uid and ex_anon_gid as kuids and kgids
  nfsd: Modify nfsd4_cb_sec to use kuids and kgids
  nfsd: Handle kuids and kgids in the nfs4acl to posix_acl conversion
  nfsd: Convert nfsxdr to use kuids and kgids
  nfsd: Convert nfs3xdr to use kuids and kgids
  ...
2013-02-25 16:00:49 -08:00
Benny Halevy 78f33277f9 pnfs: fix resend_to_mds for directio
Pass the directio request on pageio_init to clean up the API.

Percolate pg_dreq from original nfs_pageio_descriptor to the
pnfs_{read,write}_done_resend_to_mds and use it on respective
call to nfs_pageio_init_{read,write} on the newly created
nfs_pageio_descriptor.

Reproduced by command:
 mount -o vers=4.1 server:/ /mnt
 dd bs=128k count=8 if=/dev/zero of=/mnt/dd.out oflag=direct

BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa021a3a8>] atomic_inc+0x4/0x9 [nfs]
PGD 34786067 PUD 34794067 PMD 0
Oops: 0002 [#1] SMP
Modules linked in: nfs_layout_nfsv41_files nfsv4 nfs nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc btrfs zlib_deflate libcrc32c ipv6 autofs4
CPU 1
Pid: 259, comm: kworker/1:2 Not tainted 3.8.0-rc6 #2 Bochs Bochs
RIP: 0010:[<ffffffffa021a3a8>]  [<ffffffffa021a3a8>] atomic_inc+0x4/0x9 [nfs]
RSP: 0018:ffff880038f8fa68  EFLAGS: 00010206
RAX: ffffffffa021a6a9 RBX: ffff880038f8fb48 RCX: 00000000000a0000
RDX: ffffffffa021e616 RSI: ffff8800385e9a40 RDI: 0000000000000028
RBP: ffff880038f8fa68 R08: ffffffff81ad6720 R09: ffff8800385e9510
R10: ffffffffa0228450 R11: ffff880038e87418 R12: ffff8800385e9a40
R13: ffff8800385e9a70 R14: ffff880038f8fb38 R15: ffffffffa0148878
FS:  0000000000000000(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 0000000034789000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/1:2 (pid: 259, threadinfo ffff880038f8e000, task ffff880038302480)
Stack:
 ffff880038f8fa78 ffffffffa021a6bf ffff880038f8fa88 ffffffffa021bb82
 ffff880038f8fae8 ffffffffa021f454 ffff880038f8fae8 ffffffff8109689d
 ffff880038f8fab8 ffffffff00000006 0000000000000000 ffff880038f8fb48
Call Trace:
 [<ffffffffa021a6bf>] nfs_direct_pgio_init+0x16/0x18 [nfs]
 [<ffffffffa021bb82>] nfs_pgheader_init+0x6a/0x6c [nfs]
 [<ffffffffa021f454>] nfs_generic_pg_writepages+0x51/0xf8 [nfs]
 [<ffffffff8109689d>] ? mark_held_locks+0x71/0x99
 [<ffffffffa0148878>] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
 [<ffffffffa021bc25>] nfs_pageio_doio+0x1a/0x43 [nfs]
 [<ffffffffa021be7c>] nfs_pageio_complete+0x16/0x2c [nfs]
 [<ffffffffa02608be>] pnfs_write_done_resend_to_mds+0x95/0xc5 [nfsv4]
 [<ffffffffa0148878>] ? rpc_release_resources_task+0x37/0x37 [sunrpc]
 [<ffffffffa028e27f>] filelayout_reset_write+0x8c/0x99 [nfs_layout_nfsv41_files]
 [<ffffffffa028e5f9>] filelayout_write_done_cb+0x4d/0xc1 [nfs_layout_nfsv41_files]
 [<ffffffffa024587a>] nfs4_write_done+0x36/0x49 [nfsv4]
 [<ffffffffa021f996>] nfs_writeback_done+0x53/0x1cc [nfs]
 [<ffffffffa021fb1d>] nfs_writeback_done_common+0xe/0x10 [nfs]
 [<ffffffffa028e03d>] filelayout_write_call_done+0x28/0x2a [nfs_layout_nfsv41_files]
 [<ffffffffa01488a1>] rpc_exit_task+0x29/0x87 [sunrpc]
 [<ffffffffa014a0c9>] __rpc_execute+0x11d/0x3cc [sunrpc]
 [<ffffffff810969dc>] ? trace_hardirqs_on_caller+0x117/0x173
 [<ffffffffa014a39f>] rpc_async_schedule+0x27/0x32 [sunrpc]
 [<ffffffffa014a378>] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
 [<ffffffff8105f8c1>] process_one_work+0x226/0x422
 [<ffffffff8105f7f4>] ? process_one_work+0x159/0x422
 [<ffffffff81094757>] ? lock_acquired+0x210/0x249
 [<ffffffffa014a378>] ? __rpc_execute+0x3cc/0x3cc [sunrpc]
 [<ffffffff810600d8>] worker_thread+0x126/0x1c4
 [<ffffffff8105ffb2>] ? manage_workers+0x240/0x240
 [<ffffffff81064ef8>] kthread+0xb1/0xb9
 [<ffffffff81064e47>] ? __kthread_parkme+0x65/0x65
 [<ffffffff815206ec>] ret_from_fork+0x7c/0xb0
 [<ffffffff81064e47>] ? __kthread_parkme+0x65/0x65
Code: 00 83 38 02 74 12 48 81 4b 50 00 00 01 00 c7 83 60 07 00 00 01 00 00 00 48 89 df e8 55 fe ff ff 5b 41 5c 5d c3 66 90 55 48 89 e5 <f0> ff 07 5d c3 55 48 89 e5 f0 ff 0f 0f 94 c0 84 c0 0f 95 c0 0f
RIP  [<ffffffffa021a3a8>] atomic_inc+0x4/0x9 [nfs]
 RSP <ffff880038f8fa68>
CR2: 0000000000000028

Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@kernel.org [>= 3.6]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-24 10:07:36 -05:00
Al Viro 496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Trond Myklebust 5a7a613a47 NFS: Don't allow NFS silly-renamed files to be deleted, no signal
Commit 73ca100 broke the code that prevents the client from deleting
a silly renamed dentry.  This affected "delete on last close"
semantics as after that commit, nothing prevented removal of
silly-renamed files.  As a result, a process holding a file open
could easily get an ESTALE on the file in a directory where some
other process issued 'rm -rf some_dir_containing_the_file' twice.
Before the commit, any attempt at unlinking silly renamed files would
fail inside may_delete() with -EBUSY because of the
DCACHE_NFSFS_RENAMED flag.  The following testcase demonstrates
the problem:
  tail -f /nfsmnt/dir/file &
  rm -rf /nfsmnt/dir
  rm -rf /nfsmnt/dir
  # second removal does not fail, 'tail' process receives ESTALE

The problem with the above commit is that it unhashes the old and
new dentries from the lookup path, even in the normal case when
a signal is not encountered and it would have been safe to call
d_move.  Unfortunately the old dentry has the special
DCACHE_NFSFS_RENAMED flag set on it.  Unhashing has the
side-effect that future lookups call d_alloc(), allocating a new
dentry without the special flag for any silly-renamed files.  As a
result, subsequent calls to unlink silly renamed files do not fail
but allow the removal to go through.  This will result in ESTALE
errors for any other process doing operations on the file.

To fix this, go back to using d_move on success.
For the signal case, it's unclear what we may safely do beyond d_drop.

Reported-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org
2013-02-22 14:55:34 -05:00
fanchaoting 5a12cca697 umount oops when remove blocklayoutdriver first
now pnfs client uses block layout, maybe we can remove
blocklayoutdriver first. if we umount later,
it can cause oops in unset_pnfs_layoutdriver.
because nfss->pnfs_curr_ld->clear_layoutdriver is invalid.

reproduce it:
 modprobe  blocklayoutdriver
 mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/
 rmmod blocklayoutdriver
 umount /mnt

then you can see following

CPU 0
Pid: 17023, comm: umount.nfs4 Tainted: GF          O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffffa04cfe6d>]  [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP: 0018:ffff8800022d9e48  EFLAGS: 00010286
RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001
RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800
RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400
R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550
FS:  00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0)
Stack:
ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce
ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7
ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400
Call Trace:
[<ffffffffa04cd0ce>] nfs4_destroy_server+0x1e/0x30 [nfsv4]
[<ffffffffa04755a7>] nfs_free_server+0xb7/0x150 [nfs]
[<ffffffffa047d4d5>] nfs_kill_super+0x35/0x40 [nfs]
[<ffffffff81178d35>] deactivate_locked_super+0x45/0x70
[<ffffffff8117986a>] deactivate_super+0x4a/0x70
[<ffffffff81193ee2>] mntput_no_expire+0xd2/0x130
[<ffffffff81194d62>] sys_umount+0x72/0xe0
[<ffffffff8154af59>] system_call_fastpath+0x16/0x1b
Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 <48> 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2
RIP  [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4]
RSP <ffff8800022d9e48>
CR2: ffffffffa04a1b38
---[ end trace 29f75aaedda058bf ]---

Signed-off-by: fanchaoting<fanchaoting@cn.fujitsu.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-17 15:40:15 -05:00
Tim Gardner 96aa1549af nfs: remove kfree() redundant null checks
smatch analysis:

fs/nfs/getroot.c:130 nfs_get_root() info: redundant null
 check on name calling kfree()

fs/nfs/unlink.c:272 nfs_async_unlink() info: redundant null
 check on devname_garbage calling kfree()

Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-17 15:27:21 -05:00
Weston Andros Adamson 085b7a45c6 NFSv4.1: Don't decode skipped layoutgets
layoutget's prepare hook can call rpc_exit with status = NFS4_OK (0).
Because of this, nfs4_proc_layoutget can't depend on a 0 status to mean
that the RPC was successfully sent, received and parsed.

To fix this, use the result's len member to see if parsing took place.

This fixes the following OOPS -- calling xdr_init_decode() with a buffer length
0 doesn't set the stream's 'p' member and ends up using uninitialized memory
in filelayout_decode_layout.

BUG: unable to handle kernel paging request at 0000000000008050
IP: [<ffffffff81282e78>] memcpy+0x18/0x120
PGD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:11.0/0000:02:01.0/irq
CPU 1
Modules linked in: nfs_layout_nfsv41_files nfs lockd fscache auth_rpcgss nfs_acl autofs4 sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 dm_mirror dm_region_hash dm_log dm_mod ppdev parport_pc parport snd_ens1371 snd_rawmidi snd_ac97_codec ac97_bus snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc e1000 microcode vmware_balloon i2c_piix4 i2c_core sg shpchp ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix mptspi mptscsih mptbase scsi_transport_spi [last unloaded: speedstep_lib]

Pid: 1665, comm: flush-0:22 Not tainted 2.6.32-356-test-2 #2 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
RIP: 0010:[<ffffffff81282e78>]  [<ffffffff81282e78>] memcpy+0x18/0x120
RSP: 0018:ffff88003dfab588  EFLAGS: 00010206
RAX: ffff88003dc42000 RBX: ffff88003dfab610 RCX: 0000000000000009
RDX: 000000003f807ff0 RSI: 0000000000008050 RDI: ffff88003dc42000
RBP: ffff88003dfab5b0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000080 R12: 0000000000000024
R13: ffff88003dc42000 R14: ffff88003f808030 R15: ffff88003dfab6a0
FS:  0000000000000000(0000) GS:ffff880003420000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000008050 CR3: 000000003bc92000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process flush-0:22 (pid: 1665, threadinfo ffff88003dfaa000, task ffff880037f77540)
Stack:
ffffffffa0398ac1 ffff8800397c5940 ffff88003dfab610 ffff88003dfab6a0
<d> ffff88003dfab5d0 ffff88003dfab680 ffffffffa01c150b ffffea0000d82e70
<d> 000000508116713b 0000000000000000 0000000000000000 0000000000000000
Call Trace:
[<ffffffffa0398ac1>] ? xdr_inline_decode+0xb1/0x120 [sunrpc]
[<ffffffffa01c150b>] filelayout_decode_layout+0xeb/0x350 [nfs_layout_nfsv41_files]
[<ffffffffa01c17fc>] filelayout_alloc_lseg+0x8c/0x3c0 [nfs_layout_nfsv41_files]
[<ffffffff8150e6ce>] ? __wait_on_bit+0x7e/0x90

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-17 15:24:16 -05:00
Stanislav Kinsbursky 21cd1254d3 SUNRPC: remove "cache_request" argument in sunrpc_cache_pipe_upcall() function
Passing this pointer is redundant since it's stored on cache_detail structure,
which is also passed to sunrpc_cache_pipe_upcall () function.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-15 10:43:47 -05:00
Stanislav Kinsbursky 73fb847a44 SUNRPC: introduce cache_detail->cache_request callback
This callback will allow to simplify upcalls in further patches in this
series.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-15 10:43:45 -05:00
Stanislav Kinsbursky 462b8f6bf1 NFS: simplify and clean cache library
This is a cleanup patch.
Such helpers like nfs_cache_init() and nfs_cache_destroy() are redundant,
because they are just a wrappers around sunrpc_init_cache_detail() and
sunrpc_destroy_cache_detail() respectively.
So let's remove them completely and move corresponding logic to
nfs_cache_register_net() and nfs_cache_unregister_net() respectively (since
they are called together anyway).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-15 10:43:36 -05:00
Stanislav Kinsbursky 483479c26a NFS: use SUNRPC cache creation and destruction helper for DNS cache
This cache was the first containerized and doesn't use net-aware cache
creation and destruction helpers.
This is a cleanup patch which just makes code looks clearer and reduce amount
of lines of code.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-15 10:43:12 -05:00
Trond Myklebust fd9a8d7160 NFSv4.1: Fix bulk recall and destroy of layouts
The current code in pnfs_destroy_all_layouts() assumes that removing
the layout from the server->layouts list is sufficient to make it
invisible to other processes. This ignores the fact that most
users access the layout through the nfs_inode->layout...
There is further breakage due to lack of reference counting of the
layouts, meaning that the whole thing Oopses at the drop of a hat.

The code in initiate_bulk_draining() is almost correct, and can be
used as a model for pnfs_destroy_all_layouts(), so move that
code to pnfs.c, and refactor the code to allow us to choose between
a single filesystem bulk recall, and a recall of all layouts.
Also note that initiate_bulk_draining() currently calls iput() while
holding locks. Fix that too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-14 13:22:50 -05:00
Eric W. Biederman 9ff593c473 nfs: kuid and kgid conversions for nfs/inode.c
- Use uid_eq and gid_eq when comparing kuids and kgids.
- Use make_kuid(&init_user_ns, -2) and make_kgid(&init_user_ns, -2) as
  the initial uid and gid on nfs inodes, instead of using the typeunsafe
  value of -2.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:33 -08:00
Eric W. Biederman e5782076e7 nfs: Convert nfs4xdr to use kuids and kgids
When reading uids and gids off the wire convert them to
kuids and kgids.

When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.

When printing kuids and kgids convert them to values in
the initial user namespace then use normal printf formats.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:32 -08:00
Eric W. Biederman 57a38dae2a nfs: Convert nfs3xdr to use kuids and kgids
When reading uids and gids off the wire convert them to
kuids and kgids.

When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.

Add an additional failure mode incoming for uids or gids
that are invalid.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:31 -08:00
Eric W. Biederman cfa0898d4f nfs: Convert nfs2xdr to use kuids and kgids
When reading uids and gids off the wire convert them to
kuids and kgids.

When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.

Add an additional failure mode for incoming uid or
gids that are invalid.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:30 -08:00
Eric W. Biederman 9f309c86cf nfs: Convert idmap to use kuids and kgids
Convert nfs_map_name_to_uid to return a kuid_t value.
Convert nfs_map_name_to_gid to return a kgid_t value.
Convert nfs_map_uid_to_name to take a kuid_t paramater.
Convert nfs_map_gid_to_name to take a kgid_t paramater.

Tweak nfs_fattr_map_owner_to_name to use a kuid_t intermediate value.
Tweak nfs_fattr_map_group_to_name to use a kgid_t intermediate value.

Which makes these functions properly handle kuids and kgids, including
erroring of the generated kuid or kgid is invalid.

Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:29 -08:00
Eric W. Biederman 4e963d4f3e nfs: Pass GLOBAL_ROOT_UID and GLOBAL_ROOT_GID to keyring alloc
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2013-02-13 06:15:27 -08:00
Trond Myklebust c8da19b986 NFSv4.1: Fix an ABBA locking issue with session and state serialisation
Ensure that if nfs_wait_on_sequence() causes our rpc task to wait for
an NFSv4 state serialisation lock, then we also drop the session slot.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-02-11 19:04:25 -05:00
Trond Myklebust c21443c2c7 NFSv4: Fix a reboot recovery race when opening a file
If the server reboots after it has replied to our OPEN, but before we
call nfs4_opendata_to_nfs4_state(), then the reboot recovery thread
will not see a stateid for this open, and so will fail to recover it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:14 -05:00
Trond Myklebust 65b62a29f7 NFSv4: Ensure delegation recall and byte range lock removal don't conflict
Add a mutex to the struct nfs4_state_owner to ensure that delegation
recall doesn't conflict with byte range lock removal.

Note that we nest the new mutex _outside_ the state manager reclaim
protection (nfsi->rwsem) in order to avoid deadlocks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust 37380e4264 NFSv4: Fix up the return values of nfs4_open_delegation_recall
Adjust the return values so that they return EAGAIN to the caller in
cases where we might want to retry the delegation recall after
the state recovery has run.
Note that we can't wait and retry in this routine, because the caller
may be the state manager thread.

If delegation recall fails due to a session or reboot related issue,
also ensure that we mark the stateid as delegated so that
nfs_delegation_claim_opens can find it again later.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:13 -05:00
Trond Myklebust d25be546a8 NFSv4.1: Don't lose locks when a server reboots during delegation return
If the server reboots while we are converting a delegation into
OPEN/LOCK stateids as part of a delegation return, the current code
will simply exit with an error. This causes us to lose both
delegation state and locking state (i.e. locking atomicity).

Deal with this by exposing the delegation stateid during delegation
return, so that we can recover the delegation, and then resume
open/lock recovery.

Note that not having to hold the nfs_inode->rwsem across the
calls to nfs_delegation_claim_opens() also fixes a deadlock against
the NFSv4.1 reboot recovery code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust 9a99af494b NFSv4.1: Prevent deadlocks between state recovery and file locking
We currently have a deadlock in which the state recovery thread
ends up blocking due to one of the locks which it is trying to
recover holding the nfs_inode->rwsem.
The situation is as follows: the state recovery thread is
scheduled in order to recover from a reboot. It immediately
drains the session, forcing all ordinary NFSv4.1 calls to
nfs41_setup_sequence() to be put to sleep.  This includes the
file locking process that holds the nfs_inode->rwsem.
When the thread gets to nfs4_reclaim_locks(), it tries to
grab a write lock on nfs_inode->rwsem, and boom...

Fix is to have the lock drop the nfs_inode->rwsem while it is
doing RPC calls. We use a sequence lock in order to signal to
the locking process whether or not a state recovery thread has
run on that inode, in which case it should retry the lock.

Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:12 -05:00
Trond Myklebust c137afabe3 NFSv4: Allow the state manager to mark an open_owner as being recovered
This patch adds a seqcount_t lock for use by the state manager to
signal that an open owner has been recovered. This mechanism will be
used by the delegation, open and byte range lock code in order to
figure out if they need to replay requests due to collisions with
lock recovery.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-11 15:33:11 -05:00
Jeff Layton 5976687a2b sunrpc: move address copy/cmp/convert routines and prototypes from clnt.h to addr.h
These routines are used by server and client code, so having them in a
separate header would be best.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-02-05 09:41:14 -05:00
Trond Myklebust 322b2b9032 Revert "NFS: add nfs_sb_deactive_async to avoid deadlock"
This reverts commit 324d003b0c.

The deadlock turned out to be caused by a workqueue limitation that has
now been worked around in the RPC code (see comment in rpc_free_task).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-02-01 10:13:48 -05:00
Trond Myklebust c489ee290b NFSv4.1: Handle NFS4ERR_DELAY when resetting the NFSv4.1 session
NFS4ERR_DELAY is a legal reply when we call DESTROY_SESSION. It
usually means that the server is busy handling an unfinished RPC
request. Just sleep for a second and then retry.
We also need to be able to handle the NFS4ERR_BACK_CHAN_BUSY return
value. If the NFS server has outstanding callbacks, we just want to
similarly sleep & retry.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-30 17:45:15 -05:00
Trond Myklebust ab22541782 NFS: Don't silently fail setattr() requests on mountpoints
Ensure that any setattr and getattr requests for junctions and/or
mountpoints are sent to the server. Ever since commit
0ec26fd069 (vfs: automount should ignore LOOKUP_FOLLOW), we have
silently dropped any setattr requests to a server-side mountpoint.
For referrals, we have silently dropped both getattr and setattr
requests.

This patch restores the original behaviour for setattr on mountpoints,
and tries to do the same for referrals, provided that we have a
filehandle...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-30 17:41:04 -05:00
Trond Myklebust 65436ec0c8 NFSv4.1: Ensure that nfs41_walk_client_list() does start lease recovery
We do need to start the lease recovery thread prior to waiting for the
client initialisation to complete in NFSv4.1.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:41 -05:00
Trond Myklebust 202c312dba NFSv4: Fix NFSv4 trunking discovery
If walking the list in nfs4[01]_walk_client_list fails, then the most
likely explanation is that the server dropped the clientid before we
actually managed to confirm it. As long as our nfs_client is the very
last one in the list to be tested, the caller can be assured that this
is the case when the final return value is NFS4ERR_STALE_CLIENTID.

Reported-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org [>=3.7]
Tested-by: Ben Greear <greearb@candelatech.com>
2013-01-27 15:51:28 -05:00
Trond Myklebust 4ae19c2dd7 NFSv4: Fix NFSv4 reference counting for trunked sessions
The reference counting in nfs4_init_client assumes wongly that it
is safe for nfs4_discover_server_trunking() to return a pointer to a
nfs_client prior to bumping the reference count.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: stable@vger.kernel.org [>=3.7]
2013-01-27 15:51:15 -05:00
Trond Myklebust dee972b967 NFS: Fix error reporting in nfs_xdev_mount
Currently, nfs_xdev_mount converts all errors from clone_server() to
ENOMEM, which can then leak to userspace (for instance to 'mount'). Fix that.
Also ensure that if nfs_fs_mount_common() returns an error, we
don't dprintk(0)...

The regression originated in commit 3d176e3fe4
(NFS: Use nfs_fs_mount_common() for xdev mounts)

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-27 15:51:15 -05:00
Nickolai Zeldovich ecf0eb9edb nfs: avoid dereferencing null pointer in initiate_bulk_draining
Fix an inverted null pointer check in initiate_bulk_draining().

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-05 14:26:51 -05:00
Trond Myklebust 6db6dd7d3f NFS: Ensure that we free the rpc_task after read and write cleanups are done
This patch ensures that we free the rpc_task after the cleanup callbacks
are done in order to avoid a deadlock problem that can be triggered if
the callback needs to wait for another workqueue item to complete.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Weston Andros Adamson <dros@netapp.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Bruce Fields <bfields@fieldses.org>
Cc: stable@vger.kernel.org [>= 3.5]
2013-01-04 12:59:10 -05:00
Xi Wang e25fbe380c nfs: fix null checking in nfs_get_option_str()
The following null pointer check is broken.

	*option = match_strdup(args);
	return !option;

The pointer `option' must be non-null, and thus `!option' is always false.
Use `!*option' instead.

The bug was introduced in commit c5cb09b6f8 ("Cleanup: Factor out some
cut-and-paste code.").

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-01-04 10:54:43 -05:00
Yanchuan Nian 39e88fcfb1 pnfs: Increase the refcount when LAYOUTGET fails the first time
The layout will be set unusable if LAYOUTGET fails. Is it reasonable to
increase the refcount iff LAYOUTGET fails the first time?

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.7]
2013-01-04 10:50:42 -05:00
Weston Andros Adamson f8d9a897d4 NFS: Fix access to suid/sgid executables
nfs_open_permission_mask() should only check MAY_EXEC for files that
are opened with __FMODE_EXEC.

Also fix NFSv4 access-in-open path in a similar way -- openflags must be
used because fmode will not always have FMODE_EXEC set.

This patch fixes https://bugzilla.kernel.org/show_bug.cgi?id=49101

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-01-03 17:06:27 -05:00
Trond Myklebust c4271c6e37 NFS: Kill fscache warnings when mounting without -ofsc
The fscache code will currently bleat a "non-unique superblock keys"
warning even if the user is mounting without the 'fsc' option.

There should be no reason to even initialise the superblock cache cookie
unless we're planning on using fscache for something, so ensure that we
check for the NFS_OPTION_FSCACHE flag before calling into the fscache
code.

Reported-by: Paweł Sikora <pawel.sikora@agmk.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: David Howells <dhowells@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:32:09 -08:00
David Howells c129c29347 NFS: Provide stub nfs_fscache_wait_on_invalidate() for when CONFIG_NFS_FSCACHE=n
Provide a stub nfs_fscache_wait_on_invalidate() function for when
CONFIG_NFS_FSCACHE=n lest the following error appear:

  fs/nfs/inode.c: In function 'nfs_invalidate_mapping':
  fs/nfs/inode.c:887:2: error: implicit declaration of function 'nfs_fscache_wait_on_invalidate' [-Werror=implicit-function-declaration]
  cc1: some warnings being treated as errors

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-21 08:06:48 -08:00
David Howells a4ff146881 NFS4: Open files for fscaching
nfs4_file_open() should open files for fscaching.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:19:42 +00:00
David Howells 8c209ce721 NFS: nfs_migrate_page() does not wait for FS-Cache to finish with a page
nfs_migrate_page() does not wait for FS-Cache to finish with a page, probably
leading to the following bad-page-state:

 BUG: Bad page state in process python-bin  pfn:17d39b
 page:ffffea00053649e8 flags:004000000000100c count:0 mapcount:0 mapping:(null)
index:38686 (Tainted: G    B      ---------------- )
 Pid: 31053, comm: python-bin Tainted: G    B      ----------------
2.6.32-71.24.1.el6.x86_64 #1
 Call Trace:
 [<ffffffff8111bfe7>] bad_page+0x107/0x160
 [<ffffffff8111ee69>] free_hot_cold_page+0x1c9/0x220
 [<ffffffff8111ef19>] __pagevec_free+0x59/0xb0
 [<ffffffff8104b988>] ? flush_tlb_others_ipi+0x128/0x130
 [<ffffffff8112230c>] release_pages+0x21c/0x250
 [<ffffffff8115b92a>] ? remove_migration_pte+0x28a/0x2b0
 [<ffffffff8115f3f8>] ? mem_cgroup_get_reclaim_stat_from_page+0x18/0x70
 [<ffffffff81122687>] ____pagevec_lru_add+0x167/0x180
 [<ffffffff811226f8>] __lru_cache_add+0x58/0x70
 [<ffffffff81122731>] lru_cache_add_lru+0x21/0x40
 [<ffffffff81123f49>] putback_lru_page+0x69/0x100
 [<ffffffff8115c0bd>] migrate_pages+0x13d/0x5d0
 [<ffffffff81122687>] ? ____pagevec_lru_add+0x167/0x180
 [<ffffffff81152ab0>] ? compaction_alloc+0x0/0x370
 [<ffffffff8115255c>] compact_zone+0x4cc/0x600
 [<ffffffff8111cfac>] ? get_page_from_freelist+0x15c/0x820
 [<ffffffff810672f4>] ? check_preempt_wakeup+0x1c4/0x3c0
 [<ffffffff8115290e>] compact_zone_order+0x7e/0xb0
 [<ffffffff81152a49>] try_to_compact_pages+0x109/0x170
 [<ffffffff8111e94d>] __alloc_pages_nodemask+0x5ed/0x850
 [<ffffffff814c9136>] ? thread_return+0x4e/0x778
 [<ffffffff81150d43>] alloc_pages_vma+0x93/0x150
 [<ffffffff81167ea5>] do_huge_pmd_anonymous_page+0x135/0x340
 [<ffffffff814cb6f6>] ? rwsem_down_read_failed+0x26/0x30
 [<ffffffff81136755>] handle_mm_fault+0x245/0x2b0
 [<ffffffff814ce383>] do_page_fault+0x123/0x3a0
 [<ffffffff814cbdf5>] page_fault+0x25/0x30

nfs_migrate_page() calls nfs_fscache_release_page() which doesn't actually wait
- even if __GFP_WAIT is set.  The reason that doesn't wait is that
fscache_maybe_release_page() might deadlock the allocator as the work threads
writing to the cache may all end up sleeping on memory allocation.

However, I wonder if that is actually a problem.  There are a number of things
I can do to deal with this:

 (1) Make nfs_migrate_page() wait.

 (2) Make fscache_maybe_release_page() honour the __GFP_WAIT flag.

 (3) Set a timeout around the wait.

 (4) Make nfs_migrate_page() return an error if the page is still busy.

For the moment, I'll select (2) and (4).

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
2012-12-20 22:12:03 +00:00
David Howells de242c0b8b NFS: Use FS-Cache invalidation
Use the new FS-Cache invalidation facility from NFS to deal with foreign
changes being detected on the server rather than attempting to retire the old
cookie and get a new one.

The problem with the old method was that NFS did not wait for all outstanding
storage and retrieval ops on the cache to complete.  There was no automatic
wait between the calls to ->readpages() and calls to invalidate_inode_pages2()
as the latter can only wait on locked pages that have been added to the
pagecache (which they haven't yet on entry to ->readpages()).

This was leading to oopses like the one below when an outstanding read got cut
off from its cookie by a premature release.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
IP: [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
PGD 15889067 PUD 15890067 PMD 0
Oops: 0000 [#1] SMP
CPU 0
Modules linked in: cachefiles nfs fscache auth_rpcgss nfs_acl lockd sunrpc

Pid: 4544, comm: tar Not tainted 3.1.0-rc4-fsdevel+ #1064                  /DG965RY
RIP: 0010:[<ffffffffa0075118>]  [<ffffffffa0075118>] __fscache_read_or_alloc_pages+0x1dd/0x315 [fscache]
RSP: 0018:ffff8800158799e8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8800070d41e0 RCX: ffff8800083dc1b0
RDX: 0000000000000000 RSI: ffff880015879960 RDI: ffff88003e627b90
RBP: ffff880015879a28 R08: 0000000000000002 R09: 0000000000000002
R10: 0000000000000001 R11: ffff880015879950 R12: ffff880015879aa4
R13: 0000000000000000 R14: ffff8800083dc158 R15: ffff880015879be8
FS:  00007f671e9d87c0(0000) GS:ffff88003bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000a8 CR3: 000000001587f000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process tar (pid: 4544, threadinfo ffff880015878000, task ffff880015875040)
Stack:
 ffffffffa00b1759 ffff8800070dc158 ffff8800000213da ffff88002a286508
 ffff880015879aa4 ffff880015879be8 0000000000000001 ffff88002a2866e8
 ffff880015879a88 ffffffffa00b20be 00000000000200da ffff880015875040
Call Trace:
 [<ffffffffa00b1759>] ? nfs_fscache_wait_bit+0xd/0xd [nfs]
 [<ffffffffa00b20be>] __nfs_readpages_from_fscache+0x7e/0x13f [nfs]
 [<ffffffff81095fe7>] ? __alloc_pages_nodemask+0x156/0x662
 [<ffffffffa0098763>] nfs_readpages+0xee/0x187 [nfs]
 [<ffffffff81098a5e>] __do_page_cache_readahead+0x1be/0x267
 [<ffffffff81098942>] ? __do_page_cache_readahead+0xa2/0x267
 [<ffffffff81098d7b>] ra_submit+0x1c/0x20
 [<ffffffff8109900a>] ondemand_readahead+0x28b/0x29a
 [<ffffffff810990ce>] page_cache_sync_readahead+0x38/0x3a
 [<ffffffff81091d8a>] generic_file_aio_read+0x2ab/0x67e
 [<ffffffffa008cfbe>] nfs_file_read+0xa4/0xc9 [nfs]
 [<ffffffff810c22c4>] do_sync_read+0xba/0xfa
 [<ffffffff810a62c9>] ? might_fault+0x4e/0x9e
 [<ffffffff81177a47>] ? security_file_permission+0x7b/0x84
 [<ffffffff810c25dd>] ? rw_verify_area+0xab/0xc8
 [<ffffffff810c29a4>] vfs_read+0xaa/0x13a
 [<ffffffff810c2a79>] sys_read+0x45/0x6c
 [<ffffffff813ac37b>] system_call_fastpath+0x16/0x1b

Reported-by: Mark Moseley <moseleymark@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2012-12-20 22:06:33 +00:00
Linus Torvalds 2d4dce0070 NFS client updates for Linux 3.8
Features include:
 
 - Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client code
   Remove altogether where possible, and replace with WARN_ON_ONCE and
   appropriate error returns where not.
 - NFSv4.1 client adds session dynamic slot table management. There is
   matching server side code that has been submitted to Bruce for
   consideration. Together, this code allows the server to dynamically
   manage the amount of memory it allocates to the duplicate request
   cache for each client. It will constantly resize those caches to
   reserve more memory for clients that are hot while shrinking caches
   for those that are quiescent.
 
 In addition, there are assorted bugfixes for the generic NFS write code,
 fixes to deal with the drop_nlink() warnings, and yet another fix for
 NFSv4 getacl.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQz8VNAAoJEGcL54qWCgDy7iYQAKbr7AAZOcZPoJigzakZ7nMi
 UKYulGbFais2Llwzw1e+U5RzmorTSbvl7/m8eS7pDf3auYw/t4xtXjKSGZUNxaE1
 q2hNKgVwodMbScYdkZXvKKNckS93oPDttrmEyzjKanqey+1E3HSklvOvikN0ihte
 B/G1OtA7Qpcr92bPrLK+PjDqarCBUI4g42dYbZOBrZnXKTRtzUqsuKPu7WjpPiof
 SHE5b1Emt7oUxgcijWGcvYCQ8voZdeSCnSksH3DgvORlutwdhUD3Yg8KyEfFZdyc
 6C59ozXRLiHkV3c+jMhJzDkQXR9bYHrnK3tlq4G8v1NdJxRktQliZeqecRvip/Wz
 rAxfE6fnPDEvKsCpZb3+5yTAt+aZwzEhRg1fFC9qfGOp+oRa+CWw5kJCyIFHwJu6
 4LOlubQAf6rnIsja1L8D0FdeqHUa1+wy61On5kgVYS5JGtoBsQHpa1zTwdOxPmsR
 2XTMYGNCEabvpKpO9+5xQbUzkFExPTesw47ygXiUuDT/snaarpV3/f05SSCaWZkX
 R8QsGEOXTIh8/S+UxARGpc7H6xi1PdBM5nBziHVzjEdHgZRF4wGFaJe2CirMjSJO
 Df5GEd5Z/8VCGWs+1w7HD5EaQ2n0wbt5daCE80Y2jRBr7NMYnY+ciF8/GktLpHsn
 Zq1bXGOdr3UZ92LXuzL9
 =G3N9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Features include:

   - Full audit of BUG_ON asserts in the NFS, SUNRPC and lockd client
     code.  Remove altogether where possible, and replace with
     WARN_ON_ONCE and appropriate error returns where not.
   - NFSv4.1 client adds session dynamic slot table management.  There
     is matching server side code that has been submitted to Bruce for
     consideration.

     Together, this code allows the server to dynamically manage the
     amount of memory it allocates to the duplicate request cache for
     each client.  It will constantly resize those caches to reserve
     more memory for clients that are hot while shrinking caches for
     those that are quiescent.

  In addition, there are assorted bugfixes for the generic NFS write
  code, fixes to deal with the drop_nlink() warnings, and yet another
  fix for NFSv4 getacl."

* tag 'nfs-for-3.8-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (106 commits)
  SUNRPC: continue run over clients list on PipeFS event instead of break
  NFS: Don't use SetPageError in the NFS writeback code
  SUNRPC: variable 'svsk' is unused in function bc_send_request
  SUNRPC: Handle ECONNREFUSED in xs_local_setup_socket
  NFSv4.1: Deal effectively with interrupted RPC calls.
  NFSv4.1: Move the RPC timestamp out of the slot.
  NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
  NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
  NFS: Fix calls to drop_nlink()
  NFS: Ensure that we always drop inodes that have been marked as stale
  nfs: Remove unused list nfs4_clientid_list
  nfs: Remove duplicate function declaration in internal.h
  NFS: avoid NULL dereference in nfs_destroy_server
  SUNRPC handle EKEYEXPIRED in call_refreshresult
  SUNRPC set gss gc_expiry to full lifetime
  nfs: fix page dirtying in NFS DIO read codepath
  nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
  NFSv4.1: Be conservative about the client highest slotid
  NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
  nfs: don't extend writes to cover entire page if pagecache is invalid
  ...
2012-12-18 09:36:34 -08:00
Andrew Morton 965c8e59cf lseek: the "whence" argument is called "whence"
But the kernel decided to call it "origin" instead.  Fix most of the
sites.

Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-17 17:15:12 -08:00
Linus Torvalds 2a74dbb9a8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "A quiet cycle for the security subsystem with just a few maintenance
  updates."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  Smack: create a sysfs mount point for smackfs
  Smack: use select not depends in Kconfig
  Yama: remove locking from delete path
  Yama: add RCU to drop read locking
  drivers/char/tpm: remove tasklet and cleanup
  KEYS: Use keyring_alloc() to create special keyrings
  KEYS: Reduce initial permissions on keys
  KEYS: Make the session and process keyrings per-thread
  seccomp: Make syscall skipping and nr changes more consistent
  key: Fix resource leak
  keys: Fix unreachable code
  KEYS: Add payload preparsing opportunity prior to key instantiate or update
2012-12-16 15:40:50 -08:00
Trond Myklebust ada8e20d04 NFS: Don't use SetPageError in the NFS writeback code
The writeback code is already capable of passing errors back to user space
by means of the open_context->error. In the case of ENOSPC, Neil Brown
is reporting seeing 2 errors being returned.

Neil writes:

"e.g. if /mnt2/ if an nfs mounted filesystem that has no space then

strace dd if=/dev/zero conv=fsync >> /mnt2/afile count=1

reported Input/output error and the relevant parts of the strace output are:

write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512) = 512
fsync(1)                                = -1 EIO (Input/output error)
close(1)                                = -1 ENOSPC (No space left on device)"

Neil then shows that the duplication of error messages appears to be due to
the use of the PageError() mechanism, which causes filemap_fdatawait_range
to return the extra EIO. The regression was introduced by
commit 7b281ee026 (NFS: fsync() must exit
with an error if page writeback failed).

Fix this by removing the call to SetPageError(), and just relying on
open_context->error reporting the ENOSPC back to fsync().

Reported-by: Neil Brown <neilb@suse.de>
Tested-by: Neil Brown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [3.6+]
2012-12-15 17:12:14 -05:00
Trond Myklebust ac20d163fc NFSv4.1: Deal effectively with interrupted RPC calls.
If an RPC call is interrupted, assume that the server hasn't processed
the RPC call so that the next time we use the slot, we know that if we
get a NFS4ERR_SEQ_MISORDERED or NFS4ERR_SEQ_FALSE_RETRY, we just have
to bump the sequence number.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-15 15:39:59 -05:00
Trond Myklebust 8e63b6a8ad NFSv4.1: Move the RPC timestamp out of the slot.
Shave a few bytes off the slot table size by moving the RPC timestamp
into the sequence results.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-15 15:21:52 -05:00
Trond Myklebust e879444084 NFSv4.1: Try to deal with NFS4ERR_SEQ_MISORDERED.
If the server returns NFS4ERR_SEQ_MISORDERED, it could be a sign
that the slot was retired at some point. Retry the attempt after
reinitialising the slot sequence number to 1.

Also add a handler for NFS4ERR_SEQ_FALSE_RETRY. Just bump the slot
sequence number and retry...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-15 14:49:09 -05:00
Trond Myklebust 65a0c14954 NFS: nfs_lookup_revalidate should not trust an inode with i_nlink == 0
If the inode has no links, then we should force a new lookup.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-14 17:51:40 -05:00
Trond Myklebust 1f018458b3 NFS: Fix calls to drop_nlink()
It is almost always wrong for NFS to call drop_nlink() after removing a
file. What we really want is to mark the inode's attributes for
revalidation, and we want to ensure that the VFS drops it if we're
reasonably sure that this is the final unlink().
Do the former using the usual cache validity flags, and the latter
by testing if inode->i_nlink == 1, and clearing it in that case.

This also fixes the following warning reported by Neil Brown and
Jeff Layton (among others).

[634155.004438] WARNING:
at /home/abuild/rpmbuild/BUILD/kernel-desktop-3.5.0/lin [634155.004442]
Hardware name: Latitude E6510 [634155.004577]  crc_itu_t crc32c_intel
snd_hwdep snd_pcm snd_timer snd soundcor [634155.004609] Pid: 13402, comm:
bash Tainted: G        W    3.5.0-36-desktop # [634155.004611] Call Trace:
[634155.004630]  [<ffffffff8100444a>] dump_trace+0xaa/0x2b0
[634155.004641]  [<ffffffff815a23dc>] dump_stack+0x69/0x6f
[634155.004653]  [<ffffffff81041a0b>] warn_slowpath_common+0x7b/0xc0
[634155.004662]  [<ffffffff811832e4>] drop_nlink+0x34/0x40
[634155.004687]  [<ffffffffa05bb6c3>] nfs_dentry_iput+0x33/0x70 [nfs]
[634155.004714]  [<ffffffff8118049e>] dput+0x12e/0x230
[634155.004726]  [<ffffffff8116b230>] __fput+0x170/0x230
[634155.004735]  [<ffffffff81167c0f>] filp_close+0x5f/0x90
[634155.004743]  [<ffffffff81167cd7>] sys_close+0x97/0x100
[634155.004754]  [<ffffffff815c3b39>] system_call_fastpath+0x16/0x1b
[634155.004767]  [<00007f2a73a0d110>] 0x7f2a73a0d10f

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [3.3+]
2012-12-14 17:45:11 -05:00
Trond Myklebust eed9935745 NFS: Ensure that we always drop inodes that have been marked as stale
There is no need to cache stale inodes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-14 14:36:36 -05:00
Yanchuan Nian 48d7a57693 nfs: Remove unused list nfs4_clientid_list
This list was designed to store struct nfs4_client in the client side.
But nfs4_client was obsolete and has been removed from the source code.
So remove the unused list.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-13 10:40:09 -05:00
Yanchuan Nian aaea7d2f78 nfs: Remove duplicate function declaration in internal.h
Remove duplicate function declaration in internal.h

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
[Trond: Added nfs_pageio_init_read, which suffered from the same problem]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-13 10:38:54 -05:00
NeilBrown f259613a1e NFS: avoid NULL dereference in nfs_destroy_server
In rare circumstances, nfs_clone_server() of a v2 or v3 server can get
an error between setting server->destory (to nfs_destroy_server), and
calling nfs_start_lockd (which will set server->nlm_host).

If this happens, nfs_clone_server will call nfs_free_server which
will call nfs_destroy_server and thence nlmclnt_done(NULL).  This
causes the NULL to be dereferenced.

So add a guard to only call nlmclnt_done() if ->nlm_host is not NULL.

The other guards there are irrelevant as nlm_host can only be non-NULL
if one of these flags are set - so remove those tests.  (Thanks to Trond
for this suggestion).

This is suitable for any stable kernel since 2.6.25.

Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 23:55:56 -05:00
Andy Adamson eb96d5c97b SUNRPC handle EKEYEXPIRED in call_refreshresult
Currently, when an RPCSEC_GSS context has expired or is non-existent
and the users (Kerberos) credentials have also expired or are non-existent,
the client receives the -EKEYEXPIRED error and tries to refresh the context
forever.  If an application is performing I/O, or other work against the share,
the application hangs, and the user is not prompted to refresh/establish their
credentials. This can result in a denial of service for other users.

Users are expected to manage their Kerberos credential lifetimes to mitigate
this issue.

Move the -EKEYEXPIRED handling into the RPC layer. Try tk_cred_retry number
of times to refresh the gss_context, and then return -EACCES to the application.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 15:36:02 -05:00
Jeff Layton be7e985804 nfs: fix page dirtying in NFS DIO read codepath
The NFS DIO code will dirty pages that catch read responses in order to
handle the case where someone is doing DIO reads into an mmapped buffer.
The existing code doesn't really do the right thing though since it
doesn't take into account the case where we might be attempting to read
past the EOF.

Fix the logic in that code to only dirty pages that ended up receiving
data from the read. Note too that it really doesn't matter if
NFS_IOHDR_ERROR is set or not. All that matters is if the page was
altered by the read.

Cc: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 12:56:19 -05:00
Jeff Layton 67fad106a2 nfs: don't zero out the rest of the page if we hit the EOF on a DIO READ
Eryu provided a test program that would segfault when attempting to read
past the EOF on file that was opened O_DIRECT. The buffer given to the
read() call was on the stack, and when he attempted to read past it it
would scribble over the rest of the stack page.

If we hit the end of the file on a DIO READ request, then we don't want
to zero out the rest of the buffer. These aren't pagecache pages after
all, and there's no guarantee that the buffers that were passed in
represent entire pages.

Cc: <stable@vger.kernel.org> # v3.5+
Cc: Fred Isaman <iisaman@netapp.com>
Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-12 12:56:09 -05:00
Trond Myklebust b0ef9647a0 NFSv4.1: Be conservative about the client highest slotid
If the server sends us a target that looks like an outlier, but
is lower than the existing target, then respect it anyway.
However defer actually updating the generation counter until
we get a target that doesn't look like an outlier.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11 12:29:10 -05:00
Trond Myklebust 8556307374 NFSv4.1: Handle NFS4ERR_BADSLOT errors correctly
Most (all) NFS4ERR_BADSLOT errors are due to the client failing to
respect the server's sr_highest_slotid limit. This mainly happens
due to reordered RPC requests.
The way to handle it is simply to drop the slot that we're using,
and retry using the new highest_slotid limits.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11 10:31:12 -05:00
Trond Myklebust 7ce0171d4f Merge branch 'bugfixes' into nfs-for-next 2012-12-11 09:16:26 -05:00
Jeff Layton 81d9bce530 nfs: don't extend writes to cover entire page if pagecache is invalid
Jian reported that the following sequence would leave "testfile" with
corrupt data:

    # mount localhost:/export /mnt/nfs/ -o vers=3
    # echo abc > /mnt/nfs/testfile; echo def >> /export/testfile; echo ghi >> /mnt/nfs/testfile
    # cat -v /export/testfile
    abc
    ^@^@^@^@ghi

While there's no locking involved here, the operations are serialized,
so CTO should prevent corruption.

The first write to the file is fine and writes 4 bytes. The file is then
extended on the server. When it's reopened a GETATTR is issued and the
size change is noticed. This causes NFS_INO_INVALID_DATA to be set on
the file. Because the file is opened for write only,
nfs_want_read_modify_write() returns 0 to nfs_write_begin().
nfs_updatepage then calls nfs_write_pageuptodate() to see if it should
extend the nfs_page to cover the whole page. NFS_INO_INVALID_DATA is
still set on the file at that point, but that flag is ignored and
nfs_pageuptodate erroneously extends the write to cover the whole page,
with the write done on the server side filled in with zeroes.

This patch just has that function check for NFS_INO_INVALID_DATA in
addition to NFS_INO_REVAL_PAGECACHE. This fixes the bug, but looking
over the code, I wonder if we might have a similar bug in
nfs_revalidate_size(). The difference between those two flags is very
subtle, so it seems like we ought to be checking for
NFS_INO_INVALID_DATA in most of the places that we look for
NFS_INO_REVAL_PAGECACHE.

I believe this is regression introduced by commit 8d197a568. The code
did check for NFS_INO_INVALID_DATA prior to that patch.

Original bug report is here:

    https://bugzilla.redhat.com/show_bug.cgi?id=885743

Cc: <stable@vger.kernel.org> # 3.5+
Reported-by: Jian Li <jiali@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11 09:14:51 -05:00
Sven Wegener 7d3e91a89b NFSv4: Check for buffer length in __nfs4_get_acl_uncached
Commit 1f1ea6c "NFSv4: Fix buffer overflow checking in
__nfs4_get_acl_uncached" accidently dropped the checking for too small
result buffer length.

If someone uses getxattr on "system.nfs4_acl" on an NFSv4 mount
supporting ACLs, the ACL has not been cached and the buffer suplied is
too short, we still copy the complete ACL, resulting in kernel and user
space memory corruption.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-11 09:14:50 -05:00
Trond Myklebust 1fa8064429 NFSv4.1: Try to eliminate outliers when updating target_highest_slotid
Look for sudden changes in the first and second derivatives in order
to eliminate outlier changes to target_highest_slotid (which are
due to out-of-order RPC replies).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:53 +01:00
Trond Myklebust b75ad4cda5 NFSv4.1: Ensure smooth handover of slots from one task to the next waiting
Currently, we see a lot of bouncing for the value of highest_used_slotid
due to the fact that slots are getting freed, instead of getting instantly
transmitted to the next waiting task.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:52 +01:00
Trond Myklebust 1e1093c7fd NFSv4.1: Don't mess with task priorities in nfs41_setup_sequence
We want to preserve the rpc_task priority for things like writebacks,
that may have differing levels of urgency.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:51 +01:00
Bryan Schumaker 104287cd4e NFS: Remove _nfs_call_sync_session
All it does is pass its arguments through to another function.  Let's
cut out the middleman...

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:51 +01:00
Trond Myklebust 8fe72bac8d NFSv4: Clean up handling of privileged operations
Privileged rpc calls are those that are run by the state recovery thread,
in cases where we're trying to recover the system after a server reboot
or a network partition. In those cases, we want to fence off all other
rpc calls (see nfs4_begin_drain_session()) so that they don't end up
using stateids or clientids that are in the process of being recovered.

Prior to this patch, we had to set up special callback functions in
order to declare an rpc call as being privileged.
By adding a new field to the sequence arguments, this patch simplifies
things considerably, and allows us to declare the rpc call as privileged
before it is run.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:50 +01:00
Trond Myklebust 275e7e20aa NFSv4.1: Remove the 'FIFO' behaviour for nfs41_setup_sequence
It is more important to preserve the task priority behaviour, which ensures
that things like reclaim writes take precedence over background and kupdate
writes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:50 +01:00
Trond Myklebust 7b939a3f44 NFSv4.1: Clean up nfs41_setup_sequence
Move all the sleep-and-exit cases into a single section of code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:49 +01:00
Trond Myklebust fd0c09537a NFSv4: Simplify the NFSv4/v4.1 synchronous call switch
We shouldn't need to pass the 'cache_reply' parameter if we
initialise the sequence_args/sequence_res in the caller.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:49 +01:00
Trond Myklebust d9afbd1b08 NFSv4.1: Simplify the sequence setup
Nobody calls nfs4_setup_sequence or nfs41_setup_sequence without
also calling rpc_call_start() on success. This commit therefore
folds the rpc_call_start call into nfs41_setup_sequence().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:48 +01:00
Trond Myklebust 6ba7db3420 NFSv4.1: Use nfs41_setup_sequence where appropriate
There is no point in using nfs4_setup_sequence or nfs4_sequence_done
in pure NFSv4.1 functions. We already know that those have sessions...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:48 +01:00
Trond Myklebust c10e449827 NFSv4.1: Ping server when our session table limits are too high
If the server requests a lower target_highest_slotid, then ensure
that we ping it with at least one RPC call containing an
appropriate SEQUENCE op. This ensures that the server won't need to
send a recall callback in order to shrink the slot table.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:47 +01:00
Trond Myklebust 0ca3f4825a NFSv4.1: Set the maximum slot table size to 1024 slots
This means that we end up statically allocating 128 bytes for the
bitmap on each slot table.
For a server that supports 1MB write and read I/O sizes this means
that we can completely fill the maximum 1GB TCP send/receive
windows.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:47 +01:00
Trond Myklebust 76e697ba7e NFSv4.1: Move slot table and session struct definitions to nfs4session.h
Clean up. Gather NFSv4.1 slot definitions in fs/nfs/nfs4session.h.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:46 +01:00
Trond Myklebust 73e39aaa83 NFSv4.1: Cleanup move session slot management to fs/nfs/nfs4session.c
NFSv4.1 session management is getting complex enough to deserve
a separate file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:45 +01:00
Trond Myklebust 3302127967 NFSv4: Move nfs4_wait_clnt_recover and nfs4_client_recover_expired_lease
nfs4_wait_clnt_recover and nfs4_client_recover_expired_lease are both
generic state related functions. As such, they belong in nfs4state.c,
and not nfs4proc.c

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:45 +01:00
Trond Myklebust 5d63360dd8 NFSv4.1: Clean up session draining
Coalesce nfs4_check_drain_bc_complete and nfs4_check_drain_fc_complete
into a single function that can be called when the slot table is known
to be empty, then change nfs4_callback_free_slot() and nfs4_free_slot()
to use it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:44 +01:00
Trond Myklebust 69d206b5b3 NFSv4.1: If slot allocation fails due to OOM, retry more quickly
If the NFSv4.1 session slot allocation fails due to an ENOMEM condition,
then set the task->tk_timeout to 1/4 second to ensure that we do retry
the slot allocation more quickly.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:44 +01:00
Trond Myklebust ac0748359a NFSv4.1: CB_RECALL_SLOT must schedule a sequence op after updating targets
RFC5661 requires us to make sure that the server knows we've updated
our slot table size by sending at least one SEQUENCE op containing the
new 'highest_slotid' value.
We can do so using the 'CHECK_LEASE' functionality of the state
manager.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:43 +01:00
Trond Myklebust afa296103e NFSv4.1: Remove the state manager code to resize the slot table
The state manager no longer needs any special machinery to stop the
session flow and resize the slot table. It is all done on the fly by
the SEQUENCE op code now.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:43 +01:00
Trond Myklebust 87dda67e73 NFSv4.1: Allow SEQUENCE to resize the slot table on the fly
Instead of an array of slots, use a singly linked list of slots that
can be dynamically appended to or shrunk.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:42 +01:00
Trond Myklebust 97e548a93d NFSv4.1: Support dynamic resizing of the session slot table
Allow the server to control the size of the session slot table
by adjusting the value of sr_target_max_slots in the reply to the
SEQUENCE operation.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:42 +01:00
Trond Myklebust 1b285ff16a NFSv4.1: Allow the server to recall all but one slot
If the server wants to leave us with only one slot, or it wants
to "shrink" our slot table to something larger than we have now,
then so be it.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:42 +01:00
Trond Myklebust d5fb4ce33e NFSv4.1: Don't confuse target_highest_slotid and max_slots in cb_recall_slot
Don't confuse the table size and the target_highest_slotid...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:41 +01:00
Trond Myklebust ce008c4bb9 NFSv4.1: Fix nfs4_callback_recallslot to work with dynamic slot allocation
Ensure that the NFSv4.1 CB_RECALL_SLOT callback updates the slot table
target max slotid safely.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:37 +01:00
Trond Myklebust da0507b7c9 NFSv4.1: Reset the sequence number for slots that have been deallocated
When the server tells us that it is dynamically resizing the session
replay cache, we should reset the sequence number for those slots
that have been deallocated.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:30:17 +01:00
Trond Myklebust 464ee9f966 NFSv4.1: Ensure that the client tracks the server target_highest_slotid
Dynamic slot allocation in NFSv4.1 depends on the client being able to
track the server's target value for the highest slotid in the
slot table.  See the reference in Section 2.10.6.1 of RFC5661.

To avoid ordering problems in the case where 2 SEQUENCE replies contain
conflicting updates to this target value, we also introduce a generation
counter, to track whether or not an RPC containing a SEQUENCE operation
was launched before or after the last update.

Also rename the nfs4_slot_table target_max_slots field to
'target_highest_slotid' to avoid confusion with a slot
table size or number of slots.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-12-06 00:29:47 +01:00
Al Viro c44600c9d1 nfs_lookup_revalidate(): fix a leak
We are leaking fattr and fhandle if we decide that dentry is not to
be invalidated, after all (e.g. happens to be a mountpoint).  Just
free both before that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:04:36 -05:00
Al Viro 696199f8cc don't do blind d_drop() in nfs_prime_dcache()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:00:51 -05:00
Trond Myklebust f4af6e2abc NFSv4.1: Clean up nfs4_free_slot
Change the argument to take the pointer to the slot, instead of
just the slotid.

We know that the new value of highest_used_slot must be less than
the current value. No need to scan the whole table.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-26 17:49:53 -05:00
Trond Myklebust 2dc03b7f00 NFSv4.1: Simplify slot allocation
Clean up the NFSv4.1 slot allocation by replacing nfs_find_slot() with
a function nfs_alloc_slot() that returns a pointer to the nfs4_slot
instead of an offset into the slot table.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-26 17:49:52 -05:00
Trond Myklebust 2b2fa71723 NFSv4.1: Simplify struct nfs4_sequence_args too
Replace the session pointer + slotid with a pointer to the
allocated slot.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-26 17:49:52 -05:00
Trond Myklebust df2fabffba NFSv4.1: Label each entry in the session slot tables with its slot number
Instead of doing slot table pointer gymnastics every time we want to
know which slot we're using.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-26 17:49:51 -05:00
Trond Myklebust e3725ec015 NFSv4.1: Shrink struct nfs4_sequence_res by moving the session pointer
Move the session pointer into the slot table, then have struct nfs4_slot
point to that slot table.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-26 17:49:04 -05:00
Yanchuan Nian 4c10021008 nfs: Fix wrong slab cache in nfs_commit_mempool
The slab cache in nfs_commit_mempool is wrong, and I think it is just a slip.
I tested it on a x86-32 machine, the size of nfs_write_header is 544, and
the size of nfs_commit_data is 408, so it works fine. It is also true that
sizeof(struct nfs_write_header) > sizeof(struct nfs_commit_data) on other
platforms in my opinoin. Just fix it.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-25 11:59:33 -05:00
Jim Rees d751f748b3 NFS: Reduce stack use in encode_exchange_id()
encode_exchange_id() uses more stack space than necessary, giving a compile
time warning. Reduce the size of the static buffer for implementation name.

Signed-off-by: Jim Rees <rees@umich.edu>
Reviewed-by: "Adamson, Dros" <Weston.Adamson@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 22:59:29 -05:00
Trond Myklebust fe20d7d5ee NFSv4: Fix a compile time warning when #undef CONFIG_NFS_V4_1
The function nfs4_get_machine_cred_locked is used by NFSv4.0 routines
too.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 22:59:28 -05:00
Trond Myklebust 933602e368 NFSv4.1: Shrink struct nfs4_sequence_res by moving sr_renewal_time
Store the renewal time inside the session slot instead.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:53 -05:00
Trond Myklebust 9216106a84 NFSv4.1: clean up nfs4_recall_slot to use nfs4_alloc_slots
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:53 -05:00
Trond Myklebust 2d473d378e NFSv4.1: nfs4_alloc_slots doesn't need zeroing
All that memory is going to be initialised to non-zero by
nfs4_add_and_init_slots anyway.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:52 -05:00
Trond Myklebust 43095d3972 NFSv4.1: We must bump the clientid sequence number after CREATE_SESSION
We must always bump the clientid sequence number after a successful
call to CREATE_SESSION on the server. The result of
nfs4_verify_channel_attrs() is irrelevant to that requirement.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:52 -05:00
Trond Myklebust 688a9024e2 NFSv4.1: Adjust CREATE_SESSION arguments when mounting a new filesystem
If we're mounting a new filesystem, ensure that the session has negotiated
large enough request and reply sizes to match the wsize and rsize mount
arguments.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:51 -05:00
Trond Myklebust ae72ae6760 NFSv4.1: Don't confuse CREATE_SESSION arguments and results
Don't store the target request and response sizes in the same
variables used to store the server's replies to those targets.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:29:51 -05:00
Trond Myklebust 5df904aeb0 NFSv4.1: Handle session reset and bind_conn_to_session before lease check
We can't send a SEQUENCE op unless the session is OK, so it is pointless
to handle the CHECK_LEASE state before we've dealt with SESSION_RESET
and BIND_CONN_TO_SESSION.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-21 09:28:42 -05:00
Bryan Schumaker 6bdb5f213c NFS: Add sequence_priviliged_ops for nfs4_proc_sequence()
If I mount an NFS v4.1 server to a single client multiple times and then
run xfstests over each mountpoint I usually get the client into a state
where recovery deadlocks.  The server informs the client of a
cb_path_down sequence error, the client then does a
bind_connection_to_session and checks the status of the lease.

I found that bind_connection_to_session sets the NFS4_SESSION_DRAINING
flag on the client, but this flag is never unset before
nfs4_check_lease() reaches nfs4_proc_sequence().  This causes the client
to deadlock, halting all NFS activity to the server.  nfs4_proc_sequence()
is only called by the state manager, so I can change it to run in privileged
mode to bypass the NFS4_SESSION_DRAINING check and avoid the deadlock.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-11-20 23:34:54 -05:00
Trond Myklebust 28d79ea33f NFS: Remove the BUG_ON() in the mount code
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:39 -05:00
Trond Myklebust f48407ddd4 NFS: Remove BUG_ON()s in the fs/nfs/inode.c
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:39 -05:00
Trond Myklebust 4ea8fed593 NFSv4: Get rid of unnecessary BUG_ON()s
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:39 -05:00
Trond Myklebust deed85e760 NFS: Remove BUG_ON() calls from the generic writeback code
...and ensure that we set the return value for nfs_page_async_flush()
to zero! (Reported-by: Dros Adamson)

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:39 -05:00
Trond Myklebust bc5a89b337 NFSv4.1: Remove assertion BUG_ON()s from the files and generic layout code
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:39 -05:00
Trond Myklebust eba24e1fe5 NFSv4.1: Remove unused function last_byte_offset
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:38 -05:00
Trond Myklebust d3edcf9614 NFSv4: Remove the BUG_ON() from nfs4_get_lease_time_prepare()...
An EAGAIN return value would be unexpected, but there is no reason to
BUG...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:38 -05:00
Trond Myklebust 7fc388460e NFS: Remove asserts from the NFS XDR code
Convert the ones that are not trivial to check into WARN_ON_ONCE().
Remove checks for things such as NFS2_MAXPATHLEN, which are trivially
done by the caller.

Add a comment to the case of nfs3_xdr_enc_setacl3args. What is being
done there is just wrong...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:38 -05:00
Trond Myklebust 1fea73a865 NFS: Get rid of unnecessary asserts
If the nfs_client fails to initialise correctly, then it will
return an error condition.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-04 14:43:38 -05:00
Weston Andros Adamson 998f40b550 NFS4: nfs4_opendata_access should return errno
Return errno - not an NFS4ERR_. This worked because NFS4ERR_ACCESS == EACCES.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-02 18:51:54 -04:00
Trond Myklebust f9b1ef5f06 NFSv4: Initialise the NFSv4.1 slot table highest_used_slotid correctly
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-11-01 12:02:03 -04:00
Weston Andros Adamson 324d003b0c NFS: add nfs_sb_deactive_async to avoid deadlock
Use nfs_sb_deactive_async instead of nfs_sb_deactive when in a workqueue
context.  This avoids a deadlock where rpc_shutdown_client loops forever
in a workqueue kworker context, trying to kill all RPC tasks associated with
the client, while one or more of these tasks have already been assigned to the
same kworker (and will never run rpc_exit_task).

This approach is needed because RPC tasks that have already been assigned
to a kworker by queue_work cannot be canceled, as explained in the comment
for workqueue.c:insert_wq_barrier.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
[Trond: add module_get/put.]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-31 16:26:26 -04:00
Ben Hutchings 97a5486826 nfs: Show original device name verbatim in /proc/*/mount{s,info}
Since commit c7f404b ('vfs: new superblock methods to override
/proc/*/mount{s,info}'), nfs_path() is used to generate the mounted
device name reported back to userland.

nfs_path() always generates a trailing slash when the given dentry is
the root of an NFS mount, but userland may expect the original device
name to be returned verbatim (as it used to be).  Make this
canonicalisation optional and change the callers accordingly.

[jrnieder@gmail.com: use flag instead of bool argument]
Reported-and-tested-by: Chris Hiestand <chiestand@salk.edu>
Reference: http://bugs.debian.org/669314
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: <stable@vger.kernel.org> # v2.6.39+
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-31 16:26:26 -04:00
Scott Mayhew acce94e68a nfsv3: Make v3 mounts fail with ETIMEDOUTs instead EIO on mountd timeouts
In very busy v3 environment, rpc.mountd can respond to the NULL
procedure but not the MNT procedure in a timely manner causing
the MNT procedure to time out. The problem is the mount system
call returns EIO which causes the mount to fail, instead of
ETIMEDOUT, which would cause the mount to be retried.

This patch sets the RPC_TASK_SOFT|RPC_TASK_TIMEOUT flags to
the rpc_call_sync() call in nfs_mount() which causes
ETIMEDOUT to be returned on timed out connections.

Signed-off-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-31 16:26:25 -04:00
Yanchuan Nian 7175fe9015 nfs: Check whether a layout pointer is NULL before free it
The new layout pointer in pnfs_find_alloc_layout() may be NULL because of
out of memory. we must do some check work, otherwise pnfs_free_layout_hdr()
will go wrong because it can not deal with a NULL pointer.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-31 16:26:25 -04:00
NeilBrown 8d96b10639 NFS: fix bug in legacy DNS resolver.
The DNS resolver's use of the sunrpc cache involves a 'ttl' number
(relative) rather that a timeout (absolute).  This confused me when
I wrote
  commit c5b29f885a
     "sunrpc: use seconds since boot in expiry cache"

and I managed to break it.  The effect is that any TTL is interpreted
as 0, and nothing useful gets into the cache.

This patch removes the use of get_expiry() - which really expects an
expiry time - and uses get_uint() instead, treating the int correctly
as a ttl.

This fixes a regression that has been present since 2.6.37, causing
certain NFS accesses in certain environments to incorrectly fail.

Reported-by: Chuck Lever <chuck.lever@oracle.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-31 16:25:59 -04:00
Trond Myklebust 2b1bc308f4 NFSv4: nfs4_locku_done must release the sequence id
If the state recovery machinery is triggered by the call to
nfs4_async_handle_error() then we can deadlock.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-31 15:10:04 -04:00
Trond Myklebust 2240a9e2d0 NFSv4.1: We must release the sequence id when we fail to get a session slot
If we do not release the sequence id in cases where we fail to get a
session slot, then we can deadlock if we hit a recovery scenario.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-31 15:08:18 -04:00
Bryan Schumaker 399f11c3d8 NFS: Wait for session recovery to finish before returning
Currently, we will schedule session recovery and then return to the
caller of nfs4_handle_exception.  This works for most cases, but causes
a hang on the following test case:

	Client				Server
	------				------
	Open file over NFS v4.1
	Write to file
					Expire client
	Try to lock file

The server will return NFS4ERR_BADSESSION, prompting the client to
schedule recovery.  However, the client will continue placing lock
attempts and the open recovery never seems to be scheduled.  The
simplest solution is to wait for session recovery to run before retrying
the lock.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-10-31 13:13:28 -04:00
Trond Myklebust e9b7e91745 NFSv4: Fix the return value for nfs_callback_start_svc
returning PTR_ERR(cb_info->task) just after we have set it to
NULL looks like a typo...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
2012-10-16 13:14:42 -04:00
Trond Myklebust 2e928e4878 NFSv4.1: Declare osd_pri_2_pnfs_err(), objio_init_read/write to be static
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-16 12:38:00 -04:00
Trond Myklebust d6aa6a81d4 NFSv4: fs/nfs/nfs4getroot.c needs to include "internal.h"
Fix a warning about "no previous prototype for ‘nfs4_get_rootfh’"

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-16 12:37:59 -04:00
Trond Myklebust 1813badd98 NFSv4.1: Use kcalloc() to allocate zeroed arrays instead of kzalloc()
Don't circumvent the array size checks.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-15 10:49:43 -04:00
Trond Myklebust d527e5c15d NFSv4.1: Do not call pnfs_return_layout() from an rpciod context
Move the call to pnfs_return_layout() to the read and write rpc_release()
callbacks, so that it gets called from nfsiod, which is a more appropriate
context.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-15 10:49:43 -04:00
Trond Myklebust 8fcdc31b3d NFSv4.1: Kill nfs4_ds_disconnect()
There is nothing to prevent another thread from dereferencing ds->ds_clp
during or after the call to nfs4_ds_disconnect(), and Oopsing due to the
resulting NULL pointer.

Instead, we should just rely on filelayout_mark_devid_invalid() to keep
us out of trouble by avoiding that deviceid.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-15 10:49:42 -04:00
Linus Torvalds bd81ccea85 Merge branch 'for-3.7' of git://linux-nfs.org/~bfields/linux
Pull nfsd update from J Bruce Fields:
 "Another relatively quiet cycle.  There was some progress on my
  remaining 4.1 todo's, but a couple of them were just of the form
  "check that we do X correctly", so didn't have much affect on the
  code.

  Other than that, a bunch of cleanup and some bugfixes (including an
  annoying NFSv4.0 state leak and a busy-loop in the server that could
  cause it to peg the CPU without making progress)."

* 'for-3.7' of git://linux-nfs.org/~bfields/linux: (46 commits)
  UAPI: (Scripted) Disintegrate include/linux/sunrpc
  UAPI: (Scripted) Disintegrate include/linux/nfsd
  nfsd4: don't allow reclaims of expired clients
  nfsd4: remove redundant callback probe
  nfsd4: expire old client earlier
  nfsd4: separate session allocation and initialization
  nfsd4: clean up session allocation
  nfsd4: minor free_session cleanup
  nfsd4: new_conn_from_crses should only allocate
  nfsd4: separate connection allocation and initialization
  nfsd4: reject bad forechannel attrs earlier
  nfsd4: enforce per-client sessions/no-sessions distinction
  nfsd4: set cl_minorversion at create time
  nfsd4: don't pin clientids to pseudoflavors
  nfsd4: fix bind_conn_to_session xdr comment
  nfsd4: cast readlink() bug argument
  NFSD: pass null terminated buf to kstrtouint()
  nfsd: remove duplicate init in nfsd4_cb_recall
  nfsd4: eliminate redundant nfs4_free_stateid
  fs/nfsd/nfs4idmap.c: adjust inconsistent IS_ERR and PTR_ERR
  ...
2012-10-13 10:53:54 +09:00
J. Bruce Fields a9ca4043d0 Merge Trond's bugfixes
Merge branch 'bugfixes' of git://linux-nfs.org/~trondmy/nfs-2.6 into
for-3.7-incoming.  Mainly needed for Bryan's "SUNRPC: Set alloc_slot for
backchannel tcp ops", without which the 4.1 server oopses.
2012-10-11 12:41:05 -04:00
Linus Torvalds df632d3ce7 NFS client updates for Linux 3.7
Features include:
 
 - Remove CONFIG_EXPERIMENTAL dependency from NFSv4.1
   Aside from the issues discussed at the LKS, distros are shipping
   NFSv4.1 with all the trimmings.
 - Fix fdatasync()/fsync() for the corner case of a server reboot.
 - NFSv4 OPEN access fix: finally distinguish correctly between
   open-for-read and open-for-execute permissions in all situations.
 - Ensure that the TCP socket is closed when we're in CLOSE_WAIT
 - More idmapper bugfixes
 - Lots of pNFS bugfixes and cleanups to remove unnecessary state and
   make the code easier to read.
 - In cases where a pNFS read or write fails, allow the client to
   resume trying layoutgets after two minutes of read/write-through-mds.
 - More net namespace fixes to the NFSv4 callback code.
 - More net namespace fixes to the NFSv3 locking code.
 - More NFSv4 migration preparatory patches.
   Including patches to detect network trunking in both NFSv4 and NFSv4.1
 - pNFS block updates to optimise LAYOUTGET calls.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQdMvBAAoJEGcL54qWCgDyV84P/0XvcEXj6kdMv9EiWfRczo7r
 iAwAIhiEmG1agtZa6v+Gso2MYRQbkGyJi0LKIwzGqNUi0BLQGQCoV93kB0ITVpiN
 g7poDTnPyoItW1oJCtC48/Mx0G5C1yrHSwFAJrXmtzDF1mwd/BIQReafYp6x+/TU
 Mvwm7au3Y2ySRBEDmY4zyBERHXGt//JmsZ9Ays6jewQg5ZOyjDQKoeHVYaaeJoF0
 A0tQGcBSNdySagI5dt4SlkuO7AClhzVHlilep2dsBu/TLS0F2pEdHXvM2W0koZmM
 uazaIpzd2F7TfokTYExgsyKsqpkzpDf1kebN4Y1+Ioi7Yy30dQrX6lNaUNcOmOJQ
 xx694HDHV90KdRBVSFhOIHMTBRcls68hBcWib3MXWHTKX6HVgnFMwhwxGH0MRezf
 3rmXoqn+CO1j5WeQmA3BqdVbHSZHi913TKEwE/qoW4pmOFhv5I2flXWQS/Rwvdng
 2xDCe6TlvhMS92IpyvNEIicXLRSm+DUAmoAfSqqlifZIAEM5R29e/wCAsmVprO3B
 LPHyUoIMO6SZ1PL6Rk20+6qQfvCK7U/ChULsUL/zb7R88Pc3sFE2BeAvZVATsvH3
 +FJWTz43fwUBoMhPsn8xSBLn/fq6az5C19syz6Fpu3DZ4X0EwyVWifiFk6HgcxZD
 J8ajEl+dNZeFE8rkwykX
 =uBk7
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Features include:

   - Remove CONFIG_EXPERIMENTAL dependency from NFSv4.1
     Aside from the issues discussed at the LKS, distros are shipping
     NFSv4.1 with all the trimmings.
   - Fix fdatasync()/fsync() for the corner case of a server reboot.
   - NFSv4 OPEN access fix: finally distinguish correctly between
     open-for-read and open-for-execute permissions in all situations.
   - Ensure that the TCP socket is closed when we're in CLOSE_WAIT
   - More idmapper bugfixes
   - Lots of pNFS bugfixes and cleanups to remove unnecessary state and
     make the code easier to read.
   - In cases where a pNFS read or write fails, allow the client to
     resume trying layoutgets after two minutes of read/write-
     through-mds.
   - More net namespace fixes to the NFSv4 callback code.
   - More net namespace fixes to the NFSv3 locking code.
   - More NFSv4 migration preparatory patches.
     Including patches to detect network trunking in both NFSv4 and
     NFSv4.1
   - pNFS block updates to optimise LAYOUTGET calls."

* tag 'nfs-for-3.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (113 commits)
  pnfsblock: cleanup nfs4_blkdev_get
  NFS41: send real read size in layoutget
  NFS41: send real write size in layoutget
  NFS: track direct IO left bytes
  NFSv4.1: Cleanup ugliness in pnfs_layoutgets_blocked()
  NFSv4.1: Ensure that the layout sequence id stays 'close' to the current
  NFSv4.1: Deal with seqid wraparound in the pNFS return-on-close code
  NFSv4 set open access operation call flag in nfs4_init_opendata_res
  NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTAL
  NFSv4 reduce attribute requests for open reclaim
  NFSv4: nfs4_open_done first must check that GETATTR decoded a file type
  NFSv4.1: Deal with wraparound when updating the layout "barrier" seqid
  NFSv4.1: Deal with wraparound issues when updating the layout stateid
  NFSv4.1: Always set the layout stateid if this is the first layoutget
  NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layout
  NFSv4: don't put ACCESS in OPEN compound if O_EXCL
  NFSv4: don't check MAY_WRITE access bit in OPEN
  NFS: Set key construction data for the legacy upcall
  NFSv4.1: don't do two EXCHANGE_IDs on mount
  NFS: nfs41_walk_client_list(): re-lock before iterating
  ...
2012-10-10 23:52:35 +09:00
J. Bruce Fields f474af7051 UAPI Disintegration 2012-10-09
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIVAwUAUHPmWxOxKuMESys7AQKN4w//XDwALfbf0MXIw+gwyRiUtJe9mGexvI6X
 1R4FWU9a3ImzEZP4cWnmPGT2wmC/x007DcIvx8cyvbdlSuqtR2i/DC+HbWabiLRn
 nJS7Eer1BJvLv5dn6NmXMEz7yB4Z46+frcmBs3WQeR0sqBMDm+rjQzCqECznO8Jc
 VtCbox+VR2DuWcM++YECTblYEH3Z+doDXUN2eBaD8L9x3klPbPXD7OcRyOnry8w+
 ynmUTKKyH4+hpxDakYrObPIg+vFCxb4QRck1mlgA4wbvb3eqjhM0oOCYJ8GvmILA
 vdFYztWCjkiuOl5djtXBlsClX8SAMOBYlRed+R1GvjNCSR+WCWrFJJ2F8qoQ1w87
 9ts2/8qrozS8luTB475SkT2uLdJkIUKX89Oh+dWeE8YkbPnRPj5lNAdtNY5QSyDq
 VaRpIo+YfmZygyvHJQlAXBuZ0mvzcPzArfcPgSVTD3B7xTEGVu/45V7SnQX5os/V
 v39ySPXMdGOIdvK51gw7OtZl64uqrEKu39PyYDX/GUADflp/CHD0J7PJrQePbsH9
 AQolVZDIxTfKqYQnUdL8+C8Zc24RowEzz3c2+aO89MSzwGqev3q8sXRVbW/Iqryg
 p+V3nHe+ipKcga5tOBlPr9KDtDd7j3xN2yaIwf5/QyO1OHBpjAZP1gjSVDcUcwpi
 svYy4kPn3PA=
 =etoL
 -----END PGP SIGNATURE-----

nfs: disintegrate UAPI for nfs

This is to complete part of the Userspace API (UAPI) disintegration for which
the preparatory patches were pulled recently.  After these patches, userspace
headers will be segregated into:

        include/uapi/linux/.../foo.h

for the userspace interface stuff, and:

        include/linux/.../foo.h

for the strictly kernel internal stuff.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-10-09 18:35:22 -04:00
Konstantin Khlebnikov 0b173bc4da mm: kill vma flag VM_CAN_NONLINEAR
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-09 16:22:17 +09:00
Peng Tao af283885b7 pnfsblock: cleanup nfs4_blkdev_get
It is not needed at all and it is messing with return values...

Reported-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08 19:32:40 -04:00
Peng Tao 1fd937bd75 NFS41: send real read size in layoutget
For buffer read, use offst-to-isize.

For direct read, use dreq->bytes_left.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08 19:32:34 -04:00
Peng Tao 6296556f0b NFS41: send real write size in layoutget
For buffer write, block layout client scan inode mapping to find
next hole and use offset-to-hole as layoutget length. Object
layout client uses offset-to-isize as layoutget length.

For direct write, both block layout and object layout use dreq->bytes_left.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08 19:32:22 -04:00
Peng Tao 35754bc00e NFS: track direct IO left bytes
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-08 19:31:55 -04:00
Trond Myklebust 19c54abab7 NFSv4.1: Cleanup ugliness in pnfs_layoutgets_blocked()
Split it into two functions, one which checks if layoutgets are blocked,
and one which checks if the layout stateid has expired.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-05 16:56:58 -07:00
Trond Myklebust 22aaf71495 NFSv4.1: Ensure that the layout sequence id stays 'close' to the current
Clamp the layout barrier sequence id to the current sequence id
minus the maximum number of outstanding layoutget requests.

Also ensure that we correctly initialise lo->plh_barrier if there are
no layout segments associated to this layout header.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-04 16:57:48 -07:00
Trond Myklebust 0f35ad6f68 NFSv4.1: Deal with seqid wraparound in the pNFS return-on-close code
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-04 16:28:17 -07:00
Andy Adamson 5f65753033 NFSv4 set open access operation call flag in nfs4_init_opendata_res
nfs4_open_recover_helper zeros the nfs4_opendata result structures, removing
the result access_request information which leads to an XDR decode error.

Move the setting of the result access_request field to nfs4_init_opendata_res
which sets all the other required nfs4_opendata result fields and is shared
between the open and recover open paths.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-03 17:10:28 -07:00
Trond Myklebust 8544a9dc18 NFSv4.1: Remove the dependency on CONFIG_EXPERIMENTAL
CONFIG_EXPERIMENTAL is deprecated and, regardless of that, this code
is being enabled in most newer distributions.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-03 10:54:50 -07:00
Linus Torvalds aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Kirill A. Shutemov 8c0a853770 fs: push rcu_barrier() from deactivate_locked_super() to filesystems
There's no reason to call rcu_barrier() on every
deactivate_locked_super().  We only need to make sure that all delayed rcu
free inodes are flushed before we destroy related cache.

Removing rcu_barrier() from deactivate_locked_super() affects some fast
paths.  E.g.  on my machine exit_group() of a last process in IPC
namespace takes 0.07538s.  rcu_barrier() takes 0.05188s of that time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-02 21:35:55 -04:00
Andy Adamson e23008ec81 NFSv4 reduce attribute requests for open reclaim
We currently make no distinction in attribute requests between normal OPENs
and OPEN with CLAIM_PREVIOUS.  This offers more possibility of failures in
the GETATTR response which foils OPEN reclaim attempts.

Reduce the requested attributes to the bare minimum needed to update the
reclaim open stateid and split nfs4_opendata_to_nfs4_state processing
accordingly.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 18:12:25 -07:00
Trond Myklebust 807d66d802 NFSv4: nfs4_open_done first must check that GETATTR decoded a file type
...before it can check the validity of that file type.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 17:09:00 -07:00
Trond Myklebust 25a1a6211d NFSv4.1: Deal with wraparound when updating the layout "barrier" seqid
...and fix a bug in pnfs_set_layout_stateid.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 17:04:33 -07:00
Trond Myklebust 5a65503f3d NFSv4.1: Deal with wraparound issues when updating the layout stateid
...and add a helper function.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 16:47:14 -07:00
Trond Myklebust 038d649376 NFSv4.1: Always set the layout stateid if this is the first layoutget
If the list of layout segments is empty, we must unconditionally set
the layout stateid.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 16:38:41 -07:00
Trond Myklebust 251ec410c4 NFSv4.1: Fix another refcount issue in pnfs_find_alloc_layout
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 15:41:05 -07:00
Weston Andros Adamson ae2bb03236 NFSv4: don't put ACCESS in OPEN compound if O_EXCL
Don't put an ACCESS op in OPEN compound if O_EXCL, because ACCESS
will return permission denied for all bits until close.

Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 14:56:19 -07:00
Weston Andros Adamson bbd3a8eee8 NFSv4: don't check MAY_WRITE access bit in OPEN
Don't check MAY_WRITE as a newly created file may not have write mode bits,
but POSIX allows the creating process to write regardless.
This is ok because NFSv4 OPEN ops handle write permissions correctly -
the ACCESS in the OPEN compound is to differentiate READ v EXEC permissions.

Fixes a regression due to commit 6168f62c (NFSv4: Add ACCESS operation to
OPEN compound)

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 14:55:41 -07:00
Bryan Schumaker ddfc4e1712 NFS: Set key construction data for the legacy upcall
This prevents a null pointer dereference when
nfs_idmap_complete_pipe_upcall_locked() calls complete_request_key().

Fixes a regression caused by commit 0cac12023 (NFSv4: Ensure that
idmap_pipe_downcall sanity-checks the downcall data).

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 13:04:09 -07:00
Weston Andros Adamson fd4835708f NFSv4.1: don't do two EXCHANGE_IDs on mount
Since the addition of NFSv4 server trunking detection the mount context
calls nfs4_proc_exchange_id then schedules the state manager, which also
calls nfs4_proc_exchange_id. Setting the NFS4CLNT_LEASE_CONFIRM bit
makes the state manager skip the unneeded EXCHANGE_ID and continue on
with session creation.

Reported-by: Jorge Mora <mora@netapp.com>
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 11:35:47 -07:00
David Howells f8aa23a55f KEYS: Use keyring_alloc() to create special keyrings
Use keyring_alloc() to create special keyrings now that it has a permissions
parameter rather than using key_alloc() + key_instantiate_and_link().

Also document and export keyring_alloc() so that modules can use it too.

Signed-off-by: David Howells <dhowells@redhat.com>
2012-10-02 19:24:56 +01:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Linus Torvalds 033d9959ed Merge branch 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue changes from Tejun Heo:
 "This is workqueue updates for v3.7-rc1.  A lot of activities this
  round including considerable API and behavior cleanups.

   * delayed_work combines a timer and a work item.  The handling of the
     timer part has always been a bit clunky leading to confusing
     cancelation API with weird corner-case behaviors.  delayed_work is
     updated to use new IRQ safe timer and cancelation now works as
     expected.

   * Another deficiency of delayed_work was lack of the counterpart of
     mod_timer() which led to cancel+queue combinations or open-coded
     timer+work usages.  mod_delayed_work[_on]() are added.

     These two delayed_work changes make delayed_work provide interface
     and behave like timer which is executed with process context.

   * A work item could be executed concurrently on multiple CPUs, which
     is rather unintuitive and made flush_work() behavior confusing and
     half-broken under certain circumstances.  This problem doesn't
     exist for non-reentrant workqueues.  While non-reentrancy check
     isn't free, the overhead is incurred only when a work item bounces
     across different CPUs and even in simulated pathological scenario
     the overhead isn't too high.

     All workqueues are made non-reentrant.  This removes the
     distinction between flush_[delayed_]work() and
     flush_[delayed_]_work_sync().  The former is now as strong as the
     latter and the specified work item is guaranteed to have finished
     execution of any previous queueing on return.

   * In addition to the various bug fixes, Lai redid and simplified CPU
     hotplug handling significantly.

   * Joonsoo introduced system_highpri_wq and used it during CPU
     hotplug.

  There are two merge commits - one to pull in IRQ safe timer from
  tip/timers/core and the other to pull in CPU hotplug fixes from
  wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

Fixed a number of trivial conflicts, but the more interesting conflicts
were silent ones where the deprecated interfaces had been used by new
code in the merge window, and thus didn't cause any real data conflicts.

Tejun pointed out a few of them, I fixed a couple more.

* 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
  workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
  workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
  workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
  workqueue: remove @delayed from cwq_dec_nr_in_flight()
  workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
  workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
  workqueue: use __cpuinit instead of __devinit for cpu callbacks
  workqueue: rename manager_mutex to assoc_mutex
  workqueue: WORKER_REBIND is no longer necessary for idle rebinding
  workqueue: WORKER_REBIND is no longer necessary for busy rebinding
  workqueue: reimplement idle worker rebinding
  workqueue: deprecate __cancel_delayed_work()
  workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
  workqueue: use mod_delayed_work() instead of __cancel + queue
  workqueue: use irqsafe timer for delayed_work
  workqueue: clean up delayed_work initializers and add missing one
  workqueue: make deferrable delayed_work initializer names consistent
  workqueue: cosmetic whitespace updates for macro definitions
  workqueue: deprecate system_nrt[_freezable]_wq
  workqueue: deprecate flush[_delayed]_work_sync()
  ...
2012-10-02 09:54:49 -07:00
Chuck Lever c2ccc084eb NFS: nfs41_walk_client_list(): re-lock before iterating
Sparse identified an execution path in nfs41_walk_client_list()
where the nfs_client_lock is not re-acquired before taking the next
loop iteration.

fs/nfs/nfs4client.c:437:9: sparse: context imbalance in
 'nfs41_walk_client_list' - different lock contexts for basic block

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 09:25:02 -07:00
Trond Myklebust ee314c2a35 NFSv4.1: Handle BAD_STATEID and EXPIRED errors in layoutget
If the layoutget call returns a stateid error, we want to invalidate the
layout stateid, and/or recover the open stateid.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:34:29 -07:00
Trond Myklebust 6f018efac1 NFSv4.1: bl_pg_init_write should be static
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:34:29 -07:00
Stanislav Kinsbursky 4e437e95ae nfs: include internal.h in getroot.h
Sparse warning:
fs/nfs/nfs4getroot.c:11:5: warning: symbol 'nfs4_get_rootfh' was not declared.
Should it be static?

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:17:04 -07:00
Stanislav Kinsbursky 22e2430961 nfs: include nfs4_fh.h in nfs4sysctl.c
Sparse warnings:
fs/nfs/nfs4sysctl.c:56:5: warning: symbol 'nfs4_register_sysctl' was not
declared. Should it be static?
fs/nfs/nfs4sysctl.c:64:6: warning: symbol 'nfs4_unregister_sysctl' was not
declared. Should it be static?

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:17:03 -07:00
Stanislav Kinsbursky 3dd4f8ef7b nfs: declare nfs_xdev_mount as static
Sparse warning:
fs/nfs/super.c:2517:15: warning: symbol 'nfs_xdev_mount' was not declared.
Should it be static?

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:17:03 -07:00
Stanislav Kinsbursky 0b37d20ca2 nfs: declare nfs_callback_tcp_port in header
Sparse warning:
fs/nfs/super.c:2638:16: sparse: symbol 'nfs_callback_tcpport' was not
declared. Should it be static?

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:17:02 -07:00
Stanislav Kinsbursky ca57ccc48f nfs: include NFSv4 header in netns.h
Build error:
fs/nfs/netns.h:27:15: error: 'NFS4_MAX_MINOR_VERSION' undeclared here (not in
a function)

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-02 08:17:02 -07:00
Andy Adamson 47b803c8d2 NFSv4.0 reclaim reboot state when re-establishing clientid
We should reclaim reboot state when the clientid is stale.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 18:12:41 -07:00
Trond Myklebust f9d640f3a4 NFSv4: nfs4_match_clientids is only used by NFSv4.1
Fix another compiler warning.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 16:58:48 -07:00
Trond Myklebust 758201e2c9 NFSv4: Fix the minor version callback channel startup
The current spaghetti code confuses some versions of gcc (and just
looks ugly as hell)! Clean up...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 16:58:39 -07:00
Trond Myklebust 9f62387d6e NFSv4: Fix up a merge conflict between migration and container changes
nfs_callback_tcpport is now per-net_namespace.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 16:17:31 -07:00
Trond Myklebust 2afdfa5a84 Merge branch 'bugfixes' into nfs-for-next 2012-10-01 15:42:14 -07:00
Daniel Walter 7297cb682a nfs: replace strict_strto* with kstrto*
[nfs] replace strict_str* with kstr* variants

 * replace string conversions with newer kstr* functions

Signed-off-by: Daniel Walter <sahne@0x90.at>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:40:05 -07:00
Yanchuan Nian ee34e13620 NFS: Remove unnecessary semicolons (fs/nfs/client.c)
There are some unnecessary semicolons in function find_nfs_version. Just remove them.

Signed-off-by: Yanchuan Nian <ycnian@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:40:03 -07:00
Wei Yongjun 4e266229db pnfsblock: use list_move_tail instead of list_del/list_add_tail
Using list_move_tail() instead of list_del() + list_add_tail().

spatch with a semantic match is used to found this problem.
(http://coccinelle.lip6.fr/)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:39:05 -07:00
Peng Tao 96c9eae638 pnfsblock: fix non-aligned DIO write
For DIO writes, if it is not blocksize aligned, we need to do
internal serialization. It may slow down writers anyway. So we
just bail them out and resend to MDS.

Cc: stable <stable@vger.kernel.org> [since v3.4]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:38:35 -07:00
Peng Tao f742dc4a32 pnfsblock: fix non-aligned DIO read
For DIO read, if it is not sector aligned, we should reject it
and resend via MDS. Otherwise there might be data corruption.
Also teach bl_read_pagelist to handle partial page reads for DIO.

Cc: stable <stable@vger.kernel.org> [since v3.4]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:38:29 -07:00
Peng Tao fe6e1e8d9f pnfsblock: fix partial page buffer wirte
If applications use flock to protect its write range, generic NFS
will not do read-modify-write cycle at page cache level. Therefore
LD should know how to handle non-sector aligned writes. Otherwise
there will be data corruption.

Cc: stable <stable@vger.kernel.org>
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:38:24 -07:00
Peng Tao 5d0e3a004f Revert "pnfsblock: bail out partial page IO"
This reverts commit 159e0561e3, in favor
of a more complete fix to the alignment issue.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:38:16 -07:00
Peng Tao dc182549d4 NFS41: fix error of setting blocklayoutdriver
After commit e38eb650 (NFS: set_pnfs_layoutdriver() from
nfs4_proc_fsinfo()), set_pnfs_layoutdriver() is called inside
nfs4_proc_fsinfo(), but pnfs_blksize is not set. It causes setting
blocklayoutdriver failure and pnfsblock mount failure.

Cc: stable <stable@vger.kernel.org> [since v3.5]
Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:37:39 -07:00
Peng Tao 7acdb02681 NFSv41: fix DIO write_io calculation
pnfs_within_mdsthreshold() is called inside pg_init. We need to set
read_io/write_io before that. Otherwise we fail pnfs_within_mdsthreshold()
and IO goes to MDS.
A simple test case:
dd if=foo of=/mnt/pnfs/bar bs=10M count=1 oflag=direct

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:37:34 -07:00
Chuck Lever 6f2ea7f2a3 NFS: Add nfs4_unique_id boot parameter
An optional boot parameter is introduced to allow client
administrators to specify a string that the Linux NFS client can
insert into its nfs_client_id4 id string, to make it both more
globally unique, and to ensure that it doesn't change even if the
client's nodename changes.

If this boot parameter is not specified, the client's nodename is
used, as before.

Client installation procedures can create a unique string (typically,
a UUID) which remains unchanged during the lifetime of that client
instance.  This works just like creating a UUID for the label of the
system's root and boot volumes.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:33:33 -07:00
Chuck Lever 05f4c350ee NFS: Discover NFSv4 server trunking when mounting
"Server trunking" is a fancy named for a multi-homed NFS server.
Trunking might occur if a client sends NFS requests for a single
workload to multiple network interfaces on the same server.  There
are some implications for NFSv4 state management that make it useful
for a client to know if a single NFSv4 server instance is
multi-homed.  (Note this is only a consideration for NFSv4, not for
legacy versions of NFS, which are stateless).

If a client cares about server trunking, no NFSv4 operations can
proceed until that client determines who it is talking to.  Thus
server IP trunking discovery must be done when the client first
encounters an unfamiliar server IP address.

The nfs_get_client() function walks the nfs_client_list and matches
on server IP address.  The outcome of that walk tells us immediately
if we have an unfamiliar server IP address.  It invokes
nfs_init_client() in this case.  Thus, nfs4_init_client() is a good
spot to perform trunking discovery.

Discovery requires a client to establish a fresh client ID, so our
client will now send SETCLIENTID or EXCHANGE_ID as the first NFS
operation after a successful ping, rather than waiting for an
application to perform an operation that requires NFSv4 state.

The exact process for detecting trunking is different for NFSv4.0 and
NFSv4.1, so a minorversion-specific init_client callout method is
introduced.

CLID_INUSE recovery is important for the trunking discovery process.
CLID_INUSE is a sign the server recognizes the client's nfs_client_id4
id string, but the client is using the wrong principal this time for
the SETCLIENTID operation.  The SETCLIENTID must be retried with a
series of different principals until one works, and then the rest of
trunking discovery can proceed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:33:33 -07:00
Chuck Lever e984a55a74 NFS: Use the same nfs_client_id4 for every server
Currently, when identifying itself to NFS servers, the Linux NFS
client uses a unique nfs_client_id4.id string for each server IP
address it talks with.  For example, when client A talks to server X,
the client identifies itself using a string like "AX".  The
requirements for these strings are specified in detail by RFC 3530
(and bis).

This form of client identification presents a problem for Transparent
State Migration.  When client A's state on server X is migrated to
server Y, it continues to be associated with string "AX."  But,
according to the rules of client string construction above, client
A will present string "AY" when communicating with server Y.

Server Y thus has no way to know that client A should be associated
with the state migrated from server X.  "AX" is all but abandoned,
interfering with establishing fresh state for client A on server Y.

To support transparent state migration, then, NFSv4.0 clients must
instead use the same nfs_client_id4.id string to identify themselves
to every NFS server; something like "A".

Now a client identifies itself as "A" to server X.  When a file
system on server X transitions to server Y, and client A identifies
itself as "A" to server Y, Y will know immediately that the state
associated with "A," whether it is native or migrated, is owned by
the client, and can merge both into a single lease.

As a pre-requisite to adding support for NFSv4 migration to the Linux
NFS client, this patch changes the way Linux identifies itself to NFS
servers via the SETCLIENTID (NFSv4 minor version 0) and EXCHANGE_ID
(NFSv4 minor version 1) operations.

In addition to removing the server's IP address from nfs_client_id4,
the Linux NFS client will also no longer use its own source IP address
as part of the nfs_client_id4 string.  On multi-homed clients, the
value of this address depends on the address family and network
routing used to contact the server, thus it can be different for each
server.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:33:33 -07:00
Chuck Lever 896526174c NFS: Introduce "migration" mount option
Currently, the Linux client uses a unique nfs_client_id4.id string
when identifying itself to distinct NFS servers.

To support transparent state migration, the Linux client will have to
use the same nfs_client_id4 string for all servers it communicates
with (also known as the "uniform client string" approach).  Otherwise
NFS servers can not recognize that open and lock state need to be
merged after a file system transition.

Unfortunately, there are some NFSv4.0 servers currently in the field
that do not tolerate the uniform client string approach.

Thus, by default, our NFSv4.0 mounts will continue to use the current
approach, and we introduce a mount option that switches them to use
the uniform model.  Client administrators must identify which servers
can be mounted with this option.  Eventually most NFSv4.0 servers will
be able to handle the uniform approach, and we can change the default.

The first mount of a server controls the behavior for all subsequent
mounts for the lifetime of that set of mounts of that server.  After
the last mount of that server is gone, the client erases the data
structure that tracks the lease.  A subsequent lease may then honor
a different "migration" setting.

This patch adds only the infrastructure for parsing the new mount
option.  Support for uniform client strings is added in a subsequent
patch.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:33:33 -07:00
Chuck Lever ba9b584c1d SUNRPC: Introduce rpc_clone_client_set_auth()
An ULP is supposed to be able to replace a GSS rpc_auth object with
another GSS rpc_auth object using rpcauth_create().  However,
rpcauth_create() in 3.5 reliably fails with -EEXIST in this case.
This is because when gss_create() attempts to create the upcall pipes,
sometimes they are already there.  For example if a pipe FS mount
event occurs, or a previous GSS flavor was in use for this rpc_clnt.

It turns out that's not the only problem here.  While working on a
fix for the above problem, we noticed that replacing an rpc_clnt's
rpc_auth is not safe, since dereferencing the cl_auth field is not
protected in any way.

So we're deprecating the ability of rpcauth_create() to switch an
rpc_clnt's security flavor during normal operation.  Instead, let's
add a fresh API that clones an rpc_clnt and gives the clone a new
flavor before it's used.

This makes immediate use of the new __rpc_clone_client() helper.

This can be used in a similar fashion to rpcauth_create() when a
client is hunting for the correct security flavor.  Instead of
replacing an rpc_clnt's security flavor in a loop, the ULP replaces
the whole rpc_clnt.

To fix the -EEXIST problem, any ULP logic that relies on replacing
an rpc_clnt's rpc_auth with rpcauth_create() must be changed to use
this API instead.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:33:33 -07:00
Chuck Lever ffe5a83005 NFS: Slow down state manager after an unhandled error
If the state manager thread is not actually able to fully recover from
some situation, it wakes up waiters, who kick off a new state manager
thread.  Quite often the fresh invocation of the state manager is just
as successful.

This results in a livelock as the client dumps thousands of NFS
requests a second on the network in a vain attempt to recover.  Not
very friendly.

To mitigate this situation, add a delay in the state manager after
an unhandled error, so that the client sends just a few requests
every second in this case.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:31:51 -07:00
Chuck Lever 8cb7f74eee NFS: nfs_parsed_mount_options can use unsigned int
fs/nfs/super.c: In function ‘nfs_compare_remount_data’:
fs/nfs/super.c:2042:18: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2043:18: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2044:20: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2046:21: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2047:21: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2048:21: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2049:21: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]
fs/nfs/super.c:2050:18: warning: comparison between signed and
    unsigned integer expressions [-Wsign-compare]

Seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:31:41 -07:00
Stanislav Kinsbursky 1dc42e04b7 NFS: add debug messages to callback down function
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:26:06 -07:00
Stanislav Kinsbursky b3d19c5172 NFS: callback per-net usage counting introduced
This patch also introduces refcount-aware nfs_callback_down_net() wrapper for
svc_shutdown_net().

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:57 -07:00
Stanislav Kinsbursky 29dcc16a8e NFS: make nfs_callback_tcpport6 per network context
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:51 -07:00
Stanislav Kinsbursky bbe0a3aa4e NFS: make nfs_callback_tcpport per network context
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:47 -07:00
Stanislav Kinsbursky 23c20ecd44 NFS: callback up - users counting cleanup
Usage coutner now increased only is the service was started sccessfully.
Even if service is running already, then goto is not required anymore, because
service creation and start will be skipped.
With this patch code looks clearer.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:38 -07:00
Stanislav Kinsbursky 8e24614443 NFS: callback service start function introduced
This is just a code move, which from my POW makes code looks better.
I.e. now on start we have 3 different stages:
1) Service creation.
2) Service per-net data allocation.
3) Service start.

Patch also renames goto label "out_err:" into "err_start:" to reflect new
changes.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:35 -07:00
Stanislav Kinsbursky 691c457ae6 NFS: callback up - transport backchannel cleanup
No need to assign transports backchannel server explicitly in
nfs41_callback_up() -  there is nfs_callback_bc_serv() function for this.
By using it, nfs4_callback_up() and nfs41_callback_up() can be called without
transport argument.

Note: service have to be passed to nfs_callback_bc_serv() instead of callback,
since callback link can be uninitialized.

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:29 -07:00
Stanislav Kinsbursky c946556b87 NFS: move per-net callback thread initialization to nfs_callback_up_net()
v4:
1) Callback transport creation routine selection by version simlified.

This new function in now called before nfs_minorversion_callback_svc_setup()).

Also few small changes:
1) current network namespace in nfs_callback_up() was replaced by transport net.
2) svc_shutdown_net() was moved prior to callback usage counter decrement
(because in case of per-net data allocation faulure svc_shutdown_net() have to
be skipped).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:24 -07:00
Stanislav Kinsbursky dd018428dc NFS: callback service creation function introduced
This function creates service if it's not exist, or increase usage counter of
the existent, and returns pointer to it.
Usage counter will be droppepd by svc_destroy() later in nfs_callback_up().

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:25:11 -07:00
Stanislav Kinsbursky c8ceb4124b NFS: pass net to nfs_callback_down()
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:24:51 -07:00
Weston Andros Adamson 6168f62cbd NFSv4: Add ACCESS operation to OPEN compound
The OPEN operation has no way to differentiate an open for read and an
open for execution - both look like read to the server. This allowed
users to read files that didn't have READ access but did have EXEC access,
which is obviously wrong.

This patch adds an ACCESS call to the OPEN compound to handle the
difference between OPENs for reading and execution.  Since we're going
through the trouble of calling ACCESS, we check all possible access bits
and cache the results hopefully avoiding an ACCESS call in the future.

Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:20:11 -07:00
Bryan Schumaker 57a51048da NFS: Use kzalloc() instead of kmalloc() in the idmapper
This will allocate memory that has already been zeroed, allowing us to
remove the memset later on.

Signed-off-by: Bryan Schumaker <bjchuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:18:44 -07:00
Bryan Schumaker 6938867edb NFS: Remove bad delegations during open recovery
I put the client into an open recovery loop by:
	Client: Open file
		read half
	Server: Expire client (echo 0 > /sys/kernel/debug/nfsd/forget_clients)
	Client: Drop vm cache (echo 3 > /proc/sys/vm/drop_caches)
		finish reading file

This causes a loop because the client never updates the nfs4_state after
discovering that the delegation is invalid.  This means it will keep
trying to read using the bad delegation rather than attempting to re-open
the file.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
CC: stable@vger.kernel.org [3.4+]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:17:25 -07:00
Bryan Schumaker fcb6d9c6b7 NFS: Always use the open stateid when checking for expired opens
If we are reading through a delegation, and the delegation is OK then
state->stateid will still point to a delegation stateid and not an open
stateid.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-10-01 15:17:17 -07:00
Linus Torvalds 99dbb1632f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull the trivial tree from Jiri Kosina:
 "Tiny usual fixes all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  doc: fix old config name of kprobetrace
  fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc
  btrfs: fix the commment for the action flags in delayed-ref.h
  btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID
  vfs: fix kerneldoc for generic_fh_to_parent()
  treewide: fix comment/printk/variable typos
  ipr: fix small coding style issues
  doc: fix broken utf8 encoding
  nfs: comment fix
  platform/x86: fix asus_laptop.wled_type module parameter
  mfd: printk/comment fixes
  doc: getdelays.c: remember to close() socket on error in create_nl_socket()
  doc: aliasing-test: close fd on write error
  mmc: fix comment typos
  dma: fix comments
  spi: fix comment/printk typos in spi
  Coccinelle: fix typo in memdup_user.cocci
  tmiofb: missing NULL pointer checks
  tools: perf: Fix typo in tools/perf
  tools/testing: fix comment / output typos
  ...
2012-10-01 09:06:36 -07:00
Trond Myklebust 849b286fd0 NFSv4.1: nfs4_proc_layoutreturn must always drop the plh_block_lgets count
Currently it does not do so if the RPC call failed to start. Fix is to
move the decrement of plh_block_lgets into nfs4_layoutreturn_release.

Also remove a redundant test of task->tk_status in nfs4_layoutreturn_done:
if lrp->res.lrs_present is set, then obviously the RPC call succeeded.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:18 -04:00
Trond Myklebust 65857d5768 NFSv4.1: _pnfs_return_layout() shouldn't invalidate the layout on failure
Failure of the layoutreturn allocation fails is not a good reason to
mark the pnfs_layout_hdr as having failed a layoutget or i/o. Just
exit cleanly.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:18 -04:00
Trond Myklebust e5929f3cff NFSv4.1: Remove the NFS_LAYOUT_RETURNED state
It serves no purpose that the test for whether or not we have valid
layout segments doesn't already serve.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:17 -04:00
Trond Myklebust 173f77e9c5 NFSv4.1: Clear NFS_LAYOUT_BULK_RECALL when the layout segments are freed
Once all the affected layout segments have been freed up, clear the
NFS_LAYOUT_BULK_RECALL flag so that we can reuse the pnfs_layout_hdr

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:17 -04:00
Trond Myklebust 8006bfba36 NFSv4.1: Get rid of the NFS_LAYOUT_DESTROYED state
We already have a mechanism for blocking LAYOUTGET by means of the
plh_block_lgets counter. The only "service" that NFS_LAYOUT_DESTROYED
provides at this point is to block layoutget once the layout segment
list is empty, which basically means that you have to wait until
the pnfs_layout_hdr is destroyed before you can do pNFS on that file
again.

This patch enables the reuse of the pnfs_layout_hdr if the layout
segment list is empty.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:16 -04:00
Trond Myklebust 579342785f NFSv4.1: Remove unused 'default allocation' for pnfs_alloc_layout_hdr()
...and ditto for pnfs_free_layout_hdr()

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:16 -04:00
Trond Myklebust a9136d4914 NFSv4.1: Get rid of pNFS spin lock debugging asserts...
These are all in static declared functions that are called only once.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:16 -04:00
Trond Myklebust 8f0d27dc5d NFSv4.1: Balance pnfs_layout_hdr refcount in pnfs_layout_(insert|remove)_lseg
Ensure that the reference count for pnfs_layout_hdr reverts to the
original value after a call to pnfs_layout_remove_lseg().

Note that the caller is expected to hold a reference to the struct
pnfs_layout_hdr.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:15 -04:00
Trond Myklebust 905ca191cf NFSv4.1: Clean up pnfs_put_lseg()
There is no longer a need to use pnfs_free_lseg_list(). Just call
pnfs_free_lseg() directly.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:15 -04:00
Trond Myklebust 9c6263819f NFSv4.1: Clean up the removal of pnfs_layout_hdr from the server list
Move the code into pnfs_free_layout_hdr(), and add checks to
get_layout_by_fh_locked to ensure that they don't reference a layout
that is being freed.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:14 -04:00
Trond Myklebust 6622c3ea05 NFSv4.1: Free the pnfs_layout_hdr outside the inode->i_lock
None of the existing pNFS layout drivers seem to require the inode
to be locked while they free the layout header.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:14 -04:00
Trond Myklebust 01d39ce82b NFSv4.1: Remove redundant reference to the pnfs_layout_hdr
Each layout segment already holds a reference to the pnfs_layout_hdr,
so there is no need to hold an extra reference that is released once
the last layout segment is freed.

Ensure that pnfs_find_alloc_layout() always returns a reference
to the pnfs_layout_hdr, which will be matched by the final call to
pnfs_put_layout_hdr() in pnfs_update_layout().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:13 -04:00
Trond Myklebust 57036a3776 NFSv4.1: Rename the pnfs_put_lseg_common to pnfs_layout_remove_lseg
The latter name is more descriptive of the actual function.
Also rename pnfs_insert_layout to pnfs_layout_insert_lseg.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:13 -04:00
Trond Myklebust bb346f6397 NFSv4.1: reset the inode MDS threshold counters on layout destruction
Instead of resetting the inode MDS threshold counters when we mark
the layout for destruction, do it as part of freeing the layout.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:12 -04:00
Trond Myklebust 965938b83b NFSv4.1: Get rid of pNFS layout state "NFS_LAYOUT_INVALID"
In all cases where we set NFS_LAYOUT_INVALID, we also set NFS_LAYOUT_DESTROYED.
Furthermore, in all cases where we test for NFS_LAYOUT_INVALID, we should
also be testing for NFS_LAYOUT_DESTROYED, since the latter means that
we hold no valid layout segments.
Ergo the two are redundant.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:12 -04:00
Trond Myklebust 1f7977c136 NFSv4.1: Simplify the pNFS return-on-close code
Confine it to the nfs4_do_close() code.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:12 -04:00
Trond Myklebust 7fdab069b7 NFSv4.1: Fix a race in the pNFS return-on-close code
If we sleep after dropping the inode->i_lock, then we are no longer
atomic with respect to the rpc_wake_up() call in pnfs_layout_remove_lseg().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:11 -04:00
Trond Myklebust 115ce575cb NFSv4.1: pnfs_layout_io_set_failed must clear invalid lsegs
If pnfs_layout_io_test_failed() authorises a retry of the failed layoutgets,
we should clear the existing layout segments so that we start afresh. Do
this in pnfs_layout_io_set_failed().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:11 -04:00
Trond Myklebust 3e62121493 NFSv4.1: Don't drop the pnfs_layout_hdr after a layoutget failure
We want to cache the pnfs_layout_hdr after a layoutget or i/o
failure so that pnfs_update_layout() can find it and know when
it is time to retry.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:10 -04:00
Trond Myklebust 830ffb5657 NFSv4.1: Fix a reference leak in pnfs_update_layout
If we exit after the call to pnfs_find_alloc_layout(), we have to ensure
that we put the struct pnfs_layout_hdr.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:10 -04:00
Trond Myklebust 1dfed2737d NFSv4.1: pNFS data servers may be temporarily offline
In cases where the pNFS data server is just temporarily out of service,
we want to mark it as such, and then try again later. Typically that will
be in cases of network connection errors etc.
This patch allows us to mark the devices as being "unavailable" for such
transient errors, and will make them available for retries after a
2 minute timeout period.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:09 -04:00
Trond Myklebust 25c7533357 NFSv4.1: Retry pNFS after a 2 minute timeout
If we had to fall back to read/write through MDS, then assume that we should
retry pNFS after a suitable timeout period.
The following patch sets a timeout of 2 minutes.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:09 -04:00
Trond Myklebust b9e028fd89 NFSv4.1: Add helpers for setting/reading the I/O fail bit
...and make them local to the pnfs.c file.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:09 -04:00
Trond Myklebust f86bbcf85d NFSv4.1: Replace dprintk() in pnfs_update_layout with something less buggy
Dereferencing nfsi->layout in order to read plh_flags without holding
a spin lock is bug prone. Furthermore, the dprintk() tells you nothing
about whether or not the call succeeded.
Replace it with something that tells you about whether or not a valid
layout segment was returned for the inode in question.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:08 -04:00
Trond Myklebust 78e4e05c64 NFSv4.1: Replace get_device_info() with filelayout_get_device_info()
Fix the namespace pollution issue.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:08 -04:00
Trond Myklebust 9369a431bc NFSv4.1: Cleanup; add "pnfs_" prefix to put_lseg() and get_lseg()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:07 -04:00
Trond Myklebust 70c3bd2bdf NFSv4.1: Cleanup; add "pnfs_" prefix to get_layout_hdr() and put_layout_hdr()
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:07 -04:00
Trond Myklebust 49a85061b0 NFSv4.1: Cleanup add a "pnfs_" prefix to mark_matching_lsegs_invalid
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:06 -04:00
Trond Myklebust a0b0a6e39b NFS: Clean up the pNFS layoutget interface
Ensure that we do return errors from nfs4_proc_layoutget() and that we
don't mark the layout as having failed if the error was due to a
signal or resource problem on the client side.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:06 -04:00
Trond Myklebust dcfc4f2546 NFS: Write the entire file if a server reboot occurs during fsync()
This is to ensure that we don't clear the NFS_CONTEXT_RESEND_WRITES
flag while there are still writes that haven't been resent.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:05 -04:00
Trond Myklebust 05990d1bf2 NFS: Fix fdatasync/fsync() when confronted with a server reboot
If the server reboots before it can commit the unstable writes to disk,
then nfs_commit_release_pages() will detect this when it compares the
verifier returned by COMMIT to the one returned by WRITE. When this
happens, the client needs to resend those writes in order to guarantee
that they make it to stable storage.

This patch adds a signalling mechanism to notify fsync() that it
needs to retry all writes before it can exit.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:05 -04:00
Trond Myklebust 795a88c968 NFSv4: Convert the nfs4_lock_state->ls_flags to a bit field
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:04 -04:00
Trond Myklebust 2a369153c8 NFS: Clean up helper function nfs4_select_rw_stateid()
We want to be able to pass on the information that the page was not
dirtied under a lock. Instead of adding a flag parameter, do this
by passing a pointer to a 'struct nfs_lock_owner' that may be NULL.

Also reuse this structure in struct nfs_lock_context to carry the
fl_owner_t and pid_t.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:04 -04:00
Trond Myklebust b3c54de6f8 NFS: Convert nfs_get_lock_context to return an ERR_PTR on failure
We want to be able to distinguish between allocation failures, and
the case where the lock context is not needed (because there are no
locks).

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 16:03:03 -04:00
Trond Myklebust 0cac120233 NFSv4: Ensure that idmap_pipe_downcall sanity-checks the downcall data
Use the idmapper upcall data to verify that the legacy idmapper daemon
is indeed responding to an upcall that we sent.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
2012-09-28 13:43:34 -04:00
Trond Myklebust e9ab41b620 NFSv4: Clean up the legacy idmapper upcall
Replace the BUG_ON(idmap->idmap_key_cons != NULL) with a
WARN_ON_ONCE(). Then get rid of the ACCESS_ONCE(idmap->idmap_key_cons).

Then add helper functions for starting, finishing and aborting the
legacy upcall.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Bryan Schumaker <bjschuma@netapp.com>
2012-09-28 13:42:41 -04:00
Trond Myklebust 0e24d849c4 NFSv4: Remove BUG_ON() and ACCESS_ONCE() calls in the idmapper
The use of ACCESS_ONCE() is wrong, since the various routines that set/clear
idmap->idmap_key_cons should be strictly ordered w.r.t. each other, and
the idmap->idmap_mutex ensures that only one thread at a time may be in
an upcall situation.

Also replace the BUG_ON()s with WARN_ON_ONCE() where appropriate.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-28 12:27:11 -04:00
Trond Myklebust 13fe4ba1b6 NFSv4.1: decode_getdeviceinfo should check xdr_read_pages() return value
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-26 12:43:10 -04:00
NeilBrown 62d98c9354 NFS4: avoid underflow when converting error to pointer.
In nfs4_create_sec_client, 'flavor' can hold a negative error
code (returned from nfs4_negotiate_security), even though it
is an 'enum' and hence unsigned.

The code is careful to cast it to an (int) before testing if it
is negative, however it doesn't cast to an (int) before calling
ERR_PTR.

On a machine where "void*" is larger than "int", this results in
the unsigned equivalent of -1 (e.g. 0xffffffff) being converted
to a pointer.  Subsequent code determines that this is not
negative, and so  dereferences it with predictable results.

So: cast 'flavor' to a (signed) int before passing to ERR_PTR.

cc: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-25 10:38:54 -04:00
Wei Yongjun e8d920c58d NFS: fix the return value check by using IS_ERR
In case of error, the function rpcauth_create() returns ERR_PTR()
and never returns NULL pointer. The NULL test in the return value
check should be replaced with IS_ERR().

dpatch engine is used to auto generated this patch.
(https://github.com/weiyj/dpatch)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-25 10:36:37 -04:00
Eric W. Biederman 5f3a4a28ec userns: Pass a userns parameter into posix_acl_to_xattr and posix_acl_from_xattr
- Pass the user namespace the uid and gid values in the xattr are stored
   in into posix_acl_from_xattr.

 - Pass the user namespace kuid and kgid values should be converted into
   when storing uid and gid values in an xattr in posix_acl_to_xattr.

- Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to
  pass in &init_user_ns.

In the short term this change is not strictly needed but it makes the
code clearer.  In the longer term this change is necessary to be able to
mount filesystems outside of the initial user namespace that natively
store posix acls in the linux xattr format.

Cc: Theodore Tso <tytso@mit.edu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-09-18 01:01:35 -07:00
Trond Myklebust 7b281ee026 NFS: fsync() must exit with an error if page writeback failed
We need to ensure that if the call to filemap_write_and_wait_range()
fails, then we report that error back to the application.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-11 15:38:32 -04:00
Weston Andros Adamson 01913b49cf NFS: return error from decode_getfh in decode open
If decode_getfh failed, nfs4_xdr_dec_open would return 0 since the last
decode_* call must have succeeded.

Cc: stable@vger.kernel.org
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-06 16:01:33 -04:00
Trond Myklebust 1f1ea6c2d9 NFSv4: Fix buffer overflow checking in __nfs4_get_acl_uncached
Pass the checks made by decode_getacl back to __nfs4_get_acl_uncached
so that it knows if the acl has been truncated.

The current overflow checking is broken, resulting in Oopses on
user-triggered nfs4_getfacl calls, and is opaque to the point
where several attempts at fixing it have failed.
This patch tries to clean up the code in addition to fixing the
Oopses by ensuring that the overflow checks are performed in
a single place (decode_getacl). If the overflow check failed,
we will still be able to report the acl length, but at least
we will no longer attempt to cache the acl or copy the
truncated contents to user space.

Reported-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Sachin Prabhu <sprabhu@redhat.com>
2012-09-06 11:11:53 -04:00
Trond Myklebust 21f498c2f7 NFSv4: Fix range checking in __nfs4_get_acl_uncached and __nfs4_proc_set_acl
Ensure that the user supplied buffer size doesn't cause us to overflow
the 'pages' array.

Also fix up some confusion between the use of PAGE_SIZE and
PAGE_CACHE_SIZE when calculating buffer sizes. We're not using
the page cache for anything here.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-09-04 14:52:43 -04:00
Trond Myklebust 872ece86ea NFS: Fix a problem with the legacy binary mount code
Apparently, am-utils is still using the legacy binary mountdata interface,
and is having trouble parsing /proc/mounts due to the 'port=' field being
incorrectly set.

The following patch should fix up the regression.

Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-09-04 14:52:43 -04:00
Trond Myklebust c3f52af3e0 NFS: Fix the initialisation of the readdir 'cookieverf' array
When the NFS_COOKIEVERF helper macro was converted into a static
inline function in commit 99fadcd764 (nfs: convert NFS_*(inode)
helpers to static inline), we broke the initialisation of the
readdir cookies, since that depended on doing a memset with an
argument of 'sizeof(NFS_COOKIEVERF(inode))' which therefore
changed from sizeof(be32 cookieverf[2]) to sizeof(be32 *).

At this point, NFS_COOKIEVERF seems to be more of an obfuscation
than a helper, so the best thing would be to just get rid of it.

Also see: https://bugzilla.kernel.org/show_bug.cgi?id=46881

Reported-by: Andi Kleen <andi@firstfloor.org>
Reported-by: David Binderman <dcb314@hotmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-09-04 14:52:42 -04:00
Peter Meerwald 1856b225ca nfs: comment fix
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-09-01 10:09:44 -07:00
J. Bruce Fields 5b444cc9a4 svcrpc: remove handling of unknown errors from svc_recv
svc_recv() returns only -EINTR or -EAGAIN.  If we really want to worry
about the case where it has a bug that causes it to return something
else, we could stick a WARN() in svc_recv.  But it's silly to require
every caller to have all this boilerplate to handle that case.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-08-21 17:42:00 -04:00
Trond Myklebust 0866004304 NFSv3: Ensure that do_proc_get_root() reports errors correctly
If the rpc call to NFS3PROC_FSINFO fails, then we need to report that
error so that the mount fails. Otherwise we can end up with a
superblock with completely unusable values for block sizes, maxfilesize,
etc.

Reported-by: Yuanming Chen <hikvision_linux@163.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-20 12:52:42 -04:00
Trond Myklebust 7653f6ff4e NFSv4: Ensure that nfs4_alloc_client cleans up on error.
Any pointer that was allocated through nfs_alloc_client() needs to be
freed via a call to nfs_free_client().

Reported-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-20 12:12:29 -04:00
Bryan Schumaker 12dfd08055 NFS: return -ENOKEY when the upcall fails to map the name
This allows the normal error-paths to handle the error, rather than
making a special call to complete_request_key() just for this instance.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Tested-by: William Dauchy <wdauchy@gmail.com>
Cc: stable@vger.kernel.org [>= 3.4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 17:20:06 -04:00
Bryan Schumaker c5066945b7 NFS: Clear key construction data if the idmap upcall fails
idmap_pipe_downcall already clears this field if the upcall succeeds,
but if it fails (rpc.idmapd isn't running) the field will still be set
on the next call triggering a BUG_ON().  This patch tries to handle all
possible ways that the upcall could fail and clear the idmap key data
for each one.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Tested-by: William Dauchy <wdauchy@gmail.com>
Cc: stable@vger.kernel.org [>= 3.4]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 17:20:02 -04:00
Trond Myklebust cff298c721 NFSv4: Don't use private xdr_stream fields in decode_getacl
Instead of using the private field xdr->p from struct xdr_stream,
use the public xdr_stream_pos().

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 16:15:51 -04:00
Trond Myklebust b291f1b1c8 NFSv4: Fix the acl cache size calculation
Currently, we do not take into account the size of the 16 byte
struct nfs4_cached_acl header, when deciding whether or not we should
cache the acl data.  Consequently, we will end up allocating an
8k buffer in order to fit a maximum size 4k acl.

This patch adjusts the calculation so that we limit the cache size
to 4k for the acl header+data.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 16:15:50 -04:00
Trond Myklebust 519d3959e3 NFSv4: Fix pointer arithmetic in decode_getacl
Resetting the cursor xdr->p to a previous value is not a safe
practice: if the xdr_stream has crossed out of the initial iovec,
then a bunch of other fields would need to be reset too.

Fix this issue by using xdr_enter_page() so that the buffer gets
page aligned at the bitmap _before_ we decode it.

Also fix the confusion of the ACL length with the page buffer length
by not adding the base offset to the ACL length...

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-08-16 16:15:50 -04:00
bjschuma@gmail.com 425e776d93 NFS: Alias the nfs module to nfs4
This allows distros to remove the line from their modprobe
configuration.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 16:15:49 -04:00
bjschuma@gmail.com 1ae811ee27 NFS: Fix a regression when loading the NFS v4 module
Some systems have a modprobe.d/nfs.conf file that sets an nfs4 alias
pointing to nfs.ko, rather than nfs4.ko.  This can prevent the v4 module
from loading on mount, since the kernel sees that something named "nfs4"
has already been loaded.  To work around this, I've renamed the modules
to "nfsv2.ko" "nfsv3.ko" and "nfsv4.ko".

I also had to move the nfs4_fs_type back to nfs.ko to ensure that `mount
-t nfs4` still works.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-16 16:15:49 -04:00
Tejun Heo 41f63c5359 workqueue: use mod_delayed_work() instead of cancel + queue
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().

Most conversions are straight-forward.  Ones worth mentioning are,

* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
  use mod_delayed_work() and cancel loop in
  edac_mc_reset_delay_period() is dropped.

* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
  watchdog is active or not.  @fan_watchdog_active and related code
  dropped.

* drivers/power/charger-manager.c: Seemingly a lot of
  delayed_work_pending() abuse going on here.
  [delayed_]work_pending() are unsynchronized and racy when used like
  this.  I converted one instance in fullbatt_handler().  Please
  conver the rest so that it invokes workqueue APIs for the intended
  target state rather than trying to game work item pending state
  transitions.  e.g. if timer should be modified - call
  mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().

* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
  simplified.  Note that round_jiffies() calls in this function are
  meaningless.  round_jiffies() work on absolute jiffies not delta
  delay used by delayed_work.

v2: Tomi pointed out that __cancel_delayed_work() users can't be
    safely converted to mod_delayed_work().  They could be calling it
    from irq context and if that happens while delayed_work_timer_fn()
    is running, it could deadlock.  __cancel_delayed_work() users are
    dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Anton Vorontsov <cbouatmailru@gmail.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Roland Dreier <roland@kernel.org>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
2012-08-13 16:27:37 -07:00
Trond Myklebust 47fbf7976e NFSv4.1: Remove a bogus BUG_ON() in nfs4_layoutreturn_done
Ever since commit 0a57cdac3f (NFSv4.1 send layoutreturn to fence
disconnected data server) we've been sending layoutreturn calls
while there is potentially still outstanding I/O to the data
servers. The reason we do this is to avoid races between replayed
writes to the MDS and the original writes to the DS.

When this happens, the BUG_ON() in nfs4_layoutreturn_done can
be triggered because it assumes that we would never call
layoutreturn without knowing that all I/O to the DS is
finished. The fix is to remove the BUG_ON() now that the
assumptions behind the test are obsolete.

Reported-by: Boaz Harrosh <bharrosh@panasas.com>
Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>=3.5]
2012-08-08 16:03:13 -04:00
Boaz Harrosh 7de6e28417 pnfs-obj: Better IO pattern in case of unaligned offset
Depending on layout and ARCH, ORE has some limits on max IO sizes
which is communicated on (what else) ore_layout->max_io_length,
which is always stripe aligned.
This was considered as the pg_test boundary for splitting and starting
a new IO.

But in the case of a long IO where the start offset is not aligned
what would happen is that both end of IO[N] and start of IO[N+1]
would be unaligned, causing each IO boundary parity unit to be
calculated and written twice.

So what we do in this patch is split the very start of an unaligned
IO, up to a stripe boundary, and then next IO's can continue fully
aligned til the end.

We might be sacrificing the case where the full unaligned IO would
fit within a single max_io_length, but the sacrifice is well worth
the elimination of double calculation and parity units IO.
Actually the sacrificing is marginal and is almost unmeasurable.

TODO:
	If we know the total expected linear segment that will
	be received, at pg_init, we could use that information
	in many places:
	1. blocks-layout get_layout write segment size
	2. Better mds-threshold
	3. In above situation for a better clean split

	I will do this in future submission.

Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02 17:42:51 -04:00
Peng Tao f616638409 NFS41: add pg_layout_private to nfs_pageio_descriptor
To allow layout driver to pass private information around
pg_init/pg_doio.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02 17:41:18 -04:00
Idan Kedar 21d1f58aed pnfs: nfs4_proc_layoutget returns void
since the only user of nfs4_proc_layoutget is send_layoutget, which
ignores its return value, there is no reason to return any value.

Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02 17:39:06 -04:00
Idan Kedar 8554116e17 pnfs: defer release of pages in layoutget
we have encountered a bug whereby reading a lot of files (copying
fedora's /bin) from a pNFS mount and hitting Ctrl+C in the middle caused
a general protection fault in xdr_shrink_bufhead. this function is
called when decoding the response from LAYOUTGET. the decoding is done
by a worker thread, and the caller of LAYOUTGET waits for the worker
thread to complete.

hitting Ctrl+C caused the synchronous wait to end and the next thing the
caller does is to free the pages, so when the worker thread calls
xdr_shrink_bufhead, the pages are gone. therefore, the cleanup of these
pages has been moved to nfs4_layoutget_release.

Signed-off-by: Idan Kedar <idank@tonian.com>
Signed-off-by: Benny Halevy <bhalevy@tonian.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02 17:38:54 -04:00
Jeff Layton 3dd4765fce nfs: tear down caches in nfs_init_writepagecache when allocation fails
...and ensure that we tear down the nfs_commit_data cache too when
unloading the module.

Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-08-02 17:36:07 -04:00
Linus Torvalds ac694dbdbc Merge branch 'akpm' (Andrew's patch-bomb)
Merge Andrew's second set of patches:
 - MM
 - a few random fixes
 - a couple of RTC leftovers

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (120 commits)
  rtc/rtc-88pm80x: remove unneed devm_kfree
  rtc/rtc-88pm80x: assign ret only when rtc_register_driver fails
  mm: hugetlbfs: close race during teardown of hugetlbfs shared page tables
  tmpfs: distribute interleave better across nodes
  mm: remove redundant initialization
  mm: warn if pg_data_t isn't initialized with zero
  mips: zero out pg_data_t when it's allocated
  memcg: gix memory accounting scalability in shrink_page_list
  mm/sparse: remove index_init_lock
  mm/sparse: more checks on mem_section number
  mm/sparse: optimize sparse_index_alloc
  memcg: add mem_cgroup_from_css() helper
  memcg: further prevent OOM with too many dirty pages
  memcg: prevent OOM with too many dirty pages
  mm: mmu_notifier: fix freed page still mapped in secondary MMU
  mm: memcg: only check anon swapin page charges for swap cache
  mm: memcg: only check swap cache pages for repeated charging
  mm: memcg: split swapin charge function into private and public part
  mm: memcg: remove needless !mm fixup to init_mm when charging
  mm: memcg: remove unneeded shmem charge type
  ...
2012-07-31 19:25:39 -07:00
Linus Torvalds 6dbb35b0a7 NFS client updates for Linux 3.6
Features include:
 - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
   separate modules.
 - Fix Oopses in the NFSv4 idmapper
 - Fix a deadlock whereby rpciod tries to allocate a new socket and
   ends up recursing into the NFS code due to memory reclaim.
 - Increase the number of permitted callback connections.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQGF+vAAoJEGcL54qWCgDy6ncP/jKVl/Yjz6i1b/8QtC0EmYFb
 XNwbjZicNupVM98XPLm+sfdWxvoGiHSMbwG8t/hx5Z6CteM18adeGNO1KqP9xQyg
 scPvFmqj5VJNHsHztcHIFHFYdpsuJqRKcW3TA2l8AkNOm6uBcnXArRosYN6LrXik
 hI/jWcyv8wZuooSQrLf463JK37t/s6LuUQo6jKP4582sbANHvPdLkHFCYg3yt1Sk
 2IIo2CLLs2cJUswGJnsHT76Jxfvh21NdGtvydjVjpr4H0/LY5GykBc23AAL9TWj6
 KlegDO902hLzEE93xazF59hQZrPuwBi3quEMJ3X0mjdNQWb4G96s/t2hmwpTjWyc
 mjpRsuaGlCXDxcrI5dnU52hqKa0Ju1ipbLpghxSVfilRy998xvGp5Yc6ADU+pYnt
 uP09lamCyjFEa+gDSqGt7lv4dFmdPJjWZdkXLPv0ah5sXVMYAT8NRLUfiE1+hLLu
 zKaM84PHSEvykFeJ2WvjDTgF3L55y3x6L3LN4UkKmfO5nqQf7csPcquU1NfHh25S
 jjKGq2SKGoYG1l6FwK3hemup5cnFUEX78E2QeLrmaHg92fJDGrz587i6MPc5jrSe
 sOSX/CpuCJd4VhodIo8T00me0GZSHEU7RP2KEc518ZvrPmz13avXeLeG9nYhjuxM
 cV9Ex8zh2Y4aiUXCy3uA
 =KSix
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull second wave of NFS client updates from Trond Myklebust:

 - Patches from Bryan to allow splitting of the NFSv2/v3/v4 code into
   separate modules.

 - Fix Oopses in the NFSv4 idmapper

 - Fix a deadlock whereby rpciod tries to allocate a new socket and ends
   up recursing into the NFS code due to memory reclaim.

 - Increase the number of permitted callback connections.

* tag 'nfs-for-3.6-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: explicitly reject LOCK_MAND flock() requests
  nfs: increase number of permitted callback connections.
  SUNRPC: return negative value in case rpcbind client creation error
  NFS: Convert v4 into a module
  NFS: Convert v3 into a module
  NFS: Convert v2 into a module
  NFS: Keep module parameters in the generic NFS client
  NFS: Split out remaining NFS v4 inode functions
  NFS: Pass super operations and xattr handlers in the nfs_subversion
  NFS: Only initialize the ACL client in the v3 case
  NFS: Create a try_mount rpc op
  NFS: Remove the NFS v4 xdev mount function
  NFS: Add version registering framework
  NFS: Fix a number of bugs in the idmapper
  nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
  sunrpc: clarify comments on rpc_make_runnable
  pnfsblock: bail out partial page IO
2012-07-31 18:45:44 -07:00
Mel Gorman 192e501b04 nfs: prevent page allocator recursions with swap over NFS.
GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO,
just not of any filesystem data.

The problem is that previously NOFS was correct because that avoids
recursion into the NFS code.  With swap-over-NFS, it is no longer correct
as swap IO can lead to this recursion.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:48 -07:00
Mel Gorman a564b8f039 nfs: enable swap on NFS
Implement the new swapfile a_ops for NFS and hook up ->direct_IO.  This
will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under
PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol
->connect() method.

PF_MEMALLOC should allow the allocation of struct socket and related
objects and the early (re)setting of SOCK_MEMALLOC should allow us to
receive the packets required for the TCP connection buildup.

[jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases]
[dfeng@redhat.com: Fix handling of multiple swap files]
[a.p.zijlstra@chello.nl: Original patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:48 -07:00
Mel Gorman 29418aa4bd nfs: disable data cache revalidation for swapfiles
The VM does not like PG_private set on PG_swapcache pages.  As suggested
by Trond in http://lkml.org/lkml/2006/8/25/348, this patch disables NFS
data cache revalidation on swap files.  as it does not make sense to have
other clients change the file while it is being used as swap.  This avoids
setting PG_private on swap pages, since there ought to be no further races
with invalidate_inode_pages2() to deal with.

Since we cannot set PG_private we cannot use page->private which is
already used by PG_swapcache pages to store the nfs_page.  Thus augment
the new nfs_page_find_request logic.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:47 -07:00
Mel Gorman d56b4ddf77 nfs: teach the NFS client how to treat PG_swapcache pages
Replace all relevant occurences of page->index and page->mapping in the
NFS client with the new page_file_index() and page_file_mapping()
functions.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: Neil Brown <neilb@suse.de>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31 18:42:47 -07:00
Linus Torvalds 08843b79fb Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux
Pull nfsd changes from J. Bruce Fields:
 "This has been an unusually quiet cycle--mostly bugfixes and cleanup.
  The one large piece is Stanislav's work to containerize the server's
  grace period--but that in itself is just one more step in a
  not-yet-complete project to allow fully containerized nfs service.

  There are a number of outstanding delegation, container, v4 state, and
  gss patches that aren't quite ready yet; 3.7 may be wilder."

* 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (35 commits)
  NFSd: make boot_time variable per network namespace
  NFSd: make grace end flag per network namespace
  Lockd: move grace period management from lockd() to per-net functions
  LockD: pass actual network namespace to grace period management functions
  LockD: manage grace list per network namespace
  SUNRPC: service request network namespace helper introduced
  NFSd: make nfsd4_manager allocated per network namespace context.
  LockD: make lockd manager allocated per network namespace
  LockD: manage grace period per network namespace
  Lockd: add more debug to host shutdown functions
  Lockd: host complaining function introduced
  LockD: manage used host count per networks namespace
  LockD: manage garbage collection timeout per networks namespace
  LockD: make garbage collector network namespace aware.
  LockD: mark host per network namespace on garbage collect
  nfsd4: fix missing fault_inject.h include
  locks: move lease-specific code out of locks_delete_lock
  locks: prevent side-effects of locks_release_private before file_lock is initialized
  NFSd: set nfsd_serv to NULL after service destruction
  NFSd: introduce nfsd_destroy() helper
  ...
2012-07-31 14:42:28 -07:00
Jeff Layton ad0fcd4eb6 nfs: explicitly reject LOCK_MAND flock() requests
We have no mechanism to emulate LOCK_MAND locks on NFSv4, so explicitly
return -EINVAL if someone requests it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-31 14:42:20 -04:00
NeilBrown b042414feb nfs: increase number of permitted callback connections.
By default a sunrpc service is limited to (N+3)*20 connections
where N is the number of threads.  This is 80 when N==1.
If this number is exceeded a warning is printed suggesting that
the number of threads be increased.  However with services which
run a single thread, this is impossible.

For such services there is a ->sv_maxconn setting that can be
used to forcibly increase the limit, and silence the message.
This is used by lockd.

The nfs client uses a sunrpc service to handle callbacks and
it too is single-threaded, so to avoid the useless messages,
and to allow a reasonable number of concurrent connections,
we need to set ->sv_maxconn.  1024 seems like a good number.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-31 12:33:29 -04:00
Linus Torvalds 1fad1e9a74 NFS client updates for Linux 3.6
Features include:
 - More preparatory patches for modularising NFSv2/v3/v4.
   Split out the various NFSv2/v3/v4-specific code into separate
   files
 - More preparation for the NFSv4 migration code
 - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold parameters
 - pNFS fast failover when the data servers are down
 - Various cleanups and debugging patches
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQFwYRAAoJEGcL54qWCgDy6YAQAIZuUPFCQbuzWev4L/XA0uhg
 0xjr9YBmg57UOs8e7FhS0azJip+D/kmF2Rx5gCMX0QO/IwaEVoms4BS8zjFOr3yL
 cPIyY5ErF+WJ25f3Mz8k0wOHTzrOvMdq4/7TQf5hztsrGYFid78sSb3O9ZO9bgb4
 QxqhgA0Z4S2vIqeObJAF9BT5UOT/cXfVKV0iTd52ZQVA+ZhqJ2SDlNFsoxnVOweV
 qzKOi0FZCPsjH+Z4a/LLdieHG5AYRPhOScBdG7gTFAdkysI8DDtSNCmJ68dMHYRw
 OHtyt/X4k9TImfwauV3Eww1sw3NkDswX+dv3GOSDTlMvVBrm0WXWj2zoHZbOB+0Q
 ETI9Zpolvd9pADtyfu9c+kCYIR+NdB4lGKd15iZfdEy/CZ0CtbLAbn5Szo2YcwNR
 iVjswR0jsbklD+mZjlMeZoG4JX2d9ZRqLs20POz7QQBmdIQ8jb9/SwVj9zGvX0w0
 NyVUHTFlGaNwkpo7oeRoWi1xjZWE6OzfoncV/46DOJWb08OKZ4fR4ugMYJWGi7Xx
 I4nQrIiHtInqL+sF5Y3wlZ48dufLuIhAcHh7klxgud4GRj7t8j+a99pTrRmpHvvC
 4K2Yu69VYcIA7N2RsSLohIllGPFmMwpdar7yERdD0So8/fz/zf7wn0dFFYY11+bL
 YNVVGhvNxccMU5EU9YR4
 =Lc59
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Features include:
   - More preparatory patches for modularising NFSv2/v3/v4.  Split out
     the various NFSv2/v3/v4-specific code into separate files
   - More preparation for the NFSv4 migration code
   - Ensure that OPEN(O_CREATE) observes the pNFS mds threshold
     parameters
   - pNFS fast failover when the data servers are down
   - Various cleanups and debugging patches"

* tag 'nfs-for-3.6-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (67 commits)
  nfs: fix fl_type tests in NFSv4 code
  NFS: fix pnfs regression with directio writes
  NFS: fix pnfs regression with directio reads
  sunrpc: clnt: Add missing braces
  nfs: fix stub return type warnings
  NFS: exit_nfs_v4() shouldn't be an __exit function
  SUNRPC: Add a missing spin_unlock to gss_mech_list_pseudoflavors
  NFS: Split out NFS v4 client functions
  NFS: Split out the NFS v4 filesystem types
  NFS: Create a single nfs_clone_super() function
  NFS: Split out NFS v4 server creating code
  NFS: Initialize the NFS v4 client from init_nfs_v4()
  NFS: Move the v4 getroot code to nfs4getroot.c
  NFS: Split out NFS v4 file operations
  NFS: Initialize v4 sysctls from nfs_init_v4()
  NFS: Create an init_nfs_v4() function
  NFS: Split out NFS v4 inode operations
  NFS: Split out NFS v3 inode operations
  NFS: Split out NFS v2 inode operations
  NFS: Clean up nfs4_proc_setclientid() and friends
  ...
2012-07-30 19:16:57 -07:00
Bryan Schumaker 89d77c8fa8 NFS: Convert v4 into a module
This patch exports symbols needed by the v4 module.  In addition, I also
switch over to using IS_ENABLED() to check if CONFIG_NFS_V4 or
CONFIG_NFS_V4_MODULE are set.

The module (nfs4.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v4.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:52 -04:00
Bryan Schumaker 1c606fb74c NFS: Convert v3 into a module
This patch exports symbols and moves over the final structures needed by
the v3 module.  In addition, I also switch over to using IS_ENABLED() to
check if CONFIG_NFS_V3 or CONFIG_NFS_V3_MODULE are set.

The module (nfs3.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v3.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:46 -04:00
Bryan Schumaker ddda8e0aa8 NFS: Convert v2 into a module
The module (nfs2.ko) will be created in the same directory as nfs.ko and
will be automatically loaded the first time you try to mount over NFS v2.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:41 -04:00
Bryan Schumaker fac1e8e4ef NFS: Keep module parameters in the generic NFS client
Otherwise we break backwards compatibility when v4 becomes a modules.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:31 -04:00
Bryan Schumaker 19d87ca362 NFS: Split out remaining NFS v4 inode functions
Somehow I missed this in my previous patch series, but these functions
are only needed by the v4 code and should be moved to a v4-only file.  I
wasn't exactly sure where I should put these functions, so I moved them
into nfs4super.c where I could make them static.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:20 -04:00
Bryan Schumaker 6a74490dca NFS: Pass super operations and xattr handlers in the nfs_subversion
I can set all variables in the nfs_fill_super() function, allowing me to
remove the nfs4_fill_super() function.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:06:05 -04:00
Bryan Schumaker 1179acc6a3 NFS: Only initialize the ACL client in the v3 case
v2 and v4 don't use it, so I create two new nfs_rpc_ops functions to
initialize the ACL client only when we are using v3.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:05:54 -04:00
Bryan Schumaker ff9099f266 NFS: Create a try_mount rpc op
I'm already looking up the nfs subversion in nfs_fs_mount(), so I have
easy access to rpc_ops that used to be difficult to reach.  This allows
me to set up a different mount path for NFS v2/3 and NFS v4.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:04:53 -04:00
Bryan Schumaker e8f25e6d6d NFS: Remove the NFS v4 xdev mount function
I can now share this code with the v2 and v3 code by using the NFS
subversion structure.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:04:45 -04:00
Bryan Schumaker ab7017a3a0 NFS: Add version registering framework
This patch adds in the code to track multiple versions of the NFS
protocol.  I created default structures for v2, v3 and v4 so that each
version can continue to work while I convert them into kernel modules.
I also removed the const parameter from the rpc_version array so that I
can change it at runtime.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 19:04:17 -04:00
David Howells a427b9ec4e NFS: Fix a number of bugs in the idmapper
Fix a number of bugs in the NFS idmapper code:

 (1) Only registered key types can be passed to the core keys code, so
     register the legacy idmapper key type.

     This is a requirement because the unregister function cleans up keys
     belonging to that key type so that there aren't dangling pointers to the
     module left behind - including the key->type pointer.

 (2) Rename the legacy key type.  You can't have two key types with the same
     name, and (1) would otherwise require that.

 (3) complete_request_key() must be called in the error path of
     nfs_idmap_legacy_upcall().

 (4) There is one idmap struct for each nfs_client struct.  This means that
     idmap->idmap_key_cons is shared without the use of a lock.  This is a
     problem because key_instantiate_and_link() - as called indirectly by
     idmap_pipe_downcall() - releases anyone waiting for the key to be
     instantiated.

     What happens is that idmap_pipe_downcall() running in the rpc.idmapd
     thread, releases the NFS filesystem in whatever thread that is running in
     to continue.  This may then make another idmapper call, overwriting
     idmap_key_cons before idmap_pipe_downcall() gets the chance to call
     complete_request_key().

     I *think* that reading idmap_key_cons only once, before
     key_instantiate_and_link() is called, and then caching the result in a
     variable is sufficient.

Bug (4) is the cause of:

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<          (null)>]           (null)
PGD 0
Oops: 0010 [#1] SMP
CPU 1
Modules linked in: ppdev parport_pc lp parport ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack nfs fscache xt_CHECKSUM auth_rpcgss iptable_mangle nfs_acl bridge stp llc lockd be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_realtek snd_usb_audio snd_hda_intel snd_hda_codec snd_seq snd_pcm snd_hwdep snd_usbmidi_lib snd_rawmidi snd_timer uvcvideo videobuf2_core videodev media videobuf2_vmalloc snd_seq_device videobuf2_memops e1000e vhost_net iTCO_wdt joydev coretemp snd soundcore macvtap macvlan i2c_i801 snd_page_alloc tun iTCO_vendor_support microcode kvm_intel kvm sunrpc hid_logitech_dj usb_storage i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Pid: 1229, comm: rpc.idmapd Not tainted 3.4.2-1.fc16.x86_64 #1 Gateway DX4710-UB801A/G33M05G1
RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
RSP: 0018:ffff8801a3645d40  EFLAGS: 00010246
RAX: ffff880077707e30 RBX: ffff880077707f50 RCX: ffff8801a18ccd80
RDX: 0000000000000006 RSI: ffff8801a3645e75 RDI: ffff880077707f50
RBP: ffff8801a3645d88 R08: ffff8801a430f9c0 R09: ffff8801a3645db0
R10: 000000000000000a R11: 0000000000000246 R12: ffff8801a18ccd80
R13: ffff8801a3645e75 R14: ffff8801a430f9c0 R15: 0000000000000006
FS:  00007fb6fb51a700(0000) GS:ffff8801afc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001a49b0000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rpc.idmapd (pid: 1229, threadinfo ffff8801a3644000, task ffff8801a3bf9710)
Stack:
 ffffffff81260878 ffff8801a3645db0 ffff8801a3645db0 ffff880077707a90
 ffff880077707f50 ffff8801a18ccd80 0000000000000006 ffff8801a3645e75
 ffff8801a430f9c0 ffff8801a3645dd8 ffffffff81260983 ffff8801a3645de8
Call Trace:
 [<ffffffff81260878>] ? __key_instantiate_and_link+0x58/0x100
 [<ffffffff81260983>] key_instantiate_and_link+0x63/0xa0
 [<ffffffffa057062b>] idmap_pipe_downcall+0x1cb/0x1e0 [nfs]
 [<ffffffffa0107f57>] rpc_pipe_write+0x67/0x90 [sunrpc]
 [<ffffffff8117f833>] vfs_write+0xb3/0x180
 [<ffffffff8117fb5a>] sys_write+0x4a/0x90
 [<ffffffff81600329>] system_call_fastpath+0x16/0x1b
Code:  Bad RIP value.
RIP  [<          (null)>]           (null)
 RSP <ffff8801a3645d40>
CR2: 0000000000000000

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Steve Dickson <steved@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org [>= 3.4]
2012-07-30 18:57:39 -04:00
Jeff Layton 5cf02d09b5 nfs: skip commit in releasepage if we're freeing memory for fs-related reasons
We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2012-07-30 18:55:59 -04:00
Peng Tao 159e0561e3 pnfsblock: bail out partial page IO
Current block layout driver read/write code assumes page
aligned IO in many places. Add a checker to validate the assumption.
Otherwise there would be data corruption like when application does
open(O_WRONLY) and page unaliged write.

Signed-off-by: Peng Tao <tao.peng@emc.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 18:52:06 -04:00
Jeff Layton f44106e217 nfs: fix fl_type tests in NFSv4 code
fl_type is not a bitmap.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 18:09:13 -04:00
Fred Isaman c95908e4c5 NFS: fix pnfs regression with directio writes
Commit 57208fa7e5 "NFS: Create an write_pageio_init() function"
did not modify the calls in direct.c, preventing direct io from
using pnfs.  This reintroduces that capability.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 18:09:06 -04:00
Fred Isaman 59948db3be NFS: fix pnfs regression with directio reads
Commit 1abb50886a "NFS: Create an read_pageio_init() function"
did not modify the call in direct.c, preventing direct io from
using pnfs.  This reintroduces that capability.

Signed-off-by: Fred Isaman <iisaman@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 18:08:53 -04:00
Randy Dunlap 0add3e8567 nfs: fix stub return type warnings
Fix numerous repeated warnings by making the stub function
void instead of non-void:

fs/nfs/nfs4_fs.h: In function 'nfs4_unregister_sysctl':
fs/nfs/nfs4_fs.h:385:1: warning: no return statement in function returning non-void

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-30 17:30:24 -04:00
Stanislav Kinsbursky 9695c7057f SUNRPC: service request network namespace helper introduced
This is a cleanup patch - makes code looks simplier.
It replaces widely used rqstp->rq_xprt->xpt_net by introduced SVC_NET(rqstp).

Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2012-07-27 16:49:21 -04:00
Linus Torvalds a66d2c8f7e Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull the big VFS changes from Al Viro:
 "This one is *big* and changes quite a few things around VFS.  What's in there:

   - the first of two really major architecture changes - death to open
     intents.

     The former is finally there; it was very long in making, but with
     Miklos getting through really hard and messy final push in
     fs/namei.c, we finally have it.  Unlike his variant, this one
     doesn't introduce struct opendata; what we have instead is
     ->atomic_open() taking preallocated struct file * and passing
     everything via its fields.

     Instead of returning struct file *, it returns -E...  on error, 0
     on success and 1 in "deal with it yourself" case (e.g.  symlink
     found on server, etc.).

     See comments before fs/namei.c:atomic_open().  That made a lot of
     goodies finally possible and quite a few are in that pile:
     ->lookup(), ->d_revalidate() and ->create() do not get struct
     nameidata * anymore; ->lookup() and ->d_revalidate() get lookup
     flags instead, ->create() gets "do we want it exclusive" flag.

     With the introduction of new helper (kern_path_locked()) we are rid
     of all struct nameidata instances outside of fs/namei.c; it's still
     visible in namei.h, but not for long.  Come the next cycle,
     declaration will move either to fs/internal.h or to fs/namei.c
     itself.  [me, miklos, hch]

   - The second major change: behaviour of final fput().  Now we have
     __fput() done without any locks held by caller *and* not from deep
     in call stack.

     That obviously lifts a lot of constraints on the locking in there.
     Moreover, it's legal now to call fput() from atomic contexts (which
     has immediately simplified life for aio.c).  We also don't need
     anti-recursion logics in __scm_destroy() anymore.

     There is a price, though - the damn thing has become partially
     asynchronous.  For fput() from normal process we are guaranteed
     that pending __fput() will be done before the caller returns to
     userland, exits or gets stopped for ptrace.

     For kernel threads and atomic contexts it's done via
     schedule_work(), so theoretically we might need a way to make sure
     it's finished; so far only one such place had been found, but there
     might be more.

     There's flush_delayed_fput() (do all pending __fput()) and there's
     __fput_sync() (fput() analog doing __fput() immediately).  I hope
     we won't need them often; see warnings in fs/file_table.c for
     details.  [me, based on task_work series from Oleg merged last
     cycle]

   - sync series from Jan

   - large part of "death to sync_supers()" work from Artem; the only
     bits missing here are exofs and ext4 ones.  As far as I understand,
     those are going via the exofs and ext4 trees resp.; once they are
     in, we can put ->write_super() to the rest, along with the thread
     calling it.

   - preparatory bits from unionmount series (from dhowells).

   - assorted cleanups and fixes all over the place, as usual.

  This is not the last pile for this cycle; there's at least jlayton's
  ESTALE work and fsfreeze series (the latter - in dire need of fixes,
  so I'm not sure it'll make the cut this cycle).  I'll probably throw
  symlink/hardlink restrictions stuff from Kees into the next pile, too.
  Plus there's a lot of misc patches I hadn't thrown into that one -
  it's large enough as it is..."

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits)
  ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file()
  btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file()
  switch dentry_open() to struct path, make it grab references itself
  spufs: shift dget/mntget towards dentry_open()
  zoran: don't bother with struct file * in zoran_map
  ecryptfs: don't reinvent the wheels, please - use struct completion
  don't expose I_NEW inodes via dentry->d_inode
  tidy up namei.c a bit
  unobfuscate follow_up() a bit
  ext3: pass custom EOF to generic_file_llseek_size()
  ext4: use core vfs llseek code for dir seeks
  vfs: allow custom EOF in generic_file_llseek code
  vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes
  vfs: Remove unnecessary flushing of block devices
  vfs: Make sys_sync writeout also block device inodes
  vfs: Create function for iterating over block devices
  vfs: Reorder operations during sys_sync
  quota: Move quota syncing to ->sync_fs method
  quota: Split dquot_quota_sync() to writeback and cache flushing part
  vfs: Move noop_backing_dev_info check from sync into writeback
  ...
2012-07-23 12:27:27 -07:00
Linus Torvalds ce9f8d6b39 Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd
Pull pnfs/ore fixes from Boaz Harrosh:
 "These are catastrophic fixes to the pnfs objects-layout that were just
  discovered.  They are also destined for @stable.

  I have found these and worked on them at around RC1 time but
  unfortunately went to the hospital for kidney stones and had a very
  slow recovery.  I refrained from sending them as is, before proper
  testing, and surly I have found a bug just yesterday.

  So now they are all well tested, and have my sign-off.  Other then
  fixing the problem at hand, and assuming there are no bugs at the new
  code, there is low risk to any surrounding code.  And in anyway they
  affect only these paths that are now broken.  That is RAID5 in pnfs
  objects-layout code.  It does also affect exofs (which was not broken)
  but I have tested exofs and it is lower priority then objects-layout
  because no one is using exofs, but objects-layout has lots of users."

* 'for-linus' of git://git.open-osd.org/linux-open-osd:
  pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
  pnfs-obj: don't leak objio_state if ore_write/read fails
  ore: Unlock r4w pages in exact reverse order of locking
  ore: Remove support of partial IO request (NFS crash)
  ore: Fix NFS crash by supporting any unaligned RAID IO
2012-07-20 11:43:53 -07:00
Boaz Harrosh c999ff6802 pnfs-obj: Fix __r4w_get_page when offset is beyond i_size
It is very common for the end of the file to be unaligned on
stripe size. But since we know it's beyond file's end then
the XOR should be preformed with all zeros.

Old code used to just read zeros out of the OSD devices, which is a great
waist. But what scares me more about this situation is that, we now have
pages attached to the file's mapping that are beyond i_size. I don't
like the kind of bugs this calls for.

Fix both birds, by returning a global zero_page, if offset is beyond
i_size.

TODO:
	Change the API to ->__r4w_get_page() so a NULL can be
	returned without being considered as error, since XOR API
	treats NULL entries as zero_pages.

[Bug since 3.2. Should apply the same way to all Kernels since]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:31 +03:00
Boaz Harrosh 9909d45a85 pnfs-obj: don't leak objio_state if ore_write/read fails
[Bug since 3.2 Kernel]
CC: Stable Tree <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2012-07-20 11:50:30 +03:00
Bryan Schumaker bb6e071f84 NFS: exit_nfs_v4() shouldn't be an __exit function
... yet.  Right now, init_nfs() is calling this function if an error is
encountered when loading the nfs module.  An __exit function can't be
called from one declared as __init.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 17:02:57 -04:00
Bryan Schumaker ec409897e7 NFS: Split out NFS v4 client functions
These functions are only needed by NFS v4, so they can be moved into a
v4 specific file.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:56 -04:00
Bryan Schumaker fbdefd6442 NFS: Split out the NFS v4 filesystem types
This allows me to move the v4 mounting and unmounting functions out of
the generic client and into a file that is only compiled when CONFIG_NFS_V4
is enabled.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:55 -04:00
Bryan Schumaker 3cadf4b864 NFS: Create a single nfs_clone_super() function
v2 and v3 shared a function for this, but v4 implemented something only
slightly different.  Might as well share code whenever possible...

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:54 -04:00
Bryan Schumaker fcf10398f6 NFS: Split out NFS v4 server creating code
These functions are specific to NFS v4 and can be moved to nfs4client.c
to keep them out of the generic client.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:53 -04:00
Bryan Schumaker 428360d77c NFS: Initialize the NFS v4 client from init_nfs_v4()
And split these functions out of the generic client into a v4 specific
file.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:52 -04:00
Bryan Schumaker a38a9eac75 NFS: Move the v4 getroot code to nfs4getroot.c
Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:51 -04:00
Bryan Schumaker ce4ef7c0a8 NFS: Split out NFS v4 file operations
This patch moves the NFS v4 file functions into a new file that is only
compiled when CONFIG_NFS_V4 is enabled.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:50 -04:00
Bryan Schumaker 466bfe7f4a NFS: Initialize v4 sysctls from nfs_init_v4()
And split them out of the generic client into their own file.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:18 -04:00
Bryan Schumaker 129d1977ed NFS: Create an init_nfs_v4() function
I want to initialize all of NFS v4 in a single function that will
eventually be used as the v4 module init function.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:13 -04:00
Bryan Schumaker 73a79706d7 NFS: Split out NFS v4 inode operations
The NFS v4 file inode operations are already already in nfs4proc.c, so
this patch just needs to move the directory operations to the same file.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:05 -04:00
Bryan Schumaker ab96291ea1 NFS: Split out NFS v3 inode operations
This patch moves the NFS v3 file and directory inode functions into
files that are only compiled whet CONFIG_NFS_V3 is enabled.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:33:03 -04:00
Bryan Schumaker 597d92891b NFS: Split out NFS v2 inode operations
This patch moves the NFS v2 file and directory inode functions into
files that are only compiled whet CONFIG_NFS_V2 is enabled.

Signed-off-by: Bryan Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-17 13:32:55 -04:00
Chuck Lever 6bbb4ae8ff NFS: Clean up nfs4_proc_setclientid() and friends
Add documenting comments and appropriate debugging messages.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 15:12:16 -04:00
Chuck Lever de73483122 NFS: Treat NFS4ERR_CLID_INUSE as a fatal error
For NFSv4 minor version 0, currently the cl_id_uniquifier allows the
Linux client to generate a unique nfs_client_id4 string whenever a
server replies with NFS4ERR_CLID_INUSE.

This implementation seems to be based on a flawed reading of RFC
3530.  NFS4ERR_CLID_INUSE actually means that the client has presented
this nfs_client_id4 string with a different principal at some time in
the past, and that lease is still in use on the server.

For a Linux client this might be rather difficult to achieve: the
authentication flavor is named right in the nfs_client_id4.id
string.  If we change flavors, we change strings automatically.

So, practically speaking, NFS4ERR_CLID_INUSE means there is some other
client using our string.  There is not much that can be done to
recover automatically.  Let's make it a permanent error.

Remove the recovery logic in nfs4_proc_setclientid(), and remove the
cl_id_uniquifier field from the nfs_client data structure.  And,
remove the authentication flavor from the nfs_client_id4 string.

Keeping the authentication flavor in the nfs_client_id4.id string
means that we could have a separate lease for each authentication
flavor used by mounts on the client.  But we want just one lease for
all the mounts on this client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 15:12:16 -04:00
Chuck Lever 46a87b8a7b NFS: When state recovery fails, waiting tasks should exit
NFSv4 state recovery is not always successful.  Failure is signalled
by setting the nfs_client.cl_cons_state to a negative (errno) value,
then waking waiters.

Currently this can happen only during mount processing.  I'm about to
add an explicit case where state recovery failure during normal
operation should force all NFS requests waiting on that state recovery
to exit.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 15:12:15 -04:00
Chuck Lever 6a1a1e34dc SUNRPC: Add rpcauth_list_flavors()
The gss_mech_list_pseudoflavors() function provides a list of
currently registered GSS pseudoflavors.  This list does not include
any non-GSS flavors that have been registered with the RPC client.
nfs4_find_root_sec() currently adds these extra flavors by hand.

Instead, nfs4_find_root_sec() should be looking at the set of flavors
that have been explicitly registered via rpcauth_register().  And,
other areas of code will soon need the same kind of list that
contains all flavors the kernel currently knows about (see below).

Rather than cloning the open-coded logic in nfs4_find_root_sec() to
those new places, introduce a generic RPC function that generates a
full list of registered auth flavors and pseudoflavors.

A new rpc_authops method is added that lists a flavor's
pseudoflavors, if it has any.  I encountered an interesting module
loader loop when I tried to get the RPC client to invoke
gss_mech_list_pseudoflavors() by name.

This patch is a pre-requisite for server trunking discovery, and a
pre-requisite for fixing up the in-kernel mount client to do better
automatic security flavor selection.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 15:12:15 -04:00
Chuck Lever 56d08fef23 NFS: nfs_getaclargs.acl_len is a size_t
Squelch compiler warnings:

fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’:
fs/nfs/nfs4proc.c:3811:14: warning: comparison between signed and
	unsigned integer expressions [-Wsign-compare]
fs/nfs/nfs4proc.c:3818:15: warning: comparison between signed and
	unsigned integer expressions [-Wsign-compare]

Introduced by commit bf118a34 "NFSv4: include bitmap in nfsv4 get
acl data", Dec 7, 2011.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:53:43 -04:00
Chuck Lever 38527b153a NFS: Clean up TEST_STATEID and FREE_STATEID error reporting
As a finishing touch, add appropriate documenting comments and some
debugging printk's.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:53:34 -04:00
Chuck Lever 3e60ffdd36 NFS: Clean up nfs41_check_expired_stateid()
Clean up: Instead of open-coded flag manipulation, use test_bit() and
clear_bit() just like all other accessors of the state->flag field.
This also eliminates several unnecessary implicit integer type
conversions.

To make it absolutely clear what is going on, a number of comments
are introduced.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:49:40 -04:00
Chuck Lever eb64cf964d NFS: State reclaim clears OPEN and LOCK state
The "state->flags & flags" test in nfs41_check_expired_stateid()
allows the state manager to squelch a TEST_STATEID operation when
it is known for sure that a state ID is no longer valid.  If the
lease was purged, for example, the client already knows that state
ID is now defunct.

But open recovery is still needed for that inode.

To force a call to nfs4_open_expired(), change the default return
value for nfs41_check_expired_stateid() to force open recovery, and
the default return value for nfs41_check_locks() to force lock
recovery, if the requested flags are clear.  Fix suggested by Bryan
Schumaker.

Also, the presence of a delegation state ID must not prevent normal
open recovery.  The delegation state ID must be cleared if it was
revoked, but once cleared I don't think it's presence or absence has
any bearing on whether open recovery is still needed.  So the logic
is adjusted to ignore the TEST_STATEID result for the delegation
state ID.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:48:53 -04:00
Chuck Lever 89af273958 NFS: Don't free a state ID the server does not recognize
The result of a TEST_STATEID operation can indicate a few different
things:

  o If NFS_OK is returned, then the client can continue using the
    state ID under test, and skip recovery.

  o RFC 5661 says that if the state ID was revoked, then the client
    must perform an explicit FREE_STATEID before trying to re-open.

  o If the server doesn't recognize the state ID at all, then no
    FREE_STATEID is needed, and the client can immediately continue
    with open recovery.

Let's err on the side of caution: if the server clearly tells us the
state ID is unknown, we skip the FREE_STATEID.  For any other error,
we issue a FREE_STATEID.  Sometimes that FREE_STATEID will be
unnecessary, but leaving unused state IDs on the server needlessly
ties up resources.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:48:10 -04:00
Chuck Lever 377e507d15 NFS: Fix up TEST_STATEID and FREE_STATEID return code handling
The TEST_STATEID and FREE_STATEID operations can return
-NFS4ERR_BAD_STATEID, -NFS4ERR_OLD_STATEID, or -NFS4ERR_DEADSESSION.

nfs41_{test,free}_stateid() should not pass these errors to
nfs4_handle_exception() during state recovery, since that will
recursively kick off state recovery again, resulting in a deadlock.

In particular, when the TEST_STATEID operation returns NFS4_OK,
res.status can contain one of these errors.  _nfs41_test_stateid()
replaces NFS4_OK with the value in res.status, which is then returned
to callers.

But res.status is not passed through nfs4_stat_to_errno(), and thus is
a positive NFS4ERR value.  Currently callers are only interested in
!NFS4_OK, and nfs4_handle_exception() ignores positive values.

Thus the res.status values are currently ignored by
nfs4_handle_exception() and won't cause the deadlock above.  Thanks to
this missing negative, it is only when these operations fail (which
is very rare) that a deadlock can occur.

Bryan agrees the original intent was to return res.status as a
negative NFS4ERR value to callers of nfs41_test_stateid().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:47:52 -04:00
Andy Adamson 293b3b065c NFSv4.1 do not send LAYOUTRETURN on emtpy plh_segs list
mark_matching_lsegs_invalid() resets the mds_threshold counters and can
dereference the layout hdr on an initial empty plh_segs list. It returns 0 both
in the case of an initial empty list and in a non-emtpy list that was cleared
by calls to mark_lseg_invalid.

Don't send a LAYOUTRETURN if the list was initially empty.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:39:00 -04:00
Andy Adamson 366d50521c NFSv4.1 mark layout when already returned
When the file layout driver is fencing a DS, _pnfs_return_layout can be
called mulitple times per inode due to in-flight i/o referencing lsegs on it's
plh_segs list.

Remember that LAYOUTRETURN has been called, and do not call it again.
Allow LAYOUTRETURNs after a subsequent LAYOUTGET.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:37:25 -04:00
Andy Adamson baf6c2a44a NFSv4.1 don't send LAYOUTCOMMIT if data resent through MDS
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:37:00 -04:00
Andy Adamson 82c7c7a5a9 NFSv4.1 return the LAYOUT for each file with failed DS connection I/O
First mark the deviceid invalid to prevent any future use. Then fence all
files involved in I/O to a DS with a connection error by sending a
LAYOUTRETURN.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-07-16 14:36:52 -04:00
Trond Myklebust 8626e4a426 Merge commit '9249e17fe094d853d1ef7475dd559a2cc7e23d42' into nfs-for-3.6
Resolve conflicts with the VFS atomic open and sget changes.

Conflicts:
	fs/nfs/nfs4proc.c
2012-07-16 12:01:42 -04:00
David Howells 9249e17fe0 VFS: Pass mount flags to sget()
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:34 +04:00
Al Viro ebfc3b49a7 don't pass nameidata to ->create()
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:47 +04:00
Al Viro 00cd8dd3bf stop passing nameidata to ->lookup()
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:32 +04:00