Commit graph

469020 commits

Author SHA1 Message Date
NeilBrown 353db79662 NFS: avoid waiting at all in nfs_release_page when congested.
If nfs_release_page() is called on a sequence of pages which are all
in the same file which is blocked on COMMIT, each page could
contribute a 1 second delay which could be come excessive.  I have
seen delays of as much as 208 seconds.

To keep the delay to one second, mark the bdi as write-congested
if the commit didn't finished.  Once it does finish, the
write-congested flag will be cleared by nfs_commit_release_pages().

With this, the longest total delay in try_to_free_pages that I have
seen is under 3 seconds.  With no waiting in nfs_release_page at all
I have seen delays of nearly 1.5 seconds.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:38 -04:00
NeilBrown 9590544694 NFS: avoid deadlocks with loop-back mounted NFS filesystems.
Support for loop-back mounted NFS filesystems is useful when NFS is
used to access shared storage in a high-availability cluster.

If the node running the NFS server fails, some other node can mount the
filesystem and start providing NFS service.  If that node already had
the filesystem NFS mounted, it will now have it loop-back mounted.

nfsd can suffer a deadlock when allocating memory and entering direct
reclaim.
While direct reclaim does not write to the NFS filesystem it can send
and wait for a COMMIT through nfs_release_page().

This patch modifies nfs_release_page() to wait a limited time for the
commit to complete - one second.  If the commit doesn't complete
in this time, nfs_release_page() will fail.  This means it might now
fail in some cases where it wouldn't before.  These cases are only
when 'gfp' includes '__GFP_WAIT'.

nfs_release_page() is only called by try_to_release_page(), and that
can only be called on an NFS page with required 'gfp' flags from
 - page_cache_pipe_buf_steal() in splice.c
 - shrink_page_list() in vmscan.c
 - invalidate_inode_pages2_range() in truncate.c

The first two handle failure quite safely.  The last is only called
after ->launder_page() has been called, and that will have waited
for the commit to finish already.

So aborting if the commit takes longer than 1 second is perfectly safe.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:28 -04:00
NeilBrown a4796e37c1 MM: export page_wakeup functions
This will allow NFS to wait for PG_private to be cleared and,
particularly, to send a wake-up when it is.

Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:25:09 -04:00
NeilBrown cbbce82209 SCHED: add some "wait..on_bit...timeout()" interfaces.
In commit c1221321b7
   sched: Allow wait_on_bit_action() functions to support a timeout

I suggested that a "wait_on_bit_timeout()" interface would not meet my
need.  This isn't true - I was just over-engineering.

Including a 'private' field in wait_bit_key instead of a focused
"timeout" field was just premature generalization.  If some other
use is ever found, it can be generalized or added later.

So this patch renames "private" to "timeout" with a meaning "stop
waiting when "jiffies" reaches or passes "timeout",
and adds two of the many possible wait..bit..timeout() interfaces:

wait_on_page_bit_killable_timeout(), which is the one I want to use,
and out_of_line_wait_on_bit_timeout() which is a reasonably general
example.  Others can be added as needed.

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-25 08:23:57 -04:00
NeilBrown e87b4c7a7a NFS: don't use STABLE writes during writeback.
commit b31268ac79
  FS: Use stable writes when not doing a bulk flush

was a bit heavy handed.
The particular problem that lead to this patch was that
small writes to an O_SYNC file we being written as UNSTABLE writes
followed by a commit.
This is appropriate for large writes (which require multiple NFS
requests) but for small writes (single NFS request), using
NFS_FILE_SYNC is more efficient.

So that patch causes the code to select between the two methods
depending on how many nfs requests get generated.

Unfortunately this ends up applying to non O_SYNC writes as well.
In particular if you memory-map a file and update random pages, then
when they are eventually written out by writeback they will go as
NFS_FILE_SYNC.  This is inefficient and slows down the application.

So: only set FLUSH_COND_STABLE when wbc->sync_mode is WB_SYNC_ALL.
With this patch:
 O_SYNC writes are NFS_FILE_SYNC for single requests, and NFS_UNSTABLE
    followed by COMMIT for multiple requests
 Writing immediately before close of fsync follow the same pattern.
 Non-O_SYNC writes without an fsync of close eventually get flushed
 out as UNSTABLE and a commit follows eventually as appropriate.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:23:02 -04:00
NeilBrown 8478eaa16e NFSv4: use exponential retry on NFS4ERR_DELAY for async requests.
Currently asynchronous NFSv4 request will be retried with
exponential timeout (from 1/10 to 15 seconds), but async
requests will always use a 15second retry.

Some "async" requests are really synchronous though.  The
async mechanism is used to allow the request to continue if
the requesting process is killed.
In those cases, an exponential retry is appropriate.

For example, if two different clients both open a file and
get a READ delegation, and one client then unlinks the file
(while still holding an open file descriptor), that unlink
will used the "silly-rename" handling which is async.
The first rename will result in NFS4ERR_DELAY while the
delegation is reclaimed from the other client.  The rename
will not be retried for 15 seconds, causing an unlink to take
15 seconds rather than 100msec.

This patch only added exponential timeout for async unlink and
async rename.  Other async calls, such as 'close' are sometimes
waited for so they might benefit from exponential timeout too.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:22:47 -04:00
Jason Baron 3dedbb5ca1 rpc: Add -EPERM processing for xs_udp_send_request()
If an iptables drop rule is added for an nfs server, the client can end up in
a softlockup. Because of the way that xs_sendpages() is structured, the -EPERM
is ignored since the prior bits of the packet may have been successfully queued
and thus xs_sendpages() returns a non-zero value. Then, xs_udp_send_request()
thinks that because some bits were queued it should return -EAGAIN. We then try
the request again and again, resulting in cpu spinning. Reproducer:

1) open a file on the nfs server '/nfs/foo' (mounted using udp)
2) iptables -A OUTPUT -d <nfs server ip> -j DROP
3) write to /nfs/foo
4) close /nfs/foo
5) iptables -D OUTPUT -d <nfs server ip> -j DROP

The softlockup occurs in step 4 above.

The previous patch, allows xs_sendpages() to return both a sent count and
any error values that may have occurred. Thus, if we get an -EPERM, return
that to the higher level code.

With this patch in place we can successfully abort the above sequence and
avoid the softlockup.

I also tried the above test case on an nfs mount on tcp and although the system
does not softlockup, I still ended up with the 'hung_task' firing after 120
seconds, due to the i/o being stuck. The tcp case appears a bit harder to fix,
since -EPERM appears to get ignored much lower down in the stack and does not
propogate up to xs_sendpages(). This case is not quite as insidious as the
softlockup and it is not addressed here.

Reported-by: Yigong Lou <ylou@akamai.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:13:46 -04:00
Jason Baron f279cd008f rpc: return sent and err from xs_sendpages()
If an error is returned after the first bits of a packet have already been
successfully queued, xs_sendpages() will return a positive 'int' value
indicating success. Callers seem to treat this as -EAGAIN.

However, there are cases where its not a question of waiting for the write
queue to drain. For example, when there is an iptables rule dropping packets
to the destination, the lower level code can return -EPERM only after parts
of the packet have been successfully queued. In this case, we can end up
continuously retrying resulting in a kernel softlockup.

This patch is intended to make no changes in behavior but is in preparation for
subsequent patches that can make decisions based on both on the number of bytes
sent by xs_sendpages() and any errors that may have be returned.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:13:37 -04:00
Benjamin Coddington 173b3afcee lockd: Try to reconnect if statd has moved
If rpc.statd is restarted, upcalls to monitor hosts can fail with
ECONNREFUSED.  In that case force a lookup of statd's new port and retry the
upcall.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:08:43 -04:00
Benjamin Coddington a743419f42 SUNRPC: Don't wake tasks during connection abort
When aborting a connection to preserve source ports, don't wake the task in
xs_error_report.  This allows tasks with RPC_TASK_SOFTCONN to succeed if the
connection needs to be re-established since it preserves the task's status
instead of setting it to the status of the aborting kernel_connect().

This may also avoid a potential conflict on the socket's lock.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:06:56 -04:00
Olga Kornievskaia 8faaa6d5d4 Fixing lease renewal
Commit c9fdeb28 removed a 'continue' after checking if the lease needs
to be renewed. However, if client hasn't moved, the code falls down to
starting reboot recovery erroneously (ie., sends open reclaim and gets
back stale_clientid error) before recovering from getting stale_clientid
on the renew operation.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Fixes: c9fdeb280b (NFS: Add basic migration support to state manager thread)
Cc: stable@vger.kernel.org # 3.13+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-24 23:03:15 -04:00
Trond Myklebust 5466112f09 pnfs/blocklayout: Fix a 64-bit division/remainder issue in bl_map_stripe
kbuild test robot reports:

   fs/built-in.o: In function `bl_map_stripe':
   >> :(.text+0x965b4): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x965cc): undefined reference to `__aeabi_uldivmod'
   >> :(.text+0x96604): undefined reference to `__aeabi_uldivmod'

Fixes: 5c83746a0c (pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing)
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-21 14:20:20 -04:00
Stephen Rothwell b262b35c2c pnfs/blocklayout: include vmalloc.h for __vmalloc
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-15 19:33:28 -04:00
Peng Tao 88ac815cdb nfs41: change PNFS_LAYOUTRET_ON_SETATTR to only return on truncation to smaller size
Both blocks layout and objects layout want to use it to avoid CB_LAYOUTRECALL
but that should only happen if client is doing truncation to a smaller size.
For other cases, we let server decide if it wants to recall client's layouts.
Change PNFS_LAYOUTRET_ON_SETATTR to follow the logic and not to send
layoutreturn unnecessarily.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 14:03:20 -04:00
Anna Schumaker cb8c20fa53 NFS: Move NFS v3 acl functions to nfs3_fs.h
This code is internal to the v3 module, so other parts of the client
shouldn't have any knowledge of it.

nfs3_getxattr(), nfs3_setxattr(), and nfs3_removexattr() no longer exist
anywhere so I remove the declarations while I'm here.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:50:26 -04:00
Anna Schumaker f08460dc23 NFS: Remove v3 not compiled check from validate_mount_data()
This check is already performed by the module loading code - if the
module can't be found then -EPROTONOSUPPORT will be returned.  Let's
handle v3 this way, too.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:50:20 -04:00
Anna Schumaker 00a36a1090 NFS: Move v3 declarations out of internal.h
I am generally against the "one big header file" approach, and
everything in the client includes this file.  Let's move all the NFS v3
declarations into a v3-only header file.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:49:40 -04:00
Anna Schumaker f418c64b71 NFS: Unconditionally enable commit code
The goal is to create a generic NFS module with code that does not
depend on what versions of NFS are enabled.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:49:31 -04:00
Trond Myklebust 164ae58c3c pNFS/blocklayout: Remove a couple of unused variables
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:34:54 -04:00
Christoph Hellwig 84c9dee3ad pnfs: enable CB_NOTIFY_DEVICEID support
This code has been around for a while, but never was enabled, although
it is in a working shape.

Note that we implement NOTIFY_DEVICEID4_CHANGE identical to
NOTIFY_DEVICEID4_DELETE.  Given that in either case we can't do anything
but preventing further lookups of a given device ID there isn't much difference
in semantics for the two.  For the delete case the server MUST ensure that
there are no outstanding layouts, while for the change case it doesn't, but
that has little relevance to the client.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig 5c83746a0c pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing
This patches moves parsing of the GETDEVICEINFO XDR to kernel space, as well
as the management of complex devices.  The reason for that is we might have
multiple outstanding complex devices after a NOTIFY_DEVICEID4_CHANGE, which
device mapper or md can't handle as they claim devices exclusively.

But as is turns out simple striping / concatenation is fairly trivial to
implement anyway, so we make our life simpler by reducing the reliance
on blkmapd.  For now we still use blkmapd by feeding it synthetic SIMPLE
device XDR to translate device signatures to device numbers, but in the
long runs I have plans to eliminate it entirely.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig 871760ce97 pnfs/blocklayout: move all rpc_pipefs related code into a single file
Create a file to house all the rpc_pipefs boilerplate code instead of
sprinkling it over a few files.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:50 -04:00
Christoph Hellwig ca0fe1dfa5 pnfs/blocklayout: refactor extent processing
Factor out a helper for all per-extent work, and merge the now trivial
functions for lseg allocation and parsing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:49 -04:00
Christoph Hellwig 9cc4754117 pnfs/blocklayout: move extent processing to blocklayout.c
This isn't device(id) related, so move it into the main file.  Simple move
for now, the next commit will clean it up a bit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:33:49 -04:00
Christoph Hellwig 34dc93c2fc pnfs/blocklayout: allocate separate pages for the layoutcommit payload
Instead of overflowing the XDR send buffer with our extent list allocate
pages and pre-encode the layoutupdate payload into them.  We optimistically
allocate a single page use alloc_page and only switch to vmalloc when we
have more extents outstanding.  Currently there is only a single testcase
(xfstests generic/113) which can reproduce large enough extent lists for
this to occur.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:22:45 -04:00
Christoph Hellwig d4b18c3e00 pnfs: remove GETDEVICELIST implementation
The current GETDEVICELIST implementation is buggy in that it doesn't handle
cursors correctly, and in that it returns an error if the server returns
NFSERR_NOTSUPP.  Given that there is no actual need for GETDEVICELIST,
it has various issues and might get removed for NFSv4.2 stop using it in
the blocklayout driver, and thus the Linux NFS client as whole.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:54 -04:00
Christoph Hellwig fd41b4748b pnfs/objlayout: fix endianess annotation in objio_alloc_deviceid_node
The kbuild test robot complained about a new sparse warning in
objio_alloc_deviceid_node, but it turns out that this was just a moved
reference to an existing variable.  Fix it to have the right big endian
annotated type.

Note that there are some other endianess issues in this file that I didn't
bother to sort out as they involve global headers.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:43 -04:00
Christoph Hellwig 3e3f6b4e26 pnfs/blocklayout: remove some debugging
The kbuild test robot complained that we got the printk format wrong.
Let's just kill these printks instead of fixing them as there is not
point after the initial tree algorithm debugging.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-12 13:20:35 -04:00
Jeff Layton 8d11620e1e nfs: add __acquires and __releases annotations to seqfile start/stop routines
To make sparse happy...

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Jeff Layton dad2b015bb nfs: fix RCU cl_xprt handling in nfs_swap_activate/deactivate
sparse says:

fs/nfs/file.c:543:60: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:543:60:    expected struct rpc_xprt *xprt
fs/nfs/file.c:543:60:    got struct rpc_xprt [noderef] <asn:4>*cl_xprt
fs/nfs/file.c:548:53: warning: incorrect type in argument 1 (different address spaces)
fs/nfs/file.c:548:53:    expected struct rpc_xprt *xprt
fs/nfs/file.c:548:53:    got struct rpc_xprt [noderef] <asn:4>*cl_xprt

cl_xprt is RCU-managed, so we need to take care to dereference and use
it while holding the RCU read lock.

Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 08a899d5d9 nfs: setattr can only change regular file sizes
The VFS never calls setattr with ATTR_SIZE on anything but regular
files.  Remove the if check and turn it into an assert similar to
what some other file systems do.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 20d655d619 pnfs/blocklayout: use the device id cache
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 30ff0603ca pnfs: add a nfs4_get_deviceid helper
This will be used by the block layout driver when splitting extents.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 9dd2fcd32f pnfs: add a common GETDEVICELIST implementation
At a simple helper to issue a GETDEVICELIST operation and pre-load
the device id cache based on the result.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:04 -07:00
Christoph Hellwig 661373b13d pnfs: factor GETDEVICEINFO implementations
Add support to the common pNFS core to issue GETDEVICEINFO calls on
a device ID cache miss.  The code is taken from the well debugged
file layout implementation and calls out to the layoutdriver through
a new alloc_deviceid_node method.  The calling conventions for
nfs4_find_get_deviceid are changed so that all information needed to
send a GETDEVICEINFO request is passed to the common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 848746bd24 pnfs/blocklayout: return layouts on setattr
This speads up truncate-heavy workloads like fsx by multiple orders of
magnitude.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 71d5b76302 pnfs/blocklayout: implement the return_range method
This allows removing extents from the extent tree especially on truncate
operations, and thus fixing reads from truncated and re-extended that
previously returned stale data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8067253c8c pnfs/blocklayout: rewrite extent tracking
Currently the block layout driver tracks extents in three separate
data structures:

 - the two list of pnfs_block_extent structures returned by the server
 - the list of sectors that were in invalid state but have been written to
 - a list of pnfs_block_short_extent structures for LAYOUTCOMMIT

All of these share the property that they are not only highly inefficient
data structures, but also that operations on them are even more inefficient
than nessecary.

In addition there are various implementation defects like:

 - using an int to track sectors, causing corruption for large offsets
 - incorrect normalization of page or block granularity ranges
 - insufficient error handling
 - incorrect synchronization as extents can be modified while they are in
   use

This patch replace all three data with a single unified rbtree structure
tracking all extents, as well as their in-memory state, although we still
need to instance for read-only and read-write extent due to the arcane
client side COW feature in the block layouts spec.

To fix the problem of extent possibly being modified while in use we make
sure to return a copy of the extent for use in the write path - the
extent can only be invalidated by a layout recall or return which has
to wait until the I/O operations finished due to refcounts on the layout
segment.

The new extent tree work similar to the schemes used by block based
filesystems like XFS or ext4.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 8c792ea940 pnfs/blocklayout: don't set pages uptodate
The core nfs code handles setting pages uptodate on reads, no need to mess
with the pageflags outselves.  Also remove a debug function to dump page
flags.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 3a6fd1f004 pnfs/blocklayout: remove read-modify-write handling in bl_write_pagelist
Use the new PNFS_READ_WHOLE_PAGE flag to offload read-modify-write
handling to core nfs code, and remove a huge chunk of deadlock prone
mess from the block layout writeback path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig c88953d87f pnfs: add return_range method
If a layout driver keeps per-inode state outside of the layout segments it
needs to be notified of any layout returns or recalls on an inode, and not
just about the freeing of layout segments.  Add a method to acomplish this,
which will allow the block layout driver to handle the case of truncated
and re-expanded files properly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:03 -07:00
Christoph Hellwig 612aa983a0 pnfs: add flag to force read-modify-write in ->write_begin
Like all block based filesystems, the pNFS block layout driver can't read
or write at a byte granularity and thus has to perform read-modify-write
cycles on writes smaller than this granularity.

Add a flag so that the core NFS code always reads a whole page when
starting a smaller write, so that we can do it in the place where the VFS
expects it instead of doing in very deadlock prone way in the writeback
handler.

Note that in theory we could do less than page size reads here for disks
that have a smaller sector size which are served by a server with a smaller
pnfs block size.  But so far that doesn't seem like a worthwhile
optimization.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 7c5d187581 pnfs: force a layout commit when encountering busy segments during recall
Expedite layout recall processing by forcing a layout commit when
we see busy segments.  Without it the layout recall might have to wait
until the VM decided to start writeback for the file, which can introduce
long delays.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Trond Myklebust 3a3908c8b0 NFS: Fix a compile warning when !(CONFIG_NFS_V3 || CONFIG_NFS_V4)
gcc reports:

linux/fs/nfs/write.c: In function ‘nfs_page_find_head_request_locked.isra.17’:
linux/fs/nfs/write.c:121:64: warning: ‘cinfo.mds’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  list_for_each_entry_safe(freq, t, &cinfo.mds->list, wb_list) {
                                                                  ^
linux/fs/nfs/write.c:110:25: note: ‘cinfo.mds’ was declared here
  struct nfs_commit_info cinfo;

Reported-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 921b81a8cd pnfs/blocklayout: correctly decrement extent length
When we do non-page sized reads we can underflow the extent_length variable
and read incorrect data.  Fix the extent_length calculation and change to
defensive <= checks for the extent length in the read and write path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig be98fd0ac3 pnfs/blocklayout: plug block queues
Make sure the block queue is plugged when performing pNFS blocklayout I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 72c5e59f63 pnfs/blocklayout: improve GETDEVICEINFO error reporting
Tell userspace what stage of GETDEVICEINFO failed so that there is a chance
to debug it, especially with the userspace daemon clusterf***k in the block
layout driver.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig e3aaf7f2b8 pnfs/blocklayout: reject pnfs blocksize larger than page size
The Linux VM subsystem can't support block sizes larger than page size
for block based filesystems very well.  While this can be hacked around
to some extent for simple filesystems the read-modify-write cycles
required for pnfs block invalid extents are extremly deadlock prone
when operating on multiple pages.  Reject this case early on instead
of pretending to support it (badly).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:02 -07:00
Christoph Hellwig 5f919c9f10 pnfs: allow splicing pre-encoded pages into the layoutcommit args
Currently there is no XDR buffer space allocated for the per-layout driver
layoutcommit payload, which leads to server buffer overflows in the
blocklayout driver even under simple workloads.  As we can't do per-layout
sizes for XDR operations we'll have to splice a previously encoded list
of pages into the XDR stream, similar to how we handle ACL buffers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00
Christoph Hellwig 47abadefad pnfs: avoid using stale stateids after layoutreturn
After we issued a layoutreturn operations the may free the layout stateid
and will thus cause bad stateid error when the client uses it again.

We currently try to avoid this case by chosing the open stateid if not
lsegs are present for this inode.  But various places can hold refererence
on lsegs and thus cause the list not to be empty shortly after a layout
return.  Add an explicit flag to mark the current layout stateid invalid
and force usage of the openstateid after we did a full file layoutreturn.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-09-10 12:47:01 -07:00