1
0
Fork 0
Commit Graph

52029 Commits (2afe738fc070bf681227c0c9d95b9cd0c4782b0f)

Author SHA1 Message Date
Geert Uytterhoeven c877154d30 ubifs: Fix uninitialized variable in search_dh_cookie()
fs/ubifs/tnc.c: In function ‘search_dh_cookie’:
fs/ubifs/tnc.c:1893: warning: ‘err’ is used uninitialized in this function

Indeed, err is always used uninitialized.

According to an original review comment from Hyunchul, acknowledged by
Richard, err should be initialized to -ENOENT to avoid the first call to
tnc_next().  But we can achieve the same by reordering the code.

Fixes: 781f675e2d ("ubifs: Fix unlink code wrt. double hash lookups")
Reported-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-01-17 19:28:53 +01:00
Eric W. Biederman faf1f22b61 signal: Ensure generic siginfos the kernel sends have all bits initialized
Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-12 14:21:07 -06:00
Linus Torvalds 75d4276e83 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:

 - untangle sys_close() abuses in xt_bpf

 - deal with register_shrinker() failures in sget()

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix "netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1'"
  sget(): handle failures of register_shrinker()
  mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.
2018-01-06 17:13:21 -08:00
Wang Long bbbc3c1cfa writeback: update comment in inode_io_list_move_locked
The @head can be wb->b_dirty_time, so update the comment.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei c16a8ac3c0 btrfs: avoid accessing bvec table directly for a cloned bio
Commit 17347cec15f919901c90(Btrfs: change how we iterate bios in endio)
mentioned that for dio the submitted bio may be fast cloned, we
can't access the bvec table directly for a cloned bio, so use
bio_get_first_bvec() to retrieve the 1st bvec.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Acked: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei a0b60d725e btrfs: avoid access to .bi_vcnt directly
BTRFS uses bio->bi_vcnt to figure out page numbers, this approach is no
longer valid once we start enabling multipage bvecs.
correct once we start to enable multipage bvec.

Use bio_nr_pages() to do that instead.

Cc: Chris Mason <clm@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei c45a8f2def fs: convert to bio_last_bvec_all()
This patch converts 3 users to bio_last_bvec_all(), so that we can go
ahead and convert to multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Ming Lei 263663cd3c block: convert to bio_first_bvec_all & bio_first_page_all
This patch converts to bio_first_bvec_all() & bio_first_page_all() for
retrieving the 1st bvec/page, and prepares for supporting multipage bvec.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-06 09:18:00 -07:00
Linus Torvalds 89876f275e for-4.15-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlpPux0ACgkQxWXV+ddt
 WDs/ORAAgRtjm+OWBb80eV1xJIHGRPRaL6E4OZc6SA7DEA+oCpkkVzOHQz3PV2a2
 cAsIUvp9azZd41gzBMw8mIe4AQKLZpud+vEM7QYRlbZFtp3EWmZ1Jht4bJRxC+w7
 NjBIEx4MX2KiUeRizmo3iWBVW+RoaRVW1xvFo/k5QchhO8U74SNYzxTGVxd8S/C0
 ZanuTowdm71uCJJHkoNWArAsou40QCJOYK19WilRkrf6SGsUqc1zKArRKe2KF4GH
 Wyf4Qyp2fm8RRKLOlc9NcsVbVqVg4kBmUXbJPCvltCs+JiyfhX9hahweoHHH8kmH
 u/jR3CItVqX+Ft1WAtSpgRzxO0uGu6aVkIql0VHV6wIbGnFoJd9XQ6RPnT/awlOw
 1jx8RLOZtVehF6pjyoSngLppqCw/sYpV8QhF32dEFGentO3Wd7CVKTcMOH498dbN
 paNzcNEfnTFLbUmViOTXl8AS8VX+3PU2Mgn8W8UxcFYksoIpV9P/LBDS3iIGYMtL
 pFFC9fYeipBDOPg2NV4QfCE9ZSqm35c2kAV/hb1nmPtPz4W+Ya5v2y9RSjAU80f4
 Y8ZyePg6pjwWOp1dW+TZF0NE8ExzSvgnXAQOdZkiy4Ztc6OwTVhlwRfW1xFy2Py+
 riR87A7/mDbiR9IXHgzFZi6WjjVMHDifBKeEpu91cF9JrwJqMBc=
 =WIOv
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We have two more fixes for 4.15, both aimed for stable.

  The leak fix is obvious, the second patch fixes a bug revealed by the
  refcount API, when it behaves differently than previous atomic_t and
  reports refs going from 0 to 1 in one case"

* tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
  btrfs: Fix flush bio leak
2018-01-05 13:02:46 -08:00
Linus Torvalds 12e971b652 Changes since last update:
- Fix resource cleanup of failed quota initialization
 - Fix integer overflow problems wrt s_maxbytes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaS8yUAAoJEPh/dxk0SrTrrNMP+gLCitWenObhf6uA0Aysb3Vr
 EnhNFaqZA7RRLbQRwLESblvhExp9WTrtFmWOAFh1Q0ETBEIazIGkXfKDeOChxaCY
 LMPb83vQarZoV++HoiBeFbShf39dFw2ufGHyveZwvxk4kgYgQRFzIVZbRTg7CA/C
 nMLPZ9IBDBhEwnCVpH+gKJMcU6j5I9IIePwaEIKnB0o99fsEgZfnM0B4Wl0DRrzn
 nE6DOvkGZiNF4on1J2KgL2rB0r+VEyyMtBTCRs519rEaa8ACFUQDqEqoUIC92SnS
 pD/n9S2JwVH1dLX7cRoiMQcX/r4do83LlK0IvMswApMuNqYRQU6332lwosdgo7KQ
 8+antAlVKuqMAGNvhVWMy1DuaRO5gCqRwL1wpzebNHsw4eRsDD2MNkeLXbM2P2oL
 5OflIrPLMlLORlPtwbJclm8CcnQzQGMAa5yEDJcU1PIWH/urdRd+KqWQ+N0Zfj6m
 J3L4tXDY61hqwZ8BISe+/9iFDooGV/6Ri4mbez4UWiN6UfaKKokaFZzbo2n3VTb9
 Htx5KsrzslfGWAnoeIT9GnyFhT4te9IHT69jl2AorvxpmdXdfOI8TgrzS8TzuKGD
 N6TadC4IZGLLpww+rND6Bywdc8/garmFbck+/nVdMRwNAsZUE+m08OrNFMCqmYms
 p9jIA2tRh94Hu4Awi8hG
 =2rs/
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS fixes from Darrick Wong:
 "I have just a few fixes for bugs and resource cleanup problems this
  week:

   - Fix resource cleanup of failed quota initialization

   - Fix integer overflow problems wrt s_maxbytes"

* tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix s_maxbytes overflow problems
  xfs: quota: check result of register_shrinker()
  xfs: quota: fix missed destroy of qi_tree_lock
2018-01-05 12:59:32 -08:00
Andrea Arcangeli 0cbb4b4f4c userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails
The previous fix in commit 384632e67e ("userfaultfd: non-cooperative:
fix fork use after free") corrected the refcounting in case of
UFFD_EVENT_FORK failure for the fork userfault paths.

That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
were set to point to the aborted new uffd ctx earlier in
dup_userfaultfd.

Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-04 16:45:09 -08:00
Linus Torvalds 50d0f78f5c Merge branch 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs/fscache fixes from David Howells:

 - Fix the default return of fscache_maybe_release_page() when a cache
   isn't in use - it prevents a filesystem from releasing pages. This
   can cause a system to OOM.

 - Fix a potential uninitialised variable in AFS.

 - Fix AFS unlink's handling of the nlink count. It needs to use the
   nlink manipulation functions so that inode structs of deleted inodes
   actually get scheduled for destruction.

 - Fix error handling in afs_write_end() so that the page gets unlocked
   and put if we can't fill the unwritten portion.

* 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix missing error handling in afs_write_end()
  afs: Fix unlink
  afs: Potential uninitialized variable in afs_extract_data()
  fscache: Fix the default for fscache_maybe_release_page()
2018-01-03 10:58:56 -08:00
Kees Cook e816c201ae exec: Weaken dumpability for secureexec
This is a logical revert of commit e37fdb785a ("exec: Use secureexec
for setting dumpability")

This weakens dumpability back to checking only for uid/gid changes in
current (which is useless), but userspace depends on dumpability not
being tied to secureexec.

  https://bugzilla.redhat.com/show_bug.cgi?id=1528633

Reported-by: Tom Horsley <horsley1953@gmail.com>
Fixes: e37fdb785a ("exec: Use secureexec for setting dumpability")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-03 10:13:36 -08:00
Ingo Molnar 475c5ee193 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

- Updates to use cond_resched() instead of cond_resched_rcu_qs()
  where feasible (currently everywhere except in kernel/rcu and
  in kernel/torture.c).  Also a couple of fixes to avoid sending
  IPIs to offline CPUs.

- Updates to simplify RCU's dyntick-idle handling.

- Updates to remove almost all uses of smp_read_barrier_depends()
  and read_barrier_depends().

- Miscellaneous fixes.

- Torture-test updates.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-01-03 14:14:18 +01:00
Darrick J. Wong b4d8ad7fd3 xfs: fix s_maxbytes overflow problems
Fix some integer overflow problems if offset + count happen to be large
enough to cause an integer overflow.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-01-02 10:16:32 -08:00
Aliaksei Karaliou 3a3882ff26 xfs: quota: check result of register_shrinker()
xfs_qm_init_quotainfo() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.

Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
[darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-02 10:16:32 -08:00
Aliaksei Karaliou 2196881566 xfs: quota: fix missed destroy of qi_tree_lock
xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock
while destroys quotainfo->qi_quotaofflock.

Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-01-02 10:16:32 -08:00
Chris Mason ec35e48b28 btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
refcounts have a generic implementation and an asm optimized one.  The
generic version has extra debugging to make sure that once a refcount
goes to zero, refcount_inc won't increase it.

The btrfs delayed inode code wasn't expecting this, and we're tripping
over the warnings when the generic refcounts are used.  We ended up with
this race:

Process A                                         Process B
                                                  btrfs_get_delayed_node()
						  spin_lock(root->inode_lock)
						  radix_tree_lookup()
__btrfs_release_delayed_node()
refcount_dec_and_test(&delayed_node->refs)
our refcount is now zero
						  refcount_add(2) <---
						  warning here, refcount
                                                  unchanged

spin_lock(root->inode_lock)
radix_tree_delete()

With the generic refcounts, we actually warn again when process B above
tries to release his refcount because refcount_add() turned into a
no-op.

We saw this in production on older kernels without the asm optimized
refcounts.

The fix used here is to use refcount_inc_not_zero() to detect when the
object is in the middle of being freed and return NULL.  This is almost
always the right answer anyway, since we usually end up pitching the
delayed_node if it didn't have fresh data in it.

This also changes __btrfs_release_delayed_node() to remove the extra
check for zero refcounts before radix tree deletion.
btrfs_get_delayed_node() was the only path that was allowing refcounts
to go from zero to one.

Fixes: 6de5f18e7b ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
CC: <stable@vger.kernel.org> # 4.12+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:14 +01:00
Nikolay Borisov beed9263f4 btrfs: Fix flush bio leak
Commit e0ae999414 ("btrfs: preallocate device flush bio") reworked
the way the flush bio is allocated and used. Concretely it allocates
the bio in __alloc_device and then re-uses it multiple times with a
very simple endio routine that just calls complete() without consuming
a reference. Allocated bios by default come with a ref count of 1,
which is then consumed by the endio routine (or not, in which case they
should be bio_put by the caller). The way the impleementation works now
is that the flush bio has a refcount of 2 and we only ever bio_put it
once, leaving it to hang indefinitely. Fix this by removing the extra
bio_get in __alloc_device.

Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-02 18:00:13 +01:00
David Howells afae457d87 afs: Fix missing error handling in afs_write_end()
afs_write_end() is missing page unlock and put if afs_fill_page() fails.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
David Howells 440fbc3a8a afs: Fix unlink
Repeating creation and deletion of a file on an afs mount will run the box
out of memory, e.g.:

	dd if=/dev/zero of=/afs/scratch/m0 bs=$((1024*1024)) count=512
	rm /afs/scratch/m0

The problem seems to be that it's not properly decrementing the nlink count
so that the inode can be scrapped.

Note that this doesn't fix local creation followed by remote deletion.
That's harder to handle and will require a separate patch as we're not told
that the file has been deleted - only that the directory has changed.

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
Dan Carpenter 7888da9583 afs: Potential uninitialized variable in afs_extract_data()
Smatch warns that:

    fs/afs/rxrpc.c:922 afs_extract_data()
    error: uninitialized symbol 'remote_abort'.

Smatch is right that "remote_abort" might be uninitialized when we pass
it to afs_set_call_complete().  I don't know if that function uses the
uninitialized variable.  Anyway, the comment for rxrpc_kernel_recv_data(),
says that "*_abort should also be initialised to 0." and this patch does
that.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-01-02 10:02:19 +00:00
Jeff Layton 7a11ac289c ntfs: remove i_version handling
NTFS keeps track of the i_version counter here, seemingly for no reason.
It does not set the SB_I_VERSION flag so it'll never be incremented on
write, and it doesn't increment it internally for metadata operations.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2018-01-01 10:09:33 -05:00
Linus Torvalds fca0e39b2b Changes since last update:
- Fix a locking problem during xattr block conversion that could lead to
   the log checkpointing thread to try to write an incomplete buffer to
   disk, which leads to a corruption shutdown
 - Fix a null pointer dereference when removing delayed allocation extents
 - Remove post-eof speculative allocations when reflinking a block past
   current inode size so that we don't just leave them there and assert on
   inode reclaim
 - Relax an assert which didn't accurately reflect the way locking works
   and would trigger under heavy io load
 - Avoid infinite loop when cancelling copy on write extents after a
   writeback failure
 - Try to avoid copy on write transaction reservation overflows when
   remapping after a successful write
 - Fix various problems with the copy-on-write reservation automatic
   garbage collection not being cleaned up properly during a ro remount
 - Fix problems with rmap log items being processed in the wrong order,
   leading to corruption shutdowns
 - Fix problems with EFI recovery wherein the "remove any rmapping if
   present" mechanism wasn't actually doing anything, which would lead
   to corruption problems later when the extent is reallocated, leading
   to multiple rmaps for the same extent
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaO+dwAAoJEPh/dxk0SrTrY8YP/R9AXH3Wt6S2QGGjZfXURa22
 /cioJKFl8hWay00ZT8Zcj4Pdx6R+stvausj5ECDvpdWZG+d28e61c1bxg+bqRYO5
 JWXikWnAa80RQ5uEjOXHoUjAgk6u6YYuQHEuHH/xA0nL4Cw98WLSzLjqk7ZU53rx
 P17dgUWWHta/w8OpxG9UG5pxvNW3VRitiyCMWxa2gzBPncHnCk3fu9lInpDzH9S+
 xakwCRtfiAykoOG/O5pnMg6vw5r6ENwK7DymxXgqF+Vv/HzgMbeJs+9UON2eACtp
 ECHGffN4pXpqWVcGDMs5cWCOfLUEjxCrotMLYpIrdZs5DptmOcOWpQpHWl4JiaXB
 rqAxx3D0Yo+00ENponM01un8UgCXF5gqsDGyTzn99aPpDVqxCJw1XmSdOXRhcnnF
 At2raUkXF+nbqaVwL3Y7ZJuOKs1hi3HpsYwwfvClR8cTFk/BaY6sQ4QnVR0Ggkg6
 8lZxeDb8VdoUjWO11sX1edwGtR8g+p3PSHiUFSnh1JsbP2I0R+TV+j5Y9rMotxFT
 Eq6+Ehp889GeSpEBCrDpMgNIABMjBxoi5JvOwXSUNhF5Rh/1Vf//7v31nXcyVlah
 a95IhCYfQLFMtaYaGr2ElvdO+Qs1+ppsD207I4H86XotjRkvD7U+mJoYm9EaujQX
 jgUDdZEsP5h5DX524VHU
 =i51V
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are some XFS fixes for 4.15-rc5. Apologies for the unusually
  large number of patches this late, but I wanted to make sure the
  corruption fixes were really ready to go.

  Changes since last update:

   - Fix a locking problem during xattr block conversion that could lead
     to the log checkpointing thread to try to write an incomplete
     buffer to disk, which leads to a corruption shutdown

   - Fix a null pointer dereference when removing delayed allocation
     extents

   - Remove post-eof speculative allocations when reflinking a block
     past current inode size so that we don't just leave them there and
     assert on inode reclaim

   - Relax an assert which didn't accurately reflect the way locking
     works and would trigger under heavy io load

   - Avoid infinite loop when cancelling copy on write extents after a
     writeback failure

   - Try to avoid copy on write transaction reservation overflows when
     remapping after a successful write

   - Fix various problems with the copy-on-write reservation automatic
     garbage collection not being cleaned up properly during a ro
     remount

   - Fix problems with rmap log items being processed in the wrong
     order, leading to corruption shutdowns

   - Fix problems with EFI recovery wherein the "remove any rmapping if
     present" mechanism wasn't actually doing anything, which would lead
     to corruption problems later when the extent is reallocated,
     leading to multiple rmaps for the same extent"

* tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: only skip rmap owner checks for unknown-owner rmap removal
  xfs: always honor OWN_UNKNOWN rmap removal requests
  xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
  xfs: set cowblocks tag for direct cow writes too
  xfs: remove leftover CoW reservations when remounting ro
  xfs: don't be so eager to clear the cowblocks tag on truncate
  xfs: track cowblocks separately in i_flags
  xfs: allow CoW remap transactions to use reserve blocks
  xfs: avoid infinite loop when cancelling CoW blocks after writeback failure
  xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping
  xfs: remove dest file's post-eof preallocations before reflinking
  xfs: move xfs_iext_insert tracepoint to report useful information
  xfs: account for null transactions in bunmapi
  xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
  xfs: add the ability to join a held buffer to a defer_ops
2017-12-22 12:27:27 -08:00
Darrick J. Wong 68c58e9b9a xfs: only skip rmap owner checks for unknown-owner rmap removal
For rmap removal, refactor the rmap owner checks into a separate
function, then skip the checks if we are performing an unknown-owner
removal.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong 33df3a9cf9 xfs: always honor OWN_UNKNOWN rmap removal requests
Calling xfs_rmap_free with an unknown owner is supposed to remove any
rmaps covering that range regardless of owner.  This is used by the EFI
recovery code to say "we're freeing this, it mustn't be owned by
anything anymore", but for whatever reason xfs_free_ag_extent filters
them out.

Therefore, remove the filter and make xfs_rmap_unmap actually treat it
as a wildcard owner -- free anything that's already there, and if
there's no owner at all then that's fine too.

There are two existing callers of bmap_add_free that take care the rmap
deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap
cleanup; convert these to use OWN_NULL (via helpers), and now we really
require that an RUI (if any) gets added to the defer ops before any EFI.

Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests,
growfs will have to consult directly with the rmap to ensure that there
aren't any rmaps in the grown region.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong 0525e952dc xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
Under the deferred rmap operation scheme, there's a certain order in
which the rmap deferred ops have to be queued to maintain integrity
during log replay.  For alloc/map operations that order is cui -> rui;
for free/unmap operations that order is cui -> rui -> efi.  However, the
initial refcount code got the ordering wrong in the free side of things
because it queued refcount free op and an EFI and the refcount free op
queued a rmap free op, resulting in the order cui -> efi -> rui.

If we fail before the efd finishes, the efi recovery will try to do a
wildcard rmap removal and the subsequent rui will fail to find the rmap
and blow up.  This didn't ever happen due to other screws up in handling
unknown owner rmap removals, but those other screw ups broke recovery in
other ways, so fix the ordering to follow the intended rules.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:48:38 -08:00
Darrick J. Wong 86d692bfad xfs: set cowblocks tag for direct cow writes too
If a user performs a direct CoW write, we end up loading the CoW fork
with preallocated extents.  Therefore, we must set the cowblocks tag so
that they can be cleared out if we run low on space.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:37 -08:00
Darrick J. Wong 10ddf64e42 xfs: remove leftover CoW reservations when remounting ro
When we're remounting the filesystem readonly, remove all CoW
preallocations prior to going ro.  If the fs goes down after the ro
remount, we never clean up the staging extents, which means xfs_check
will trip over them on a subsequent run.  Practically speaking, the next
mount will clean them up too, so this is unlikely to be seen.  Since we
shut down the cowblocks cleaner on remount-ro, we also have to make sure
we start it back up if/when we remount-rw.

Found by adding clonerange to fsstress and running xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:32 -08:00
Darrick J. Wong 363e59baa4 xfs: don't be so eager to clear the cowblocks tag on truncate
Currently, xfs_itruncate_extents clears the cowblocks tag if i_cnextents
is zero.  This is wrong, since i_cnextents only tracks real extents in
the CoW fork, which means that we could have some delayed CoW
reservations still in there that will now never get cleaned.

Fix a further bug where we /don't/ clear the reflink iflag if there are
any attribute blocks -- really, it's only safe to clear the reflink flag
if there are no data fork extents and no cow fork extents.

Found by adding clonerange to fsstress in xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-21 08:47:28 -08:00
Darrick J. Wong 91aae6be41 xfs: track cowblocks separately in i_flags
The EOFBLOCKS/COWBLOCKS tags are totally separate things, so track them
with separate i_flags.  Right now we're abusing IEOFBLOCKS for both,
which is totally bogus because we won't tag the inode with COWBLOCKS if
IEOFBLOCKS was set by a previous tagging of the inode with EOFBLOCKS.
Found by wiring up clonerange to fsstress in xfs/017.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-20 17:11:48 -08:00
Al Viro 9ee332d99e sget(): handle failures of register_shrinker()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-18 15:05:07 -05:00
Kees Cook 779f4e1c6c Revert "exec: avoid RLIMIT_STACK races with prlimit()"
This reverts commit 04e35f4495.

SELinux runs with secureexec for all non-"noatsecure" domain transitions,
which means lots of processes end up hitting the stack hard-limit change
that was introduced in order to fix a race with prlimit(). That race fix
will need to be redesigned.

Reported-by: Laura Abbott <labbott@redhat.com>
Reported-by: Tomáš Trnka <trnka@scm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-17 14:26:25 -08:00
Arnd Bergmann b9f5fb1800 cramfs: fix MTD dependency
With CONFIG_MTD=m and CONFIG_CRAMFS=y, we now get a link failure:

  fs/cramfs/inode.o: In function `cramfs_mount': inode.c:(.text+0x220): undefined reference to `mount_mtd'
  fs/cramfs/inode.o: In function `cramfs_mtd_fill_super':
  inode.c:(.text+0x6d8): undefined reference to `mtd_point'
  inode.c:(.text+0xae4): undefined reference to `mtd_unpoint'

This adds a more specific Kconfig dependency to avoid the broken
configuration.

Alternatively we could make CRAMFS itself depend on "MTD || !MTD" with a
similar result.

Fixes: 99c18ce580 ("cramfs: direct memory access support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-17 12:20:58 -08:00
Linus Torvalds 73d080d374 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "The alloc_super() one is a regression in this merge window, lazytime
  thing is older..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: Handle lazytime in do_mount()
  alloc_super(): do ->s_umount initialization earlier
2017-12-17 12:18:35 -08:00
Linus Torvalds 1c6b942d7d Fix a regression which caused us to fail to interpret symlinks in very
ancient ext3 file system images.  Also fix two xfstests failures, one
 of which could cause a OOPS, plus an additional bug fix caught by fuzz
 testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlo1y3EACgkQ8vlZVpUN
 gaNFOQf/bMf6ynai1dGGRwef+UcT874NZ2Hqm+UqI6pxusz0ZeKWm8HWfPfg31Fa
 o+OnUsZ7NXFBIHyfXKFJzdOgutjZ5eY0vMu+NrlyBdd6W+ZcHwn1PvQsLapFYvqK
 Rt+8nWTKqtnksSfh0vyODmUYgItOULOPPepjnIPm/Pd0DinJwo0GY/8MzLkz4SpX
 g6R60ou0ToEYNqBXAKIBnZ4aq8KWMtCMGcD270U5eAm/63Pt4riRwJbjITxZPAH1
 wKzivP4Ce5ce8W2g2/6mFFlBFWvtlB491T+BsgHUEv3OLze+kYS2PcxQthhEmBR8
 zeZ2o2/0tTxejE//cyJ4gCe3fYGRDg==
 =xqLC
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix a regression which caused us to fail to interpret symlinks in very
  ancient ext3 file system images.

  Also fix two xfstests failures, one of which could cause an OOPS, plus
  an additional bug fix caught by fuzz testing"

* tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix crash when a directory's i_size is too small
  ext4: add missing error check in __ext4_new_inode()
  ext4: fix fdatasync(2) after fallocate(2) operation
  ext4: support fast symlinks from ext3 file systems
2017-12-17 12:14:33 -08:00
Linus Torvalds d025fbf1a2 NFS client fixes for Linux 4.15-rc4
Stable bugfixes:
 - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
        commit in the case that there were no commit requests.
 - SUNRPC: Fix a race in the receive code path
 
 Other fixes:
 - NFS: Fix a deadlock in nfs client initialization
 - xprtrdma: Fix a performance regression for small IOs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlo0PdMACgkQ18tUv7Cl
 QOvlUg/+KoXWXNwItHIyyegYgRXcAPpaCtdnCjjOP6R9HEJ+clnLcaqDxdDKVWQ/
 oDvEcQcsBpywbUi7vVrvdar4mofwuyjXPpbcZPlDP1Ru4yyAlyylftwIuQW/nzdd
 vX2tZaVf+B9y1XvSD5NI+2EKWmp7MVrPdNhYxAB39TQZnAAvYDFHhywtZ0UR7vJt
 7YVcZoPtKUhg15jhCOr73eaCT0884/tlgedfd6DkDGR6bCtSQC2PySfqq9Lnnl/1
 ruDzzcgTARzSEzvta/uyBRspOLBHeeBhTdQUp79lMfekC4+68Tx6DFWnydIUttuE
 G7LphN6hfbJLF20U/ENb2H8v10WZsKvGEuxM+fp5PXGcIMSlX4qoJUe/egJFiiSL
 IaikgibvfiKmYSJvwdxTlOcr793X2Ej19HNciNjJQp4pviDOdZixgtGvVVHJBmh6
 LYzE5q9jgbW9wQXwTTeWHp/nyqL80NslX0UARYnS2Ua0B96GRCESXqCUFtxK6tKR
 wbYiHzKc4dOfSxpNlKI+FlX63m5oSAmTEii3ODsWZjObbwYHNX2Zqj2cVFiSLCpv
 ZXgmpNL+tL2zBWxPvn6rzYhpaXo++PqlHK7vv2QVBI6XM2J8ztpj5Wr5zneRoJaE
 ejk8nw/mR43bfdQuUGZRKh/Z+FTqL0/2WbDgJMXl09c+zRz7J2c=
 =XhEC
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "This has two stable bugfixes, one to fix a BUG_ON() when
  nfs_commit_inode() is called with no outstanding commit requests and
  another to fix a race in the SUNRPC receive codepath.

  Additionally, there are also fixes for an NFS client deadlock and an
  xprtrdma performance regression.

  Summary:

  Stable bugfixes:
   - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
     commit in the case that there were no commit requests.
   - SUNRPC: Fix a race in the receive code path

  Other fixes:
   - NFS: Fix a deadlock in nfs client initialization
   - xprtrdma: Fix a performance regression for small IOs"

* tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Fix a race in the receive code path
  nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
  xprtrdma: Spread reply processing over more CPUs
  nfs: fix a deadlock in nfs client initialization
2017-12-16 13:12:53 -08:00
Linus Torvalds f6f3732162 Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths"
This reverts commits 5c9d2d5c26, c7da82b894, and e7fe7b5cae.

We'll probably need to revisit this, but basically we should not
complicate the get_user_pages_fast() case, and checking the actual page
table protection key bits will require more care anyway, since the
protection keys depend on the exact state of the VM in question.

Particularly when doing a "remote" page lookup (ie in somebody elses VM,
not your own), you need to be much more careful than this was.  Dave
Hansen says:

 "So, the underlying bug here is that we now a get_user_pages_remote()
  and then go ahead and do the p*_access_permitted() checks against the
  current PKRU. This was introduced recently with the addition of the
  new p??_access_permitted() calls.

  We have checks in the VMA path for the "remote" gups and we avoid
  consulting PKRU for them. This got missed in the pkeys selftests
  because I did a ptrace read, but not a *write*. I also didn't
  explicitly test it against something where a COW needed to be done"

It's also not entirely clear that it makes sense to check the protection
key bits at this level at all.  But one possible eventual solution is to
make the get_user_pages_fast() case just abort if it sees protection key
bits set, which makes us fall back to the regular get_user_pages() case,
which then has a vma and can do the check there if we want to.

We'll see.

Somewhat related to this all: what we _do_ want to do some day is to
check the PAGE_USER bit - it should obviously always be set for user
pages, but it would be a good check to have back.  Because we have no
generic way to test for it, we lost it as part of moving over from the
architecture-specific x86 GUP implementation to the generic one in
commit e585513b76 ("x86/mm/gup: Switch GUP to the generic
get_user_page_fast() implementation").

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-15 18:53:22 -08:00
Linus Torvalds dd3d66b838 CephFS inode trimming fix from Zheng, marked for stable.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJaM/Y/AAoJEEp/3jgCEfOLSu0H/iFhQS+7rnyPcb3P8/YR785H
 IMPNWv8hg4UU6MDWC3lIliAPypAkaMLuEKOZvBRsLCW5esbOTlCP7w4bmO/YCI66
 DF0JfA4AV5yXIVMAtjP2EK3sFz0eCrK6S3XP3cT+x3K5qI6zwNN3Yvj78NFcvCOz
 IBgxrlhpu7/DfBsorhKEAEHXaYE+NKJNlcGBIisvM0BNC9dcm7ufTkP7pP6mRJC0
 GjjYqh8HMe45AvvIaE7o976M1GKexEDNsncHM8VlxuwkC5hz0SNAg73J7iwcDfUe
 hqfLeHcvTOrPQ0oB4Xz0Nh6cJ7tIv3gYZ941awhmH6XZCWgZhrBaLyipIenXEHM=
 =xpe2
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client

Pull ceph fix from Ilya Dryomov:
 "CephFS inode trimming fix from Zheng, marked for stable"

* tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client:
  ceph: drop negative child dentries before try pruning inode's alias
2017-12-15 12:48:27 -08:00
Linus Torvalds 227701e0e7 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:

 - fix incomplete syncing of filesystem

 - fix regression in readdir on ovl over 9p

 - only follow redirects when needed

 - misc fixes and cleanups

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix overlay: warning prefix
  ovl: Use PTR_ERR_OR_ZERO()
  ovl: Sync upper dirty data when syncing overlayfs
  ovl: update ctx->pos on impure dir iteration
  ovl: Pass ovl_get_nlink() parameters in right order
  ovl: don't follow redirects if redirect_dir=off
2017-12-15 12:46:48 -08:00
Scott Mayhew dc4fd9ab01 nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
If there were no commit requests, then nfs_commit_inode() should not
wait on the commit or mark the inode dirty, otherwise the following
BUG_ON can be triggered:

[ 1917.130762] kernel BUG at fs/inode.c:578!
[ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
[ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
[ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G               ------------ T 3.10.0-768.el7.ppc64 #1
[ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000
[ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80
[ 1917.130816] REGS: c00000004cea3720 TRAP: 0700   Tainted: G               ------------ T  (3.10.0-768.el7.ppc64)
[ 1917.130820] MSR: 8000000100029032 <SF,EE,ME,IR,DR,RI>  CR: 22002822  XER: 20000000
[ 1917.130828] CFAR: c00000000011f594 SOFTE: 1
GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
[ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350
[ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350
[ 1917.130873] Call Trace:
[ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable)
[ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270
[ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0
[ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
[ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
[ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180
[ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150
[ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100
[ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
[ 1917.130919] Instruction dump:
[ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
[ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 <0b0a0000> 892d02a4 2f890000 40de0134

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Cc: stable@vger.kernel.org # 4.5+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:50 -05:00
Scott Mayhew c156618e15 nfs: fix a deadlock in nfs client initialization
The following deadlock can occur between a process waiting for a client
to initialize in while walking the client list during nfsv4 server trunking
detection and another process waiting for the nfs_clid_init_mutex so it
can initialize that client:

Process 1                               Process 2
---------                               ---------
spin_lock(&nn->nfs_client_lock);
list_add_tail(&CLIENTA->cl_share_link,
        &nn->nfs_client_list);
spin_unlock(&nn->nfs_client_lock);
                                        spin_lock(&nn->nfs_client_lock);
                                        list_add_tail(&CLIENTB->cl_share_link,
                                                &nn->nfs_client_list);
                                        spin_unlock(&nn->nfs_client_lock);
                                        mutex_lock(&nfs_clid_init_mutex);
                                        nfs41_walk_client_list(clp, result, cred);
                                        nfs_wait_client_init_complete(CLIENTA);
(waiting for nfs_clid_init_mutex)

Make sure nfs_match_client() only evaluates clients that have completed
initialization in order to prevent that deadlock.

This patch also fixes v4.0 trunking behavior by not marking the client
NFS_CS_READY until the clientid has been confirmed.

Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-12-15 14:31:49 -05:00
Linus Torvalds 18d40eae7f Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "17 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  arch: define weak abort()
  mm, oom_reaper: fix memory corruption
  kernel: make groups_sort calling a responsibility group_info allocators
  mm/frame_vector.c: release a semaphore in 'get_vaddr_frames()'
  tools/slabinfo-gnuplot: force to use bash shell
  kcov: fix comparison callback signature
  mm/slab.c: do not hash pointers when debugging slab
  mm/page_alloc.c: avoid excessive IRQ disabled times in free_unref_page_list()
  mm/memory.c: mark wp_huge_pmd() inline to prevent build failure
  scripts/faddr2line: fix CROSS_COMPILE unset error
  Documentation/vm/zswap.txt: update with same-value filled page feature
  exec: avoid gcc-8 warning for get_task_comm
  autofs: fix careless error in recent commit
  string.h: workaround for increased stack usage
  mm/kmemleak.c: make cond_resched() rate-limiting more efficient
  lib/rbtree,drm/mm: add rbtree_replace_node_cached()
  include/linux/idr.h: add #include <linux/bug.h>
2017-12-14 16:35:20 -08:00
Thiago Rafael Becker bdcf0a423e kernel: make groups_sort calling a responsibility group_info allocators
In testing, we found that nfsd threads may call set_groups in parallel
for the same entry cached in auth.unix.gid, racing in the call of
groups_sort, corrupting the groups for that entry and leading to
permission denials for the client.

This patch:
 - Make groups_sort globally visible.
 - Move the call to groups_sort to the modifiers of group_info
 - Remove the call to groups_sort from set_groups

Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com
Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: NeilBrown <neilb@suse.com>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:49 -08:00
Arnd Bergmann 3756f6401c exec: avoid gcc-8 warning for get_task_comm
gcc-8 warns about using strncpy() with the source size as the limit:

  fs/exec.c:1223:32: error: argument to 'sizeof' in 'strncpy' call is the same expression as the source; did you mean to use the size of the destination? [-Werror=sizeof-pointer-memaccess]

This is indeed slightly suspicious, as it protects us from source
arguments without NUL-termination, but does not guarantee that the
destination is terminated.

This keeps the strncpy() to ensure we have properly padded target
buffer, but ensures that we use the correct length, by passing the
actual length of the destination buffer as well as adding a build-time
check to ensure it is exactly TASK_COMM_LEN.

There are only 23 callsites which I all reviewed to ensure this is
currently the case.  We could get away with doing only the check or
passing the right length, but it doesn't hurt to do both.

Link: http://lkml.kernel.org/r/20171205151724.1764896-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Aleksa Sarai <asarai@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:48 -08:00
NeilBrown 302ec300ef autofs: fix careless error in recent commit
Commit ecc0c469f2 ("autofs: don't fail mount for transient error") was
meant to replace an 'if' with a 'switch', but instead added the 'switch'
leaving the case in place.

Link: http://lkml.kernel.org/r/87zi6wstmw.fsf@notabene.neil.brown.name
Fixes: ecc0c469f2 ("autofs: don't fail mount for transient error")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: NeilBrown <neilb@suse.com>
Cc: Ian Kent <raven@themaw.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-14 16:00:48 -08:00
Linus Torvalds d455df0bcc Small SMB3 fixes for stable and 4.15rc
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJaMszhAAoJEIosvXAHck9R+gYMAJM6QM9sjiCf8xPh1YhPkGr4
 /yLqw6dyaicsPBo2YN6aY3tRNuAkTTbcVW6Sjaepk5WkqK3t//PYC0MzmS9cfDg+
 DdgtHwW5CoyB7cdzx0QzgAfoH3A7IRJoO9ezjiM/mkPURZlhJJTgFOhggkCGPzhU
 R7h39e7SNmg4kB2x/fx4HBWxdHrPj0AysDaxFZ83FiVtZojZ7X9tIRb5HT0PFCB5
 buoAjvtOuXueKN91Z/seSkSj0NqaANXYPXsBudMy7TlfDb/tko7LOy7TcmOn1tVy
 av51+oSTcWSgSLPnJ2LRNMfeguw39YJzcMhAdZh/4/Hik8c2MrBSTaKveJl9N1cf
 CDqRdKaoycjjhiTPgmreQUaL35rDhJ3LoYOqX2IMsGFjVjbI1S/8oIPJpL/JxZYd
 t7jxDPGNWjA6AppKo5C2kysjI0VPCvtiwxrm0aCBx6iVM8Hf/nxk9I0Dq7LLL179
 7vdYPoS4H4aip5XvDPV99Xus72qfErrnVJcYmOziqg==
 =QS2E
 -----END PGP SIGNATURE-----

Merge tag '4.15-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Small SMB3 fixes for stable and 4.15rc"

* tag '4.15-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: don't log STATUS_NOT_FOUND errors for DFS
  cifs: fix NULL deref in SMB2_read
2017-12-14 11:51:21 -08:00
Darrick J. Wong a192de265b xfs: allow CoW remap transactions to use reserve blocks
Since we as yet have no way of holding on to the indlen blocks that are
reserved as part of CoW fork delalloc reservations, let the CoW remap
transaction dip into the reserves so that we avoid failing writes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:11 -08:00
Darrick J. Wong 9d40fba8b2 xfs: avoid infinite loop when cancelling CoW blocks after writeback failure
When we're cancelling a cow range, we don't always delete each extent
that we iterate, so we have to move icur backwards in the list to avoid
an infinite loop.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:11 -08:00
Darrick J. Wong 73353f486c xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping
We don't hold the ilock through the entire sequence of xfs_writepage_map
-> xfs_map_cow -> xfs_reflink_find_cow_mapping.  This means that we can
race with another thread that is trying to clear the inode reflink flag,
with the result that the flag is set for the xfs_map_cow check but
cleared before we get to the assert in find_cow_mapping.  When this
happens, we blow the assert even though everything is fine.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:11 -08:00
Darrick J. Wong 5c989a0ee0 xfs: remove dest file's post-eof preallocations before reflinking
If we try to reflink into a file with post-eof preallocations at an
offset well past the preallocations, we increase i_size as one would
expect.  However, those allocations do not have page cache backing them,
so they won't get cleaned out on their own.  This leads to asserts in
the collapse/insert range code and xfs_destroy_inode when they encounter
delalloc extents they weren't expecting to find.

Since there are plenty of other places where we dump those post-eof
blocks, do the same to the reflink destination file before we start
remapping extents.  This was found by adding clonerange support to
fsstress and running it in write-only mode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:11 -08:00
Darrick J. Wong c54854a437 xfs: move xfs_iext_insert tracepoint to report useful information
Move the tracepoint in xfs_iext_insert to after the point where we've
inserted the extent because otherwise we report stale extent data in
the ftrace output.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:11 -08:00
Darrick J. Wong 8c57b88637 xfs: account for null transactions in bunmapi
In e1a4e37cc7 ("xfs: try to avoid blowing out the transaction
reservation when bunmaping a shared extent"), we try to constrain the
amount of real extents we unmap from the data fork in a given call so
that we don't blow out transaction reservations.

However, not all bunmapi operations require a transaction -- if we're
only removing a delalloc extent, no transaction is needed, so we have to
code against that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:20:10 -08:00
Darrick J. Wong 6e643cd094 xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
The new attribute leaf buffer is not held locked across the transaction
roll between the shortform->leaf modification and the addition of the
new entry.  As a result, the attribute buffer modification being made is
not atomic from an operational perspective.  Hence the AIL push can grab
it in the transient state of "just created" after the initial
transaction is rolled, because the buffer has been released.  This leads
to xfs_attr3_leaf_verify() asserting that hdr.count is zero, treating
this as in-memory corruption, and shutting down the filesystem.

Darrick ported the original patch to 4.15 and reworked it use the
xfs_defer_bjoin helper and hold/join the buffer correctly across the
second transaction roll.

Signed-off-by: Alex Lyakas <alex@zadarastorage.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:18:12 -08:00
Darrick J. Wong b7b2846fe2 xfs: add the ability to join a held buffer to a defer_ops
In certain cases, defer_ops callers will lock a buffer and want to hold
the lock across transaction rolls.  Similar to ijoined inodes, we want
to dirty & join the buffer with each transaction roll in defer_finish so
that afterwards the caller still owns the buffer lock and we haven't
inadvertently pinned the log.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-14 09:17:35 -08:00
Amir Goldstein da2e6b7eed ovl: fix overlay: warning prefix
Conform two stray warning messages to the standard overlayfs: prefix.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-14 11:14:52 +01:00
Linus Torvalds 7c5cac1bc7 Changes since last update:
- Clean up duplicate includes
 - Remove ancient 'no-alloc' crap code that occasionally caused hard fs
   shutdowns due to lack of proper space reservations
 - Fix regression in FIEMAP behavior when reporting xattr extents
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaK0JUAAoJEPh/dxk0SrTrWOcP/iDoE1nV8BHru8ynwCr0ABun
 Hc+dmtQ1uQezu1qewzWkxH/zkyvpMBtH3wkqkYQApbPw7jSN4WDUazEGPY4Ju6pJ
 gMyg64EEC6UEGN8B9M2mf1QB/Q/TjZSeFiKOLw78ikWYSG/dbf814zC2fyWO79eG
 mjGzNbdvBbId35HLd62vd8VAW7zYY3acOyzQEl41LqKoGXD9eFWIh/uvH0bGuxN3
 3YipW/PM7MBq+1rCi6pFVX+wt7pemi8hQ4vRZqMp24SB5JmvruP9E45iOt/8sep+
 D/x1YjDyhutshAjbXyIaruxeIfsrs/r/3SAkOQgktwc8ihadBTJF3TPL9aTUGwLS
 1dCL7Gd2Mx317yeHzSFs+FCq8pc+ioysbyZcCIlJPnhb1ZCaA98XD/desbNL/BY4
 uf/Uq/5dJ6Kwllzol1VVz4CVKne4x1vQhPuIT1/wYsd2tSIYiBg+XlFV67CB7Fsv
 9wRetybw2c22qINLNPc50tocGcormQT940PieketssFsOHa96GduT5Z5DEbZa7FV
 /yk68o50VU2zlKuAMtTYbLT+uL/TimgeHU1pSCXOwT2wvJA/O5hVQEadIZ51cMct
 KSFlY8xEGwDZM8S88Xf1H7yFmUpGvmAnIwPHCZSJur026rZMWeANl6MTZJTJSpTx
 Wdj87C+2s5awNUcZmX0n
 =cmic
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are a few more bug fixes & cleanups for 4.15-rc4:

   - clean up duplicate includes

   - remove ancient 'no-alloc' crap code that occasionally caused hard
     fs shutdowns due to lack of proper space reservations

   - fix regression in FIEMAP behavior when reporting xattr extents"

* tag 'xfs-4.15-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: make iomap_begin functions trim iomaps consistently
  xfs: remove "no-allocation" reservations for file creations
  fs: xfs: remove duplicate includes
2017-12-13 20:15:49 -08:00
Chandan Rajendra 9d5afec6b8 ext4: fix crash when a directory's i_size is too small
On a ppc64 machine, when mounting a fuzzed ext2 image (generated by
fsfuzzer) the following call trace is seen,

VFS: brelse: Trying to free free buffer
WARNING: CPU: 1 PID: 6913 at /root/repos/linux/fs/buffer.c:1165 .__brelse.part.6+0x24/0x40
.__brelse.part.6+0x20/0x40 (unreliable)
.ext4_find_entry+0x384/0x4f0
.ext4_lookup+0x84/0x250
.lookup_slow+0xdc/0x230
.walk_component+0x268/0x400
.path_lookupat+0xec/0x2d0
.filename_lookup+0x9c/0x1d0
.vfs_statx+0x98/0x140
.SyS_newfstatat+0x48/0x80
system_call+0x58/0x6c

This happens because the directory that ext4_find_entry() looks up has
inode->i_size that is less than the block size of the filesystem. This
causes 'nblocks' to have a value of zero. ext4_bread_batch() ends up not
reading any of the directory file's blocks. This renders the entries in
bh_use[] array to continue to have garbage data. buffer_uptodate() on
bh_use[0] can then return a zero value upon which brelse() function is
invoked.

This commit fixes the bug by returning -ENOENT when the directory file
has no associated blocks.

Reported-by: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
2017-12-11 15:00:57 -05:00
Paul E. McKenney 1dfa55e019 Merge branches 'cond_resched.2017.12.04a', 'dyntick.2017.11.28a', 'fixes.2017.12.11a', 'srbd.2017.12.05a' and 'torture.2017.12.11a' into HEAD
cond_resched.2017.12.04a: Convert cond_resched_rcu_qs() to cond_resched()
dyntick.2017.11.28a: Make RCU dynticks handle interrupts from NMI
fixes.2017.12.11a: Miscellaneous fixes
srbd.2017.12.05a: Remove now-redundant smp_read_barrier_depends()
torture.2017.12.11a: Torture-testing update
2017-12-11 09:21:58 -08:00
Vasyl Gomonovych 7879cb43f9 ovl: Use PTR_ERR_OR_ZERO()
Fix ptr_ret.cocci warnings:
fs/overlayfs/overlayfs.h:179:11-17: WARNING: PTR_ERR_OR_ZERO can be used

Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR

Generated by: scripts/coccinelle/api/ptr_ret.cocci

Signed-off-by: Vasyl Gomonovych <gomonovych@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-11 11:28:11 +01:00
Chengguang Xu e8d4bfe3a7 ovl: Sync upper dirty data when syncing overlayfs
When executing filesystem sync or umount on overlayfs,
dirty data does not get synced as expected on upper filesystem.
This patch fixes sync filesystem method to keep data consistency
for overlayfs.

Signed-off-by: Chengguang Xu <cgxu@mykernel.net>
Fixes: e593b2bf51 ("ovl: properly implement sync_filesystem()")
Cc: <stable@vger.kernel.org> #4.11
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-11 11:28:11 +01:00
Amir Goldstein b02a16e641 ovl: update ctx->pos on impure dir iteration
This fixes a regression with readdir of impure dir in overlayfs
that is shared to VM via 9p fs.

Reported-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Fixes: 4edb83bb10 ("ovl: constant d_ino for non-merge dirs")
Cc: <stable@vger.kernel.org> #4.14
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Tested-by: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-11 11:28:11 +01:00
Vivek Goyal 08d8f8a5b0 ovl: Pass ovl_get_nlink() parameters in right order
Right now we seem to be passing index as "lowerdentry" and origin.dentry
as "upperdentry". IIUC, we should pass these parameters in reversed order
and this looks like a bug.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> #v4.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-11 11:28:10 +01:00
Miklos Szeredi 438c84c2f0 ovl: don't follow redirects if redirect_dir=off
Overlayfs is following redirects even when redirects are disabled. If this
is unintentional (probably the majority of cases) then this can be a
problem.  E.g. upper layer comes from untrusted USB drive, and attacker
crafts a redirect to enable read access to otherwise unreadable
directories.

If "redirect_dir=off", then turn off following as well as creation of
redirects.  If "redirect_dir=follow", then turn on following, but turn off
creation of redirects (which is what "redirect_dir=off" does now).

This is a backward incompatible change, so make it dependent on a config
option.

Reported-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-12-11 11:28:10 +01:00
Theodore Ts'o 996fc4477a ext4: add missing error check in __ext4_new_inode()
It's possible for ext4_get_acl() to return an ERR_PTR.  So we need to
add a check for this case in __ext4_new_inode().  Otherwise on an
error we can end up oops the kernel.

This was getting triggered by xfstests generic/388, which is a test
which exercises the shutdown code path.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-12-10 23:44:11 -05:00
Jeff Layton 98087c05b9 hpfs: don't bother with the i_version counter or f_version
HPFS does not set SB_I_VERSION and does not use the i_version counter
internally.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Mikulas Patocka <mikulas@twibright.com>
Reviewed-by: Mikulas Patocka <mikulas@twibright.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-10 12:58:18 -08:00
Linus Torvalds 51090c5d6d for-4.15-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlosIf4ACgkQxWXV+ddt
 WDspsw//YPhztOkAM7L37Lcv6PuMIBm7AsZax+iUctx9GlE9Yb9dYX+yIGjk3N44
 M6oHANP/Af70lGn3jaNlH+BeQre+RFD2KnT+Yyvp/0DV5+v+Bb6wqzrVqeYf9NIr
 lf6yc925gX10+DM6UXpYopTmdB8zXXO8xnqmFuT1jC/PrW/g+Hpxi7UtFFcoXwnE
 uucdih1LnNC/2pwp4ygQAxMkLnU2foWRsEP9lqsv83ecKDBfVxHUidzEZLTO7L+c
 ePc74AcyuPZ7DobuSDyDF4e0Ru5YtY5Zf+KR7RZHag5BNF2YLJE/XtN+hd3YhOQA
 7VniaPzUEG74ukvkL3L2oqxrMEavE0IFJtmzT4CM8DlRsGsDnn5n45sGHfo5clr8
 33XOq8aiGtbG1vwVbBJOuNQI2SWJxwe1OyAZoV/o1UVrltSCRf+dYL8Yf3IO2K0M
 DRnRNqEcZQGfqrVO5Iblw7VzVqY9LKiRESScS0Btvrys+DTVZAgC9CJDwN446E5v
 i56PrmT8OcC9MzP9wFIZtg27jiC0ndNwkqUhFrt1LBvC+BtvZvshAnFLhLfSRyZo
 0gqp2GoP6CFaUd5Ok+osALWF2VG8cpMJ7urdX0O5zXEYKioLwiXUS9Z7sldfHsJr
 Uiy1uh70UIOM96ZcsXyjLr0LO5vmgkV2kyDNbR5DtrJhfFai4Gs=
 =YaZE
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "This contains a few fixes (error handling, quota leak, FUA vs
  nobarrier mount option).

  There's one one worth mentioning separately - an off-by-one fix that
  leads to overwriting first byte of an adjacent page with 0, out of
  bounds of the memory allocated by an ioctl. This is under a privileged
  part of the ioctl, can be triggerd in some subvolume layouts"

* tag 'for-4.15-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
  Btrfs: disable FUA if mounted with nobarrier
  btrfs: fix missing error return in btrfs_drop_snapshot
  btrfs: handle errors while updating refcounts in update_ref_for_cow
  btrfs: Fix quota reservation leak on preallocated files
2017-12-10 08:30:04 -08:00
Markus Trippelsdorf d7ee946942 VFS: Handle lazytime in do_mount()
Since commit e462ec50cb ("VFS: Differentiate mount flags (MS_*) from
internal superblock flags") the lazytime mount option doesn't get passed
on anymore.

Fix the issue by handling the option in do_mount().

Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-09 20:16:33 -05:00
Darrick J. Wong b7e0b6ff54 xfs: make iomap_begin functions trim iomaps consistently
Historically, the XFS iomap_begin function only returned mappings for
exactly the range queried, i.e. it doesn't do XFS_BMAPI_ENTIRE lookups.
The current vfs iomap consumers are only set up to deal with trimmed
mappings.  xfs_xattr_iomap_begin does BMAPI_ENTIRE lookups, which is
inconsistent with the current iomap usage.  Remove the flag so that both
iomap_begin functions behave the same way.

FWIW this also fixes a behavioral regression in xattr FIEMAP that was
introduced in 4.8 wherein attr fork extents are no longer trimmed like
they used to be.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-12-08 17:51:05 -08:00
Christoph Hellwig f59cf5c299 xfs: remove "no-allocation" reservations for file creations
If we create a new file we will need an inode, and usually some metadata
in the parent direction.  Aiming for everything to go well despite the
lack of a reservation leads to dirty transactions cancelled under a heavy
create/delete load.  This patch removes those nospace transactions, which
will lead to slightly earlier ENOSPC on some workloads, but instead
prevent file system shutdowns due to cancelling dirty transactions for
others.

A customer could observe assertations failures and shutdowns due to
cancelation of dirty transactions during heavy NFS workloads as shown
below:

2017-05-30 21:17:06 kernel: WARNING: [ 2670.728125] XFS: Assertion failed: error != -ENOSPC, file: fs/xfs/xfs_inode.c, line: 1262

2017-05-30 21:17:06 kernel: WARNING: [ 2670.728222] Call Trace:
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728246]  [<ffffffff81795daf>] dump_stack+0x63/0x81
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728262]  [<ffffffff810a1a5a>] warn_slowpath_common+0x8a/0xc0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728264]  [<ffffffff810a1b8a>] warn_slowpath_null+0x1a/0x20
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728285]  [<ffffffffa01bf403>] asswarn+0x33/0x40 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728308]  [<ffffffffa01bb07e>] xfs_create+0x7be/0x7d0 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728329]  [<ffffffffa01b6ffb>] xfs_generic_create+0x1fb/0x2e0 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728348]  [<ffffffffa01b7114>] xfs_vn_mknod+0x14/0x20 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728366]  [<ffffffffa01b7153>] xfs_vn_create+0x13/0x20 [xfs]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728380]  [<ffffffff81231de5>] vfs_create+0xd5/0x140
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728390]  [<ffffffffa045ddb9>] do_nfsd_create+0x499/0x610 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728396]  [<ffffffffa0465fa5>] nfsd3_proc_create+0x135/0x210 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728401]  [<ffffffffa04561e3>] nfsd_dispatch+0xc3/0x210 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728416]  [<ffffffffa03bfa43>] svc_process_common+0x453/0x6f0 [sunrpc]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728423]  [<ffffffffa03bfdf3>] svc_process+0x113/0x1f0 [sunrpc]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728427]  [<ffffffffa0455bcf>] nfsd+0x10f/0x180 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728432]  [<ffffffffa0455ac0>] ? nfsd_destroy+0x80/0x80 [nfsd]
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728438]  [<ffffffff810c0d58>] kthread+0xd8/0xf0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728441]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728451]  [<ffffffff8179d962>] ret_from_fork+0x42/0x70
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728453]  [<ffffffff810c0c80>] ? kthread_create_on_node+0x1b0/0x1b0
2017-05-30 21:17:06 kernel: WARNING: [ 2670.728454] ---[ end trace f9822c842fec81d4 ]---

2017-05-30 21:17:06 kernel: ALERT: [ 2670.728477] XFS (sdb): Internal error xfs_trans_cancel at line 983 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x4ee/0x7d0 [xfs]

2017-05-30 21:17:06 kernel: ALERT: [ 2670.728684] XFS (sdb): Corruption of in-memory data detected. Shutting down filesystem
2017-05-30 21:17:06 kernel: ALERT: [ 2670.728685] XFS (sdb): Please umount the filesystem and rectify the problem(s)

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-12-08 17:51:05 -08:00
Pravin Shedge eaf0ec303b fs: xfs: remove duplicate includes
These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge <pravin.shedge4linux@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-12-08 17:51:05 -08:00
Yan, Zheng 040d786032 ceph: drop negative child dentries before try pruning inode's alias
Negative child dentry holds reference on inode's alias, it makes
d_prune_aliases() do nothing.

Cc: stable@vger.kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-12-08 11:07:12 +01:00
Linus Torvalds ba3edf1f77 proc: show si_ptr in /proc/<pid>/timers without hashing
It's a user pointer, and while the permissions of the file are pretty
questionable (should it really be readable to everybody), hashing the
pointer isn't going to be the solution.

We should take a closer look at more of the /proc/<pid> file permissions
in general.  Sure, we do want many of them to often be readable (for
'ps' and friends), but I think we should probably do a few conversions
from S_IRUGO to S_IRUSR.

Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-12-06 18:23:27 -08:00
Nikolay Borisov c8bcbfbd23 btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819d7 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ added implications ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:35:15 +01:00
Omar Sandoval 1b9e619c5b Btrfs: disable FUA if mounted with nobarrier
I was seeing disk flushes still happening when I mounted a Btrfs
filesystem with nobarrier for testing. This is because we use FUA to
write out the first super block, and on devices without FUA support, the
block layer translates FUA to a flush. Even on devices supporting true
FUA, using FUA when we asked for no barriers is surprising.

Fixes: 387125fc72 ("Btrfs: fix barrier flushes")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:34:45 +01:00
Jeff Mahoney e19182c0ff btrfs: fix missing error return in btrfs_drop_snapshot
If btrfs_del_root fails in btrfs_drop_snapshot, we'll pick up the
error but then return 0 anyway due to mixing err and ret.

Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: <stable@vger.kernel.org> # v3.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:30:29 +01:00
Jeff Mahoney 692826b273 btrfs: handle errors while updating refcounts in update_ref_for_cow
Since commit fb235dc06f (btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans) the assumption that
btrfs_add_delayed_{data,tree}_ref can only return 0 or -ENOMEM has
been false.  The qgroup operations call into btrfs_search_slot
and friends and can now return the full spectrum of error codes.

Fortunately, the fix here is easy since update_ref_for_cow failing
is already handled so we just need to bail early with the error
code.

Fixes: fb235dc06f (btrfs: qgroup: Move half of the qgroup accounting ...)
Cc: <stable@vger.kernel.org> # v4.11+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:30:03 +01:00
Justin Maggard b430b77512 btrfs: Fix quota reservation leak on preallocated files
Commit c6887cd111 ("Btrfs: don't do nocow check unless we have to")
changed the behavior of __btrfs_buffered_write() so that it first tries
to get a data space reservation, and then skips the relatively expensive
nocow check if the reservation succeeded.

If we have quotas enabled, the data space reservation also includes a
quota reservation.  But in the rewrite case, the space has already been
accounted for in qgroups.  So btrfs_check_data_free_space() increases
the quota reservation, but it never gets decreased when the data
actually gets written and overwrites the pre-existing data.  So we're
left with both the qgroup and qgroup reservation accounting for the same
space.

This commit adds the missing btrfs_qgroup_free_data() call in the case
of BTRFS_ORDERED_PREALLOC extents.

Fixes: c6887cd111 ("Btrfs: don't do nocow check unless we have to")
Signed-off-by: Justin Maggard <jmaggard@netgear.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-12-07 00:28:12 +01:00
Aurelien Aptel 5702591fc6 CIFS: don't log STATUS_NOT_FOUND errors for DFS
cifs.ko makes DFS queries regardless of the type of the server and
non-DFS servers are common. This often results in superfluous logging of
non-critical errors.

Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2017-12-06 12:48:01 -06:00
Ronnie Sahlberg a821df3f1a cifs: fix NULL deref in SMB2_read
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-12-06 12:46:13 -06:00
Al Viro ca0168e8a7 alloc_super(): do ->s_umount initialization earlier
... so that failure exits could count on it having been
done.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-12-05 09:32:25 -05:00
Paul E. McKenney 7088efa913 fs/dcache: Use release-acquire for name/length update
The code in __d_alloc() carefully orders filling in the NUL character
of the name (and the length, hash, and the name itself) with assigning
of the name itself.  However, prepend_name() does not order the accesses
to the ->name and ->len fields, other than on TSO systems.  This commit
therefore replaces prepend_name()'s READ_ONCE() of ->name with an
smp_load_acquire(), which orders against the subsequent READ_ONCE() of
->len.  Because READ_ONCE() now incorporates smp_read_barrier_depends(),
prepend_name()'s smp_read_barrier_depends() is removed.  Finally,
to save a line, the smp_wmb()/store pair in __d_alloc() is replaced
by smp_store_release().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
2017-12-04 10:52:52 -08:00
Paul E. McKenney 388a4c8806 fs: Eliminate cond_resched_rcu_qs() in favor of cond_resched()
Now that cond_resched() also provides RCU quiescent states when
needed, it can be used in place of cond_resched_rcu_qs().  This
commit therefore makes this change.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: <linux-fsdevel@vger.kernel.org>
2017-12-04 10:28:59 -08:00
Eryu Guan c894aa9757 ext4: fix fdatasync(2) after fallocate(2) operation
Currently, fallocate(2) with KEEP_SIZE followed by a fdatasync(2)
then crash, we'll see wrong allocated block number (stat -c %b), the
blocks allocated beyond EOF are all lost. fstests generic/468
exposes this bug.

Commit 67a7d5f561 ("ext4: fix fdatasync(2) after extent
manipulation operations") fixed all the other extent manipulation
operation paths such as hole punch, zero range, collapse range etc.,
but forgot the fallocate case.

So similarly, fix it by recording the correct journal tid in ext4
inode in fallocate(2) path, so that ext4_sync_file() will wait for
the right tid to be committed on fdatasync(2).

This addresses the test failure in xfstests test generic/468.

Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2017-12-03 22:52:51 -05:00
Andi Kleen fc82228a5e ext4: support fast symlinks from ext3 file systems
407cd7fb83 (ext4: change fast symlink test to not rely on i_blocks)
broke ~10 years old ext3 file systems created by 2.6.17. Any ELF
executable fails because the /lib/ld-linux.so.2 fast symlink
cannot be read anymore.

The patch assumed fast symlinks were created in a specific way,
but that's not true on these really old file systems.

The new behavior is apparently needed only with the large EA inode
feature.

Revert to the old behavior if the large EA inode feature is not set.

This makes my old VM boot again.

Fixes: 407cd7fb83 (ext4: change fast symlink test to not rely on i_blocks)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2017-12-03 20:38:01 -05:00
Linus Torvalds 2db767d988 NFS client fixes for Linux 4.15-rc2
Bugfixes:
 - NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
 - SUNRPC: Allow connect to return EHOSTUNREACH
 - SUNRPC: Handle ENETDOWN errors
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlohwp4ACgkQ18tUv7Cl
 QOtq1A//RPOxJBPQsImfkVTiVzxZbS8k2/obJSZjPYoNozmywEJs9dnFYJVCFUGp
 l9AvRd/SjXOVjGovk6ZhDCY3xA2eP1XfOLiVg7EhpczPVCRNJ34BUT7hWyxnTLSz
 MKc1qLLfVaSjsLioO6YmdCPjiGC0KegrBKNlRlIbI+OjCq5aNJpz73Fb4mFgCp5M
 taERunf7X29WHxAVn0c3mhIHN7tpCi9SgfbMURBEKLNrzj7RxnRY07dT1S9Mg/Yg
 4FWU9FIpAyk9C9we/LR9jUywZQ3GGJFFFTOo8RfyMB/LR9RACSXnbHjhI1nUEQTb
 R/NpBxlpvxEOapHdmw32jwj1fkY/WYlUiJekQhjEekp/HkFNdctQL8PjrhG6lIW7
 eBfFqZ2RUhYF1OQ8k4o0pR60O2scH3/D7tZwpgnJMFSpQSMnPnU8K3gvn/B5Mi4f
 UPDHtfj3GlWCIIJq1RIqKN4mt4tPktatnTCLIzDmqNbwqISwxow1lxmSesNejULo
 MryXLLl5M3XegjokXs0d0hadoywswHRTAxXxQEZav0dKMcHq4F0NirVw+VOIyNCB
 CztIVFI5Czzo4h4x99lgN26bNTysGMvse2qiPkVVr0CZt2leyrZyTl9khvDe3C0t
 ijyq882b4LqibuQtnI3l/Pynrrowfp7fqYx7SO62VJjraBVYUzE=
 =eQyi
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These patches fix a problem with compiling using an old version of
  gcc, and also fix up error handling in the SUNRPC layer.

   - NFSv4: Ensure gcc 4.4.4 can compile initialiser for
     "invalid_stateid"

   - SUNRPC: Allow connect to return EHOSTUNREACH

   - SUNRPC: Handle ENETDOWN errors"

* tag 'nfs-for-4.15-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  SUNRPC: Handle ENETDOWN errors
  SUNRPC: Allow connect to return EHOSTUNREACH
  NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
2017-12-01 20:04:20 -05:00
Linus Torvalds 788c1da05b Changes since last update:
- Fix memory leaks that appeared after removing ifork inline data buffer
 - Recover deferred rmap update log items in correct order
 - Fix memory leaks when buffer construction fails
 - Fix memory leaks when bmbt is corrupt
 - Fix some uninitialized variables and math problems in the quota scrubber
 - Add some omitted attribution tags on the log replay commit
 - Fix some UBSAN complaints about integer overflows with large sparse files
 - Implement an effective inode mode check in online fsck
 - Fix log's inability to retry quota item writeout due to transient errors
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaIDZ8AAoJEPh/dxk0SrTrTD4QAIUq223XSyqMJYkAK163zMj4
 PADY30MV7uMlFBLEm3b7ZEWA/vtFzDM7Qpa61WN15oR5jEVSqSFes9AzuLeISqia
 s7Hc1ksqgZLNaMnW+jQc4iT/yiCVhiWw3rFC4tahDVCF2lJO/la3ToUBbcoADAFk
 kBYVN1H1t5b+n5+A9QY6+Vxm6LXGPPo8vNyCQCEtN+dE7CcSEL4Ff9H9GmJiVPzk
 rG6uizwRvxZje/yY1jEnkCSI88Gj1v0L//VmIDDuGjCZleYxwbTQQO0l8p4S+Su8
 48la8PZbk3KcBTfiRbcU0m4995DHDVT/mAOWHeZnv+ZI5jhDEe1lpJG5l65kwPK+
 BOoTYaRaBv3yZvEOob6wEqyfT3A1dxXstKBJLPyHx+McqFH8+NV2WAry+6dedOkv
 Hwz6+OlAFmuBuhOZAZSt0LSWxu/qYovo5lCSNrBtiLlmDyFjtdbanQ7s8oWaV7p/
 wimNV4Y+Y3XiePOEUftnG8yxOULZS4KMeYsdJxj9HzaKloYHQer+MWfPe0gzExBb
 eE3P9PckQpcx9hK8LE1irgDCDG6J2eb8b5sFZY0eNzngdtWCR/xYz3NFT+72kz3s
 XOI0mByH1Ab0Q1lvJml0RyW86Uj7lpMD2SzV2nVhbYrW81rkkzb7AQx5VyO57Gq6
 WAX9mHNNRcY+uVrbb8QQ
 =oTB7
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Here are some bug fixes for 4.15-rc2.

   - fix memory leaks that appeared after removing ifork inline data
     buffer

   - recover deferred rmap update log items in correct order

   - fix memory leaks when buffer construction fails

   - fix memory leaks when bmbt is corrupt

   - fix some uninitialized variables and math problems in the quota
     scrubber

   - add some omitted attribution tags on the log replay commit

   - fix some UBSAN complaints about integer overflows with large sparse
     files

   - implement an effective inode mode check in online fsck

   - fix log's inability to retry quota item writeout due to transient
     errors"

* tag 'xfs-4.15-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: Properly retry failed dquot items in case of error during buffer writeback
  xfs: scrub inode mode properly
  xfs: remove unused parameter from xfs_writepage_map
  xfs: ubsan fixes
  xfs: calculate correct offset in xfs_scrub_quota_item
  xfs: fix uninitialized variable in xfs_scrub_quota
  xfs: fix leaks on corruption errors in xfs_bmap.c
  xfs: fortify xfs_alloc_buftarg error handling
  xfs: log recovery should replay deferred ops in order
  xfs: always free inline data before resetting inode fork during ifree
2017-12-01 20:00:19 -05:00
David Howells f8de483e74 afs: Properly reset afs_vnode (inode) fields
When an AFS inode is allocated by afs_alloc_inode(), the allocated
afs_vnode struct isn't necessarily reset from the last time it was used as
an inode because the slab constructor is only invoked once when the memory
is obtained from the page allocator.

This means that information can leak from one inode to the next because
we're not calling kmem_cache_zalloc().  Some of the information isn't
reset, in particular the permit cache pointer.

Bring the clearances up to date.

Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
2017-12-01 11:51:24 +00:00
David Howells 1bcab12521 afs: Fix permit refcounting
Fix four refcount bugs in afs_cache_permit():

 (1) When checking the result of the kzalloc(), we can't just return, but
     must put 'permits'.

 (2) We shouldn't put permits immediately after hashing a new permit as we
     need to keep the pointer stable so that we can check to see if
     vnode->permit_cache has changed before we decide whether to assign to
     it.

 (3) 'permits' is being put twice.

 (4) We need to put either the replacement or the thing replaced after the
     assignment to vnode->permit_cache.

Without this, lots of the following are seen:

  Kernel BUG at ffffffffa039857b [verbose debug info unavailable]
  ------------[ cut here ]------------
  Kernel BUG at ffffffffa039858a [verbose debug info unavailable]
  ------------[ cut here ]------------

The addresses are in the .text..refcount section of the kafs.ko module.
Following the relocation records for the __ex_table section shows one to be
due to the decrement in afs_put_permits() and the other to be key_get() in
afs_cache_permit().

Occasionally, the following is seen:

  refcount_t overflow at afs_cache_permit+0x57d/0x5c0 [kafs] in cc1[562], uid/euid: 0/0
  WARNING: CPU: 0 PID: 562 at kernel/panic.c:657 refcount_error_report+0x9c/0xac
  ...

Reported-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Marc Dionne <marc.dionne@auristor.com>
2017-12-01 11:40:43 +00:00
Eric W. Biederman 116ceac974 autofs4: Modify autofs_wait to use current_uid() and current_gid()
The code used to do that and then I mucked with it and never quite put
the code back.  Today the code references current_cred()->uid and
current_cred()->gid which is equivalent but more wordy, and not
idiomatic.

Fixes: 93faccbbfa ("fs: Better permission checking for submounts")
Fixes: 069d5ac9ae ("autofs:  Fix automounts by using current_real_cred()->uid")
Acked-by:  Ian Kent <raven@themaw.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-11-30 17:47:52 -06:00
Eric W. Biederman bbc3e47101 userns: Don't fail follow_automount based on s_user_ns
When vfs_submount was added the test to limit automounts from
filesystems that with s_user_ns != &init_user_ns accidentially left
in follow_automount.  The test was never about any security concerns
and was always about how do we implement this for filesystems whose
s_user_ns != &init_user_ns.

At the moment this check makes no difference as there are no
filesystems that both set FS_USERNS_MOUNT and implement d_automount.

Remove this check now while I am thinking about it so there will not
be odd booby traps for someone who does want to make this combination
work.

vfs_submount still needs improvements to allow this combination to work,
and vfs_submount contains a check that presents a warning.

The autofs4 filesystem could be modified to set FS_USERNS_MOUNT and it would
need not work on this code path, as userspace performs the mounts.

Fixes: 93faccbbfa ("fs: Better permission checking for submounts")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Acked-by:  Ian Kent <raven@themaw.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-11-30 17:47:20 -06:00
Linus Torvalds 9c41180be4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota & reiserfs changes from Jan Kara:

 - two error checking improvements for quota

 - remove bogus i_version increase for reiserfs

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: Check for register_shrinker() failure.
  quota: propagate error from __dquot_initialize
  reiserfs: remove unneeded i_version bump
2017-11-30 18:38:47 -05:00
Carlos Maiolino 373b0589dc xfs: Properly retry failed dquot items in case of error during buffer writeback
Once the inode item writeback errors is already fixed, it's time to fix the same
problem in dquot code.

Although there were no reports of users hitting this bug in dquot code (at least
none I've seen), the bug is there and I was already planning to fix it when the
correct approach to fix the inodes part was decided.

This patch aims to fix the same problem in dquot code, regarding failed buffers
being unable to be resubmitted once they are flush locked.

Tested with the recently test-case sent to fstests list by Hou Tao.

Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-30 08:47:40 -08:00
Darrick J. Wong 3b42d38575 xfs: scrub inode mode properly
Since we've used up all the bits in i_mode, the existing mode check
doesn't actually do anything useful.  However, we've not used all the
bit values in the format portion of i_mode, so we /do/ need to test
that for bad values.

Fixes: 80e4e1268 ("xfs: scrub inodes")
Fixes-coverity-id: 1423992
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-11-30 08:43:52 -08:00
Darrick J. Wong 2d5f4b5beb xfs: remove unused parameter from xfs_writepage_map
The first thing that xfs_writepage_map does is clobber the offset
parameter.  Since we never use the passed-in value, turn the parameter
into a local variable.  This gets rid of an UBSAN warning in generic/466.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-11-30 08:43:52 -08:00
Darrick J. Wong 22a6c83777 xfs: ubsan fixes
Fix some complaints from the UBSAN about signed integer addition overflows.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-11-30 08:43:52 -08:00
Linus Torvalds a0908a1b7d Merge branch 'akpm' (patches from Andrew)
Mergr misc fixes from Andrew Morton:
 "28 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (28 commits)
  fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate()
  mm/hugetlb: fix NULL-pointer dereference on 5-level paging machine
  autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"
  autofs: revert "autofs: take more care to not update last_used on path walk"
  fs/fat/inode.c: fix sb_rdonly() change
  mm, memcg: fix mem_cgroup_swapout() for THPs
  mm: migrate: fix an incorrect call of prep_transhuge_page()
  kmemleak: add scheduling point to kmemleak_scan()
  scripts/bloat-o-meter: don't fail with division by 0
  fs/mbcache.c: make count_objects() more robust
  Revert "mm/page-writeback.c: print a warning if the vm dirtiness settings are illogical"
  mm/madvise.c: fix madvise() infinite loop under special circumstances
  exec: avoid RLIMIT_STACK races with prlimit()
  IB/core: disable memory registration of filesystem-dax vmas
  v4l2: disable filesystem-dax mapping support
  mm: fail get_vaddr_frames() for filesystem-dax mappings
  mm: introduce get_user_pages_longterm
  device-dax: implement ->split() to catch invalid munmap attempts
  mm, hugetlbfs: introduce ->split() to vm_operations_struct
  scripts/faddr2line: extend usage on generic arch
  ...
2017-11-29 19:12:44 -08:00
Nadav Amit 72639e6df4 fs/hugetlbfs/inode.c: change put_page/unlock_page order in hugetlbfs_fallocate()
hugetlfs_fallocate() currently performs put_page() before unlock_page().
This scenario opens a small time window, from the time the page is added
to the page cache, until it is unlocked, in which the page might be
removed from the page-cache by another core.  If the page is removed
during this time windows, it might cause a memory corruption, as the
wrong page will be unlocked.

It is arguable whether this scenario can happen in a real system, and
there are several mitigating factors.  The issue was found by code
inspection (actually grep), and not by actually triggering the flow.
Yet, since putting the page before unlocking is incorrect it should be
fixed, if only to prevent future breakage or someone copy-pasting this
code.

Mike said:
 "I am of the opinion that this does not need to be sent to stable.
  Although the ordering is current code is incorrect, there is no way
  for this to be a problem with current locking. In addition, I verified
  that the perhaps bigger issue with sys_fadvise64(POSIX_FADV_DONTNEED)
  for hugetlbfs and other filesystems is addressed in 3a77d21480 ("mm:
  fadvise: avoid fadvise for fs without backing device")"

Link: http://lkml.kernel.org/r/20170826191124.51642-1-namit@vmware.com
Fixes: 70c3547e36 ("hugetlbfs: add hugetlbfs_fallocate()")
Signed-off-by: Nadav Amit <namit@vmware.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
Ian Kent 5d38f049ce autofs: revert "autofs: fix AT_NO_AUTOMOUNT not being honored"
Commit 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
allowed the fstatat(2) system call to properly honor the AT_NO_AUTOMOUNT
flag but introduced a semantic change.

In order to honor AT_NO_AUTOMOUNT a semantic change was made to the
negative dentry case for stat family system calls in follow_automount().

This changed the unconditional triggering of an automount in this case
to no longer be done and an error returned instead.

This has caused more problems than I expected so reverting the change is
needed.

In a discussion with Neil Brown it was concluded that the automount(8)
daemon can implement this change without kernel modifications.  So that
will be done instead and the autofs module documentation updated with a
description of the problem and what needs to be done by module users for
this specific case.

Link: http://lkml.kernel.org/r/151174730120.6162.3848002191530283984.stgit@pluto.themaw.net
Fixes: 42f4614821 ("autofs: fix AT_NO_AUTOMOUNT not being honored")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Neil Brown <neilb@suse.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
Ian Kent 43694d4bf8 autofs: revert "autofs: take more care to not update last_used on path walk"
While commit 092a53452b ("autofs: take more care to not update
last_used on path walk") helped (partially) resolve a problem where
automounts were not expiring due to aggressive accesses from user space
it has a side effect for very large environments.

This change helps with the expire problem by making the expire more
aggressive but, for very large environments, that means more mount
requests from clients.  When there are a lot of clients that can mean
fairly significant server load increases.

It turns out I put the last_used in this position to solve this very
problem and failed to update my own thinking of the autofs expire
policy.  So the patch being reverted introduces a regression which
should be fixed.

Link: http://lkml.kernel.org/r/151174729420.6162.1832622523537052460.stgit@pluto.themaw.net
Fixes: 092a53452b ("autofs: take more care to not update last_used on path walk")
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: NeilBrown <neilb@suse.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: <stable@vger.kernel.org>	[4.11+]
Cc: Colin Walters <walters@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
OGAWA Hirofumi b6e8e12c0a fs/fat/inode.c: fix sb_rdonly() change
Commit bc98a42c1f ("VFS: Convert sb->s_flags & MS_RDONLY to
sb_rdonly(sb)") converted fat_remount():new_rdonly from a bool to an
int.

However fat_remount() depends upon the compiler's conversion of a
non-zero integer into boolean `true'.

Fix it by switching `new_rdonly' back into a bool.

Link: http://lkml.kernel.org/r/87mv3d5x51.fsf@mail.parknet.co.jp
Fixes: bc98a42c1f ("VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)")
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Joe Perches <joe@perches.com>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
Jiang Biao d5dabd6339 fs/mbcache.c: make count_objects() more robust
When running ltp stress test for 7*24 hours, vmscan occasionally emits
the following warning continuously:

  mb_cache_scan+0x0/0x3f0 negative objects to delete
  nr=-9232265467809300450
  ...

Tracing shows the freeable(mb_cache_count returns) is -1, which causes
the continuous accumulation and overflow of total_scan.

This patch makes sure that mb_cache_count() cannot return a negative
value, which makes the mbcache shrinker more robust.

Link: http://lkml.kernel.org/r/1511753419-52328-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <zhong.weidong@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:43 -08:00
Kees Cook 04e35f4495 exec: avoid RLIMIT_STACK races with prlimit()
While the defense-in-depth RLIMIT_STACK limit on setuid processes was
protected against races from other threads calling setrlimit(), I missed
protecting it against races from external processes calling prlimit().
This adds locking around the change and makes sure that rlim_max is set
too.

Link: http://lkml.kernel.org/r/20171127193457.GA11348@beast
Fixes: 64701dee41 ("exec: Use sane stack rlimit under secureexec")
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: Serge Hallyn <serge@hallyn.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:42 -08:00
Dan Williams c7da82b894 mm: replace pmd_write with pmd_access_permitted in fault + gup paths
The 'access_permitted' helper is used in the gup-fast path and goes
beyond the simple _PAGE_RW check to also:

 - validate that the mapping is writable from a protection keys
   standpoint

 - validate that the pte has _PAGE_USER set since all fault paths where
   pmd_write is must be referencing user-memory.

Link: http://lkml.kernel.org/r/151043111049.2842.15241454964150083466.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-29 18:40:42 -08:00
Linus Torvalds b915176102 Highlights:
- Fixes from Trond for some races in the NFSv4 state code.
 	- Fix from Naofumi Honda for a typo in the blocked lock
 	  notificiation code.
 	- Fixes from Vasily Averin for some problems starting and
 	  stopping lockd especially in network namespaces.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaHxq/AAoJECebzXlCjuG+QOYP/jIa9dZnbau3owP8RJJv1+VI
 RSMYAZkIjy1vixn/BymZo55R7+23BhdLe8CDsknXWo85mIj61kpV1bwF2lVc7FWm
 +Pt93DkUsUBEjf+/3/58TLknYs5o7UhsEw2Qjg+D3BkO+z95biNa0hUBle2+Nnwi
 vBQLGqdlCIFZxuzEo7yUlGdKTyefzab4bocgRnh/5JMs+bHzPDD74W1GrGB1oEKX
 VSGzq0d7LLe23yIJwgP1eaa0tQr1/WsxlL8xD5Im6mXcN9aYa/7VZhg/oCluy8ac
 v95IBjQUkFqvw5OjDgSX5ZgKzokmRxLjnaUX2JT/sLCk1WxdJhUomw8qb1AJLvav
 e6xce1M+dR5VihTrD/cEe0xB7CXKXywPd6pXQBosAMInhS79aU8brIPCDtLvNwCw
 XvtNybbqC0Go89YMt2zuRfBkV7W3FmM1h+h4PWVl+iCl/7+AYIXD1qeX/FuIjnk6
 SMEdtTb/cqECuh55YefEljUzY1vKYgquxCNCvcbSrMtVSOZYXXufheY+fBjf5DBb
 Bnsd1FiPtVkwFwX8bTbGlOOub1Ryl9SD4Ae0Ynu2FNYSFL8BVXTHkTHm9UHl83s5
 pr0T6bKlpg+YzZrHVh2Herr9Ze89C9uM7oCU1M062vk4+Cg65paqNTnWVtflYPhG
 y9p0hsY5csyzm0SZ/1Ui
 =pR9D
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "I screwed up my merge window pull request; I only sent half of what I
  meant to.

  There were no new features, just bugfixes of various importance and
  some very minor cleanup, so I think it's all still appropriate for
  -rc2.

  Highlights:

   - Fixes from Trond for some races in the NFSv4 state code.

   - Fix from Naofumi Honda for a typo in the blocked lock notificiation
     code

   - Fixes from Vasily Averin for some problems starting and stopping
     lockd especially in network namespaces"

* tag 'nfsd-4.15-1' of git://linux-nfs.org/~bfields/linux: (23 commits)
  lockd: fix "list_add double add" caused by legacy signal interface
  nlm_shutdown_hosts_net() cleanup
  race of nfsd inetaddr notifiers vs nn->nfsd_serv change
  race of lockd inetaddr notifiers vs nlmsvc_rqst change
  SUNRPC: make cache_detail structures const
  NFSD: make cache_detail structures const
  sunrpc: make the function arg as const
  nfsd: check for use of the closed special stateid
  nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
  lockd: lost rollback of set_grace_period() in lockd_down_net()
  lockd: added cleanup checks in exit_net hook
  grace: replace BUG_ON by WARN_ONCE in exit_net hook
  nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class
  lockd: remove net pointer from messages
  nfsd: remove net pointer from debug messages
  nfsd: Fix races with check_stateid_generation()
  nfsd: Ensure we check stateid validity in the seqid operation checks
  nfsd: Fix race in lock stateid creation
  nfsd4: move find_lock_stateid
  nfsd: Ensure we don't recognise lock stateids after freeing them
  ...
2017-11-29 14:49:26 -08:00
Linus Torvalds 26cd94744e for-4.15-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlofBpkACgkQxWXV+ddt
 WDvtTQ//emI1QsD4N0e4BxMcZ1bcigiEk3jc4gj+biRapnMHHAHOqJbVtpK1v8gS
 PCTw+4uD5UOGvhBtS4kXJn8e2qxWCESWJDXwVlW0RHmWLfwd9z7ly0sBMi3oiIqH
 qief8CIkk3oTexiuuJ3mZGxqnDQjRGtWx2LM+bRJBWMk+jN32v2ObSlv9V505a5M
 1daDBsjWojFWa8d4r3YZNJq1df2om/dwVQZ0Wk59bacIo9Xbvok0X459cOlWuv0p
 mjx8m8uA/z+HVdkTYlzyKpq08O8Z4shj3GrBbSnZ511gKzV+c9jJPxij5pKm3Z2z
 KW4Mp17+/7GSNcSsJiqnOYi+wtOrak2lD0COlZTijnY2jrv18h8ianoIM6CpzUdy
 +b09yuFXbPLoUfyl6vFaO/JHuvAkQdaR2tJbds6lvW+liC1ReoL4W1WcUjY6nv9f
 6wTaIv0vwgrHaxeIzxKNpnsTlpHAgorFFk0/w8nLb40WX8AoJ/95lo2zws8oaFDN
 0Fylu3NYhoDrJZK+D8dbsWx2eTsFVCqep4w0+iEVZl3lfuy3FZl1pu8CL7ru9vJl
 DNieh+lUvK1Fk+SYIoilGoriW96RbU8+jPo2W4A1ENzeMJfrNCSWtUSZZp4XT4tO
 8m1PGud07XBLSxd62bAEDV3KZO2DnY1WxgXbKuIHSi9D5CI1LMo=
 =7UW+
 -----END PGP SIGNATURE-----

Merge tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "We've collected some fixes in since the pre-merge window freeze.

  There's technically only one regression fix for 4.15, but the rest
  seems important and candidates for stable.

   - fix missing flush bio puts in error cases (is serious, but rarely
     happens)

   - fix reporting stat::st_blocks for buffered append writes

   - fix space cache invalidation

   - fix out of bound memory access when setting zlib level

   - fix potential memory corruption when fsync fails in the middle

   - fix crash in integrity checker

   - incremetnal send fix, path mixup for certain unlink/rename
     combination

   - pass flags to writeback so compressed writes can be throttled
     properly

   - error handling fixes"

* tag 'for-4.15-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: incremental send, fix wrong unlink path after renaming file
  btrfs: tree-checker: Fix false panic for sanity test
  Btrfs: fix list_add corruption and soft lockups in fsync
  btrfs: Fix wild memory access in compression level parser
  btrfs: fix deadlock when writing out space cache
  btrfs: clear space cache inode generation always
  Btrfs: fix reported number of inode blocks after buffered append writes
  Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
  Btrfs: bail out gracefully rather than BUG_ON
  btrfs: dev_alloc_list is not protected by RCU, use normal list_del
  btrfs: add missing device::flush_bio puts
  btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
  Btrfs: add write_flags for compression bio
2017-11-29 14:26:50 -08:00
Trond Myklebust 445f288d70 NFSv4: Ensure gcc 4.4.4 can compile initialiser for "invalid_stateid"
gcc 4.4.4 is too old to have full C11 anonymous union support, so
the current initialiser fails to compile.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
(compile-)Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-29 13:46:32 -05:00
Tetsuo Handa 88bc0ede8d quota: Check for register_shrinker() failure.
register_shrinker() might return -ENOMEM error since Linux 3.12.
Call panic() as with other failure checks in this function if
register_shrinker() failed.

Fixes: 1d3d4437ea ("vmscan: per-node deferred work")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jan Kara <jack@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-29 16:46:48 +01:00
Eric Sandeen 712d361d59 xfs: calculate correct offset in xfs_scrub_quota_item
It's only used for tracepoints so it's relatively harmless,
but the offset is calculated incorrectly in xfs_scrub_quota_item.

qi_dqperchunk is the nr. of dquots per "chunk" which we have
conveniently *cough* defined to always be 1 FSB.  Therefore
block_offset * qi_dqperchunk == first id in that chunk,
and so offset = id / qi_dqperchunk

id * dqperchunk is ... meaningless.

Fixes-coverity-id: 1423965
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-28 08:57:11 -08:00
Eric Sandeen eda6bc27cc xfs: fix uninitialized variable in xfs_scrub_quota
On the first pass through the while(1) loop, we get to
xfs_scrub_should_terminate() which can test the uninitialized
error variable.

Fixes-coverity-id: 1423737
Fixes: c2fc338c ("xfs: scrub quota information")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-28 08:57:11 -08:00
Eric Sandeen d41c6172bd xfs: fix leaks on corruption errors in xfs_bmap.c
Use _GOTO instead of _RETURN so we can free the allocated
cursor on error.

Fixes: bf80628 ("xfs: remove xfs_bmse_shift_one")
Fixes-coverity-id: 1423813, 1423676
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-28 08:57:11 -08:00
Michal Hocko d210a9874b xfs: fortify xfs_alloc_buftarg error handling
percpu_counter_init failure path doesn't clean up &btp->bt_lru list.
Call list_lru_destroy in that error path. Similarly register_shrinker
error path is not handled.

While it is unlikely to trigger these error path, it is not impossible
especially the later might fail with large NUMAs.  Let's handle the
failure to make the code more robust.

Noticed-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-28 08:57:11 -08:00
Filipe Manana ea37d5998b Btrfs: incremental send, fix wrong unlink path after renaming file
Under some circumstances, an incremental send operation can issue wrong
paths for unlink commands related to files that have multiple hard links
and some (or all) of those links were renamed between the parent and send
snapshots. Consider the following example:

Parent snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- b/                                         (ino 259)
 |     |     |---- c/                                   (ino 260)
 |     |     |---- f2                                   (ino 261)
 |     |
 |     |---- f2l1                                       (ino 261)
 |
 |---- d/                                               (ino 262)
       |---- f1l1_2                                     (ino 258)
       |---- f2l2                                       (ino 261)
       |---- f1_2                                       (ino 258)

Send snapshot

 .                                                      (ino 256)
 |---- a/                                               (ino 257)
 |     |---- f2l1/                                      (ino 263)
 |             |---- b2/                                (ino 259)
 |                   |---- c/                           (ino 260)
 |                   |     |---- d3                     (ino 262)
 |                   |           |---- f1l1_2           (ino 258)
 |                   |           |---- f2l2_2           (ino 261)
 |                   |           |---- f1_2             (ino 258)
 |                   |
 |                   |---- f2                           (ino 261)
 |                   |---- f1l2                         (ino 258)
 |
 |---- d                                                (ino 261)

When computing the incremental send stream the following steps happen:

1) When processing inode 261, a rename operation is issued that renames
   inode 262, which currently as a path of "d", to an orphan name of
   "o262-7-0". This is done because in the send snapshot, inode 261 has
   of its hard links with a path of "d" as well.

2) Two link operations are issued that create the new hard links for
   inode 261, whose names are "d" and "f2l2_2", at paths "/" and
   "o262-7-0/" respectively.

3) Still while processing inode 261, unlink operations are issued to
   remove the old hard links of inode 261, with names "f2l1" and "f2l2",
   at paths "a/" and "d/". However path "d/" does not correspond anymore
   to the directory inode 262 but corresponds instead to a hard link of
   inode 261 (link command issued in the previous step). This makes the
   receiver fail with a ENOTDIR error when attempting the unlink
   operation.

The problem happens because before sending the unlink operation, we failed
to detect that inode 262 was one of ancestors for inode 261 in the parent
snapshot, and therefore we didn't recompute the path for inode 262 before
issuing the unlink operation for the link named "f2l2" of inode 262. The
detection failed because the function "is_ancestor()" only follows the
first hard link it finds for an inode instead of all of its hard links
(as it was originally created for being used with directories only, for
which only one hard link exists). So fix this by making "is_ancestor()"
follow all hard links of the input inode.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28 17:15:30 +01:00
Chao Yu 1a6152d36d quota: propagate error from __dquot_initialize
In commit 6184fc0b8d ("quota: Propagate error from ->acquire_dquot()"),
we have propagated error from __dquot_initialize to caller, but we forgot
to handle such error in add_dquot_ref(), so, currently, during quota
accounting information initialization flow, if we failed for some of
inodes, we just ignore such error, and do account for others, which is
not a good implementation.

In this patch, we choose to let user be aware of such error, so after
turning on quota successfully, we can make sure all inodes disk usage
can be accounted, which will be more reasonable.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-28 16:08:08 +01:00
Qu Wenruo 69fc6cbbac btrfs: tree-checker: Fix false panic for sanity test
[BUG]
If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
instantly cause kernel panic like:

------
...
assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
...
Call Trace:
 btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
 setup_items_for_insert+0x385/0x650 [btrfs]
 __btrfs_drop_extents+0x129a/0x1870 [btrfs]
...
-----

[Cause]
Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.

However quite some btrfs_mark_buffer_dirty() callers(*) don't really
initialize its item data but only initialize its item pointers, leaving
item data uninitialized.

This makes tree-checker catch uninitialized data as error, causing
such panic.

*: These callers include but not limited to
setup_items_for_insert()
btrfs_split_item()
btrfs_expand_item()

[Fix]
Add a new parameter @check_item_data to btrfs_check_leaf().
With @check_item_data set to false, item data check will be skipped and
fallback to old btrfs_check_leaf() behavior.

So we can still get early warning if we screw up item pointers, and
avoid false panic.

Cc: Filipe Manana <fdmanana@gmail.com>
Reported-by: Lakshmipathi.G <lakshmipathi.g@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-28 14:59:09 +01:00
Linus Torvalds 8f5abe842e proc: don't report kernel addresses in /proc/<pid>/stack
This just changes the file to report them as zero, although maybe even
that could be removed.  I checked, and at least procps doesn't actually
seem to parse the 'stack' file at all.

And since the file doesn't necessarily even exist (it requires
CONFIG_STACKTRACE), possibly other tools don't really use it either.

That said, in case somebody parses it with tools, just having that zero
there should keep such tools happy.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 16:45:56 -08:00
Vasily Averin 81833de1a4 lockd: fix "list_add double add" caused by legacy signal interface
restart_grace() uses hardcoded init_net.
It can cause to "list_add double add" in following scenario:

1) nfsd and lockd was started in several net namespaces
2) nfsd in init_net was stopped (lockd was not stopped because
 it have users from another net namespaces)
3) lockd got signal, called restart_grace() -> set_grace_period()
 and enabled lock_manager in hardcoded init_net.
4) nfsd in init_net is started again,
 its lockd_up() calls set_grace_period() and tries to add
 lock_manager into init_net 2nd time.

Jeff Layton suggest:
"Make it safe to call locks_start_grace multiple times on the same
lock_manager. If it's already on the global grace_list, then don't try
to add it again.  (But we don't intentionally add twice, so for now we
WARN about that case.)

With this change, we also need to ensure that the nfsd4 lock manager
initializes the list before we call locks_start_grace. While we're at
it, move the rest of the nfsd_net initialization into
nfs4_state_create_net. I see no reason to have it spread over two
functions like it is today."

Suggested patch was updated to generate warning in described situation.

Suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Vasily Averin 9e137ed5ab nlm_shutdown_hosts_net() cleanup
nlm_complain_hosts() walks through nlm_server_hosts hlist, which should
be protected by nlm_host_mutex.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Vasily Averin 2317dc557a race of nfsd inetaddr notifiers vs nn->nfsd_serv change
nfsd_inet[6]addr_event uses nn->nfsd_serv without taking nfsd_mutex,
which can be changed during execution of notifiers and crash the host.

Moreover if notifiers were enabled in one net namespace they are enabled
in all other net namespaces, from creation until destruction.

This patch allows notifiers to access nn->nfsd_serv only after the
pointer is correctly initialized and delays cleanup until notifiers are
no longer in use.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Tested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Vasily Averin 6b18dd1c03 race of lockd inetaddr notifiers vs nlmsvc_rqst change
lockd_inet[6]addr_event use nlmsvc_rqst without taken nlmsvc_mutex,
nlmsvc_rqst can be changed during execution of notifiers and crash the host.

Patch enables access to nlmsvc_rqst only when it was correctly initialized
and delays its cleanup until notifiers are no longer in use.

Note that nlmsvc_rqst can be temporally set to ERR_PTR, so the "if
(nlmsvc_rqst)" check in notifiers is insufficient on its own.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Tested-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Bhumika Goyal ae2e408ec2 NFSD: make cache_detail structures const
Make these const as they are only getting passed to the function
cache_create_net having the argument as const.

Signed-off-by: Bhumika Goyal <bhumirks@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Andrew Elble ae254dac72 nfsd: check for use of the closed special stateid
Prevent the use of the closed (invalid) special stateid by clients.

Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Naofumi Honda 64ebe12494 nfsd: fix panic in posix_unblock_lock called from nfs4_laundromat
From kernel 4.9, my two nfsv4 servers sometimes suffer from
    "panic: unable to handle kernel page request"
in posix_unblock_lock() called from nfs4_laundromat().

These panics diseappear if we revert the commit "nfsd: add a LRU list
for blocked locks".

The cause appears to be a typo in nfs4_laundromat(), which is also
present in nfs4_state_shutdown_net().

Cc: stable@vger.kernel.org
Fixes: 7919d0a27f "nfsd: add a LRU list for blocked locks"
Cc: jlayton@redhat.com
Reveiwed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Vasily Averin 3a2b19d1ee lockd: lost rollback of set_grace_period() in lockd_down_net()
Commit efda760fe9 ("lockd: fix lockd shutdown race") is incorrect,
it removes lockd_manager and disarm grace_period_end for init_net only.

If nfsd was started from another net namespace lockd_up_net() calls
set_grace_period() that adds lockd_manager into per-netns list
and queues grace_period_end delayed work.

These action should be reverted in lockd_down_net().
Otherwise it can lead to double list_add on after restart nfsd in netns,
and to use-after-free if non-disarmed delayed work will be executed after netns destroy.

Fixes: efda760fe9 ("lockd: fix lockd shutdown race")
Cc: stable@vger.kernel.org
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:11 -05:00
Vasily Averin a3152f1440 lockd: added cleanup checks in exit_net hook
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Vasily Averin b872285751 grace: replace BUG_ON by WARN_ONCE in exit_net hook
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Andrew Elble 4f34bd0540 nfsd: fix locking validator warning on nfs4_ol_stateid->st_mutex class
The use of the st_mutex has been confusing the validator. Use the
proper nested notation so as to not produce warnings.

Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Vasily Averin e919b07652 lockd: remove net pointer from messages
Publishing of net pointer is not safe,
use net->ns.inum as net ID in debug messages

[  171.757678] lockd_up_net: per-net data created; net=f00001e7
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)
[  300.653313] lockd: nuking all hosts in net f00001e7...
[  300.653641] lockd: host garbage collection for net f00001e7
[  300.653968] lockd: nlmsvc_mark_resources for net f00001e7
[  300.711483] lockd_down_net: per-net data destroyed; net=f00001e7
[  300.711847] lockd: nuking all hosts in net 0...
[  300.711847] lockd: host garbage collection for net 0
[  300.711848] lockd: nlmsvc_mark_resources for net 0

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Vasily Averin ba589528d6 nfsd: remove net pointer from debug messages
Publishing of net pointer is not safe,
replace it in debug meesages by net->ns.inum

[  119.989161] nfsd: initializing export module (net: f00001e7).
[  171.767188] NFSD: starting 90-second grace period (net f00001e7)
[  322.185240] nfsd: shutting down export module (net: f00001e7).
[  322.186062] nfsd: export shutdown complete (net: f00001e7).

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust 03da3169c6 nfsd: Fix races with check_stateid_generation()
The various functions that call check_stateid_generation() in order
to compare a client-supplied stateid with the nfs4_stid state, usually
need to atomically check for closed state. Those that perform the
check after locking the st_mutex using nfsd4_lock_ol_stateid()
should now be OK, but we do want to fix up the others.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust 9271d7e509 nfsd: Ensure we check stateid validity in the seqid operation checks
After taking the stateid st_mutex, we want to know that the stateid
still represents valid state before performing any non-idempotent
actions.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust beeca19cf1 nfsd: Fix race in lock stateid creation
If we're looking up a new lock state, and the creation fails, then
we want to unhash it, just like we do for OPEN. However in order
to do so, we need to that no other LOCK requests can grab the
mutex until we have unhashed it (and marked it as closed).

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust fd1fd685b3 nfsd4: move find_lock_stateid
Trivial cleanup to simplify following patch.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust 659aefb68e nfsd: Ensure we don't recognise lock stateids after freeing them
In order to deal with lookup races, nfsd4_free_lock_stateid() needs
to be able to signal to other stateful functions that the lock stateid
is no longer valid. Right now, nfsd_lock() will check whether or not an
existing stateid is still hashed, but only in the "new lock" path.

To ensure the stateid invalidation is also recognised by the "existing lock"
path, and also by a second call to nfsd4_free_lock_stateid() itself, we can
change the type to NFS4_CLOSED_STID under the stp->st_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust fb500a7cfe nfsd: CLOSE SHOULD return the invalid special stateid for NFSv4.x (x>0)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust d8a1a00055 nfsd: Fix another OPEN stateid race
If nfsd4_process_open2() is initialising a new stateid, and yet the
call to nfs4_get_vfs_file() fails for some reason, then we must
declare the stateid closed, and unhash it before dropping the mutex.

Right now, we unhash the stateid after dropping the mutex, and without
changing the stateid type, meaning that another OPEN could theoretically
look it up and attempt to use it.

Reported-by: Andrew W Elble <aweits@rit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Trond Myklebust 15ca08d329 nfsd: Fix stateid races between OPEN and CLOSE
Open file stateids can linger on the nfs4_file list of stateids even
after they have been closed. In order to avoid reusing such a
stateid, and confusing the client, we need to recheck the
nfs4_stid's type after taking the mutex.
Otherwise, we risk reusing an old stateid that was already closed,
which will confuse clients that expect new stateids to conform to
RFC7530 Sections 9.1.4.2 and 16.2.5 or RFC5661 Sections 8.2.2 and 18.2.4.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-11-27 16:45:10 -05:00
Linus Torvalds 1751e8a6cb Rename superblock flags (MS_xyz -> SB_xyz)
This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-27 13:05:09 -08:00
Darrick J. Wong 509955823c xfs: log recovery should replay deferred ops in order
As part of testing log recovery with dm_log_writes, Amir Goldstein
discovered an error in the deferred ops recovery that lead to corruption
of the filesystem metadata if a reflink+rmap filesystem happened to shut
down midway through a CoW remap:

"This is what happens [after failed log recovery]:

"Phase 1 - find and verify superblock...
"Phase 2 - using internal log
"        - zero log...
"        - scan filesystem freespace and inode maps...
"        - found root inode chunk
"Phase 3 - for each AG...
"        - scan (but don't clear) agi unlinked lists...
"        - process known inodes and perform inode discovery...
"        - agno = 0
"data fork in regular inode 134 claims CoW block 376
"correcting nextents for inode 134
"bad data fork in inode 134
"would have cleared inode 134"

Hou Tao dissected the log contents of exactly such a crash:

"According to the implementation of xfs_defer_finish(), these ops should
be completed in the following sequence:

"Have been done:
"(1) CUI: Oper (160)
"(2) BUI: Oper (161)
"(3) CUD: Oper (194), for CUI Oper (160)
"(4) RUI A: Oper (197), free rmap [0x155, 2, -9]

"Should be done:
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI A
"(8) RUD: for RUI B

"Actually be done by xlog_recover_process_intents()
"(5) BUD: for BUI Oper (161)
"(6) RUI B: add rmap [0x155, 2, 137]
"(7) RUD: for RUI B
"(8) RUD: for RUI A

"So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
from the log record in post_mount.log (generated after umount) and the trace
print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
entry [0x155, 2, -9] are freed."

When reconstructing the internal log state from the log items found on
disk, it's required that deferred ops replay in exactly the same order
that they would have had the filesystem not gone down.  However,
replaying unfinished deferred ops can create /more/ deferred ops.  These
new deferred ops are finished in the wrong order.  This causes fs
corruption and replay crashes, so let's create a single defer_ops to
handle the subsequent ops created during replay, then use one single
transaction at the end of log recovery to ensure that everything is
replayed in the same order as they're supposed to be.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Analyzed-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-27 09:34:08 -08:00
Darrick J. Wong 98c4f78dcd xfs: always free inline data before resetting inode fork during ifree
In xfs_ifree, we reset the data/attr forks to extents format without
bothering to free any inline data buffer that might still be around
after all the blocks have been truncated off the file.  Prior to commit
43518812d2 ("xfs: remove support for inlining data/extents into the
inode fork") nobody noticed because the leftover inline data after
truncation was small enough to fit inside the inline buffer inside the
fork itself.

However, now that we've removed the inline buffer, we /always/ have to
free the inline data buffer or else we leak them like crazy.  This test
was found by turning on kmemleak for generic/001 or generic/388.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-11-27 09:33:25 -08:00
Liu Bo ebb70442cd Btrfs: fix list_add corruption and soft lockups in fsync
Xfstests btrfs/146 revealed this corruption,

[   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
[   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
[   58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
[   58.153518] ------------[ cut here ]------------
[   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
...
[   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
...
[   58.161956] Call Trace:
[   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
[   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
[   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
[   58.164393]  vfs_fsync_range+0x5f/0xd0
[   58.164898]  do_fsync+0x5a/0x90
[   58.165170]  SyS_fsync+0x10/0x20
[   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
...

It turns out that we could record btrfs_log_ctx:io_err in
log_one_extents when IO fails, but make log_one_extents() return '0'
instead of -EIO, so the IO error is not acknowledged by the callers,
i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
from stack memory, it'd get freed with a object alive on the
list. then a future list_add will throw the above warning.

This returns the correct error in the above case.

Jeff also reported this while testing against his fsync error
patch set[1].

[1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
"btrfs list corruption and soft lockups while testing writeback error handling"

Fixes: 8407f55326 ("Btrfs: fix data corruption after fast fsync and writeback error")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 17:41:19 +01:00
Jeff Layton 9f97df50c5 reiserfs: remove unneeded i_version bump
The i_version field in reiserfs is not initialized and is only ever
updated here. Nothing ever views it, so just remove it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-11-27 17:31:07 +01:00
Qu Wenruo eae8d82529 btrfs: Fix wild memory access in compression level parser
[BUG]
Kernel panic when mounting with "-o compress" mount option.
KASAN will report like:
------
==================================================================
BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
Read of size 1 at addr d86735fce994f800 by task mount/662
...
Call Trace:
 dump_stack+0xe3/0x175
 kasan_report+0x163/0x370
 __asan_load1+0x47/0x50
 strncmp+0x31/0xc0
 btrfs_compress_str2level+0x20/0x70 [btrfs]
 btrfs_parse_options+0xff4/0x1870 [btrfs]
 open_ctree+0x2679/0x49f0 [btrfs]
 btrfs_mount+0x1b7f/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 vfs_kern_mount+0x13/0x20
 btrfs_mount+0x31e/0x1d30 [btrfs]
 mount_fs+0x49/0x190
 vfs_kern_mount.part.29+0xba/0x280
 do_mount+0xaad/0x1a00
 SyS_mount+0x98/0xe0
 entry_SYSCALL_64_fastpath+0x1f/0xbe
------

[Cause]
For 'compress' and 'compress_force' options, its token doesn't expect
any parameter so its args[0] contains uninitialized data.
Accessing args[0] will cause above wild memory access.

[Fix]
For Opt_compress and Opt_compress_force, set compression level to
the default.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ set the default in advance ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 17:01:11 +01:00
Josef Bacik b77000ed55 btrfs: fix deadlock when writing out space cache
If we fail to prepare our pages for whatever reason (out of memory in
our case) we need to make sure to drop the block_group->data_rwsem,
otherwise hilarity ensues.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add label and use existing unlocking code ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-27 15:50:07 +01:00
Linus Torvalds 844056fd74 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:

 - The final conversion of timer wheel timers to timer_setup().

   A few manual conversions and a large coccinelle assisted sweep and
   the removal of the old initialization mechanisms and the related
   code.

 - Remove the now unused VSYSCALL update code

 - Fix permissions of /proc/timer_list. I still need to get rid of that
   file completely

 - Rename a misnomed clocksource function and remove a stale declaration

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  m68k/macboing: Fix missed timer callback assignment
  treewide: Remove TIMER_FUNC_TYPE and TIMER_DATA_TYPE casts
  timer: Remove redundant __setup_timer*() macros
  timer: Pass function down to initialization routines
  timer: Remove unused data arguments from macros
  timer: Switch callback prototype to take struct timer_list * argument
  timer: Pass timer_list pointer to callbacks unconditionally
  Coccinelle: Remove setup_timer.cocci
  timer: Remove setup_*timer() interface
  timer: Remove init_timer() interface
  treewide: setup_timer() -> timer_setup() (2 field)
  treewide: setup_timer() -> timer_setup()
  treewide: init_timer() -> setup_timer()
  treewide: Switch DEFINE_TIMER callbacks to struct timer_list *
  s390: cmm: Convert timers to use timer_setup()
  lightnvm: Convert timers to use timer_setup()
  drivers/net: cris: Convert timers to use timer_setup()
  drm/vc4: Convert timers to use timer_setup()
  block/laptop_mode: Convert timers to use timer_setup()
  net/atm/mpc: Avoid open-coded assignment of timer callback function
  ...
2017-11-25 08:37:16 -10:00
Linus Torvalds f61ec2c97c AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWhglCPSw1s6N8H32AQJr5g/7BFKQ5KrbkPcjJTjP18bgVTFDq2in6/ui
 3aYXvcI5dqKzfGyCZkFYS48tSnvNeWKVYbgsLsSOdDHLQ40QW4mDnJmbtK1A9Adx
 scXgQsgGdyK3NrIFBWPcKCbttiomj4pDQhkc5MVYxy/hFhXAB7J2CvNvxgkA5suv
 K14cg1y9hbY2WSe+/dXBB8WNCmL4CSXV23sb2Dy+JkPUGOE+DhGTwdbK5DSDr2FN
 wJOkEle7k1fsHn3z8S5CK+h2p5lwy26KXMD+boEQS8UvFwq+SMm4J3Emkk7L6BvQ
 WDbQlvGt1EF/+O6GTaB/FKZd2pO51sf5BNPuVoFyk5AmhNrZcTOPyQl83JHedHGp
 nlKWOI8bOWYeRZEeBnrXfoEkOAs9U0NKZk6+NOxgXrhDBkmcBwyMqrHNgaP6iY45
 ducE3UCKsL0a0yC/lz9usq6gM2QIbd1BB2RcVoFRAFHk7DU7aLxtgTZRF3NFT36n
 vKVUIPbAMh+T8lzxw/bJmyfiyVZZpIlxMdkJmyWPMelgw8R4c448kXcQwQ5kofBz
 0UeZGcYZ7+B/XUtkvfL3ZSGzRJN0k5ibA3gMKwhUd+UvyG1hVB4m1Tg9cO6EWHtS
 vbj+GL2D/SDRmjCGKv5HmImik5cHWufjqjxJHW+0LolkqTw500RZDScT0pxLpHdT
 sK6AHEamcn8=
 =v3Rx
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20171124' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS fixes from David Howells:

 - Make AFS file locking work again.

 - Don't write to a page that's being written out, but wait for it to
   complete.

 - Do d_drop() and d_add() in the right places.

 - Put keys on error paths.

 - Remove some redundant code.

* tag 'afs-fixes-20171124' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: remove redundant assignment of dvnode to itself
  afs: cell: Remove unnecessary code in afs_lookup_cell
  afs: Fix signal handling in some file ops
  afs: Fix some dentry handling in dir ops and missing key_puts
  afs: Make afs_write_begin() avoid writing to a page that's being stored
  afs: Fix file locking
2017-11-25 07:58:25 -10:00
Colin Ian King 43dd388b21 afs: remove redundant assignment of dvnode to itself
The assignment of dvnode to itself is redundant and can be removed.
Cleans up warning detected by cppcheck:

fs/afs/dir.c:975: (warning) Redundant assignment of 'dvnode' to itself.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24 13:55:46 +00:00
Gustavo A. R. Silva 6832795164 afs: cell: Remove unnecessary code in afs_lookup_cell
Due to recent changes this piece of code is no longer needed.

Addresses-Coverity-ID: 1462033
Link: https://lkml.kernel.org/r/4923.1510957307@warthog.procyon.org.uk
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24 13:55:45 +00:00
David Howells 4433b69141 afs: Fix signal handling in some file ops
afs_mkdir(), afs_create(), afs_link() and afs_symlink() all need to drop
the target dentry if a signal causes the operation to be killed immediately
before we try to contact the server.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24 13:55:35 +00:00
David Howells bc1527dcb4 afs: Fix some dentry handling in dir ops and missing key_puts
Fix some of dentry handling in AFS directory ops:

 (1) Do d_drop() on the new_dentry before assigning a new inode to it in
     afs_vnode_new_inode().  It's fine to do this before calling afs_iget()
     because the operation has taken place on the server.

 (2) Replace d_instantiate()/d_rehash() with d_add().

 (3) Don't d_drop() the new_dentry in afs_rename() on error.

Also fix afs_link() and afs_rename() to call key_put() on all error paths
where the key is taken.

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24 10:56:51 +00:00
David Howells 5a039c3227 afs: Make afs_write_begin() avoid writing to a page that's being stored
Make afs_write_begin() wait for a page that's marked PG_writeback because:

 (1) We need to avoid interference with the data being stored so that the
     data on the server ends up in a defined state.

 (2) page->private is used to track the window of dirty data within a page,
     but it's also used by the storage code to track what's being written,
     being cleared by the completion notification.  Ownership can't be
     relinquished by the storage code until completion because it a store
     fails, the data must be remarked dirty.

Tracing shows something like the following (edited):

 x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-125
    kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store+ 0-125
 x86_64-linux-gn-15940 [1] afs_page_dirty: vn=ffff8800bef33800 9c75 begin 0-2052
    kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 clear 0-2052
    kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 store 0-0
    kworker/u8:3-114   [2] afs_page_dirty: vn=ffff8800bef33800 9c75 WARN 0-0

The clear (completion) corresponding to the store+ (store continuation from
a previous page) happens between the second begin (afs_write_begin) and the
store corresponding to that.  This results in the second store not seeing
any data to write back, leading to the following warning:

WARNING: CPU: 2 PID: 114 at ../fs/afs/write.c:403 afs_write_back_from_locked_page+0x19d/0x76c [kafs]
Modules linked in: kafs(E)
CPU: 2 PID: 114 Comm: kworker/u8:3 Tainted: G            E   4.14.0-fscache+ #242
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Workqueue: writeback wb_workfn (flush-afs-2)
task: ffff8800cad72600 task.stack: ffff8800cad44000
RIP: 0010:afs_write_back_from_locked_page+0x19d/0x76c [kafs]
RSP: 0018:ffff8800cad47aa0 EFLAGS: 00010246
RAX: 0000000000000001 RBX: ffff8800bef33a20 RCX: 0000000000000000
RDX: 000000000000000f RSI: ffffffff81c5d0e0 RDI: ffff8800cad72e78
RBP: ffff8800d31ea1e8 R08: ffff8800c1358000 R09: ffff8800ca00e400
R10: ffff8800cad47a38 R11: ffff8800c5d9e400 R12: 0000000000000000
R13: ffffea0002d9df00 R14: ffffffffa0023c1c R15: 0000000000007fdf
FS:  0000000000000000(0000) GS:ffff8800ca700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f85ac6c4000 CR3: 0000000001c10001 CR4: 00000000001606e0
Call Trace:
 ? clear_page_dirty_for_io+0x23a/0x267
 afs_writepages_region+0x1be/0x286 [kafs]
 afs_writepages+0x60/0x127 [kafs]
 do_writepages+0x36/0x70
 __writeback_single_inode+0x12f/0x635
 writeback_sb_inodes+0x2cc/0x452
 __writeback_inodes_wb+0x68/0x9f
 wb_writeback+0x208/0x470
 ? wb_workfn+0x22b/0x565
 wb_workfn+0x22b/0x565
 ? worker_thread+0x230/0x2ac
 process_one_work+0x2cc/0x517
 ? worker_thread+0x230/0x2ac
 worker_thread+0x1d4/0x2ac
 ? rescuer_thread+0x29b/0x29b
 kthread+0x15d/0x165
 ? kthread_create_on_node+0x3f/0x3f
 ? call_usermodehelper_exec_async+0x118/0x11f
 ret_from_fork+0x24/0x30

Signed-off-by: David Howells <dhowells@redhat.com>
2017-11-24 10:56:51 +00:00
Linus Torvalds 3f3211e755 Changes since last update:
- Fix a memory leak in the new in-core extent map.
 - Refactor the xfs_dev_t conversions for easier xfsprogs porting
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaFH3KAAoJEPh/dxk0SrTrkDgQAIz7YHFpWxcbyVPJnk84lMov
 +UlovbgTtY6sgrfgfMk/o072gBpnUEme10w47GikKB86f/FAvfVjXC7jujshXy+I
 OmoZalwwDpIDpv/QAP79gZL9JQxSBY9on57pMiAIAn4z1saLGzJ7I97cAIv15dyy
 f0viWEVfML417Rgr3/cBgK0RfK1ShjcF/jmk/S7I+2L7fAPwGZHBFT1PJ+IYYleG
 FyrMoKi21AAzomnGWMtr2O/Deaip0zio8Yzg5LhthW0vBv6Hi6meVZZnLqDTQkve
 1MfKOuDm75SszNwWCnisPjC/KNiEd9nL2vRJZYx6lWrXMwIxoj+IpXVavR4z97zS
 QFVDtUpCRHKaj4vT1wPvYuqAQFusigvTvgpZALp9Pt18RL4CbSI9mKtqrdEZWJ2F
 YAhK8i5OytbFoK6MbgsBTwKZz9eKAck8ummWIViMNN1Wyroxemvs6p/+eRBEKDIW
 Hz/SMSAdLdPcw/HGG5Y+KE5lKWATSUWk7u5YQDt68prriLI6h3qKl1ssX6mtb7P7
 DkW+aLW0Zxqy79s2eDpvNPZrYe7bEnanAejJa3Qz8VcI9H5roX+2cQzSjWh4zUua
 6dJwPaupJDHlrR5VSG+oPC/q7v9b7X4LnsqHGpt0wSgdyuqhg+vHXo2ARIu8oAvP
 TMHdg1ICt5sPy+6eWtDD
 =1IEk
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix a memory leak in the new in-core extent map

 - Refactor the xfs_dev_t conversions for easier xfsprogs porting

* tag 'xfs-4.15-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: abstract out dev_t conversions
  xfs: fix memory leak in xfs_iext_free_last_leaf
2017-11-22 20:42:42 -10:00
Linus Torvalds 275327851e Merge branch 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull mode_t whack-a-mole from Al Viro:
 "For all internal uses we want umode_t, which is arch-independent;
  mode_t (or __kernel_mode_t, for that matter) is wrong outside of
  userland ABI.

  Unfortunately, that crap keeps coming back and needs to be put down
  from time to time..."

* 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  mode_t whack-a-mole: task_dump_owner()
2017-11-22 20:20:02 -10:00
Linus Torvalds d18bee424b Merge branch '9p-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 9p filesystemfixes from Al Viro:
 "Several 9p fixes"

* '9p-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: Fix missing commas in mount options
  net/9p: Switch to wait_event_killable()
  fs/9p: Compare qid.path in v9fs_test_inode
2017-11-22 20:17:54 -10:00
Kees Cook e99e88a9d2 treewide: setup_timer() -> timer_setup()
This converts all remaining cases of the old setup_timer() API into using
timer_setup(), where the callback argument is the structure already
holding the struct timer_list. These should have no behavioral changes,
since they just change which pointer is passed into the callback with
the same available pointers after conversion. It handles the following
examples, in addition to some other variations.

Casting from unsigned long:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, ptr);

and forced object casts:

    void my_callback(struct something *ptr)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, (unsigned long)ptr);

become:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

Direct function assignments:

    void my_callback(unsigned long data)
    {
        struct something *ptr = (struct something *)data;
    ...
    }
    ...
    ptr->my_timer.function = my_callback;

have a temporary cast added, along with converting the args:

    void my_callback(struct timer_list *t)
    {
        struct something *ptr = from_timer(ptr, t, my_timer);
    ...
    }
    ...
    ptr->my_timer.function = (TIMER_FUNC_TYPE)my_callback;

And finally, callbacks without a data assignment:

    void my_callback(unsigned long data)
    {
    ...
    }
    ...
    setup_timer(&ptr->my_timer, my_callback, 0);

have their argument renamed to verify they're unused during conversion:

    void my_callback(struct timer_list *unused)
    {
    ...
    }
    ...
    timer_setup(&ptr->my_timer, my_callback, 0);

The conversion is done with the following Coccinelle script:

spatch --very-quiet --all-includes --include-headers \
	-I ./arch/x86/include -I ./arch/x86/include/generated \
	-I ./include -I ./arch/x86/include/uapi \
	-I ./arch/x86/include/generated/uapi -I ./include/uapi \
	-I ./include/generated/uapi --include ./include/linux/kconfig.h \
	--dir . \
	--cocci-file ~/src/data/timer_setup.cocci

@fix_address_of@
expression e;
@@

 setup_timer(
-&(e)
+&e
 , ...)

// Update any raw setup_timer() usages that have a NULL callback, but
// would otherwise match change_timer_function_usage, since the latter
// will update all function assignments done in the face of a NULL
// function initialization in setup_timer().
@change_timer_function_usage_NULL@
expression _E;
identifier _timer;
type _cast_data;
@@

(
-setup_timer(&_E->_timer, NULL, _E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E->_timer, NULL, (_cast_data)_E);
+timer_setup(&_E->_timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, &_E);
+timer_setup(&_E._timer, NULL, 0);
|
-setup_timer(&_E._timer, NULL, (_cast_data)&_E);
+timer_setup(&_E._timer, NULL, 0);
)

@change_timer_function_usage@
expression _E;
identifier _timer;
struct timer_list _stl;
identifier _callback;
type _cast_func, _cast_data;
@@

(
-setup_timer(&_E->_timer, _callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, &_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, _E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, &_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)_E);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, (_cast_func)&_callback, (_cast_data)&_E);
+timer_setup(&_E._timer, _callback, 0);
|
 _E->_timer@_stl.function = _callback;
|
 _E->_timer@_stl.function = &_callback;
|
 _E->_timer@_stl.function = (_cast_func)_callback;
|
 _E->_timer@_stl.function = (_cast_func)&_callback;
|
 _E._timer@_stl.function = _callback;
|
 _E._timer@_stl.function = &_callback;
|
 _E._timer@_stl.function = (_cast_func)_callback;
|
 _E._timer@_stl.function = (_cast_func)&_callback;
)

// callback(unsigned long arg)
@change_callback_handle_cast
 depends on change_timer_function_usage@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
identifier _handle;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
(
	... when != _origarg
	_handletype *_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(_handletype *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
|
	... when != _origarg
	_handletype *_handle;
	... when != _handle
	_handle =
-(void *)_origarg;
+from_timer(_handle, t, _timer);
	... when != _origarg
)
 }

// callback(unsigned long arg) without existing variable
@change_callback_handle_cast_no_arg
 depends on change_timer_function_usage &&
                     !change_callback_handle_cast@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _origtype;
identifier _origarg;
type _handletype;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *t
 )
 {
+	_handletype *_origarg = from_timer(_origarg, t, _timer);
+
	... when != _origarg
-	(_handletype *)_origarg
+	_origarg
	... when != _origarg
 }

// Avoid already converted callbacks.
@match_callback_converted
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
	    !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier t;
@@

 void _callback(struct timer_list *t)
 { ... }

// callback(struct something *handle)
@change_callback_handle_arg
 depends on change_timer_function_usage &&
	    !match_callback_converted &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
@@

 void _callback(
-_handletype *_handle
+struct timer_list *t
 )
 {
+	_handletype *_handle = from_timer(_handle, t, _timer);
	...
 }

// If change_callback_handle_arg ran on an empty function, remove
// the added handler.
@unchange_callback_handle_arg
 depends on change_timer_function_usage &&
	    change_callback_handle_arg@
identifier change_timer_function_usage._callback;
identifier change_timer_function_usage._timer;
type _handletype;
identifier _handle;
identifier t;
@@

 void _callback(struct timer_list *t)
 {
-	_handletype *_handle = from_timer(_handle, t, _timer);
 }

// We only want to refactor the setup_timer() data argument if we've found
// the matching callback. This undoes changes in change_timer_function_usage.
@unchange_timer_function_usage
 depends on change_timer_function_usage &&
            !change_callback_handle_cast &&
            !change_callback_handle_cast_no_arg &&
	    !change_callback_handle_arg@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type change_timer_function_usage._cast_data;
@@

(
-timer_setup(&_E->_timer, _callback, 0);
+setup_timer(&_E->_timer, _callback, (_cast_data)_E);
|
-timer_setup(&_E._timer, _callback, 0);
+setup_timer(&_E._timer, _callback, (_cast_data)&_E);
)

// If we fixed a callback from a .function assignment, fix the
// assignment cast now.
@change_timer_function_assignment
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression change_timer_function_usage._E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_func;
typedef TIMER_FUNC_TYPE;
@@

(
 _E->_timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E->_timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-&_callback;
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)_callback
+(TIMER_FUNC_TYPE)_callback
 ;
|
 _E._timer.function =
-(_cast_func)&_callback
+(TIMER_FUNC_TYPE)_callback
 ;
)

// Sometimes timer functions are called directly. Replace matched args.
@change_timer_function_calls
 depends on change_timer_function_usage &&
            (change_callback_handle_cast ||
             change_callback_handle_cast_no_arg ||
             change_callback_handle_arg)@
expression _E;
identifier change_timer_function_usage._timer;
identifier change_timer_function_usage._callback;
type _cast_data;
@@

 _callback(
(
-(_cast_data)_E
+&_E->_timer
|
-(_cast_data)&_E
+&_E._timer
|
-_E
+&_E->_timer
)
 )

// If a timer has been configured without a data argument, it can be
// converted without regard to the callback argument, since it is unused.
@match_timer_function_unused_data@
expression _E;
identifier _timer;
identifier _callback;
@@

(
-setup_timer(&_E->_timer, _callback, 0);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0L);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E->_timer, _callback, 0UL);
+timer_setup(&_E->_timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0L);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_E._timer, _callback, 0UL);
+timer_setup(&_E._timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0L);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(&_timer, _callback, 0UL);
+timer_setup(&_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0L);
+timer_setup(_timer, _callback, 0);
|
-setup_timer(_timer, _callback, 0UL);
+timer_setup(_timer, _callback, 0);
)

@change_callback_unused_data
 depends on match_timer_function_unused_data@
identifier match_timer_function_unused_data._callback;
type _origtype;
identifier _origarg;
@@

 void _callback(
-_origtype _origarg
+struct timer_list *unused
 )
 {
	... when != _origarg
 }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:07 -08:00
Kees Cook 24ed960abf treewide: Switch DEFINE_TIMER callbacks to struct timer_list *
This changes all DEFINE_TIMER() callbacks to use a struct timer_list
pointer instead of unsigned long. Since the data argument has already been
removed, none of these callbacks are using their argument currently, so
this renames the argument to "unused".

Done using the following semantic patch:

@match_define_timer@
declarer name DEFINE_TIMER;
identifier _timer, _callback;
@@

 DEFINE_TIMER(_timer, _callback);

@change_callback depends on match_define_timer@
identifier match_define_timer._callback;
type _origtype;
identifier _origarg;
@@

 void
-_callback(_origtype _origarg)
+_callback(struct timer_list *unused)
 { ... }

Signed-off-by: Kees Cook <keescook@chromium.org>
2017-11-21 15:57:05 -08:00
Linus Torvalds b620fd2df2 3 Cleanups: remove initialization of i_version - Jeff Layton
use ARRAY_SIZE - Jérémy Lefaure
             call op_release sooner when creating inodes - Martin Brandenburg
 
 1 Patch: stop setting atime on inode dirty - Martin Brandenburg
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaEyiIAAoJEM9EDqnrzg2+LvEP+QGkMxX7i0Y4KSbIIWPkE3Ec
 y5OrEV8NjBg3u9eINNIfym65blGiOKK++dltSm7UAM//QoctMpG+HAhUMsFQsf3H
 XBvosvdRwxd9n/vxkcA9KsICdRKDi//vBoAS9EiyQYZfn1spE4LZBs+uZxtkQpIY
 ofUOdGYDOsXE5Jb8oBz2PRS3nQWPsflIOs2y1oTiwAfjP6WIBq11tu7wdamUJ02A
 F7vvFTA5wbxuuZq9cLA52Ho7IVR09GiymSaDTbilPK3d73eaacVl/zlfYcdMVRJA
 YmsyXcgdpgLhgiKl4B969dWU5p2X7a3cbkexTbIU+iFXcq685OohLj/SacFYH1eA
 /eZibdz9UhO6rLGwR5YDQ50lMIzwPYxMM98f8E/jjfxdRFrG3Pu4A2yLjDtaJYZc
 ATJDVk491xnGOhYDARQ6Wt/Dy3Yj0TtPsJeXggR6NiXH4AgsjZxToD2QgHXBhynb
 2+dFadBb0erFMT1rB295thBGJWeD6kArIXwZS9alz83z/VH7O5rpjIx0I4Qj5NeP
 fZEYHf3E2+jFVQzqdw31fK6nTVsCN6/YhSwSYOGo+MAdvurCVxuFp0ulUM6FOCGR
 cfNYle/KrP3q1A3zzR4lpSDLXbXGKYbmImEYw4pobYH/vnjAtNVOpcEAMaxGyogm
 NUbQyGgcP9JIglkLSlQ5
 =nqnT
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-4.15-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux

Pull orangefs updates from Mike Marshall:
 "Fix:

   - stop setting atime on inode dirty (Martin Brandenburg)

  Cleanups:

   - remove initialization of i_version (Jeff Layton)

   - use ARRAY_SIZE (Jérémy Lefaure)

   - call op_release sooner when creating inodes (Mike MarshallMartin
     Brandenburg)"

* tag 'for-linus-4.15-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
  orangefs: call op_release sooner when creating inodes
  orangefs: stop setting atime on inode dirty
  orangefs: use ARRAY_SIZE
  orangefs: remove initialization of i_version
2017-11-21 05:40:48 -10:00
Linus Torvalds adb072d3cd We have a set of file locking improvements from Zheng, rbd rw/ro
state handling code cleanup from myself and some assorted CephFS fixes
 from Jeff.
 
 rbd now defaults to single-major=Y, lifting the limit of ~240 rbd
 images per host for everyone.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJaEwyIAAoJEEp/3jgCEfOLjgYH/jKJbQ1yJFPyTVTTv/U9/xH2
 kpHykEbzvvTT2TwNspbM9ZK4vSJPjYoHjL2qTRKxybuXYWYPxD2q6x+Z1iRP5G5N
 4Py3RUZaagCSSgbUhfNl3VCbdki6cIKHHz1tHWBuO75kFEg03yZroozzc3SCKH8T
 wHIa7UFxncDRroHMDiF5viF2tz4SfYSB0fd/Kev9qLJOiVr/lUTELfejlsu89ANT
 6UvXPiTd9iifxQxjLV+2eQM4x5JImiDJUhMvcqfDlY2l85LzVCVTPXFnN4ZoEPlt
 4NJj2SnnSQxSZLl1LwJC/gFYepdzW6qSxVqlpkAr0PvazZPushLpMA4AsKxWgVM=
 =qsu2
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.15-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "We have a set of file locking improvements from Zheng, rbd rw/ro state
  handling code cleanup from myself and some assorted CephFS fixes from
  Jeff.

  rbd now defaults to single-major=Y, lifting the limit of ~240 rbd
  images per host for everyone"

* tag 'ceph-for-4.15-rc1' of git://github.com/ceph/ceph-client:
  rbd: default to single-major device number scheme
  libceph: don't WARN() if user tries to add invalid key
  rbd: set discard_alignment to zero
  ceph: silence sparse endianness warning in encode_caps_cb
  ceph: remove the bump of i_version
  ceph: present consistent fsid, regardless of arch endianness
  ceph: clean up spinlocking and list handling around cleanup_cap_releases()
  rbd: get rid of rbd_mapping::read_only
  rbd: fix and simplify rbd_ioctl_set_ro()
  ceph: remove unused and redundant variable dropping
  ceph: mark expected switch fall-throughs
  ceph: -EINVAL on decoding failure in ceph_mdsc_handle_fsmap()
  ceph: disable cached readdir after dropping positive dentry
  ceph: fix bool initialization/comparison
  ceph: handle 'session get evicted while there are file locks'
  ceph: optimize flock encoding during reconnect
  ceph: make lock_to_ceph_filelock() static
  ceph: keep auth cap when inode has flocks or posix locks
2017-11-21 05:38:32 -10:00
Christoph Hellwig 274e0a1f47 xfs: abstract out dev_t conversions
And move them to xfs_linux.h so that xfsprogs can stub them out more
easily.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-21 01:44:53 -08:00
Shu Wang 6818caa4cd xfs: fix memory leak in xfs_iext_free_last_leaf
found the issue by kmemleak.
unreferenced object 0xffff8800674611c0 (size 16):
    xfs_iext_insert+0x82a/0xa90 [xfs]
    xfs_bmap_add_extent_hole_delay+0x1e5/0x5b0 [xfs]
    xfs_bmapi_reserve_delalloc+0x483/0x530 [xfs]
    xfs_file_iomap_begin+0xac8/0xd40 [xfs]
    iomap_apply+0xb8/0x1b0
    iomap_file_buffered_write+0xac/0xe0
    xfs_file_buffered_aio_write+0x198/0x420 [xfs]
    xfs_file_write_iter+0x23f/0x2a0 [xfs]
    __vfs_write+0x23e/0x340
    vfs_write+0xe9/0x240
    SyS_write+0xa1/0x120
    do_syscall_64+0xda/0x260

Signed-off-by: Shu Wang <shuwang@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-11-21 01:44:53 -08:00
Josef Bacik 8e138e0d92 btrfs: clear space cache inode generation always
We discovered a box that had double allocations, and suspected the space
cache may be to blame.  While auditing the write out path I noticed that
if we've already setup the space cache we will just carry on.  This
means that any error we hit after cache_save_setup before we go to
actually write the cache out we won't reset the inode generation, so
whatever was already written will be considered correct, except it'll be
stale.  Fix this by _always_ resetting the generation on the block group
inode, this way we only ever have valid or invalid cache.

With this patch I was no longer able to reproduce cache corruption with
dm-log-writes and my bpf error injection tool.

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-20 20:43:39 +01:00
Linus Torvalds 4dd3c2e5a4 Lots of good bugfixes, including:
- fix a number of races in the NFSv4+ state code.
 	- fix some shutdown crashes in multiple-network-namespace cases.
 	- relax our 4.1 session limits; if you've an artificially low limit
 	  to the number of 4.1 clients that can mount simultaneously, try
 	  upgrading.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaEH3oAAoJECebzXlCjuG++t0P/2t7RvRUunQa4pngCmg5QbOA
 rldfEd1HM1F6+4fXzN0wcxWjphUNxs19VjEaWNjThYoGGTEdSOuFhBHgK18xmHjp
 Cjz5IYJ0yS7PClCxMTmz5u3gfyExPR83whmNaNK69CGvn5xu97gDntOv/06Llw4Y
 nCUJrEmVcMAOHek3tOD0Rlv8eYFyfLhF6zacp+qWFIlymU118iK1Or83M7pi6j51
 yVVOvxktDLzkyDq5gQD/Py3rKHikOWFMCoseOPfMnOiGF/Bp7YDzWt6HT17mwyU4
 xDeICbnfqve2SwT9NChpJOYtUAPuZDiQR6G2ZtnI8/JN7ob/wls/4CbDVlzYFN4r
 dLsRlEC5spQmg34j6dscOKkt1vRK9vKXTC46wEMfXZLtiDLA/uZ/J0gNh3EXqpbt
 LQQZI4B2MomYPcp64i4UHHO8BqSIX+lC5otVlAW105TQvZflJ8Mhtawmpu1O3nXZ
 DSUhkZrImlBmb7/ulhjyXpmNAxQLXsqb0lP5tUYR5Re+A2lyea/pMJmtBLu3fv6h
 tzHqq2JL13kblqJY+Frc1zqQGI5AAyKmdTTjmljBIGHxbVwAMzk1qO+VOI/f+J21
 MWNmFkEqw+Tnvwy6sIm1eUGtTWIGc6ejvMxXguAfa+QjT4iHAL3F4PkpSihzIZnm
 bzHDeJ87HRWWj/ICPQ1j
 =PBs+
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Lots of good bugfixes, including:

   -  fix a number of races in the NFSv4+ state code

   -  fix some shutdown crashes in multiple-network-namespace cases

   -  relax our 4.1 session limits; if you've an artificially low limit
      to the number of 4.1 clients that can mount simultaneously, try
      upgrading"

* tag 'nfsd-4.15' of git://linux-nfs.org/~bfields/linux: (22 commits)
  SUNRPC: Improve ordering of transport processing
  nfsd: deal with revoked delegations appropriately
  svcrdma: Enqueue after setting XPT_CLOSE in completion handlers
  nfsd: use nfs->ns.inum as net ID
  rpc: remove some BUG()s
  svcrdma: Preserve CB send buffer across retransmits
  nfds: avoid gettimeofday for nfssvc_boot time
  fs, nfsd: convert nfs4_file.fi_ref from atomic_t to refcount_t
  fs, nfsd: convert nfs4_cntl_odstate.co_odcount from atomic_t to refcount_t
  fs, nfsd: convert nfs4_stid.sc_count from atomic_t to refcount_t
  lockd: double unregister of inetaddr notifiers
  nfsd4: catch some false session retries
  nfsd4: fix cached replies to solo SEQUENCE compounds
  sunrcp: make function _svc_create_xprt static
  SUNRPC: Fix tracepoint storage issues with svc_recv and svc_rqst_status
  nfsd: use ARRAY_SIZE
  nfsd: give out fewer session slots as limit approaches
  nfsd: increase DRC cache limit
  nfsd: remove unnecessary nofilehandle checks
  nfs_common: convert int to bool
  ...
2017-11-18 11:22:04 -08:00
Linus Torvalds fa7f578076 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - a bit more MM

 - procfs updates

 - dynamic-debug fixes

 - lib/ updates

 - checkpatch

 - epoll

 - nilfs2

 - signals

 - rapidio

 - PID management cleanup and optimization

 - kcov updates

 - sysvipc updates

 - quite a few misc things all over the place

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits)
  EXPERT Kconfig menu: fix broken EXPERT menu
  include/asm-generic/topology.h: remove unused parent_node() macro
  arch/tile/include/asm/topology.h: remove unused parent_node() macro
  arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro
  arch/sh/include/asm/topology.h: remove unused parent_node() macro
  arch/ia64/include/asm/topology.h: remove unused parent_node() macro
  drivers/pcmcia/sa1111_badge4.c: avoid unused function warning
  mm: add infrastructure for get_user_pages_fast() benchmarking
  sysvipc: make get_maxid O(1) again
  sysvipc: properly name ipc_addid() limit parameter
  sysvipc: duplicate lock comments wrt ipc_addid()
  sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE
  initramfs: use time64_t timestamps
  drivers/watchdog: make use of devm_register_reboot_notifier()
  kernel/reboot.c: add devm_register_reboot_notifier()
  kcov: update documentation
  Makefile: support flag -fsanitizer-coverage=trace-cmp
  kcov: support comparison operands collection
  kcov: remove pointless current != NULL check
  kernel/panic.c: add TAINT_AUX
  ...
2017-11-17 16:56:17 -08:00
Gargi Sharma 95846ecf9d pid: replace pid bitmap implementation with IDR API
Patch series "Replacing PID bitmap implementation with IDR API", v4.

This series replaces kernel bitmap implementation of PID allocation with
IDR API.  These patches are written to simplify the kernel by replacing
custom code with calls to generic code.

The following are the stats for pid and pid_namespace object files
before and after the replacement.  There is a noteworthy change between
the IDR and bitmap implementation.

Before
   text       data        bss        dec        hex    filename
   8447       3894         64      12405       3075    kernel/pid.o
After
   text       data        bss        dec        hex    filename
   3397        304          0       3701        e75    kernel/pid.o

Before
   text       data        bss        dec        hex    filename
   5692       1842        192       7726       1e2e    kernel/pid_namespace.o
After
   text       data        bss        dec        hex    filename
   2854        216         16       3086        c0e    kernel/pid_namespace.o

The following are the stats for ps, pstree and calling readdir on /proc
for 10,000 processes.

ps:
        With IDR API    With bitmap
real    0m1.479s        0m2.319s
user    0m0.070s        0m0.060s
sys     0m0.289s        0m0.516s

pstree:
        With IDR API    With bitmap
real    0m1.024s        0m1.794s
user    0m0.348s        0m0.612s
sys     0m0.184s        0m0.264s

proc:
        With IDR API    With bitmap
real    0m0.059s        0m0.074s
user    0m0.000s        0m0.004s
sys     0m0.016s        0m0.016s

This patch (of 2):

Replace the current bitmap implementation for Process ID allocation.
Functions that are no longer required, for example, free_pidmap(),
alloc_pidmap(), etc.  are removed.  The rest of the functions are
modified to use the IDR API.  The change was made to make the PID
allocation less complex by replacing custom code with calls to generic
API.

[gs051095@gmail.com: v6]
  Link: http://lkml.kernel.org/r/1507760379-21662-2-git-send-email-gs051095@gmail.com
[avagin@openvz.org: restore the old behaviour of the ns_last_pid sysctl]
  Link: http://lkml.kernel.org/r/20171106183144.16368-1-avagin@openvz.org
Link: http://lkml.kernel.org/r/1507583624-22146-2-git-send-email-gs051095@gmail.com
Signed-off-by: Gargi Sharma <gs051095@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Colin Ian King eecd7f4f5b fat: remove redundant assignment of 0 to slots
The variable slots is being assigned a value of zero that is never read,
slots is being updated again a few lines later.  Remove this redundant
assignment.

Cleans clang warning: Value stored to 'slots' is never read

Link: http://lkml.kernel.org/r/20171017140258.22536-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Christos Gkekas 15ec37185e hfs/hfsplus: clean up unused variables in bnode.c
Delete variables 'tree' and 'sb', which are set but never used.

Link: http://lkml.kernel.org/r/1507977146-15875-1-git-send-email-chris.gekas@gmail.com
Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Jeff Layton 577753cc57 nilfs2: remove inode->i_version initialization
It's never used in nilfs2.

Link: http://lkml.kernel.org/r/1510064486-1728-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Ryusuke Konishi 3147db8938 nilfs2: use octal for unreadable permission macro
Replace S_IRWXUGO with 0777 because symbolic permissions are considered
harmful:

 https://lwn.net/Articles/696229/

Link: http://lkml.kernel.org/r/1509367935-3086-5-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Ryusuke Konishi 4d685f930a nilfs2: align block comments of nilfs_sufile_truncate_range() at *
Fix the following checkpatch warning:

 WARNING: Block comments should align the * on each line
 #633: FILE: sufile.c:633:
 +/**
 +  * nilfs_sufile_truncate_range - truncate range of segment array

Link: http://lkml.kernel.org/r/1509367935-3086-4-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Elena Reshetova d4f0284a59 fs, nilfs: convert nilfs_root.count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference counters
with the following properties:

 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided refcount_t
type and API that prevents accidental counter overflows and underflows.
This is important since overflows and underflows can lead to
use-after-free situation and be exploitable.

The variable nilfs_root.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Link: http://lkml.kernel.org/r/1509367935-3086-3-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Andreas Rohner 31ccb1f7ba nilfs2: fix race condition that causes file system corruption
There is a race condition between nilfs_dirty_inode() and
nilfs_set_file_dirty().

When a file is opened, nilfs_dirty_inode() is called to update the
access timestamp in the inode.  It calls __nilfs_mark_inode_dirty() in a
separate transaction.  __nilfs_mark_inode_dirty() caches the ifile
buffer_head in the i_bh field of the inode info structure and marks it
as dirty.

After some data was written to the file in another transaction, the
function nilfs_set_file_dirty() is called, which adds the inode to the
ns_dirty_files list.

Then the segment construction calls nilfs_segctor_collect_dirty_files(),
which goes through the ns_dirty_files list and checks the i_bh field.
If there is a cached buffer_head in i_bh it is not marked as dirty
again.

Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
transactions, it is possible that a segment construction that writes out
the ifile occurs in-between the two.  If this happens the inode is not
on the ns_dirty_files list, but its ifile block is still marked as dirty
and written out.

In the next segment construction, the data for the file is written out
and nilfs_bmap_propagate() updates the b-tree.  Eventually the bmap root
is written into the i_bh block, which is not dirty, because it was
written out in another segment construction.

As a result the bmap update can be lost, which leads to file system
corruption.  Either the virtual block address points to an unallocated
DAT block, or the DAT entry will be reused for something different.

The error can remain undetected for a long time.  A typical error
message would be one of the "bad btree" errors or a warning that a DAT
entry could not be found.

This bug can be reproduced reliably by a simple benchmark that creates
and overwrites millions of 4k files.

Link: http://lkml.kernel.org/r/1509367935-3086-2-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andreas Rohner <andreas.rohner@gmx.net>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Kees Cook 7554e9c4cf fs/nilfs2: convert timers to use timer_setup()
In preparation for unconditionally passing the struct timer_list pointer
to all timer callbacks, switch to using the new timer_setup() and
from_timer() to pass the timer pointer explicitly.  This requires adding
a pointer to hold the timer's target task, as the lifetime of sc_task
doesn't appear to match the timer's task.

Link: http://lkml.kernel.org/r/20171016235900.GA102729@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Joe Lawrence 7a8d181949 pipe: add proc_dopipe_max_size() to safely assign pipe_max_size
pipe_max_size is assigned directly via procfs sysctl:

  static struct ctl_table fs_table[] = {
          ...
          {
                  .procname       = "pipe-max-size",
                  .data           = &pipe_max_size,
                  .maxlen         = sizeof(int),
                  .mode           = 0644,
                  .proc_handler   = &pipe_proc_fn,
                  .extra1         = &pipe_min_size,
          },
          ...

  int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
                   size_t *lenp, loff_t *ppos)
  {
          ...
          ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)
          ...

and then later rounded in-place a few statements later:

          ...
          pipe_max_size = round_pipe_size(pipe_max_size);
          ...

This leaves a window of time between initial assignment and rounding
that may be visible to other threads.  (For example, one thread sets a
non-rounded value to pipe_max_size while another reads its value.)

Similar reads of pipe_max_size are potentially racy:

  pipe.c :: alloc_pipe_info()
  pipe.c :: pipe_set_size()

Add a new proc_dopipe_max_size() that consolidates reading the new value
from the user buffer, verifying bounds, and calling round_pipe_size()
with a single assignment to pipe_max_size.

Link: http://lkml.kernel.org/r/1507658689-11669-4-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:03 -08:00
Joe Lawrence d3f14c4858 pipe: avoid round_pipe_size() nr_pages overflow on 32-bit
round_pipe_size() contains a right-bit-shift expression which may
overflow, which would cause undefined results in a subsequent
roundup_pow_of_two() call.

  static inline unsigned int round_pipe_size(unsigned int size)
  {
          unsigned long nr_pages;

          nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
          return roundup_pow_of_two(nr_pages) << PAGE_SHIFT;
  }

PAGE_SIZE is defined as (1UL << PAGE_SHIFT), so:
  - 4 bytes wide on 32-bit (0 to 0xffffffff)
  - 8 bytes wide on 64-bit (0 to 0xffffffffffffffff)

That means that 32-bit round_pipe_size(), nr_pages may overflow to 0:

  size=0x00000000    nr_pages=0x0
  size=0x00000001    nr_pages=0x1
  size=0xfffff000    nr_pages=0xfffff
  size=0xfffff001    nr_pages=0x0         << !
  size=0xffffffff    nr_pages=0x0         << !

This is bad because roundup_pow_of_two(n) is undefined when n == 0!

64-bit is not a problem as the unsigned int size is 4 bytes wide
(similar to 32-bit) and the larger, 8 byte wide unsigned long, is
sufficient to handle the largest value of the bit shift expression:

  size=0xffffffff    nr_pages=100000

Modify round_pipe_size() to return 0 if n == 0 and updates its callers to
handle accordingly.

Link: http://lkml.kernel.org/r/1507658689-11669-3-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
Joe Lawrence 98159d977f pipe: match pipe_max_size data type with procfs
Patch series "A few round_pipe_size() and pipe-max-size fixups", v3.

While backporting Michael's "pipe: fix limit handling" patchset to a
distro-kernel, Mikulas noticed that current upstream pipe limit handling
contains a few problems:

  1 - procfs signed wrap: echo'ing a large number into
      /proc/sys/fs/pipe-max-size and then cat'ing it back out shows a
      negative value.

  2 - round_pipe_size() nr_pages overflow on 32bit:  this would
      subsequently try roundup_pow_of_two(0), which is undefined.

  3 - visible non-rounded pipe-max-size value: there is no mutual
      exclusion or protection between the time pipe_max_size is assigned
      a raw value from proc_dointvec_minmax() and when it is rounded.

  4 - unsigned long -> unsigned int conversion makes for potential odd
      return errors from do_proc_douintvec_minmax_conv() and
      do_proc_dopipe_max_size_conv().

This version underwent the same testing as v1:
https://marc.info/?l=linux-kernel&m=150643571406022&w=2

This patch (of 4):

pipe_max_size is defined as an unsigned int:

  unsigned int pipe_max_size = 1048576;

but its procfs/sysctl representation is an integer:

  static struct ctl_table fs_table[] = {
          ...
          {
                  .procname       = "pipe-max-size",
                  .data           = &pipe_max_size,
                  .maxlen         = sizeof(int),
                  .mode           = 0644,
                  .proc_handler   = &pipe_proc_fn,
                  .extra1         = &pipe_min_size,
          },
          ...

that is signed:

  int pipe_proc_fn(struct ctl_table *table, int write, void __user *buf,
                   size_t *lenp, loff_t *ppos)
  {
          ...
          ret = proc_dointvec_minmax(table, write, buf, lenp, ppos)

This leads to signed results via procfs for large values of pipe_max_size:

  % echo 2147483647 >/proc/sys/fs/pipe-max-size
  % cat /proc/sys/fs/pipe-max-size
  -2147483648

Use unsigned operations on this variable to avoid such negative values.

Link: http://lkml.kernel.org/r/1507658689-11669-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
NeilBrown ecc0c469f2 autofs: don't fail mount for transient error
Currently if the autofs kernel module gets an error when writing to the
pipe which links to the daemon, then it marks the whole moutpoint as
catatonic, and it will stop working.

It is possible that the error is transient.  This can happen if the
daemon is slow and more than 16 requests queue up.  If a subsequent
process tries to queue a request, and is then signalled, the write to
the pipe will return -ERESTARTSYS and autofs will take that as total
failure.

So change the code to assess -ERESTARTSYS and -ENOMEM as transient
failures which only abort the current request, not the whole mountpoint.

It isn't a crash or a data corruption, but having autofs mountpoints
suddenly stop working is rather inconvenient.

Ian said:

: And given the problems with a half dozen (or so) user space applications
: consuming large amounts of CPU under heavy mount and umount activity this
: could happen more easily than we expect.

Link: http://lkml.kernel.org/r/87y3norvgp.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
Jason Baron 37b5e5212a epoll: remove ep_call_nested() from ep_eventpoll_poll()
The use of ep_call_nested() in ep_eventpoll_poll(), which is the .poll
routine for an epoll fd, is used to prevent excessively deep epoll
nesting, and to prevent circular paths.

However, we are already preventing these conditions during
EPOLL_CTL_ADD.  In terms of too deep epoll chains, we do in fact allow
deep nesting of the epoll fds themselves (deeper than EP_MAX_NESTS),
however we don't allow more than EP_MAX_NESTS when an epoll file
descriptor is actually connected to a wakeup source.  Thus, we do not
require the use of ep_call_nested(), since ep_eventpoll_poll(), which is
called via ep_scan_ready_list() only continues nesting if there are
events available.

Since ep_call_nested() is implemented using a global lock, applications
that make use of nested epoll can see large performance improvements
with this change.

Davidlohr said:

: Improvements are quite obscene actually, such as for the following
: epoll_wait() benchmark with 2 level nesting on a 80 core IvyBridge:
:
: ncpus  vanilla     dirty     delta
: 1      2447092     3028315   +23.75%
: 4      231265      2986954   +1191.57%
: 8      121631      2898796   +2283.27%
: 16     59749       2902056   +4757.07%
: 32     26837	     2326314   +8568.30%
: 64     12926       1341281   +10276.61%
:
: (http://linux-scalability.org/epoll/epoll-test.c)

Link: http://lkml.kernel.org/r/1509430214-5599-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Salman Qazi <sqazi@google.com>
Cc: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
Jason Baron 57a173bdf5 epoll: avoid calling ep_call_nested() from ep_poll_safewake()
ep_poll_safewake() is used to wakeup potentially nested epoll file
descriptors.  The function uses ep_call_nested() to prevent entering the
same wake up queue more than once, and to prevent excessively deep
wakeup paths (deeper than EP_MAX_NESTS).  However, this is not necessary
since we are already preventing these conditions during EPOLL_CTL_ADD.
This saves extra function calls, and avoids taking a global lock during
the ep_call_nested() calls.

I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
case, since ep_call_nested() keeps track of the nesting level, and this
is required by the call to spin_lock_irqsave_nested().  It would be nice
to remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC
case as well, however its not clear how to simply pass the nesting level
through multiple wake_up() levels without more surgery.  In any case, I
don't think CONFIG_DEBUG_LOCK_ALLOC is generally used for production.
This patch, also apparently fixes a workload at Google that Salman Qazi
reported by completely removing the poll_safewake_ncalls->lock from
wakeup paths.

Link: http://lkml.kernel.org/r/1507920533-8812-1-git-send-email-jbaron@akamai.com
Signed-off-by: Jason Baron <jbaron@akamai.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Salman Qazi <sqazi@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
Shakeel Butt 2ae928a944 epoll: account epitem and eppoll_entry to kmemcg
A userspace application can directly trigger the allocations from
eventpoll_epi and eventpoll_pwq slabs.  A buggy or malicious application
can consume a significant amount of system memory by triggering such
allocations.  Indeed we have seen in production where a buggy
application was leaking the epoll references and causing a burst of
eventpoll_epi and eventpoll_pwq slab allocations.  This patch opt-in the
charging of eventpoll_epi and eventpoll_pwq slabs.

There is a per-user limit (~4% of total memory if no highmem) on these
caches.  I think it is too generous particularly in the scenario where
jobs of multiple users are running on the system and the administrator
is reducing cost by overcomitting the memory.  This is unaccounted
kernel memory and will not be considered by the oom-killer.  I think by
accounting it to kmemcg, for systems with kmem accounting enabled, we
can provide better isolation between jobs of different users.

Link: http://lkml.kernel.org/r/20171003021519.23907-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:02 -08:00
Alexey Dobriyan 0746a0bc6e proc: use do-while in name_to_int()
Gcc doesn't know that "len" is guaranteed to be >=1 by dcache and
generates standard while-loop prologue duplicating loop condition.

	add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
	function                                     old     new   delta
	name_to_int                                  104      77     -27

Link: http://lkml.kernel.org/r/20170912195213.GB17730@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:00 -08:00
Alexey Dobriyan 3ee2a19908 proc: : uninline name_to_int()
Save ~360 bytes.

	add/remove: 1/0 grow/shrink: 0/4 up/down: 104/-463 (-359)
	function                                     old     new   delta
	name_to_int                                    -     104    +104
	proc_pid_lookup                              217     126     -91
	proc_lookupfd_common                         212     121     -91
	proc_task_lookup                             289     194     -95
	__proc_create                                588     402    -186

Link: http://lkml.kernel.org/r/20170912194850.GA17730@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:00 -08:00
Roman Gushchin c643401218 proc, coredump: add CoreDumping flag to /proc/pid/status
Right now there is no convenient way to check if a process is being
coredumped at the moment.

It might be necessary to recognize such state to prevent killing the
process and getting a broken coredump.  Writing a large core might take
significant time, and the process is unresponsive during it, so it might
be killed by timeout, if another process is monitoring and
killing/restarting hanging tasks.

We're getting a significant number of corrupted coredump files on
machines in our fleet, just because processes are being killed by
timeout in the middle of the core writing process.

We do have a process health check, and some agent is responsible for
restarting processes which are not responding for health check requests.
Writing a large coredump to the disk can easily exceed the reasonable
timeout (especially on an overloaded machine).

This flag will allow the agent to distinguish processes which are being
coredumped, extend the timeout for them, and let them produce a full
coredump file.

To provide an ability to detect if a process is in the state of being
coredumped, we can expose a boolean CoreDumping flag in
/proc/pid/status.

Example:
$ cat core.sh
  #!/bin/sh

  echo "|/usr/bin/sleep 10" > /proc/sys/kernel/core_pattern
  sleep 1000 &
  PID=$!

  cat /proc/$PID/status | grep CoreDumping
  kill -ABRT $PID
  sleep 1
  cat /proc/$PID/status | grep CoreDumping

$ ./core.sh
  CoreDumping:	0
  CoreDumping:	1

[guro@fb.com: document CoreDumping flag in /proc/<pid>/status]
  Link: http://lkml.kernel.org/r/20170928135357.GA8470@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20170920230634.31572-1-guro@fb.com
Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17 16:10:00 -08:00
Linus Torvalds e75080f185 Two power management fixes for v4.15-rc1
This is the change making /proc/cpuinfo on x86 report current
 CPU frequency in "cpu MHz" again in all cases and an additional
 one dealing with an overzealous check in one of the helper
 routines in the runtime PM framework.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJaDvBIAAoJEILEb/54YlRxZ58QAJP6p53XDcml8Risw9CrpnZV
 6kBdFTYn6JSJiE4cALTER14ScqHQdTP2M6QJPDDLV5LwiQFa5fJYsSNP7F1Dpg4r
 8V3QNZbBjpyc8rSGRUkjY7+WsvUUb2UWzEkLIUjOWIT4mfC969JxV/fBYEL7ZDn9
 Wg7q79qI5Tss9PU2GUmaFtdkR0lqUIdNrrWe+qyLl0XHkrmU8DGL4XkPykdkwX0L
 gn0i/RrK+5DBUVPR1qQTU2CO3751IdIDktpK3RLmWl/yb4TqlM4WKIhIZvvglc2g
 S+OWGg/E4CNU6/EcGllNCPENAH7v0FNvvLMslPs6ao+wGQBcgO4R5d70dzobph/i
 P1ns6iJbd+lgRlGSQBReVo/FWcwi4HrINRxAB4W88dBBxchHdt+G3/Juq6GiGEJi
 mOh3ZHWd0J3mQEIWLKEcm5nHwIeY9yhCFJIpr5azte7JIz1fDuMnnp2gYl1SOVCK
 CHv0uD8Mw7hQFC0Dzje8T0Hr29MBwpEJiXE4Eh+Fp4zWiI7BYd1TNtp5WPDtchhv
 weqFqgDArN5gpkrZuSsxxg8eeRRwPeQR/mCyxofmsQ5lplCVJi8Ieqcf/KZrCy/c
 1vHGJsn9ec2dNeQKTFFT5luznQSSSXoZCXprumFuTp2804E3Hpkf/UnAldc4EYSn
 SwzAOO3gNA76eaFikvTK
 =h6Ux
 -----END PGP SIGNATURE-----

Merge tag 'pm-fixes-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull two power management fixes from Rafael Wysocki:
 "This is the change making /proc/cpuinfo on x86 report current CPU
  frequency in "cpu MHz" again in all cases and an additional one
  dealing with an overzealous check in one of the helper routines in the
  runtime PM framework"

* tag 'pm-fixes-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM / runtime: Drop children check from __pm_runtime_set_status()
  x86 / CPU: Always show current CPU frequency in /proc/cpuinfo
2017-11-17 14:49:25 -08:00
Linus Torvalds c3e9c04b89 NFS client updates for Linux 4.15
Stable bugfixes:
 - Revalidate "." and ".." correctly on open
 - Avoid RCU usage in tracepoints
 - Fix ugly referral attributes
 - Fix a typo in nomigration mount option
 - Revert "NFS: Move the flock open mode check into nfs_flock()"
 
 Features:
 - Implement a stronger send queue accounting system for NFS over RDMA
 - Switch some atomics to the new refcount_t type
 
 Other bugfixes and cleanups:
 - Clean up access mode bits
 - Remove special-case revalidations in nfs_opendir()
 - Improve invalidating NFS over RDMA memory for async operations that time out
 - Handle NFS over RDMA replies with a worqueue
 - Handle NFS over RDMA sends with a workqueue
 - Fix up replaying interrupted requests
 - Remove dead NFS over RDMA definitions
 - Update NFS over RDMA copyright information
 - Be more consistent with bool initialization and comparisons
 - Mark expected switch fall throughs
 - Various sunrpc tracepoint cleanups
 - Fix various OPEN races
 - Fix a typo in nfs_rename()
 - Use common error handling code in nfs_lock_and_join_request()
 - Check that some structures are properly cleaned up during net_exit()
 - Remove net pointer from dprintk()s
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAloPWGwACgkQ18tUv7Cl
 QOtMVhAAufCkDxqO2lmDH+0JyYUKMcoOMYtI8s2J1HrbEzTW/dVtI28fPAKEEd4m
 2JjNqnO516Jiv+g3E6eO4uunZRb4IB3AYT6YaTwmBFE+l7tpMdPb1xybOBP02Hji
 Y29kzLXwxxvnoxEqFalzCzV2BeRb2kAw6mayY9FxH6AfiEEQZfmxLCYgVuYa2jTC
 Z/B5E0GxAf28Aj0bIP8lLKbOkFijo851DB88UffEOZQGKUDlAd3GNUSSHb81Rj0N
 4ef7bKoGylkIpZ1PdTChdG1+RKqud02zrmQfmEwXui3eUwhOWy8hrKloNykqR5sj
 pgoDz79euAq4TDVyQKtutnbvVxfCcBeMYAXZhXkZLVcl+39in0kuLj4SxU5AmDhf
 ErnthG4W7jsLMM96kMvSTaoh4uwioviG1KmZfvuvUoMBSwtiX18hFTWtFKRD6x9e
 PNOqBdh8nkKYEFbEO4ksfYaWZJ5AuyFIQiIpj1gm+7sf039oN/zEuPV+jaEJG0oa
 Ef9IqHrQbbCUFYFjpBENr3HjU3igTTaxQ5iq+VYl4zg1pw6m6JTojqZ6qtQzqOYS
 O3N1ygeShsW934z8QcWjtEyeUXIB3JF9vUS3gEBgWPDyCltGXyq4Cq6Lod4s4JCb
 pWGI6wJLX1Fg6nq7cj0S4Or3QBgz2q8ZyBxssamhdvON/Ef5ccI=
 =2Zc1
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Revalidate "." and ".." correctly on open
   - Avoid RCU usage in tracepoints
   - Fix ugly referral attributes
   - Fix a typo in nomigration mount option
   - Revert "NFS: Move the flock open mode check into nfs_flock()"

  Features:
   - Implement a stronger send queue accounting system for NFS over RDMA
   - Switch some atomics to the new refcount_t type

  Other bugfixes and cleanups:
   - Clean up access mode bits
   - Remove special-case revalidations in nfs_opendir()
   - Improve invalidating NFS over RDMA memory for async operations that
     time out
   - Handle NFS over RDMA replies with a worqueue
   - Handle NFS over RDMA sends with a workqueue
   - Fix up replaying interrupted requests
   - Remove dead NFS over RDMA definitions
   - Update NFS over RDMA copyright information
   - Be more consistent with bool initialization and comparisons
   - Mark expected switch fall throughs
   - Various sunrpc tracepoint cleanups
   - Fix various OPEN races
   - Fix a typo in nfs_rename()
   - Use common error handling code in nfs_lock_and_join_request()
   - Check that some structures are properly cleaned up during
     net_exit()
   - Remove net pointer from dprintk()s"

* tag 'nfs-for-4.15-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (62 commits)
  NFS: Revert "NFS: Move the flock open mode check into nfs_flock()"
  NFS: Fix typo in nomigration mount option
  nfs: Fix ugly referral attributes
  NFS: super: mark expected switch fall-throughs
  sunrpc: remove net pointer from messages
  nfs: remove net pointer from messages
  sunrpc: exit_net cleanup check added
  nfs client: exit_net cleanup check added
  nfs/write: Use common error handling code in nfs_lock_and_join_requests()
  NFSv4: Replace closed stateids with the "invalid special stateid"
  NFSv4: nfs_set_open_stateid must not trigger state recovery for closed state
  NFSv4: Check the open stateid when searching for expired state
  NFSv4: Clean up nfs4_delegreturn_done
  NFSv4: cleanup nfs4_close_done
  NFSv4: Retry NFS4ERR_OLD_STATEID errors in layoutreturn
  pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-close
  NFSv4: Don't try to CLOSE if the stateid 'other' field has changed
  NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.
  NFS: Fix a typo in nfs_rename()
  NFSv4: Fix open create exclusive when the server reboots
  ...
2017-11-17 14:18:00 -08:00
Linus Torvalds e0bcb42e60 * Miscellaneous code cleanups and refactoring
* Fix a possible use after free bug when unloading the module
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaD1AnAAoJENaSAD2qAscKY6kQAJKNyajxTJ3r0wtz/BErmxiR
 ZkMACc+5vuLuggui1vm53fN3LnR3IBa0k0Um9c4f42cItYw7V+Km/ZCf27w9bmV0
 sFkDlPx6o+AgyZEGI8RCadsEHh1XOZ9/lduBr+I0NnmF2A1Wk0/kc4aU3rRarg62
 T8xOUBSv2231y1KOFFQ6RWSKTKfvTJMiJie5nnXhPI8/v5Tdwr06XhW/Purj3Wg1
 9aZcKCCjd+MKR5vK4sH2AhEQKztNLCI6MENQeRTL5nKKoXxk7Ew8BhxhkTta3f3M
 FDnaQlkzRUaQgdxKSaDN+nygsGXC0TRYgq/6zh6+oGeqLgqlN1GcOY4azBu+Vxn3
 VzhLpqxdmUFO+GT4htQOHogHGF/XevjT6Rbx/lxNo0O4bYw3yLFamMXx9MQ7olaJ
 apIbKCoC42eSh+RkvYFqylFcbudiBtOctZZBdAboE1vqZlOUN6qvK1hNftcnmfiA
 pXlcYvXPKMRDXr5bfCFvIuQ1Y2QYd9KukHgh8t5sTv7MSfLzjUg4c8DI5I2G1DYj
 rX4MvP9ZTEUAdWnCFGsiBuxzs88STQVzbFOgSk5eMa1Nu5dkqeXSrdKDWcpwy9Zp
 oFAyiZn5pLuamlwBqXfR9/3eJhZ3iZ7LqVME33Hm7QTsxdGVWAQyy/3zO82GiFQz
 Pril+5zm89wSkOelqzGx
 =q9yI
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-4.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs updates from Tyler Hicks:

 - miscellaneous code cleanups and refactoring

 - fix a possible use after free bug when unloading the module

* tag 'ecryptfs-4.15-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: constify attribute_group structures.
  ecryptfs: remove unnecessary i_version bump
  ecryptfs: use ARRAY_SIZE
  ecryptfs: Adjust four checks for null pointers
  ecryptfs: Return an error code only as a constant in ecryptfs_add_global_auth_tok()
  ecryptfs: Delete 21 error messages for a failed memory allocation
  eCryptfs: use after free in ecryptfs_release_messaging()
  ecryptfs: remove private bin2hex implementation
  ecryptfs: add missing \n to end of various error messages
2017-11-17 14:16:21 -08:00
Linus Torvalds b6b220b0c7 Changes since last update:
- Fix a forgotten rcu read unlock
 - Fix some inconsistent integer type usage.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaDplqAAoJEPh/dxk0SrTrgwoP/R47TYDyR9HH2X8WRCamgZKu
 zVoPTCv8+OP7DwsrkZdhMfn3+GtDUKihr0DhU2sP54ifdH/iJ+JdyX1J77B8+hyE
 70fONGDn1XR+AeThaBDLw2t+FvabHICYF3gUVduj6xGszSJqjPWkaTOTmpG1rrs0
 q3SeHDddX6gUkral6wDHWdYRqvgthW++oqmUMzQuK991+XtbJwVzVCpppXi7s6ip
 VDhHfu0mbux9hzJGESToOOXuvb1vBe4wTqD3HVKKbCofiLbrX1dDtu9IaTCQa6vn
 kzuk2Z4DkPQe6IYUBq7/Z/cSpSk+ECHV+QwCeX+eA1D3nbt/dIbdThHM/FB3Qcai
 NaQ0+vxWFIIEgAPs03NiZ87h+tFtj2Fu6c5te7PceF9UsTe3G8WQDp8q90Lzy14j
 EIJ83wMJrAdoruXcCTzuuDotrXjW1Ss3KyYzmINrOGlLp86uKAG500Eete+ik9fm
 F+vfFbs+X5ZcGcqeAJo6v9FL9nV7K0IBZ9b1S3iNx319sK35Nmt0OYZ4ae8ftxKV
 DoaU1QifSakgsowHVlTwajJnl6l+NK5lFNjL0fKjZsnZ+zLuF8bL/dNeMWozBrE3
 welZya13dl+ZBC6xutJkkdBBvqKVhcliLS+LGfp2bdZTKoVx4P08TbtERCkDAzeF
 ZS74pC9u90HshYjXwNl/
 =P/lR
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "A couple more patches to fix a locking bug and some inconsistent type
  usage in some of the new code:

   - Fix a forgotten rcu read unlock

   - Fix some inconsistent integer type usage"

* tag 'xfs-4.15-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix type usage
  xfs: fix forgotten rcu read unlock when skipping inode reclaim
2017-11-17 14:14:13 -08:00
Benjamin Coddington fcfa447062 NFS: Revert "NFS: Move the flock open mode check into nfs_flock()"
Commit e12937279c "NFS: Move the flock open mode check into nfs_flock()"
changed NFSv3 behavior for flock() such that the open mode must match the
lock type, however that requirement shouldn't be enforced for flock().

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:52 -05:00
Joshua Watt f02fee227e NFS: Fix typo in nomigration mount option
The option was incorrectly masking off all other options.

Signed-off-by: Joshua Watt <JPEWhacker@gmail.com>
Cc: stable@vger.kernel.org #3.7
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:52 -05:00
Chuck Lever c05cefcc72 nfs: Fix ugly referral attributes
Before traversing a referral and performing a mount, the mounted-on
directory looks strange:

dr-xr-xr-x. 2 4294967294 4294967294 0 Dec 31  1969 dir.0

nfs4_get_referral is wiping out any cached attributes with what was
returned via GETATTR(fs_locations), but the bit mask for that
operation does not request any file attributes.

Retrieve owner and timestamp information so that the memcpy in
nfs4_get_referral fills in more attributes.

Changes since v1:
- Don't request attributes that the client unconditionally replaces
- Request only MOUNTED_ON_FILEID or FILEID attribute, not both
- encode_fs_locations() doesn't use the third bitmask word

Fixes: 6b97fd3da1 ("NFSv4: Follow a referral")
Suggested-by: Pradeep Thomas <pradeepthomas@gmail.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:52 -05:00
Gustavo A. R. Silva fd53dde839 NFS: super: mark expected switch fall-throughs
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 703509
Addresses-Coverity-ID: 703510
Addresses-Coverity-ID: 703511
Addresses-Coverity-ID: 703512
Addresses-Coverity-ID: 703513
Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:51 -05:00
Vasily Averin e4949e4b3d nfs: remove net pointer from messages
Publishing of net pointer is not safe,
use net->ns.inum instead

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:51 -05:00
Vasily Averin b0b5352d9a nfs client: exit_net cleanup check added
Be sure that nfs_client_list and nfs_volume_list lists initialized
in net_init hook were return to initial state in net_exit hook.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:50 -05:00
Markus Elfring 0671d8f108 nfs/write: Use common error handling code in nfs_lock_and_join_requests()
Add a jump target so that a bit of exception handling can be better reused
at the end of this function.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:50 -05:00
Trond Myklebust fcd8843c40 NFSv4: Replace closed stateids with the "invalid special stateid"
When decoding a CLOSE, replace the stateid returned by the server
with the "invalid special stateid" described in RFC5661, Section 8.2.3.

In nfs_set_open_stateid_locked, ignore stateids from closed state.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:49 -05:00
Trond Myklebust e1fff5df6e NFSv4: nfs_set_open_stateid must not trigger state recovery for closed state
In nfs_set_open_stateid_locked, we must ignore stateids from closed state.

Reported-by: Andrew W Elble <aweits@rit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:49 -05:00
Trond Myklebust 46280d9d3d NFSv4: Check the open stateid when searching for expired state
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:49 -05:00
Trond Myklebust 140087fdf6 NFSv4: Clean up nfs4_delegreturn_done
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:48 -05:00
Trond Myklebust 91b30d2e7f NFSv4: cleanup nfs4_close_done
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:48 -05:00
Trond Myklebust ff90514ebf NFSv4: Retry NFS4ERR_OLD_STATEID errors in layoutreturn
If our layoutreturn returns an NFS4ERR_OLD_STATEID, then try to
update the stateid and retry.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:48 -05:00
Trond Myklebust 7380020e77 pNFS: Retry NFS4ERR_OLD_STATEID errors in layoutreturn-on-close
If our layoutreturn on close operation returns an NFS4ERR_OLD_STATEID,
then try to update the stateid and retry. We know that there should
be no further LAYOUTGET requests being launched.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:47 -05:00
Trond Myklebust c82bac6f4b NFSv4: Don't try to CLOSE if the stateid 'other' field has changed
If the stateid is no longer recognised on the server, either due to a
restart, or due to a competing CLOSE call, then we do not have to
retry. Any open contexts that triggered a reopen of the file, will
also act as triggers for any CLOSE for the updated stateids.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:47 -05:00
Trond Myklebust 12f275cdd1 NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.
If we're racing with an OPEN, then retry the operation instead of
declaring it a success.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
[Andrew W Elble: Fix a typo in nfs4_refresh_open_stateid]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:47 -05:00
Trond Myklebust d803224c84 NFS: Fix a typo in nfs_rename()
On successful rename, the "old_dentry" is retained and is attached to
the "new_dir", so we need to call nfs_set_verifier() accordingly.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:46 -05:00
Trond Myklebust 8fd1ab747d NFSv4: Fix open create exclusive when the server reboots
If the server that does not implement NFSv4.1 persistent session
semantics reboots while we are performing an exclusive create,
then the return value of NFS4ERR_DELAY when we replay the open
during the grace period causes us to lose the verifier.
When the grace period expires, and we present a new verifier,
the server will then correctly reply NFS4ERR_EXIST.

This commit ensures that we always present the same verifier when
replaying the OPEN.

Reported-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:46 -05:00
Trond Myklebust ad9e02dc02 NFSv4: Add a tracepoint to document open stateid updates
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:45 -05:00
Trond Myklebust c9399f21c2 NFSv4: Fix OPEN / CLOSE race
Ben Coddington has noted the following race between OPEN and CLOSE
on a single client.

Process 1		Process 2		Server
=========		=========		======

1)  OPEN file
2)			OPEN file
3)						Process OPEN (1) seqid=1
4)						Process OPEN (2) seqid=2
5)						Reply OPEN (2)
6)			Receive reply (2)
7)			new stateid, seqid=2

8)			CLOSE file, using
			stateid w/ seqid=2
9)						Reply OPEN (1)
10(						Process CLOSE (8)
11)						Reply CLOSE (8)
12)						Forget stateid
						file closed

13)			Receive reply (7)
14)			Forget stateid
			file closed.

15) Receive reply (1).
16) New stateid seqid=1
    is really the same
    stateid that was
    closed.

IOW: the reply to the first OPEN is delayed. Since "Process 2" does
not wait before closing the file, and it does not cache the closed
stateid, then when the delayed reply is finally received, it is treated
as setting up a new stateid by the client.

The fix is to ensure that the client processes the OPEN and CLOSE calls
in the same order in which the server processed them.

This commit ensures that we examine the seqid of the stateid
returned by OPEN. If it is a new stateid, we assume the seqid
must be equal to the value 1, and that each state transition
increments the seqid value by 1 (See RFC7530, Section 9.1.4.2,
and RFC5661, Section 8.2.2).

If the tracker sees that an OPEN returns with a seqid that is greater
than the cached seqid + 1, then it bumps a flag to ensure that the
caller waits for the RPCs carrying the missing seqids to complete.

Note that there can still be pathologies where the server crashes before
it can even send us the missing seqids. Since the OPEN call is still
holding a slot when it waits here, that could cause the recovery to
stall forever. To avoid that, we time out after a 5 second wait.

Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:45 -05:00
Thomas Meyer 6089dd0d73 NFS: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:43 -05:00
Anna Schumaker 3944369db7 NFS: Avoid RCU usage in tracepoints
There isn't an obvious way to acquire and release the RCU lock during a
tracepoint, so we can't use the rpc_peeraddr2str() function here.
Instead, rely on the client's cl_hostname, which should have similar
enough information without needing an rcu_dereference().

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Cc: stable@vger.kernel.org # v3.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 16:43:43 -05:00
Linus Torvalds b04a23421b Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:

 - Report constant st_ino values across copy-up even if underlying
   layers are on different filesystems, but using different st_dev
   values for each layer.

   Ideally we'd report the same st_dev across the overlay, and it's
   possible to do for filesystems that use only 32bits for st_ino by
   unifying the inum space. It would be nice if it wasn't a choice of 32
   or 64, rather filesystems could report their current maximum (that
   could change on resize, so it wouldn't be set in stone).

 - miscellaneus fixes and a cleanup of ovl_fill_super(), that was long
   overdue.

 - created a path_put_init() helper that clears out the pointers after
   putting the ref.

   I think this could be useful elsewhere, so added it to <linux/path.h>

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (30 commits)
  ovl: remove unneeded arg from ovl_verify_origin()
  ovl: Put upperdentry if ovl_check_origin() fails
  ovl: rename ufs to ofs
  ovl: clean up getting lower layers
  ovl: clean up workdir creation
  ovl: clean up getting upper layer
  ovl: move ovl_get_workdir() and ovl_get_lower_layers()
  ovl: reduce the number of arguments for ovl_workdir_create()
  ovl: change order of setup in ovl_fill_super()
  ovl: factor out ovl_free_fs() helper
  ovl: grab reference to workbasedir early
  ovl: split out ovl_get_indexdir() from ovl_fill_super()
  ovl: split out ovl_get_lower_layers() from ovl_fill_super()
  ovl: split out ovl_get_workdir() from ovl_fill_super()
  ovl: split out ovl_get_upper() from ovl_fill_super()
  ovl: split out ovl_get_lowerstack() from ovl_fill_super()
  ovl: split out ovl_get_workpath() from ovl_fill_super()
  ovl: split out ovl_get_upperpath() from ovl_fill_super()
  ovl: use path_put_init() in error paths for ovl_fill_super()
  vfs: add path_put_init()
  ...
2017-11-17 13:36:59 -08:00
Linus Torvalds 5a3e0b196b File locking related changes for v4.15
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaDuoWAAoJEAAOaEEZVoIVXEQP/jQYoU9hgvEj8j3ZIgi56SDJ
 pR45w2zcJz2/uU43DEKyShyLgsuoBbJQ3l/gGBH/tl+xGm9NzB0gatoEu9GmKNYz
 /IN6/vUFnoIAUyD+iMZbpmsYKIkz0z2YJo261IfspAwIft/cvHJnYYGQrP9YXg9F
 c7bdDuANTKocdQigc4BQyOe3OfIBGfTwJhuakO+1yuZmGOVNyxEcdYbMM8FiTfc8
 +62kvQQ3t7WMqSbM8M0QdGcYQjG0EwcVAuV7COurLJIva7hUkVel32MVUjoFcf28
 BnRu2ztFJCubm1HA85twlJDtpeXbcMqrUl/CcwRMpwDaePd5GVB1h5iKqbZ51BZ1
 fWT2STmt+8hY2B5eiXoYEaG3B7ZRr+r0oroxqOxpiZ/m4AVeouF+gPGv+NV5zgvD
 NGWC0MdklIJ4xaC99NEeP6kBhz0M74VKymFCTeHkVg9m4TqDepNvitKed0qagw19
 uw8seei7TOTm4o117+l55NHmyfTHXFO4U0WLTJyeZcoEnUs0rOcHeqyy0RwCBMrK
 W2fJtdBLFr+tBIIrID4TnPhhYtSvIPjz+FpiRDobqhgvMva/PIvLGTWK4unrgIjG
 ZQ7YGnwWda8GjqKhgZacn/BSXyJzOAF9hJp0mz2ORaOxaMarEV55duiZufCvGuZw
 uUQWRCKuQX7Oi05i9jXp
 =fCeF
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull file locking update from Jeff Layton:
 "A couple of fixes for a patch that went into v4.14, and the bug report
  just came in a few days ago.. It passes my (minimal) testing, and has
  been in linux-next for a few days now.

  I also would like to get my address changed in MAINTAINERS to clear
  that hurdle"

* tag 'locks-v4.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
  fcntl: don't leak fd reference when fixup_compat_flock fails
  MAINTAINERS: s/jlayton@poochiereds.net/jlayton@kernel.org/
2017-11-17 13:21:58 -08:00
Linus Torvalds cbda1b270f Merge branch 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull cramfs updates from Al Viro:
 "Nicolas Pitre's cramfs work"

* 'work.cramfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  cramfs: rehabilitate it
  cramfs: add mmap support
  cramfs: implement uncompressed and arbitrary data block positioning
  cramfs: direct memory access support
2017-11-17 13:20:41 -08:00
Linus Torvalds ca5b857cb0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted stuff, really no common topic here"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: grab the lock instead of blocking in __fd_install during resizing
  vfs: stop clearing close on exec when closing a fd
  include/linux/fs.h: fix comment about struct address_space
  fs: make fiemap work from compat_ioctl
  coda: fix 'kernel memory exposure attempt' in fsync
  pstore: remove unneeded unlikely()
  vfs: remove unneeded unlikely()
  stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case
  make vfs_ustat() static
  do_handle_open() should be static
  elf_fdpic: fix unused variable warning
  fold destroy_super() into __put_super()
  new helper: destroy_unused_super()
  fix address space warnings in ipc/
  acct.h: get rid of detritus
2017-11-17 12:54:01 -08:00
Linus Torvalds 16382e17c0 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:

 - bio_{map,copy}_user_iov() series; those are cleanups - fixes from the
   same pile went into mainline (and stable) in late September.

 - fs/iomap.c iov_iter-related fixes

 - new primitive - iov_iter_for_each_range(), which applies a function
   to kernel-mapped segments of an iov_iter.

   Usable for kvec and bvec ones, the latter does kmap()/kunmap() around
   the callback. _Not_ usable for iovec- or pipe-backed iov_iter; the
   latter is not hard to fix if the need ever appears, the former is by
   design.

   Another related primitive will have to wait for the next cycle - it
   passes page + offset + size instead of pointer + size, and that one
   will be usable for everything _except_ kvec. Unfortunately, that one
   didn't get exposure in -next yet, so...

 - a bit more lustre iov_iter work, including a use case for
   iov_iter_for_each_range() (checksum calculation)

 - vhost/scsi leak fix in failure exit

 - misc cleanups and detritectomy...

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (21 commits)
  iomap_dio_actor(): fix iov_iter bugs
  switch ksocknal_lib_recv_...() to use of iov_iter_for_each_range()
  lustre: switch struct ksock_conn to iov_iter
  vhost/scsi: switch to iov_iter_get_pages()
  fix a page leak in vhost_scsi_iov_to_sgl() error recovery
  new primitive: iov_iter_for_each_range()
  lnet_return_rx_credits_locked: don't abuse list_entry
  xen: don't open-code iov_iter_kvec()
  orangefs: remove detritus from struct orangefs_kiocb_s
  kill iov_shorten()
  bio_alloc_map_data(): do bmd->iter setup right there
  bio_copy_user_iov(): saner bio size calculation
  bio_map_user_iov(): get rid of copying iov_iter
  bio_copy_from_iter(): get rid of copying iov_iter
  move more stuff down into bio_copy_user_iov()
  blk_rq_map_user_iov(): move iov_iter_advance() down
  bio_map_user_iov(): get rid of the iov_for_each()
  bio_map_user_iov(): move alignment check into the main loop
  don't rely upon subsequent bio_add_pc_page() calls failing
  ... and with iov_iter_get_pages_alloc() it becomes even simpler
  ...
2017-11-17 12:08:18 -08:00
Linus Torvalds 93f30c73ec Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull compat and uaccess updates from Al Viro:

 - {get,put}_compat_sigset() series

 - assorted compat ioctl stuff

 - more set_fs() elimination

 - a few more timespec64 conversions

 - several removals of pointless access_ok() in places where it was
   followed only by non-__ variants of primitives

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
  coredump: call do_unlinkat directly instead of sys_unlink
  fs: expose do_unlinkat for built-in callers
  ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
  ipmi: get rid of pointless access_ok()
  pi433: sanitize ioctl
  cxlflash: get rid of pointless access_ok()
  mtdchar: get rid of pointless access_ok()
  r128: switch compat ioctls to drm_ioctl_kernel()
  selection: get rid of field-by-field copyin
  VT_RESIZEX: get rid of field-by-field copyin
  i2c compat ioctls: move to ->compat_ioctl()
  sched_rr_get_interval(): move compat to native, get rid of set_fs()
  mips: switch to {get,put}_compat_sigset()
  sparc: switch to {get,put}_compat_sigset()
  s390: switch to {get,put}_compat_sigset()
  ppc: switch to {get,put}_compat_sigset()
  parisc: switch to {get,put}_compat_sigset()
  get_compat_sigset()
  get rid of {get,put}_compat_itimerspec()
  io_getevents: Use timespec64 to represent timeouts
  ...
2017-11-17 11:54:55 -08:00
Elena Reshetova 212bf41d88 fs, nfs: convert nfs_client.cl_count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_client.cl_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:48:01 -05:00
Elena Reshetova 2f62b5aa48 fs, nfs: convert nfs_lock_context.count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_lock_context.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:48:01 -05:00
Elena Reshetova 194bc1f481 fs, nfs: convert nfs4_lock_state.ls_count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_lock_state.ls_count  is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:48:00 -05:00
Elena Reshetova 0896cade12 fs, nfs: convert nfs_cache_defer_req.count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs_cache_defer_req.count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:48:00 -05:00
Elena Reshetova 81a090b997 fs, nfs: convert nfs4_ff_layout_mirror.ref from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_ff_layout_mirror.ref is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:48:00 -05:00
Elena Reshetova 2b28a7bee4 fs, nfs: convert pnfs_layout_hdr.plh_refcount from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable pnfs_layout_hdr.plh_refcount is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:59 -05:00
Elena Reshetova eba6dd6917 fs, nfs: convert pnfs_layout_segment.pls_refcount from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:59 -05:00
Elena Reshetova a2a5dea7b6 fs, nfs: convert nfs4_pnfs_ds.ds_count from atomic_t to refcount_t
atomic_t variables are currently used to implement reference
counters with the following properties:
 - counter is initialized to 1 using atomic_set()
 - a resource is freed upon counter reaching zero
 - once counter reaches zero, its further
   increments aren't allowed
 - counter schema uses basic atomic operations
   (set, inc, inc_not_zero, dec_and_test, etc.)

Such atomic variables should be converted to a newly provided
refcount_t type and API that prevents accidental counter overflows
and underflows. This is important since overflows and underflows
can lead to use-after-free situation and be exploitable.

The variable nfs4_pnfs_ds.ds_count is used as pure reference counter.
Convert it to refcount_t and fix up the operations.

Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: David Windsor <dwindsor@gmail.com>
Reviewed-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:59 -05:00
Trond Myklebust 3be0f80b5f NFSv4.1: Fix up replays of interrupted requests
If the previous request on a slot was interrupted before it was
processed by the server, then our slot sequence number may be out of whack,
and so we try the next operation using the old sequence number.

The problem with this, is that not all servers check to see that the
client is replaying the same operations as previously when they decide
to go to the replay cache, and so instead of the expected error of
NFS4ERR_SEQ_FALSE_RETRY, we get a replay of the old reply, which could
(if the operations match up) be mistaken by the client for a new reply.

To fix this, we attempt to send a COMPOUND containing only the SEQUENCE op
in order to resync our slot sequence number.

Cc: Olga Kornievskaia <olga.kornievskaia@gmail.com>
[olga.kornievskaia@gmail.com: fix an Oops]
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-11-17 13:47:58 -05:00
Linus Torvalds a3841f94c7 libnvdimm for 4.15
* Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
  'userspace flush' of persistent memory updates via filesystem-dax
   mappings. It arranges for any filesystem metadata updates that may be
   required to satisfy a write fault to also be flushed ("on disk") before
   the kernel returns to userspace from the fault handler. Effectively
   every write-fault that dirties metadata completes an fsync() before
   returning from the fault handler. The new MAP_SHARED_VALIDATE mapping
   type guarantees that the MAP_SYNC flag is validated as supported by the
   filesystem's ->mmap() file operation.
 
 * Add support for the standard ACPI 6.2 label access methods that
   replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods. This
   enables interoperability with environments that only implement the
   standardized methods.
 
 * Add support for the ACPI 6.2 NVDIMM media error injection methods.
 
 * Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for latch
   last shutdown status, firmware update, SMART error injection, and
   SMART alarm threshold control.
 
 * Cleanup physical address information disclosures to be root-only.
 
 * Fix revalidation of the DIMM "locked label area" status to support
   dynamic unlock of the label area.
 
 * Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
   (system-physical-address) command and error injection commands.
 
 Acknowledgements that came after the commits were pushed to -next:
 
 957ac8c421 dax: fix PMD faults on zero-length files
 Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
 
 a39e596baa xfs: support for synchronous DAX faults
 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
 
 7b565c9f96 xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
 Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaDfvcAAoJEB7SkWpmfYgCk7sP/2qJhBH+VTTdg2osDnhAdAhI
 co/AGEmsHFlUCMBb/Ek7UnMAmhBYiJU2q4ywPsNFBpusXpMlqNy5Iwo7k4/wQHE/
 SJcIM0g4zg0ViFuUhwV+C2T0R5UzFR8JLd9EYWj/YS6aJpurtotm5l4UStaM0Hzo
 AhxSXJLrBDuqCpbOxbctfiGEmdRL7aRfBEAARTNRKBn/iXxJUcYHlp62rtXQS+t4
 I6LC/URCWTNTTMGmzW6TRsgSD9WMfd19xKcGzN3qL6ee0KFccxN4ctFqHA/sFGOh
 iYLeR0XJUjJxyp+PkWGteXPVZL0Kj3bD/lSTG+Co5bm/ra8a/sh3TSFfgFyoBZD1
 EqMN8Ryf80hGp3FabeH2Iw2SviYPZpHSWgjddjxLD0RA6OmpzINc+Wm8eqApjMME
 sbZDTOijiab4QMQ0XamF4GuDHyQtawv5Y/w2Ehhl1tmiqW+5tKhsKqxkQt+/V3Yt
 RTVSRe2Pkway66b+cD64IdQ6L2tyonPnmi5IzgkKOhlOEGomy+4/U2Jt2bMbhzq6
 ymszKmXp2XI8P06wU8sHrIUeXO5I9qoKn/fZA73Eb8aIzgJe3tBE/5+Ab7RG6HB9
 1OVfcMWoXU1gNgNktTs63X1Lsg4aW9kt/K4fPHHcqUcaliEJpJTlAbg9GLF2buoW
 nQ+0fTRgMRihE3ZA0Fs3
 =h2vZ
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm and dax updates from Dan Williams:
 "Save for a few late fixes, all of these commits have shipped in -next
  releases since before the merge window opened, and 0day has given a
  build success notification.

  The ext4 touches came from Jan, and the xfs touches have Darrick's
  reviewed-by. An xfstest for the MAP_SYNC feature has been through
  a few round of reviews and is on track to be merged.

   - Introduce MAP_SYNC and MAP_SHARED_VALIDATE, a mechanism to enable
     'userspace flush' of persistent memory updates via filesystem-dax
     mappings. It arranges for any filesystem metadata updates that may
     be required to satisfy a write fault to also be flushed ("on disk")
     before the kernel returns to userspace from the fault handler.
     Effectively every write-fault that dirties metadata completes an
     fsync() before returning from the fault handler. The new
     MAP_SHARED_VALIDATE mapping type guarantees that the MAP_SYNC flag
     is validated as supported by the filesystem's ->mmap() file
     operation.

   - Add support for the standard ACPI 6.2 label access methods that
     replace the NVDIMM_FAMILY_INTEL (vendor specific) label methods.
     This enables interoperability with environments that only implement
     the standardized methods.

   - Add support for the ACPI 6.2 NVDIMM media error injection methods.

   - Add support for the NVDIMM_FAMILY_INTEL v1.6 DIMM commands for
     latch last shutdown status, firmware update, SMART error injection,
     and SMART alarm threshold control.

   - Cleanup physical address information disclosures to be root-only.

   - Fix revalidation of the DIMM "locked label area" status to support
     dynamic unlock of the label area.

   - Expand unit test infrastructure to mock the ACPI 6.2 Translate SPA
     (system-physical-address) command and error injection commands.

  Acknowledgements that came after the commits were pushed to -next:

   - 957ac8c421 ("dax: fix PMD faults on zero-length files"):
       Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

   - a39e596baa ("xfs: support for synchronous DAX faults") and
     7b565c9f96 ("xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()")
        Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>"

* tag 'libnvdimm-for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (49 commits)
  acpi, nfit: add 'Enable Latch System Shutdown Status' command support
  dax: fix general protection fault in dax_alloc_inode
  dax: fix PMD faults on zero-length files
  dax: stop requiring a live device for dax_flush()
  brd: remove dax support
  dax: quiet bdev_dax_supported()
  fs, dax: unify IOMAP_F_DIRTY read vs write handling policy in the dax core
  tools/testing/nvdimm: unit test clear-error commands
  acpi, nfit: validate commands against the device type
  tools/testing/nvdimm: stricter bounds checking for error injection commands
  xfs: support for synchronous DAX faults
  xfs: Implement xfs_filemap_pfn_mkwrite() using __xfs_filemap_fault()
  ext4: Support for synchronous DAX faults
  ext4: Simplify error handling in ext4_dax_huge_fault()
  dax: Implement dax_finish_sync_fault()
  dax, iomap: Add support for synchronous faults
  mm: Define MAP_SYNC and VM_SYNC flags
  dax: Allow tuning whether dax_insert_mapping_entry() dirties entry
  dax: Allow dax_iomap_fault() to return pfn
  dax: Fix comment describing dax_iomap_fault()
  ...
2017-11-17 09:51:57 -08:00
David Howells 0fafdc9f88 afs: Fix file locking
Fix the AFS file locking whereby the use of the big kernel lock (which
could be slept with) was replaced by a spinlock (which couldn't).  The
problem is that the AFS code was doing stuff inside the critical section
that might call schedule(), so this is a broken transformation.

Fix this by the following means:

 (1) Use a state machine with a proper state that can only be changed under
     the spinlock rather than using a collection of bit flags.

 (2) Cache the key used for the lock and the lock type in the afs_vnode
     struct so that the manager work function doesn't have to refer to a
     file_lock struct that's been dequeued.  This makes signal handling
     safer.

 (4) Move the unlock from afs_do_unlk() to afs_fl_release_private() which
     means that unlock is achieved in other circumstances too.

 (5) Unlock the file on the server before taking the next conflicting lock.

Also change:

 (1) Check the permits on a file before actually trying the lock.

 (2) fsync the file before effecting an explicit unlock operation.  We
     don't fsync if the lock is erased otherwise as we might not be in a
     context where we can actually do that.

Further fixes:

 (1) Fixed-fileserver address rotation is made to work.  It's only used by
     the locking functions, so couldn't be tested before.

Fixes: 72f98e7255 ("locks: turn lock_flocks into a spinlock")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: jlayton@redhat.com
2017-11-17 10:06:13 +00:00
Linus Torvalds 441692aafc Merge branch 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM updates from Russell King:

 - add support for ELF fdpic binaries on both MMU and noMMU platforms

 - linker script cleanups

 - support for compressed .data section for XIP images

 - discard memblock arrays when possible

 - various cleanups

 - atomic DMA pool updates

 - better diagnostics of missing/corrupt device tree

 - export information to allow userspace kexec tool to place images more
   inteligently, so that the device tree isn't overwritten by the
   booting kernel

 - make early_printk more efficient on semihosted systems

 - noMMU cleanups

 - SA1111 PCMCIA update in preparation for further cleanups

* 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (38 commits)
  ARM: 8719/1: NOMMU: work around maybe-uninitialized warning
  ARM: 8717/2: debug printch/printascii: translate '\n' to "\r\n" not "\n\r"
  ARM: 8713/1: NOMMU: Support MPU in XIP configuration
  ARM: 8712/1: NOMMU: Use more MPU regions to cover memory
  ARM: 8711/1: V7M: Add support for MPU to M-class
  ARM: 8710/1: Kconfig: Kill CONFIG_VECTORS_BASE
  ARM: 8709/1: NOMMU: Disallow MPU for XIP
  ARM: 8708/1: NOMMU: Rework MPU to be mostly done in C
  ARM: 8707/1: NOMMU: Update MPU accessors to use cp15 helpers
  ARM: 8706/1: NOMMU: Move out MPU setup in separate module
  ARM: 8702/1: head-common.S: Clear lr before jumping to start_kernel()
  ARM: 8705/1: early_printk: use printascii() rather than printch()
  ARM: 8703/1: debug.S: move hexbuf to a writable section
  ARM: add additional table to compressed kernel
  ARM: decompressor: fix BSS size calculation
  pcmcia: sa1111: remove special sa1111 mmio accessors
  pcmcia: sa1111: use sa1111_get_irq() to obtain IRQ resources
  ARM: better diagnostics with missing/corrupt dtb
  ARM: 8699/1: dma-mapping: Remove init_dma_coherent_pool_size()
  ARM: 8698/1: dma-mapping: Mark atomic_pool as __ro_after_init
  ..
2017-11-16 12:50:35 -08:00
Linus Torvalds a02cd4229e f2fs-for-4.15-rc1
In this round, we introduce sysfile-based quota support which is required
 for Android by default. In addition, we allow that users are able to reserve
 some blocks in runtime to mitigate performance drops in low free space.
 
 Enhancement
 - assign proper data segments according to write_hints given by user
 - issue cache_flush on dirty devices only among multiple devices
 - exploit cp_error flag and add more faults to enhance fault injection test
 - conduct more readaheads during f2fs_readdir
 - add a range for discard commands
 
 Bug fix
 - fix zero stat->st_blocks when inline_data is set
 - drop crypto key and free stale memory pointer while evict_inode is failing
 - fix some corner cases in free space and segment management
 - fix wrong last_disk_size
 
 This series includes lots of clean-ups and code enhancement in terms of xattr
 operations, discard/flush command control. In addition, it adds versatile
 debugfs entries to monitor f2fs status.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAloNCPAACgkQQBSofoJI
 UNLYmg/8DbDp/mTXqJ0AURo84Z4OQUOTRxYkWazx4ct2WPZp2+5HCWDDoM8AAtUn
 1J6/t7cU3osjos+zWvpUREZq1SPbp5m0h818HBFFJ/YMBPXucdQcd6wpepniOR5J
 5uKauVd7jd2pbAAL7hKyr+iBSLrJl816wsq34Ml8y8zkDSJe4wO5YsGDqzqyKf4N
 8nxMavUgerb14I/qXPb3ljlYlfaNNRlCT649QGCG78gx5hPeiUtUJ2l5DKV2xPe7
 v+5lZO93FFwW1siGy+Atq+nqQJyUkeiOYGPR1NPx9tfmaPO58iOIXLirfblKASZY
 HXJigVf50fQQBtwdBFL8ICSop6zV6gCKkNGZCHLzcYFWWL2TQwCIP3/iJdj9Wy+j
 +YUYyN0dyl2mmNEDZjRNX1V+QBW1k+msmvBCb0fT1GJTQAyRfA4XfBDyg94cpWQ1
 9YivNywuzG8YtghY7gYU3lCfT2OG19nXCSdz4qYUb5SSwoeGtLahLxMV4mlil4Tg
 dOa8CPLFhJnCqB9ivI4L6SennBr+gNgL26SeZ3PF+B5KimYOTZxbenrll1kTi1xp
 uCU6UR1xJS0W7Cjk8sCIu5hXkJMJwPJ0hcVeTgsxMkujLGvSSRCGb2hmOeILfwRZ
 N4aGn+kVmwwgKaKjD/F4CY4b3yJLdTKMjjl74u5YaMQWe4Bq4qU=
 =c49T
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we introduce sysfile-based quota support which is
  required for Android by default. In addition, we allow that users are
  able to reserve some blocks in runtime to mitigate performance drops
  in low free space.

  Enhancements:
   - assign proper data segments according to write_hints given by user
   - issue cache_flush on dirty devices only among multiple devices
   - exploit cp_error flag and add more faults to enhance fault
     injection test
   - conduct more readaheads during f2fs_readdir
   - add a range for discard commands

  Bug fixes:
   - fix zero stat->st_blocks when inline_data is set
   - drop crypto key and free stale memory pointer while evict_inode is
     failing
   - fix some corner cases in free space and segment management
   - fix wrong last_disk_size

  This series includes lots of clean-ups and code enhancement in terms
  of xattr operations, discard/flush command control. In addition, it
  adds versatile debugfs entries to monitor f2fs status"

* tag 'f2fs-for-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (75 commits)
  f2fs: deny accessing encryption policy if encryption is off
  f2fs: inject fault in inc_valid_node_count
  f2fs: fix to clear FI_NO_PREALLOC
  f2fs: expose quota information in debugfs
  f2fs: separate nat entry mem alloc from nat_tree_lock
  f2fs: validate before set/clear free nat bitmap
  f2fs: avoid opened loop codes in __add_ino_entry
  f2fs: apply write hints to select the type of segments for buffered write
  f2fs: introduce scan_curseg_cache for cleanup
  f2fs: optimize the way of traversing free_nid_bitmap
  f2fs: keep scanning until enough free nids are acquired
  f2fs: trace checkpoint reason in fsync()
  f2fs: keep isize once block is reserved cross EOF
  f2fs: avoid race in between GC and block exchange
  f2fs: save a multiplication for last_nid calculation
  f2fs: fix summary info corruption
  f2fs: remove dead code in update_meta_page
  f2fs: remove unneeded semicolon
  f2fs: don't bother with inode->i_version
  f2fs: check curseg space before foreground GC
  ...
2017-11-16 12:10:21 -08:00
Darrick J. Wong 2015a63dce xfs: fix type usage
Be consistent about using uint32_t/uint8_t instead of u32/u8.  This is
more so that we don't have to maintain /those/ types in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2017-11-16 12:06:45 -08:00
Darrick J. Wong 962cc1ad6c xfs: fix forgotten rcu read unlock when skipping inode reclaim
In commit f2e9ad21 ("xfs: check for race with xfs_reclaim_inode"), we
skip an inode if we're racing with freeing the inode via
xfs_reclaim_inode, but we forgot to release the rcu read lock when
dumping the inode, with the result that we exit to userspace with a lock
held.  Don't do that; generic/320 with a 1k block size fails this
very occasionally.

================================================
WARNING: lock held when returning to user space!
4.14.0-rc6-djwong #4 Tainted: G        W
------------------------------------------------
rm/30466 is leaving the kernel with locks still held!
1 lock held by rm/30466:
 #0:  (rcu_read_lock){....}, at: [<ffffffffa01364d3>] xfs_ifree_cluster.isra.17+0x2c3/0x6f0 [xfs]
------------[ cut here ]------------
WARNING: CPU: 1 PID: 30466 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x71/0x700
Modules linked in: deadline_iosched dm_snapshot dm_bufio ext4 mbcache jbd2 dm_flakey xfs libcrc32c dax_pmem device_dax nd_pmem sch_fq_codel af_packet [last unloaded: scsi_debug]
CPU: 1 PID: 30466 Comm: rm Tainted: G        W       4.14.0-rc6-djwong #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1djwong0 04/01/2014
task: ffff880037680000 task.stack: ffffc90001064000
RIP: 0010:rcu_note_context_switch+0x71/0x700
RSP: 0000:ffffc90001067e50 EFLAGS: 00010002
RAX: 0000000000000001 RBX: ffff880037680000 RCX: ffff88003e73d200
RDX: 0000000000000002 RSI: ffffffff819e53e9 RDI: ffffffff819f4375
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff880062c900d0
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880037680000
R13: 0000000000000000 R14: ffffc90001067eb8 R15: ffff880037680690
FS:  00007fa3b8ce8700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f69bf77c000 CR3: 000000002450a000 CR4: 00000000000006e0
Call Trace:
 __schedule+0xb8/0xb10
 schedule+0x40/0x90
 exit_to_usermode_loop+0x6b/0xa0
 prepare_exit_to_usermode+0x7a/0x90
 retint_user+0x8/0x20
RIP: 0033:0x7fa3b87fda87
RSP: 002b:00007ffe41206568 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff02
RAX: 0000000000000000 RBX: 00000000010e88c0 RCX: 00007fa3b87fda87
RDX: 0000000000000000 RSI: 00000000010e89c8 RDI: 0000000000000005
RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000000000
R10: 000000000000015e R11: 0000000000000246 R12: 00000000010c8060
R13: 00007ffe41206690 R14: 0000000000000000 R15: 0000000000000000
---[ end trace e88f83bf0cfbd07d ]---

Fixes: f2e9ad212d
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Omar Sandoval <osandov@fb.com>
2017-11-16 12:06:45 -08:00
Linus Torvalds 487e2c9f44 AFS development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAWgm9V/Sw1s6N8H32AQK5mQ//QGUDZLXsUPCtq0XJq0V+r4MUjNp9tCZR
 htiuNrEkHSyPpYgCcQ2Aqdl9kndwVXcE7lWT99mp/a0zwNAsp9GOGVhCXUd5R86G
 XlrBuUYVvBJk18tDsUNWdjRQ0gMHgQSlEnEbsaGiU1bVrpXatI9hL8qoeO78Iy7+
 eaJUQLCuCVJq7qMQGhC0hg338vmHVeYhnViXIxq+HFjsMmR9IVanuK+sQr6NSJxS
 F6RkPxBUPWkRVMHmxTLWj/XSHZwtwu+Mnc/UFYsAPLKEbY0cIohsI8EgfE8U7geU
 yRVnu3MIOXUXUrZizj9SwVYWdJfneRlINqMbHIO8QXMKR38tnQ0C2/7bgBsXiNPv
 YdiAyeqL4nM+JthV/rgA3hWgupwBlSb4ubclTphDNxMs5MBIUIK3XUt9GOXDDUZz
 2FT/FdrphM2UORaI2AEOi4Q0/nHdin+3rld8fjV0Ree/TPNXwcrOmvy8yGnxFCEp
 5b7YLwKrffZGnnS965dhZlnFR6hjndmzFgHdyRrJwc80hXi1Q/+W4F19MoYkkoVK
 G/gLvD3FbmygmFnjCik9TjUrro6vQxo56H/TuWgHTvYriNGH+D/D7EGUwg4GiXZZ
 +7vrNw660uXmZiu9i0YacCRyD8lvm7QpmWLb+uHwzfsBE1+C8UetyQ+egSWVdWJO
 KwPspygWXD4=
 =3vy0
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "kAFS filesystem driver overhaul.

  The major points of the overhaul are:

   (1) Preliminary groundwork is laid for supporting network-namespacing
       of kAFS. The remainder of the namespacing work requires some way
       to pass namespace information to submounts triggered by an
       automount. This requires something like the mount overhaul that's
       in progress.

   (2) sockaddr_rxrpc is used in preference to in_addr for holding
       addresses internally and add support for talking to the YFS VL
       server. With this, kAFS can do everything over IPv6 as well as
       IPv4 if it's talking to servers that support it.

   (3) Callback handling is overhauled to be generally passive rather
       than active. 'Callbacks' are promises by the server to tell us
       about data and metadata changes. Callbacks are now checked when
       we next touch an inode rather than actively going and looking for
       it where possible.

   (4) File access permit caching is overhauled to store the caching
       information per-inode rather than per-directory, shared over
       subordinate files. Whilst older AFS servers only allow ACLs on
       directories (shared to the files in that directory), newer AFS
       servers break that restriction.

       To improve memory usage and to make it easier to do mass-key
       removal, permit combinations are cached and shared.

   (5) Cell database management is overhauled to allow lighter locks to
       be used and to make cell records autonomous state machines that
       look after getting their own DNS records and cleaning themselves
       up, in particular preventing races in acquiring and relinquishing
       the fscache token for the cell.

   (6) Volume caching is overhauled. The afs_vlocation record is got rid
       of to simplify things and the superblock is now keyed on the cell
       and the numeric volume ID only. The volume record is tied to a
       superblock and normal superblock management is used to mediate
       the lifetime of the volume fscache token.

   (7) File server record caching is overhauled to make server records
       independent of cells and volumes. A server can be in multiple
       cells (in such a case, the administrator must make sure that the
       VL services for all cells correctly reflect the volumes shared
       between those cells).

       Server records are now indexed using the UUID of the server
       rather than the address since a server can have multiple
       addresses.

   (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
       similar), VOFFLINE and VNOVOL indications and to handle rotation
       both of servers and addresses of those servers. The rotation will
       also wait and retry if the server says it is busy.

   (9) Data writeback is overhauled. Each inode no longer stores a list
       of modified sections tagged with the key that authorised it in
       favour of noting the modified region of a page in page->private
       and storing a list of keys that made modifications in the inode.

       This simplifies things and allows other keys to be used to
       actually write to the server if a key that made a modification
       becomes useless.

  (10) Writable mmap() is implemented. This allows a kernel to be build
       entirely on AFS.

  Note that Pre AFS-3.4 servers are no longer supported, though this can
  be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
  afs: Protect call->state changes against signals
  afs: Trace page dirty/clean
  afs: Implement shared-writeable mmap
  afs: Get rid of the afs_writeback record
  afs: Introduce a file-private data record
  afs: Use a dynamic port if 7001 is in use
  afs: Fix directory read/modify race
  afs: Trace the sending of pages
  afs: Trace the initiation and completion of client calls
  afs: Fix documentation on # vs % prefix in mount source specification
  afs: Fix total-length calculation for multiple-page send
  afs: Only progress call state at end of Tx phase from rxrpc callback
  afs: Make use of the YFS service upgrade to fully support IPv6
  afs: Overhaul volume and server record caching and fileserver rotation
  afs: Move server rotation code into its own file
  afs: Add an address list concept
  afs: Overhaul cell database management
  afs: Overhaul permit caching
  afs: Overhaul the callback handling
  afs: Rename struct afs_call server member to cm_server
  ...
2017-11-16 11:41:22 -08:00
Linus Torvalds b9743042b3 Driver core patches for 4.15-rc1
Here is the set of driver core / debugfs patches for 4.15-rc1.
 
 Not many here, mostly all are debugfs fixes to resolve some
 long-reported problems with files going away with references to them in
 userspace.  There's also some SPDX cleanups for the debugfs code, as
 well as a few other minor driver core changes for issues reported by
 people.
 
 All of these have been in linux-next for a week or more with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWg2NCA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymUNgCfYq434CFh+YtwITBNYdqkFYFf0ZAAn3qfhh2+
 M3rmZzwk2MKBvNQ2npvt
 =/8+Y
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here is the set of driver core / debugfs patches for 4.15-rc1.

  Not many here, mostly all are debugfs fixes to resolve some
  long-reported problems with files going away with references to them
  in userspace. There's also some SPDX cleanups for the debugfs code, as
  well as a few other minor driver core changes for issues reported by
  people.

  All of these have been in linux-next for a week or more with no
  reported issues"

* tag 'driver-core-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  driver core: Fix device link deferred probe
  debugfs: Remove redundant license text
  debugfs: add SPDX identifiers to all debugfs files
  debugfs: defer debugfs_fsdata allocation to first usage
  debugfs: call debugfs_real_fops() only after debugfs_file_get()
  debugfs: purge obsolete SRCU based removal protection
  IB/hfi1: convert to debugfs_file_get() and -put()
  debugfs: convert to debugfs_file_get() and -put()
  debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
  debugfs: implement per-file removal protection
  debugfs: add support for more elaborate ->d_fsdata
  driver core: Move device_links_purge() after bus_remove_device()
  arch_topology: Fix section miss match warning due to free_raw_capacity()
  driver-core: pr_err() strings should end with newlines
2017-11-16 08:55:30 -08:00
Linus Torvalds 7c225c69f8 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2 updates

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (131 commits)
  memory hotplug: fix comments when adding section
  mm: make alloc_node_mem_map a void call if we don't have CONFIG_FLAT_NODE_MEM_MAP
  mm: simplify nodemask printing
  mm,oom_reaper: remove pointless kthread_run() error check
  mm/page_ext.c: check if page_ext is not prepared
  writeback: remove unused function parameter
  mm: do not rely on preempt_count in print_vma_addr
  mm, sparse: do not swamp log with huge vmemmap allocation failures
  mm/hmm: remove redundant variable align_end
  mm/list_lru.c: mark expected switch fall-through
  mm/shmem.c: mark expected switch fall-through
  mm/page_alloc.c: broken deferred calculation
  mm: don't warn about allocations which stall for too long
  fs: fuse: account fuse_inode slab memory as reclaimable
  mm, page_alloc: fix potential false positive in __zone_watermark_ok
  mm: mlock: remove lru_add_drain_all()
  mm, sysctl: make NUMA stats configurable
  shmem: convert shmem_init_inodecache() to void
  Unify migrate_pages and move_pages access checks
  mm, pagevec: rename pagevec drained field
  ...
2017-11-15 19:42:40 -08:00
Johannes Weiner df206988e0 fs: fuse: account fuse_inode slab memory as reclaimable
Fuse inodes are currently included in the unreclaimable slab counts -
SUnreclaim in /proc/meminfo, slab_unreclaimable in /proc/vmstat and the
per-cgroup memory.stat.  But they are reclaimable just like other
filesystems' inodes, and /proc/sys/vm/drop_caches frees them easily.

Mark the slab cache reclaimable.

Link: http://lkml.kernel.org/r/20171102202727.12539-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:07 -08:00
Mel Gorman 453f85d43f mm: remove __GFP_COLD
As the page free path makes no distinction between cache hot and cold
pages, there is no real useful ordering of pages in the free list that
allocation requests can take advantage of.  Juding from the users of
__GFP_COLD, it is likely that a number of them are the result of copying
other sites instead of actually measuring the impact.  Remove the
__GFP_COLD parameter which simplifies a number of paths in the page
allocator.

This is potentially controversial but bear in mind that the size of the
per-cpu pagelists versus modern cache sizes means that the whole per-cpu
list can often fit in the L3 cache.  Hence, there is only a potential
benefit for microbenchmarks that alloc/free pages in a tight loop.  It's
even worse when THP is taken into account which has little or no chance
of getting a cache-hot page as the per-cpu list is bypassed and the
zeroing of multiple pages will thrash the cache anyway.

The truncate microbenchmarks are not shown as this patch affects the
allocation path and not the free path.  A page fault microbenchmark was
tested but it showed no sigificant difference which is not surprising
given that the __GFP_COLD branches are a miniscule percentage of the
fault path.

Link: http://lkml.kernel.org/r/20171018075952.10627-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman c6f92f9fbe mm: remove cold parameter for release_pages
All callers of release_pages claim the pages being released are cache
hot.  As no one cares about the hotness of pages being released to the
allocator, just ditch the parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman 8667982014 mm, pagevec: remove cold parameter for pagevecs
Every pagevec_init user claims the pages being released are hot even in
cases where it is unlikely the pages are hot.  As no one cares about the
hotness of pages being released to the allocator, just ditch the
parameter.

No performance impact is expected as the overhead is marginal.  The
parameter is removed simply because it is a bit stupid to have a useless
parameter copied everywhere.

Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mel Gorman c7df8ad291 mm, truncate: do not check mapping for every page being truncated
During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called.  The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                             oneirq-v1r1        pickhelper-v1r1
Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)

As you'd expect, the gain is marginal but it can be detected.  The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field.  The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:06 -08:00
Mike Rapoport 00bb31fa44 userfaultfd: use mmgrab instead of open-coded increment of mm_count
Link: http://lkml.kernel.org/r/1508132478-7738-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:05 -08:00
Levin, Alexander (Sasha Levin) 4950276672 kmemcheck: remove annotations
Patch series "kmemcheck: kill kmemcheck", v2.

As discussed at LSF/MM, kill kmemcheck.

KASan is a replacement that is able to work without the limitation of
kmemcheck (single CPU, slow).  KASan is already upstream.

We are also not aware of any users of kmemcheck (or users who don't
consider KASan as a suitable replacement).

The only objection was that since KASAN wasn't supported by all GCC
versions provided by distros at that time we should hold off for 2
years, and try again.

Now that 2 years have passed, and all distros provide gcc that supports
KASAN, kill kmemcheck again for the very same reasons.

This patch (of 4):

Remove kmemcheck annotations, and calls to kmemcheck from the kernel.

[alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs]
  Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com
Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Tim Hansen <devtimhansen@gmail.com>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Shakeel Butt f3f7c09355 fs, mm: account filp cache to kmemcg
The allocations from filp cache can be directly triggered by userspace
applications.  A buggy application can consume a significant amount of
unaccounted system memory.  Though we have not noticed such buggy
applications in our production but upon close inspection, we found that
a lot of machines spend very significant amount of memory on these
caches.

One way to limit allocations from filp cache is to set system level
limit of maximum number of open files.  However this limit is shared
between different users on the system and one user can hog this
resource.  To cater that, we can charge filp to kmemcg and set the
maximum limit very high and let the memory limit of each user limit the
number of files they can open and indirectly limiting their allocations
from filp cache.

One side effect of this change is that it will allow _sysctl() to return
ENOMEM and the man page of _sysctl() does not specify that.  However the
man page also discourages to use _sysctl() at all.

Link: http://lkml.kernel.org/r/20171011190359.34926-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov af5b0f6a09 mm: consolidate page table accounting
Currently, we account page tables separately for each page table level,
but that's redundant -- we only make use of total memory allocated to
page tables for oom_badness calculation.  We also provide the
information to userspace, but it has dubious value there too.

This patch switches page table accounting to single counter.

mm->pgtables_bytes is now used to account all page table levels.  We use
bytes, because page table size for different levels of page table tree
may be different.

The change has user-visible effect: we don't have VmPMD and VmPUD
reported in /proc/[pid]/status.  Not sure if anybody uses them.  (As
alternative, we can always report 0 kB for them.)

OOM-killer report is also slightly changed: we now report pgtables_bytes
instead of nr_ptes, nr_pmd, nr_puds.

Apart from reducing number of counters per-mm, the benefit is that we
now calculate oom_badness() more correctly for machines which have
different size of page tables depending on level or where page tables
are less than a page in size.

The only downside can be debuggability because we do not know which page
table level could leak.  But I do not remember many bugs that would be
caught by separate counters so I wouldn't lose sleep over this.

[akpm@linux-foundation.org: fix mm/huge_memory.c]
Link: http://lkml.kernel.org/r/20171006100651.44742-2-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
[kirill.shutemov@linux.intel.com: fix build]
  Link: http://lkml.kernel.org/r/20171016150113.ikfxy3e7zzfvsr4w@black.fi.intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov c4812909f5 mm: introduce wrappers to access mm->nr_ptes
Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
and nr_pud.

The patch also makes nr_ptes accounting dependent onto CONFIG_MMU.  Page
table accounting doesn't make sense if you don't have page tables.

It's preparation for consolidation of page-table counters in mm_struct.

Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Kirill A. Shutemov b4e98d9ac7 mm: account pud page tables
On a machine with 5-level paging support a process can allocate
significant amount of memory and stay unnoticed by oom-killer and memory
cgroup.  The trick is to allocate a lot of PUD page tables.  We don't
account PUD page tables, only PMD and PTE.

We already addressed the same issue for PMD page tables, see commit
dc6c9a35b6 ("mm: account pmd page tables to the process").
Introduction of 5-level paging brings the same issue for PUD page
tables.

The patch expands accounting to PUD level.

[kirill.shutemov@linux.intel.com: s/pmd_t/pud_t/]
  Link: http://lkml.kernel.org/r/20171004074305.x35eh5u7ybbt5kar@black.fi.intel.com
[heiko.carstens@de.ibm.com: s390/mm: fix pud table accounting]
  Link: http://lkml.kernel.org/r/20171103090551.18231-1-heiko.carstens@de.ibm.com
Link: http://lkml.kernel.org/r/20171002080427.3320-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 9c19a9cb16 cifs: use find_get_pages_range_tag()
wdata_alloc_and_fillpages() needlessly iterates calls to
find_get_pages_tag().  Also it wants only pages from given range.  Make
it use find_get_pages_range_tag().

Link: http://lkml.kernel.org/r/20171009151359.31984-17-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Suggested-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara aef6e415ee afs: use find_get_pages_range_tag()
Use find_get_pages_range_tag() in afs_writepages_region() as we are
interested only in pages from given range.  Remove unnecessary code
after this conversion.

Link: http://lkml.kernel.org/r/20171009151359.31984-16-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 67fd707f46 mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.  Just drop the argument.

Link: http://lkml.kernel.org/r/20171009151359.31984-15-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 4be90299a1 ceph: use pagevec_lookup_range_nr_tag()
Use new function for looking up pages since nr_pages argument from
pagevec_lookup_range_tag() is going away.

Link: http://lkml.kernel.org/r/20171009151359.31984-14-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 40f9c51326 nilfs2: use pagevec_lookup_range_tag()
We want only pages from given range in nilfs_lookup_dirty_data_buffers().
Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and
remove unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-10-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara d2bc5b3c67 gfs2: use pagevec_lookup_range_tag()
We want only pages from given range in gfs2_write_cache_jdata().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 8faab64229 f2fs: use find_get_pages_tag() for looking up single page
__get_first_dirty_index() wants to lookup only the first dirty page
after given index.  There's no point in using pagevec_lookup_tag() for
that.  Just use find_get_pages_tag() directly.

Link: http://lkml.kernel.org/r/20171009151359.31984-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:04 -08:00
Jan Kara 028a63a6e3 f2fs: simplify page iteration loops
In several places we want to iterate over all tagged pages in a mapping.
However the code was apparently copied from places that iterate only
over a limited range and thus it checks for index <= end, optimizes the
case where we are coming close to range end which is all pointless when
end == ULONG_MAX.  So just remove this dead code.

[akpm@linux-foundation.org: fix warnings]
Link: http://lkml.kernel.org/r/20171009151359.31984-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jan Kara 69c4f35d25 f2fs: use pagevec_lookup_range_tag()
We want only pages from given range in f2fs_write_cache_pages().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jan Kara dc7f3e868a ext4: use pagevec_lookup_range_tag()
We want only pages from given range in ext4_writepages().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jan Kara 0ed75fc8d2 ceph: use pagevec_lookup_range_tag()
We want only pages from given range in ceph_writepages_start().  Use
pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove
unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jan Kara 4006f437f9 btrfs: use pagevec_lookup_range_tag()
We want only pages from given range in btree_write_cache_pages() and
extent_write_cache_pages().  Use pagevec_lookup_range_tag() instead of
pagevec_lookup_tag() and remove unnecessary code.

Link: http://lkml.kernel.org/r/20171009151359.31984-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: David Sterba <dsterba@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Jérôme Glisse 0f10851ea4 mm/mmu_notifier: avoid double notification when it is useless
This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users.  Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range).  But that notification is not necessary
in all cases.

This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end.  It also adds documentation in all
those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space).  There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:

  A) page backing address is free before mmu_notifier_invalidate_range_end
  B) a page table entry is updated to point to a new page (COW, write fault
     on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
  - take page table lock
  - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
  - set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0  {try to write to addrA}
CPU-thread-1  {try to write to addrB}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA and populate device TLB}
DEV-thread-2  {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {write to addrA which is a write to new page}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {preempted}
CPU-thread-2  {}
CPU-thread-3  {write to addrB which is a write to new page}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {}
DEV-thread-2  {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0  {preempted}
CPU-thread-1  {}
CPU-thread-2  {}
CPU-thread-3  {}
DEV-thread-0  {read addrA from old page}
DEV-thread-2  {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA.  This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock.  This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end

Thanks to Andrea for thinking of a problematic scenario for COW.

[jglisse@redhat.com: v2]
  Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Anshuman Khandual 007ab7b49a fs/hugetlbfs/inode.c: remove redundant -ENIVAL return from hugetlbfs_setattr()
There is no need to have a local return code set with -EINVAL when both
the conditions following it return error codes appropriately.  Just
remove the redundant one.

Link: http://lkml.kernel.org/r/20170929145444.17611-1-khandual@linux.vnet.ibm.com
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:03 -08:00
Alexey Dobriyan d50112edde slab, slub, slob: add slab_flags_t
Add sparse-checked slab_flags_t for struct kmem_cache::flags (SLAB_POISON,
etc).

SLAB is bloated temporarily by switching to "unsigned long", but only
temporarily.

Link: http://lkml.kernel.org/r/20171021100225.GA22428@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
Guozhonghua 47ee9d89f0 ocfs2: remove unneeded goto in ocfs2_reserve_cluster_bitmap_bits()
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4F3CDE3A9@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
Changwei Ge 3db409fa24 ocfs2/dlm: get mle inuse only when it is initialized
When dlm_add_migration_mle returns -EEXIST, previously input mle will
not be initialized.  So we can't use its associated dlm object.  And we
truly don't need this mle for already launched migration progress, since
oldmle has taken this role.

Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED7AA61@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
alex chen 853bc26a7e ocfs2: subsystem.su_mutex is required while accessing the item->ci_parent
The subsystem.su_mutex is required while accessing the item->ci_parent,
otherwise, NULL pointer dereference to the item->ci_parent will be
triggered in the following situation:

add node                     delete node
sys_write
 vfs_write
  configfs_write_file
   o2nm_node_store
    o2nm_node_local_write
                             do_rmdir
                              vfs_rmdir
                               configfs_rmdir
                                mutex_lock(&subsys->su_mutex);
                                unlink_obj
                                 item->ci_group = NULL;
                                 item->ci_parent = NULL;
	 to_o2nm_cluster_from_node
	  node->nd_item.ci_parent->ci_parent
	  BUG since of NULL pointer dereference to nd_item.ci_parent

Moreover, the o2nm_cluster also should be protected by the
subsystem.su_mutex.

[alex.chen@huawei.com: v2]
  Link: http://lkml.kernel.org/r/59EEAA69.9080703@huawei.com
Link: http://lkml.kernel.org/r/59E9B36A.10700@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
alex chen 3e4c56d41e ocfs2: ip_alloc_sem should be taken in ocfs2_get_block()
ip_alloc_sem should be taken in ocfs2_get_block() when reading file in
DIRECT mode to prevent concurrent access to extent tree with
ocfs2_dio_end_io_write(), which may cause BUGON in the following
situation:

read file 'A'                                  end_io of writing file 'A'
vfs_read
 __vfs_read
  ocfs2_file_read_iter
   generic_file_read_iter
    ocfs2_direct_IO
     __blockdev_direct_IO
      do_blockdev_direct_IO
       do_direct_IO
        get_more_blocks
         ocfs2_get_block
          ocfs2_extent_map_get_blocks
           ocfs2_get_clusters
            ocfs2_get_clusters_nocache()
             ocfs2_search_extent_list
              return the index of record which
              contains the v_cluster, that is
              v_cluster > rec[i]->e_cpos.
                                                ocfs2_dio_end_io
                                                 ocfs2_dio_end_io_write
                                                  down_write(&oi->ip_alloc_sem);
                                                  ocfs2_mark_extent_written
                                                   ocfs2_change_extent_flag
                                                    ocfs2_split_extent
                                                     ...
                                                 --> modify the rec[i]->e_cpos, resulting
                                                     in v_cluster < rec[i]->e_cpos.
             BUG_ON(v_cluster < le32_to_cpu(rec->e_cpos))

[alex.chen@huawei.com: v3]
  Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
Link: http://lkml.kernel.org/r/59EF3614.6050008@huawei.com
Fixes: c15471f795 ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Reviewed-by: Gang He <ghe@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
alex chen 28f5a8a7c0 ocfs2: should wait dio before inode lock in ocfs2_setattr()
we should wait dio requests to finish before inode lock in
ocfs2_setattr(), otherwise the following deadlock will happen:

process 1                  process 2                    process 3
truncate file 'A'          end_io of writing file 'A'   receiving the bast messages
ocfs2_setattr
 ocfs2_inode_lock_tracker
  ocfs2_inode_lock_full
 inode_dio_wait
  __inode_dio_wait
  -->waiting for all dio
  requests finish
                                                        dlm_proxy_ast_handler
                                                         dlm_do_local_bast
                                                          ocfs2_blocking_ast
                                                           ocfs2_generic_handle_bast
                                                            set OCFS2_LOCK_BLOCKED flag
                        dio_end_io
                         dio_bio_end_aio
                          dio_complete
                           ocfs2_dio_end_io
                            ocfs2_dio_end_io_write
                             ocfs2_inode_lock
                              __ocfs2_cluster_lock
                               ocfs2_wait_for_mask
                               -->waiting for OCFS2_LOCK_BLOCKED
                               flag to be cleared, that is waiting
                               for 'process 1' unlocking the inode lock
                           inode_dio_end
                           -->here dec the i_dio_count, but will never
                           be called, so a deadlock happened.

Link: http://lkml.kernel.org/r/59F81636.70508@huawei.com
Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
piaojun 67b1b8d14a ocfs2: clean up some unused function declarations
Link: http://lkml.kernel.org/r/59C5D7D6.9050106@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
Changwei Ge 1c01967116 ocfs2: fix cluster hang after a node dies
When a node dies, other live nodes have to choose a new master for an
existed lock resource mastered by the dead node.

As for ocfs2/dlm implementation, this is done by function -
dlm_move_lockres_to_recovery_list which marks those lock rsources as
DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM
changes lock resource's master later.

So without invoking dlm_move_lockres_to_recovery_list, no master will be
choosed after dlm recovery accomplishment since no lock resource can be
found through ::resource list.

What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock
resources mastered a dead node, it will break up synchronization among
nodes.

So invoke dlm_move_lockres_to_recovery_list again.

Fixs: 'commit ee8f7fcbe6 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")'
Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED6E0F9@H3CMLB14-EX.srv.huawei-3com.com
Signed-off-by: Changwei Ge <ge.changwei@h3c.com>
Reported-by: Vitaly Mayatskih <v.mayatskih@gmail.com>
Tested-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
piaojun 98d6c09ec2 ocfs2: cleanup unused func declaration and assignment
Link: http://lkml.kernel.org/r/59E064BB.8000005@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
piaojun 23e0813a08 ocfs2: no need flush workqueue before destroying it
destroy_workqueue() will do flushing work for us.

Link: http://lkml.kernel.org/r/59E06476.3090502@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
Guozhonghua a60874f858 ocfs2: remove unused declaration ocfs2_publish_get_mount_state()
Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4D0743232@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-15 18:21:01 -08:00
Linus Torvalds 1be2172e96 Modules updates for v4.15
Summary of modules changes for the 4.15 merge window:
 
 - Treewide module_param_call() cleanup, fix up set/get function
   prototype mismatches, from Kees Cook
 
 - Minor code cleanups
 
 Signed-off-by: Jessica Yu <jeyu@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCgAGBQJaDCyzAAoJEMBFfjjOO8FyaYQP/AwHBy6XmwwVlWDP4BqIF6hL
 Vhy3ccVLYEORvePv68tWSRPUz5n6+1Ebqanmwtkw6i8l+KwxY2SfkZql09cARc33
 2iBE4bHF98iWQmnJbF6me80fedY9n5bZJNMQKEF9VozJWwTMOTQFTCfmyJRDBmk9
 iidQj6M3idbSUOYIJjvc40VGx5NyQWSr+FFfqsz1rU5iLGRGEvA3I2/CDT0oTuV6
 D4MmFxzE2Tv/vIMa2GzKJ1LGScuUfSjf93Lq9Kk0cG36qWao8l930CaXyVdE9WJv
 bkUzpf3QYv/rDX6QbAGA0cada13zd+dfBr8YhchclEAfJ+GDLjMEDu04NEmI6KUT
 5lP0Xw0xYNZQI7bkdxDMhsj5jaz/HJpXCjPCtZBnSEKiL4OPXVMe+pBHoCJ2/yFN
 6M716XpWYgUviUOdiE+chczB5p3z4FA6u2ykaM4Tlk0btZuHGxjcSWwvcIdlPmjm
 kY4AfDV6K0bfEBVguWPJicvrkx44atqT5nWbbPhDwTSavtsuRJLb3GCsHedx7K8h
 ZO47lCQFAWCtrycK1HYw+oupNC3hYWQ0SR42XRdGhL1bq26C+1sei1QhfqSgA9PQ
 7CwWH4UTOL9fhtrzSqZngYOh9sjQNFNefqQHcecNzcEjK2vjrgQZvRNWZKHSwaFs
 fbGX8juZWP4ypbK+irTB
 =c8vb
 -----END PGP SIGNATURE-----

Merge tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux

Pull module updates from Jessica Yu:
 "Summary of modules changes for the 4.15 merge window:

   - treewide module_param_call() cleanup, fix up set/get function
     prototype mismatches, from Kees Cook

   - minor code cleanups"

* tag 'modules-for-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: Do not paper over type mismatches in module_param_call()
  treewide: Fix function prototypes for module_param_call()
  module: Prepare to convert all module_param_call() prototypes
  kernel/module: Delete an error message for a failed memory allocation in add_module_usage()
2017-11-15 13:46:33 -08:00
Linus Torvalds 5bbcc0f595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Maintain the TCP retransmit queue using an rbtree, with 1GB
      windows at 100Gb this really has become necessary. From Eric
      Dumazet.

   2) Multi-program support for cgroup+bpf, from Alexei Starovoitov.

   3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew
      Lunn.

   4) Add meter action support to openvswitch, from Andy Zhou.

   5) Add a data meta pointer for BPF accessible packets, from Daniel
      Borkmann.

   6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet.

   7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli.

   8) More work to move the RTNL mutex down, from Florian Westphal.

   9) Add 'bpftool' utility, to help with bpf program introspection.
      From Jakub Kicinski.

  10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper
      Dangaard Brouer.

  11) Support 'blocks' of transformations in the packet scheduler which
      can span multiple network devices, from Jiri Pirko.

  12) TC flower offload support in cxgb4, from Kumar Sanghvi.

  13) Priority based stream scheduler for SCTP, from Marcelo Ricardo
      Leitner.

  14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg.

  15) Add RED qdisc offloadability, and use it in mlxsw driver. From
      Nogah Frankel.

  16) eBPF based device controller for cgroup v2, from Roman Gushchin.

  17) Add some fundamental tracepoints for TCP, from Song Liu.

  18) Remove garbage collection from ipv6 route layer, this is a
      significant accomplishment. From Wei Wang.

  19) Add multicast route offload support to mlxsw, from Yotam Gigi"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits)
  tcp: highest_sack fix
  geneve: fix fill_info when link down
  bpf: fix lockdep splat
  net: cdc_ncm: GetNtbFormat endian fix
  openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start
  netem: remove unnecessary 64 bit modulus
  netem: use 64 bit divide by rate
  tcp: Namespace-ify sysctl_tcp_default_congestion_control
  net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()
  ipv6: set all.accept_dad to 0 by default
  uapi: fix linux/tls.h userspace compilation error
  usbnet: ipheth: prevent TX queue timeouts when device not ready
  vhost_net: conditionally enable tx polling
  uapi: fix linux/rxrpc.h userspace compilation errors
  net: stmmac: fix LPI transitioning for dwmac4
  atm: horizon: Fix irq release error
  net-sysfs: trigger netlink notification on ifalias change via sysfs
  openvswitch: Using kfree_rcu() to simplify the code
  openvswitch: Make local function ovs_nsh_key_attr_size() static
  openvswitch: Fix return value check in ovs_meter_cmd_features()
  ...
2017-11-15 11:56:19 -08:00
Linus Torvalds c9b012e5f4 arm64 updates for 4.15
Plenty of acronym soup here:
 
 - Initial support for the Scalable Vector Extension (SVE)
 - Improved handling for SError interrupts (required to handle RAS events)
 - Enable GCC support for 128-bit integer types
 - Remove kernel text addresses from backtraces and register dumps
 - Use of WFE to implement long delay()s
 - ACPI IORT updates from Lorenzo Pieralisi
 - Perf PMU driver for the Statistical Profiling Extension (SPE)
 - Perf PMU driver for Hisilicon's system PMUs
 - Misc cleanups and non-critical fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJaCcLqAAoJELescNyEwWM0JREH/2FbmD/khGzEtP8LW+o9D8iV
 TBM02uWQxS1bbO1pV2vb+512YQO+iWfeQwJH9Jv2FZcrMvFv7uGRnYgAnJuXNGrl
 W+LL6OhN22A24LSawC437RU3Xe7GqrtONIY/yLeJBPablfcDGzPK1eHRA0pUzcyX
 VlyDruSHWX44VGBPV6JRd3x0vxpV8syeKOjbRvopRfn3Nwkbd76V3YSfEgwoTG5W
 ET1sOnXLmHHdeifn/l1Am5FX1FYstpcd7usUTJ4Oto8y7e09tw3bGJCD0aMJ3vow
 v1pCUWohEw7fHqoPc9rTrc1QEnkdML4vjJvMPUzwyTfPrN+7uEuMIEeJierW+qE=
 =0qrg
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "The big highlight is support for the Scalable Vector Extension (SVE)
  which required extensive ABI work to ensure we don't break existing
  applications by blowing away their signal stack with the rather large
  new vector context (<= 2 kbit per vector register). There's further
  work to be done optimising things like exception return, but the ABI
  is solid now.

  Much of the line count comes from some new PMU drivers we have, but
  they're pretty self-contained and I suspect we'll have more of them in
  future.

  Plenty of acronym soup here:

   - initial support for the Scalable Vector Extension (SVE)

   - improved handling for SError interrupts (required to handle RAS
     events)

   - enable GCC support for 128-bit integer types

   - remove kernel text addresses from backtraces and register dumps

   - use of WFE to implement long delay()s

   - ACPI IORT updates from Lorenzo Pieralisi

   - perf PMU driver for the Statistical Profiling Extension (SPE)

   - perf PMU driver for Hisilicon's system PMUs

   - misc cleanups and non-critical fixes"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (97 commits)
  arm64: Make ARMV8_DEPRECATED depend on SYSCTL
  arm64: Implement __lshrti3 library function
  arm64: support __int128 on gcc 5+
  arm64/sve: Add documentation
  arm64/sve: Detect SVE and activate runtime support
  arm64/sve: KVM: Hide SVE from CPU features exposed to guests
  arm64/sve: KVM: Treat guest SVE use as undefined instruction execution
  arm64/sve: KVM: Prevent guests from using SVE
  arm64/sve: Add sysctl to set the default vector length for new processes
  arm64/sve: Add prctl controls for userspace vector length management
  arm64/sve: ptrace and ELF coredump support
  arm64/sve: Preserve SVE registers around EFI runtime service calls
  arm64/sve: Preserve SVE registers around kernel-mode NEON use
  arm64/sve: Probe SVE capabilities and usable vector lengths
  arm64: cpufeature: Move sys_caps_initialised declarations
  arm64/sve: Backend logic for setting the vector length
  arm64/sve: Signal handling support
  arm64/sve: Support vector length resetting for new processes
  arm64/sve: Core task context handling
  arm64/sve: Low-level CPU setup
  ...
2017-11-15 10:56:56 -08:00
Rafael J. Wysocki 7d5905dc14 x86 / CPU: Always show current CPU frequency in /proc/cpuinfo
After commit 890da9cf09 (Revert "x86: do not use cpufreq_quick_get()
for /proc/cpuinfo "cpu MHz"") the "cpu MHz" number in /proc/cpuinfo
on x86 can be either the nominal CPU frequency (which is constant)
or the frequency most recently requested by a scaling governor in
cpufreq, depending on the cpufreq configuration.  That is somewhat
inconsistent and is different from what it was before 4.13, so in
order to restore the previous behavior, make it report the current
CPU frequency like the scaling_cur_freq sysfs file in cpufreq.

To that end, modify the /proc/cpuinfo implementation on x86 to use
aperfmperf_snapshot_khz() to snapshot the APERF and MPERF feedback
registers, if available, and use their values to compute the CPU
frequency to be reported as "cpu MHz".

However, do that carefully enough to avoid accumulating delays that
lead to unacceptable access times for /proc/cpuinfo on systems with
many CPUs.  Run aperfmperf_snapshot_khz() once on all CPUs
asynchronously at the /proc/cpuinfo open time, add a single delay
upfront (if necessary) at that point and simply compute the current
frequency while running show_cpuinfo() for each individual CPU.

Also, to avoid slowing down /proc/cpuinfo accesses too much, reduce
the default delay between consecutive APERF and MPERF reads to 10 ms,
which should be sufficient to get large enough numbers for the
frequency computation in all cases.

Fixes: 890da9cf09 (Revert "x86: do not use cpufreq_quick_get() for /proc/cpuinfo "cpu MHz"")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
2017-11-15 19:46:50 +01:00
Linus Torvalds 9682b3dea2 Merge branch 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual rocket-science from trivial tree for 4.15"

* 'for-linus' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  MAINTAINERS: relinquish kconfig
  MAINTAINERS: Update my email address
  treewide: Fix typos in Kconfig
  kfifo: Fix comments
  init/Kconfig: Fix module signing document location
  misc: ibmasm: Return error on error path
  HID: logitech-hidpp: fix mistake in printk, "feeback" -> "feedback"
  MAINTAINERS: Correct path to uDraw PS3 driver
  tracing: Fix doc mistakes in trace sample
  tracing: Kconfig text fixes for CONFIG_HWLAT_TRACER
  MIPS: Alchemy: Remove reverted CONFIG_NETLINK_MMAP from db1xxx_defconfig
  mm/huge_memory.c: fixup grammar in comment
  lib/xz: Add fall-through comments to a switch statement
2017-11-15 10:14:11 -08:00
Chao Yu ead710b7d8 f2fs: deny accessing encryption policy if encryption is off
This patch adds missing feature check in encryption ioctl interface.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-15 08:30:19 -08:00
Filipe Manana e3b8a48585 Btrfs: fix reported number of inode blocks after buffered append writes
The patch from commit a7e3b975a0 ("Btrfs: fix reported number of inode
blocks") introduced a regression where if we do a buffered write starting
at position equal to or greater than the file's size and then stat(2) the
file before writeback is triggered, the number of used blocks does not
change (unless there's a prealloc/unwritten extent). Example:

  $ xfs_io -f -c "pwrite -S 0xab 0 64K" foobar
  $ du -h foobar
  0	foobar
  $ sync
  $ du -h foobar
  64K	foobar

The first version of that patch didn't had this regression and the second
version, which was the one committed, was made only to address some
performance regression detected by the intel test robots using fs_mark.

This fixes the regression by setting the new delaloc bit in the range, and
doing it at btrfs_dirty_pages() while setting the regular dealloc bit as
well, so that this way we set both bits at once avoiding navigation of the
inode's io tree twice. Doing it at btrfs_dirty_pages() is also the most
meaninful place, as we should set the new dellaloc bit when if we set the
delalloc bit, which happens only if we copied bytes into the pages at
__btrfs_buffered_write().

This was making some of LTP's du tests fail, which can be quickly run
using a command line like the following:

  $ ./runltp -q -p -l /ltp.log -f commands -s du -d /mnt

Fixes: a7e3b975a0 ("Btrfs: fix reported number of inode blocks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:46 +01:00
Filipe Manana f48bf66b66 Btrfs: move definition of the function btrfs_find_new_delalloc_bytes
Move the definition of the function btrfs_find_new_delalloc_bytes() closer
to the function btrfs_dirty_pages(), because in a future commit it will be
used exclusively by btrfs_dirty_pages(). This just moves the function's
definition, with no functional changes at all.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 17:27:44 +01:00
Liu Bo 56a0e706fc Btrfs: bail out gracefully rather than BUG_ON
If a file's DIR_ITEM key is invalid (due to memory errors) and gets
written to disk, a future lookup_path can end up with kernel panic due
to BUG_ON().

This gets rid of the BUG_ON(), meanwhile output the corrupted key and
return ENOENT if it's invalid.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Guillaume Bouchard <bouchard@mercs-eng.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:47:01 +01:00
David Sterba 619c47f3d4 btrfs: dev_alloc_list is not protected by RCU, use normal list_del
The dev_alloc_list list could be protected by various mutexes,
depending on the context. The list tracks devices that can take part of
allocating new chunks, so the closest mutex is chunk_mutex. Adding a new
device from inside the ADD_DEV ioctl will need device_list_mutex and
registering a new device from the ioctl needs uuid_mutex.

All mutexes naturally guarantee exclusivity against the same context.
The device ownership can move between the contexts and the exclusivity
is guaranteed by other means, eg. during the mount with the uuid_mutex.

There's no RCU involved for dev_alloc_list.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:46:12 +01:00
David Sterba 3065ae5b85 btrfs: add missing device::flush_bio puts
This fixes potential bio leaks, in several error paths. Unfortunatelly
the device structure freeing is opencoded in many places and I missed
them when introducing the flush_bio.

Most of the time, devices get freed through call_rcu(..., free_device),
so it at least it's not that easy to hit the leak, but it's still
possible through the path that frees stale devices.

Fixes: e0ae999414 ("btrfs: preallocate device flush bio")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:45:26 +01:00
Nikolay Borisov 5e9f2ad5b2 btrfs: Fix transaction abort during failure in btrfs_rm_dev_item
btrfs_rm_dev_item calls several function under an active transaction,
however it fails to abort it if an error happens. Fix this by adding
explicit btrfs_abort_transaction/btrfs_end_transaction calls.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:44 +01:00
Liu Bo f82b735936 Btrfs: add write_flags for compression bio
Compression code path has only flaged bios with REQ_OP_WRITE no matter
where the bios come from, but it could be a sync write if fsync starts
this writeback or a normal writeback write if wb kthread starts a
periodic writeback.

It breaks the rule that sync writes and writeback writes need to be
differentiated from each other, because from the POV of block layer,
all bios need to be recognized by these flags in order to do some
management, e.g. throttlling.

This passes writeback_control to compression write path so that it can
send bios with proper flags to block layer.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-15 14:44:31 +01:00
Jeff Layton 4d2dc2cc76 fcntl: don't cap l_start and l_end values for F_GETLK64 in compat syscall
Currently, we're capping the values too low in the F_GETLK64 case. The
fields in that structure are 64-bit values, so we shouldn't need to do
any sort of fixup there.

Make sure we check that assumption at build time in the future however
by ensuring that the sizes we're copying will fit.

With this, we no longer need COMPAT_LOFF_T_MAX either, so remove it.

Fixes: 94073ad77f (fs/locks: don't mess with the address limit in compat_fcntl64)
Reported-by: Vitaly Lipatov <lav@etersoft.ru>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Howells <dhowells@redhat.com>
2017-11-15 08:08:36 -05:00
Jeff Layton 9280a601e6 fcntl: don't leak fd reference when fixup_compat_flock fails
Currently we just return err here, but we need to put the fd reference
first.

Fixes: 94073ad77f (fs/locks: don't mess with the address limit in compat_fcntl64)
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-11-15 08:08:36 -05:00
Jeff Moyer 957ac8c421 dax: fix PMD faults on zero-length files
PMD faults on a zero length file on a file system mounted with -o dax
will not generate SIGBUS as expected.

	fd = open(...O_TRUNC);
	addr = mmap(NULL, 2*1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
	*addr = 'a';
        <expect SIGBUS>

The problem is this code in dax_iomap_pmd_fault:

	max_pgoff = (i_size_read(inode) - 1) >> PAGE_SHIFT;

If the inode size is zero, we end up with a max_pgoff that is way larger
than 0.  :)  Fix it by using DIV_ROUND_UP, as is done elsewhere in the
kernel.

I tested this with some simple test code that ensured that SIGBUS was
received where expected.

Cc: <stable@vger.kernel.org>
Fixes: 642261ac99 ("dax: add struct iomap based DAX PMD support")
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-11-14 20:16:55 -08:00
Linus Torvalds e2c5923c34 Merge branch 'for-4.15/block' of git://git.kernel.dk/linux-block
Pull core block layer updates from Jens Axboe:
 "This is the main pull request for block storage for 4.15-rc1.

  Nothing out of the ordinary in here, and no API changes or anything
  like that. Just various new features for drivers, core changes, etc.
  In particular, this pull request contains:

   - A patch series from Bart, closing the whole on blk/scsi-mq queue
     quescing.

   - A series from Christoph, building towards hidden gendisks (for
     multipath) and ability to move bio chains around.

   - NVMe
        - Support for native multipath for NVMe (Christoph).
        - Userspace notifications for AENs (Keith).
        - Command side-effects support (Keith).
        - SGL support (Chaitanya Kulkarni)
        - FC fixes and improvements (James Smart)
        - Lots of fixes and tweaks (Various)

   - bcache
        - New maintainer (Michael Lyle)
        - Writeback control improvements (Michael)
        - Various fixes (Coly, Elena, Eric, Liang, et al)

   - lightnvm updates, mostly centered around the pblk interface
     (Javier, Hans, and Rakesh).

   - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

   - Writeback series that fix the much discussed hundreds of millions
     of sync-all units. This goes all the way, as discussed previously
     (me).

   - Fix for missing wakeup on writeback timer adjustments (Yafang
     Shao).

   - Fix laptop mode on blk-mq (me).

   - {mq,name} tupple lookup for IO schedulers, allowing us to have
     alias names. This means you can use 'deadline' on both !mq and on
     mq (where it's called mq-deadline). (me).

   - blktrace race fix, oopsing on sg load (me).

   - blk-mq optimizations (me).

   - Obscure waitqueue race fix for kyber (Omar).

   - NBD fixes (Josef).

   - Disable writeback throttling by default on bfq, like we do on cfq
     (Luca Miccio).

   - Series from Ming that enable us to treat flush requests on blk-mq
     like any other request. This is a really nice cleanup.

   - Series from Ming that improves merging on blk-mq with schedulers,
     getting us closer to flipping the switch on scsi-mq again.

   - BFQ updates (Paolo).

   - blk-mq atomic flags memory ordering fixes (Peter Z).

   - Loop cgroup support (Shaohua).

   - Lots of minor fixes from lots of different folks, both for core and
     driver code"

* 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
  nvme: fix visibility of "uuid" ns attribute
  blk-mq: fixup some comment typos and lengths
  ide: ide-atapi: fix compile error with defining macro DEBUG
  blk-mq: improve tag waiting setup for non-shared tags
  brd: remove unused brd_mutex
  blk-mq: only run the hardware queue if IO is pending
  block: avoid null pointer dereference on null disk
  fs: guard_bio_eod() needs to consider partitions
  xtensa/simdisk: fix compile error
  nvme: expose subsys attribute to sysfs
  nvme: create 'slaves' and 'holders' entries for hidden controllers
  block: create 'slaves' and 'holders' entries for hidden gendisks
  nvme: also expose the namespace identification sysfs files for mpath nodes
  nvme: implement multipath access to nvme subsystems
  nvme: track shared namespaces
  nvme: introduce a nvme_ns_ids structure
  nvme: track subsystems
  block, nvme: Introduce blk_mq_req_flags_t
  block, scsi: Make SCSI quiesce and resume work reliably
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  ...
2017-11-14 15:32:19 -08:00
Linus Torvalds abc36be236 A couple of configfs cleanups:
- proper use of the bool type (Thomas Meyer)
   - constification of struct config_item_type (Bhumika Goyal)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAloLSTALHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNxfhAAv3cunxiEPEAvs+1xuGd3cZYaxz7qinvIODPxIKoF
 kRWiuy5PUklRMnJ8seOgJ1p1QokX6Sk4cZ8HcctDJVByqODjOq4K5eaKVN1ZqJoz
 BUzO/gOqfs64r9yaFIlKfe8nFA+gpUftSeWyv3lThxAIJ1iSbue7OZ/A10tTOS1m
 RWp9FPepFv+nJMfWqeQU64BsoDQ4kgZ2NcEA+jFxNx5dlmIbLD49tk0lfddvZQXr
 j5WyAH73iugilLtNUGVOqSzHBY4kUvfCKUV7leirCegyMoGhFtA87m6Wzwbo6ZUI
 DwQLzWvuPaGv1P2PpNEHfKiNbfIEp75DRyyyf87DD3lc5ffAxQSm28mGuwcr7Rn5
 Ow/yWL6ERMzCLExoCzEkXYJISy7T5LIzYDgNggKMpeWxysAduF7Onx7KfW1bTuhK
 mHvY7iOXCjEvaIVaF8uMKE6zvuY1vCMRXaJ+kC9jcIE3gwhg+2hmQvrdJ2uAFXY+
 rkeF2Poj/JlblPU4IKWAjiPUbzB7Lv0gkypCB2pD4riaYIN5qCAgF8ULIGQp2hsO
 lYW1EEgp5FBop85oSO/HAGWeH9dFg0WaV7WqNRVv0AGXhKjgy+bVd7iYPpvs7mGw
 z9IqSQDORcG2ETLcFhZgiJpCk/itwqXBD+wgMOjJPP8lL+4kZ8FcuhtY9kc9WlJE
 Tew=
 =+tMO
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:
 "A couple of configfs cleanups:

   - proper use of the bool type (Thomas Meyer)

   - constification of struct config_item_type (Bhumika Goyal)"

* tag 'configfs-for-4.15' of git://git.infradead.org/users/hch/configfs:
  RDMA/cma: make config_item_type const
  stm class: make config_item_type const
  ACPI: configfs: make config_item_type const
  nvmet: make config_item_type const
  usb: gadget: configfs: make config_item_type const
  PCI: endpoint: make config_item_type const
  iio: make function argument and some structures const
  usb: gadget: make config_item_type structures const
  dlm: make config_item_type const
  netconsole: make config_item_type const
  nullb: make config_item_type const
  ocfs2/cluster: make config_item_type const
  target: make config_item_type const
  configfs: make ci_type field, some pointers and function arguments const
  configfs: make config_item_type const
  configfs: Fix bool initialization/comparison
2017-11-14 14:44:04 -08:00
Linus Torvalds f14fc0ccee Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, ext2, isofs and udf fixes from Jan Kara:

 - two small quota error handling fixes

 - two isofs fixes for architectures with signed char

 - several udf block number overflow and signedness fixes

 - ext2 rework of mount option handling to avoid GFP_KERNEL allocation
   with spinlock held

 - ... it also contains a patch to implement auditing of responses to
   fanotify permission events. That should have been in the fanotify
   pull request but I mistakenly merged that patch into a wrong branch
   and noticed only now at which point I don't think it's worth rebasing
   and redoing.

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  quota: be aware of error from dquot_initialize
  quota: fix potential infinite loop
  isofs: use unsigned char types consistently
  isofs: fix timestamps beyond 2027
  udf: Fix some sign-conversion warnings
  udf: Fix signed/unsigned format specifiers
  udf: Fix 64-bit sign extension issues affecting blocks > 0x7FFFFFFF
  udf: Remove some outdate references from documentation
  udf: Avoid overflow when session starts at large offset
  ext2: Fix possible sleep in atomic during mount option parsing
  ext2: Parse mount options into a dedicated structure
  audit: Record fanotify access control decisions
2017-11-14 14:13:11 -08:00
Linus Torvalds 23281c8034 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:

 - fixes of use-after-tree issues when handling fanotify permission
   events from Miklos

 - refcount_t conversions from Elena

 - fixes of ENOMEM handling in dnotify and fsnotify from me

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: convert fsnotify_mark.refcnt from atomic_t to refcount_t
  fanotify: clean up CONFIG_FANOTIFY_ACCESS_PERMISSIONS ifdefs
  fsnotify: clean up fsnotify()
  fanotify: fix fsnotify_prepare_user_wait() failure
  fsnotify: fix pinning group in fsnotify_prepare_user_wait()
  fsnotify: pin both inode and vfsmount mark
  fsnotify: clean up fsnotify_prepare/finish_user_wait()
  fsnotify: convert fsnotify_group.refcnt from atomic_t to refcount_t
  fsnotify: Protect bail out path of fsnotify_add_mark_locked() properly
  dnotify: Handle errors from fsnotify_add_mark_locked() in fcntl_dirnotify()
2017-11-14 14:08:20 -08:00
Linus Torvalds f0b60bfa95 dlm for 4.15
This set focuses, as usual, on fixes to the comms layer.
 New testing of the dlm with ocfs2 uncovered a number of
 bugs in the TCP connection handling during recovery,
 starting, and stopping.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJaCeJ8AAoJEDgbc8f8gGmqpqQP/A9GekFRRvm3QfpmHG3Lj6ey
 O4IKINaB8F46KDBaWzTwKE3fOl0j19qICKuEibBZeJl4lGh7Q5GDOMZ20AfDU4wv
 Qq40OEfFombCFsVX/Qc4AvdXj7cjpfjJwrZhW6CHOkYGZDaAmsHBeCgBeTvhkqR4
 dj4pGFIwBpvV1gQrIteFx110kupeT8DvCSIVzWelD+Jb18vtht7YfKehRyQ3Cyix
 8sbEPiKuhmLW/6wbliRqQL5cp9ZyUU5YtBqhmE8r2QIbOOB+k1xFIvVgUylawv3P
 qi1SpBkX7zRM4BCTP0J3zbUzQHZhgjtgBLVMiSrAWBFb3XtpssEXVczKFDxFafEt
 YJtPeqHxr8zwzQeF+6MGx6amRWW0T9yHv2sB79wBkz8wL483qL39k9DNa564NoSJ
 rZtN0bk4g6CuDnHgEM3hzNsVU2sgdaQMZnRWYONHwvDeI+HgKJWD4nedD6wFmXlo
 kimrQDQCzvx8ZnCKHH0/k23BV2SoYz+80fbW+TeFCWU6gPFGKcJZ12p1e9YYLZJh
 yeY1Y/kdNhLWyIZlldIK1TtO0645YPBhXcaFBA/RF7g8EbwKrIG8FUZSHzWwQIoJ
 hGtLBhWT12BGE2NCLHSMCrKZEb+JeXIN+jKxm9g2m5k6D+nQBt5K7Ae6j6n8pwUC
 hxic9hQmXNxb0R51YD/+
 =w3jk
 -----END PGP SIGNATURE-----

Merge tag 'dlm-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set focuses, as usual, on fixes to the comms layer.

  New testing of the dlm with ocfs2 uncovered a number of bugs in the
  TCP connection handling during recovery, starting, and stopping"

* tag 'dlm-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  dlm: remove dlm_send_rcom_lookup_dump
  dlm: recheck kthread_should_stop() before schedule()
  DLM: fix NULL pointer dereference in send_to_sock()
  DLM: fix to reschedule rwork
  DLM: fix to use sk_callback_lock correctly
  DLM: fix overflow dlm_cb_seq
  DLM: fix memory leak in tcp_accept_from_sock()
  DLM: fix conversion deadlock when DLM_LKF_NODLCKWT flag is set
  DLM: use CF_CLOSE flag to stop dlm_send correctly
  DLM: Reanimate CF_WRITE_PENDING flag
  DLM: fix race condition between dlm_recoverd_stop and dlm_recoverd
  DLM: close othercon at send/receive error
  DLM: retry rcom when dlm_wait_function is timed out.
  DLM: fix to use sock_mutex correctly in xxx_accept_from_sock
  DLM: fix race condition between dlm_send and dlm_recv
  DLM: fix double list_del()
  DLM: fix remove save_cb argument from add_sock()
  DLM: Fix saving of NULL callbacks
  DLM: Eliminate CF_WRITE_PENDING flag
  DLM: Eliminate CF_CONNECT_PENDING flag
2017-11-14 14:06:51 -08:00
Linus Torvalds 29309a4eb8 We've got a total of 17 GFS2 patches for this merge window. The
patches are basically in three categories: (1) patches related to
 broken xfstest cases, (2) patches related to improving iomap and
 start using it in GFS2, and (3) general typos and clarifications.
 
 Please note that one of the iomap patches extends beyond GFS2 and
 affects other file systems, but it was publically reviewed by a
 variety of file system people in the community.
 
 1. Andreas has a patch that simply renames variable 'bsize' to 'factor'
    to clarify the logic related to gfs2_block_map.
 2. He also has a patch to correctly set ctime in the setflags ioctl,
    which fixes broken xfstests test 277.
 3. He also fixed broken xfstest 258, due to an atime initialization
    problem.
 4. He also fixed broken xfstest 307, in which GFS2 was not setting
    ctime when setting acls.
 5. He has a patch to switch general iomap code from blkno to disk
    offset for a variety of file systems.
 6. He has a patch to add a new IOMAP_F_DATA_INLINE flag for iomap
    to indicate blocks that have data mixed with metadata.
 7. I contributed a patch to make inode height info part of the
    'metapath' data structure to facilitate using iomap in GFS2.
 8. I have a patch to start using iomap inside GFS2 and switch GFS2's
    block_map functions to use iomap under the covers.
 9. I have a patch to switch GFS2's fiemap implementation from using
    block_map to using iomap under the covers.
 10. Andreas has a patch to implement SEEK_HOLE and SEEK_DATA via
     iomap in GFS2.
 11. I have a patch related to journaled data pages not being properly
     synced to media when writing inodes. This was caught with xfstests.
 12. I have a patch to fix another failing xfstest case in which
     switching a file from ordered_write to journaled data via set_flags
     caused a deadlock.
 13. Andreas has a patch to fix failing xfstest case 066, which was
     due to not properly syncing dirty inodes when changing extended
     attributes.
 14. Andreas fixed a minor typo in a comment.
 15. Andreas contributed a patch to partially fix xfstest 424, which
     involved GET_FLAGS and SET_FLAGS ioctl. This is also a cleanup
     and simplification of the translation of flags from fs flags to
     gfs2 flags.
 16. He also added support for STATX_ATTR_ in statx, which fixed broken
     xfstest 424.
 17. He also contributed a fix for failing xfstest 093 which fixes a
     recursive glock problem with gfs2_xattr_get and _set.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJaCx5uAAoJENeLYdPf93o7Nb8H/RLJ2CsEbTSJQ82RH4eptoxe
 XbQ4HVig9Hm8k5teSTH9DdVypkxjPtbJZY9k1Y4mEtddtCZ/yS407aTdr/pP0C5r
 3W8Ouu2JXmqPKWg0sp3wC/Pji2ThCYssQXNyBSDPADsF2C8XEuT7aL/YPzMitIdm
 Lxa9JHo1tKgdFnkloNyaTt4MdBGNF5M5UBr6KgRfwhgooHWbxM0rNyZIXJtySb0I
 vsaNNOA7a4VQp1Fo1DkHQomNbOG5hpVKfswUOOZvk2RdAewTPN+jXiOAmIhNjQ3Y
 /PkJLjRCf8Ob/VIYmt2BTs16+07mODGv1d6DuhgXzH/dfiVihVGvVo71DxXx5uw=
 =i8b2
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 updates from Bob Peterson:
 "We've got a total of 17 GFS2 patches for this merge window. The
  patches are basically in three categories: (1) patches related to
  broken xfstest cases, (2) patches related to improving iomap and start
  using it in GFS2, and (3) general typos and clarifications.

  Please note that one of the iomap patches extends beyond GFS2 and
  affects other file systems, but it was publically reviewed by a
  variety of file system people in the community.

  From Andreas Gruenbacher:

   - rename variable 'bsize' to 'factor' to clarify the logic related to
     gfs2_block_map.

   - correctly set ctime in the setflags ioctl, which fixes broken
     xfstests test 277.

   - fix broken xfstest 258, due to an atime initialization problem.

   - fix broken xfstest 307, in which GFS2 was not setting ctime when
     setting acls.

   - switch general iomap code from blkno to disk offset for a variety
     of file systems.

   - add a new IOMAP_F_DATA_INLINE flag for iomap to indicate blocks
     that have data mixed with metadata.

   - implement SEEK_HOLE and SEEK_DATA via iomap in GFS2.

   - fix failing xfstest case 066, which was due to not properly syncing
     dirty inodes when changing extended attributes.

   - fix a minor typo in a comment.

   - partially fix xfstest 424, which involved GET_FLAGS and SET_FLAGS
     ioctl. This is also a cleanup and simplification of the translation
     of flags from fs flags to gfs2 flags.

   - add support for STATX_ATTR_ in statx, which fixed broken xfstest
     424.

   - fix for failing xfstest 093 which fixes a recursive glock problem
     with gfs2_xattr_get and _set

  From me:

   - make inode height info part of the 'metapath' data structure to
     facilitate using iomap in GFS2.

   - start using iomap inside GFS2 and switch GFS2's block_map functions
     to use iomap under the covers.

   - switch GFS2's fiemap implementation from using block_map to using
     iomap under the covers.

   - fix journaled data pages not being properly synced to media when
     writing inodes. This was caught with xfstests.

   - fix another failing xfstest case in which switching a file from
     ordered_write to journaled data via set_flags caused a deadlock"

* tag 'gfs2-4.15.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Allow gfs2_xattr_set to be called with the glock held
  gfs2: Add support for statx inode flags
  gfs2: Fix and clean up {GET,SET}FLAGS ioctl
  gfs2: Fix a harmless typo
  gfs2: Fix xattr fsync
  GFS2: Take inode off order_write list when setting jdata flag
  GFS2: flush the log and all pages for jdata as we do for WB_SYNC_ALL
  gfs2: Implement SEEK_HOLE / SEEK_DATA via iomap
  GFS2: Switch fiemap implementation to use iomap
  GFS2: Implement iomap for block_map
  GFS2: Make height info part of metapath
  gfs2: Always update inode ctime in set_acl
  gfs2: Support negative atimes
  gfs2: Update ctime in setflags ioctl
  gfs2: Clarify gfs2_block_map
2017-11-14 13:55:51 -08:00
Linus Torvalds ac446dcc83 A couple small fixes for jfs
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAloLPhEACgkQNqiEXrVA
 jGTKUA//bdyfsVyx6GH8VMspHQcwYuIn40Pz3iD1MCEu0Fg4yjjF8RL0F+RC4xGi
 OHVA5eLRW+6JOlTcZQ4Cf7+MNbAzcrsjbo3wsjWjBT5jfFh5Z6XcNel1WLBUsWBF
 tK5tgOjB+vfdWF6oquCi4n4R+EWW2TcmI6qhtsV0Xr1trmbaTls4zDE9DAghJam+
 UXwMSsZHe/YZ88kLMHQ2qf3mD/MmUCBCIM+LHUWlK+wCW9ZLMMh5mtizRqhGKAo4
 hHBxv0tknlTIfd3qskxxFP5ii6bADcR11vNYx7DUjIhb/GOs6eGIWmlLyJN17P9X
 F21mTuQ4SDfXnaBaq5w3mLRgAek9+bB1di5rIIEqXkiPZySX3wXGeZk60i6Bps55
 Z2xUPuk0E5VImljEv8kqpbZRbMc+bP5IFM2lmsRmvvjlinqK82s3soqhystTVLKU
 Cm5mMGN51mX+00OboC9lJg7caZmb2b1gCtrIgzE++8gN3RKdKzSQXEMJz28oIUoH
 w9Fjm687MWbjyDU7fhIcIcouVYRcUKqQGKj3qw8iNXkXUx/2UXZaXdAYve8Fmj8U
 BZdp7rD5SJyLxEc4NEDbi8rkWkDW6UwpOR/WD796AxT7rMORA17LUAg0wV8Xuq42
 Zc6PW74038hfI1WoMrW97488c0dh3Kc+saKqwdVece/Z3HfcsQ8=
 =oulW
 -----END PGP SIGNATURE-----

Merge tag 'jfs-4.15' of git://github.com/kleikamp/linux-shaggy

Pull jfs updates from David Kleikamp:
 "A couple small fixes for jfs"

* tag 'jfs-4.15' of git://github.com/kleikamp/linux-shaggy:
  jfs: Add missing NULL pointer check in __get_metapage
  jfs: remove increment of i_version counter
2017-11-14 13:53:18 -08:00
Linus Torvalds 5cea7647e6 Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "There are some new user features and the usual load of invisible
  enhancements or cleanups.

  New features:

   - extend mount options to specify zlib compression level, -o
     compress=zlib:9

   - v2 of ioctl "extent to inode mapping", addressing a usecase where
     we want to retrieve more but inaccurate results and do the
     postprocessing in userspace, aiding defragmentation or
     deduplication tools

   - populate compression heuristics logic, do data sampling and try to
     guess compressibility by: looking for repeated patterns, counting
     unique byte values and distribution, calculating Shannon entropy;
     this will need more benchmarking and possibly fine tuning, but the
     base should be good enough

   - enable indexing for btrfs as lower filesystem in overlayfs

   - speedup page cache readahead during send on large files

  Internal enhancements:

   - more sanity checks of b-tree items when reading them from disk

   - more EINVAL/EUCLEAN fixups, missing BLK_STS_* conversion, other
     errno or error handling fixes

   - remove some homegrown IO-related logic, that's been obsoleted by
     core block layer changes (batching, plug/unplug, own counters)

   - add ref-verify, optional debugging feature to verify extent
     reference accounting

   - simplify code handling outstanding extents, make it more clear
     where and how the accounting is done

   - make delalloc reservations per-inode, simplify the code and make
     the logic more straightforward

   - extensive cleanup of delayed refs code

  Notable fixes:

   - fix send ioctl on 32bit with 64bit kernel"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (102 commits)
  btrfs: Fix bug for misused dev_t when lookup in dev state hash table.
  Btrfs: heuristic: add Shannon entropy calculation
  Btrfs: heuristic: add byte core set calculation
  Btrfs: heuristic: add byte set calculation
  Btrfs: heuristic: add detection of repeated data patterns
  Btrfs: heuristic: implement sampling logic
  Btrfs: heuristic: add bucket and sample counters and other defines
  Btrfs: compression: separate heuristic/compression workspaces
  btrfs: move btrfs_truncate_block out of trans handle
  btrfs: don't call btrfs_start_delalloc_roots in flushoncommit
  btrfs: track refs in a rb_tree instead of a list
  btrfs: add a comp_refs() helper
  btrfs: switch args for comp_*_refs
  btrfs: make the delalloc block rsv per inode
  btrfs: add tracepoints for outstanding extents mods
  Btrfs: rework outstanding_extents
  btrfs: increase output size for LOGICAL_INO_V2 ioctl
  btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
  btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
  btrfs: send: remove unused code
  ...
2017-11-14 13:35:29 -08:00
Linus Torvalds 808eb24e0e New in this version:
- Refactor the incore extent map manipulations to use a cursor instead of
   directly modifying extent data.
 - Refactor the incore extent map cursor to use an in-memory btree instead
   of a single high-order allocation.  This eliminates a major source of
   complaints about insufficient memory when opening a heavily fragmented
   file into a system whose memory is also heavily fragmented.
 - Fix a longstanding bug where deleting a file with a complex extended
   attribute btree incorrectly handled memory pointers, which could lead
   to memory corruption.
 - Improve metadata validation to eliminate crashing problems found while
   fuzzing xfs.
 - Move the error injection tag definitions into libxfs to be shared with
   userspace components.
 - Fix some log recovery bugs where we'd underflow log block position
   vector and incorrectly fail log recovery.
 - Drain the buffer lru after log recovery to force recovered buffers back
   through the verifiers after mount.  On a v4 filesystem the log never
   attaches verifiers during log replay (v5 does), so we could end up with
   buffers marked verified but without having ever been verified.
 - Fix various other bugs.
 - Introduce the first part of a new online fsck tool.  The new fsck tool
   will be able to iterate every piece of metadata in the filesystem to
   look for obvious errors and corruptions.  In the next release cycle
   the checking will be extended to cross-reference with the other fs
   metadata, so this feature should only be used by the developers in the
   mean time.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJaBdbZAAoJEPh/dxk0SrTrKoUP/RroXfZX3PSn3Z0Qo99E6Ev9
 +Z3CoJSSfXPJtSPBh6mUgonzzpKMqoN3kj8ZezYRLaeSEo+36ZkBtdLOb/8PydOZ
 4agNvtGDhwt88+1vSAccbT6l4wB/Z16NfzGaVN4dioHF1LpC4rORqdEuoq5xXxzo
 JVjuwTbz8uPSCpTTukzll9XFghvvj+YXm20MgEOCJiR5uULlGW5gZ38mNCmS76Bk
 Nks5dNSmNzlGwIpwsVmthd0s0jwj8WeQPnUOv27naRm4J6GOvB5gE8vn15e07AHT
 EqeTTHy25lnJhmpazphvDwbN3B6UdWCHGoG8ll2B+45pZegS7SKt4G6b4ittHq9x
 +ErCHFElrNCO77QDQmQoXHy6+DJV/Rdnyb5K575rA91TAb0q2C7OP6vQt6oV0rDM
 obZ7M3MvW9jBVn9A07Hdsk4+J2/SYW0jf5Dv4O69U1KuvZYUES2B++PL+u7pdTpy
 JPg1+pWO+AgxRKQNviFFzRwQDPE3JSp854TCE/5D/59h2ZeSWg+g4ZH5jcLjKwKM
 +uHbJgqOdgk2/WPHiEFCOouom3RUxdE1Yg7S87sbaQC4iU5oWWQ8Kenl2AUyNQEN
 yaU/leq6rqX3Z2z+T70ujWSvh5xl07YHLW3LJszZMi4w+i8C7c0lIX9F8CNu26Cf
 yJApOvMWhhY3Mf7Gn1l5
 =vQrJ
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "xfs: great scads of new stuff for 4.15.

  This merge cycle, we're making some substantive changes to XFS. The
  in-core extent mappings have been refactored to use proper iterators
  and a btree to handle heavily fragmented files without needing
  high-order memory allocations; some important log recovery bug fixes;
  and the first part of the online fsck functionality.

  (The online fsck feature is disabled by default and more pieces of it
  will be coming in future release cycles.)

  This giant pile of patches has been run through a full xfstests run
  over the weekend and through a quick xfstests run against this
  morning's master, with no major failures reported.

  New in this version:

   - Refactor the incore extent map manipulations to use a cursor
     instead of directly modifying extent data.

   - Refactor the incore extent map cursor to use an in-memory btree
     instead of a single high-order allocation. This eliminates a major
     source of complaints about insufficient memory when opening a
     heavily fragmented file into a system whose memory is also heavily
     fragmented.

   - Fix a longstanding bug where deleting a file with a complex
     extended attribute btree incorrectly handled memory pointers, which
     could lead to memory corruption.

   - Improve metadata validation to eliminate crashing problems found
     while fuzzing xfs.

   - Move the error injection tag definitions into libxfs to be shared
     with userspace components.

   - Fix some log recovery bugs where we'd underflow log block position
     vector and incorrectly fail log recovery.

   - Drain the buffer lru after log recovery to force recovered buffers
     back through the verifiers after mount. On a v4 filesystem the log
     never attaches verifiers during log replay (v5 does), so we could
     end up with buffers marked verified but without having ever been
     verified.

   - Fix various other bugs.

   - Introduce the first part of a new online fsck tool. The new fsck
     tool will be able to iterate every piece of metadata in the
     filesystem to look for obvious errors and corruptions. In the next
     release cycle the checking will be extended to cross-reference with
     the other fs metadata, so this feature should only be used by the
     developers in the mean time"

* tag 'xfs-4.15-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (131 commits)
  xfs: on failed mount, force-reclaim inodes after unmounting quota controls
  xfs: check the uniqueness of the AGFL entries
  xfs: remove u_int* type usage
  xfs: handle zero entries case in xfs_iext_rebalance_leaf
  xfs: add comments documenting the rebalance algorithm
  xfs: trivial indentation fixup for xfs_iext_remove_node
  xfs: remove a superflous assignment in xfs_iext_remove_node
  xfs: add some comments to xfs_iext_insert/xfs_iext_insert_node
  xfs: fix number of records handling in xfs_iext_split_leaf
  fs/xfs: Remove NULL check before kmem_cache_destroy
  xfs: only check da node header padding on v5 filesystems
  xfs: fix btree scrub deref check
  xfs: fix uninitialized return values in scrub code
  xfs: pass inode number to xfs_scrub_ino_set_{preen,warning}
  xfs: refactor the directory data block bestfree checks
  xfs: mark xlog_verify_dest_ptr STATIC
  xfs: mark xlog_recover_check_summary STATIC
  xfs: mark xfs_btree_check_lblock and xfs_btree_check_ptr static
  xfs: remove unreachable error injection code in xfs_qm_dqget
  xfs: remove unused debug counts for xfs_lock_inodes
  ...
2017-11-14 13:15:12 -08:00
Linus Torvalds ae9a8c4bdc Add support for online resizing of file systems with bigalloc. Fix a
two data corruption bugs involving DAX, as well as a corruption bug
 after a crash during a racing fallocate and delayed allocation.
 Finally, a number of cleanups and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloJCiEACgkQ8vlZVpUN
 gaOahAgAhcgdPagn/B5w+6vKFdH+hOJLKyGI0adGDyWD9YBXN0wFQvliVgXrTKei
 hxW2GdQGc6yHv9mOjvD+4Fn2AnTZk8F3GtG6zdqRM08JGF/IN2Jax2boczG/XnUz
 rT9cd3ic2Ff0KaUX+Yos55QwomTh5CAeRPgvB69o9D6L4VJzTlsWKSOBR19FmrSG
 NDmzZibgWmHcqzW9Bq8ZrXXx+KB42kUlc8tYYm2n6MTaE0LMvp3d9XcFcnm/I7Bk
 MGa2d3/3FArGD6Rkl/E82MXMSElOHJnY6jGYSDaadUeMI5FXkA6tECOSJYXqShdb
 ZJwkOBwfv2lbYZJxIBJTy/iA6zdsoQ==
 =ZzaJ
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:

 - Add support for online resizing of file systems with bigalloc

 - Fix a two data corruption bugs involving DAX, as well as a corruption
   bug after a crash during a racing fallocate and delayed allocation.

 - Finally, a number of cleanups and optimizations.

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: improve smp scalability for inode generation
  ext4: add support for online resizing with bigalloc
  ext4: mention noload when recovering on read-only device
  Documentation: fix little inconsistencies
  ext4: convert timers to use timer_setup()
  jbd2: convert timers to use timer_setup()
  ext4: remove duplicate extended attributes defs
  ext4: add ext4_should_use_dax()
  ext4: add sanity check for encryption + DAX
  ext4: prevent data corruption with journaling + DAX
  ext4: prevent data corruption with inline data + DAX
  ext4: fix interaction between i_size, fallocate, and delalloc after a crash
  ext4: retry allocations conservatively
  ext4: Switch to iomap for SEEK_HOLE / SEEK_DATA
  ext4: Add iomap support for inline data
  iomap: Add IOMAP_F_DATA_INLINE flag
  iomap: Switch from blkno to disk offset
2017-11-14 12:59:42 -08:00
Linus Torvalds 32190f0afb fscrypt: lots of cleanups, mostly courtesy by Eric Biggers
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAloI8AUACgkQ8vlZVpUN
 gaMdjgf8CCW7UhPjoZYwF8sUNtAaX9+JZT1maOcXUhpJ3vRQiRn+AzRH6yBYMm79
 +NZBwVlk4dlEe55Wh4yFIStMAstqzCrke4C9CSbExjgHNsJdU4znyYuLRMbLfyO0
 6c4NObiAIKJdW1/te1aN90keGC6min8pBZot+FqZsRr+Kq2+IOtM43JAv7efOLev
 v3LCjUf9JKxatoB8tgw4AJRa1p18p7D2APWTG05VlFq63TjhVIYNvvwcQlizLwGY
 cuEq3X59FbFdX06fJnucujU3WP3ES4/3rhufBK4NNaec5e5dbnH2KlAx7J5SyMIZ
 0qUFB/dmXDSb3gsfScSGo1F71Ad0CA==
 =asAm
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Lots of cleanups, mostly courtesy by Eric Biggers"

* tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: lock mutex before checking for bounce page pool
  fscrypt: add a documentation file for filesystem-level encryption
  ext4: switch to fscrypt_prepare_setattr()
  ext4: switch to fscrypt_prepare_lookup()
  ext4: switch to fscrypt_prepare_rename()
  ext4: switch to fscrypt_prepare_link()
  ext4: switch to fscrypt_file_open()
  fscrypt: new helper function - fscrypt_prepare_setattr()
  fscrypt: new helper function - fscrypt_prepare_lookup()
  fscrypt: new helper function - fscrypt_prepare_rename()
  fscrypt: new helper function - fscrypt_prepare_link()
  fscrypt: new helper function - fscrypt_file_open()
  fscrypt: new helper function - fscrypt_require_key()
  fscrypt: remove unneeded empty fscrypt_operations structs
  fscrypt: remove ->is_encrypted()
  fscrypt: switch from ->is_encrypted() to IS_ENCRYPTED()
  fs, fscrypt: add an S_ENCRYPTED inode flag
  fscrypt: clean up include file mess
2017-11-14 11:35:15 -08:00
Linus Torvalds 37dc79565c Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "Here is the crypto update for 4.15:

  API:

   - Disambiguate EBUSY when queueing crypto request by adding ENOSPC.
     This change touches code outside the crypto API.
   - Reset settings when empty string is written to rng_current.

  Algorithms:

   - Add OSCCA SM3 secure hash.

  Drivers:

   - Remove old mv_cesa driver (replaced by marvell/cesa).
   - Enable rfc3686/ecb/cfb/ofb AES in crypto4xx.
   - Add ccm/gcm AES in crypto4xx.
   - Add support for BCM7278 in iproc-rng200.
   - Add hash support on Exynos in s5p-sss.
   - Fix fallback-induced error in vmx.
   - Fix output IV in atmel-aes.
   - Fix empty GCM hash in mediatek.

  Others:

   - Fix DoS potential in lib/mpi.
   - Fix potential out-of-order issues with padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (162 commits)
  lib/mpi: call cond_resched() from mpi_powm() loop
  crypto: stm32/hash - Fix return issue on update
  crypto: dh - Remove pointless checks for NULL 'p' and 'g'
  crypto: qat - Clean up error handling in qat_dh_set_secret()
  crypto: dh - Don't permit 'key' or 'g' size longer than 'p'
  crypto: dh - Don't permit 'p' to be 0
  crypto: dh - Fix double free of ctx->p
  hwrng: iproc-rng200 - Add support for BCM7278
  dt-bindings: rng: Document BCM7278 RNG200 compatible
  crypto: chcr - Replace _manual_ swap with swap macro
  crypto: marvell - Add a NULL entry at the end of mv_cesa_plat_id_table[]
  hwrng: virtio - Virtio RNG devices need to be re-registered after suspend/resume
  crypto: atmel - remove empty functions
  crypto: ecdh - remove empty exit()
  MAINTAINERS: update maintainer for qat
  crypto: caam - remove unused param of ctx_map_to_sec4_sg()
  crypto: caam - remove unneeded edesc zeroization
  crypto: atmel-aes - Reset the controller before each use
  crypto: atmel-aes - properly set IV after {en,de}crypt
  hwrng: core - Reset user selected rng by writing "" to rng_current
  ...
2017-11-14 10:52:09 -08:00
Jan Kara 838bee9e75 Merge udf, isofs, quota, ext2 changes for 4.15-rc1. 2017-11-14 11:09:53 +01:00
Linus Torvalds fb0255fb29 TTY/Serial patches for 4.15-rc1
Here is the big tty/serial driver pull request for 4.15-rc1.
 
 Lots of serial driver updates in here, some small vt cleanups, and a
 raft of SPDX and license boilerplate cleanups, messing up the diffstat a
 bit.
 
 Nothing major, with no realy functional changes except better hardware
 support for some platforms.
 
 All of these have been in linux-next for a while with no reported
 issues.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWgnD+w8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynAmgCfSSr/9qiCE0vfP5eVYjddzxfWyZ4AoMbKORZC
 5x2KVW0Btrbs3WmnD7ZU
 =PSea
 -----END PGP SIGNATURE-----

Merge tag 'tty-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial updates from Greg KH:
 "Here is the big tty/serial driver pull request for 4.15-rc1.

  Lots of serial driver updates in here, some small vt cleanups, and a
  raft of SPDX and license boilerplate cleanups, messing up the diffstat
  a bit.

  Nothing major, with no realy functional changes except better hardware
  support for some platforms.

  All of these have been in linux-next for a while with no reported
  issues"

* tag 'tty-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (110 commits)
  tty: ehv_bytechan: fix spelling mistake
  tty: serial: meson: allow baud-rates lower than 9600
  serial: 8250_fintek: Fix crash with baud rate B0
  serial: 8250_fintek: Disable delays for ports != 0
  serial: 8250_fintek: Return -EINVAL on invalid configuration
  tty: Remove redundant license text
  tty: serdev: Remove redundant license text
  tty: hvc: Remove redundant license text
  tty: serial: Remove redundant license text
  tty: add SPDX identifiers to all remaining files in drivers/tty/
  tty: serial: jsm: remove redundant pointer ts
  tty: serial: jsm: add space before the open parenthesis '('
  tty: serial: jsm: fix coding style
  tty: serial: jsm: delete space between function name and '('
  tty: serial: jsm: add blank line after declarations
  tty: serial: jsm: change the type of local variable
  tty: serial: imx: remove dead code imx_dma_rxint
  tty: serial: imx: disable ageing timer interrupt if dma in use
  serial: 8250: fix potential deadlock in rs485-mode
  serial: m32r_sio: Drop redundant .data assignment
  ...
2017-11-13 21:05:31 -08:00
Chao Yu 812c60564c f2fs: inject fault in inc_valid_node_count
This patch adds missing fault injection in inc_valid_node_count.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 20:21:35 -08:00
Chao Yu 28cfafb738 f2fs: fix to clear FI_NO_PREALLOC
We need to clear FI_NO_PREALLOC flag in error path of f2fs_file_write_iter,
otherwise we will lose the chance to preallocate blocks in latter write()
at one time.

Fixes: dc91de78e5 ("f2fs: do not preallocate blocks which has wrong buffer")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-11-13 20:21:22 -08:00