1
0
Fork 0
Commit Graph

60810 Commits (0362f326d86c645b5e96b7dbc3ee515986ed019d)

Author SHA1 Message Date
Linus Torvalds 875fef493f Two fixes for the buffered reads and O_DIRECT writes serialization
patch that went into -rc1 and a fixup for a bogus warning on older
 gcc versions.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3O2RMTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi+k6CACf0hUTyWcJaiH3WAmkpKOnZVG//Ghv
 +hDWskib0gSilW+mx8Cjsndb5rXVuE4MhZ9P1VD1MMhhfVlfTUspCPG6cIQ3B3gd
 jEVLHDALaMc/tpKwa6EbxvxQRAL5D/2Umh8aK1kVMX2U9R6KKfMiRVToHVPewSkS
 eM3HJuV0kUonnD6glHyie1iwI9iFkDgt+eTJR1hpiFx26y6TwVCH5RNNvZGr0Tcf
 KMLgwAHxIowx0SxblvbJTMf1iIoPJNiUuZyBoo0Hli9hI7S/b4XfARAkD9eud02d
 4shv7D9Js0V1tKDsfV/c8UpnOgBTGEY4AEXIN3Mm6Gk6q9pMnobWt+l8
 =kIHR
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two fixes for the buffered reads and O_DIRECT writes serialization
  patch that went into -rc1 and a fixup for a bogus warning on older gcc
  versions"

* tag 'ceph-for-5.4-rc8' of git://github.com/ceph/ceph-client:
  rbd: silence bogus uninitialized warning in rbd_object_map_update_finish()
  ceph: increment/decrement dio counter on async requests
  ceph: take the inode lock before acquiring cap refs
2019-11-15 10:30:24 -08:00
David Howells a28f239e29 afs: Fix race in commit bulk status fetch
When a lookup is done, the afs filesystem will perform a bulk status-fetch
operation on the requested vnode (file) plus the next 49 other vnodes from
the directory list (in AFS, directory contents are downloaded as blobs and
parsed locally).  When the results are received, it will speculatively
populate the inode cache from the extra data.

However, if the lookup races with another lookup on the same directory, but
for a different file - one that's in the 49 extra fetches, then if the bulk
status-fetch operation finishes first, it will try and update the inode
from the other lookup.

If this other inode is still in the throes of being created, however, this
will cause an assertion failure in afs_apply_status():

	BUG_ON(test_bit(AFS_VNODE_UNSET, &vnode->flags));

on or about fs/afs/inode.c:175 because it expects data to be there already
that it can compare to.

Fix this by skipping the update if the inode is being created as the
creator will presumably set up the inode with the same information.

Fixes: 39db9815da ("afs: Fix application of the results of a inline bulk status fetch")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-15 10:28:02 -08:00
Linus Torvalds b4c0800e42 Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs fixes from Al Viro:
 "Assorted fixes all over the place; some of that is -stable fodder,
  some regressions from the last window"

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
  ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
  ecryptfs: fix unlink and rmdir in face of underlying fs modifications
  audit_get_nd(): don't unlock parent too early
  exportfs_decode_fh(): negative pinned may become positive without the parent locked
  cgroup: don't put ERR_PTR() into fc->root
  autofs: fix a leak in autofs_expire_indirect()
  aio: Fix io_pgetevents() struct __compat_aio_sigset layout
  fs/namespace.c: fix use-after-free of mount in mnt_warn_timestamp_expiry()
2019-11-15 08:44:08 -08:00
Jeff Layton 6a81749ebe ceph: increment/decrement dio counter on async requests
Ceph can in some cases issue an async DIO request, in which case we can
end up calling ceph_end_io_direct before the I/O is actually complete.
That may allow buffered operations to proceed while DIO requests are
still in flight.

Fix this by incrementing the i_dio_count when issuing an async DIO
request, and decrement it when tearing down the aio_req.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Jeff Layton a81bc3102b ceph: take the inode lock before acquiring cap refs
Most of the time, we (or the vfs layer) takes the inode_lock and then
acquires caps, but ceph_read_iter does the opposite, and that can lead
to a deadlock.

When there are multiple clients treading over the same data, we can end
up in a situation where a reader takes caps and then tries to acquire
the inode_lock. Another task holds the inode_lock and issues a request
to the MDS which needs to revoke the caps, but that can't happen until
the inode_lock is unwedged.

Fix this by having ceph_read_iter take the inode_lock earlier, before
attempting to acquire caps.

Fixes: 321fe13c93 ("ceph: add buffered/direct exclusionary locking for reads and writes")
Link: https://tracker.ceph.com/issues/36348
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-14 18:44:51 +01:00
Linus Torvalds afd7a71872 for-5.4-rc7-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3MUkoACgkQxWXV+ddt
 WDvAEQ//fEHZ51NbIMwJNltqF4mr6Oao0M5u0wejEgiXzmR9E1IuHUgVK+KDQmSu
 wZl/y+RTlQC0TiURyStaVFBreXEqiuB79my9u4iDeNv/4UJQB42qpmYB4EuviDgB
 mxb9bFpWTLkO6Oc+vMrGF3BOmVsQKlq2nOua25g8VFtApQ6uiEfbwBOslCcC8kQB
 ZpNBl6x74xz/VWNWZnRStBfwYjRitKNDVU6dyIyRuLj8cktqfGBxGtx7/w0wDiZT
 kPR1bNtdpy3Ndke6H/0G6plRWi9kENqcN43hvrz54IKh2l+Jd2/as51j4Qq2tJU9
 KaAnJzRaSePxc2m0SqtgZTvc2BYSOg7dqaCyHxBB0CUBdTdJdz2TVZ2KM9MiLns4
 1haHBLo4l8o8zeYZpW05ac6OXKY4f8qsjWPEGshn4FDbq0TrHQzYxAF3c0X3hPag
 SnuvilgYUuYal+n87qinePg/ZmVrrBXPRycpQnn7FxqezJbf/2WUEojUVQnreU05
 mdp8mulxQxyFhgEvO7K1uDtlP8bqW69IO9M//6IWzGNKTDK2SRI08ULplghqgyna
 8SG0+y9w26r8UIWDhuvPbdfUMSG3kEH8yLFK84AFDMVJJxOnfznE3sC8sGOiP5q9
 OUkl8l7bhDkyAdWZY57gGUobebdPfnLxRV9A+LZQ2El1kSOEK18=
 =xzXs
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fix from David Sterba:
 "A fix for an older bug that has started to show up during testing
  (because of an updated test for rename exchange).

  It's an in-memory corruption caused by local variable leaking out of
  the function scope"

* tag 'for-5.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix log context list corruption after rename exchange operation
2019-11-13 12:06:10 -08:00
Filipe Manana e6c617102c Btrfs: fix log context list corruption after rename exchange operation
During rename exchange we might have successfully log the new name in the
source root's log tree, in which case we leave our log context (allocated
on stack) in the root's list of log contextes. However we might fail to
log the new name in the destination root, in which case we fallback to
a transaction commit later and never sync the log of the source root,
which causes the source root log context to remain in the list of log
contextes. This later causes invalid memory accesses because the context
was allocated on stack and after rename exchange finishes the stack gets
reused and overwritten for other purposes.

The kernel's linked list corruption detector (CONFIG_DEBUG_LIST=y) can
detect this and report something like the following:

  [  691.489929] ------------[ cut here ]------------
  [  691.489947] list_add corruption. prev->next should be next (ffff88819c944530), but was ffff8881c23f7be4. (prev=ffff8881c23f7a38).
  [  691.489967] WARNING: CPU: 2 PID: 28933 at lib/list_debug.c:28 __list_add_valid+0x95/0xe0
  (...)
  [  691.489998] CPU: 2 PID: 28933 Comm: fsstress Not tainted 5.4.0-rc6-btrfs-next-62 #1
  [  691.490001] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  [  691.490003] RIP: 0010:__list_add_valid+0x95/0xe0
  (...)
  [  691.490007] RSP: 0018:ffff8881f0b3faf8 EFLAGS: 00010282
  [  691.490010] RAX: 0000000000000000 RBX: ffff88819c944530 RCX: 0000000000000000
  [  691.490011] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffa2c497e0
  [  691.490013] RBP: ffff8881f0b3fe68 R08: ffffed103eaa4115 R09: ffffed103eaa4114
  [  691.490015] R10: ffff88819c944000 R11: ffffed103eaa4115 R12: 7fffffffffffffff
  [  691.490016] R13: ffff8881b4035610 R14: ffff8881e7b84728 R15: 1ffff1103e167f7b
  [  691.490019] FS:  00007f4b25ea2e80(0000) GS:ffff8881f5500000(0000) knlGS:0000000000000000
  [  691.490021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  691.490022] CR2: 00007fffbb2d4eec CR3: 00000001f2a4a004 CR4: 00000000003606e0
  [  691.490025] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  691.490027] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  691.490029] Call Trace:
  [  691.490058]  btrfs_log_inode_parent+0x667/0x2730 [btrfs]
  [  691.490083]  ? join_transaction+0x24a/0xce0 [btrfs]
  [  691.490107]  ? btrfs_end_log_trans+0x80/0x80 [btrfs]
  [  691.490111]  ? dget_parent+0xb8/0x460
  [  691.490116]  ? lock_downgrade+0x6b0/0x6b0
  [  691.490121]  ? rwlock_bug.part.0+0x90/0x90
  [  691.490127]  ? do_raw_spin_unlock+0x142/0x220
  [  691.490151]  btrfs_log_dentry_safe+0x65/0x90 [btrfs]
  [  691.490172]  btrfs_sync_file+0x9f1/0xc00 [btrfs]
  [  691.490195]  ? btrfs_file_write_iter+0x1800/0x1800 [btrfs]
  [  691.490198]  ? rcu_read_lock_any_held.part.11+0x20/0x20
  [  691.490204]  ? __do_sys_newstat+0x88/0xd0
  [  691.490207]  ? cp_new_stat+0x5d0/0x5d0
  [  691.490218]  ? do_fsync+0x38/0x60
  [  691.490220]  do_fsync+0x38/0x60
  [  691.490224]  __x64_sys_fdatasync+0x32/0x40
  [  691.490228]  do_syscall_64+0x9f/0x540
  [  691.490233]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [  691.490235] RIP: 0033:0x7f4b253ad5f0
  (...)
  [  691.490239] RSP: 002b:00007fffbb2d6078 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
  [  691.490242] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f4b253ad5f0
  [  691.490244] RDX: 00007fffbb2d5fe0 RSI: 00007fffbb2d5fe0 RDI: 0000000000000003
  [  691.490245] RBP: 000000000000000d R08: 0000000000000001 R09: 00007fffbb2d608c
  [  691.490247] R10: 00000000000002e8 R11: 0000000000000246 R12: 00000000000001f4
  [  691.490248] R13: 0000000051eb851f R14: 00007fffbb2d6120 R15: 00005635a498bda0

This started happening recently when running some test cases from fstests
like btrfs/004 for example, because support for rename exchange was added
last week to fsstress from fstests.

So fix this by deleting the log context for the source root from the list
if we have logged the new name in the source root.

Reported-by: Su Yue <Damenly_Su@gmx.com>
Fixes: d4682ba03e ("Btrfs: sync log after logging new name")
CC: stable@vger.kernel.org # 4.19+
Tested-by: Su Yue <Damenly_Su@gmx.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-11 19:46:02 +01:00
Linus Torvalds a5871fcba4 configfs regression fix for 5.4-rc
- fix a regression from this merge window in the configfs
    symlink handling (Honggang Li)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl3IG00LHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPJbBAAw57uaM2mKcUe1syqwsaliqYVGqZ7X+4sMdtwNPAI
 W01aolnmmukcWInlZBqiHFu3+iAVYJND9NvP7ZA+p6/meJfWiIrgWMEVS4X6HFpg
 qsBCP1qh8BvLxCzHC+i9tkgPMpicYjImE/pza99ZtrMHTRC7ao5I3GKbWBIdEpDv
 5NZDbt2g79Y1YfqGRo0xcY0etwpau+iN4rXivjj4qMO1o+Vt4rWlGZehhD2/r2M0
 NGeR3JpYe1MdxvyECMoDI+aWuOiDFrJ/ZfWfTuCskqwTyZ1BtKElBnqS6VFn7yxL
 XqPNwe6Sr9fz7RRZXQ1iH8O/SYct25xVyc6JQlAvcpp0emAUsj2bNAQ3Hj4W3Krw
 h1D5HKvte+CjIqZEnFlqI6GeHawdbSpJoE1VOECS1rXpYhvs5V+2e5RyT2gmj1Pp
 X58Q0xF5Ver2sF0NkhAMTxXL77L8cpRDVljveBXgUh/MzTdPcq1h/RsKXwwVBD23
 Vsmg4qKlBAWJLK1TJ4wvvmmp0N/XzcaNWm7cKCxJDBlzY8LpcVTmUJfGty96mHqb
 cRZLMNcWbtPFSkHfzkYeuWWnhhtgXBmEcTawOd3y0s+jd8wjhRr2wuhL/MSCicJG
 t0A+4/5p4s8H94OAiq9tF8yOAQpvzmrrunW+lyyOQchK4kBBM8rD+cMIRziL+u/2
 B84=
 =kgn+
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs

Pull configfs regression fix from Christoph Hellwig:
 "Fix a regression from this merge window in the configfs symlink
  handling (Honggang Li)"

* tag 'configfs-for-5.4-2' of git://git.infradead.org/users/hch/configfs:
  configfs: calculate the depth of parent item
2019-11-10 12:59:34 -08:00
Linus Torvalds 79a64063a8 small fix for an smb3 reconnect bug
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl3HowUACgkQiiy9cAdy
 T1GBFwwAsUHGIXIeV5gWNuy7XhTYOkH9OsSEOJIT+N/3d/L2QrIcfQ0x4n/RauC9
 S654xqdsPljti231e7LgeV0fZlayFg99YeNGBkorGk/UBxv5pYzVqqUyyMnUhunI
 /TCC4SPzhcKyLjrei/pZ2lTkZoZe34WfAq36E1p5d758g44ypLSlIcMUUznmhOKE
 B4ATOI5wV5zqHUh4pZSkLWMQ/D33XapQZDFJweol6YEaPCz2NjQfsL/MRJ/pg3nL
 Vh7v1jdHiPMbMhqV9pqsmkBSOci+WMQTqwz4yXiKJgKRsOjrcp+64Tpl+yMDfNfk
 1AsW7A2oHxbWkmSwDN14lNMPqGftREJww9pf40BaYGzTSKUvyI7VZNULT9kLCtSQ
 0xvVhrAMtHZSFeDqIf0ndcPS5WXWlHrTRDx25fgFHid3ZkDh5tLcNgODRZMih/xy
 KvdHlKm2y36jdvDxXMoNAna7iwgKEOtYYk/X9oq82icI+6S0p46HEcShx1rqgDXL
 KAWDdnsb
 =8yoq
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "Small fix for an smb3 reconnect bug (also marked for stable)"

* tag '5.4-rc7-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  SMB3: Fix persistent handles reconnect
2019-11-10 11:43:18 -08:00
Al Viro 762c69685f ecryptfs_lookup_interpose(): lower_dentry->d_parent is not stable either
We need to get the underlying dentry of parent; sure, absent the races
it is the parent of underlying dentry, but there's nothing to prevent
losing a timeslice to preemtion in the middle of evaluation of
lower_dentry->d_parent->d_inode, having another process move lower_dentry
around and have its (ex)parent not pinned anymore and freed on memory
pressure.  Then we regain CPU and try to fetch ->d_inode from memory
that is freed by that point.

dentry->d_parent *is* stable here - it's an argument of ->lookup() and
we are guaranteed that it won't be moved anywhere until we feed it
to d_add/d_splice_alias.  So we safely go that way to get to its
underlying dentry.

Cc: stable@vger.kernel.org # since 2009 or so
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:45 -05:00
Al Viro e72b9dd6a5 ecryptfs_lookup_interpose(): lower_dentry->d_inode is not stable
lower_dentry can't go from positive to negative (we have it pinned),
but it *can* go from negative to positive.  So fetching ->d_inode
into a local variable, doing a blocking allocation, checking that
now ->d_inode is non-NULL and feeding the value we'd fetched
earlier to a function that won't accept NULL is not a good idea.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro bcf0d9d4b7 ecryptfs: fix unlink and rmdir in face of underlying fs modifications
A problem similar to the one caught in commit 74dd7c97ea ("ecryptfs_rename():
verify that lower dentries are still OK after lock_rename()") exists for
unlink/rmdir as well.

Instead of playing with dget_parent() of underlying dentry of victim
and hoping it's the same as underlying dentry of our directory,
do the following:
        * find the underlying dentry of victim
        * find the underlying directory of victim's parent (stable
since the victim is ecryptfs dentry and inode of its parent is
held exclusive by the caller).
        * lock the inode of dentry underlying the victim's parent
        * check that underlying dentry of victim is still hashed and
has the right parent - it can be moved, but it can't be moved to/from
the directory we are holding exclusive.  So while ->d_parent itself
might not be stable, the result of comparison is.

If the check passes, everything is fine - underlying directory is locked,
underlying victim is still a child of that directory and we can go ahead
and feed them to vfs_unlink().  As in the current mainline we need to
pin the underlying dentry of victim, so that it wouldn't go negative under
us, but that's the only temporary reference that needs to be grabbed there.
Underlying dentry of parent won't go away (it's pinned by the parent,
which is held by caller), so there's no need to grab it.

The same problem (with the same solution) exists for rmdir.  Moreover,
rename gets simpler and more robust with the same "don't bother with
dget_parent()" approach.

Fixes: 74dd7c97ea "ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()"
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:57:44 -05:00
Al Viro a2ece08888 exportfs_decode_fh(): negative pinned may become positive without the parent locked
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-11-10 11:56:05 -05:00
Linus Torvalds 00aff68362 for-5.4-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl3GwvEACgkQxWXV+ddt
 WDvsfQ//UQ/TND1roMW3EmN+PLTRTUQS75Hb737r5Q66j0j94gFnBxB03R+tU8lP
 bH5XVpateb1wDmmdscqVQ1WM9O82bQdDNiYeLDr86+kzpLgy61rZHswZfNlDtl5M
 wDwyaxsrd7HndDeUZnIuaakYI9MXz9WIaNXkt0o8hSHctt0N18y23DSBFTVSh/4q
 T3cn4odkoBKtQ4mIn2OSmQvkl69nWRBVpjPJZIvsNszKjOo9aZTuGHrOWUV5RPiE
 Ho9UBkr+IjEDo8OH88vXAsHeYFIoYhEeUltjLHyF6Agwk1/Ajwp5sxXSubbfHUMQ
 l1YdmrTZf+l1Dxdj0sCdyK1npcgGI5IuZmIICpNUEAny9AbTtpSE3GNwtnIHAEAr
 cpki+1Z3lfaVSwNMYUz9Esbsb72C+f08WJHGHBMaOhjrBIwQUeWeYzTx6N7uDwNg
 GjaDRxjqFV2o7373isQaz7WOTatUMNtL1xvhkeceONI9NBXQRjW5rq5zECmr1ix7
 lSTKDzn7yAd6eGBW0T6iYdRl4Bta/zxPiDEF5KDNPugvv23yx17LAOdS4qAiHzbx
 oglra/kz9D4xmKVqH7hpFaaHrzL38G4mz6aMdwBu0M7dkn9AwsXiXWSDb0n7jnK0
 o96gkUBZjUSFBseUFD2s34MikFtrDynEJzPq96JHuWLguaByMcc=
 =CZxY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few regressions and fixes for stable.

  Regressions:

   - fix a race leading to metadata space leak after task received a
     signal

   - un-deprecate 2 ioctls, marked as deprecated by mistake

  Fixes:

   - fix limit check for number of devices during chunk allocation

   - fix a race due to double evaluation of i_size_read inside max()
     macro, can cause a crash

   - remove wrong device id check in tree-checker"

* tag 'for-5.4-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
  btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
  Btrfs: fix race leading to metadata space leak after task received signal
  btrfs: tree-checker: Fix wrong check on max devid
  btrfs: Consider system chunk array size for new SYSTEM chunks
2019-11-09 08:51:37 -08:00
Linus Torvalds 5cb8418cb5 for-linus-2019-11-08
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl3F+DIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpkVmD/9FIa092Q6obga0RqA16GlbI85tgtNyRFZU
 neIO/g9R3G/uBTGbiUeXHXHDz9CqUXIRYX7pmI2u0b07iGLRz8oUsOsgyVEfCMen
 VitqwkJJAZ9j9OifyKpLYCZX9ulVDWX5hEz/vm2cNWDkjCbOpXvRuQmkXEzp7RNM
 F7K25PpGLvJHfqC90q9FXXxNDlB2i1M/rh5I7eUqhb6rHmzfJGKCd+H80t+REoB1
 iXAygPj86agQKLOUKZtJjXUjq9Ol/0FD+OKY+eP3EfVv/FJvIeWWYe78WplxJpRD
 BYb9dhLMCSo619WVVy4hNYCPjSfPKVT2cO5QJmVRpgOI1urFuTNgNoIiw06AgvkZ
 09vrlrJZ5A7eFEppuAFQC4WRYKWCQCQfq8wxt2iGUivXgHfskjJ7qJz1Mh5Nlxsr
 JGm1hSVw9UCBzjqC75K2CR+vVt4T8ovEaizPFvzVj6lQ0lRTmSchisxbXzTdevOn
 kFvBOntBdpeSq/CwZ0x5PDP4AbRsgH3ny47LCJoEpFZlxlOLuEB/dAx564dn6ZgZ
 rRz6mKU7rlO7brrkW+DbRZ0XiOn5qzN6FrcSGX1DBzc4+hMus/5PPjv8Gk+Wo1PU
 388mu6N6DObSjw1ij1AqkwyzudmbrziKM4isJY4I72I0YGSq9cuG5VXgM2GqbGlU
 XXpzsu8pSw==
 =8B/z
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Two NVMe device removal crash fixes, and a compat fixup for for an
   ioctl that was introduced in this release (Anton, Charles, Max - via
   Keith)

 - Missing error path mutex unlock for drbd (Dan)

 - cgroup writeback fixup on dead memcg (Tejun)

 - blkcg online stats print fix (Tejun)

* tag 'for-linus-2019-11-08' of git://git.kernel.dk/linux-block:
  cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
  block: drbd: remove a stray unlock in __drbd_send_protocol()
  blkcg: make blkcg_print_stat() print stats only for online blkgs
  nvme: change nvme_passthru_cmd64 to explicitly mark rsvd
  nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths
  nvme-rdma: fix a segmentation fault during module unload
2019-11-08 18:15:55 -08:00
Tejun Heo 65de03e251 cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead
cgroup writeback tries to refresh the associated wb immediately if the
current wb is dead.  This is to avoid keeping issuing IOs on the stale
wb after memcg - blkcg association has changed (ie. when blkcg got
disabled / enabled higher up in the hierarchy).

Unfortunately, the logic gets triggered spuriously on inodes which are
associated with dead cgroups.  When the logic is triggered on dead
cgroups, the attempt fails only after doing quite a bit of work
allocating and initializing a new wb.

While c3aab9a0bd ("mm/filemap.c: don't initiate writeback if mapping
has no dirty pages") alleviated the issue significantly as it now only
triggers when the inode has dirty pages.  However, the condition can
still be triggered before the inode is switched to a different cgroup
and the logic simply doesn't make sense.

Skip the immediate switching if the associated memcg is dying.

This is a simplified version of the following two patches:

 * https://lore.kernel.org/linux-mm/20190513183053.GA73423@dennisz-mbp/
 * http://lkml.kernel.org/r/156355839560.2063.5265687291430814589.stgit@buzz

Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Fixes: e8a7abf5a5 ("writeback: disassociate inodes from dying bdi_writebacks")
Acked-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-08 13:37:24 -07:00
Linus Torvalds 0689acfad3 Some late-breaking dentry handling fixes from Al and Jeff, a patch to
further restrict copy_file_range() to avoid potential data corruption
 from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.  Everything but
 the fscache fix is marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl3FjtQTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyjfCACbNiRHpTWfuZ1iHXXCpo0tAixOT6i0
 9ZlE8yiF6U6iH7uv/MuIeJ+Qeep9+6wwgJYEJKcF0e+Raiob7womDO+yeDx6RYrC
 bqq6OLjJl8VbfeQWvJTmisGTPpvOsOKxxXIKG6vx1zJaJ69yENilFZ0biis14iTu
 RRDvtSntc96OhZ3n8IXie2+f9purYeIKhiCZqx9WzYZOk5sb3zGyhqD+Wh5kxkEh
 QzwwZ/G5RYvMMhL0o13uMZrjqozGTI9Qm09u7+EikIXFFtt+szEdgFd4pQPhsC1D
 VmpdrvV4ml8JnbkENiZ7tlHCcICQJfgMWNcLVgHvD6808CJ0UP5Cq/jP
 =34Iz
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Some late-breaking dentry handling fixes from Al and Jeff, a patch to
  further restrict copy_file_range() to avoid potential data corruption
  from Luis and a fix for !CONFIG_CEPH_FSCACHE kernels.

  Everything but the fscache fix is marked for stable"

* tag 'ceph-for-5.4-rc7' of git://github.com/ceph/ceph-client:
  ceph: return -EINVAL if given fsc mount option on kernel w/o support
  ceph: don't allow copy_file_range when stripe_count != 1
  ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
  ceph: add missing check in d_revalidate snapdir handling
  ceph: fix RCU case handling in ceph_d_revalidate()
  ceph: fix use-after-free in __ceph_remove_cap()
2019-11-08 12:31:27 -08:00
Jeff Layton ff29fde84d ceph: return -EINVAL if given fsc mount option on kernel w/o support
If someone requests fscache on the mount, and the kernel doesn't
support it, it should fail the mount.

[ Drop ceph prefix -- it's provided by pr_err. ]

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-07 18:03:23 +01:00
Pavel Shilovsky d243af7ab9 SMB3: Fix persistent handles reconnect
When the client hits a network reconnect, it re-opens every open
file with a create context to reconnect a persistent handle. All
create context types should be 8-bytes aligned but the padding
was missed for that one. As a result, some servers don't allow
us to reconnect handles and return an error. The problem occurs
when the problematic context is not at the end of the create
request packet. Fix this by adding a proper padding at the end
of the reconnect persistent handle context.

Cc: Stable <stable@vger.kernel.org> # 4.19.x
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-11-06 21:32:18 -06:00
Honggang Li e2f238f7d5 configfs: calculate the depth of parent item
When create symbolic link, create_link should calculate the depth
of the parent item. However, both the first and second parameters
of configfs_get_target_path had been set to the target. Broken
symbolic link created.

$ targetcli ls /
o- / ............................................................. [...]
  o- backstores .................................................. [...]
  | o- block ...................................... [Storage Objects: 0]
  | o- fileio ..................................... [Storage Objects: 2]
  | | o- vdev0 .......... [/dev/ramdisk1 (16.0MiB) write-thru activated]
  | | | o- alua ....................................... [ALUA Groups: 1]
  | | |   o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | | o- vdev1 .......... [/dev/ramdisk2 (16.0MiB) write-thru activated]
  | |   o- alua ....................................... [ALUA Groups: 1]
  | |     o- default_tg_pt_gp ........... [ALUA state: Active/optimized]
  | o- pscsi ...................................... [Storage Objects: 0]
  | o- ramdisk .................................... [Storage Objects: 0]
  o- iscsi ................................................ [Targets: 0]
  o- loopback ............................................. [Targets: 0]
  o- srpt ................................................. [Targets: 2]
  | o- ib.e89a8f91cb3200000000000000000000 ............... [no-gen-acls]
  | | o- acls ................................................ [ACLs: 2]
  | | | o- ib.e89a8f91cb3200000000000000000000 ........ [Mapped LUNs: 2]
  | | | | o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | | | o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | | o- ib.e89a8f91cb3300000000000000000000 ........ [Mapped LUNs: 2]
  | | |   o- mapped_lun0 ............................. [BROKEN LUN LINK]
  | | |   o- mapped_lun1 ............................. [BROKEN LUN LINK]
  | | o- luns ................................................ [LUNs: 2]
  | |   o- lun0 ...... [fileio/vdev0 (/dev/ramdisk1) (default_tg_pt_gp)]
  | |   o- lun1 ...... [fileio/vdev1 (/dev/ramdisk2) (default_tg_pt_gp)]
  | o- ib.e89a8f91cb3300000000000000000000 ............... [no-gen-acls]
  |   o- acls ................................................ [ACLs: 0]
  |   o- luns ................................................ [LUNs: 0]
  o- vhost ................................................ [Targets: 0]

Fixes: e9c03af21c ("configfs: calculate the symlink target only once")
Signed-off-by: Honggang Li <honli@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-11-06 18:36:01 +01:00
Shuning Zhang e74540b285 ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()
When the extent tree is modified, it should be protected by inode
cluster lock and ip_alloc_sem.

The extent tree is accessed and modified in the
ocfs2_prepare_inode_for_write, but isn't protected by ip_alloc_sem.

The following is a case.  The function ocfs2_fiemap is accessing the
extent tree, which is modified at the same time.

  kernel BUG at fs/ocfs2/extent_map.c:475!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: tun ocfs2 ocfs2_nodemanager configfs ocfs2_stackglue [...]
  CPU: 16 PID: 14047 Comm: o2info Not tainted 4.1.12-124.23.1.el6uek.x86_64 #2
  Hardware name: Oracle Corporation ORACLE SERVER X7-2L/ASM, MB MECH, X7-2L, BIOS 42040600 10/19/2018
  task: ffff88019487e200 ti: ffff88003daa4000 task.ti: ffff88003daa4000
  RIP: ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  Call Trace:
    ocfs2_fiemap+0x1e3/0x430 [ocfs2]
    do_vfs_ioctl+0x155/0x510
    SyS_ioctl+0x81/0xa0
    system_call_fastpath+0x18/0xd8
  Code: 18 48 c7 c6 60 7f 65 a0 31 c0 bb e2 ff ff ff 48 8b 4a 40 48 8b 7a 28 48 c7 c2 78 2d 66 a0 e8 38 4f 05 00 e9 28 fe ff ff 0f 1f 00 <0f> 0b 66 0f 1f 44 00 00 bb 86 ff ff ff e9 13 fe ff ff 66 0f 1f
  RIP  ocfs2_get_clusters_nocache.isra.11+0x390/0x550 [ocfs2]
  ---[ end trace c8aa0c8180e869dc ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

This issue can be reproduced every week in a production environment.

This issue is related to the usage mode.  If others use ocfs2 in this
mode, the kernel will panic frequently.

[akpm@linux-foundation.org: coding style fixes]
[Fix new warning due to unused function by removing said function - Linus ]
Link: http://lkml.kernel.org/r/1568772175-2906-2-git-send-email-sunny.s.zhang@oracle.com
Signed-off-by: Shuning Zhang <sunny.s.zhang@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Gang He <ghe@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-11-06 08:47:08 -08:00
Luis Henriques a3a0819388 ceph: don't allow copy_file_range when stripe_count != 1
copy_file_range tries to use the OSD 'copy-from' operation, which simply
performs a full object copy.  Unfortunately, the implementation of this
system call assumes that stripe_count is always set to 1 and doesn't take
into account that the data may be striped across an object set.  If the
file layout has stripe_count different from 1, then the destination file
data will be corrupted.

For example:

Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and
stripe_size of 2 MiB; the first half of the file will be filled with 'A's
and the second half will be filled with 'B's:

               0      4M     8M       Obj1     Obj2
               +------+------+       +----+   +----+
        file:  | AAAA | BBBB |       | AA |   | AA |
               +------+------+       |----|   |----|
                                     | BB |   | BB |
                                     +----+   +----+

If we copy_file_range this file into a new file (which needs to have the
same file layout!), then it will start by copying the object starting at
file offset 0 (Obj1).  And then it will copy the object starting at file
offset 4M -- which is Obj1 again.

Unfortunately, the solution for this is to not allow remote object copies
to be performed when the file layout stripe_count is not 1 and simply
fallback to the default (VFS) copy_file_range implementation.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05 15:42:58 +01:00
Jeff Layton 5bb5e6ee6f ceph: don't try to handle hashed dentries in non-O_CREAT atomic_open
If ceph_atomic_open is handed a !d_in_lookup dentry, then that means
that it already passed d_revalidate so we *know* that it's negative (or
at least was very recently). Just return -ENOENT in that case.

This also addresses a subtle bug in dentry handling. Non-O_CREAT opens
call atomic_open with the parent's i_rwsem shared, but calling
d_splice_alias on a hashed dentry requires the exclusive lock.

If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT
open, and another client were to race in and create the file before we
issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on
the dentry with the new inode with insufficient locks.

Cc: stable@vger.kernel.org
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-11-05 15:42:44 +01:00
David Sterba a5009d3a31 btrfs: un-deprecate ioctls START_SYNC and WAIT_SYNC
The two ioctls START_SYNC and WAIT_SYNC were mistakenly marked as
deprecated and scheduled for removal but we actualy do use them for
'btrfs subvolume delete -C/-c'. The deprecated thing in ebc87351e5
should have been just the async flag for subvolume creation.

The deprecation has been added in this development cycle, remove it
until it's time.

Fixes: ebc87351e5 ("btrfs: Deprecate BTRFS_SUBVOL_CREATE_ASYNC flag")
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-04 21:42:01 +01:00
Josef Bacik d98da49977 btrfs: save i_size to avoid double evaluation of i_size_read in compress_file_range
We hit a regression while rolling out 5.2 internally where we were
hitting the following panic

  kernel BUG at mm/page-writeback.c:2659!
  RIP: 0010:clear_page_dirty_for_io+0xe6/0x1f0
  Call Trace:
   __process_pages_contig+0x25a/0x350
   ? extent_clear_unlock_delalloc+0x43/0x70
   submit_compressed_extents+0x359/0x4d0
   normal_work_helper+0x15a/0x330
   process_one_work+0x1f5/0x3f0
   worker_thread+0x2d/0x3d0
   ? rescuer_thread+0x340/0x340
   kthread+0x111/0x130
   ? kthread_create_on_node+0x60/0x60
   ret_from_fork+0x1f/0x30

This is happening because the page is not locked when doing
clear_page_dirty_for_io.  Looking at the core dump it was because our
async_extent had a ram_size of 24576 but our async_chunk range only
spanned 20480, so we had a whole extra page in our ram_size for our
async_extent.

This happened because we try not to compress pages outside of our
i_size, however a cleanup patch changed us to do

actual_end = min_t(u64, i_size_read(inode), end + 1);

which is problematic because i_size_read() can evaluate to different
values in between checking and assigning.  So either an expanding
truncate or a fallocate could increase our i_size while we're doing
writeout and actual_end would end up being past the range we have
locked.

I confirmed this was what was happening by installing a debug kernel
that had

  actual_end = min_t(u64, i_size_read(inode), end + 1);
  if (actual_end > end + 1) {
	  printk(KERN_ERR "KABOOM\n");
	  actual_end = end + 1;
  }

and installing it onto 500 boxes of the tier that had been seeing the
problem regularly.  Last night I got my debug message and no panic,
confirming what I expected.

[ dsterba: the assembly confirms a tiny race window:

    mov    0x20(%rsp),%rax
    cmp    %rax,0x48(%r15)           # read
    movl   $0x0,0x18(%rsp)
    mov    %rax,%r12
    mov    %r14,%rax
    cmovbe 0x48(%r15),%r12           # eval

  Where r15 is inode and 0x48 is offset of i_size.

  The original fix was to revert 62b3762271 that would do an
  intermediate assignment and this would also avoid the doulble
  evaluation but is not future-proof, should the compiler merge the
  stores and call i_size_read anyway.

  There's a patch adding READ_ONCE to i_size_read but that's not being
  applied at the moment and we need to fix the bug. Instead, emulate
  READ_ONCE by two barrier()s that's what effectively happens. The
  assembly confirms single evaluation:

    mov    0x48(%rbp),%rax          # read once
    mov    0x20(%rsp),%rcx
    mov    $0x20,%edx
    cmp    %rax,%rcx
    cmovbe %rcx,%rax
    mov    %rax,(%rsp)
    mov    %rax,%rcx
    mov    %r14,%rax

  Where 0x48(%rbp) is inode->i_size stored to %eax.
]

Fixes: 62b3762271 ("btrfs: Remove isize local variable in compress_file_range")
CC: stable@vger.kernel.org # v5.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ changelog updated ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-04 21:41:49 +01:00
Linus Torvalds 56cfd2507d a small smb3 memleak fix
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCAAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl290UsACgkQiiy9cAdy
 T1GVTAv+Mga+91Nw8Nte0Ix3ynuitDsqjtj6jIJs2FHoOI8cO1RhplU16elxS1OQ
 y3AekBU/go2aWWraTPtGiZZReIPm0gyku11lK8zox3zEE9buFFR0dHvZgxll2gG8
 IHJNgn76avvs+gI4XLeITzpwcv8Xt+z9VN1A0vujDSfSg3TeMEIyr6ofnFSgo9jx
 2SRmCAMcgBameUlZWkc4fdz66GLguXhnYAZ7paX1mMLPuEsEmvHquU691+sKqDej
 Q2GarzDR3JVusNIiuJtlwJlUprKAQuGuF0h6B9raZ0saoyR3MFr2bUkxNqDMPj4T
 9BTeRItnPWcxh+q7bfvJi9LiHTP2tevoXZhqafd17hYRj3noXyw0FRLsKmYDccW2
 Q4+PjOiv/Qyxg8g6l/Bw87VowYrzvVPxfcFMt8fC+tijX9XhdbzF/kSwD83jy/Vm
 u14Eps2UEdaO7qiNZDRNSk1DyFePwCUq55ZMx27MbYfqu8RHXV5NvSJw/P7WQEF7
 rAB7Cvy6
 =Oh9N
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fix from Steve French:
 "A small smb3 memleak fix"

* tag '5.4-rc6-smb3-fix' of git://git.samba.org/sfrench/cifs-2.6:
  fix memory leak in large read decrypt offload
2019-11-02 14:34:00 -07:00
Linus Torvalds 372bf6c1c8 NFS Client Bugfixes for Linux 5.4-rc6
Stable bugfixes:
 - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
 
 Other fixes:
 - The TCP back channel mustn't disappear while requests are outstanding
 - The RDMA back channel mustn't disappear while requests are outstanding
 - Destroy the back channel when we destroy the host transport
 - Don't allow a cached open with a revoked delegation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl28mZ4ACgkQ18tUv7Cl
 QOvAAxAA5h15NJ8OTOpr/kyWQAHEYWIEWQyTNYT8Gi/TI/YCfauET6kxAKg7ETk+
 Bx6+A7Q5OlKSqFZXf2UMMIHPnvH7Ar06uIof6DLjWxA9N/q57SNW0YszQCUhvWjp
 4Ry7d9A2yNrCizEk/vTmY5zfFIo2S88AvwkQ6gLCEyfIn5REWQyQnS4SS3vL7MuS
 E/iWBsXFV2C9/yON+zzZnC0ewMIPbtWA3/CsVI2zDKLsHKvBYOx6n7TbfTtMtc/p
 mCdgLzdEEZlN0KOCzzMp7ziME8y1qUGX6eo17Jrr0ZBZIM7o5d7sYMA88Apu6dMg
 MQM5iEEAf//kmHLWzQ6yg8XZEZrXFMOiGKYu0kWt4pVlqi78gemEfE6T8dClsB5m
 P7jqn/4ntDlJPWPciImsVEzDOHlzlZBIXwZs1JjufeO/hIB570cs80PjeDQP0id8
 Cx3yR7Hr1ldKzwSwpWrtIaqEPljJJw0jhxQ7YYlvfbA1Ogd4ek4QU4RuvyW11CRr
 d/5Oaa7xvjGs63klu8HpZG4NHFby3PPmZT6t4xUcH75MR14S15YCGtaUvPJ7ORck
 C6tZJAkVtJqSoAjsGuBtTc5qiOdo+hMRghutW4Pd6UpnPYJKwQUtTqp3MbfM4+bA
 S19tGG6qn+7cl7bW4NJpNy8HfbC7BdS+m7maHpQfGIdMsxKVFoI=
 =ttcU
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "This contains two delegation fixes (with the RCU lock leak fix marked
  for stable), and three patches to fix destroying the the sunrpc back
  channel.

  Stable bugfixes:

   - Fix an RCU lock leak in nfs4_refresh_delegation_stateid()

  Other fixes:

   - The TCP back channel mustn't disappear while requests are
     outstanding

   - The RDMA back channel mustn't disappear while requests are
     outstanding

   - Destroy the back channel when we destroy the host transport

   - Don't allow a cached open with a revoked delegation"

* tag 'nfs-for-5.4-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
  NFSv4: Don't allow a cached open with a revoked delegation
  SUNRPC: Destroy the back channel when we destroy the host transport
  SUNRPC: The RDMA back channel mustn't disappear while requests are outstanding
  SUNRPC: The TCP back channel mustn't disappear while requests are outstanding
2019-11-01 17:37:44 -07:00
Linus Torvalds 0821de2896 for-linus-20191101
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl28lRQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjtbEADbrMXdsdPV2CApxSiZIaWO1mR78yy/btxp
 cHsJ+avaPGxhNukSsose2KWm656SriH/OfQspqtvzDpslbu40V41+vSqSknqGRPr
 8jW5efZIAy6dq0FjbtBnmIV6PhC5d4F/nAEQbsnVRn8RSr3OwQcm8/smpSFA8urI
 oHVU8jiyLsQiSbDvjf2KPhPYhWBHO0W5SyGo29HY8pSzQpsMzGkQ6TcHL4EzwPZP
 WPtGglr14v8rMyhNMxUdHZ9eHCMq7uufFPuyJXzesE/qyM+H8p2pwwxyfflHGZil
 w2vxLJRu8d4UIkHEkNbC0bydXJ+eCtRMBZON1ZGdrZwQ58L9AbBPBZmxKb0LkmHb
 4tc/yQm/0kSUUXwFtDoUoIBFjjy36Pl5BsLt4n5fofsl04myhm5CLqZ8oWxyU0vO
 sCinJwk1+eQO/tbQVDfven+MroNlYVPCnXhIe/12/wEba3EJ7Ab4X5p0lJoJ1oY7
 9dQyY6+BaHd4wV9p0domOP5y7dJnXM9k46EF0/5YoNjoqaH5MWPMq355VH2xNjdw
 5HzRcZfvOAlXASrnXuQAAQAdR2b+s/iFZaNKA7bTZxjNPvYE0zySCMeQeNXmfVKe
 CrDuwViWukwIzETDZHYqMWJxOV4nyOeL3jTo7rQp5A5TEWwBiJKQ4aGBif2eqc+L
 Mk41ziQGuQ==
 =+rar
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-20191101' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Two small nvme fixes, one is a fabrics connection fix, the other one
   a cleanup made possible by that fix (Anton, via Keith)

 - Fix requeue handling in umb ubd (Anton)

 - Fix spin_lock_irq() nesting in blk-iocost (Dan)

 - Three small io_uring fixes:
     - Install io_uring fd after done with ctx (me)
     - Clear ->result before every poll issue (me)
     - Fix leak of shadow request on error (Pavel)

* tag 'for-linus-20191101' of git://git.kernel.dk/linux-block:
  iocost: don't nest spin_lock_irq in ioc_weight_write()
  io_uring: ensure we clear io_kiocb->result before each issue
  um-ubd: Entrust re-queue to the upper layers
  nvme-multipath: remove unused groups_only mode in ana log
  nvme-multipath: fix possible io hang after ctrl reconnect
  io_uring: don't touch ctx in setup after ring fd install
  io_uring: Fix leaked shadow_req
2019-11-01 17:33:12 -07:00
Trond Myklebust 79cc55422c NFS: Fix an RCU lock leak in nfs4_refresh_delegation_stateid()
A typo in nfs4_refresh_delegation_stateid() means we're leaking an
RCU lock, and always returning a value of 'false'. As the function
description states, we were always supposed to return 'true' if a
matching delegation was found.

Fixes: 12f275cdd1 ("NFSv4: Retry CLOSE and DELEGRETURN on NFS4ERR_OLD_STATEID.")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-11-01 11:03:56 -04:00
Trond Myklebust be3df3dd4c NFSv4: Don't allow a cached open with a revoked delegation
If the delegation is marked as being revoked, we must not use it
for cached opens.

Fixes: 869f9dfa4d ("NFSv4: Fix races between nfs_remove_bad_delegation() and delegation return")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-11-01 10:59:26 -04:00
Jens Axboe 6873e0bd6a io_uring: ensure we clear io_kiocb->result before each issue
We use io_kiocb->result == -EAGAIN as a way to know if we need to
re-submit a polled request, as -EAGAIN reporting happens out-of-line
for IO submission failures. This field is cleared when we originally
allocate the request, but it isn't reset when we retry the submission
from async context. This can cause issues where we think something
needs a re-issue, but we're really just reading stale data.

Reset ->result whenever we re-prep a request for polled submission.

Cc: stable@vger.kernel.org
Fixes: 9e645e1105 ("io_uring: add support for sqe links")
Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-30 14:45:22 -06:00
Linus Torvalds b66b449872 Fix remounting (broken in -rc1).
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJduXHFAAoJENW/n+sDE2U6FKkP/j9xxzkjHAQwCaim5DIBNJfo
 1GDpDxrN6Voa4hC82p/8fUwFGedo72Ex/TyyVCOnJ9/iNRSugpIdEAWM+NCNs/Sm
 UHbNkKPxCB0rRiO20tvHZsbhytbM2zLEKuDuNwaI5BN+6RvW6AaAeMMSAv245tMh
 0iu5U8l5B/oPRx+EcirSia8Ep+MiMp8+6MrLoQLVYnrs9wsfTrD8LltGPLqLhbPV
 JPkJaeIdjdM5Ii34z5nE8yyAEYr7pjfL7CNim1NY/cM7gRsX4jTdd2xGMecIaUdV
 TIBWn+NzqIguqLdPHf7xDinwNxF6/F2ppQpwW3zSS0VpQpJI1oApi0Xw8ESiylZ5
 uMkR/+x9XpeSjvyvKonk0D0WlnfLfDy/frSxhw8emsJ1o8TSqL2TgHRqVIfU8+ZD
 fR9zGw8tXjmWHZkC1gkyPx6xWiMAVBrTsgHlRIRWukXtKCf/qlCgoq0HwSfFINkt
 z3AYfpHcAZ/zVpYTfy7x3X4jeM71aToTEQuoJPNk3AQ+b5uKH9fD48XYyx1EQlZl
 jwvldyLcPfU2YVe2KTfsglEGY3XqN4cn3B6Ysm7kX81uh7dJ4oeKX1w0yC47/umS
 1cNbx/4h7R+6fTZGhduOH1znDfw92iel8vz6npOe508mq+isj49kGRPJ9wurAMce
 lcSV5NZgDRpikbIq6fZw
 =hVf6
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.4-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:
 "Fix remounting (broken in -rc1)."

* tag 'gfs2-v5.4-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix initialisation of args for remount
2019-10-30 14:05:40 +01:00
Andrew Price d5798141fd gfs2: Fix initialisation of args for remount
When gfs2 was converted to use fs_context, the initialisation of the
mount args structure to the currently active args was lost with the
removal of gfs2_remount_fs(), so the checks of the new args on remount
became checks against the default values instead of the current ones.
This caused unexpected remount behaviour and test failures (xfstests
generic/294, generic/306 and generic/452).

Reinstate the args initialisation, this time in gfs2_init_fs_context()
and conditional upon fc->purpose, as that's the only time we get control
before the mount args are parsed in the remount process.

Fixes: 1f52aa08d1 ("gfs2: Convert gfs2 to fs_context")
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2019-10-30 12:16:53 +01:00
Al Viro 1f08529c84 ceph: add missing check in d_revalidate snapdir handling
We should not play with dcache without parent locked...

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:55 +01:00
Al Viro aa8dd81673 ceph: fix RCU case handling in ceph_d_revalidate()
For RCU case ->d_revalidate() is called with rcu_read_lock() and
without pinning the dentry passed to it.  Which means that it
can't rely upon ->d_inode remaining stable; that's the reason
for d_inode_rcu(), actually.

Make sure we don't reload ->d_inode there.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:54 +01:00
Luis Henriques ea60ed6fcf ceph: fix use-after-free in __ceph_remove_cap()
KASAN reports a use-after-free when running xfstest generic/531, with the
following trace:

[  293.903362]  kasan_report+0xe/0x20
[  293.903365]  rb_erase+0x1f/0x790
[  293.903370]  __ceph_remove_cap+0x201/0x370
[  293.903375]  __ceph_remove_caps+0x4b/0x70
[  293.903380]  ceph_evict_inode+0x4e/0x360
[  293.903386]  evict+0x169/0x290
[  293.903390]  __dentry_kill+0x16f/0x250
[  293.903394]  dput+0x1c6/0x440
[  293.903398]  __fput+0x184/0x330
[  293.903404]  task_work_run+0xb9/0xe0
[  293.903410]  exit_to_usermode_loop+0xd3/0xe0
[  293.903413]  do_syscall_64+0x1a0/0x1c0
[  293.903417]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

This happens because __ceph_remove_cap() may queue a cap release
(__ceph_queue_cap_release) which can be scheduled before that cap is
removed from the inode list with

	rb_erase(&cap->ci_node, &ci->i_caps);

And, when this finally happens, the use-after-free will occur.

This can be fixed by removing the cap from the inode list before being
removed from the session list, and thus eliminating the risk of an UAF.

Cc: stable@vger.kernel.org
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-10-29 22:29:51 +01:00
Linus Torvalds 23fdb198ae fuse fixes for 5.4-rc6
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXbgx9wAKCRDh3BK/laaZ
 PIjMAQCTrWPf6pLHSoq+Pll7b8swu1m2mW97fj5k8coED1/DSQEA4222PkIdMmhu
 qyHnoX0fTtYXg6NnHbDVWsNL4uG8YAM=
 =RE9K
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:
 "Mostly virtiofs fixes, but also fixes a regression and couple of
  longstanding data/metadata writeback ordering issues"

* tag 'fuse-fixes-5.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: redundant get_fuse_inode() calls in fuse_writepages_fill()
  fuse: Add changelog entries for protocols 7.1 - 7.8
  fuse: truncate pending writes on O_TRUNC
  fuse: flush dirty data/metadata before non-truncate setattr
  virtiofs: Remove set but not used variable 'fc'
  virtiofs: Retry request submission from worker context
  virtiofs: Count pending forgets as in_flight forgets
  virtiofs: Set FR_SENT flag only after request has been sent
  virtiofs: No need to check fpq->connected state
  virtiofs: Do not end request in submission context
  fuse: don't advise readdirplus for negative lookup
  fuse: don't dereference req->args on finished request
  virtio-fs: don't show mount options
  virtio-fs: Change module name to virtiofs.ko
2019-10-29 17:43:33 +01:00
Jens Axboe 044c1ab399 io_uring: don't touch ctx in setup after ring fd install
syzkaller reported an issue where it looks like a malicious app can
trigger a use-after-free of reading the ctx ->sq_array and ->rings
value right after having installed the ring fd in the process file
table.

Defer ring fd installation until after we're done reading those
values.

Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Reported-by: syzbot+6f03d895a6cd0d06187f@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-28 09:15:33 -06:00
Pavel Begunkov 7b20238d28 io_uring: Fix leaked shadow_req
io_queue_link_head() owns shadow_req after taking it as an argument.
By not freeing it in case of an error, it can leak the request along
with taken ctx->refs.

Reviewed-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-27 21:29:18 -06:00
Steve French a08d897bc0 fix memory leak in large read decrypt offload
Spotted by Ronnie.

Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2019-10-27 14:36:11 -05:00
Linus Torvalds c9a2e4a829 Seven cifs/smb3 fixes, including three for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGyBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl208asACgkQiiy9cAdy
 T1EVJAv41lwLL01FQtWsoMB8yn46q7iboc0eqVJ4CEnZj25Fe3VgokKm1+2N6/15
 g8tQeLnE8kzjQyAxEgI260PHRF6Dg0RSL7zwxnJOFpspWLb9+U4UEryCDao/v4Ao
 LYr88F6XeYDGojyuGKokT/jlsxWfdPwaYhpOPabjO76f6Jv7cNfIa9EKzbEAtq3e
 a9to2AVnJ02FLGTP3QM1ThBtCu3JFFMI07u2Rp8ANiJYtM5Mg2piOkO1PQ3hGLIM
 /3+1E7B12KgG7D3GA7ZuP1HLjyo+A91BhZVTzlBDq1VHTd8ccJDJklg28oXoMrMh
 dgW5XRcS/s60LsRzagpNMC0cPlAdBAd9N9uZaoXmQhNAg0I7CJ24H3tglwKejoUR
 m/duI1C89Fbo1kgjwfAztxhfvair2SYLu8DGijvYH/C5syNUR7V99uy1th+nmOgl
 vpBM9CbkM6mtSxFHtgcFuv4rwIf61EQaGoXiL/e3RLH4zc/HaUpBo2rDh0LPmwFB
 koXIw6E=
 =fTOk
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Seven cifs/smb3 fixes, including three for stable"

* tag '5.4-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix cifsInodeInfo lock_sem deadlock when reconnect occurs
  CIFS: Fix use after free of file info structures
  CIFS: Fix retry mid list corruption on reconnects
  cifs: Fix missed free operations
  CIFS: avoid using MID 0xFFFF
  cifs: clarify comment about timestamp granularity for old servers
  cifs: Handle -EINPROGRESS only when noblockcnt is set
2019-10-27 06:41:52 -04:00
Linus Torvalds acf913b7fb for-linus-2019-10-26
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl20ST0QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprl3D/wJo392Nd3fDa1+t9XpV3M5T/HKcC7yau/G
 98r41DF8TzjlwWFpsbolbBibeUkwpkZr32rAyEN7UScw6d5I+qlGF/ATzZijWGvd
 a714VhMq3fs8siGLDtNxdclYlSXzF5wnhbniPdF1QJsEHK7wAbqXV6/xf5FubHke
 SjYX9HOt0PGfkdDWmdMLRlaaedUZG5RPWAZV4dtg8LWMcL9yu9FZtxbyYyhFXf94
 Wd3MxvDp4/aeudO+CLkI73H0LOl5f0iuIOAOTEDE39X0zQA3ezwYiBh8wRtrBh1Y
 DJus7LI0G56DdUOengyptFkR2qvjmQpJr6mFKzVlp9bRJhwuvzcJ1MWj23oDDFOx
 xW1m74rQcP2COZ2BIvzufKBrBJn8Gq5BXDjQAHgQMNn0FQbEnBnCQ+vA3V4NkzMs
 EuUgk2P/LAwBpldkhFUg78UZUKCxiV72cTihLZJHN2FW5cf652Hf9jaA5Vnrzye5
 gJvXmjWojDmEu77lP51UdKSk4Jhhf+Mr/FM3tNoCyteXpceSIrDE8gOMwhkj3oM4
 eXVgHKSyxrD1npcE2/C7wwOMP9V0FBCFb5ulDbuvRyhZRQeSXtugbzU5Pn9VqgiK
 e7uhXA7AqrKS0bdHdXXlxs0of1gvq42zYcD8A9KFJhUxU07yUKXmIwXO4hApAMsk
 VLxsoPdIMA==
 =PhEL
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block

Pull block and io_uring fixes from Jens Axboe:
 "A bit bigger than usual at this point in time, mostly due to some good
  bug hunting work by Pavel that resulted in three io_uring fixes from
  him and two from me. Anyway, this pull request contains:

   - Revert of the submit-and-wait optimization for io_uring, it can't
     always be done safely. It depends on commands always making
     progress on their own, which isn't necessarily the case outside of
     strict file IO. (me)

   - Series of two patches from me and three from Pavel, fixing issues
     with shared data and sequencing for io_uring.

   - Lastly, two timeout sequence fixes for io_uring (zhangyi)

   - Two nbd patches fixing races (Josef)

   - libahci regulator_get_optional() fix (Mark)"

* tag 'for-linus-2019-10-26' of git://git.kernel.dk/linux-block:
  nbd: verify socket is supported during setup
  ata: libahci_platform: Fix regulator_get_optional() misuse
  nbd: handle racing with error'ed out commands
  nbd: protect cmd->status with cmd->lock
  io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
  io_uring: used cached copies of sq->dropped and cq->overflow
  io_uring: Fix race for sqes with userspace
  io_uring: Fix broken links with offloading
  io_uring: Fix corrupted user_data
  io_uring: correct timeout req sequence when inserting a new entry
  io_uring : correct timeout req sequence when waiting timeout
  io_uring: revert "io_uring: optimize submit_and_wait API"
2019-10-26 14:59:51 -04:00
Linus Torvalds 485fc4b69c dax fix 5.4-rc5
- Fix a performance regression that followed from a fix to the
   conversion of the fsdax implementation to the xarray. v5.3 users
   report that they stop seeing huge page mappings on an application +
   filesystem layout that was seeing huge pages previously on v5.2.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJds5qfAAoJEB7SkWpmfYgCD6gP/1PuAbdRnTt/pEUB5GLtOR21
 hszHP5vFiWaiDN4yUw7IknYmOWlPVPoHg8SdSwKLM75EGj9JCfQGUcUTZIREB15h
 6MnAavVr3u697ZGfnqg8y2Nhg6hNKbqzRnzCEACAXe+O2GnH2TuJRauyS6Zr33St
 dXiJ+XQbTGoUbR0+CEDIdv0DnkUFaaE/m1+O5DbApcP56Gt/4E1PvlIPJjMog72/
 WKVk4/Cv9+gc7JZMjR+GjjXJafiNaXT1btPGzF6zeHe9XrAnkxvfjm9BSh+qeCxj
 19jOKKFs+bY5O21jczEIvgE4cpPQEytYhZijPwMXFt2o3betOH4ACMDxVMuNJKag
 Q3yd8QJRqtP5xALAxeby7HoD6gdVLLcH0qjTdMt+LHvSde5OH3l8qOfWjK5CoIdx
 uiUC+ugg8FDZLTsliI5qPaDCSYhswoyLlvIF3AIqllB7Y1lUbihw5h+gXr9aLGmu
 1SUA6ipPw1gUW5ikYlKTu5Lzxa2VN8vHBv03W1mjuj2nOggRWp8cK8CmwKpnUuOp
 9xD1WB0dl23MRNRZOKQvkrg8kOL2vuowwJYQ6Kl3Tc2vWvqYfNRQW6dSwgx3gF38
 rQ0ZQddet4EMATrxbwMalU5hVDV2w5LAUz0+bNyovb7S0oZ20wLHm5o5iImppV1U
 +K2aRqwaPi/ytcwRF+iF
 =LioO
 -----END PGP SIGNATURE-----

Merge tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax fix from Dan Williams:
 "Fix a performance regression that followed from a fix to the
  conversion of the fsdax implementation to the xarray. v5.3 users
  report that they stop seeing huge page mappings on an application +
  filesystem layout that was seeing huge pages previously on v5.2"

* tag 'dax-fix-5.4-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  fs/dax: Fix pmd vs pte conflict detection
2019-10-26 06:26:04 -04:00
Filipe Manana 0cab7acc4a Btrfs: fix race leading to metadata space leak after task received signal
When a task that is allocating metadata needs to wait for the async
reclaim job to process its ticket and gets a signal (because it was killed
for example) before doing the wait, the task ends up erroring out but
with space reserved for its ticket, which never gets released, resulting
in a metadata space leak (more specifically a leak in the bytes_may_use
counter of the metadata space_info object).

Here's the sequence of steps leading to the space leak:

1) A task tries to create a file for example, so it ends up trying to
   start a transaction at btrfs_create();

2) The filesystem is currently in a state where there is not enough
   metadata free space to satisfy the transaction's needs. So at
   space-info.c:__reserve_metadata_bytes() we create a ticket and
   add it to the list of tickets of the space info object. Also,
   because the metadata async reclaim job is not running, we queue
   a job ro run metadata reclaim;

3) In the meanwhile the task receives a signal (like SIGTERM from
   a kill command for example);

4) After queing the async reclaim job, at __reserve_metadata_bytes(),
   we unlock the metadata space info and call handle_reserve_ticket();

5) That last function calls wait_reserve_ticket(), which acquires the
   lock from the metadata space info. Then in the first iteration of
   its while loop, it calls prepare_to_wait_event(), which returns
   -ERESTARTSYS because the task has a pending signal. As a result,
   we set the error field of the ticket to -EINTR and exit the while
   loop without deleting the ticket from the list of tickets (in the
   space info object). After exiting the loop we unlock the space info;

6) The async reclaim job is able to release enough metadata, acquires
   the metadata space info's lock and then reserves space for the ticket,
   since the ticket is still in the list of (non-priority) tickets. The
   space reservation happens at btrfs_try_granting_tickets(), called from
   maybe_fail_all_tickets(). This increments the bytes_may_use counter
   from the metadata space info object, sets the ticket's bytes field to
   zero (meaning success, that space was reserved) and removes it from
   the list of tickets;

7) wait_reserve_ticket() returns, with the error field of the ticket
   set to -EINTR. Then handle_reserve_ticket() just propagates that error
   to the caller. Because an error was returned, the caller does not
   release the reserved space, since the expectation is that any error
   means no space was reserved.

Fix this by removing the ticket from the list, while holding the space
info lock, at wait_reserve_ticket() when prepare_to_wait_event() returns
an error.

Also add some comments and an assertion to guarantee we never end up with
a ticket that has an error set and a bytes counter field set to zero, to
more easily detect regressions in the future.

This issue could be triggered sporadically by some test cases from fstests
such as generic/269 for example, which tries to fill a filesystem and then
kills fsstress processes running in the background.

When this issue happens, we get a warning in syslog/dmesg when unmounting
the filesystem, like the following:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 13240 at fs/btrfs/block-group.c:3186 btrfs_free_block_groups+0x314/0x470 [btrfs]
  (...)
  CPU: 0 PID: 13240 Comm: umount Tainted: G        W    L    5.3.0-rc8-btrfs-next-48+ #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
  RIP: 0010:btrfs_free_block_groups+0x314/0x470 [btrfs]
  (...)
  RSP: 0018:ffff9910c14cfdb8 EFLAGS: 00010286
  RAX: 0000000000000024 RBX: ffff89cd8a4d55f0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff89cdf6a178a8 RDI: ffff89cdf6a178a8
  RBP: ffff9910c14cfde8 R08: 0000000000000000 R09: 0000000000000001
  R10: ffff89cd4d618040 R11: 0000000000000000 R12: ffff89cd8a4d5508
  R13: ffff89cde7c4a600 R14: dead000000000122 R15: dead000000000100
  FS:  00007f42754432c0(0000) GS:ffff89cdf6a00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fd25a47f730 CR3: 000000021f8d6006 CR4: 00000000003606f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   close_ctree+0x1ad/0x390 [btrfs]
   generic_shutdown_super+0x6c/0x110
   kill_anon_super+0xe/0x30
   btrfs_kill_super+0x12/0xa0 [btrfs]
   deactivate_locked_super+0x3a/0x70
   cleanup_mnt+0xb4/0x160
   task_work_run+0x7e/0xc0
   exit_to_usermode_loop+0xfa/0x100
   do_syscall_64+0x1cb/0x220
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7f4274d2cb37
  (...)
  RSP: 002b:00007ffcff701d38 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 0000557ebde2f060 RCX: 00007f4274d2cb37
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000557ebde2f240
  RBP: 0000557ebde2f240 R08: 0000557ebde2f270 R09: 0000000000000015
  R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f427522ee64
  R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffcff701fc0
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
  softirqs last  enabled at (0): [<ffffffffb12b561e>] copy_process+0x75e/0x1fd0
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace bcf4b235461b26f6 ]---
  BTRFS info (device sdb): space_info 4 has 19116032 free, is full
  BTRFS info (device sdb): space_info total=33554432, used=14176256, pinned=0, reserved=0, may_use=196608, readonly=65536
  BTRFS info (device sdb): global_block_rsv: size 0 reserved 0
  BTRFS info (device sdb): trans_block_rsv: size 0 reserved 0
  BTRFS info (device sdb): chunk_block_rsv: size 0 reserved 0
  BTRFS info (device sdb): delayed_block_rsv: size 0 reserved 0
  BTRFS info (device sdb): delayed_refs_rsv: size 0 reserved 0

Fixes: 374bf9c5cd ("btrfs: unify error handling for ticket flushing")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 19:11:34 +02:00
Qu Wenruo 8bb177d18f btrfs: tree-checker: Fix wrong check on max devid
[BUG]
The following script will cause false alert on devid check.
  #!/bin/bash

  dev1=/dev/test/test
  dev2=/dev/test/scratch1
  mnt=/mnt/btrfs

  umount $dev1 &> /dev/null
  umount $dev2 &> /dev/null
  umount $mnt &> /dev/null

  mkfs.btrfs -f $dev1

  mount $dev1 $mnt

  _fail()
  {
          echo "!!! FAILED !!!"
          exit 1
  }

  for ((i = 0; i < 4096; i++)); do
          btrfs dev add -f $dev2 $mnt || _fail
          btrfs dev del $dev1 $mnt || _fail
          dev_tmp=$dev1
          dev1=$dev2
          dev2=$dev_tmp
  done

[CAUSE]
Tree-checker uses BTRFS_MAX_DEVS() and BTRFS_MAX_DEVS_SYS_CHUNK() as
upper limit for devid.  But we can have devid holes just like above
script.

So the check for devid is incorrect and could cause false alert.

[FIX]
Just remove the whole devid check.  We don't have any hard requirement
for devid assignment.

Furthermore, even devid could get corrupted by a bitflip, we still have
dev extents verification at mount time, so corrupted data won't sneak
in.

This fixes fstests btrfs/194.

Reported-by: Anand Jain <anand.jain@oracle.com>
Fixes: ab4ba2e133 ("btrfs: tree-checker: Verify dev item")
CC: stable@vger.kernel.org # 5.2+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 19:11:34 +02:00
Qu Wenruo c17add7a1c btrfs: Consider system chunk array size for new SYSTEM chunks
For SYSTEM chunks, despite the regular chunk item size limit, there is
another limit due to system chunk array size.

The extra limit was removed in a refactoring, so add it back.

Fixes: e3ecdb3fde ("btrfs: factor out devs_max setting in __btrfs_alloc_chunk")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-25 19:11:34 +02:00
Jens Axboe 2b2ed9750f io_uring: fix bad inflight accounting for SETUP_IOPOLL|SETUP_SQTHREAD
We currently assume that submissions from the sqthread are successful,
and if IO polling is enabled, we use that value for knowing how many
completions to look for. But if we overflowed the CQ ring or some
requests simply got errored and already completed, they won't be
available for polling.

For the case of IO polling and SQTHREAD usage, look at the pending
poll list. If it ever hits empty then we know that we don't have
anymore pollable requests inflight. For that case, simply reset
the inflight count to zero.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:53 -06:00
Jens Axboe 498ccd9eda io_uring: used cached copies of sq->dropped and cq->overflow
We currently use the ring values directly, but that can lead to issues
if the application is malicious and changes these values on our behalf.
Created in-kernel cached versions of them, and just overwrite the user
side when we update them. This is similar to how we treat the sq/cq
ring tail/head updates.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 10:58:45 -06:00
Pavel Begunkov 935d1e4590 io_uring: Fix race for sqes with userspace
io_ring_submit() finalises with
1. io_commit_sqring(), which releases sqes to the userspace
2. Then calls to io_queue_link_head(), accessing released head's sqe

Reorder them.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:02:01 -06:00
Pavel Begunkov fb5ccc9878 io_uring: Fix broken links with offloading
io_sq_thread() processes sqes by 8 without considering links. As a
result, links will be randomely subdivided.

The easiest way to fix it is to call io_get_sqring() inside
io_submit_sqes() as do io_ring_submit().

Downsides:
1. This removes optimisation of not grabbing mm_struct for fixed files
2. It submitting all sqes in one go, without finer-grained sheduling
with cq processing.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-25 09:01:59 -06:00