1
0
Fork 0
Commit Graph

60716 Commits (a2ae8c0551050cbf3232fbb3d963a9bbe7b57268)

Author SHA1 Message Date
Aliasgar Surti d5cc14d9f9 xfs: removed unused error variable from xchk_refcountbt_rec
Removed unused error variable. Instead of using error variable,
returned the value directly as it wasn't updated.

Signed-off-by: Aliasgar Surti <aliasgar.surti500@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-06 15:39:05 -07:00
Eric Sandeen 6374ca0397 xfs: remove unused flags arg from xfs_get_aghdr_buf()
The flags arg is always passed as zero, so remove it.

(xfs_buf_get_uncached takes flags to support XBF_NO_IOACCT for
the sb, but that should never be relevant for xfs_get_aghdr_buf)

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-06 15:39:05 -07:00
Max Reitz e093c4be76 xfs: Fix tail rounding in xfs_alloc_file_space()
To ensure that all blocks touched by the range [offset, offset + count)
are allocated, we need to calculate the block count from the difference
of the range end (rounded up) and the range start (rounded down).

Before this patch, we just round up the byte count, which may lead to
unaligned ranges not being fully allocated:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
        0: [0..7]: 1396264..1396271
        1: [8..15]: hole

There should not be a hole there.  Instead, the first two blocks should
be fully allocated.

With this patch applied, the result is something like this:

$ touch test_file
$ block_size=$(stat -fc '%S' test_file)
$ fallocate -o $((block_size / 2)) -l $block_size test_file
$ xfs_bmap test_file
test_file:
        0: [0..15]: 11024..11039

Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-06 15:39:05 -07:00
Linus Torvalds b212921b13 elf: don't use MAP_FIXED_NOREPLACE for elf executable mappings
In commit 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map") we
changed elf to use MAP_FIXED_NOREPLACE instead of MAP_FIXED for the
executable mappings.

Then, people reported that it broke some binaries that had overlapping
segments from the same file, and commit ad55eac74f ("elf: enforce
MAP_FIXED on overlaying elf segments") re-instated MAP_FIXED for some
overlaying elf segment cases.  But only some - despite the summary line
of that commit, it only did it when it also does a temporary brk vma for
one obvious overlapping case.

Now Russell King reports another overlapping case with old 32-bit x86
binaries, which doesn't trigger that limited case.  End result: we had
better just drop MAP_FIXED_NOREPLACE entirely, and go back to MAP_FIXED.

Yes, it's a sign of old binaries generated with old tool-chains, but we
do pride ourselves on not breaking existing setups.

This still leaves MAP_FIXED_NOREPLACE in place for the load_elf_interp()
and the old load_elf_library() use-cases, because nobody has reported
breakage for those. Yet.

Note that in all the cases seen so far, the overlapping elf sections
seem to be just re-mapping of the same executable with different section
attributes.  We could possibly introduce a new MAP_FIXED_NOFILECHANGE
flag or similar, which acts like NOREPLACE, but allows just remapping
the same executable file using different protection flags.

It's not clear that would make a huge difference to anything, but if
people really hate that "elf remaps over previous maps" behavior, maybe
at least a more limited form of remapping would alleviate some concerns.

Alternatively, we should take a look at our elf_map() logic to see if we
end up not mapping things properly the first time.

In the meantime, this is the minimal "don't do that then" patch while
people hopefully think about it more.

Reported-by: Russell King <linux@armlinux.org.uk>
Fixes: 4ed2863951 ("fs, elf: drop MAP_FIXED usage from elf_map")
Fixes: ad55eac74f ("elf: enforce  MAP_FIXED on overlaying elf segments")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-06 13:53:27 -07:00
Linus Torvalds 4f11918ab9 Merge branch 'readdir' (readdir speedup and sanity checking)
This makes getdents() and getdents64() do sanity checking on the
pathname that it gives to user space.  And to mitigate the performance
impact of that, it first cleans up the way it does the user copying, so
that the code avoids doing the SMAP/PAN updates between each part of the
dirent structure write.

I really wanted to do this during the merge window, but didn't have
time.  The conversion of filldir to unsafe_put_user() is something I've
had around for years now in a private branch, but the extra pathname
checking finally made me clean it up to the point where it is mergable.

It's worth noting that the filename validity checking really should be a
bit smarter: it would be much better to delay the error reporting until
the end of the readdir, so that non-corrupted filenames are still
returned.  But that involves bigger changes, so let's see if anybody
actually hits the corrupt directory entry case before worrying about it
further.

* branch 'readdir':
  Make filldir[64]() verify the directory entry filename is valid
  Convert filldir[64]() from __put_user() to unsafe_put_user()
2019-10-05 12:03:27 -07:00
Linus Torvalds 8a23eb804c Make filldir[64]() verify the directory entry filename is valid
This has been discussed several times, and now filesystem people are
talking about doing it individually at the filesystem layer, so head
that off at the pass and just do it in getdents{64}().

This is partially based on a patch by Jann Horn, but checks for NUL
bytes as well, and somewhat simplified.

There's also commentary about how it might be better if invalid names
due to filesystem corruption don't cause an immediate failure, but only
an error at the end of the readdir(), so that people can still see the
filenames that are ok.

There's also been discussion about just how much POSIX strictly speaking
requires this since it's about filesystem corruption.  It's really more
"protect user space from bad behavior" as pointed out by Jann.  But
since Eric Biederman looked up the POSIX wording, here it is for context:

 "From readdir:

   The readdir() function shall return a pointer to a structure
   representing the directory entry at the current position in the
   directory stream specified by the argument dirp, and position the
   directory stream at the next entry. It shall return a null pointer
   upon reaching the end of the directory stream. The structure dirent
   defined in the <dirent.h> header describes a directory entry.

  From definitions:

   3.129 Directory Entry (or Link)

   An object that associates a filename with a file. Several directory
   entries can associate names with the same file.

  ...

   3.169 Filename

   A name consisting of 1 to {NAME_MAX} bytes used to name a file. The
   characters composing the name may be selected from the set of all
   character values excluding the slash character and the null byte. The
   filenames dot and dot-dot have special meaning. A filename is
   sometimes referred to as a 'pathname component'."

Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.

Also note that if this ends up being noticeable as a performance
regression, we can fix that to do a much more optimized model that
checks for both NUL and '/' at the same time one word at a time.

We haven't really tended to optimize 'memchr()', and it only checks for
one pattern at a time anyway, and we really _should_ check for NUL too
(but see the comment about "soft errors" in the code about why it
currently only checks for '/')

See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name
lookup code looks for pathname terminating characters in parallel.

Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-05 12:00:36 -07:00
Linus Torvalds 9f79b78ef7 Convert filldir[64]() from __put_user() to unsafe_put_user()
We really should avoid the "__{get,put}_user()" functions entirely,
because they can easily be mis-used and the original intent of being
used for simple direct user accesses no longer holds in a post-SMAP/PAN
world.

Manually optimizing away the user access range check makes no sense any
more, when the range check is generally much cheaper than the "enable
user accesses" code that the __{get,put}_user() functions still need.

So instead of __put_user(), use the unsafe_put_user() interface with
user_access_{begin,end}() that really does generate better code these
days, and which is generally a nicer interface.  Under some loads, the
multiple user writes that filldir() does are actually quite noticeable.

This also makes the dirent name copy use unsafe_put_user() with a couple
of macros.  We do not want to make function calls with SMAP/PAN
disabled, and the code this generates is quite good when the
architecture uses "asm goto" for unsafe_put_user() like x86 does.

Note that this doesn't bother with the legacy cases.  Nobody should use
them anyway, so performance doesn't really matter there.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-05 12:00:36 -07:00
Linus Torvalds c4bd70e8c9 for-linus-2019-10-03
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2WrkYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpjf1D/9wy2L/yAXA/KLQkwDZ2kn3tMtzMrsv6bOJ
 4UTdlLWKH2ihGg69iAE5f19iQSeNmqqXxBz0VvIBpFR7TqcRbGe0d9iVtoOsDhpZ
 c1OvQ2Ey35Xx1T2w6uMh0llgeWY2J/gJkY64unUxwUBUZwNOPA8ZjxqeXcMQmyAt
 sYmXpNLCT/f4YTYOYDgzoh5960TsCB/H+m/bLEVRvr0MaonvlaBUKRcysQikDFCQ
 yobzDmlSJqGKqlFJ2fnRVSkJC0BmBE5p8Ric9HHiUOT8BO31079IHUGbkbSh/csH
 0yPipNaYNMv+Hr0t9pgfcNbAt2weMK5HFgtpQwv8Frl4xjvBSWDS5fQesCVDjkZt
 +ROeOvQtjfeKtLy5PCu6BJwYpu8iYG9eGF8zxBQ4FBHM3tghcVhqssaNbfrVOW+u
 YXYbAuLMkLwKlmJ+6WBiVIMefyF59ue3+UJGECiCrj/BrgxUyw8HcGKwpKEAZSok
 VFGDukL0Y3flnoO/gyOf0GFaD5Uovr1sx82DCz05B/XEMfkqFMJRGkbyZBarJL69
 9QrnyGpF4rwtfg+usR1PmJ+9/oY/ypSk8N9MAIkoK9e1YIexxvBiXAf0k8AxuDyC
 uPuOiQgKcqUr3aF+ivao8dQB9NiK1bJGc4pqBPPN4ZYRSjMSfBT/cms4IeUyj0K6
 sokcB1p+CQ==
 =vKVl
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block

Pull block fixes from Jens Axboe:

 - Mandate timespec64 for the io_uring timeout ABI (Arnd)

 - Set of NVMe changes via Sagi:
     - controller removal race fix from Balbir
     - quirk additions from Gabriel and Jian-Hong
     - nvme-pci power state save fix from Mario
     - Add 64bit user commands (for 64bit registers) from Marta
     - nvme-rdma/nvme-tcp fixes from Max, Mark and Me
     - Minor cleanups and nits from James, Dan and John

 - Two s390 dasd fixes (Jan, Stefan)

 - Have loop change block size in DIO mode (Martijn)

 - paride pg header ifdef guard (Masahiro)

 - Two blk-mq queue scheduler tweaks, fixing an ordering issue on zoned
   devices and suboptimal performance on others (Ming)

* tag 'for-linus-2019-10-03' of git://git.kernel.dk/linux-block: (22 commits)
  block: sed-opal: fix sparse warning: convert __be64 data
  block: sed-opal: fix sparse warning: obsolete array init.
  block: pg: add header include guard
  Revert "s390/dasd: Add discard support for ESE volumes"
  s390/dasd: Fix error handling during online processing
  io_uring: use __kernel_timespec in timeout ABI
  loop: change queue block size to match when using DIO
  blk-mq: apply normal plugging for HDD
  blk-mq: honor IO scheduler for multiqueue devices
  nvme-rdma: fix possible use-after-free in connect timeout
  nvme: Move ctrl sqsize to generic space
  nvme: Add ctrl attributes for queue_count and sqsize
  nvme: allow 64-bit results in passthru commands
  nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T
  nvmet-tcp: remove superflous check on request sgl
  Added QUIRKs for ADATA XPG SX8200 Pro 512GB
  nvme-rdma: Fix max_hw_sectors calculation
  nvme: fix an error code in nvme_init_subsystem()
  nvme-pci: Save PCI state before putting drive into deepest state
  nvme-tcp: fix wrong stop condition in io_work
  ...
2019-10-04 09:56:51 -07:00
Pavel Begunkov bf7ec93c64 io_uring: fix reversed nonblock flag for link submission
io_queue_link_head() accepts @force_nonblock flag, but io_ring_submit()
passes something opposite.

Fixes: c576666863 ("io_uring: optimize submit_and_wait API")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-04 08:31:15 -06:00
Eric Sandeen cc3a7bfe62 vfs: Fix EOVERFLOW testing in put_compat_statfs64
Today, put_compat_statfs64() disallows nearly any field value over
2^32 if f_bsize is only 32 bits, but that makes no sense.
compat_statfs64 is there for the explicit purpose of providing 64-bit
fields for f_files, f_ffree, etc.  And f_bsize is always only 32 bits.

As a result, 32-bit userspace gets -EOVERFLOW for i.e.  large file
counts even with -D_FILE_OFFSET_BITS=64 set.

In reality, only f_bsize and f_frsize can legitimately overflow
(fields like f_type and f_namelen should never be large), so test
only those fields.

This bug was discussed at length some time ago, and this is the proposal
Al suggested at https://lkml.org/lkml/2018/8/6/640.  It seemed to get
dropped amid the discussion of other related changes, but this
part seems obviously correct on its own, so I've picked it up and
sent it, for expediency.

Fixes: 64d2ab32ef ("vfs: fix put_compat_statfs64() does not handle errors")
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-10-03 14:21:35 -07:00
Josef Bacik c5f4987e86 btrfs: fix uninitialized ret in ref-verify
Coverity caught a case where we could return with a uninitialized value
in ret in process_leaf.  This is actually pretty likely because we could
very easily run into a block group item key and have a garbage value in
ret and think there was an errror.  Fix this by initializing ret to 0.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: fd708b81d9 ("Btrfs: add a extent ref verify tool")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-03 15:00:56 +02:00
ZhangXiaoxu 33ea5aaa87 nfs: Fix nfsi->nrequests count error on nfs_inode_remove_request
When xfstests testing, there are some WARNING as below:

WARNING: CPU: 0 PID: 6235 at fs/nfs/inode.c:122 nfs_clear_inode+0x9c/0xd8
Modules linked in:
CPU: 0 PID: 6235 Comm: umount.nfs
Hardware name: linux,dummy-virt (DT)
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : nfs_clear_inode+0x9c/0xd8
lr : nfs_evict_inode+0x60/0x78
sp : fffffc000f68fc00
x29: fffffc000f68fc00 x28: fffffe00c53155c0
x27: fffffe00c5315000 x26: fffffc0009a63748
x25: fffffc000f68fd18 x24: fffffc000bfaaf40
x23: fffffc000936d3c0 x22: fffffe00c4ff5e20
x21: fffffc000bfaaf40 x20: fffffe00c4ff5d10
x19: fffffc000c056000 x18: 000000000000003c
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000040 x14: 0000000000000228
x13: fffffc000c3a2000 x12: 0000000000000045
x11: 0000000000000000 x10: 0000000000000000
x9 : 0000000000000000 x8 : 0000000000000000
x7 : 0000000000000000 x6 : fffffc00084b027c
x5 : fffffc0009a64000 x4 : fffffe00c0e77400
x3 : fffffc000c0563a8 x2 : fffffffffffffffb
x1 : 000000000000764e x0 : 0000000000000001
Call trace:
 nfs_clear_inode+0x9c/0xd8
 nfs_evict_inode+0x60/0x78
 evict+0x108/0x380
 dispose_list+0x70/0xa0
 evict_inodes+0x194/0x210
 generic_shutdown_super+0xb0/0x220
 nfs_kill_super+0x40/0x88
 deactivate_locked_super+0xb4/0x120
 deactivate_super+0x144/0x160
 cleanup_mnt+0x98/0x148
 __cleanup_mnt+0x38/0x50
 task_work_run+0x114/0x160
 do_notify_resume+0x2f8/0x308
 work_pending+0x8/0x14

The nrequest should be increased/decreased only if PG_INODE_REF flag
was setted.

But in the nfs_inode_remove_request function, it maybe decrease when
no PG_INODE_REF flag, this maybe lead nrequests count error.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-10-02 08:52:17 -04:00
Josef Bacik 11a19a9087 btrfs: allocate new inode in NOFS context
A user reported a lockdep splat

 ======================================================
 WARNING: possible circular locking dependency detected
 5.2.11-gentoo #2 Not tainted
 ------------------------------------------------------
 kswapd0/711 is trying to acquire lock:
 000000007777a663 (sb_internal){.+.+}, at: start_transaction+0x3a8/0x500

but task is already holding lock:
 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}:
 kmem_cache_alloc+0x1f/0x1c0
 btrfs_alloc_inode+0x1f/0x260
 alloc_inode+0x16/0xa0
 new_inode+0xe/0xb0
 btrfs_new_inode+0x70/0x610
 btrfs_symlink+0xd0/0x420
 vfs_symlink+0x9c/0x100
 do_symlinkat+0x66/0xe0
 do_syscall_64+0x55/0x1c0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

-> #0 (sb_internal){.+.+}:
 __sb_start_write+0xf6/0x150
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 __list_lru_walk_one.isra.4+0x5c/0x100
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 kthread+0x147/0x160
 ret_from_fork+0x24/0x30

other info that might help us debug this:

 Possible unsafe locking scenario:

 CPU0 CPU1
 ---- ----
 lock(fs_reclaim);
 lock(sb_internal);
 lock(fs_reclaim);
 lock(sb_internal);
*** DEADLOCK ***

 3 locks held by kswapd0/711:
 #0: 000000000ba86300 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x0/0x30
 #1: 000000004a5100f8 (shrinker_rwsem){++++}, at: shrink_node+0x9a/0x380
 #2: 00000000f956fa46 (&type->s_umount_key#30){++++}, at: super_cache_scan+0x35/0x1d0

stack backtrace:
 CPU: 7 PID: 711 Comm: kswapd0 Not tainted 5.2.11-gentoo #2
 Hardware name: Dell Inc. Precision Tower 3620/0MWYPT, BIOS 2.4.2 09/29/2017
 Call Trace:
 dump_stack+0x85/0xc7
 print_circular_bug.cold.40+0x1d9/0x235
 __lock_acquire+0x18b1/0x1f00
 lock_acquire+0xa6/0x170
 ? start_transaction+0x3a8/0x500
 __sb_start_write+0xf6/0x150
 ? start_transaction+0x3a8/0x500
 start_transaction+0x3a8/0x500
 btrfs_commit_inode_delayed_inode+0x59/0x110
 btrfs_evict_inode+0x19e/0x4c0
 ? var_wake_function+0x20/0x20
 evict+0xbc/0x1f0
 inode_lru_isolate+0x113/0x190
 ? discard_new_inode+0xc0/0xc0
 __list_lru_walk_one.isra.4+0x5c/0x100
 ? discard_new_inode+0xc0/0xc0
 list_lru_walk_one+0x32/0x50
 prune_icache_sb+0x36/0x80
 super_cache_scan+0x14a/0x1d0
 do_shrink_slab+0x131/0x320
 shrink_node+0xf7/0x380
 balance_pgdat+0x2d5/0x640
 kswapd+0x2ba/0x5e0
 ? __wake_up_common_lock+0x90/0x90
 kthread+0x147/0x160
 ? balance_pgdat+0x640/0x640
 ? __kthread_create_on_node+0x160/0x160
 ret_from_fork+0x24/0x30

This is because btrfs_new_inode() calls new_inode() under the
transaction.  We could probably move the new_inode() outside of this but
for now just wrap it in memalloc_nofs_save().

Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Fixes: 712e36c5f2 ("btrfs: use GFP_KERNEL in btrfs_alloc_inode")
CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01 20:12:27 +02:00
Zygo Blaxell 7a54789074 btrfs: fix balance convert to single on 32-bit host CPUs
Currently, the command:

	btrfs balance start -dconvert=single,soft .

on a Raspberry Pi produces the following kernel message:

	BTRFS error (device mmcblk0p2): balance: invalid convert data profile single

This fails because we use is_power_of_2(unsigned long) to validate
the new data profile, the constant for 'single' profile uses bit 48,
and there are only 32 bits in a long on ARM.

Fix by open-coding the check using u64 variables.

Tested by completing the original balance command on several Raspberry
Pis.

Fixes: 818255feec ("btrfs: use common helper instead of open coding a bit test")
CC: stable@vger.kernel.org # 4.20+
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01 19:37:29 +02:00
Josef Bacik 4203e96894 btrfs: fix incorrect updating of log root tree
We've historically had reports of being unable to mount file systems
because the tree log root couldn't be read.  Usually this is the "parent
transid failure", but could be any of the related errors, including
"fsid mismatch" or "bad tree block", depending on which block got
allocated.

The modification of the individual log root items are serialized on the
per-log root root_mutex.  This means that any modification to the
per-subvol log root_item is completely protected.

However we update the root item in the log root tree outside of the log
root tree log_mutex.  We do this in order to allow multiple subvolumes
to be updated in each log transaction.

This is problematic however because when we are writing the log root
tree out we update the super block with the _current_ log root node
information.  Since these two operations happen independently of each
other, you can end up updating the log root tree in between writing out
the dirty blocks and setting the super block to point at the current
root.

This means we'll point at the new root node that hasn't been written
out, instead of the one we should be pointing at.  Thus whatever garbage
or old block we end up pointing at complains when we mount the file
system later and try to replay the log.

Fix this by copying the log's root item into a local root item copy.
Then once we're safely under the log_root_tree->log_mutex we update the
root item in the log_root_tree.  This way we do not modify the
log_root_tree while we're committing it, fixing the problem.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01 18:41:02 +02:00
Filipe Manana c67d970f0e Btrfs: fix memory leak due to concurrent append writes with fiemap
When we have a buffered write that starts at an offset greater than or
equals to the file's size happening concurrently with a full ranged
fiemap, we can end up leaking an extent state structure.

Suppose we have a file with a size of 1Mb, and before the buffered write
and fiemap are performed, it has a single extent state in its io tree
representing the range from 0 to 1Mb, with the EXTENT_DELALLOC bit set.

The following sequence diagram shows how the memory leak happens if a
fiemap a buffered write, starting at offset 1Mb and with a length of
4Kb, are performed concurrently.

          CPU 1                                                  CPU 2

  extent_fiemap()
    --> it's a full ranged fiemap
        range from 0 to LLONG_MAX - 1
        (9223372036854775807)

    --> locks range in the inode's
        io tree
      --> after this we have 2 extent
          states in the io tree:
          --> 1 for range [0, 1Mb[ with
              the bits EXTENT_LOCKED and
              EXTENT_DELALLOC_BITS set
          --> 1 for the range
              [1Mb, LLONG_MAX[ with
              the EXTENT_LOCKED bit set

                                                  --> start buffered write at offset
                                                      1Mb with a length of 4Kb

                                                  btrfs_file_write_iter()

                                                    btrfs_buffered_write()
                                                      --> cached_state is NULL

                                                      lock_and_cleanup_extent_if_need()
                                                        --> returns 0 and does not lock
                                                            range because it starts
                                                            at current i_size / eof

                                                      --> cached_state remains NULL

                                                      btrfs_dirty_pages()
                                                        btrfs_set_extent_delalloc()
                                                          (...)
                                                          __set_extent_bit()

                                                            --> splits extent state for range
                                                                [1Mb, LLONG_MAX[ and now we
                                                                have 2 extent states:

                                                                --> one for the range
                                                                    [1Mb, 1Mb + 4Kb[ with
                                                                    EXTENT_LOCKED set
                                                                --> another one for the range
                                                                    [1Mb + 4Kb, LLONG_MAX[ with
                                                                    EXTENT_LOCKED set as well

                                                            --> sets EXTENT_DELALLOC on the
                                                                extent state for the range
                                                                [1Mb, 1Mb + 4Kb[
                                                            --> caches extent state
                                                                [1Mb, 1Mb + 4Kb[ into
                                                                @cached_state because it has
                                                                the bit EXTENT_LOCKED set

                                                    --> btrfs_buffered_write() ends up
                                                        with a non-NULL cached_state and
                                                        never calls anything to release its
                                                        reference on it, resulting in a
                                                        memory leak

Fix this by calling free_extent_state() on cached_state if the range was
not locked by lock_and_cleanup_extent_if_need().

The same issue can happen if anything else other than fiemap locks a range
that covers eof and beyond.

This could be triggered, sporadically, by test case generic/561 from the
fstests suite, which makes duperemove run concurrently with fsstress, and
duperemove does plenty of calls to fiemap. When CONFIG_BTRFS_DEBUG is set
the leak is reported in dmesg/syslog when removing the btrfs module with
a message like the following:

  [77100.039461] BTRFS: state leak: start 6574080 end 6582271 state 16402 in tree 0 refs 1

Otherwise (CONFIG_BTRFS_DEBUG not set) detectable with kmemleak.

CC: stable@vger.kernel.org # 4.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-01 18:40:58 +02:00
Arnd Bergmann bdf2007311 io_uring: use __kernel_timespec in timeout ABI
All system calls use struct __kernel_timespec instead of the old struct
timespec, but this one was just added with the old-style ABI. Change it
now to enforce the use of __kernel_timespec, avoiding ABI confusion and
the need for compat handlers on 32-bit architectures.

Any user space caller will have to use __kernel_timespec now, but this
is unambiguous and works for any C library regardless of the time_t
definition. A nicer way to specify the timeout would have been a less
ambiguous 64-bit nanosecond value, but I suppose it's too late now to
change that as this would impact both 32-bit and 64-bit users.

Fixes: 5262f56798 ("io_uring: IORING_OP_TIMEOUT support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-01 09:53:29 -06:00
Gao Xiang dc76ea8c10 erofs: fix mis-inplace determination related with noio chain
Fix a recent cleanup patch. noio (bypass) chain is
handled asynchronously against submit chain, therefore
inplace I/O or pagevec cannot be applied to such pages.
Add detailed comment for this as well.

Fixes: 97e86a858b ("staging: erofs: tidy up decompression frontend")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20190922100434.229340-1-gaoxiang25@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2019-10-01 04:54:45 +08:00
Gao Xiang 55252ab72b erofs: fix erofs_get_meta_page locking due to a cleanup
After doing more drop_caches stress test on
our products, I found the mistake introduced by
a very recent cleanup [1].

The current rule is that "erofs_get_meta_page"
should be returned with page locked (although
it's mostly unnecessary for read-only fs after
pages are PG_uptodate), but a fix should be
done for this.

[1] https://lore.kernel.org/r/20190904020912.63925-26-gaoxiang25@huawei.com
Fixes: 618f40ea02 ("erofs: use read_cache_page_gfp for erofs_get_meta_page")
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Link: https://lore.kernel.org/r/20190921184355.149928-1-gaoxiang25@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2019-10-01 04:54:33 +08:00
Wei Yongjun 517d6b9c6f erofs: fix return value check in erofs_read_superblock()
In case of error, the function read_mapping_page() returns
ERR_PTR() not NULL. The NULL test in the return value check
should be replaced with IS_ERR().

Fixes: fe7c242357 ("erofs: use read_mapping_page instead of sb_bread")
Reviewed-by: Gao Xiang <gaoxiang25@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Link: https://lore.kernel.org/r/20190918083033.47780-1-weiyongjun1@huawei.com
Signed-off-by: Gao Xiang <gaoxiang25@huawei.com>
2019-10-01 04:53:50 +08:00
Linus Torvalds bb48a59135 for-5.4-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl2SDbMACgkQxWXV+ddt
 WDsUhw/9HcRsT6SlrwA2R5leHxCR5UMwT2Zmbxpfft37ANF0SC1UINHBfnmquM97
 xX6fdRSR9RUjF9DrdLPfLBnJDQ/MnHl1ruIVBFhJm6cJ9TJwf9E0TiJBQt+08JWg
 vy5hZBWvsPWWRBJ94XPMe4LtakK/isW4Cz5W9AdrC2Siqw69j6eZzms2AnIjyBjA
 BoKg4se2Ay2rMxLZWXIOj9374PU+N1cnRnqgh77ZxLku5WdCzrDfB5safE7UmoTG
 /MWJuuIgzOk0iQpQORRtEZDS1dNe5KT9m4xXkUbrZbQROwqnXrT1SVIsuqNAvlPk
 uaymR1W8nshepzpMlSxVydLv/mKWZNUGnDxOJ23ooow8Yd7ndppXEtFuGwCYqIFc
 xQqxuTLREvJ9+jpSv11bmDpk/ULRqpV+2PjUqGaWlGwFArJ+qFRLVGYx31eXmDPj
 t2mrPOcXGzY0pKtIpbkuUGleY/jeI+BNsvD4+QPs+jnp0nmfvH0/Rmp7grGqx2FI
 rQM8Gn4a5i3nuEDWLp8nN2wcKC3ePwy96Vp2tqfsl6TVTPx4EFzGLkWogHR2yiqI
 0LAj8YWFmWuChSv71wYOjX79CppjcbNwOakSwtDjV30jkwoh2f/0D3OwOpua2xe8
 75KQMaSB0kesGZz7ZkL1kMqA5m5w7MGZom6XZoBJ+bq2HPLB2jo=
 =2UM7
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A bunch of fixes that accumulated in recent weeks, mostly material for
  stable.

  Summary:

   - fix for regression from 5.3 that prevents to use balance convert
     with single profile

   - qgroup fixes: rescan race, accounting leak with multiple writers,
     potential leak after io failure recovery

   - fix for use after free in relocation (reported by KASAN)

   - other error handling fixups"

* tag 'for-5.4-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
  btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
  btrfs: Fix a regression which we can't convert to SINGLE profile
  btrfs: relocation: fix use-after-free on dead relocation roots
  Btrfs: fix race setting up and completing qgroup rescan workers
  Btrfs: fix missing error return if writeback for extent buffer never started
  btrfs: adjust dirty_metadata_bytes after writeback failure of extent buffer
  Btrfs: fix selftests failure due to uninitialized i_mode in test inodes
2019-09-30 10:25:24 -07:00
Linus Torvalds 1eb80d6ffb Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "A couple of misc patches"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  afs dynroot: switch to simple_dir_operations
  fs/handle.c - fix up kerneldoc
2019-09-29 19:42:07 -07:00
Linus Torvalds 7edee5229c 9 smb3 patches including an important patch for debugging traces with wireshark, and 3 patches for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl2Pzl0ACgkQiiy9cAdy
 T1F7aAv9EUA2vEdV+3tyKX17yGm8GBVygANsdMlGqqmRhauO0+KJnrsTR19qh9na
 oe0r6EwaS6/JwDtM/Tt0YyjyRS7GDyfT4cNNFVmrJ0fnQV11FJR0X83uzdm3HydH
 eOyKNG22TwOeFJ3kWqdvSI0AtfbmIcVoOlUAAKsAsv2ksrJIW7Q1BIgQeD8estUV
 j8VjPEIc1c/69UU/H5ktrRHMeT5PO61SV8xGM47WnYkntlFDe1E83xWGoxo996Pc
 KdGSrB1edWXK6kSlX3yQWnoo8QxcUm8IjgsudqcnOrhnro9s/cDU5ZU1RlXNQeB8
 LMtYwNA7jEu9p3TIibxOCph4gofUWNV25GbEJWOY03NxWReTvgLsMbsreul+XNv9
 fow5mvCG94SaE8xDjTvzYRBTeYoXv0WjlTTJjqAlVshirQXk7a2dEBVBipkn0Ea7
 0845c3NtR20pDGQs3vVzdStDT2MwkNUl1hN4vE1Zl0p2ClOS+eFVq9MgIEddSLi2
 Z0oJsmfg
 =o1/m
 -----END PGP SIGNATURE-----

Merge tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull more cifs updates from Steve French:
 "Fixes from the recent SMB3 Test events and Storage Developer
  Conference (held the last two weeks).

  Here are nine smb3 patches including an important patch for debugging
  traces with wireshark, with three patches marked for stable.

  Additional fixes from last week to better handle some newly discovered
  reparse points, and a fix the create/mkdir path for setting the mode
  more atomically (in SMB3 Create security descriptor context), and one
  for path name processing are still being tested so are not included
  here"

* tag '5.4-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  CIFS: Fix oplock handling for SMB 2.1+ protocols
  smb3: missing ACL related flags
  smb3: pass mode bits into create calls
  smb3: Add missing reparse tags
  CIFS: fix max ea value size
  fs/cifs/sess.c: Remove set but not used variable 'capabilities'
  fs/cifs/smb2pdu.c: Make SMB2_notify_init static
  smb3: fix leak in "open on server" perf counter
  smb3: allow decryption keys to be dumped by admin for debugging
2019-09-29 19:37:32 -07:00
Linus Torvalds 3f2dc2798b Merge branch 'entropy'
Merge active entropy generation updates.

This is admittedly partly "for discussion".  We need to have a way
forward for the boot time deadlocks where user space ends up waiting for
more entropy, but no entropy is forthcoming because the system is
entirely idle just waiting for something to happen.

While this was triggered by what is arguably a user space bug with
GDM/gnome-session asking for secure randomness during early boot, when
they didn't even need any such truly secure thing, the issue ends up
being that our "getrandom()" interface is prone to that kind of
confusion, because people don't think very hard about whether they want
to block for sufficient amounts of entropy.

The approach here-in is to decide to not just passively wait for entropy
to happen, but to start actively collecting it if it is missing.  This
is not necessarily always possible, but if the architecture has a CPU
cycle counter, there is a fair amount of noise in the exact timings of
reasonably complex loads.

We may end up tweaking the load and the entropy estimates, but this
should be at least a reasonable starting point.

As part of this, we also revert the revert of the ext4 IO pattern
improvement that ended up triggering the reported lack of external
entropy.

* getrandom() active entropy waiting:
  Revert "Revert "ext4: make __ext4_get_inode_loc plug""
  random: try to actively add entropy rather than passively wait for it
2019-09-29 19:25:39 -07:00
Linus Torvalds 02f03c4206 Revert "Revert "ext4: make __ext4_get_inode_loc plug""
This reverts commit 72dbcf7215.

Instead of waiting forever for entropy that may just not happen, we now
try to actively generate entropy when required, and are thus hopefully
avoiding the problem that caused the nice ext4 IO pattern fix to be
reverted.

So revert the revert.

Cc: Ahmed S. Darwish <darwish.07@gmail.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Alexander E. Patrakov <patrakov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-29 17:59:23 -07:00
Linus Torvalds 9c5efe9ae7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function
2019-09-28 12:39:07 -07:00
Linus Torvalds aefcf2f4b5 Merge branch 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull kernel lockdown mode from James Morris:
 "This is the latest iteration of the kernel lockdown patchset, from
  Matthew Garrett, David Howells and others.

  From the original description:

    This patchset introduces an optional kernel lockdown feature,
    intended to strengthen the boundary between UID 0 and the kernel.
    When enabled, various pieces of kernel functionality are restricted.
    Applications that rely on low-level access to either hardware or the
    kernel may cease working as a result - therefore this should not be
    enabled without appropriate evaluation beforehand.

    The majority of mainstream distributions have been carrying variants
    of this patchset for many years now, so there's value in providing a
    doesn't meet every distribution requirement, but gets us much closer
    to not requiring external patches.

  There are two major changes since this was last proposed for mainline:

   - Separating lockdown from EFI secure boot. Background discussion is
     covered here: https://lwn.net/Articles/751061/

   -  Implementation as an LSM, with a default stackable lockdown LSM
      module. This allows the lockdown feature to be policy-driven,
      rather than encoding an implicit policy within the mechanism.

  The new locked_down LSM hook is provided to allow LSMs to make a
  policy decision around whether kernel functionality that would allow
  tampering with or examining the runtime state of the kernel should be
  permitted.

  The included lockdown LSM provides an implementation with a simple
  policy intended for general purpose use. This policy provides a coarse
  level of granularity, controllable via the kernel command line:

    lockdown={integrity|confidentiality}

  Enable the kernel lockdown feature. If set to integrity, kernel features
  that allow userland to modify the running kernel are disabled. If set to
  confidentiality, kernel features that allow userland to extract
  confidential information from the kernel are also disabled.

  This may also be controlled via /sys/kernel/security/lockdown and
  overriden by kernel configuration.

  New or existing LSMs may implement finer-grained controls of the
  lockdown features. Refer to the lockdown_reason documentation in
  include/linux/security.h for details.

  The lockdown feature has had signficant design feedback and review
  across many subsystems. This code has been in linux-next for some
  weeks, with a few fixes applied along the way.

  Stephen Rothwell noted that commit 9d1f8be5cf ("bpf: Restrict bpf
  when kernel lockdown is in confidentiality mode") is missing a
  Signed-off-by from its author. Matthew responded that he is providing
  this under category (c) of the DCO"

* 'next-lockdown' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (31 commits)
  kexec: Fix file verification on S390
  security: constify some arrays in lockdown LSM
  lockdown: Print current->comm in restriction messages
  efi: Restrict efivar_ssdt_load when the kernel is locked down
  tracefs: Restrict tracefs when the kernel is locked down
  debugfs: Restrict debugfs when the kernel is locked down
  kexec: Allow kexec_file() with appropriate IMA policy when locked down
  lockdown: Lock down perf when in confidentiality mode
  bpf: Restrict bpf when kernel lockdown is in confidentiality mode
  lockdown: Lock down tracing and perf kprobes when in confidentiality mode
  lockdown: Lock down /proc/kcore
  x86/mmiotrace: Lock down the testmmiotrace module
  lockdown: Lock down module params that specify hardware parameters (eg. ioport)
  lockdown: Lock down TIOCSSERIAL
  lockdown: Prohibit PCMCIA CIS storage when the kernel is locked down
  acpi: Disable ACPI table override if the kernel is locked down
  acpi: Ignore acpi_rsdp kernel param when the kernel has been locked down
  ACPI: Limit access to custom_method when the kernel is locked down
  x86/msr: Restrict MSR access when the kernel is locked down
  x86: Lock down IO port access when the kernel is locked down
  ...
2019-09-28 08:14:15 -07:00
Linus Torvalds 298fb76a55 Highlights:
- add a new knfsd file cache, so that we don't have to open and
 	  close on each (NFSv2/v3) READ or WRITE.  This can speed up
 	  read and write in some cases.  It also replaces our readahead
 	  cache.
 	- Prevent silent data loss on write errors, by treating write
 	  errors like server reboots for the purposes of write caching,
 	  thus forcing clients to resend their writes.
 	- Tweak the code that allocates sessions to be more forgiving,
 	  so that NFSv4.1 mounts are less likely to hang when a server
 	  already has a lot of clients.
 	- Eliminate an arbitrary limit on NFSv4 ACL sizes; they should
 	  now be limited only by the backend filesystem and the
 	  maximum RPC size.
 	- Allow the server to enforce use of the correct kerberos
 	  credentials when a client reclaims state after a reboot.
 
 And some miscellaneous smaller bugfixes and cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl2OoFcVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+dRoP/3OW1NxPjpjbCQWZL0M+O3AYJJla
 W8E+uoZKMosFEe/ymokMD0Vn5s47jPaMCifMjHZa2GygW8zHN9X2v0HURx/lob+o
 /rJXwMn78N/8kdbfDz2FvaCPeT0IuNzRIFBV8/sSXofqwCBwvPo+cl0QGrd4/xLp
 X35qlupx62TRk+kbdRjvv8kpS5SJ7BvR+FSA1WubNYWw2hpdEsr2OCFdGq2Wvthy
 DK6AfGBXfJGsOE+HGCSj6ejRV6i0UOJ17P8gRSsx+YT0DOe5E7ROjt+qvvRwk489
 wmR8Vjuqr1e40eGAUq3xuLfk5F5NgycY4ekVxk/cTVFNwWcz2DfdjXQUlyPAbrSD
 SqIyxN1qdKT24gtr7AHOXUWJzBYPWDgObCVBXUGzyL81RiDdhf38HRNjL2TcSDld
 tzCjQ0wbPw+iT74v6qQRY05oS+h3JOtDjU4pxsBnxVtNn4WhGJtaLfWW8o1C1QwU
 bc4aX3TlYhDmzU7n7Zjt4rFXGJfyokM+o6tPao1Z60Pmsv1gOk4KQlzLtW/jPHx4
 ZwYTwVQUKRDBfC62nmgqDyGI3/Qu11FuIxL2KXUCgkwDxNWN4YkwYjOGw9Lb5qKM
 wFpxq6CDNZB/IWLEu8Yg85kDPPUJMoI657lZb7Osr/MfBpU0YljcMOIzLBy8uV1u
 9COUbPaQipiWGu/0
 =diBo
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Add a new knfsd file cache, so that we don't have to open and close
     on each (NFSv2/v3) READ or WRITE. This can speed up read and write
     in some cases. It also replaces our readahead cache.

   - Prevent silent data loss on write errors, by treating write errors
     like server reboots for the purposes of write caching, thus forcing
     clients to resend their writes.

   - Tweak the code that allocates sessions to be more forgiving, so
     that NFSv4.1 mounts are less likely to hang when a server already
     has a lot of clients.

   - Eliminate an arbitrary limit on NFSv4 ACL sizes; they should now be
     limited only by the backend filesystem and the maximum RPC size.

   - Allow the server to enforce use of the correct kerberos credentials
     when a client reclaims state after a reboot.

  And some miscellaneous smaller bugfixes and cleanup"

* tag 'nfsd-5.4' of git://linux-nfs.org/~bfields/linux: (34 commits)
  sunrpc: clean up indentation issue
  nfsd: fix nfs read eof detection
  nfsd: Make nfsd_reset_boot_verifier_locked static
  nfsd: degraded slot-count more gracefully as allocation nears exhaustion.
  nfsd: handle drc over-allocation gracefully.
  nfsd: add support for upcall version 2
  nfsd: add a "GetVersion" upcall for nfsdcld
  nfsd: Reset the boot verifier on all write I/O errors
  nfsd: Don't garbage collect files that might contain write errors
  nfsd: Support the server resetting the boot verifier
  nfsd: nfsd_file cache entries should be per net namespace
  nfsd: eliminate an unnecessary acl size limit
  Deprecate nfsd fault injection
  nfsd: remove duplicated include from filecache.c
  nfsd: Fix the documentation for svcxdr_tmpalloc()
  nfsd: Fix up some unused variable warnings
  nfsd: close cached files prior to a REMOVE or RENAME that would replace target
  nfsd: rip out the raparms cache
  nfsd: have nfsd_test_lock use the nfsd_file cache
  nfsd: hook up nfs4_preprocess_stateid_op to the nfsd_file cache
  ...
2019-09-27 17:00:27 -07:00
Linus Torvalds 8f744bdee4 add virtio-fs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXYx2zAAKCRDh3BK/laaZ
 PFpHAQD2G+F8a9e41jFTJg5YpNKMD8/Pl4T6v9chIO9qPXF2IAEAji0P1JterRfv
 ixiBhv54hSwYbk527nxNWE9tP5gAHAQ=
 =WCHy
 -----END PGP SIGNATURE-----

Merge tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse virtio-fs support from Miklos Szeredi:
 "Virtio-fs allows exporting directory trees on the host and mounting
  them in guest(s).

  This isn't actually a new filesystem, but a glue layer between the
  fuse filesystem and a virtio based back-end.

  It's similar in functionality to the existing virtio-9p solution, but
  significantly faster in benchmarks and has better POSIX compliance.
  Further permformance improvements can be achieved by sharing the page
  cache between host and guest, allowing for faster I/O and reduced
  memory use.

  Kata Containers have been including the out-of-tree virtio-fs (with
  the shared page cache patches as well) since version 1.7 as an
  experimental feature. They have been active in development and plan to
  switch from virtio-9p to virtio-fs as their default solution. There
  has been interest from other sources as well.

  The userspace infrastructure is slated to be merged into qemu once the
  kernel part hits mainline.

  This was developed by Vivek Goyal, Dave Gilbert and Stefan Hajnoczi"

* tag 'virtio-fs-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  virtio-fs: add virtiofs filesystem
  virtio-fs: add Documentation/filesystems/virtiofs.rst
  fuse: reserve values for mapping protocol
2019-09-27 15:54:24 -07:00
Linus Torvalds 9977b1a714 9p pull request for inclusion in 5.4
Small fixes all around:
  - avoid overlayfs copy-up for PRIVATE mmaps
  - KUMSAN uninitialized warning for transport error
  - one syzbot memory leak fix in 9p cache
  - internal API cleanup for v9fs_fill_super
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl2OG0MACgkQq06b7GqY
 5nB21Q/+P4kD+wBg4d1b74Mf+oK2p2cSvCnYHcwMgkRsuyFgu7JXLSuRy2DLlAa4
 F+92x0jvit+yUy8bFqlEsSGGmE/dZC3WfpBOVChZRT/9ZLcxNasowtGDaySrX1e1
 tmx5QfYsNL840Cv8Vfos4QJjze1oE6RmSCUUO4KyC250PO/jxEgWGVPv1S8LfTUb
 KyhkcKlJ+TCqEON9nZ+CrQ4rndHNc7LxtBjVR0M03UYW55kYAhMLSj/AWqKA5I0q
 tARlTSgHHRdj4EJvWvLw7INg7BuvTuuyIzCIo/RfpUvijTDyLXFQI3qgtXosKLSq
 wuOacq8ykDRv8nvpLJ98daCw0pQOF2Qw40niFfl8Y7AxnaXggMI+UsVX/BnigVWX
 E6kaJm6d/YERee/f7aBsJkjKbwRqO7tMbNDrbK2/AzAaZo92dnB5nsEJnvKtCdjl
 aylBujCJq/DPp/SOf0Rl8mr24GdYVIk04hvqpxr4X6a/sacO3EUMh8GkGD7muY63
 axbYildLQHyChEn0TNdicFFrVie3A984eiK8sht+visd7c5QRm9iu1lsYtXdVGuz
 0OvSRtJuTDCTs6s1DhhVnez9luzH4VZJAj0RYJjcF1M4tzy0sKWUSSUf3RUbyxmY
 bocklihTlI5UzZfCP/wChXJ26+QrmcL7j0PrKA0NJy7ooFyT5eI=
 =VBrx
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.4' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "Some of the usual small fixes and cleanup.

  Small fixes all around:
   - avoid overlayfs copy-up for PRIVATE mmaps
   - KUMSAN uninitialized warning for transport error
   - one syzbot memory leak fix in 9p cache
   - internal API cleanup for v9fs_fill_super"

* tag '9p-for-5.4' of git://github.com/martinetd/linux:
  9p/vfs_super.c: Remove unused parameter data in v9fs_fill_super
  9p/cache.c: Fix memory leak in v9fs_cache_session_get_cookie
  9p: Transport error uninitialized
  9p: avoid attaching writeback_fid on mmap with type PRIVATE
2019-09-27 15:10:34 -07:00
Linus Torvalds 738f531d87 for-5.4/io_uring-2019-09-27
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2OIu4QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsedD/4h54330vuq66DsGqzFLonLwFl5YHC5NeJX
 aV38j7pAUPHvr9CSnr3d2VwTk/ThBtPI50I9/d9SXh8n1oAAA5C/+nPf1XknME47
 giKHr3eb0FNLOySt/Ry284gla8mO0GZM83zUbDMnF0N+tfwAtFvbvgCpsPFK9vdL
 xNzMLsq6++pj7p9c6IXd+zv0nmJzDikZQz6PtU1KYlbfnU7hh/cP3CrIHIPtGbyk
 c/7tfbKTB9UnbW5guGakZt3BzNViJpK28SKn+S6AlLiEOYpC+55dBbaZQIy5qxHv
 CZsx0GJIw0Ya0Lw3UEFp/74krLHq2610jmx/va8P7MZCjZAR675G3mjxKUnC/+SY
 mEdLo6vghMNAIqMBWNu59CFQOPnqa8sqRii0q6cRWXSqKiFr1FLN8mstb3Ghh9K0
 kGVA/gw6ESWB/e/X6I+pD6pTm6O6BPWEqBzGAWSvavQQIP9YpIzf5j+k3JsRu03/
 IIzR6gW9k9u4k0rFlOJKbp1+AO5sK3VtJFR8JGELiRwwgjD91w50gjPYak5OGM37
 Mi7OHCxqtwFGTkSvT6RM6om6onBsizrveszkrPUO01bWYIHHbtu6ofLyQlfnEtpv
 qbGZtLW6KYj9VsIKZNDfg99Ff79IAOiAZDbXWAu/JKyg/gu1Y9uOiVkNFPJGPNHV
 8ourcldMGg==
 =DEYH
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block

Pull more io_uring updates from Jens Axboe:
 "Just two things in here:

   - Improvement to the io_uring CQ ring wakeup for batched IO (me)

   - Fix wrong comparison in poll handling (yangerkun)

  I realize the first one is a little late in the game, but it felt
  pointless to hold it off until the next release. Went through various
  testing and reviews with Pavel and peterz"

* tag 'for-5.4/io_uring-2019-09-27' of git://git.kernel.dk/linux-block:
  io_uring: make CQ ring wakeups be more efficient
  io_uring: compare cached_cq_tail with cq.head in_io_uring_poll
2019-09-27 12:08:24 -07:00
Qu Wenruo d4e204948f btrfs: qgroup: Fix reserved data space leak if we have multiple reserve calls
[BUG]
The following script can cause btrfs qgroup data space leak:

  mkfs.btrfs -f $dev
  mount $dev -o nospace_cache $mnt

  btrfs subv create $mnt/subv
  btrfs quota en $mnt
  btrfs quota rescan -w $mnt
  btrfs qgroup limit 128m $mnt/subv

  for (( i = 0; i < 3; i++)); do
          # Create 3 64M holes for latter fallocate to fail
          truncate -s 192m $mnt/subv/file
          xfs_io -c "pwrite 64m 4k" $mnt/subv/file > /dev/null
          xfs_io -c "pwrite 128m 4k" $mnt/subv/file > /dev/null
          sync

          # it's supposed to fail, and each failure will leak at least 64M
          # data space
          xfs_io -f -c "falloc 0 192m" $mnt/subv/file &> /dev/null
          rm $mnt/subv/file
          sync
  done

  # Shouldn't fail after we removed the file
  xfs_io -f -c "falloc 0 64m" $mnt/subv/file

[CAUSE]
Btrfs qgroup data reserve code allow multiple reservations to happen on
a single extent_changeset:
E.g:
	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_1M);
	btrfs_qgroup_reserve_data(inode, &data_reserved, SZ_1M, SZ_2M);
	btrfs_qgroup_reserve_data(inode, &data_reserved, 0, SZ_4M);

Btrfs qgroup code has its internal tracking to make sure we don't
double-reserve in above example.

The only pattern utilizing this feature is in the main while loop of
btrfs_fallocate() function.

However btrfs_qgroup_reserve_data()'s error handling has a bug in that
on error it clears all ranges in the io_tree with EXTENT_QGROUP_RESERVED
flag but doesn't free previously reserved bytes.

This bug has a two fold effect:
- Clearing EXTENT_QGROUP_RESERVED ranges
  This is the correct behavior, but it prevents
  btrfs_qgroup_check_reserved_leak() to catch the leakage as the
  detector is purely EXTENT_QGROUP_RESERVED flag based.

- Leak the previously reserved data bytes.

The bug manifests when N calls to btrfs_qgroup_reserve_data are made and
the last one fails, leaking space reserved in the previous ones.

[FIX]
Also free previously reserved data bytes when btrfs_qgroup_reserve_data
fails.

Fixes: 5247255370 ("btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-27 15:24:34 +02:00
Qu Wenruo bab32fc069 btrfs: qgroup: Fix the wrong target io_tree when freeing reserved data space
[BUG]
Under the following case with qgroup enabled, if some error happened
after we have reserved delalloc space, then in error handling path, we
could cause qgroup data space leakage:

From btrfs_truncate_block() in inode.c:

	ret = btrfs_delalloc_reserve_space(inode, &data_reserved,
					   block_start, blocksize);
	if (ret)
		goto out;

 again:
	page = find_or_create_page(mapping, index, mask);
	if (!page) {
		btrfs_delalloc_release_space(inode, data_reserved,
					     block_start, blocksize, true);
		btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, true);
		ret = -ENOMEM;
		goto out;
	}

[CAUSE]
In the above case, btrfs_delalloc_reserve_space() will call
btrfs_qgroup_reserve_data() and mark the io_tree range with
EXTENT_QGROUP_RESERVED flag.

In the error handling path, we have the following call stack:
btrfs_delalloc_release_space()
|- btrfs_free_reserved_data_space()
   |- btrsf_qgroup_free_data()
      |- __btrfs_qgroup_release_data(reserved=@reserved, free=1)
         |- qgroup_free_reserved_data(reserved=@reserved)
            |- clear_record_extent_bits();
            |- freed += changeset.bytes_changed;

However due to a completion bug, qgroup_free_reserved_data() will clear
EXTENT_QGROUP_RESERVED flag in BTRFS_I(inode)->io_failure_tree, other
than the correct BTRFS_I(inode)->io_tree.
Since io_failure_tree is never marked with that flag,
btrfs_qgroup_free_data() will not free any data reserved space at all,
causing a leakage.

This type of error handling can only be triggered by errors outside of
qgroup code. So EDQUOT error from qgroup can't trigger it.

[FIX]
Fix the wrong target io_tree.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: bc42bda223 ("btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-27 15:24:28 +02:00
Pavel Shilovsky a016e2794f CIFS: Fix oplock handling for SMB 2.1+ protocols
There may be situations when a server negotiates SMB 2.1
protocol version or higher but responds to a CREATE request
with an oplock rather than a lease.

Currently the client doesn't handle such a case correctly:
when another CREATE comes in the server sends an oplock
break to the initial CREATE and the client doesn't send
an ack back due to a wrong caching level being set (READ
instead of RWH). Missing an oplock break ack makes the
server wait until the break times out which dramatically
increases the latency of the second CREATE.

Fix this by properly detecting oplocks when using SMB 2.1
protocol version and higher.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2019-09-26 16:42:44 -05:00
Steve French ff3ee62a55 smb3: missing ACL related flags
Various SMB3 ACL related flags (for security descriptor and
ACEs for example) were missing and some fields are different
in SMB3 and CIFS. Update cifsacl.h definitions based on
current MS-DTYP specification.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2019-09-26 16:37:43 -05:00
Linus Torvalds 972a2bf7df NFS Client Updates for Linux 5.3
Stable bugfixes:
 - Dequeue the request from the receive queue while we're re-encoding # v4.20+
 - Fix buffer handling of GSS MIC without slack # 5.1
 
 Features:
 - Increase xprtrdma maximum transport header and slot table sizes
 - Add support for nfs4_call_sync() calls using a custom rpc_task_struct
 - Optimize the default readahead size
 - Enable pNFS filelayout LAYOUTGET on OPEN
 
 Other bugfixes and cleanups:
 - Fix possible null-pointer dereferences and memory leaks
 - Various NFS over RDMA cleanups
 - Various NFS over RDMA comment updates
 - Don't receive TCP data into a reset request buffer
 - Don't try to parse incomplete RPC messages
 - Fix congestion window race with disconnect
 - Clean up pNFS return-on-close error handling
 - Fixes for NFS4ERR_OLD_STATEID handling
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl2NC04ACgkQ18tUv7Cl
 QOs4Tg//bAlGs+dIKixAmeMKmTd6I34laUnuyV/12yPQDgo6bryLrTngfe2BYvmG
 2l+8H7yHfR4/gQE4vhR0c15xFgu6pvjBGR0/nNRaXienIPXO4xsQkcaxVA7SFRY2
 HjffZwyoBfjyRps0jL+2sTsKbRtSkf9Dn+BONRgesg51jK1jyWkXqXpmgi4uMO4i
 ojpTrW81dwo7Yhv08U2A/Q1ifMJ8F9dVYuL5sm+fEbVI/Nxoz766qyB8rs8+b4Xj
 3gkfyh/Y1zoMmu6c+r2Q67rhj9WYbDKpa6HH9yX1zM/RLTiU7czMX+kjuQuOHWxY
 YiEk73NjJ48WJEep3odess1q/6WiAXX7UiJM1SnDFgAa9NZMdfhqMm6XduNO1m60
 sy0i8AdxdQciWYexOXMsBuDUCzlcoj4WYs1QGpY3uqO1MznQS/QUfu65fx8CzaT5
 snm6ki5ivqXH/js/0Z4MX2n/sd1PGJ5ynMkekxJ8G3gw+GC/oeSeGNawfedifLKK
 OdzyDdeiel5Me1p4I28j1WYVLHvtFmEWEU9oytdG0D/rjC/pgYgW/NYvAao8lQ4Z
 06wdcyAM66ViAPrbYeE7Bx4jy8zYRkiw6Y3kIbLgrlMugu3BhIW5Mi3BsgL4f4am
 KsqkzUqPZMCOVwDuUILSuPp4uHaR+JTJttywiLniTL6reF5kTiA=
 =4Ey6
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Dequeue the request from the receive queue while we're re-encoding
     # v4.20+
   - Fix buffer handling of GSS MIC without slack # 5.1

  Features:
   - Increase xprtrdma maximum transport header and slot table sizes
   - Add support for nfs4_call_sync() calls using a custom
     rpc_task_struct
   - Optimize the default readahead size
   - Enable pNFS filelayout LAYOUTGET on OPEN

  Other bugfixes and cleanups:
   - Fix possible null-pointer dereferences and memory leaks
   - Various NFS over RDMA cleanups
   - Various NFS over RDMA comment updates
   - Don't receive TCP data into a reset request buffer
   - Don't try to parse incomplete RPC messages
   - Fix congestion window race with disconnect
   - Clean up pNFS return-on-close error handling
   - Fixes for NFS4ERR_OLD_STATEID handling"

* tag 'nfs-for-5.4-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (53 commits)
  pNFS/filelayout: enable LAYOUTGET on OPEN
  NFS: Optimise the default readahead size
  NFSv4: Handle NFS4ERR_OLD_STATEID in LOCKU
  NFSv4: Handle NFS4ERR_OLD_STATEID in CLOSE/OPEN_DOWNGRADE
  NFSv4: Fix OPEN_DOWNGRADE error handling
  pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid
  NFSv4: Add a helper to increment stateid seqids
  NFSv4: Handle RPC level errors in LAYOUTRETURN
  NFSv4: Handle NFS4ERR_DELAY correctly in return-on-close
  NFSv4: Clean up pNFS return-on-close error handling
  pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors
  NFS: remove unused check for negative dentry
  NFSv3: use nfs_add_or_obtain() to create and reference inodes
  NFS: Refactor nfs_instantiate() for dentry referencing callers
  SUNRPC: Fix congestion window race with disconnect
  SUNRPC: Don't try to parse incomplete RPC messages
  SUNRPC: Rename xdr_buf_read_netobj to xdr_buf_read_mic
  SUNRPC: Fix buffer handling of GSS MIC without slack
  SUNRPC: RPC level errors should always set task->tk_rpc_status
  SUNRPC: Don't receive TCP data into a request buffer that has been reset
  ...
2019-09-26 12:20:14 -07:00
Kees Cook 7be3cb019d binfmt_elf: Do not move brk for INTERP-less ET_EXEC
When brk was moved for binaries without an interpreter, it should have
been limited to ET_DYN only. In other words, the special case was an
ET_DYN that lacks an INTERP, not just an executable that lacks INTERP.
The bug manifested for giant static executables, where the brk would end
up in the middle of the text area on 32-bit architectures.

Reported-and-tested-by: Richard Kojedzinszky <richard@kojedz.in>
Fixes: bbdc6076d2 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 11:38:55 -07:00
Linus Torvalds 2268419e4c Changes since last update:
- Minor code cleanups.
 - Fix a superblock logging error.
 - Ensure that collapse range converts the data fork to extents format
   when necessary.
 - Revert the ALLOC_USERDATA cleanup because it caused subtle
   behavior regressions.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2KR6oACgkQ+H93GTRK
 tOvtVA//Q6iqCTYnvbIHgNMMP6WvD/qL/G1bTm+ql7KJEyd99PC6EATY3IW2DP2Q
 gwj1VDid362eF65XiLoe/4zjhj1SmrpX92B9jDBt3n47sHLnDIjtT+54jo0n51xm
 MW+79qPC1luT16pdgurkJo7jGrR5Zj5xcUlNj8qVfH9fD9ZAUu2+OzTD+iWn8qvu
 L4geuMG2WKinxlfP6v4y5Fn7X9PiWM14N2N2p+N6ITuhdxyKch/EohI/P0syUcTJ
 Ad1za0PFFI+BtI4F9kWWBgoGe0oCqCeU6xCGkK5HJ9XTPmJrlU6WKk5lQJs6T92z
 OEAQfWkb/Yc7YbSP+CATH1vBIYQZzvhVXkHd0q1JBqhLOPeBnLhO/gVefjSDBgIq
 KlYiGlCn3Rme28KNUX+o9/JAsOVpwnxL1GIUAu4v0V2GQQIx7G0WCykiMgRjl0sm
 iCsarAev1eHPVsRWArdR1fFrOQ206tvMTz4zSREYVFMsHn0Zq3c9j3R78azMaUA6
 tbAfPlp7p0M90u3uC4FrZHyPPu6VHLNuE82zAwi8oLD1kCO9wxluF5gRg1UVOiB5
 Rp4Y/6BeO75039aZka7+5iQzzhiudezsXK138YGBx40nW+2nW2ZLWIc3Kkm2mGl3
 aJ0xsMusAv2q/lVKyg7280EjlkPYcHLTkaZ9+hCkmd3A9jcb/l8=
 =D/XD
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "There are a couple of bug fixes and some small code cleanups that came
  in recently:

   - Minor code cleanups

   - Fix a superblock logging error

   - Ensure that collapse range converts the data fork to extents format
     when necessary

   - Revert the ALLOC_USERDATA cleanup because it caused subtle behavior
     regressions"

* tag 'xfs-5.4-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: avoid unused to_mp() function warning
  xfs: log proper length of superblock
  xfs: revert 1baa2800e6 ("xfs: remove the unused XFS_ALLOC_USERDATA flag")
  xfs: removed unneeded variable
  xfs: convert inode to extent format after extent merge due to shift
2019-09-26 11:36:20 -07:00
Linus Torvalds dadedd8563 Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull jffs2 fix from Al Viro:
 "braino fix for mount API conversion for jffs2"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  jffs2: Fix mounting under new mount API
2019-09-26 11:33:30 -07:00
Linus Torvalds cbafe18c71 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - almost all of the rest of -mm

 - various other subsystems

Subsystems affected by this patch series:
  memcg, misc, core-kernel, lib, checkpatch, reiserfs, fat, fork,
  cpumask, kexec, uaccess, kconfig, kgdb, bug, ipc, lzo, kasan, madvise,
  cleanups, pagemap

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (77 commits)
  arch/sparc/include/asm/pgtable_64.h: fix build
  mm: treewide: clarify pgtable_page_{ctor,dtor}() naming
  ntfs: remove (un)?likely() from IS_ERR() conditions
  IB/hfi1: remove unlikely() from IS_ERR*() condition
  xfs: remove unlikely() from WARN_ON() condition
  wimax/i2400m: remove unlikely() from WARN*() condition
  fs: remove unlikely() from WARN_ON() condition
  xen/events: remove unlikely() from WARN() condition
  checkpatch: check for nested (un)?likely() calls
  hexagon: drop empty and unused free_initrd_mem
  mm: factor out common parts between MADV_COLD and MADV_PAGEOUT
  mm: introduce MADV_PAGEOUT
  mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
  mm: introduce MADV_COLD
  mm: untag user pointers in mmap/munmap/mremap/brk
  vfio/type1: untag user pointers in vaddr_get_pfn
  tee/shm: untag user pointers in tee_shm_register
  media/v4l2-core: untag user pointers in videobuf_dma_contig_user_get
  drm/radeon: untag user pointers in radeon_gem_userptr_ioctl
  drm/amdgpu: untag user pointers
  ...
2019-09-26 10:29:42 -07:00
Denis Efremov cc22c800e1 ntfs: remove (un)?likely() from IS_ERR() conditions
"likely(!IS_ERR(x))" is excessive. IS_ERR() already uses
unlikely() internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-11-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:44 -07:00
Denis Efremov 14ed868807 xfs: remove unlikely() from WARN_ON() condition
"unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-7-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:30 -07:00
Denis Efremov 7159d54418 fs: remove unlikely() from WARN_ON() condition
"unlikely(WARN_ON(x))" is excessive. WARN_ON() already uses unlikely()
internally.

Link: http://lkml.kernel.org/r/20190829165025.15750-5-efremov@linux.com
Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-26 10:10:30 -07:00
David Howells a3bc18a48e jffs2: Fix mounting under new mount API
The mounting of jffs2 is broken due to the changes from the new mount API
because it specifies a "source" operation, but then doesn't actually
process it.  But because it specified it, it doesn't return -ENOPARAM and
the caller doesn't process it either and the source gets lost.

Fix this by simply removing the source parameter from jffs2 and letting the
VFS deal with it in the default manner.

To test it, enable CONFIG_MTD_MTDRAM and allow the default size and erase
block size parameters, then try and mount the /dev/mtdblock<N> file that
that creates as jffs2.  No need to initialise it.

Fixes: ec10a24f10 ("vfs: Convert jffs2 to use the new mount API")
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: David Woodhouse <dwmw2@infradead.org>
cc: Richard Weinberger <richard@nod.at>
cc: linux-mtd@lists.infradead.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-09-26 10:26:55 -04:00
Jens Axboe bda521624e io_uring: make CQ ring wakeups be more efficient
For batched IO, it's not uncommon for waiters to ask for more than 1
IO to complete before being woken up. This is a problem with
wait_event() since tasks will get woken for every IO that completes,
re-check condition, then go back to sleep. For batch counts on the
order of what you do for high IOPS, that can result in 10s of extra
wakeups for the waiting task.

Add a private wake function that checks for the wake up count criteria
being met before calling autoremove_wake_function(). Pavel reports that
one test case he has runs 40% faster with proper batching of wakeups.

Reported-by: Pavel Begunkov <asml.silence@gmail.com>
Tested-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-09-26 03:55:40 -06:00
Steve French c3ca78e217 smb3: pass mode bits into create calls
We need to populate an ACL (security descriptor open context)
on file and directory correct.  This patch passes in the
mode.  Followon patch will build the open context and the
security descriptor (from the mode) that goes in the open
context.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2019-09-26 02:06:42 -05:00
Andrey Konovalov 7d0325749a userfaultfd: untag user pointers
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

userfaultfd code use provided user pointers for vma lookups, which can
only by done with untagged pointers.

Untag user pointers in validate_range().

Link: http://lkml.kernel.org/r/cdc59ddd7011012ca2e689bc88c3b65b1ea7e413.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:41 -07:00
Andrey Konovalov ed8a66b832 fs/namespace: untag user pointers in copy_mount_options
This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

In copy_mount_options a user address is being subtracted from TASK_SIZE.
If the address is lower than TASK_SIZE, the size is calculated to not
allow the exact_copy_from_user() call to cross TASK_SIZE boundary.
However if the address is tagged, then the size will be calculated
incorrectly.

Untag the address before subtracting.

Link: http://lkml.kernel.org/r/1de225e4a54204bfd7f25dac2635e31aa4aa1d90.1563904656.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Auger <eric.auger@redhat.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Jens Wiklander <jens.wiklander@linaro.org>
Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:41 -07:00
Markus Elfring aadc4e01db fat: delete an unnecessary check before brelse()
brelse() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/cfff3b81-fb5d-af26-7b5e-724266509045@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Jason Yan b25bab1722 fs/reiserfs/do_balan.c: remove set but not used variable
Fix the following gcc warning:

fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:629:6: warning: variable ret set but not used
[-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20190827032932.46622-2-yanaijie@huawei.com
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Jason Yan 3e9fd5a48c fs/reiserfs/journal.c: remove set but not used variable
Fix the following gcc warning:

fs/reiserfs/journal.c: In function flush_used_journal_lists:
fs/reiserfs/journal.c:1791:6: warning: variable ret set but not used
[-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20190827032932.46622-1-yanaijie@huawei.com
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Cc: zhengbin <zhengbin13@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin da5184c2ab fs/reiserfs/do_balan.c: remove set but not used variables
fs/reiserfs/do_balan.c: In function balance_leaf_when_delete:
fs/reiserfs/do_balan.c:245:20: warning: variable ih set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_left:
fs/reiserfs/do_balan.c:301:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_insert_right:
fs/reiserfs/do_balan.c:649:7: warning: variable version set but not used [-Wunused-but-set-variable]
fs/reiserfs/do_balan.c: In function balance_leaf_new_nodes_insert:
fs/reiserfs/do_balan.c:953:7: warning: variable version set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-8-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin 4fadcd1c14 fs/reiserfs/fix_node.c: remove set but not used variables
fs/reiserfs/fix_node.c: In function get_num_ver:
fs/reiserfs/fix_node.c:379:6: warning: variable cur_free set but not used [-Wunused-but-set-variable]
fs/reiserfs/fix_node.c: In function dc_check_balance_internal:
fs/reiserfs/fix_node.c:1737:6: warning: variable maxsize set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-7-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin 73fbff5eea fs/reiserfs/prints.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/prints.c: In function check_internal_block_head:
fs/reiserfs/prints.c:749:21: warning: variable blkh set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-6-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin 4a70aebb12 fs/reiserfs/objectid.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/objectid.c: In function reiserfs_convert_objectid_map_v1:
fs/reiserfs/objectid.c:186:25: warning: variable new_objectid_map set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-5-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin d4a1a857e3 fs/reiserfs/lbalance.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/lbalance.c: In function leaf_paste_entries:
fs/reiserfs/lbalance.c:1325:9: warning: variable old_entry_num set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-4-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin 66985cb9ee fs/reiserfs/stree.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/stree.c: In function search_by_key:
fs/reiserfs/stree.c:596:6: warning: variable right_neighbor_of_leaf_node set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-3-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
zhengbin 6e9ca45f77 fs/reiserfs/journal.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/reiserfs/journal.c: In function flush_older_commits:
fs/reiserfs/journal.c:894:15: warning: variable first_trans_id set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function flush_journal_list:
fs/reiserfs/journal.c:1354:38: warning: variable last set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function do_journal_release:
fs/reiserfs/journal.c:1916:6: warning: variable flushed set but not used [-Wunused-but-set-variable]
fs/reiserfs/journal.c: In function do_journal_end:
fs/reiserfs/journal.c:3993:6: warning: variable old_start set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566379929-118398-2-git-send-email-zhengbin13@huawei.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:40 -07:00
Jia-Ju Bai d256085be1 fs: reiserfs: remove unnecessary check of bh in remove_from_transaction()
On lines 3430-3434, bh has been assured to be non-null:
    cn = get_journal_hash_dev(sb, journal->j_hash_table, blocknr);
    if (!cn || !cn->bh) {
        return ret;
    }
    bh = cn->bh;

Thus, the check of bh on line 3447 is unnecessary and can be removed.
Thank Andrew Morton for good advice.

Link: http://lkml.kernel.org/r/20190727084019.11307-1-baijiaju1990@gmail.com
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Cc: Bharath Vedartham <linux.bhar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-25 17:51:39 -07:00
Linus Torvalds f41def3971 The highlights are:
- automatic recovery of a blacklisted filesystem session (Zheng Yan).
   This is disabled by default and can be enabled by mounting with the
   new "recover_session=clean" option.
 
 - serialize buffered reads and O_DIRECT writes (Jeff Layton).  Care is
   taken to avoid serializing O_DIRECT reads and writes with each other,
   this is based on the exclusion scheme from NFS.
 
 - handle large osdmaps better in the face of fragmented memory (myself)
 
 - don't limit what security.* xattrs can be get or set (Jeff Layton).
   We were overly restrictive here, unnecessarily preventing things like
   file capability sets stored in security.capability from working.
 
 - allow copy_file_range() within the same inode and across different
   filesystems within the same cluster (Luis Henriques)
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl2LoD8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzixRYB/9H5S4fif8Pn9eiXkIJiT9UZR/o7S1k
 ikfQNPeDxlBLKnoZXpDp2HqCu1/YuCcJ0zpZzPGGrKECZb7r+NaayxhmEXAZ+Vsg
 YwsO3eNHBbb58pe9T4oiHp19sflwcTOeNwg8wlvmvrgfBupFz2pU8Xm72EdFyoYm
 tP0QNTOCAuQK3pJcgozaptAO1TzBL3LomyVM0YzAKcumgMg47zALpaSLWJLGtDLM
 5+5WLvcVfBGLVv60h4B62ldS39eBxqTsFodcRMUaqAsnhLK70HVfKlwR3GgtZggr
 PDqbsuIfw/O3b65U2XDKZt1P9dyG3OE/ucueduXUxJPYNGmooEE+PpE+
 =DRVP
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The highlights are:

   - automatic recovery of a blacklisted filesystem session (Zheng Yan).
     This is disabled by default and can be enabled by mounting with the
     new "recover_session=clean" option.

   - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is
     taken to avoid serializing O_DIRECT reads and writes with each
     other, this is based on the exclusion scheme from NFS.

   - handle large osdmaps better in the face of fragmented memory
     (myself)

   - don't limit what security.* xattrs can be get or set (Jeff Layton).
     We were overly restrictive here, unnecessarily preventing things
     like file capability sets stored in security.capability from
     working.

   - allow copy_file_range() within the same inode and across different
     filesystems within the same cluster (Luis Henriques)"

* tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits)
  ceph: call ceph_mdsc_destroy from destroy_fs_client
  libceph: use ceph_kvmalloc() for osdmap arrays
  libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()
  ceph: allow object copies across different filesystems in the same cluster
  ceph: include ceph_debug.h in cache.c
  ceph: move static keyword to the front of declarations
  rbd: pull rbd_img_request_create() dout out into the callers
  ceph: reconnect connection if session hang in opening state
  libceph: drop unused con parameter of calc_target()
  ceph: use release_pages() directly
  rbd: fix response length parameter for encoded strings
  ceph: allow arbitrary security.* xattrs
  ceph: only set CEPH_I_SEC_INITED if we got a MAC label
  ceph: turn ceph_security_invalidate_secctx into static inline
  ceph: add buffered/direct exclusionary locking for reads and writes
  libceph: handle OSD op ceph_pagelist_append() errors
  ceph: don't return a value from void function
  ceph: don't freeze during write page faults
  ceph: update the mtime when truncating up
  ceph: fix indentation in __get_snap_name()
  ...
2019-09-25 10:21:13 -07:00
Linus Torvalds 7b1373dd6e fuse update for 5.4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXYod6wAKCRDh3BK/laaZ
 PG3fAP9WXuvUeYh3X7ThQPa2D33VCIMJRd6t+1TVSSc/H8P3dAD/ehN5HIWjnmzz
 iZFc3zDtO9UCJUe23IZomblxOQbu6Qk=
 =I0S2
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse updates from Miklos Szeredi:

 - Continue separating the transport (user/kernel communication) and the
   filesystem layers of fuse. Getting rid of most layering violations
   will allow for easier cleanup and optimization later on.

 - Prepare for the addition of the virtio-fs filesystem. The actual
   filesystem will be introduced by a separate pull request.

 - Convert to new mount API.

 - Various fixes, optimizations and cleanups.

* tag 'fuse-update-5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (55 commits)
  fuse: Make fuse_args_to_req static
  fuse: fix memleak in cuse_channel_open
  fuse: fix beyond-end-of-page access in fuse_parse_cache()
  fuse: unexport fuse_put_request
  fuse: kmemcg account fs data
  fuse: on 64-bit store time in d_fsdata directly
  fuse: fix missing unlock_page in fuse_writepage()
  fuse: reserve byteswapped init opcodes
  fuse: allow skipping control interface and forced unmount
  fuse: dissociate DESTROY from fuseblk
  fuse: delete dentry if timeout is zero
  fuse: separate fuse device allocation and installation in fuse_conn
  fuse: add fuse_iqueue_ops callbacks
  fuse: extract fuse_fill_super_common()
  fuse: export fuse_dequeue_forget() function
  fuse: export fuse_get_unique()
  fuse: export fuse_send_init_request()
  fuse: export fuse_len_args()
  fuse: export fuse_end_request()
  fuse: fix request limit
  ...
2019-09-25 09:55:59 -07:00
Linus Torvalds 4ef5b13a29 New code for 5.4:
- Report both io errors and short io results to the directio endio
   handler.
 - Allow directio callers to pass an ops structure to iomap_dio_rw.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl2EQBwACgkQ+H93GTRK
 tOvZ0w//R00Dya3pZmO+HQX/Kovo/AOKCct/gPaLvRf4DVvWZ3ZH5lPtDUSKCHBV
 Pw+potTuabwtp//mZrQ84ZMoQYOmxBmxchquJ2L49qWT/mZ1BDqBA2sxQOCehffO
 Dlv715eOPVhkVHOEAHv+BsatxlFDkuKgGcwUqxE4LNHWGrvmjm6rBWPCnxQMTwX9
 BhGo/0WG6D0leWFSMoUIpcD4Oj302/O43RX4lJsyl0Jd5pKJD/RFtaKqwIeE+pj9
 36MnoQYtT5e9BTm1/zzHRVpGgLDDWTY+IClgZNlZprU/Em+TtOGMWitWzob3wWcU
 QHj5bJ9NiWBT6QJ192i0+I0thSTwW/peOAJrkBOW3YhBkH3UIKSzaPiwCBoSEUKS
 fuIOJkRHFW7HTjKSw6wUCJJ6ShD+rNPOoaJ9Ivzz7x2HGEqb1al3EIumu6oDJgyq
 BdnLEecN/JSxPnakpSGAnvlCjZiYbJ95E0JfapzMy7HYpMXfbzZZ4JO9Ld0hQflR
 sMrGwL82tqE9ish1yRr+2nNEY5PqVJqV0XXU3dZz3HVknNuAvqrqXDXgIZT142zS
 s9vGAW+2XphyUKHH9pImQKw142n2xiZK+Pd2AuiO4I06E0EkAvN/n9CN7oJ4Cr4F
 z9rQKNXNx8OSjFtryYe6+IzQ0qQtl2OP7o88K91jYArKP17D8ds=
 =zkFp
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull iomap updates from Darrick Wong:
 "After last week's failed pull request attempt, I scuttled everything
  in the branch except for the directio endio api changes, which were
  trivial. Everything else will simply have to wait for the next cycle.

  Summary:

   - Report both io errors and short io results to the directio endio
     handler.

   - Allow directio callers to pass an ops structure to iomap_dio_rw"

* tag 'iomap-5.4-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: move the iomap_dio_rw ->end_io callback into a structure
  iomap: split size and error for iomap_dio_rw ->end_io
2019-09-25 09:01:43 -07:00
Mathieu Desnoyers 227a4aadc7 sched/membarrier: Fix p->mm->membarrier_state racy load
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-25 17:42:30 +02:00
Qu Wenruo fab2735955 btrfs: Fix a regression which we can't convert to SINGLE profile
[BUG]
With v5.3 kernel, we can't convert to SINGLE profile:

  # btrfs balance start -f -dconvert=single $mnt
  ERROR: error during balancing '/mnt/btrfs': Invalid argument
  # dmesg -t | tail
  validate_convert_profile: data profile=0x1000000000000 allowed=0x20 is_valid=1 final=0x1000000000000 ret=1
  BTRFS error (device dm-3): balance: invalid convert data profile single

[CAUSE]
With the extra debug output added, it shows that the @allowed bit is
lacking the special in-memory only SINGLE profile bit.

Thus we fail at that (profile & ~allowed) check.

This regression is caused by commit 081db89b13 ("btrfs: use raid_attr
to get allowed profiles for balance conversion") and the fact that we
don't use any bit to indicate SINGLE profile on-disk, but uses special
in-memory only bit to help distinguish different profiles.

[FIX]
Add that BTRFS_AVAIL_ALLOC_BIT_SINGLE to @allowed, so the code should be
the same as it was and fix the regression.

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: 081db89b13 ("btrfs: use raid_attr to get allowed profiles for balance conversion")
CC: stable@vger.kernel.org # 5.3+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-25 16:00:37 +02:00
Qu Wenruo 1fac4a5437 btrfs: relocation: fix use-after-free on dead relocation roots
[BUG]
One user reported a reproducible KASAN report about use-after-free:

  BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
  BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
  ==================================================================
  BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
  Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579

  CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P           OE     5.2.11-arch1-1-kasan #4
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
  Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
  Call Trace:
   dump_stack+0x7b/0xba
   print_address_description+0x6c/0x22e
   ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   __kasan_report.cold+0x1b/0x3b
   ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   kasan_report+0x12/0x17
   __asan_report_store8_noabort+0x17/0x20
   btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
   record_root_in_trans+0x2a0/0x370 [btrfs]
   btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
   start_transaction+0x1ab/0xe90 [btrfs]
   btrfs_join_transaction+0x1d/0x20 [btrfs]
   btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
   ? lock_repin_lock+0x400/0x400
   ? __kmem_cache_shutdown.cold+0x140/0x1ad
   ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
   finish_ordered_fn+0x15/0x20 [btrfs]
   normal_work_helper+0x1bd/0xca0 [btrfs]
   ? process_one_work+0x819/0x1720
   ? kasan_check_read+0x11/0x20
   btrfs_endio_write_helper+0x12/0x20 [btrfs]
   process_one_work+0x8c9/0x1720
   ? pwq_dec_nr_in_flight+0x2f0/0x2f0
   ? worker_thread+0x1d9/0x1030
   worker_thread+0x98/0x1030
   kthread+0x2bb/0x3b0
   ? process_one_work+0x1720/0x1720
   ? kthread_park+0x120/0x120
   ret_from_fork+0x35/0x40

  Allocated by task 369692:
   __kasan_kmalloc.part.0+0x44/0xc0
   __kasan_kmalloc.constprop.0+0xba/0xc0
   kasan_kmalloc+0x9/0x10
   kmem_cache_alloc_trace+0x138/0x260
   btrfs_read_tree_root+0x92/0x360 [btrfs]
   btrfs_read_fs_root+0x10/0xb0 [btrfs]
   create_reloc_root+0x47d/0xa10 [btrfs]
   btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
   record_root_in_trans+0x2a0/0x370 [btrfs]
   btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
   start_transaction+0x1ab/0xe90 [btrfs]
   btrfs_start_transaction+0x1e/0x20 [btrfs]
   __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
   btrfs_prealloc_file_range+0x13/0x20 [btrfs]
   prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
   relocate_file_extent_cluster+0x193/0xc30 [btrfs]
   relocate_data_extent+0x1f8/0x490 [btrfs]
   relocate_block_group+0x600/0x1060 [btrfs]
   btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
   btrfs_relocate_chunk+0x9e/0x180 [btrfs]
   btrfs_balance+0x14e4/0x2fc0 [btrfs]
   btrfs_ioctl_balance+0x47f/0x640 [btrfs]
   btrfs_ioctl+0x119d/0x8380 [btrfs]
   do_vfs_ioctl+0x9f5/0x1060
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x73/0xb0
   do_syscall_64+0xa5/0x370
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Freed by task 369692:
   __kasan_slab_free+0x14f/0x210
   kasan_slab_free+0xe/0x10
   kfree+0xd8/0x270
   btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
   clean_dirty_subvols+0x227/0x340 [btrfs]
   relocate_block_group+0x972/0x1060 [btrfs]
   btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
   btrfs_relocate_chunk+0x9e/0x180 [btrfs]
   btrfs_balance+0x14e4/0x2fc0 [btrfs]
   btrfs_ioctl_balance+0x47f/0x640 [btrfs]
   btrfs_ioctl+0x119d/0x8380 [btrfs]
   do_vfs_ioctl+0x9f5/0x1060
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x73/0xb0
   do_syscall_64+0xa5/0x370
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  The buggy address belongs to the object at ffff88856f671100
   which belongs to the cache kmalloc-4k of size 4096
  The buggy address is located 1552 bytes inside of
   4096-byte region [ffff88856f671100, ffff88856f672100)
  The buggy address belongs to the page:
  page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
  flags: 0x2ffff0000010200(slab|head)
  raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
  raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
  page dumped because: kasan: bad access detected

  Memory state around the buggy address:
   ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                           ^
   ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================
  BTRFS info (device sdi1): 1 enospc errors during balance
  BTRFS info (device sdi1): balance: ended with status: -28

[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)

That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.

The following race could cause the use-after-free problem:

                CPU1              |                CPU2
--------------------------------------------------------------------------
                                  | relocate_block_group()
                                  | |- unset_reloc_control(rc)
                                  | |- btrfs_commit_transaction()
btrfs_finish_ordered_io()         | |- clean_dirty_subvols()
|- btrfs_join_transaction()       |    |
   |- record_root_in_trans()      |    |
      |- btrfs_init_reloc_root()  |    |
         |- if (root->reloc_root) |    |
         |                        |    |- root->reloc_root = NULL
         |                        |    |- btrfs_drop_snapshot(reloc_root);
         |- reloc_root->last_trans|
                 = trans->transid |
	    ^^^^^^^^^^^^^^^^^^^^^^
            Use after free

[FIX]
Fix it by the following modifications:

- Test if the root has dead reloc tree before accessing root->reloc_root
  If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
  create or update root->reloc_tree

- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
  reloc tree
  To co-operate with above modification, so as long as
  BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
  reloc tree at record_root_in_trans().

Reported-by: Cebtenzzre <cebtenzzre@gmail.com>
Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-25 15:23:42 +02:00
Steve French 131ea1ed33 smb3: Add missing reparse tags
Additional reparse tags were described for WSL and file sync.
Add missing defines for these tags. Some will be useful for
POSIX extensions (as discussed at Storage Developer Conference).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2019-09-24 23:31:32 -05:00
Linus Torvalds b6cb84b4fc for-5.4/io_uring-2019-09-24
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl2J9DEQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpqiqD/9D9etX/wXJ54bNKXpQqba3tUQy8e5SLnd4
 q1ZzlJhvnmfNz6GqdH/1c+5Q2WBTLEb8PLJ4Z8d+1I8eWxpaCOzz362a7NA7x4l2
 d1a8MV8gQy0moJKX7kTL9Pn44d2HofxyhgmHFKkvm/j7P4HIKbAuckzlcGoOS9TS
 gV/bnroZ8eSrAfmswRRrxaWIoCe0iB+4x+nvoQjdVz8WXHcpBD8Y7hi5Ulid3Q/7
 vU4WJubwKrP1dczk1L64CnYN4QfP2k4pyyxQwEBujueNuAxPKTRD3GXQc58c63Bp
 GElE7FGxMUT4JhaY7U8G2pUSdAb2WiyMc9vd4Au7IkX0TEdQeRSHWkocnt2O9OPp
 nk7XqPNK5MSskE4T9fjsJkE4KiOHhslixxc5qY6X9g5GuPw03ITM4VBwvU8aYxbf
 p++jcWff10ovllck3uKvMpzU1IS20B4mSYYTyj990diocceEN6nsJsYv9wW4lMpk
 lBL8b/Q3TaoMG58C5BiZvlf6JQqdKpN5UJhBmyrZAv5ozvgBAcSeLK41V6oxzGzo
 y/3qcJeuIE6Exdm4DqWBHp7L82FcednO/EDqz3WnGx9v67kUz5VFqP7lIGNmXKM3
 2PaBMWv7BU/Fx087KOsBOI5XWm3aVw+yTkHT2aitVQW0su+UDvb/BvFL0sqbBe3O
 qm60F3TWkQ==
 =/w5G
 -----END PGP SIGNATURE-----

Merge tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block

Pull more io_uring updates from Jens Axboe:
 "A collection of later fixes and additions, that weren't quite ready
  for pushing out with the initial pull request.

  This contains:

   - Fix potential use-after-free of shadow requests (Jackie)

   - Fix potential OOM crash in request allocation (Jackie)

   - kmalloc+memcpy -> kmemdup cleanup (Jackie)

   - Fix poll crash regression (me)

   - Fix SQ thread not being nice and giving up CPU for !PREEMPT (me)

   - Add support for timeouts, making it easier to do epoll_wait()
     conversions, for instance (me)

   - Ensure io_uring works without f_ops->read_iter() and
     f_ops->write_iter() (me)"

* tag 'for-5.4/io_uring-2019-09-24' of git://git.kernel.dk/linux-block:
  io_uring: correctly handle non ->{read,write}_iter() file_operations
  io_uring: IORING_OP_TIMEOUT support
  io_uring: use cond_resched() in sqthread
  io_uring: fix potential crash issue due to io_get_req failure
  io_uring: ensure poll commands clear ->sqe
  io_uring: fix use-after-free of shadow_req
  io_uring: use kmemdup instead of kmalloc and memcpy
2019-09-24 16:40:21 -07:00
Linus Torvalds 9c9fa97a8e Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few hot fixes

 - ocfs2 updates

 - almost all of -mm (slab-generic, slab, slub, kmemleak, kasan,
   cleanups, debug, pagecache, memcg, gup, pagemap, memory-hotplug,
   sparsemem, vmalloc, initialization, z3fold, compaction, mempolicy,
   oom-kill, hugetlb, migration, thp, mmap, madvise, shmem, zswap,
   zsmalloc)

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  mm/zsmalloc.c: fix a -Wunused-function warning
  zswap: do not map same object twice
  zswap: use movable memory if zpool support allocate movable memory
  zpool: add malloc_support_movable to zpool_driver
  shmem: fix obsolete comment in shmem_getpage_gfp()
  mm/madvise: reduce code duplication in error handling paths
  mm: mmap: increase sockets maximum memory size pgoff for 32bits
  mm/mmap.c: refine find_vma_prev() with rb_last()
  riscv: make mmap allocation top-down by default
  mips: use generic mmap top-down layout and brk randomization
  mips: replace arch specific way to determine 32bit task with generic version
  mips: adjust brk randomization offset to fit generic version
  mips: use STACK_TOP when computing mmap base address
  mips: properly account for stack randomization and stack guard gap
  arm: use generic mmap top-down layout and brk randomization
  arm: use STACK_TOP when computing mmap base address
  arm: properly account for stack randomization and stack guard gap
  arm64, mm: make randomization selected by generic topdown mmap layout
  arm64, mm: move generic mmap layout functions to mm
  arm64: consider stack randomization for mmap base only when necessary
  ...
2019-09-24 16:10:23 -07:00
Alexandre Ghiti 649775be63 mm, fs: move randomize_stack_top from fs to mm
Patch series "Provide generic top-down mmap layout functions", v6.

This series introduces generic functions to make top-down mmap layout
easily accessible to architectures, in particular riscv which was the
initial goal of this series.  The generic implementation was taken from
arm64 and used successively by arm, mips and finally riscv.

Note that in addition the series fixes 2 issues:

- stack randomization was taken into account even if not necessary.

- [1] fixed an issue with mmap base which did not take into account
  randomization but did not report it to arm and mips, so by moving arm64
  into a generic library, this problem is now fixed for both
  architectures.

This work is an effort to factorize architecture functions to avoid code
duplication and oversights as in [1].

[1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html

This patch (of 14):

This preparatory commit moves this function so that further introduction
of generic topdown mmap layout is contained only in mm/util.c.

Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr
Signed-off-by: Alexandre Ghiti <alex@ghiti.fr>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Burton <paul.burton@mips.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu 09d91cda0e mm,thp: avoid writes to file with THP in pagecache
In previous patch, an application could put part of its text section in
THP via madvise().  These THPs will be protected from writes when the
application is still running (TXTBSY).  However, after the application
exits, the file is available for writes.

This patch avoids writes to file THP by dropping page cache for the file
when the file is open for write.  A new counter nr_thps is added to struct
address_space.  In do_dentry_open(), if the file is open for write and
nr_thps is non-zero, we drop page cache for the whole file.

Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Song Liu 60fbf0ab5d mm,thp: stats for file backed THP
In preparation for non-shmem THP, this patch adds a few stats and exposes
them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and
/proc/<pid>/task/<tid>/smaps.

This patch is mostly a rewrite of Kirill A.  Shutemov's earlier version:
https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/

Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com
Signed-off-by: Song Liu <songliubraving@fb.com>
Acked-by: Rik van Riel <riel@surriel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:11 -07:00
Nicholas Piggin 13224794cb mm: remove quicklist page table caches
Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

This patch (of 3):

Remove page table allocator "quicklists".  These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore.  If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:09 -07:00
Matthew Wilcox (Oracle) d8c6546b1a mm: introduce compound_nr()
Replace 1 << compound_order(page) with compound_nr(page).  Minor
improvements in readability.

Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Matthew Wilcox (Oracle) a50b854e07 mm: introduce page_size()
Patch series "Make working with compound pages easier", v2.

These three patches add three helpers and convert the appropriate
places to use them.

This patch (of 3):

It's unnecessarily hard to find out the size of a potentially huge page.
Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:08 -07:00
Colin Ian King 1c3ce5417b ocfs2: fix spelling mistake "ambigous" -> "ambiguous"
There is a spelling mistake in a mlog_bug_on_msg message. Fix it.

Link: http://lkml.kernel.org/r/831bdff4-064e-038b-f45d-c4d265cbff1e@linux.alibaba.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Changwei Ge d7283b39db ocfs2: checkpoint appending truncate log transaction before flushing
Appending truncate log(TA) and and flushing truncate log(TF) are two
separated transactions.  They can be both committed but not checkpointed.
If crash occurs then, both transaction will be replayed with several
already released to global bitmap clusters.  Then truncate log will be
replayed resulting in cluster double free.

To reproduce this issue, just crash the host while punching hole to files.

Signed-off-by: Changwei Ge <gechangwei@live.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Changwei Ge 0a3775e4f8 ocfs2: wait for recovering done after direct unlock request
There is a scenario causing ocfs2 umount hang when multiple hosts are
rebooting at the same time.

NODE1                           NODE2               NODE3
send unlock requset to NODE2
                                dies
                                                    become recovery master
                                                    recover NODE2
find NODE2 dead
mark resource RECOVERING
directly remove lock from grant list
calculate usage but RECOVERING marked
**miss the window of purging
clear RECOVERING

To reproduce this issue, crash a host and then umount ocfs2
from another node.

To solve this, just let unlock progress wait for recovery done.

Link: http://lkml.kernel.org/r/1550124866-20367-1-git-send-email-gechangwei@live.cn
Signed-off-by: Changwei Ge <gechangwei@live.cn>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Markus Elfring a89bd89fae ocfs2: delete unnecessary checks before brelse()
brelse() tests whether its argument is NULL and then returns immediately.
Thus the tests around the shown calls are not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/55cde320-394b-f985-56ce-1a2abea782aa@web.de
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
zhengbin 77461ba1d1 fs/ocfs2/dir.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/dir.c: In function ocfs2_dx_dir_transfer_leaf:
fs/ocfs2/dir.c:3653:42: warning: variable new_list set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566522588-63786-4-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Changwei Ge <chge@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
zhengbin 236dcc2ae4 fs/ocfs2/file.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/file.c: In function ocfs2_prepare_inode_for_write:
fs/ocfs2/file.c:2143:9: warning: variable end set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566522588-63786-3-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Changwei Ge <chge@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
zhengbin 225dcadf8e fs/ocfs2/namei.c: remove set but not used variables
Fixes gcc '-Wunused-but-set-variable' warning:

fs/ocfs2/namei.c: In function ocfs2_create_inode_in_orphan:
fs/ocfs2/namei.c:2503:23: warning: variable di set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/1566522588-63786-2-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Changwei Ge <chge@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Guozhonghua bf5a526479 ocfs2: remove unused ocfs2_orphan_scan_exit() declaration
ocfs2_orphan_scan_exit() is declared but not implemented.  Also perform a
minor cleanup in ocfs2_link_credits()

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4014FC208AC@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Guozhonghua 3dd21cdbef ocfs2: remove unused ocfs2_calc_tree_trunc_credits()
ocfs2_calc_tree_trunc_credits() is not called anywhere.

Link: http://lkml.kernel.org/r/71604351584F6A4EBAE558C676F37CA4014FC2050F@H3CMLB12-EX.srv.huawei-3com.com
Signed-off-by: guozhonghua <guozhonghua@h3c.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Greg Kroah-Hartman 5e7a3ed9f1 ocfs2: further debugfs cleanups
There is no need to check return value of debugfs_create functions, but
the last sweep through ocfs missed a number of places where this was
happening.  There is also no need to save the individual dentries for the
debugfs files, as everything is can just be removed at once when the
directory is removed.

By getting rid of the file dentries for the debugfs entries, a bit of
local memory can be saved as well.

[colin.king@canonical.com: ensure ret is set to zero before returning]
  Link: http://lkml.kernel.org/r/20190807121929.28918-1-colin.king@canonical.com
Link: http://lkml.kernel.org/r/20190731132119.GA12603@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jia Guo <guojia12@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Joseph Qi 963abb9aeb jbd2: remove jbd2_journal_inode_add_[write|wait]
Since ext4/ocfs2 are using jbd2_inode dirty range scoping APIs now,
jbd2_journal_inode_add_[write|wait] are not used any more, remove them.

Link: http://lkml.kernel.org/r/1562977611-8412-2-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Acked-by: Changwei Ge <chge@linux.alibaba.com>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
Joseph Qi bbd0f32721 ocfs2: use jbd2_inode dirty range scoping
6ba0e7dc64 ("jbd2: introduce jbd2_inode dirty range scoping") allow us
scoping each of the inode dirty ranges associated with a given
transaction, and ext4 already does this way.

Now let's also use the newly introduced jbd2_inode dirty range scoping to
prevent us from waiting forever when trying to complete a journal
transaction in ocfs2.

Link: http://lkml.kernel.org/r/1562977611-8412-1-git-send-email-joseph.qi@linux.alibaba.com
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ross Zwisler <zwisler@google.com>
Reviewed-by: Changwei Ge <chge@linux.alibaba.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:07 -07:00
OGAWA Hirofumi 07bfa4415a fat: work around race with userspace's read via blockdev while mounting
If userspace reads the buffer via blockdev while mounting,
sb_getblk()+modify can race with buffer read via blockdev.

For example,

            FS                               userspace
    bh = sb_getblk()
    modify bh->b_data
                                  read
				    ll_rw_block(bh)
				      fill bh->b_data by on-disk data
				      /* lost modified data by FS */
				      set_buffer_uptodate(bh)
    set_buffer_uptodate(bh)

Userspace should not use the blockdev while mounting though, the udev
seems to be already doing this.  Although I think the udev should try to
avoid this, workaround the race by small overhead.

Link: http://lkml.kernel.org/r/87pnk7l3sw.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-24 15:54:06 -07:00
Olga Kornievskaia a8fd0feeca pNFS/filelayout: enable LAYOUTGET on OPEN
Add the flag to the filelayout driver to add LAYOUTGET to
the OPEN compound.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-24 16:28:38 -04:00
Trond Myklebust c128e57551 NFS: Optimise the default readahead size
In the years since the max readahead size was fixed in NFS, a number of
things have happened:
- Users can now set the value directly using /sys/class/bdi
- NFS max supported block sizes have increased by several orders of
  magnitude from 64K to 1MB.
- Disk access latencies are orders of magnitude faster due to SSD + NVME.

In particular note that if the server is advertising 1MB as the optimal
read size, as that will set the readahead size to 15MB.
Let's therefore adjust down, and try to default to VM_READAHEAD_PAGES.
However let's inform the VM about our preferred block size so that it
can choose to round up in cases where that makes sense.

Reported-by: Alkis Georgopoulos <alkisg@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-09-24 15:58:20 -04:00
Linus Torvalds 0b36c9eed2 Merge branch 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more mount API conversions from Al Viro:
 "Assorted conversions of options parsing to new API.

  gfs2 is probably the most serious one here; the rest is trivial stuff.

  Other things in what used to be #work.mount are going to wait for the
  next cycle (and preferably go via git trees of the filesystems
  involved)"

* 'work.mount3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gfs2: Convert gfs2 to fs_context
  vfs: Convert spufs to use the new mount API
  vfs: Convert hypfs to use the new mount API
  hypfs: Fix error number left in struct pointer member
  vfs: Convert functionfs to use the new mount API
  vfs: Convert bpf to use the new mount API
2019-09-24 12:33:34 -07:00
Austin Kim 88d32d3983 xfs: avoid unused to_mp() function warning
to_mp() was first introduced with the following commit:
'commit 801cc4e17a ("xfs: debug mode forced buffered write failure")'

But the user of to_mp() was removed by below commit:
'commit f8c47250ba ("xfs: convert drop_writes to use the errortag
mechanism")'

So kernel build with clang throws below warning message:

   fs/xfs/xfs_sysfs.c:72:1: warning: unused function 'to_mp' [-Wunused-function]
   to_mp(struct kobject *kobject)

Hence to_mp() might be removed safely to get rid of warning message.

Signed-off-by: Austin Kim <austindh.kim@gmail.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-24 09:40:19 -07:00
Eric Sandeen 6f4ff81a46 xfs: log proper length of superblock
xfs_trans_log_buf takes first byte, last byte as args.  In this
case, it should be from 0 to sizeof() - 1.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-09-24 08:00:36 -07:00
Filipe Manana 13fc1d271a Btrfs: fix race setting up and completing qgroup rescan workers
There is a race between setting up a qgroup rescan worker and completing
a qgroup rescan worker that can lead to callers of the qgroup rescan wait
ioctl to either not wait for the rescan worker to complete or to hang
forever due to missing wake ups. The following diagram shows a sequence
of steps that illustrates the race.

        CPU 1                                                         CPU 2                                  CPU 3

 btrfs_ioctl_quota_rescan()
  btrfs_qgroup_rescan()
   qgroup_rescan_init()
    mutex_lock(&fs_info->qgroup_rescan_lock)
    spin_lock(&fs_info->qgroup_lock)

    fs_info->qgroup_flags |=
      BTRFS_QGROUP_STATUS_FLAG_RESCAN

    init_completion(
      &fs_info->qgroup_rescan_completion)

    fs_info->qgroup_rescan_running = true

    mutex_unlock(&fs_info->qgroup_rescan_lock)
    spin_unlock(&fs_info->qgroup_lock)

    btrfs_init_work()
     --> starts the worker

                                                        btrfs_qgroup_rescan_worker()
                                                         mutex_lock(&fs_info->qgroup_rescan_lock)

                                                         fs_info->qgroup_flags &=
                                                           ~BTRFS_QGROUP_STATUS_FLAG_RESCAN

                                                         mutex_unlock(&fs_info->qgroup_rescan_lock)

                                                         starts transaction, updates qgroup status
                                                         item, etc

                                                                                                           btrfs_ioctl_quota_rescan()
                                                                                                            btrfs_qgroup_rescan()
                                                                                                             qgroup_rescan_init()
                                                                                                              mutex_lock(&fs_info->qgroup_rescan_lock)
                                                                                                              spin_lock(&fs_info->qgroup_lock)

                                                                                                              fs_info->qgroup_flags |=
                                                                                                                BTRFS_QGROUP_STATUS_FLAG_RESCAN

                                                                                                              init_completion(
                                                                                                                &fs_info->qgroup_rescan_completion)

                                                                                                              fs_info->qgroup_rescan_running = true

                                                                                                              mutex_unlock(&fs_info->qgroup_rescan_lock)
                                                                                                              spin_unlock(&fs_info->qgroup_lock)

                                                                                                              btrfs_init_work()
                                                                                                               --> starts another worker

                                                         mutex_lock(&fs_info->qgroup_rescan_lock)

                                                         fs_info->qgroup_rescan_running = false

                                                         mutex_unlock(&fs_info->qgroup_rescan_lock)

							 complete_all(&fs_info->qgroup_rescan_completion)

Before the rescan worker started by the task at CPU 3 completes, if
another task calls btrfs_ioctl_quota_rescan(), it will get -EINPROGRESS
because the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN is set at
fs_info->qgroup_flags, which is expected and correct behaviour.

However if other task calls btrfs_ioctl_quota_rescan_wait() before the
rescan worker started by the task at CPU 3 completes, it will return
immediately without waiting for the new rescan worker to complete,
because fs_info->qgroup_rescan_running is set to false by CPU 2.

This race is making test case btrfs/171 (from fstests) to fail often:

  btrfs/171 9s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad)
      --- tests/btrfs/171.out     2018-09-16 21:30:48.505104287 +0100
      +++ /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad      2019-09-19 02:01:36.938486039 +0100
      @@ -1,2 +1,3 @@
       QA output created by 171
      +ERROR: quota rescan failed: Operation now in progress
       Silence is golden
      ...
      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/171.out /home/fdmanana/git/hub/xfstests/results//btrfs/171.out.bad'  to see the entire diff)

That is because the test calls the btrfs-progs commands "qgroup quota
rescan -w", "qgroup assign" and "qgroup remove" in a sequence that makes
calls to the rescan start ioctl fail with -EINPROGRESS (note the "btrfs"
commands 'qgroup assign' and 'qgroup remove' often call the rescan start
ioctl after calling the qgroup assign ioctl,
btrfs_ioctl_qgroup_assign()), since previous waits didn't actually wait
for a rescan worker to complete.

Another problem the race can cause is missing wake ups for waiters,
since the call to complete_all() happens outside a critical section and
after clearing the flag BTRFS_QGROUP_STATUS_FLAG_RESCAN. In the sequence
diagram above, if we have a waiter for the first rescan task (executed
by CPU 2), then fs_info->qgroup_rescan_completion.wait is not empty, and
if after the rescan worker clears BTRFS_QGROUP_STATUS_FLAG_RESCAN and
before it calls complete_all() against
fs_info->qgroup_rescan_completion, the task at CPU 3 calls
init_completion() against fs_info->qgroup_rescan_completion which
re-initilizes its wait queue to an empty queue, therefore causing the
rescan worker at CPU 2 to call complete_all() against an empty queue,
never waking up the task waiting for that rescan worker.

Fix this by clearing BTRFS_QGROUP_STATUS_FLAG_RESCAN and setting
fs_info->qgroup_rescan_running to false in the same critical section,
delimited by the mutex fs_info->qgroup_rescan_lock, as well as doing the
call to complete_all() in that same critical section. This gives the
protection needed to avoid rescan wait ioctl callers not waiting for a
running rescan worker and the lost wake ups problem, since setting that
rescan flag and boolean as well as initializing the wait queue is done
already in a critical section delimited by that mutex (at
qgroup_rescan_init()).

Fixes: 57254b6ebc ("Btrfs: add ioctl to wait for qgroup rescan completion")
Fixes: d2c609b834 ("btrfs: properly track when rescan worker is running")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-24 16:38:53 +02:00
YueHaibing 5addcd5dbd fuse: Make fuse_args_to_req static
Fix sparse warning:

fs/fuse/dev.c:468:6: warning: symbol 'fuse_args_to_req' was not declared. Should it be static?

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 68583165f9 ("fuse: add pages to fuse_args")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:02 +02:00
zhengbin 9ad09b1976 fuse: fix memleak in cuse_channel_open
If cuse_send_init fails, need to fuse_conn_put cc->fc.

cuse_channel_open->fuse_conn_init->refcount_set(&fc->count, 1)
                 ->fuse_dev_alloc->fuse_conn_get
                 ->fuse_dev_free->fuse_conn_put

Fixes: cc080e9e9b ("fuse: introduce per-instance fuse_dev structure")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00
Tejun Heo e5854b1cdf fuse: fix beyond-end-of-page access in fuse_parse_cache()
With DEBUG_PAGEALLOC on, the following triggers.

  BUG: unable to handle page fault for address: ffff88859367c000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 3001067 P4D 3001067 PUD 406d3a8067 PMD 406d30c067 PTE 800ffffa6c983060
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
  CPU: 38 PID: 3110657 Comm: python2.7
  RIP: 0010:fuse_readdir+0x88f/0xe7a [fuse]
  Code: 49 8b 4d 08 49 39 4e 60 0f 84 44 04 00 00 48 8b 43 08 43 8d 1c 3c 4d 01 7e 68 49 89 dc 48 03 5c 24 38 49 89 46 60 8b 44 24 30 <8b> 4b 10 44 29 e0 48 89 ca 48 83 c1 1f 48 83 e1 f8 83 f8 17 49 89
  RSP: 0018:ffffc90035edbde0 EFLAGS: 00010286
  RAX: 0000000000001000 RBX: ffff88859367bff0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff88859367bfed RDI: 0000000000920907
  RBP: ffffc90035edbe90 R08: 000000000000014b R09: 0000000000000004
  R10: ffff88859367b000 R11: 0000000000000000 R12: 0000000000000ff0
  R13: ffffc90035edbee0 R14: ffff889fb8546180 R15: 0000000000000020
  FS:  00007f80b5f4a740(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffff88859367c000 CR3: 0000001c170c2001 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   iterate_dir+0x122/0x180
   __x64_sys_getdents+0xa6/0x140
   do_syscall_64+0x42/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

It's in fuse_parse_cache().  %rbx (ffff88859367bff0) is fuse_dirent
pointer - addr + offset.  FUSE_DIRENT_SIZE() is trying to dereference
namelen off of it but that derefs into the next page which is disabled
by pagealloc debug causing a PF.

This is caused by dirent->namelen being accessed before ensuring that
there's enough bytes in the page for the dirent.  Fix it by pushing
down reclen calculation.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 5d7bc7e868 ("fuse: allow using readdir cache")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00
Arnd Bergmann 0ed4059302 fuse: unexport fuse_put_request
This function has been made static, which now causes a compile-time
warning:

WARNING: "fuse_put_request" [vmlinux] is a static EXPORT_SYMBOL_GPL

Remove the unneeded export.

Fixes: 66abc3599c ("fuse: unexport request ops")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00
Khazhismel Kumykov dc69e98c24 fuse: kmemcg account fs data
account per-file, dentry, and inode data

blockdev/superblock and temporary per-request data was left alone, as
this usually isn't accounted

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00
Khazhismel Kumykov 30c6a23d34 fuse: on 64-bit store time in d_fsdata directly
Implements the optimization noted in commit f75fdf22b0 ("fuse: don't
use ->d_time"), as the additional memory can be significant.  (In
particular, on SLAB configurations this 8-byte alloc becomes 32 bytes).
Per-dentry, this can consume significant memory.

Reviewed-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00
Vasily Averin d5880c7a86 fuse: fix missing unlock_page in fuse_writepage()
unlock_page() was missing in case of an already in-flight write against the
same page.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Fixes: ff17be0864 ("fuse: writepage: skip already in flight")
Cc: <stable@vger.kernel.org> # v3.13
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-24 15:28:01 +02:00