1
0
Fork 0
Commit Graph

61395 Commits (db9b14ae4b6a7675e65faf8098555ce5e075051b)

Author SHA1 Message Date
Josef Bacik e633add66d btrfs: fix lockdep splat from btrfs_dump_space_info
[ Upstream commit ab0db043c3 ]

When running with -o enospc_debug you can get the following splat if one
of the dump_space_info's trip

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc5+ #20 Tainted: G           OE
  ------------------------------------------------------
  dd/563090 is trying to acquire lock:
  ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]

  but task is already holding lock:
  ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&cache->lock){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
	 find_free_extent+0x7ef/0x13b0 [btrfs]
	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
	 __btrfs_cow_block+0x122/0x530 [btrfs]
	 btrfs_cow_block+0x106/0x210 [btrfs]
	 commit_cowonly_roots+0x55/0x300 [btrfs]
	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20 [btrfs]
	 deactivate_locked_super+0x36/0x70
	 cleanup_mnt+0x104/0x160
	 task_work_run+0x5f/0x90
	 __prepare_exit_to_usermode+0x1bd/0x1c0
	 do_syscall_64+0x5e/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (&space_info->lock){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
	 clear_state_bit+0x81/0x1a0 [btrfs]
	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
	 clear_extent_bit+0x15/0x20 [btrfs]
	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
	 truncate_cleanup_page+0x47/0xe0
	 truncate_inode_pages_range+0x238/0x840
	 truncate_pagecache+0x44/0x60
	 btrfs_setattr+0x202/0x5e0 [btrfs]
	 notify_change+0x33b/0x490
	 do_truncate+0x76/0xd0
	 path_openat+0x687/0xa10
	 do_filp_open+0x91/0x100
	 do_sys_openat2+0x215/0x2d0
	 do_sys_open+0x44/0x80
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&tree->lock#2){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 find_first_extent_bit+0x32/0x150 [btrfs]
	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
	 commit_cowonly_roots+0x245/0x300 [btrfs]
	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
	 close_ctree+0xf9/0x2f5 [btrfs]
	 generic_shutdown_super+0x6c/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20 [btrfs]
	 deactivate_locked_super+0x36/0x70
	 cleanup_mnt+0x104/0x160
	 task_work_run+0x5f/0x90
	 __prepare_exit_to_usermode+0x1bd/0x1c0
	 do_syscall_64+0x5e/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
	 __lock_acquire+0x1240/0x2460
	 lock_acquire+0xab/0x360
	 _raw_spin_lock+0x25/0x30
	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
	 btrfs_reserve_extent+0x176/0x180 [btrfs]
	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
	 cache_save_setup+0x28d/0x3b0 [btrfs]
	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
	 new_sync_write+0x11e/0x1b0
	 vfs_write+0x1c9/0x200
	 ksys_write+0x68/0xe0
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &ctl->tree_lock --> &space_info->lock --> &cache->lock

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&cache->lock);
				 lock(&space_info->lock);
				 lock(&cache->lock);
    lock(&ctl->tree_lock);

   *** DEADLOCK ***

  6 locks held by dd/563090:
   #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
   #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
   #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
   #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
   #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
   #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]

  stack backtrace:
  CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ #20
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
  Call Trace:
   dump_stack+0x96/0xd0
   check_noncircular+0x162/0x180
   __lock_acquire+0x1240/0x2460
   ? wake_up_klogd.part.0+0x30/0x40
   lock_acquire+0xab/0x360
   ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   _raw_spin_lock+0x25/0x30
   ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   btrfs_dump_space_info+0xf4/0x120 [btrfs]
   btrfs_reserve_extent+0x176/0x180 [btrfs]
   __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
   ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
   cache_save_setup+0x28d/0x3b0 [btrfs]
   btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
   btrfs_commit_transaction+0xcc/0xac0 [btrfs]
   ? start_transaction+0xe0/0x5b0 [btrfs]
   btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
   btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
   btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
   ? ktime_get_coarse_real_ts64+0xa8/0xd0
   ? trace_hardirqs_on+0x1c/0xe0
   btrfs_file_write_iter+0x3cf/0x610 [btrfs]
   new_sync_write+0x11e/0x1b0
   vfs_write+0x1c9/0x200
   ksys_write+0x68/0xe0
   do_syscall_64+0x52/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is because we're holding the block_group->lock while trying to dump
the free space cache.  However we don't need this lock, we just need it
to read the values for the printk, so move the free space cache dumping
outside of the block group lock.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19 08:16:01 +02:00
Paul E. McKenney 6402b23182 fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls
[ Upstream commit 9f47eb5461 ]

Very large I/Os can cause the following RCU CPU stall warning:

RIP: 0010:rb_prev+0x8/0x50
Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c =
89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48
RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0
RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630
RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238
R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001
R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0
 __lookup_extent_mapping+0xa0/0x110
 try_release_extent_mapping+0xdc/0x220
 btrfs_releasepage+0x45/0x70
 shrink_page_list+0xa39/0xb30
 shrink_inactive_list+0x18f/0x3b0
 shrink_lruvec+0x38e/0x6b0
 shrink_node+0x14d/0x690
 do_try_to_free_pages+0xc6/0x3e0
 try_to_free_mem_cgroup_pages+0xe6/0x1e0
 reclaim_high.constprop.73+0x87/0xc0
 mem_cgroup_handle_over_high+0x66/0x150
 exit_to_usermode_loop+0x82/0xd0
 do_syscall_64+0xd4/0x100
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

On a PREEMPT=n kernel, the try_release_extent_mapping() function's
"while" loop might run for a very long time on a large I/O.  This commit
therefore adds a cond_resched() to this loop, providing RCU any needed
quiescent states.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19 08:16:00 +02:00
Dmitry Vyukov 0b1799662a io_uring: fix sq array offset calculation
[ Upstream commit b36200f543 ]

rings_size() sets sq_offset to the total size of the rings (the returned
value which is used for memory allocation). This is wrong: sq array should
be located within the rings, not after them. Set sq_offset to where it
should be.

Fixes: 75b28affdd ("io_uring: allocate the two rings together")
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Hristo Venev <hristo@venev.name>
Cc: io-uring@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19 08:15:57 +02:00
Liu Yong a02df82a59 fs/io_uring.c: Fix uninitialized variable is referenced in io_submit_sqe
the commit <a4d61e66ee4a> ("<io_uring: prevent re-read of sqe->opcode>")
caused another vulnerability. After io_get_req(), the sqe_submit struct
in req is not initialized, but the following code defaults that
req->submit.opcode is available.

Signed-off-by: Liu Yong <pkfxxxing@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-08-19 08:15:53 +02:00
Chuck Lever 512570b178 nfsd: Fix NFSv4 READ on RDMA when using readv
commit 412055398b upstream.

svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().

This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.

Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.

That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.

Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.

Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: Timo Rothenpieler <timo@rothenpieler.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-11 15:33:42 +02:00
Frank van der Linden 11e64146dc xattr: break delegations in {set,remove}xattr
commit 08b5d5014a upstream.

set/removexattr on an exported filesystem should break NFS delegations.
This is true in general, but also for the upcoming support for
RFC 8726 (NFSv4 extended attribute support). Make sure that they do.

Additionally, they need to grow a _locked variant, since callers might
call this with i_rwsem held (like the NFS server code).

Cc: stable@vger.kernel.org # v4.9+
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-11 15:33:39 +02:00
Guoyu Huang e8053c6833 io_uring: Fix use-after-free in io_sq_wq_submit_work()
when ctx->sqo_mm is zero, io_sq_wq_submit_work() frees 'req'
without deleting it from 'task_list'. After that, 'req' is
accessed in io_ring_ctx_wait_and_kill() which lead to
a use-after-free.

Signed-off-by: Guoyu Huang <hgy5945@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-11 15:33:33 +02:00
Jens Axboe a4d61e66ee io_uring: prevent re-read of sqe->opcode
Liu reports that he can trigger a NULL pointer dereference with
IORING_OP_SENDMSG, by changing the sqe->opcode after we've validated
that the previous opcode didn't need a file and didn't assign one.

Ensure we validate and read the opcode only once.

Reported-by: Liu Yong <pkfxxxing@gmail.com>
Tested-by: Liu Yong <pkfxxxing@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-11 15:33:32 +02:00
Jiang Ying c776104353 ext4: fix direct I/O read error
This patch is used to fix ext4 direct I/O read error when
the read size is not aligned with block size.

Then, I will use a test to explain the error.

(1) Make a file that is not aligned with block size:
	$dd if=/dev/zero of=./test.jar bs=1000 count=3

(2) I wrote a source file named "direct_io_read_file.c" as following:

	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/file.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <string.h>
	#define BUF_SIZE 1024

	int main()
	{
		int fd;
		int ret;

		unsigned char *buf;
		ret = posix_memalign((void **)&buf, 512, BUF_SIZE);
		if (ret) {
			perror("posix_memalign failed");
			exit(1);
		}
		fd = open("./test.jar", O_RDONLY | O_DIRECT, 0755);
		if (fd < 0){
			perror("open ./test.jar failed");
			exit(1);
		}

		do {
			ret = read(fd, buf, BUF_SIZE);
			printf("ret=%d\n",ret);
			if (ret < 0) {
				perror("write test.jar failed");
			}
		} while (ret > 0);

		free(buf);
		close(fd);
	}

(3) Compile the source file:
	$gcc direct_io_read_file.c -D_GNU_SOURCE

(4) Run the test program:
	$./a.out

	The result is as following:
	ret=1024
	ret=1024
	ret=952
	ret=-1
	write test.jar failed: Invalid argument.

I have tested this program on XFS filesystem, XFS does not have
this problem, because XFS use iomap_dio_rw() to do direct I/O
read. And the comparing between read offset and file size is done
in iomap_dio_rw(), the code is as following:

	if (pos < size) {
		retval = filemap_write_and_wait_range(mapping, pos,
				pos + iov_length(iov, nr_segs) - 1);

		if (!retval) {
			retval = mapping->a_ops->direct_IO(READ, iocb,
						iov, pos, nr_segs);
		}
		...
	}

...only when "pos < size", direct I/O can be done, or 0 will be return.

I have tested the fix patch on Ext4, it is up to the mustard of
EINVAL in man2(read) as following:
	#include <unistd.h>
	ssize_t read(int fd, void *buf, size_t count);

	EINVAL
		fd is attached to an object which is unsuitable for reading;
		or the file was opened with the O_DIRECT flag, and either the
		address specified in buf, the value specified in count, or the
		current file offset is not suitably aligned.

So I think this patch can be applied to fix ext4 direct I/O error.

However Ext4 introduces direct I/O read using iomap infrastructure
on kernel 5.5, the patch is commit <b1b4705d54ab>
("ext4: introduce direct I/O read using iomap infrastructure"),
then Ext4 will be the same as XFS, they all use iomap_dio_rw() to do direct
I/O read. So this problem does not exist on kernel 5.5 for Ext4.

>From above description, we can see this problem exists on all the kernel
versions between kernel 3.14 and kernel 5.4. It will cause the Applications
to fail to read. For example, when the search service downloads a new full
index file, the search engine is loading the previous index file and is
processing the search request, it can not use buffer io that may squeeze
the previous index file in use from pagecache, so the serch service must
use direct I/O read.

Please apply this patch on these kernel versions, or please use the method
on kernel 5.5 to fix this problem.

Fixes: 9fe55eea7e ("Fix race when checking i_size on direct i/o read")
Reviewed-by: Jan Kara <jack@suse.cz>
Co-developed-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Wang Long <wanglong19@meituan.com>
Signed-off-by: Jiang Ying <jiangying8582@126.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-08-07 09:34:02 +02:00
Steve French 3e84602475 Revert "cifs: Fix the target file was deleted when rename failed."
commit 0e6705182d upstream.

This reverts commit 9ffad9263b.

Upon additional testing with older servers, it was found that
the original commit introduced a regression when using the old SMB1
dialect and rsyncing over an existing file.

The patch will need to be respun to address this, likely including
a larger refactoring of the SMB1 and SMB3 rename code paths to make
it less confusing and also to address some additional rename error
cases that SMB3 may be able to workaround.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: Patrick Fernie <patrick.fernie@gmail.com>
CC: Stable <stable@vger.kernel.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Acked-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:41 +02:00
J. Bruce Fields 9f2c2928b9 nfsd4: fix NULL dereference in nfsd/clients display code
[ Upstream commit 9affa43581 ]

We hold the cl_lock here, and that's enough to keep stateid's from going
away, but it's not enough to prevent the files they point to from going
away.  Take fi_lock and a reference and check for NULL, as we do in
other code.

Reported-by: NeilBrown <neilb@suse.de>
Fixes: 78599c42ae ("nfsd4: add file to display list of client's opens")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-29 10:18:34 +02:00
Robbie Ko 38a66f3cda btrfs: fix page leaks after failure to lock page for delalloc
commit 5909ca110b upstream.

When locking pages for delalloc, we check if it's dirty and mapping still
matches. If it does not match, we need to return -EAGAIN and release all
pages. Only the current page was put though, iterate over all the
remaining pages too.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:30 +02:00
Boris Burkov b04805a7e8 btrfs: fix mount failure caused by race with umount
commit 48cfa61b58 upstream.

It is possible to cause a btrfs mount to fail by racing it with a slow
umount. The crux of the sequence is generic_shutdown_super not yet
calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
If that occurs, btrfs_open_devices will decide the opened counter is
non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
0. From here, mount will call sget which will result in grab_super
trying to take the super block umount semaphore. That semaphore will be
held by the slow umount, so mount will block. Before up-ing the
semaphore, umount will delete the super block, resulting in mount's sget
reliably allocating a new one, which causes the mount path to dutifully
fill it out, and increment total_rw_bytes a second time, which causes
the mount to fail, as we see double the expected bytes.

Here is the sequence laid out in greater detail:

CPU0                                                    CPU1
down_write sb->s_umount
btrfs_kill_super
  kill_anon_super(sb)
    generic_shutdown_super(sb);
      shrink_dcache_for_umount(sb);
      sync_filesystem(sb);
      evict_inodes(sb); // SLOW

                                              btrfs_mount_root
                                                btrfs_scan_one_device
                                                fs_devices = device->fs_devices
                                                fs_info->fs_devices = fs_devices
                                                // fs_devices-opened makes this a no-op
                                                btrfs_open_devices(fs_devices, mode, fs_type)
                                                s = sget(fs_type, test, set, flags, fs_info);
                                                  find sb in s_instances
                                                  grab_super(sb);
                                                    down_write(&s->s_umount); // blocks

      sop->put_super(sb)
        // sb->fs_devices->opened == 2; no-op
      spin_lock(&sb_lock);
      hlist_del_init(&sb->s_instances);
      spin_unlock(&sb_lock);
      up_write(&sb->s_umount);
                                                    return 0;
                                                  retry lookup
                                                  don't find sb in s_instances (deleted by CPU0)
                                                  s = alloc_super
                                                  return s;
                                                btrfs_fill_super(s, fs_devices, data)
                                                  open_ctree // fs_devices total_rw_bytes improperly set!
                                                    btrfs_read_chunk_tree
                                                      read_one_dev // increment total_rw_bytes again!!
                                                      super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
before the calls to read_one_dev, while holding the sb umount semaphore
and the uuid mutex.

To reproduce, it is sufficient to dirty a decent number of inodes, then
quickly umount and mount.

  for i in $(seq 0 500)
  do
    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
  done
  umount /mnt/foo&
  mount /mnt/foo

does the trick for me.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:30 +02:00
Filipe Manana e333df0e4a btrfs: fix double free on ulist after backref resolution failure
commit 580c079b57 upstream.

At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
argument to point to it. However if later we fail due to an error returned
by find_parent_nodes(), we free that ulist but leave a dangling pointer in
the **roots argument. Upon receiving the error, a caller of this function
can attempt to free the same ulist again, resulting in an invalid memory
access.

One such scenario is during qgroup accounting:

btrfs_qgroup_account_extents()

 --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
     pointer) to btrfs_find_all_roots()

   --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
       passing &new_roots to it

     --> allocates ulist and assigns its address to **roots (which
         points to new_roots from btrfs_qgroup_account_extents())

     --> find_parent_nodes() returns an error, so we free the ulist
         and leave **roots pointing to it after returning

 --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
     an error and jumps to the label 'cleanup', which just tries to
     free again the same ulist

Stack trace example:

 ------------[ cut here ]------------
 BTRFS: tree first key check failed
 WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
 Modules linked in: dm_snapshot dm_thin_pool (...)
 CPU: 1 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
 Code: 28 5b 5d (...)
 RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
 RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
 RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
 R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
 FS:  00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  read_block_for_search+0xf6/0x350 [btrfs]
  btrfs_next_old_leaf+0x242/0x650 [btrfs]
  resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
  find_parent_nodes+0x4ea/0x12c0 [btrfs]
  btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
  btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
  btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
  btrfs_sync_file+0x3d4/0x4d0 [btrfs]
  do_fsync+0x38/0x70
  __x64_sys_fdatasync+0x13/0x20
  do_syscall_64+0x5c/0xe0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fc47e2d72e3
 Code: Bad RIP value.
 RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
 RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
 RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
 R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
 irq event stamp: 0
 hardirqs last  enabled at (0): [<0000000000000000>] 0x0
 hardirqs last disabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
 softirqs last  enabled at (0): [<ffffffffb8eb5e85>] copy_process+0x755/0x1eb0
 softirqs last disabled at (0): [<0000000000000000>] 0x0
 ---[ end trace 8639237550317b48 ]---
 BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
 general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
 CPU: 2 PID: 1763215 Comm: fsstress Tainted: G        W         5.8.0-rc3-btrfs-next-64 #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 RIP: 0010:ulist_release+0x14/0x60 [btrfs]
 Code: c7 07 00 (...)
 RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
 RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
 RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
 R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
 FS:  00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  ulist_free+0x13/0x20 [btrfs]
  btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
  btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
  btrfs_sync_file+0x3d4/0x4d0 [btrfs]
  do_fsync+0x38/0x70
  __x64_sys_fdatasync+0x13/0x20
  do_syscall_64+0x5c/0xe0
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fc47e2d72e3
 Code: Bad RIP value.
 RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
 RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
 RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
 R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
 Modules linked in: dm_snapshot dm_thin_pool (...)
 ---[ end trace 8639237550317b49 ]---
 RIP: 0010:ulist_release+0x14/0x60 [btrfs]
 Code: c7 07 00 (...)
 RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
 RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
 RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
 RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
 R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
 FS:  00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
it frees the ulist.

Fixes: 8da6d5815c ("Btrfs: added btrfs_find_all_roots()")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:30 +02:00
Qu Wenruo ee08663380 btrfs: reloc: clear DEAD_RELOC_TREE bit for orphan roots to prevent runaway balance
commit 1dae7e0e58 upstream.

[BUG]
There are several reported runaway balance, that balance is flooding the
log with "found X extents" where the X never changes.

[CAUSE]
Commit d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after
merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
indicate that one subvolume has finished its tree blocks swap with its
reloc tree.

However if balance is canceled or hits ENOSPC halfway, we didn't clear
the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
until unmount.

Any subvolume root with that bit, would cause backref cache to skip this
tree block, as it has finished its tree block swap.  This would cause
all tree blocks of that root be ignored by balance, leading to runaway
balance.

[FIX]
Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
the original subvolume of orphan reloc root.

Add an umount check for the stale bit still set.

Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Manually solve the conflicts due to no btrfs root refs rework]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:29 +02:00
Qu Wenruo 044ca91027 btrfs: reloc: fix reloc root leak and NULL pointer dereference
commit 51415b6c1b upstream.

[BUG]
When balance is canceled, there is a pretty high chance that unmounting
the fs can lead to lead the NULL pointer dereference:

  BTRFS warning (device dm-3): page private not zero on page 223158272
  ...
  BTRFS warning (device dm-3): page private not zero on page 223162368
  BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
  BUG: kernel NULL pointer dereference, address: 0000000000000168
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 2 PID: 5793 Comm: umount Tainted: G           O      5.7.0-rc5-custom+ #53
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:__lock_acquire+0x5dc/0x24c0
  Call Trace:
   lock_acquire+0xab/0x390
   _raw_spin_lock+0x39/0x80
   btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
   release_extent_buffer+0xb2/0x170 [btrfs]
   free_extent_buffer+0x66/0xb0 [btrfs]
   btrfs_put_root+0x8e/0x130 [btrfs]
   btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
   btrfs_free_fs_info+0xe5/0x120 [btrfs]
   btrfs_kill_super+0x1f/0x30 [btrfs]
   deactivate_locked_super+0x3b/0x80
   deactivate_super+0x3e/0x50
   cleanup_mnt+0x109/0x160
   __cleanup_mnt+0x12/0x20
   task_work_run+0x67/0xa0
   exit_to_usermode_loop+0xc5/0xd0
   syscall_return_slowpath+0x205/0x360
   do_syscall_64+0x6e/0xb0
   entry_SYSCALL_64_after_hwframe+0x49/0xb3
  RIP: 0033:0x7fd028ef740b

[CAUSE]
When balance is canceled, all reloc roots are marked as orphan, and
orphan reloc roots are going to be cleaned up.

However for orphan reloc roots and merged reloc roots, their lifespan
are quite different:

	Merged reloc roots	|	Orphan reloc roots by cancel
--------------------------------------------------------------------
create_reloc_root()		| create_reloc_root()
|- refs == 1			| |- refs == 1
				|
btrfs_grab_root(reloc_root);	| btrfs_grab_root(reloc_root);
|- refs == 2			| |- refs == 2
				|
root->reloc_root = reloc_root;	| root->reloc_root = reloc_root;
		>>> No difference so far <<<
				|
prepare_to_merge()		| prepare_to_merge()
|- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
				|
merge_reloc_roots()		| merge_reloc_roots()
|- merge_reloc_root()		| |- Doing nothing to put reloc root
   |- insert_dirty_subvol()	| |- refs == 2
      |- __del_reloc_root()	|
         |- btrfs_put_root()	|
            |- refs == 1	|
		>>> Now orphan reloc roots still have refs 2 <<<
				|
clean_dirty_subvols()		| clean_dirty_subvols()
|- btrfs_drop_snapshot()	| |- btrfS_drop_snapshot()
   |- reloc_root get freed	|    |- reloc_root still has refs 2
				|	related ebs get freed, but
				|	reloc_root still recorded in
				|	allocated_roots
btrfs_check_leaked_roots()	| btrfs_check_leaked_roots()
|- No leaked roots		| |- Leaked reloc_roots detected
				| |- btrfs_put_root()
				|    |- free_extent_buffer(root->node);
				|       |- eb already freed, caused NULL
				|	   pointer dereference

[FIX]
The fix is to clear fs_root->reloc_root and put it at
merge_reloc_roots() time, so that we won't leak reloc roots.

Fixes: d2311e6985 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[Manually solve the conflicts due to no btrfs root refs rework]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:29 +02:00
Olga Kornievskaia cb12257070 SUNRPC reverting d03727b248 ("NFSv4 fix CLOSE not waiting for direct IO compeletion")
commit 65caafd0d2 upstream.

Reverting commit d03727b248 "NFSv4 fix CLOSE not waiting for
direct IO compeletion". This patch made it so that fput() by calling
inode_dio_done() in nfs_file_release() would wait uninterruptably
for any outstanding directIO to the file (but that wait on IO should
be killable).

The problem the patch was also trying to address was REMOVE returning
ERR_ACCESS because the file is still opened, is supposed to be resolved
by server returning ERR_FILE_OPEN and not ERR_ACCESS.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-29 10:18:29 +02:00
Miklos Szeredi 8676732c33 fuse: fix weird page warning
commit a5005c3cda upstream.

When PageWaiters was added, updating this check was missed.

Reported-by: Nikolaus Rath <Nikolaus@rath.org>
Reported-by: Hugh Dickins <hughd@google.com>
Fixes: 6290602709 ("mm: add PageWaiters indicating tasks are waiting for a page bit")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: André Almeida <andrealmeid@collabora.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-29 10:18:28 +02:00
Chirantan Ekbote 4dd2ad6867 fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS
commit 31070f6cce upstream.

The ioctl encoding for this parameter is a long but the documentation says
it should be an int and the kernel drivers expect it to be an int.  If the
fuse driver treats this as a long it might end up scribbling over the stack
of a userspace process that only allocated enough space for an int.

This was previously discussed in [1] and a patch for fuse was proposed in
[2].  From what I can tell the patch in [2] was nacked in favor of adding
new, "fixed" ioctls and using those from userspace.  However there is still
no "fixed" version of these ioctls and the fact is that it's sometimes
infeasible to change all userspace to use the new one.

Handling the ioctls specially in the fuse driver seems like the most
pragmatic way for fuse servers to support them without causing crashes in
userspace applications that call them.

[1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
[2]: https://sourceforge.net/p/fuse/mailman/message/31771759/

Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Fixes: 59efec7b90 ("fuse: implement ioctl support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:12 +02:00
Miklos Szeredi e8f32a9f5a fuse: use ->reconfigure() instead of ->remount_fs()
commit 0189a2d367 upstream.

s_op->remount_fs() is only called from legacy_reconfigure(), which is not
used after being converted to the new API.

Convert to using ->reconfigure().  This restores the previous behavior of
syncing the filesystem and rejecting MS_MANDLOCK on remount.

Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:12 +02:00
Miklos Szeredi f96ce4be46 fuse: ignore 'data' argument of mount(..., MS_REMOUNT)
commit e8b20a474c upstream.

The command

  mount -o remount -o unknownoption /mnt/fuse

succeeds on kernel versions prior to v5.4 and fails on kernel version at or
after.  This is because fuse_parse_param() rejects any unrecognised options
in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.

This causes a regression in case the fuse filesystem is in fstab, since
remount sends all options found there to the kernel; even ones that are
meant for the initial mount and are consumed by the userspace fuse server.

Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
the conversion to the new API.

Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Fixes: c30da2e981 ("fuse: convert to use the new mount API")
Cc: <stable@vger.kernel.org> # v5.4
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:12 +02:00
Amir Goldstein 09b696bd21 ovl: fix unneeded call to ovl_change_flags()
commit 81a33c1ee9 upstream.

The check if user has changed the overlay file was wrong, causing unneeded
call to ovl_change_flags() including taking f_lock on every file access.

Fixes: d989903058 ("ovl: do not generate duplicate fsnotify events for "fake" path")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:12 +02:00
Amir Goldstein 93f75b0f0d ovl: relax WARN_ON() when decoding lower directory file handle
commit 124c2de2c0 upstream.

Decoding a lower directory file handle to overlay path with cold
inode/dentry cache may go as follows:

1. Decode real lower file handle to lower dir path
2. Check if lower dir is indexed (was copied up)
3. If indexed, get the upper dir path from index
4. Lookup upper dir path in overlay
5. If overlay path found, verify that overlay lower is the lower dir
   from step 1

On failure to verify step 5 above, user will get an ESTALE error and a
WARN_ON will be printed.

A mismatch in step 5 could be a result of lower directory that was renamed
while overlay was offline, after that lower directory has been copied up
and indexed.

This is a scripted reproducer based on xfstest overlay/052:

  # Create lower subdir
  create_dirs
  create_test_files $lower/lowertestdir/subdir
  mount_dirs
  # Copy up lower dir and encode lower subdir file handle
  touch $SCRATCH_MNT/lowertestdir
  test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
  # Rename lower dir offline
  unmount_dirs
  mv $lower/lowertestdir $lower/lowertestdir.new/
  mount_dirs
  # Attempt to decode lower subdir file handle
  test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle

Since this WARN_ON() can be triggered by user we need to relax it.

Fixes: 4b91c30a5a ("ovl: lookup connected ancestor of dir in inode cache")
Cc: <stable@vger.kernel.org> # v4.16+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:12 +02:00
youngjun 6270654c7d ovl: inode reference leak in ovl_is_inuse true case.
commit 24f14009b8 upstream.

When "ovl_is_inuse" true case, trap inode reference not put.  plus adding
the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.

Fixes: 0be0bfd2de ("ovl: fix regression caused by overlapping layers detection")
Cc: <stable@vger.kernel.org> # v4.19+
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: youngjun <her0gyugyu@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:11 +02:00
Amir Goldstein 4996065307 ovl: fix regression with re-formatted lower squashfs
commit a888db3101 upstream.

Commit 9df085f3c9 ("ovl: relax requirement for non null uuid of lower
fs") relaxed the requirement for non null uuid with single lower layer to
allow enabling index and nfs_export features with single lower squashfs.

Fabian reported a regression in a setup when overlay re-uses an existing
upper layer and re-formats the lower squashfs image.  Because squashfs
has no uuid, the origin xattr in upper layer are decoded from the new
lower layer where they may resolve to a wrong origin file and user may
get an ESTALE or EIO error on lookup.

To avoid the reported regression while still allowing the new features
with single lower squashfs, do not allow decoding origin with lower null
uuid unless user opted-in to one of the new features that require
following the lower inode of non-dir upper (index, xino, metacopy).

Reported-by: Fabian <godi.beat@gmx.net>
Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
Fixes: 9df085f3c9 ("ovl: relax requirement for non null uuid of lower fs")
Cc: stable@vger.kernel.org # v4.20+
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-22 09:33:11 +02:00
Vasily Averin 408ef501b8 fuse: don't ignore errors from fuse_writepages_fill()
[ Upstream commit 7779b047a5 ]

fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
either guarantee that all data was successfully saved or return error.

Fixes: 26d614df1d ("fuse: Implement writepages callback")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22 09:33:03 +02:00
Anna Schumaker 9b810684b1 NFS: Fix interrupted slots by sending a solo SEQUENCE operation
[ Upstream commit 913fadc5b1 ]

We used to do this before 3453d5708b, but this was changed to better
handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot
re-use case when the server doesn't receive the interrupted operation,
but if the server does receive the operation then it could still end up
replying to the client with mis-matched operations from the reply cache.

We can fix this by sending a SEQUENCE to the server while recovering from
a SEQ_MISORDERED error when we detect that we are in an interrupted slot
situation.

Fixes: 3453d5708b (NFSv4.1: Avoid false retries when RPC calls are interrupted)
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22 09:33:03 +02:00
Bob Peterson 27874115b0 gfs2: read-only mounts should grab the sd_freeze_gl glock
[ Upstream commit b780cc615b ]

Before this patch, only read-write mounts would grab the freeze
glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
glock was never initialized. That meant requests to freeze, which
request the glock in EX, were granted without any state transition.
That meant you could mount a gfs2 file system, which is currently
frozen on a different cluster node, in read-only mode.

This patch makes read-only mounts lock the freeze glock in SH mode,
which will block for file systems that are frozen on another node.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22 09:32:53 +02:00
Ronnie Sahlberg 91e81d2262 cifs: prevent truncation from long to int in wait_for_free_credits
[ Upstream commit 19e888678b ]

The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
the value.

Reported-by: Marshall Midden <marshallmidden@gmail.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-22 09:32:52 +02:00
Josef Bacik 026f830e0b btrfs: fix double put of block group with nocow
commit 230ed39743 upstream.

While debugging a patch that I wrote I was hitting use-after-free panics
when accessing block groups on unmount.  This turned out to be because
in the nocow case if we bail out of doing the nocow for whatever reason
we need to call btrfs_dec_nocow_writers() if we called the inc.  This
puts our block group, but a few error cases does

if (nocow) {
    btrfs_dec_nocow_writers();
    goto error;
}

unfortunately, error is

error:
	if (nocow)
		btrfs_dec_nocow_writers();

so we get a double put on our block group.  Fix this by dropping the
error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
error label now.

Fixes: 762bf09893 ("btrfs: improve error handling in run_delalloc_nocow")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-16 08:16:45 +02:00
Boris Burkov 808b2b3ea8 btrfs: fix fatal extent_buffer readahead vs releasepage race
commit 6bf9cd2eed upstream.

Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.

This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.

Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.

The following represents an example execution demonstrating the race:

            CPU0                                                         CPU1                                           CPU2
reada_for_search                                            reada_for_search
  readahead_tree_block                                        readahead_tree_block
    find_create_tree_block                                      find_create_tree_block
      alloc_extent_buffer                                         alloc_extent_buffer
                                                                  find_extent_buffer // not found
                                                                  allocates eb
                                                                  lock pages
                                                                  associate pages to eb
                                                                  insert eb into radix tree
                                                                  set TREE_REF, refs == 2
                                                                  unlock pages
                                                              read_extent_buffer_pages // WAIT_NONE
                                                                not uptodate (brand new eb)
                                                                                                            lock_page
                                                                if !trylock_page
                                                                  goto unlock_exit // not an error
                                                              free_extent_buffer
                                                                release_extent_buffer
                                                                  atomic_dec_and_test refs to 1
        find_extent_buffer // found
                                                                                                            try_release_extent_buffer
                                                                                                              take refs_lock
                                                                                                              reads refs == 1; no io
          atomic_inc_not_zero refs to 2
          mark_buffer_accessed
            check_buffer_tree_ref
              // not STALE, won't take refs_lock
              refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
                                                                                                              clear TREE_REF
                                                                                                              release_extent_buffer
                                                                                                                atomic_dec_and_test refs to 1
                                                                                                                unlock_page
      still not uptodate (CPU1 read failed on trylock_page)
      locks pages
      set io_pages > 0
      submit io
      return
    free_extent_buffer
      release_extent_buffer
        dec refs to 0
        delete from radix tree
        btrfs_release_extent_buffer_pages
          BUG_ON(io_pages > 0)!!!

We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.

To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.

Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS:  00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103]  release_extent_buffer+0x39/0x90
[1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645]  btrfs_search_slot+0x260/0x9b0
[1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427]  btrfs_get_extent+0x15f/0x830
[1417839.787665]  ? submit_extent_page+0xc4/0x1c0
[1417839.797474]  ? __do_readpage+0x299/0x7a0
[1417839.806515]  __do_readpage+0x33b/0x7a0
[1417839.815171]  ? btrfs_releasepage+0x70/0x70
[1417839.824597]  extent_readpages+0x28f/0x400
[1417839.833836]  read_pages+0x6a/0x1c0
[1417839.841729]  ? startup_64+0x2/0x30
[1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590]  filemap_fault+0x6c7/0x990
[1417839.869252]  ? xas_load+0x8/0x80
[1417839.876756]  ? xas_find+0x150/0x190
[1417839.884839]  ? filemap_map_pages+0x295/0x3b0
[1417839.894652]  __do_fault+0x32/0x110
[1417839.902540]  __handle_mm_fault+0xacd/0x1000
[1417839.912156]  handle_mm_fault+0xaa/0x1c0
[1417839.921004]  __do_page_fault+0x242/0x4b0
[1417839.930044]  ? page_fault+0x8/0x30
[1417839.937933]  page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-16 08:16:45 +02:00
Zhang Xiaoxu 15fa5dfaa4 cifs: update ctime and mtime during truncate
[ Upstream commit 5618303d85 ]

As the man description of the truncate, if the size changed,
then the st_ctime and st_mtime fields should be updated. But
in cifs, we doesn't do it.

It lead the xfstests generic/313 failed.

So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change
the file size

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-16 08:16:35 +02:00
Zhang Xiaoxu 71a20b798d cifs: Fix the target file was deleted when rename failed.
commit 9ffad9263b upstream.

When xfstest generic/035, we found the target file was deleted
if the rename return -EACESS.

In cifs_rename2, we unlink the positive target dentry if rename
failed with EACESS or EEXIST, even if the target dentry is positived
before rename. Then the existing file was deleted.

We should just delete the target file which created during the
rename.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
Paul Aurich 49dae9bed7 SMB3: Honor 'handletimeout' flag for multiuser mounts
commit 6b356f6cf9 upstream.

Fixes: ca567eb2b3 ("SMB3: Allow persistent handle timeout to be configurable on mount")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
Paul Aurich 7ab27439fe SMB3: Honor lease disabling for multiuser mounts
commit ad35f169db upstream.

Fixes: 3e7a02d478 ("smb3: allow disabling requesting leases")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
Paul Aurich 0d5824aea7 SMB3: Honor persistent/resilient handle flags for multiuser mounts
commit 00dfbc2f9c upstream.

Without this:

- persistent handles will only be enabled for per-user tcons if the
  server advertises the 'Continuous Availabity' capability
- resilient handles would never be enabled for per-user tcons

Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
Paul Aurich d56787683c SMB3: Honor 'seal' flag for multiuser mounts
commit cc15461c73 upstream.

Ensure multiuser SMB3 mounts use encryption for all users' tcons if the
mount options are configured to require encryption. Without this, only
the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user
tcons would only be encrypted if the server was configured to require
encryption.

Signed-off-by: Paul Aurich <paul@darkrain42.org>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
J. Bruce Fields fe05e114d0 nfsd: apply umask on fs without ACL support
commit 22cf8419f1 upstream.

The server is failing to apply the umask when creating new objects on
filesystems without ACL support.

To reproduce this, you need to use NFSv4.2 and a client and server
recent enough to support umask, and you need to export a filesystem that
lacks ACL support (for example, ext4 with the "noacl" mount option).

Filesystems with ACL support are expected to take care of the umask
themselves (usually by calling posix_acl_create).

For filesystems without ACL support, this is up to the caller of
vfs_create(), vfs_mknod(), or vfs_mkdir().

Reported-by: Elliott Mitchell <ehem+debian@m5p.com>
Reported-by: Salvatore Bonaccorso <carnil@debian.org>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Fixes: 47057abde5 ("nfsd: add support for the umask attribute")
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-07-09 09:37:55 +02:00
Paul Aurich 7d3f489e61 SMB3: Honor 'posix' flag for multiuser mounts
[ Upstream commit 5391b8e1b7 ]

The flag from the primary tcon needs to be copied into the volume info
so that cifs_get_tcon will try to enable extensions on the per-user
tcon. At that point, since posix extensions must have already been
enabled on the superblock, don't try to needlessly adjust the mount
flags.

Fixes: ce558b0e17 ("smb3: Add posix create context for smb3.11 posix mounts")
Fixes: b326614ea2 ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions")
Signed-off-by: Paul Aurich <paul@darkrain42.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09 09:37:54 +02:00
J. Bruce Fields c84138b3c1 nfsd: fix nfsdfs inode reference count leak
[ Upstream commit bf2654017e ]

I don't understand this code well, but  I'm seeing a warning about a
still-referenced inode on unmount, and every other similar filesystem
does a dput() here.

Fixes: e8a79fb14f ("nfsd: add nfsd/clients directory")
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09 09:37:53 +02:00
J. Bruce Fields 2571e17356 nfsd4: fix nfsdfs reference count loop
[ Upstream commit 681370f4b0 ]

We don't drop the reference on the nfsdfs filesystem with
mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called
until the nfsd module's unloaded, and we can't unload the module as long
as there's a reference on nfsdfs.  So this prevents module unloading.

Fixes: 2c830dd720 ("nfsd: persist nfsd filesystem across mounts")
Reported-and-Tested-by:  Luo Xiaogang <lxgrxd@163.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09 09:37:53 +02:00
Jens Axboe 1c4404efcf io_uring: make sure async workqueue is canceled on exit
Track async work items that we queue, so we can safely cancel them
if the ring is closed or the process exits. Newer kernels handle
this automatically with io-wq, but the old workqueue based setup needs
a bit of special help to get there.

There's no upstream variant of this, as that would require backporting
all the io-wq changes from 5.5 and on. Hence I made a one-off that
ensures that we don't leak memory if we have async work items that
need active cancelation (like socket IO).

Reported-by: Agarwal, Anchal <anchalag@amazon.com>
Tested-by: Agarwal, Anchal <anchalag@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-07-09 09:37:49 +02:00
Zheng Bin ffd40b7962 xfs: add agf freeblocks verify in xfs_agf_verify
[ Upstream commit d0c7feaf87 ]

We recently used fuzz(hydra) to test XFS and automatically generate
tmp.img(XFS v5 format, but some metadata is wrong)

xfs_repair information(just one AG):
agf_freeblks 0, counted 3224 in ag 0
agf_longest 536874136, counted 3224 in ag 0
sb_fdblocks 613, counted 3228

Test as follows:
mount tmp.img tmpdir
cp file1M tmpdir
sync

In 4.19-stable, sync will stuck, the reason is:
xfs_mountfs
  xfs_check_summary_counts
    if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
       XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
       !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
	return 0;  -->just return, incore sb_fdblocks still be 613
    xfs_initialize_perag_data

cp file1M tmpdir -->ok(write file to pagecache)
sync -->stuck(write pagecache to disk)
xfs_map_blocks
  xfs_iomap_write_allocate
    while (count_fsb != 0) {
      nimaps = 0;
      while (nimaps == 0) { --> endless loop
         nimaps = 1;
         xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
xfs_bmapi_write
  xfs_bmap_alloc
    xfs_bmap_btalloc
      xfs_alloc_vextent
        xfs_alloc_fix_freelist
          xfs_alloc_space_available -->fail(agf_freeblks is 0)

In linux-next, sync not stuck, cause commit c2b3164320 ("xfs:
use the latest extent at writeback delalloc conversion time") remove
the above while, dmesg is as follows:
[   55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.

Users do not know why this page is discard, the better soultion is:
1. Like xfs_repair, make sure sb_fdblocks is equal to counted
(xfs_initialize_perag_data did this, who is not called at this mount)
2. Add agf verify, if fail, will tell users to repair

This patch use the second soultion.

Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Signed-off-by: Ren Xudong <renxudong1@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-06-30 15:37:13 -04:00
Olga Kornievskaia 4d35ca872a NFSv4 fix CLOSE not waiting for direct IO compeletion
commit d03727b248 upstream.

Figuring out the root case for the REMOVE/CLOSE race and
suggesting the solution was done by Neil Brown.

Currently what happens is that direct IO calls hold a reference
on the open context which is decremented as an asynchronous task
in the nfs_direct_complete(). Before reference is decremented,
control is returned to the application which is free to close the
file. When close is being processed, it decrements its reference
on the open_context but since directIO still holds one, it doesn't
sent a close on the wire. It returns control to the application
which is free to do other operations. For instance, it can delete a
file. Direct IO is finally releasing its reference and triggering
an asynchronous close. Which races with the REMOVE. On the server,
REMOVE can be processed before the CLOSE, failing the REMOVE with
EACCES as the file is still opened.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Suggested-by: Neil Brown <neilb@suse.com>
CC: stable@vger.kernel.org
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:12 -04:00
Trond Myklebust 02917bef8f pNFS/flexfiles: Fix list corruption if the mirror count changes
commit 8b04013737 upstream.

If the mirror count changes in the new layout we pick up inside
ff_layout_pg_init_write(), then we can end up adding the
request to the wrong mirror and corrupting the mirror->pg_list.

Fixes: d600ad1f2b ("NFS41: pop some layoutget errors to application")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:12 -04:00
Junxiao Bi ff6aff13a8 ocfs2: fix panic on nfs server over ocfs2
commit e5a15e17a7 upstream.

The following kernel panic was captured when running nfs server over
ocfs2, at that time ocfs2_test_inode_bit() was checking whether one
inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its
"suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
//global_inode_alloc, but here it wrongly assumed that it was got from per
slot inode alloctor which would cause array overflow and trigger kernel
panic.

  BUG: unable to handle kernel paging request at 0000000000001088
  IP: [<ffffffff816f6898>] _raw_spin_lock+0x18/0xf0
  PGD 1e06ba067 PUD 1e9e7d067 PMD 0
  Oops: 0002 [#1] SMP
  CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2
  Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018
  RIP: _raw_spin_lock+0x18/0xf0
  RSP: e02b:ffff88005ae97908  EFLAGS: 00010206
  RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000
  RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088
  RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00
  R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088
  R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff
  FS:  0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000
  CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660
  Call Trace:
    igrab+0x1e/0x60
    ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2]
    ocfs2_test_inode_bit+0x328/0xa00 [ocfs2]
    ocfs2_get_parent+0xba/0x3e0 [ocfs2]
    reconnect_path+0xb5/0x300
    exportfs_decode_fh+0xf6/0x2b0
    fh_verify+0x350/0x660 [nfsd]
    nfsd4_putfh+0x4d/0x60 [nfsd]
    nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
    nfsd_dispatch+0xe0/0x290 [nfsd]
    svc_process_common+0x412/0x6a0 [sunrpc]
    svc_process+0x123/0x210 [sunrpc]
    nfsd+0xff/0x170 [nfsd]
    kthread+0xcb/0xf0
    ret_from_fork+0x61/0x90
  Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 <f0> 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
  RIP   _raw_spin_lock+0x18/0xf0
  CR2: 0000000000001088
  ---[ end trace 7264463cd1aac8f9 ]---
  Kernel panic - not syncing: Fatal exception

Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:09 -04:00
Junxiao Bi a8d82ebaee ocfs2: fix value of OCFS2_INVALID_SLOT
commit 9277f8334f upstream.

In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
implementation, slot number is 32 bits.  Usually this will not cause any
issue, because slot number is converted from u16 to u32, but
OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
disk was obtained, its value was (u16)-1, and it was converted to u32.
Then the following checking in get_local_system_inode will be always
skipped:

 static struct inode **get_local_system_inode(struct ocfs2_super *osb,
                                               int type,
                                               u32 slot)
 {
 	BUG_ON(slot == OCFS2_INVALID_SLOT);
	...
 }

Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:09 -04:00
Junxiao Bi 4685df862c ocfs2: load global_inode_alloc
commit 7569d3c754 upstream.

Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will
make it load during mount.  It can be used to test whether some
global/system inodes are valid.  One use case is that nfsd will test
whether root inode is valid.

Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:09 -04:00
Junxiao Bi 7fa716a594 ocfs2: avoid inode removal while nfsd is accessing it
commit 4cd9973f9f upstream.

Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.

This is a series of patches to fix issues on nfsd over ocfs2.  patch 1
is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a
panic issue.

This patch (of 4):

When nfsd is getting file dentry using handle or parent dentry of some
dentry, one cluster lock is used to avoid inode removed from other node,
but it still could be removed from local node, so use a rw lock to avoid
this.

Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:09 -04:00
Filipe Manana a79c3a99ac btrfs: fix failure of RWF_NOWAIT write into prealloc extent beyond eof
commit 4b1946284d upstream.

If we attempt to write to prealloc extent located after eof using a
RWF_NOWAIT write, we always fail with -EAGAIN.

We do actually check if we have an allocated extent for the write at
the start of btrfs_file_write_iter() through a call to check_can_nocow(),
but later when we go into the actual direct IO write path we simply
return -EAGAIN if the write starts at or beyond EOF.

Trivial to reproduce:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ touch /mnt/foo
  $ chattr +C /mnt/foo

  $ xfs_io -d -c "pwrite -S 0xab 0 64K" /mnt/foo
  wrote 65536/65536 bytes at offset 0
  64 KiB, 16 ops; 0.0004 sec (135.575 MiB/sec and 34707.1584 ops/sec)

  $ xfs_io -c "falloc -k 64K 1M" /mnt/foo

  $ xfs_io -d -c "pwrite -N -V 1 -S 0xfe -b 64K 64K 64K" /mnt/foo
  pwrite: Resource temporarily unavailable

On xfs and ext4 the write succeeds, as expected.

Fix this by removing the wrong check at btrfs_direct_IO().

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-06-30 15:37:08 -04:00