1
0
Fork 0
Commit Graph

7731 Commits (d75f773c86a2b8b7278e2c33343b46a4024bc002)

Author SHA1 Message Date
Sakari Ailus d75f773c86 treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively
%pF and %pf are functionally equivalent to %pS and %ps conversion
specifiers. The former are deprecated, therefore switch the current users
to use the preferred variant.

The changes have been produced by the following command:

	git grep -l '%p[fF]' | grep -v '^\(tools\|Documentation\)/' | \
	while read i; do perl -i -pe 's/%pf/%ps/g; s/%pF/%pS/g;' $i; done

And verifying the result.

Link: http://lkml.kernel.org/r/20190325193229.23390-1-sakari.ailus@linux.intel.com
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: sparclinux@vger.kernel.org
Cc: linux-um@lists.infradead.org
Cc: xen-devel@lists.xenproject.org
Cc: linux-acpi@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: drbd-dev@lists.linbit.com
Cc: linux-block@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Cc: linux-pci@vger.kernel.org
Cc: linux-scsi@vger.kernel.org
Cc: linux-btrfs@vger.kernel.org
Cc: linux-f2fs-devel@lists.sourceforge.net
Cc: linux-mm@kvack.org
Cc: ceph-devel@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Acked-by: David Sterba <dsterba@suse.com> (for btrfs)
Acked-by: Mike Rapoport <rppt@linux.ibm.com> (for mm/memblock.c)
Acked-by: Bjorn Helgaas <bhelgaas@google.com> (for drivers/pci)
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-04-09 14:19:06 +02:00
Linus Torvalds 80201fe175 for-5.1/block-20190302
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlx63XIQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpp2vEACfrrQsap7R+Av28mmXpmXi2FPa3g5Tev1t
 yYjK2qHvhlMZjPTYw3hCmbYdDDczlF7PEgSE2x2DjdcsYapb8Fy1lZ2X16c7ztBR
 HD/t9b5AVSQsczZzKgv3RqsNtTnjzS5V0A8XH8FAP2QRgiwDMwSN6G0FP0JBLbE/
 ZgxQrH1Iy1F33Wz4hI3Z7dEghKPZrH1IlegkZCEu47q9SlWS76qUetSy2GEtchOl
 3Lgu54mQZyVdI5/QZf9DyMDLF6dIz3tYU2qhuo01AHjGRCC72v86p8sIiXcUr94Q
 8pbegJhJ/g8KBol9Qhv3+pWG/QUAZwi/ZwasTkK+MJ4klRXfOrznxPubW1z6t9Vn
 QRo39Po5SqqP0QWAscDxCFjESIQlWlKa+LZurJL7DJDCUGrSgzTpnVwFqKwc5zTP
 HJa5MT2tEeL2TfUYRYCfh0ZV0elINdHA1y1klDBh38drh4EWr2gW8xdseGYXqRjh
 fLgEpoF7VQ8kTvxKN+E4jZXkcZmoLmefp0ZyAbblS6IawpPVC7kXM9Fdn2OU8f2c
 fjVjvSiqxfeN6dnpfeLDRbbN9894HwgP/LPropJOQ7KmjCorQq5zMDkAvoh3tElq
 qwluRqdBJpWT/F05KweY+XVW8OawIycmUWqt6JrVNoIDAK31auHQv47kR0VA4OvE
 DRVVhYpocw==
 =VBaU
 -----END PGP SIGNATURE-----

Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block

Pull block layer updates from Jens Axboe:
 "Not a huge amount of changes in this round, the biggest one is that we
  finally have Mings multi-page bvec support merged. Apart from that,
  this pull request contains:

   - Small series that avoids quiescing the queue for sysfs changes that
     match what we currently have (Aleksei)

   - Series of bcache fixes (via Coly)

   - Series of lightnvm fixes (via Mathias)

   - NVMe pull request from Christoph. Nothing major, just SPDX/license
     cleanups, RR mp policy (Hannes), and little fixes (Bart,
     Chaitanya).

   - BFQ series (Paolo)

   - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection
     for the fast path (Jianchao)

   - fops->iopoll() added for async IO polling, this is a feature that
     the upcoming io_uring interface will use (Christoph, me)

   - Partition scan loop fixes (Dongli)

   - mtip32xx conversion from managed resource API (Christoph)

   - cdrom registration race fix (Guenter)

   - MD pull from Song, two minor fixes.

   - Various documentation fixes (Marcos)

   - Multi-page bvec feature. This brings a lot of nice improvements
     with it, like more efficient splitting, larger IOs can be supported
     without growing the bvec table size, and so on. (Ming)

   - Various little fixes to core and drivers"

* tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits)
  block: fix updating bio's front segment size
  block: Replace function name in string with __func__
  nbd: propagate genlmsg_reply return code
  floppy: remove set but not used variable 'q'
  null_blk: fix checking for REQ_FUA
  block: fix NULL pointer dereference in register_disk
  fs: fix guard_bio_eod to check for real EOD errors
  blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map
  block: optimize bvec iteration in bvec_iter_advance
  block: introduce mp_bvec_for_each_page() for iterating over page
  block: optimize blk_bio_segment_split for single-page bvec
  block: optimize __blk_segment_map_sg() for single-page bvec
  block: introduce bvec_nth_page()
  iomap: wire up the iopoll method
  block: add bio_set_polled() helper
  block: wire up block device iopoll method
  fs: add an iopoll method to struct file_operations
  loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part()
  loop: do not print warn message if partition scan is successful
  block: bounce: make sure that bvec table is updated
  ...
2019-03-08 14:12:17 -08:00
Linus Torvalds b5dd0c658c Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - some of the rest of MM

 - various misc things

 - dynamic-debug updates

 - checkpatch

 - some epoll speedups

 - autofs

 - rapidio

 - lib/, lib/lzo/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (83 commits)
  samples/mic/mpssd/mpssd.h: remove duplicate header
  kernel/fork.c: remove duplicated include
  include/linux/relay.h: fix percpu annotation in struct rchan
  arch/nios2/mm/fault.c: remove duplicate include
  unicore32: stop printing the virtual memory layout
  MAINTAINERS: fix GTA02 entry and mark as orphan
  mm: create the new vm_fault_t type
  arm, s390, unicore32: remove oneliner wrappers for memblock_alloc()
  arch: simplify several early memory allocations
  openrisc: simplify pte_alloc_one_kernel()
  sh: prefer memblock APIs returning virtual address
  microblaze: prefer memblock API returning virtual address
  powerpc: prefer memblock APIs returning virtual address
  lib/lzo: separate lzo-rle from lzo
  lib/lzo: implement run-length encoding
  lib/lzo: fast 8-byte copy on arm64
  lib/lzo: 64-bit CTZ on arm64
  lib/lzo: tidy-up ifdefs
  ipc/sem.c: replace kvmalloc/memset with kvzalloc and use struct_size
  ipc: annotate implicit fall through
  ...
2019-03-07 19:25:37 -08:00
Rasmus Villemoes afe1a715e8 btrfs: implement btrfs_debug* in terms of helper macro
First, the btrfs_debug macros open-code (one possible definition of)
DYNAMIC_DEBUG_BRANCH, so they don't benefit from the CONFIG_JUMP_LABEL
optimization.

Second, a planned change of struct _ddebug (to reduce its size on 64 bit
machines) requires that all descriptors in a translation unit use
distinct identifiers.

Using the new _dynamic_func_call_no_desc helper macro from
dynamic_debug.h takes care of both of these.  No functional change.

Link: http://lkml.kernel.org/r/20190212214150.4807-12-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Jason Baron <jbaron@akamai.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-07 18:32:00 -08:00
YueHaibing f65e25e343 btrfs: Remove unnecessary casts in btrfs_read_root_item
There is a messy cast here:
	min_t(int, len, (int)sizeof(*item)));

min_t() should normally cast to unsigned.  It's not possible for "len"
to be negative, but if it were then we definitely wouldn't want to pass
negatives to read_extent_buffer().  Also there is an extra cast.

This patch shouldn't affect runtime, it's just a clean up.

Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:19:23 +01:00
Filipe Manana 253002f2e3 Btrfs: remove assertion when searching for a key in a node/leaf
At ctree.c:key_search(), the assertion that verifies the first key on a
child extent buffer corresponds to the key at a specific slot in the
parent has a disadvantage: we effectively hit a BUG_ON() which requires
rebooting the machine later. It also does not tell any information about
which extent buffer is affected, from which root, the expected and found
keys, etc.

However as of commit 581c176041 ("btrfs: Validate child tree block's
level and first key"), that assertion is not needed since at the time we
read an extent buffer from disk we validate that its first key matches the
key, at the respective slot, in the parent extent buffer. Therefore just
remove the assertion at key_search().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:19:23 +01:00
Filipe Manana cbca7d59fe Btrfs: add missing error handling after doing leaf/node binary search
The function map_private_extent_buffer() can return an -EINVAL error, and
it is called by generic_bin_search() which will return back the error. The
btrfs_bin_search() function in turn calls generic_bin_search() and the
key_search() function calls btrfs_bin_search(), so both can return the
-EINVAL error coming from the map_private_extent_buffer() function. Some
callers of these functions were ignoring that these functions can return
an error, so fix them to deal with error return values.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:19:23 +01:00
Dan Carpenter 669e859b5e btrfs: drop the lock on error in btrfs_dev_replace_cancel
We should drop the lock on this error path.  This has been found by a
static tool.

The lock needs to be released, it's there to protect access to the
dev_replace members and is not supposed to be left locked. The value of
state that's being switched would need to be artifically changed to an
invalid value so the default: branch is taken.

Fixes: d189dd70e2 ("btrfs: fix use-after-free due to race between replace start and cancel")
CC: stable@vger.kernel.org # 5.0+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:41 +01:00
Johannes Thumshirn 349ae63f40 btrfs: ensure that a DUP or RAID1 block group has exactly two stripes
We recently had a customer issue with a corrupted filesystem. When
trying to mount this image btrfs panicked with a division by zero in
calc_stripe_length().

The corrupt chunk had a 'num_stripes' value of 1. calc_stripe_length()
takes this value and divides it by the number of copies the RAID profile
is expected to have to calculate the amount of data stripes. As a DUP
profile is expected to have 2 copies this division resulted in 1/2 = 0.
Later then the 'data_stripes' variable is used as a divisor in the
stripe length calculation which results in a division by 0 and thus a
kernel panic.

When encountering a filesystem with a DUP block group and a
'num_stripes' value unequal to 2, refuse mounting as the image is
corrupted and will lead to unexpected behaviour.

Code inspection showed a RAID1 block group has the same issues.

Fixes: e06cd3dd7c ("Btrfs: add validadtion checks for chunk loading")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:41 +01:00
Dan Robertson e49be14b8d btrfs: init csum_list before possible free
The scrub_ctx csum_list member must be initialized before scrub_free_ctx
is called. If the csum_list is not initialized beforehand, the
list_empty call in scrub_free_csums will result in a null deref if the
allocation fails in the for loop.

Fixes: a2de733c78 ("btrfs: scrub")
CC: stable@vger.kernel.org # 3.0+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dan Robertson <dan@dlrobertson.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:41 +01:00
Filipe Manana 57a50e2506 Btrfs: remove no longer needed range length checks for deduplication
Comparing the content of the pages in the range to deduplicate is now
done in generic_remap_checks called by the generic helper
generic_remap_file_range_prep(), which takes care of ensuring we do not
compare/deduplicate undefined data beyond a file's EOF (range from EOF
to the next block boundary). So remove these checks which are now
redundant.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:40 +01:00
Filipe Manana a3baaf0d78 Btrfs: fix fsync after succession of renames and unlink/rmdir
After a succession of renames operations of different files and unlinking
one of them, if we fsync one of the renamed files we can end up with a
log that will either fail to replay at mount time or result in a filesystem
that is in an inconsistent state. One example scenario:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/fname1
  $ touch /mnt/testdir/fname2

  $ sync

  $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
  $ rm -f /mnt/testdir/fname2
  $ ln /mnt/testdir/fname3 /mnt/testdir/fname2

  $ touch /mnt/testdir/fname1
  $ xfs_io -c "fsync" /mnt/testdir/fname1

  <power failure>

  $ mount /dev/sdb /mnt
  $ umount /mnt
  $ btrfs check /dev/sdb
  [1/7] checking root items
  [2/7] checking extents
  [3/7] checking free space cache
  [4/7] checking fs roots
  root 5 inode 259 errors 2, no orphan item
  ERROR: errors found in fs roots
  Opening filesystem to check...
  Checking filesystem on /dev/sdc
  UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d
  found 393216 bytes used, error(s) found
  total csum bytes: 0
  total tree bytes: 131072
  total fs tree bytes: 32768
  total extent tree bytes: 16384
  btree space waste bytes: 122986
  file data blocks allocated: 262144
   referenced 262144

On a kernel without the first patch in this series, titled
"[PATCH] Btrfs: fix fsync after succession of renames of different files",
we get instead an error when mounting the filesystem due to failure of
replaying the log:

  $ mount /dev/sdb /mnt
  mount: mount /dev/sdb on /mnt failed: File exists

Fix this by logging the parent directory of an inode whenever we find an
inode that no longer exists (was unlinked in the current transaction),
during the procedure which finds inodes that have old names that collide
with new names of other inodes.

A test case for fstests follows soon.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:40 +01:00
Filipe Manana 6b5fc433a7 Btrfs: fix fsync after succession of renames of different files
After a succession of rename operations of different files and fsyncing
one of them, such that each file gets a new name that corresponds to an
old name of another file, we can end up with a log that will cause a
failure when attempted to replay at mount time (an EEXIST error).
We currently have correct behaviour when such succession of renames
involves only two files, but if there are more files involved, we end up
not logging all the inodes that are needed, therefore resulting in a
failure when attempting to replay the log.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/fname1
  $ touch /mnt/testdir/fname2

  $ sync

  $ mv /mnt/testdir/fname1 /mnt/testdir/fname3
  $ mv /mnt/testdir/fname2 /mnt/testdir/fname4
  $ ln /mnt/testdir/fname3 /mnt/testdir/fname2

  $ touch /mnt/testdir/fname1
  $ xfs_io -c "fsync" /mnt/testdir/fname1

  <power failure>

  $ mount /dev/sdb /mnt
  mount: mount /dev/sdb on /mnt failed: File exists

So fix this by checking all inode dependencies when logging an inode. That
is, if one logged inode A has a new name that matches the old name of some
other inode B, check if inode B has a new name that matches the old name
of some other inode C, and so on. This fix is implemented not by doing any
recursive function calls but by using an iterative method using a linked
list that is used in a first-in-first-out fashion.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:40 +01:00
Josef Bacik 38e3eebff6 btrfs: honor path->skip_locking in backref code
Qgroups will do the old roots lookup at delayed ref time, which could be
while walking down the extent root while running a delayed ref.  This
should be fine, except we specifically lock eb's in the backref walking
code irrespective of path->skip_locking, which deadlocks the system.
Fix up the backref code to honor path->skip_locking, nobody will be
modifying the commit_root when we're searching so it's completely safe
to do.

This happens since fb235dc06f ("btrfs: qgroup: Move half of the qgroup
accounting time out of commit trans"), kernel may lockup with quota
enabled.

There is one backref trace triggered by snapshot dropping along with
write operation in the source subvolume.  The example can be reliably
reproduced:

  btrfs-cleaner   D    0  4062      2 0x80000000
  Call Trace:
   schedule+0x32/0x90
   btrfs_tree_read_lock+0x93/0x130 [btrfs]
   find_parent_nodes+0x29b/0x1170 [btrfs]
   btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
   btrfs_find_all_roots+0x57/0x70 [btrfs]
   btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
   btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
   btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
   do_walk_down+0x541/0x5e3 [btrfs]
   walk_down_tree+0xab/0xe7 [btrfs]
   btrfs_drop_snapshot+0x356/0x71a [btrfs]
   btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
   cleaner_kthread+0x12b/0x160 [btrfs]
   kthread+0x112/0x130
   ret_from_fork+0x27/0x50

When dropping snapshots with qgroup enabled, we will trigger backref
walk.

However such backref walk at that timing is pretty dangerous, as if one
of the parent nodes get WRITE locked by other thread, we could cause a
dead lock.

For example:

           FS 260     FS 261 (Dropped)
            node A        node B
           /      \      /      \
       node C      node D      node E
      /   \         /  \        /     \
  leaf F|leaf G|leaf H|leaf I|leaf J|leaf K

The lock sequence would be:

      Thread A (cleaner)             |       Thread B (other writer)
-----------------------------------------------------------------------
write_lock(B)                        |
write_lock(D)                        |
^^^ called by walk_down_tree()       |
                                     |       write_lock(A)
                                     |       write_lock(D) << Stall
read_lock(H) << for backref walk     |
read_lock(D) << lock owner is        |
                the same thread A    |
                so read lock is OK   |
read_lock(A) << Stall                |

So thread A hold write lock D, and needs read lock A to unlock.
While thread B holds write lock A, while needs lock D to unlock.

This will cause a deadlock.

This is not only limited to snapshot dropping case.  As the backref
walk, even only happens on commit trees, is breaking the normal top-down
locking order, makes it deadlock prone.

Fixes: fb235dc06f ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
CC: stable@vger.kernel.org # 4.14+
Reported-and-tested-by: David Sterba <dsterba@suse.com>
Reported-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
[ rebase to latest branch and fix lock assert bug in btrfs/007 ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy logs and deadlock analysis from Qu's patch ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:39 +01:00
Qu Wenruo f5fef45936 btrfs: qgroup: Make qgroup async transaction commit more aggressive
[BUG]
Btrfs qgroup will still hit EDQUOT under the following case:

  $ dev=/dev/test/test
  $ mnt=/mnt/btrfs
  $ umount $mnt &> /dev/null
  $ umount $dev &> /dev/null

  $ mkfs.btrfs -f $dev
  $ mount $dev $mnt -o nospace_cache

  $ btrfs subv create $mnt/subv
  $ btrfs quota enable $mnt
  $ btrfs quota rescan -w $mnt
  $ btrfs qgroup limit -e 1G $mnt/subv

  $ fallocate -l 900M $mnt/subv/padding
  $ sync

  $ rm $mnt/subv/padding

  # Hit EDQUOT
  $ xfs_io -f -c "pwrite 0 512M" $mnt/subv/real_file

[CAUSE]
Since commit a514d63882 ("btrfs: qgroup: Commit transaction in advance
to reduce early EDQUOT"), btrfs is not forced to commit transaction to
reclaim more quota space.

Instead, we just check pertrans metadata reservation against some
threshold and try to do asynchronously transaction commit.

However in above case, the pertrans metadata reservation is pretty small
thus it will never trigger asynchronous transaction commit.

[FIX]
Instead of only accounting pertrans metadata reservation, we calculate
how much free space we have, and if there isn't much free space left,
commit transaction asynchronously to try to free some space.

This may slow down the fs when we have less than 32M free qgroup space,
but should reduce a lot of false EDQUOT, so the cost should be
acceptable.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:39 +01:00
Qu Wenruo 1418bae1c2 btrfs: qgroup: Move reserved data accounting from btrfs_delayed_ref_head to btrfs_qgroup_extent_record
[BUG]
Btrfs/139 will fail with a high probability if the testing machine (VM)
has only 2G RAM.

Resulting the final write success while it should fail due to EDQUOT,
and the fs will have quota exceeding the limit by 16K.

The simplified reproducer will be: (needs a 2G ram VM)

  $ mkfs.btrfs -f $dev
  $ mount $dev $mnt

  $ btrfs subv create $mnt/subv
  $ btrfs quota enable $mnt
  $ btrfs quota rescan -w $mnt
  $ btrfs qgroup limit -e 1G $mnt/subv

  $ for i in $(seq -w  1 8); do
  	xfs_io -f -c "pwrite 0 128M" $mnt/subv/file_$i > /dev/null
  	echo "file $i written" > /dev/kmsg
    done
  $ sync
  $ btrfs qgroup show -pcre --raw $mnt

The last pwrite will not trigger EDQUOT and final 'qgroup show' will
show something like:

  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5             16384        16384         none         none ---     ---
  0/256      1073758208   1073758208         none   1073741824 ---     ---

And 1073758208 is larger than
  > 1073741824.

[CAUSE]
It's a bug in btrfs qgroup data reserved space management.

For quota limit, we must ensure that:
  reserved (data + metadata) + rfer/excl <= limit

Since rfer/excl is only updated at transaction commmit time, reserved
space needs to be taken special care.

One important part of reserved space is data, and for a new data extent
written to disk, we still need to take the reserved space until
rfer/excl numbers get updated.

Originally when an ordered extent finishes, we migrate the reserved
qgroup data space from extent_io tree to delayed ref head of the data
extent, expecting delayed ref will only be cleaned up at commit
transaction time.

However for small RAM machine, due to memory pressure dirty pages can be
flushed back to disk without committing a transaction.

The related events will be something like:

  file 1 written
  btrfs_finish_ordered_io: ino=258 ordered offset=0 len=54947840
  btrfs_finish_ordered_io: ino=258 ordered offset=54947840 len=5636096
  btrfs_finish_ordered_io: ino=258 ordered offset=61153280 len=57344
  btrfs_finish_ordered_io: ino=258 ordered offset=61210624 len=8192
  btrfs_finish_ordered_io: ino=258 ordered offset=60583936 len=569344
  cleanup_ref_head: num_bytes=54947840
  cleanup_ref_head: num_bytes=5636096
  cleanup_ref_head: num_bytes=569344
  cleanup_ref_head: num_bytes=57344
  cleanup_ref_head: num_bytes=8192
  ^^^^^^^^^^^^^^^^ This will free qgroup data reserved space
  file 2 written
  ...
  file 8 written
  cleanup_ref_head: num_bytes=8192
  ...
  btrfs_commit_transaction  <<< the only transaction committed during
				the test

When file 2 is written, we have already freed 128M reserved qgroup data
space for ino 258. Thus later write won't trigger EDQUOT.

This allows us to write more data beyond qgroup limit.

In my 2G ram VM, it could reach about 1.2G before hitting EDQUOT.

[FIX]
By moving reserved qgroup data space from btrfs_delayed_ref_head to
btrfs_qgroup_extent_record, we can ensure that reserved qgroup data
space won't be freed half way before commit transaction, thus fix the
problem.

Fixes: f64d5ca868 ("btrfs: delayed_ref: Add new function to record reserved space into delayed ref")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:39 +01:00
David Sterba 0ea8207626 btrfs: scrub: remove unused nocow worker pointer
The member btrfs_fs_info::scrub_nocow_workers is unused since the nocow
optimization was removed from scrub in 9bebe665c3 ("btrfs: scrub:
Remove unused copy_nocow_pages and its callchain").

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:38 +01:00
David Sterba c835294274 btrfs: scrub: add assertions for worker pointers
The scrub worker pointers are not NULL iff the scrub is running, so
reset them back once the last reference is dropped. Add assertions to
the initial phase of scrub to verify that.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:38 +01:00
Anand Jain ff09c4ca59 btrfs: scrub: convert scrub_workers_refcnt to refcount_t
Use the refcount_t for fs_info::scrub_workers_refcnt instead of int so
we get the extra checks. All reference changes are still done under
scrub_lock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:38 +01:00
Anand Jain eb4318e59a btrfs: scrub: add scrub_lock lockdep check in scrub_workers_get
scrub_workers_refcnt is protected by scrub_lock, add lockdep_assert_held()
in scrub_workers_get().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:37 +01:00
Anand Jain 1cec3f2716 btrfs: scrub: fix circular locking dependency warning
This fixes a longstanding lockdep warning triggered by
fstests/btrfs/011.

Circular locking dependency check reports warning[1], that's because the
btrfs_scrub_dev() calls the stack #0 below with, the fs_info::scrub_lock
held. The test case leading to this warning:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /btrfs
  $ btrfs scrub start -B /btrfs

In fact we have fs_info::scrub_workers_refcnt to track if the init and destroy
of the scrub workers are needed. So once we have incremented and decremented
the fs_info::scrub_workers_refcnt value in the thread, its ok to drop the
scrub_lock, and then actually do the btrfs_destroy_workqueue() part. So this
patch drops the scrub_lock before calling btrfs_destroy_workqueue().

  [359.258534] ======================================================
  [359.260305] WARNING: possible circular locking dependency detected
  [359.261938] 5.0.0-rc6-default #461 Not tainted
  [359.263135] ------------------------------------------------------
  [359.264672] btrfs/20975 is trying to acquire lock:
  [359.265927] 00000000d4d32bea ((wq_completion)"%s-%s""btrfs", name){+.+.}, at: flush_workqueue+0x87/0x540
  [359.268416]
  [359.268416] but task is already holding lock:
  [359.270061] 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
  [359.272418]
  [359.272418] which lock already depends on the new lock.
  [359.272418]
  [359.274692]
  [359.274692] the existing dependency chain (in reverse order) is:
  [359.276671]
  [359.276671] -> #3 (&fs_info->scrub_lock){+.+.}:
  [359.278187]        __mutex_lock+0x86/0x9c0
  [359.279086]        btrfs_scrub_pause+0x31/0x100 [btrfs]
  [359.280421]        btrfs_commit_transaction+0x1e4/0x9e0 [btrfs]
  [359.281931]        close_ctree+0x30b/0x350 [btrfs]
  [359.283208]        generic_shutdown_super+0x64/0x100
  [359.284516]        kill_anon_super+0x14/0x30
  [359.285658]        btrfs_kill_super+0x12/0xa0 [btrfs]
  [359.286964]        deactivate_locked_super+0x29/0x60
  [359.288242]        cleanup_mnt+0x3b/0x70
  [359.289310]        task_work_run+0x98/0xc0
  [359.290428]        exit_to_usermode_loop+0x83/0x90
  [359.291445]        do_syscall_64+0x15b/0x180
  [359.292598]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [359.294011]
  [359.294011] -> #2 (sb_internal#2){.+.+}:
  [359.295432]        __sb_start_write+0x113/0x1d0
  [359.296394]        start_transaction+0x369/0x500 [btrfs]
  [359.297471]        btrfs_finish_ordered_io+0x2aa/0x7c0 [btrfs]
  [359.298629]        normal_work_helper+0xcd/0x530 [btrfs]
  [359.299698]        process_one_work+0x246/0x610
  [359.300898]        worker_thread+0x3c/0x390
  [359.302020]        kthread+0x116/0x130
  [359.303053]        ret_from_fork+0x24/0x30
  [359.304152]
  [359.304152] -> #1 ((work_completion)(&work->normal_work)){+.+.}:
  [359.306100]        process_one_work+0x21f/0x610
  [359.307302]        worker_thread+0x3c/0x390
  [359.308465]        kthread+0x116/0x130
  [359.309357]        ret_from_fork+0x24/0x30
  [359.310229]
  [359.310229] -> #0 ((wq_completion)"%s-%s""btrfs", name){+.+.}:
  [359.311812]        lock_acquire+0x90/0x180
  [359.312929]        flush_workqueue+0xaa/0x540
  [359.313845]        drain_workqueue+0xa1/0x180
  [359.314761]        destroy_workqueue+0x17/0x240
  [359.315754]        btrfs_destroy_workqueue+0x57/0x200 [btrfs]
  [359.317245]        scrub_workers_put+0x2c/0x60 [btrfs]
  [359.318585]        btrfs_scrub_dev+0x336/0x590 [btrfs]
  [359.319944]        btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
  [359.321622]        btrfs_ioctl+0x28a4/0x2e40 [btrfs]
  [359.322908]        do_vfs_ioctl+0xa2/0x6d0
  [359.324021]        ksys_ioctl+0x3a/0x70
  [359.325066]        __x64_sys_ioctl+0x16/0x20
  [359.326236]        do_syscall_64+0x54/0x180
  [359.327379]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [359.328772]
  [359.328772] other info that might help us debug this:
  [359.328772]
  [359.330990] Chain exists of:
  [359.330990]   (wq_completion)"%s-%s""btrfs", name --> sb_internal#2 --> &fs_info->scrub_lock
  [359.330990]
  [359.334376]  Possible unsafe locking scenario:
  [359.334376]
  [359.336020]        CPU0                    CPU1
  [359.337070]        ----                    ----
  [359.337821]   lock(&fs_info->scrub_lock);
  [359.338506]                                lock(sb_internal#2);
  [359.339506]                                lock(&fs_info->scrub_lock);
  [359.341461]   lock((wq_completion)"%s-%s""btrfs", name);
  [359.342437]
  [359.342437]  *** DEADLOCK ***
  [359.342437]
  [359.343745] 1 lock held by btrfs/20975:
  [359.344788]  #0: 0000000053ea26a6 (&fs_info->scrub_lock){+.+.}, at: btrfs_scrub_dev+0x322/0x590 [btrfs]
  [359.346778]
  [359.346778] stack backtrace:
  [359.347897] CPU: 0 PID: 20975 Comm: btrfs Not tainted 5.0.0-rc6-default #461
  [359.348983] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
  [359.350501] Call Trace:
  [359.350931]  dump_stack+0x67/0x90
  [359.351676]  print_circular_bug.isra.37.cold.56+0x15c/0x195
  [359.353569]  check_prev_add.constprop.44+0x4f9/0x750
  [359.354849]  ? check_prev_add.constprop.44+0x286/0x750
  [359.356505]  __lock_acquire+0xb84/0xf10
  [359.357505]  lock_acquire+0x90/0x180
  [359.358271]  ? flush_workqueue+0x87/0x540
  [359.359098]  flush_workqueue+0xaa/0x540
  [359.359912]  ? flush_workqueue+0x87/0x540
  [359.360740]  ? drain_workqueue+0x1e/0x180
  [359.361565]  ? drain_workqueue+0xa1/0x180
  [359.362391]  drain_workqueue+0xa1/0x180
  [359.363193]  destroy_workqueue+0x17/0x240
  [359.364539]  btrfs_destroy_workqueue+0x57/0x200 [btrfs]
  [359.365673]  scrub_workers_put+0x2c/0x60 [btrfs]
  [359.366618]  btrfs_scrub_dev+0x336/0x590 [btrfs]
  [359.367594]  ? start_transaction+0xa1/0x500 [btrfs]
  [359.368679]  btrfs_dev_replace_by_ioctl.cold.19+0x179/0x1bb [btrfs]
  [359.369545]  btrfs_ioctl+0x28a4/0x2e40 [btrfs]
  [359.370186]  ? __lock_acquire+0x263/0xf10
  [359.370777]  ? kvm_clock_read+0x14/0x30
  [359.371392]  ? kvm_sched_clock_read+0x5/0x10
  [359.372248]  ? sched_clock+0x5/0x10
  [359.372786]  ? sched_clock_cpu+0xc/0xc0
  [359.373662]  ? do_vfs_ioctl+0xa2/0x6d0
  [359.374552]  do_vfs_ioctl+0xa2/0x6d0
  [359.375378]  ? do_sigaction+0xff/0x250
  [359.376233]  ksys_ioctl+0x3a/0x70
  [359.376954]  __x64_sys_ioctl+0x16/0x20
  [359.377772]  do_syscall_64+0x54/0x180
  [359.378841]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [359.380422] RIP: 0033:0x7f5429296a97

Backporting to older kernels: scrub_nocow_workers must be freed the same
way as the others.

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ update changelog ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:37 +01:00
Anand Jain 7faad6e25c btrfs: fix comment its device list mutex not volume lock
We have killed volume mutex (commit: dccdb07bc9
btrfs: kill btrfs_fs_info::volume_mutex). This a trival one seems to have
escaped.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:37 +01:00
Qu Wenruo bb58eb9e16 btrfs: extent_io: Kill the forward declaration of flush_write_bio
There is no need to forward declare flush_write_bio(), as it only
depends on submit_one_bio().  Both of them are pretty small, just move
them to kill the forward declaration.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:37 +01:00
Nikolay Borisov 352646c7bf btrfs: Fix grossly misleading argument names in extent io search
The variables and function parameters of __etree_search which pertain to
prev/next are grossly misnamed. Namely, prev_ret holds the next state
and not the previous. Similarly, next_ret actually holds the previous
extent state relating to the offset we are interested in. Fix this by
renaming the variables as well as switching the arguments order. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:36 +01:00
Nikolay Borisov ba8f5206a4 btrfs: Remove EXTENT_FIRST_DELALLOC bit
With the refactoring introduced in 8b62f87bad ("Btrfs: reworki
outstanding_extents") this flag became unused. Remove it and renumber
the following flags accordingly. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:36 +01:00
Nikolay Borisov 9a0ec83d57 btrfs: use WARN_ON in a canonical form btrfs_remove_block_group
There is no point in using a construct like 'if (!condition)
WARN_ON(1)'. Use WARN_ON(!condition) directly. No functional changes.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:36 +01:00
Josef Bacik 260e77025f btrfs: reserve extra space during evict
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:35 +01:00
Josef Bacik 8a1bbe1d5c btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic.  I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked.  So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine.  Then as we add new things that are safe we can just add them
to this list.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:35 +01:00
Josef Bacik 5df1136363 btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.

However we may not actually need that much once writeout is complete,
because of the over-reservation for the worst case.

So instead try to make our reservation, and if we couldn't make it
re-calculate our new reservation size and try again.  If our reservation
size doesn't change between tries then we know we are actually out of
space and can error. Flushing that could have been running in parallel
did not make any space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ rename to calc_refill_bytes, update comment and changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:35 +01:00
Josef Bacik f91587e415 btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.

However, this is generally not the case anymore.  So fix this logic to
instead see if we had a ticket that we were able to give some
reservation to, and if we were continue the flushing loop again.

Likewise we make the tickets use the space_info_add_old_bytes() method
of returning what reservation they did receive in hopes that it could
satisfy reservations down the line.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:34 +01:00
Josef Bacik 450114fc0d btrfs: don't use global reserve for chunk allocation
We've done this forever because of the voodoo around knowing how much
space we have.  However, we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

This has a side effect for mixed block group fs'es where we are no
longer allocating enough chunks for the data/metadata requirements.  To
deal with this add a ALLOC_CHUNK_FORCE step to the flushing state
machine.  This will only get used if we've already made a full loop
through the flushing machinery and tried committing the transaction.

If we have then we can try and force a chunk allocation since we likely
need it to make progress.  This resolves issues I was seeing with
the mixed bg tests in xfstests without the new flushing state.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ merged with patch "add ALLOC_CHUNK_FORCE to the flushing code" ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:34 +01:00
Josef Bacik b78e5616af btrfs: dump block_rsv details when dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:34 +01:00
Josef Bacik d89dbefb8c btrfs: check if there are free block groups for commit
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However, if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:33 +01:00
Dennis Zhou 3f93aef535 btrfs: add zstd compression level support
Zstd compression requires different amounts of memory for each level of
compression. The prior patches implemented indirection to allow for each
compression type to manage their workspaces independently. This patch
uses this indirection to implement compression level support for zstd.

To manage the additional memory require, each compression level has its
own queue of workspaces. A global LRU is used to help with reclaim.
Reclaim is done via a timer which provides a mechanism to decrease
memory utilization by keeping only workspaces around that are sized
appropriately. Forward progress is guaranteed by a preallocated max
workspace hidden from the LRU.

When getting a workspace, it uses a bitmap to identify the levels that
are populated and scans up. If it finds a workspace that is greater than
it, it uses it, but does not update the last_used time and the
corresponding place in the LRU. If we hit memory pressure, we sleep on
the max level workspace. We continue to rescan in case we can use a
smaller workspace, but eventually should be able to obtain the max level
workspace or allocate one again should memory pressure subside.

The memory requirement for decompression is the same as level 1, and
therefore can use any of available workspace.

The number of workspaces is bound by an upper limit of the workqueue's
limit which currently is 2 (percpu limit). The reclaim timer is used to
free inactive/improperly sized workspaces and is set to 307s to avoid
colliding with transaction commit (every 30s).

Repeating the experiment from v2 [1], the Silesia corpus was copied to a
btrfs filesystem 10 times and then read back after dropping the caches.
The btrfs filesystem was on an SSD.

Level   Ratio   Compression (MB/s)  Decompression (MB/s)  Memory (KB)
1       2.658        438.47                910.51            780
2       2.744        364.86                886.55           1004
3       2.801        336.33                828.41           1260
4       2.858        286.71                886.55           1260
5       2.916        212.77                556.84           1388
6       2.363        119.82                990.85           1516
7       3.000        154.06                849.30           1516
8       3.011        159.54                875.03           1772
9       3.025        100.51                940.15           1772
10      3.033        118.97                616.26           1772
11      3.036         94.19                802.11           1772
12      3.037         73.45                931.49           1772
13      3.041         55.17                835.26           2284
14      3.087         44.70                716.78           2547
15      3.126         37.30                878.84           2547

[1] https://lore.kernel.org/linux-btrfs/20181031181108.289340-1-terrelln@fb.com/

Cc: Nick Terrell <terrelln@fb.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:33 +01:00
Dennis Zhou d3c6ab752c btrfs: make zstd memory requirements monotonic
It is possible based on the level configurations that a higher level
workspace uses less memory than a lower level workspace. In order to
reuse workspaces, this must be made a monotonic relationship. This
precomputes the required memory for each level and enforces the
monotonicity between level and memory required. This is also done
in upstream zstd in [1].

[1] a68b76afef

Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:33 +01:00
Dennis Zhou e0dc87afcd btrfs: zstd use the passed through level instead of default
Zstd currently only supports the default level of compression. This
patch switches to using the level passed in for btrfs zstd
configuration.

Zstd workspaces now keep track of the requested level as this can differ
from the size of the workspace.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:33 +01:00
Dennis Zhou d0ab62ce2d btrfs: change set_level() to bound the level passed in
Currently, the only user of set_level() is zlib which sets an internal
workspace parameter. As level is now plumbed into get_workspace(), this
can be handled there rather than separately.

This repurposes set_level() to bound the level passed in so it can be
used when setting the mounts compression level and as well as verifying
the level before getting a workspace. The other benefit is this divides
the meaning of compress(0) and get_workspace(0). The former means we
want to use the default compression level of the compression type. The
latter means we can use any workspace available.

Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 7bf4994304 btrfs: plumb level through the compression interface
Zlib compression supports multiple levels, but doesn't require changing
in how a workspace itself is created and managed. Zstd introduces a
different memory requirement such that higher levels of compression
require more memory.

This requires changes in how the alloc()/get() methods work for zstd.
This pach plumbs compression level through the interface as a parameter
in preparation for zstd compression levels.  This gives the compression
types opportunity to create/manage based on the compression level.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 92ee553036 btrfs: move to function pointers for get/put workspaces
The previous patch added generic helpers for get_workspace() and
put_workspace(). Now, we can migrate ownership of the workspace_manager
to be in the compression type code as the compression code itself
doesn't care beyond being able to get a workspace. The init/cleanup and
get/put methods are abstracted so each compression algorithm can decide
how they want to manage their workspaces.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:32 +01:00
Dennis Zhou 929f4baf93 btrfs: add compression interface in (get/put)_workspace
There are two levels of workspace management. First, alloc()/free()
which are responsible for actually creating and destroy workspaces.
Second, at a higher level, get()/put() which is the compression code
asking for a workspace from a workspace_manager.

The compression code shouldn't really care how it gets a workspace, but
that it got a workspace. This adds get_workspace() and put_workspace()
to be the higher level interface which is responsible for indexing into
the appropriate compression type. It also introduces
btrfs_put_workspace() and btrfs_get_workspace() to be the generic
implementations of the higher interface.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou 1666edabc8 btrfs: add helper methods for workspace manager init and cleanup
Workspace manager init and cleanup code is open coded inside a for loop
over the compression types. This forces each compression type to rely on
the same workspace manager implementation. This patch creates helper
methods that will be the generic implementation for btrfs workspace
management.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou 10b94a51ca btrfs: unify compression ops with workspace_manager
Make the workspace_manager own the interface operations rather than
managing index-paired arrays for the workspace_manager and compression
operations.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou ca4ac360af btrfs: manage heuristic workspace as index 0
While the heuristic workspaces aren't really compression workspaces,
they use the same interface for managing them. So rather than branching,
let's just handle them once again as the index 0 compression type.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:31 +01:00
Dennis Zhou acce85de12 btrfs: rename workspaces_list to workspace_manager
This is in preparation for zstd compression levels. As each level will
require different size of workspace, workspaces_list is no longer a
really fitting name.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:30 +01:00
Dennis Zhou 1972708a89 btrfs: add helpers for compression type and level
It is very easy to miss places that rely on a certain bitshifting for
decoding the type_level overloading. Add helpers to do this instead.

Cc: Omar Sandoval <osandov@osandov.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:30 +01:00
Anand Jain 228a73abde btrfs: introduce new ioctl to unregister a btrfs device
Support for a new command that can be used eg. as a command

  $ btrfs device scan --forget [dev]'
(the final name may change though)

to undo the effects of 'btrfs device scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty and is next to the
SCAN ioctl.

The new ioctl BTRFS_IOC_FORGET_DEV works only on the control device
(/dev/btrfs-control) to unregister one or all devices, devices that are
not mounted.

The argument is struct btrfs_ioctl_vol_args, ::name specifies the device
path. To unregister all device, the path is an empty string.

Again, the devices are removed only if they aren't part of a mounte
filesystem.

This new ioctl provides:

- release of unwanted btrfs_fs_devices and btrfs_devices structures
  from memory if the device is not going to be mounted

- ability to mount filesystem in degraded mode, when one devices is
  corrupted like in split brain raid1

- running test cases which would require reloading the kernel module
  but this is not possible eg. due to mounted filesystem or built-in

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:30 +01:00
Josef Bacik 034f784d7c btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

The waiting is killable as it could be indirectly called from user
operations like fallocate or zero-range. Such call sites should handle
the error but otherwise it's not necessary. Eg. flush_space just needs
to attempt to make space by waiting on iputs.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add killable comment and changelog parts ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:29 +01:00
Qu Wenruo 3ece54e504 btrfs: Output ENOSPC debug info in inc_block_group_ro
Since inc_block_group_ro() would return -ENOSPC, outputting debug info
for enospc_debug mount option would be helpful to debug some balance
false ENOSPC report.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:29 +01:00
Qu Wenruo c8f72b98b6 btrfs: qgroup: Remove duplicated trace points for qgroup_rsv_add/release
Inside qgroup_rsv_add/release(), we have trace events
trace_qgroup_update_reserve() to catch reserved space update.

However we still have two manual trace_qgroup_update_reserve() calls
just outside these functions.  Remove these duplicated calls.

Fixes: 64ee4e751a ("btrfs: qgroup: Update trace events to use new separate rsv types")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:28 +01:00
Anders Roxell 2eec5f0042 btrfs: let the assertion expression compile in all configs
A compiler warning (in a patch in development) pointed to a variable
that was used only inside and ASSERT:

  u64 root_objectid = root->root_key.objectid;
  ASSERT(root_objectid == ...);

  fs/btrfs/relocation.c: In function ‘insert_dirty_subv’:
  fs/btrfs/relocation.c:2138:6: warning: unused variable ‘root_objectid’ [-Wunused-variable]
    u64 root_objectid = root->root_key.objectid;
	^~~~~~~~~~~~~

When CONFIG_BRTFS_ASSERT isn't enabled, variable root_objectid isn't used.

Rework the assertion helper by adding a runtime check instead of the
'#ifdef CONFIG_BTRFS_ASSERT #else ...", so the compiler sees the
condition being passed into an inline function after preprocessing.

Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25 14:13:28 +01:00