1
0
Fork 0
Commit Graph

267 Commits (74d46992e0d9dee7f1f376de0d56d31614c8a17a)

Author SHA1 Message Date
Christoph Hellwig 74d46992e0 block: replace bi_bdev with a gendisk pointer and partitions index
This way we don't need a block_device structure to submit I/O.  The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open.  Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device.  But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-23 12:49:55 -06:00
Linus Torvalds 8c27cb3566 Merge branch 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
 "The core updates improve error handling (mostly related to bios), with
  the usual incremental work on the GFP_NOFS (mis)use removal,
  refactoring or cleanups. Except the two top patches, all have been in
  for-next for an extensive amount of time.

  User visible changes:

   - statx support

   - quota override tunable

   - improved compression thresholds

   - obsoleted mount option alloc_start

  Core updates:

   - bio-related updates:
       - faster bio cloning
       - no allocation failures
       - preallocated flush bios

   - more kvzalloc use, memalloc_nofs protections, GFP_NOFS updates

   - prep work for btree_inode removal

   - dir-item validation

   - qgoup fixes and updates

   - cleanups:
       - removed unused struct members, unused code, refactoring
       - argument refactoring (fs_info/root, caller -> callee sink)
       - SEARCH_TREE ioctl docs"

* 'for-4.13-part1' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (115 commits)
  btrfs: Remove false alert when fiemap range is smaller than on-disk extent
  btrfs: Don't clear SGID when inheriting ACLs
  btrfs: fix integer overflow in calc_reclaim_items_nr
  btrfs: scrub: fix target device intialization while setting up scrub context
  btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges
  btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
  btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
  btrfs: qgroup: Return actually freed bytes for qgroup release or free data
  btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function
  btrfs: qgroup: Add quick exit for non-fs extents
  Btrfs: rework delayed ref total_bytes_pinned accounting
  Btrfs: return old and new total ref mods when adding delayed refs
  Btrfs: always account pinned bytes when dropping a tree block ref
  Btrfs: update total_bytes_pinned when pinning down extents
  Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT()
  Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64
  btrfs: fix validation of XATTR_ITEM dir items
  btrfs: Verify dir_item in iterate_object_props
  btrfs: Check name_len before in btrfs_del_root_ref
  btrfs: Check name_len before reading btrfs_get_name
  ...
2017-07-05 16:41:23 -07:00
Chris Mason 6374e57ad8 btrfs: fix integer overflow in calc_reclaim_items_nr
Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
v4.12-rc6.  This was because commit 70e7af244 made it possible for
calc_reclaim_items_nr() to return a negative number.  It's not really a
bug in that commit, it just didn't go far enough down the stack to find
all the possible 64->32 bit overflows.

This switches calc_reclaim_items_nr() to return a u64 and changes everyone
that uses the results of that math to u64 as well.

Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 70e7af2 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
David Sterba ded56184a5 btrfs: scrub: fix target device intialization while setting up scrub context
The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
helper but wrongly sets up the target device. Incidentally there's a
local variable with the same name as a parameter in the previous
function, so this got caught during runtime as crash in test btrfs/027.

Reported-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-29 20:17:02 +02:00
David Sterba c5e4c3d750 btrfs: sink gfp parameter to btrfs_io_bio_alloc
We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we
change it back from GFP_KERNEL in scrub. I'd rather save a few stack
bytes from not passing the gfp flags in the remaining, more imporatant,
contexts and the bio allocating API now looks more consistent.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:04 +02:00
David Sterba e4f5690386 btrfs: btrfs_io_bio_alloc never fails, skip error handling
Update direct callers of btrfs_io_bio_alloc that do error handling, that
we can now remove.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba de2491fdef btrfs: scrub: add memalloc_nofs protection around init_ipath
init_ipath is called from a safe ioctl context and from scrub when
printing an error.  The protection is added for three reasons:

* init_data_container calls vmalloc and this does not work as expected
  in the GFP_NOFS context, so this silently does GFP_KERNEL and might
  deadlock in some cases
* keep the context constraint of GFP_NOFS, used by scrub
* we want to use GFP_KERNEL unconditionally inside init_ipath or its
  callees

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:02 +02:00
David Sterba 3fb99303c6 btrfs: scrub: embed scrub_wr_ctx into scrub context
The structure scrub_wr_ctx is not used anywhere just the scrub context,
we can move the members there. The tgtdev is renamed so it's more clear
that it belongs to the "wr" part.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba 25cc1226c1 btrfs: scrub: use fs_info::sectorsize and drop it from scrub context
As we now have the node/block sizes in fs_info, we can use them and can
drop the local copies.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:01 +02:00
David Sterba 4e2814ef04 btrfs: scrub: simplify cleanup of wr_ctx in scrub_free_ctx
We don't need to take the mutex and zero out wr_cur_bio, as this is
called after the scrub finished.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba e241ddeb9c btrfs: scrub: inline helper scrub_free_wr_ctx
The helper scrub_free_wr_ctx is used only once and fits into
scrub_free_ctx as it continues sctx shutdown, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
David Sterba 8fcdac3f20 btrfs: scrub: inline helper scrub_setup_wr_ctx
The helper scrub_setup_wr_ctx is used only once and fits into
scrub_setup_ctx as it continues intialization, no need to keep it
separate.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:26:00 +02:00
Josef Bacik 6ec656bc0f btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Christoph Hellwig 4e4cbee93d block: switch bios to blk_status_t
Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2017-06-09 09:27:32 -06:00
Qu Wenruo 28d70e237d btrfs: scrub: Fix RAID56 recovery race condition
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16 (assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
        Corrupted       |       Correct          |      Correct        |
|   Scrub dev3 (D1)     |    Scrub dev2 (D2)     |    Scrub dev1(P)    |
------------------------------------------------------------------------
Read out D1             |Read out D2             |Read full stripe     |
Check csum              |Check csum              |Check parity         |
Csum mismatch           |Csum match, continue    |Parity mismatch      |
handle_errored_block    |                        |handle_errored_block |
 Read out full stripe   |                        | Read out full stripe|
 D1 csum error(err++)   |                        | D1 csum error(err++)|
 Recover D1             |                        | Recover D1          |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protection, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli <kreijack@libero.it>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Qu Wenruo 0966a7b130 btrfs: scrub: Introduce full stripe lock for RAID56
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protection, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which store a rb_tree of mutexes for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:27 +02:00
Liu Bo 42c61ab676 Btrfs: switch to div64_u64 if with a u64 divisor
This is fixing code pieces where we use div_u64 when passing a u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo 972d721939 Btrfs: update scrub_parity to use u64 stripe_len
Commit 3d8da67817 ("Btrfs: fix divide error upon chunk's stripe_len")
changed stripe_len in struct map_lookup to u64, but didn't update
stripe_len in struct scrub_parity.

This updates the type and switches to div64_u64_rem to match u64 divisor.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba 619a974292 btrfs: use clear_page where appropriate
There's a helper to clear whole page, with a arch-specific optimized
code. The replaced cases do not seem to be in performace critical code,
but we still might get some percent gain.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo e501bfe323 btrfs: Prevent scrub recheck from racing with dev replace
scrub_setup_recheck_block() calls btrfs_map_sblock() and then accesses
bbio without protection of bio_counter.

This can lead to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decreasing the bio_counter when corresponding recover is finished.

Cc: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo ae6529c35b btrfs: Wait for in-flight bios before freeing target device for raid56
When raid56 dev-replace is cancelled by running scrub, we will free
target device without waiting for in-flight bios, causing the following
NULL pointer deference or general protection failure.

 BUG: unable to handle kernel NULL pointer dereference at 00000000000005e0
 IP: generic_make_request_checks+0x4d/0x610
 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O    4.11.0-rc2 #72
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
 task: ffff88002875b4c0 task.stack: ffffc90001334000
 RIP: 0010:generic_make_request_checks+0x4d/0x610
 Call Trace:
  ? generic_make_request+0xc7/0x360
  generic_make_request+0x24/0x360
  ? generic_make_request+0xc7/0x360
  submit_bio+0x64/0x120
  ? page_in_rbio+0x4d/0x80 [btrfs]
  ? rbio_orig_end_io+0x80/0x80 [btrfs]
  finish_rmw+0x3f4/0x540 [btrfs]
  validate_rbio_for_rmw+0x36/0x40 [btrfs]
  raid_rmw_end_io+0x7a/0x90 [btrfs]
  bio_endio+0x56/0x60
  end_workqueue_fn+0x3c/0x40 [btrfs]
  btrfs_scrubparity_helper+0xef/0x620 [btrfs]
  btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
  process_one_work+0x2af/0x720
  ? process_one_work+0x22b/0x720
  worker_thread+0x4b/0x4f0
  kthread+0x10f/0x150
  ? process_one_work+0x720/0x720
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 RIP: generic_make_request_checks+0x4d/0x610 RSP: ffffc90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when dev-replace is aborted, either due to error or cancelled by
scrub, we didn't wait for bios, this can lead to use-after-free if there
are bios holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problem:
1) Wait for bio_counter before freeing target device when canceling
   replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
   protect the call.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Qu Wenruo 9a33944bdf btrfs: scrub: Don't append on-disk pages for raid56 scrub
In the following situation, scrub will calculate wrong parity to
overwrite the correct one:

RAID5 full stripe:

Before
|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0x0000 (Bad)   |     0xcdcd     |     0x0000    |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

After scrubbing dev3 only:

|     Dev 1      |     Dev  2     |     Dev 3     |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--------------------------------------------------- 0
| 0xcdcd (Good)  |     0xcdcd     | 0xcdcd (Bad)  |
--------------------------------------------------- 4K
|     0xcdcd     |     0xcdcd     |     0x0000    |
...
|     0xcdcd     |     0xcdcd     |     0x0000    |
--------------------------------------------------- 64K

The reason is that after raid56 read rebuild rbio->stripe_pages are all
correctly recovered (0xcd for data stripes).

However when we check and repair parity in
scrub_parity_check_and_repair(), we will append pages in sparity->spages
list to rbio->bio_pages[], which contains old on-disk data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
David Sterba 825ad4c964 btrfs: drop redundant parameters from btrfs_map_sblock
All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Liu Bo 1bcd7aa17f Btrfs: set scrub page's io_error if failing to submit io
Scrub repairs data by the unit called scrub_block, which may contain
several pages.  Scrub always tries to look up a good copy of a whole
block, but if there's no such copy, it tries to do repair page by page.

If we don't set page's io_error when checking this bad copy, in the last
step, we may skip this page when repairing bad copy from good copy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:26 +02:00
Elena Reshetova 99f4cdb16f btrfs: convert scrub_ctx.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova 78a764504d btrfs: convert scrub_parity.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova 186debd6ed btrfs: convert scrub_block.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Elena Reshetova 6f615018b3 btrfs: convert scrub_recover.refs from atomic_t to refcount_t
refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova <elena.reshetova@intel.com>
Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David Windsor <dwindsor@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-18 14:07:24 +02:00
Nikolay Borisov 1c8c9c5216 btrfs: Make check_extent_to_block take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:11 +01:00
Nikolay Borisov 9d4f7f8ad6 btrfs: make repair_io_failure take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:09 +01:00
Nikolay Borisov a776c6fa1f btrfs: Make btrfs_lookup_ordered_range take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-28 11:30:08 +01:00
Jeff Mahoney 5e00f1939f btrfs: convert btrfs_inc_block_group_ro to accept fs_info
btrfs_inc_block_group_ro is either passed the extent root or the dev
root, but it doesn't do anything with the dev tree.  Let's convert
to passing an fs_info and using the extent root.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:56 +01:00
David Sterba e5987e1319 btrfs: remove unused parameters from scrub_setup_wr_ctx
Never used.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-17 12:03:54 +01:00
Linus Torvalds 087a76d390 Merge branch 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "Jeff Mahoney and Dave Sterba have a really nice set of cleanups in
  here, and Christoph pitched in corrections/improvements to make btrfs
  use proper helpers for bio walking instead of doing it by hand.

  There are some key fixes as well, including some long standing bugs
  that took forever to track down in btrfs_drop_extents and during
  balance"

* 'for-linus-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (77 commits)
  btrfs: limit async_work allocation and worker func duration
  Revert "Btrfs: adjust len of writes if following a preallocated extent"
  Btrfs: don't WARN() in btrfs_transaction_abort() for IO errors
  btrfs: opencode chunk locking, remove helpers
  btrfs: remove root parameter from transaction commit/end routines
  btrfs: split btrfs_wait_marked_extents into normal and tree log functions
  btrfs: take an fs_info directly when the root is not used otherwise
  btrfs: simplify btrfs_wait_cache_io prototype
  btrfs: convert extent-tree tracepoints to use fs_info
  btrfs: root->fs_info cleanup, access fs_info->delayed_root directly
  btrfs: root->fs_info cleanup, add fs_info convenience variables
  btrfs: root->fs_info cleanup, update_block_group{,flags}
  btrfs: root->fs_info cleanup, lock/unlock_chunks
  btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size
  btrfs: pull node/sector/stripe sizes out of root and into fs_info
  btrfs: root->fs_info cleanup, io_ctl_init
  btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
  btrfs: struct reada_control.root -> reada_control.fs_info
  btrfs: struct btrfsic_state->root should be an fs_info
  btrfs: alloc_reserved_file_extent trace point should use extent_root
  ...
2016-12-16 10:53:01 -08:00
Jeff Mahoney 3a45bb207e btrfs: remove root parameter from transaction commit/end routines
Now we only use the root parameter to print the root objectid in
a tracepoint.  We can use the root parameter from the transaction
handle for that.  It's also used to join the transaction with
async commits, so we remove the comment that it's just for checking.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:07:00 +01:00
Jeff Mahoney 2ff7e61e0d btrfs: take an fs_info directly when the root is not used otherwise
There are loads of functions in btrfs that accept a root parameter
but only use it to obtain an fs_info pointer.  Let's convert those to
just accept an fs_info pointer directly.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney 0b246afa62 btrfs: root->fs_info cleanup, add fs_info convenience variables
In routines where someptr->fs_info is referenced multiple times, we
introduce a convenience variable.  This makes the code considerably
more readable.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:59 +01:00
Jeff Mahoney da17066c40 btrfs: pull node/sector/stripe sizes out of root and into fs_info
We track the node sizes per-root, but they never vary from the values
in the superblock.  This patch messes with the 80-column style a bit,
but subsequent patches to factor out root->fs_info into a convenience
variable fix it up again.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Jeff Mahoney fb456252d3 btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-12-06 16:06:58 +01:00
Christoph Hellwig cf8cddd38b btrfs: don't abuse REQ_OP_* flags for btrfs_map_block
btrfs_map_block supports different types of mappings, which to a large
extent resemble block layer operations.  But they don't always do, and
currently btrfs dangerously overlays it's own flag over the block layer
flags.  This is just asking for a conflict, so introduce a different
map flags enum inside of btrfs instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-11-29 14:10:38 +01:00
Christoph Hellwig 70fd76140a block,fs: use REQ_* flags directly
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly.  Where applicable this also drops usage of the
bio_set_op_attrs wrapper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-11-01 09:43:26 -06:00
Jeff Mahoney 5d163e0e68 btrfs: unsplit printed strings
CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-09-26 18:08:44 +02:00
Linus Torvalds d58b0d980f Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull more btrfs updates from Chris Mason:
 "This is part two of my btrfs pull, which is some cleanups and a batch
  of fixes.

  Most of the code here is from Jeff Mahoney, making the pointers we
  pass around internally more consistent and less confusing overall.  I
  noticed a small problem right before I sent this out yesterday, so I
  fixed it up and re-tested overnight"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits)
  Btrfs: fix __MAX_CSUM_ITEMS
  btrfs: btrfs_abort_transaction, drop root parameter
  btrfs: add btrfs_trans_handle->fs_info pointer
  btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction
  btrfs: convert nodesize macros to static inlines
  btrfs: introduce BTRFS_MAX_ITEM_SIZE
  btrfs: cleanup, remove prototype for btrfs_find_root_ref
  btrfs: copy_to_sk drop unused root parameter
  btrfs: simpilify btrfs_subvol_inherit_props
  btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root
  btrfs: tests, require fs_info for root
  btrfs: tests, move initialization into tests/
  btrfs: btrfs_test_opt and friends should take a btrfs_fs_info
  btrfs: prefix fsid to all trace events
  btrfs: plumb fs_info into btrfs_work
  btrfs: remove obsolete part of comment in statfs
  btrfs: hide test-only member under ifdef
  btrfs: Ratelimit "no csum found" info message
  btrfs: Add ratelimit to btrfs printing
  Btrfs: fix unexpected balance crash due to BUG_ON
  ...
2016-08-04 19:56:16 -04:00
Jeff Mahoney cb001095ca btrfs: plumb fs_info into btrfs_work
In order to provide an fsid for trace events, we'll need a btrfs_fs_info
pointer.  The most lightweight way to do that for btrfs_work structures
is to associate it with the __btrfs_workqueue structure.  Each queued
btrfs_work structure has a workqueue associated with it, so that's
a natural fit.  It's a privately defined structures, so we add accessors
to retrieve the fs_info pointer.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:53:15 +02:00
Chandan Rajendra 751bebbe0a Btrfs: subpage-blocksize: Rate limit scrub error message
btrfs/073 invokes scrub ioctl in a tight loop. In subpage-blocksize
scenario this results in a lot of "scrub: size assumption sectorsize !=
PAGE_SIZE " messages being printed on the console. To reduce the number
of such messages this commit uses btrfs_err_rl() instead of
btrfs_err().

Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26 13:52:25 +02:00
Mike Christie 37226b2111 btrfs: use bio op accessors
This should be the easier cases to convert btrfs to
bio_set_op_attrs/bio_op.
They are mostly just cut and replace type of changes.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Filipe Manana 1a1a8b732c Btrfs: fix race setting block group back to RW mode during device replace
After it finishes processing a device extent, the device replace code sets
back the block group to RW mode and then after that it sets the left cursor
to match the logical end address of the block group, so that future writes
into extents belonging to the block group go both the source (old) and
target (new) devices. However from the moment we turn the block group
back to RW mode we have a short time window, that lasts until we update
the left cursor's value, where extents can be allocated from the block
group and written to, in which case they will not be copied/written to
the target (new) device. Fix this by updating the left cursor's value
before turning the block group back to RW mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:24 +01:00
Filipe Manana 81e87a736c Btrfs: fix unprotected assignment of the left cursor for device replace
We were assigning new values to fields of the device replace object
without holding the respective lock after processing each device extent.
This is important for the left cursor field which can be accessed by a
concurrent task running __btrfs_map_block (which, correctly, takes the
device replace lock).
So change these fields while holding the device replace lock.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:23 +01:00
Filipe Manana f0e9b7d640 Btrfs: fix race setting block group readonly during device replace
When we do a device replace, for each device extent we find from the
source device, we set the corresponding block group to readonly mode to
prevent writes into it from happening while we are copying the device
extent from the source to the target device. However just before we set
the block group to readonly mode some concurrent task might have already
allocated an extent from it or decided it could perform a nocow write
into one of its extents, which can make the device replace process to
miss copying an extent since it uses the extent tree's commit root to
search for extents and only once it finishes searching for all extents
belonging to the block group it does set the left cursor to the logical
end address of the block group - this is a problem if the respective
ordered extents finish while we are searching for extents using the
extent tree's commit root and no transaction commit happens while we
are iterating the tree, since it's the delayed references created by the
ordered extents (when they complete) that insert the extent items into
the extent tree (using the non-commit root of course).
Example:

          CPU 1                                            CPU 2

 btrfs_dev_replace_start()
   btrfs_scrub_dev()
     scrub_enumerate_chunks()
       --> finds device extent belonging
           to block group X

                               <transaction N starts>

                                                      starts buffered write
                                                      against some inode

                                                      writepages is run against
                                                      that inode forcing dellaloc
                                                      to run

                                                      btrfs_writepages()
                                                        extent_writepages()
                                                          extent_write_cache_pages()
                                                            __extent_writepage()
                                                              writepage_delalloc()
                                                                run_delalloc_range()
                                                                  cow_file_range()
                                                                    btrfs_reserve_extent()
                                                                      --> allocates an extent
                                                                          from block group X
                                                                          (which is not yet
                                                                           in RO mode)
                                                                    btrfs_add_ordered_extent()
                                                                      --> creates ordered extent Y
                                                        flush_epd_write_bio()
                                                          --> bio against the extent from
                                                              block group X is submitted

       btrfs_inc_block_group_ro(bg X)
         --> sets block group X to readonly

       scrub_chunk(bg X)
         scrub_stripe(device extent from srcdev)
           --> keeps searching for extent items
               belonging to the block group using
               the extent tree's commit root
           --> it never blocks due to
               fs_info->scrub_pause_req as no
               one tries to commit transaction N
           --> copies all extents found from the
               source device into the target device
           --> finishes search loop

                                                        bio completes

                                                        ordered extent Y completes
                                                        and creates delayed data
                                                        reference which will add an
                                                        extent item to the extent
                                                        tree when run (typically
                                                        at transaction commit time)

                                                          --> so the task doing the
                                                              scrub/device replace
                                                              at CPU 1 misses this
                                                              and does not copy this
                                                              extent into the new/target
                                                              device

       btrfs_dec_block_group_ro(bg X)
         --> turns block group X back to RW mode

       dev_replace->cursor_left is set to the
       logical end offset of block group X

So fix this by waiting for all cow and nocow writes after setting a block
group to readonly mode.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:21 +01:00