Commit graph

6545 commits

Author SHA1 Message Date
Nikolay Borisov eb7b9d6a46 btrfs: send: remove unused code
This code was first introduced in 31db9f7c23 ("Btrfs: introduce
BTRFS_IOC_SEND for btrfs send/receive") and it was not functional, then
it got slightly refactored in e938c8ad54 ("Btrfs: code cleanups for
send/receive"), alas it was still dead. So let's remove it for good!

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
Anand Jain 6dd38f81f9 btrfs: remove BUG_ON in btrfs_rm_dev_replace_free_srcdev()
That was only an extra check to tackle a few bugs around this area, now
its safe to remove it.  Replace it by an ASSERT.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
Adam Borowski fa4d885a48 btrfs: allow setting zlib compression level via :9
This is bikeshedding, but it seems people are drastically more likely to
understand "zlib:9" as compression level rather than an algorithm
version compared to "zlib9".

Based on feedback on the mailinglist, the ":9" will be the only accepted
syntax. The level must be a single digit. Unrecognized format will
result to the default, for forward compatibility in a similar way the
compression algorithm specifier was relaxed in commit
a7164fa4e0 ("btrfs: prepare for extensions in compression
options").

Signed-off-by: Adam Borowski <kilobyte@angband.pl>
Reviewed-by: David Sterba <dsterba@suse.com>
[ tighten the accepted format ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:34 +01:00
David Sterba f51d2b5912 btrfs: allow to set compression level for zlib
Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib                 # default
$ mount -o compess=zlib0                # same
$ mount -o compess=zlib9                # level 9, slower sync, less data
$ mount -o compess=zlib1                # level 1, faster sync, more data
$ mount -o remount,compress=zlib3	# level set by remount

The compress-force works the same as compress'.  The level is visible in
the same format in /proc/mounts. Level set via file property does not
work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-01 20:45:29 +01:00
Nikolay Borisov d4417e2255 btrfs: Replace opencoded sizes with their symbolic constants
Currently btrfs' code uses a mix of opencoded sizes and defines from sizes.h.
Let's unifiy the code base to always use the symbolic constants. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Gu JinXiang 859a58a207 btrfs: Use bd_dev to generate index when dev_state_hashtable add items.
Fix missing change from commit f8f84b2dfd
("btrfs: index check-integrity state hash by a dev_t").

Function btrfsic_dev_state_hashtable_lookup uses dev_t to generate hashval
when look in up a btrfsic_dev_state in hash table. So when we add a
btrfsic_dev_state into the hash table, it should also use dev_t.

Reproducer of this bug:
Use MOUNT_OPTIONS="-o check_int" when running xfstest, device can not be
mounted successfully. So xfstest can not run.

Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain 102ed2c5ff btrfs: fix false EIO for missing device
When one of the device is missing, bbio_error() takes care of setting
the error status. And if its only IO that is pending in that stripe, it
fails to check the status of the other IO at %bbio_error before setting
the error %bi_status for the %orig_bio. Fix this by checking if
%bbio->error has exceeded the %bbio->max_errors.

Reproducer as below fdatasync error is seen intermittently.

 mount -o degraded /dev/sdc /btrfs
 dd status=none if=/dev/zero of=$(mktemp /btrfs/XXX) bs=4096 count=1 conv=fdatasync

 dd: fdatasync failed for ‘/btrfs/LSe’: Input/output error

 The reason for the intermittences of the problem is because
 the following conditions have to be met, which depends on timing:
 In btrfs_map_bio()
  - the RAID1 the missing device has to be at %dev_nr = 1
 In bbio_error()
  . before bbio_error() is called the bio of the not-missing
    device at %dev_nr = 0 must be completed so that the below
    condition is true
     if (atomic_dec_and_test(&bbio->stripes_pending)) {

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain de48373454 btrfs: use need_full_stripe() in __btrfs_map_block()
A cleanup patch, use need_full_stripe() to replace the open code.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Goldwyn Rodrigues 79f015f216 btrfs: cleanup extent locking sequence
Code cleanup for better understanding:
Variable needs_unlock to be called extent_locked to show state as
opposed to action. Changed the type to int, to reduce code in the
critical path.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Anand Jain 2dbe0c7718 btrfs: use BLK_STS defines where needed
At few places we could use BLK_STS_OK and BLK_STS_NOSUPP.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Satoru Taekeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ dropped first hunk btrfs_endio_direct_read ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Josef Bacik bf2681cb94 btrfs: add assertions for releasing trans handle reservations
These are useful for debugging problems where we mess with
trans->block_rsv to make sure we're not screwing something up.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:01 +01:00
Josef Bacik 3b60d436a1 btrfs: remove type argument from comp_tree_refs
We can get this from the ref we've passed in.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik d278850eff btrfs: remove delayed_ref_node from ref_head
This is just excessive information in the ref_head, and makes the code
complicated.  It is a relic from when we had the heads and the refs in
the same tree, which is no longer the case.  With this removal I've
cleaned up a bunch of the cruft around this old assumption as well.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik c1103f7a5d btrfs: move all ref head cleanup to the helper function
We do a couple different cleanup operations on the ref head.  We adjust
counters, we'll free any reserved space if we didn't end up using the
ref, and we clear the pending csum bytes.  Move all these disparate
things into cleanup_ref_head and clean up the logic in
__btrfs_run_delayed_refs so that it handles the !ref case a lot cleaner,
as well as making run_one_delayed_ref() only deal with real refs and not
the ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 1ce7a5ec44 btrfs: move ref_mod modification into the if (ref) logic
We only use this logic if our ref isn't a ref_head, so move it up into
the if (ref) case since we know that this is a normal ref and not a
delayed ref head.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 194ab0bc21 btrfs: breakout empty head cleanup to a helper
Move this code out to a helper function to further simplivy
__btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik b00e62507e btrfs: move extent_op cleanup to a helper
Move the extent_op cleanup for an empty head ref to a helper function to
help simplify __btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 2eadaa22c1 btrfs: add a helper to return a head ref
Simplify the error handling in __btrfs_run_delayed_refs by breaking out
the code used to return a head back to the delayed_refs tree for
processing into a helper function.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 7c777430e8 Btrfs: only check delayed ref usage in should_end_transaction
We were only doing btrfs_check_space_for_delayed_refs() if the metadata
space was full, ie we couldn't allocate chunks.  This assumes we'll be
able to allocate chunks during transaction commit, but since nothing
does a LIMIT flush during the transaction commit this won't actually
happen unless we happen to run shy of actual space.  We already take
into account a full fs in btrfs_check_space_for_delayed_refs() so just
kill this extra check to make sure we're ending the transaction when we
need to.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik fd708b81d9 Btrfs: add a extent ref verify tool
We were having corruption issues that were tied back to problems with
the extent tree.  In order to track them down I built this tool to try
and find the culprit, which was pretty successful.  If you compile with
this tool on it will live verify every ref update that the fs makes and
make sure it is consistent and valid.  I've run this through with
xfstests and haven't gotten any false positives.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update error messages, add fixup from Dan Carpenter to handle errors
  of read_tree_block ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 84f7d8e624 btrfs: pass root to various extent ref mod functions
We need the actual root for the ref verifier tool to work, so change
these functions to pass the root around instead.  This will be used in
a subsequent patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik fb592373cd btrfs: add ref-verify mount option
This adds the infrastructure for turning ref verify on and off for a
mount, to be used by a later patch.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ enhnance btrfs_print_mod_info to print if ref-verify is compiled in ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba 6273b7f8ed btrfs: get rid of sector_t and use u64 offset in submit_extent_page
The use of sector_t in the callchain of submit_extent_page is not
necessary.  Switch to u64 and rename the variable and use byte units
instead of 512b, ie.  dropping the >> 9 shifts and avoiding the
con(tro)versions of sector_t.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba 6c5a4e2c12 btrfs: rename page offset parameter in submit_extent_page
We're going to remove sector_t and will use 'offset', so this patch
frees the name.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
David Sterba 6aa21263e3 btrfs: scrub: get rid of sector_t
The use of sector_t is not necessry, it's just for a warning.  Switch to
u64 and rename the variable and use byte units instead of 512b, ie.
dropping the >> 9 shifts.  The messages are adjusted as well.

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:28:00 +01:00
Josef Bacik 2351f431f7 btrfs: fix send ioctl on 32bit with 64bit kernel
We pass in a pointer in our send arg struct, this means the struct size
doesn't match with 32bit user space and 64bit kernel space.  Fix this by
adding a compat mode and doing the appropriate conversion.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ move structure to the beginning, next to receive 32bit compat ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain 2b902dfc89 btrfs: fix use of error or warning for missing device
When device is missing without the -o degraded option then its an error
so report it as an error instead of a warning.  And when -o degraded
option is provided, log the missing device as warning.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch error to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain 5a2b8e601c btrfs: declare btrfs_report_missing_device() static
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain 45dbdbc9f6 btrfs: fix EIO misuse to report missing degraded option
EIO is only for the IO failure to the device, avoid it. Use ENOENT as
that's the closest error code describing what happened.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Anand Jain adfb69af7d btrfs: add_missing_dev() should return the actual error
add_missing_dev() can return device pointer so that IS_ERR/PTR_ERR can
be used to check for the actual error that occurred in the function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
[ minor error message adjustment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Christos Gkekas 9e882d6d05 btrfs: Clean up unused variables in free-space-tree.c
Remove variables 'start' and 'end', which are set but never used.

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Arnd Bergmann 709a95c3eb btrfs: tree-checker: use %zu format string for size_t
We now get a harmless compile-time on 32-bit architectures:

fs/btrfs/tree-checker.c: In function 'check_extent_data_item':
fs/btrfs/tree-checker.c:189:70: error: format '%lu' expects argument of type 'long unsigned int', but argument 6 has type 'unsigned int' [-Werror=format=]

This changes the format string to use %zu instead of %lu for size_t.

Fixes: c1f6520bf360 ("btrfs: tree-checker: Enhance output for check_extent_data_item")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo 736cd52e0c Btrfs: remove nr_async_submits and async_submit_draining
Now that we have the combo of flushing twice, which can make sure IO
have started since the second flush will wait for page lock which
won't be unlocked unless setting page writeback and queuing ordered
extents, we don't need %async_submit_draining, %async_delalloc_pages
and %nr_async_submits to tell whether the IO has actually started.

Moreover, all the flushers in use are followed by functions that wait
for ordered extents to complete, so %nr_async_submits, which tracks
whether bio's async submit has made progress, doesn't really make
sense.

However, %async_delalloc_pages is still required by shrink_delalloc()
as that function doesn't flush twice in the normal case (just issues a
writeback with WB_REASON_FS_FREE_SPACE).

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo 80e03a2c51 Btrfs: do not make defrag wait on async_delalloc_pages
By setting compression for a defrag task, the task will start IO at
the end of defrag.

After the combo of filemap_flush(), we've already made sure that
dirty pages have made progress via async compress thread because the
second filemap_flush() will wait for page lock, which won't be
unlocked until those pages have been marked as writeback and ordered
extents have been queued.

And this is for per-inode defrag, it's not helpful to wait on a global
%async_delalloc_pages and %nr_async_submits from fs_info.

Although waiting on %nr_async_submits means that all bios are
submitted down to per-device schedule IO lists, it doesn't wait for
their completions, thus users still need to do fsync/sync to make sure
the data is on disk.  While with this change, it makes sure that pages
are marked with writeback bits and will be submitted asynchronously
shortly, therefore, the behavior of defrag option '-c' remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Liu Bo f851689b5a Btrfs: remove nr_async_bios
This was intended to congest higher layers to not send bios, but as

1) the congested bit has been taken by writeback

Async bios come from buffered writes and DIO writes.

For DIO writes, we want to submit them ASAP, while for buffered writes,
writeback uses balance_dirty_pages() to throttle how much dirty pages we
can have.

2) and no one is waiting for %nr_async_bios down to zero,

Historically, it was introduced along with changes which let
checksumming workload spread accross different cpus.  And at that time,
pdflush was used instead of per-bdi flushing, perhaps pdflush did not
have the necessary information for writeback to do throttling.

We can safely remove them now.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
[ additional explanation from mails, removed unused variable 'limit' ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo 8806d7185b btrfs: tree-checker: Enhance output for check_extent_data_item
Output the invalid member name and its bad value, along with its
expected value range or alignment.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo d508c5f07c btrfs: tree-checker: Enhance output for check_csum_item
Output the bad value and expected good value (or its alignment).

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ unindent long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo 478d01b3fc btrfs: tree-checker: Enhance output for btrfs_check_leaf
Enhance the output to print:
1) the eason
2) the ad value, if reason is not sufficient
3) good value (range)

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording, unidented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo bba4f29896 btrfs: tree-checker: Enhance btrfs_check_node output
Use inline function to replace macro since we don't need
stringification.
(Macro still exists until all callers get updated)

And add more info about the error, and replace EIO with EUCLEAN.

For nr_items error, report if it's too large or too small, and output
the valid value range.

For node block pointer, added a new alignment checker.

For key order, also output the next key to make the problem more
obvious.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments, unindented long strings ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:59 +01:00
Qu Wenruo 557ea5dd00 btrfs: Move leaf and node validation checker to tree-checker.c
It's no doubt the comprehensive tree block checker will become larger,
so moving them into their own files is quite reasonable.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
[ wording adjustments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Timofey Titovets 1170862d78 Btrfs: compress_file_range remove dead variable num_bytes
Remove dead assigment of num_bytes.

Also as num_bytes only used in the will_compress block as copy of
total_in just replace that with total_in and drop num_bytes entirely.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Rakesh Pandit a7e3c5f2f7 btrfs: use appropriate replacements for __sb_{start,end}_write calls
Commit a53f4f8e9c ("btrfs: Don't call btrfs_start_transaction() on
frozen fs to avoid deadlock.") started using internal calls and we
replace them with more suitable ones.

Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Hans van Kranenburg a969f4cc13 btrfs: prefix sysfs attribute struct names
Currently struct names for sysfs are generated only based on the
attribute names. This means that attribute names cannot be reused in
multiple places throughout the complete btrfs sysfs hierarchy.

E.g. allocation/data/total_bytes and allocation/data/single/total_bytes
result in the same struct name btrfs_attr_total_bytes. A workaround for
this case was made in the past by ad hoc creating an extra macro
wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct
name.

Instead of polluting sysfs.h with such kind of extra macro definitions,
and only doing so when there are collisions, use a prefix which gets
inserted in the struct name, so we keep everything nicely grouped
together by default.

Current collections of attributes are:
* (the toplevel, empty prefix)
* allocation
* space_info
* raid
* features

Signed-off-by: Hans van Kranenburg <hans.van.kranenburg@mendix.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Thomas Meyer 897ca8194c btrfs: Fix bool initialization/comparison
Bool initializations should use true and false. Bool tests don't need
comparisons.

Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Nikolay Borisov efd38150af btrfs: Refactor transaction handling in received subvolume ioctl
If btrfs_transaction_commit fails it will proceed to call
cleanup_transaction, which in turn already does btrfs_abort_transaction.
So let's remove the unnecessary code duplication. Also let's be explicit
about handling failure of btrfs_uuid_tree_add by calling
btrfs_end_transaction.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Nikolay Borisov 9417ebc8a6 btrfs: Explicitly handle btrfs_update_root failure
btrfs_udpate_root can fail and it aborts the transaction, the correct
way to handle an aborted transaction is to explicitly end with
btrfs_end_transaction.  Even now the code is correct since
btrfs_commit_transaction would handle an aborted transaction but this is
more of an implementation detail. So let's be explicit in handling
failure in btrfs_update_root.

Furthermore btrfs_commit_transaction can also fail and by ignoring it's
return value we could have left the in-memory copy of the root item in
an inconsistent state. So capture the error value which allows us to
correctly revert the RO/RW flags in case of commit failure.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain 7132a26259 btrfs: error out if btrfs_attach_transaction() fails
btrfs_init_new_device() calls btrfs_attach_transaction() to
commit sys chunks, and it should error out if it fails.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain d31c32f674 btrfs: fix BUG_ON in btrfs_init_new_device()
Instead of BUG_ON return error to the caller. And handle the fail
condition by calling the abort transaction and going through the
error path.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Anand Jain 0af2c4bf5a btrfs: undo writable superblocke when sprouting fails
When new device is being added to seed FS, seed FS is marked writable,
but when we fail to bring in the new device, we missed to undo the
writable part. This patch fixes it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo 4b865cab96 btrfs: Add checker for EXTENT_CSUM
EXTENT_CSUM checker is a relatively easy one, only needs to check:

1) Objectid
   Fixed to BTRFS_EXTENT_CSUM_OBJECTID

2) Key offset alignment
   Must be aligned to sectorsize

3) Item size alignedment
   Must be aligned to csum size

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo 40c3c40947 btrfs: Add sanity check for EXTENT_DATA when reading out leaf
Add extra checks for item with EXTENT_DATA type.  This checks the
following thing:

0) Key offset
   All key offsets must be aligned to sectorsize.
   Inline extent must have 0 for key offset.

1) Item size
   Uncompressed inline file extent size must match item size.
   (Compressed inline file extent has no information about its on-disk size.)
   Regular/preallocated file extent size must be a fixed value.

2) Every member of regular file extent item
   Including alignment for bytenr and offset, possible value for
   compression/encryption/type.

3) Type/compression/encode must be one of the valid values.

This should be the most comprehensive and strict check in the context
of btrfs_item for EXTENT_DATA.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ switch to BTRFS_FILE_EXTENT_TYPES, similar to what
  BTRFS_COMPRESS_TYPES does ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo 7f43d4affb btrfs: Check if item pointer overlaps with the item itself
Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlaps with the item itself.

Normally only the last item may be the victim, but adding such check is
never a bad idea anyway.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:58 +01:00
Qu Wenruo c3267bbaa9 btrfs: Refactor check_leaf function for later expansion
Current check_leaf() function does a good job checking key order and
item offset/size.

However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.

So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid keys should be larger than that.

And for item size/offset checks, it compares current item end with
previous item offset.
For slot 0, use leaf end as a special case.

This makes later item/key offset checks and item size checks easier to
be implemented.

Also, makes check_leaf() to return -EUCLEAN other than -EIO to indicate
error.

Signed-off-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Timofey Titovets 6018ba0a0e Btrfs: cleanup 'start' subtraction from try uncompressed inline extent
Was added in:
  c8b978188c
  "Btrfs: Add zlib compression support"
Survive to near time (from 08.10.2008).

Because 'start' checked for zero before branch, so it's safe to remove
that subtraction.

Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Josef Bacik 996478ca9c btrfs: change how we decide to commit transactions during flushing
Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs.  This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction.  From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.

From Nikolai, IOW: when we go to the final stage of deciding whether to
do trans commit, instead of passing all the reservations from all
tickets we just pass the reservation for the current ticket. Otherwise,
in case all reservations exceed pinned space, then we don't commit
transaction and fail prematurely. Before we passed num_bytes from
flush_space, where num_bytes was the sum of all pending reserverations,
but now all we do is take the first ticket and commit the trans if we
can satisfy that.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Tested-by: Nikolay Borisov <nborisov@suse.com>
[ added Nikolai's comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Kuanling Huang eef16ba269 Btrfs: send, apply asynchronous page cache readahead to enhance page read
By analyzing the perf on btrfs send, we found it take large amount of
cpu time on page_cache_sync_readahead. This effort can be reduced after
switching to asynchronous one. Overall performance gain on HDD and SSD
were 9 and 15 percent if simply send a large file.

Signed-off-by: Kuanling Huang <peterh@synology.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo 785884fc31 Btrfs: fix memory leak in raid56
The local bio_list may have pending bios when doing cleanup, it can
end up with memory leak if they don't get freed.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Colin Ian King 315d8e98aa btrfs: make array types static const, reduces object code size
Don't populate the read-only array types on the stack, instead make
it static const.  Makes the object code smaller by nearly 60 bytes:

Before:
   text	   data	    bss	    dec	    hex	filename
  90536	   6552	     64	  97152	  17b80	fs/btrfs/ioctl.o

After:
   text	   data	    bss	    dec	    hex	filename
  90414	   6616	     64	  97094	  17b46	fs/btrfs/ioctl.o

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Allen Pais 3afb0c5014 btrfs: return -ENOMEM on allocation failure in btrfsic
Forward the correct return value -ENOMEM from btrfsic_dev_state_alloc()
too.

Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ adjust changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo 6939f66724 Btrfs: fix confusing worker helper info in stacktrace
We've seen the following backtrace stack in ftrace or dmesg log,

  kworker/u16:10-4244  [000] 241942.480955: function:             btrfs_put_ordered_extent
  kworker/u16:10-4244  [000] 241942.480956: kernel_stack:         <stack trace>
=> finish_ordered_fn (ffffffffa0384475)
=> btrfs_scrubparity_helper (ffffffffa03ca577)        <-----"incorrect"
=> btrfs_freespace_write_helper (ffffffffa03ca98e)    <-----"correct"
=> process_one_work (ffffffff81117b2f)
=> worker_thread (ffffffff81118c2a)
=> kthread (ffffffff81121de0)
=> ret_from_fork (ffffffff81d7087a)

btrfs_freespace_write_helper is actually calling normal_worker_helper
instead of btrfs_scrubparity_helper, so somehow kernel has parsed the
incorrect function address while unwinding the stack,
btrfs_scrubparity_helper really shouldn't be shown up.

It's caused by compiler doing inline for our helper function, adding a
noinline tag can fix that.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use noinline_for_stack ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:57 +01:00
Liu Bo 18fdc67900 Btrfs: remove bio_flags which indicates a meta block of log-tree
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.

We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.

Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.

1) while log tree is flushing its dirty blocks via
   btrfs_write_marked_extents(), in which %sync_writers is increased
   by one.

2) in the meantime, some writeback operations may happen upon btrfs's
   metadata inode, so these writes go synchronously, too.

However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.

This removes the bio_flags related stuff for writing log-tree.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Liu Bo 6300463b14 Btrfs: make plug in writing meta blocks really work
We have started plug in btrfs_write_and_wait_marked_extents() but the
generated IOs actually go to device's schedule IO list where the work
is doing in another task, thus the started plug doesn't make any
sense.

And since we wait for IOs immediately after writing meta blocks, it's
the same case as writing log tree, doing sync submit can merge more
IOs.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Satoru Takeuchi d8953d69bc btrfs: convert all mount option checking code to use btrfs_test_opt
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Colin Ian King 3993b112da btrfs: avoid null pointer dereference on fs_info when calling btrfs_crit
There are checks on fs_info in __btrfs_panic to avoid dereferencing a
null fs_info, however, there is a call to btrfs_crit that may also
dereference a null fs_info. Fix this by adding a check to see if fs_info
is null and only print the s_id if fs_info is non-null.

Detected by CoverityScan CID#401973 ("Dereference after null check")

Fixes: efe120a067 ("Btrfs: convert printk to btrfs_ and fix BTRFS prefix")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Christos Gkekas fa0d0888bd btrfs: Clean up dead code in root-tree
The value of variable 'can_recover' is never used after being set, thus
it should be removed, as it was never used since the first commit
68a7342c51 ("Btrfs: cleanup orphaned root orphan item").

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Christophe JAILLET 9ca2e97fa3 btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
If 'btrfs_alloc_path()' fails, we must free the resources already
allocated, as done in the other error handling paths in this function.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov c434d21c64 btrfs: Remove redundant argument of __link_block_group
__link_block_group is called from only 2 places and at each call site the
space_info being passed is the same as the space info assigned to the passed
cache struct. Let's remove the redundant argument and make the function
reference the space_info from the passed block_group_cache. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ renamed to link_block_group ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov 1efb72a3c3 btrfs: Rework error handling of add_extent_mapping in __btrfs_alloc_chunk
Currently the code executes add_extent_mapping and if it is successful
it links the new mapping, it then proceeds to unlock the extent mapping
tree and check for failure and handle them. Instead, rework the code to
only perform a single check if add_extent_mapping has failed and handle
it, otherwise the code continues in a linear fashion. No functional
changes

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov 8c70c9f81e btrfs: Remove unused parameter from check_direct_IO
Introduced by 5a5f79b570 ("Btrfs: allow unaligned DIO") and never
used. The buffered fallback from unaligned DIO works as expected.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov ee8c494f88 btrfs: Remove unused arguments from btrfs_changed_cb_t
btrfs_changed_cb_t represents the signature of the callback being passed
to btrfs_compare_trees. Currently there is only one such callback,
namely changed_cb in send.c. This function doesn't really uses the first
2 parameters, i.e. the roots. Since there are not other functions
implementing the btrfs_changed_cb_t let's remove the unused parameters
from the prototype and implementation.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:56 +01:00
Nikolay Borisov a0357511f2 btrfs: Remove unused parameters from various functions
iterate_dir_item:found_key - introduced in 31db9f7c23 ("Btrfs:
  introduce BTRFS_IOC_SEND for btrfs send/receive"), yet never used.

record_ref:num - ditto

This is a first pass with the low-hanging fruit. There are still quite a
few unsued parameters in some function which have to abide by a callback
interface.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Nikolay Borisov 8ca199501e btrfs: Remove unused variable
Src was initially part of 31ff1cd25d ("Btrfs: Copy into the log tree in
big batches"), however 16e7549f04 ("Btrfs: incompatible format change
to remove hole extents") changed parameters passed to copy_items which
made the src variable redundant.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Timofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 9b4a9b283d Btrfs: do not async submit for nodatasum inodes
While we submit direct writes, if the inode is flagged with nodatasum,
there's no benefit to submit asynchronously, because

a) we don't have to calculate checksum across processors,

b) and direct IO has started a plug, but async submit makes us queue
IO on each device's scheduled IO list instead of DIO's plug list, so
that IOs get much less merges in general.

Lets use sync submit for nodatasum inodes.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 9cd3a7eb85 Btrfs: search parity device wisely
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ copied changelog as a comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Anand Jain ee87cf5ed9 btrfs: copy fsid to super_block s_uuid
We didn't copy fsid to struct super_block.s_uuid so Overlay disables
index feature with btrfs as the lower FS.

kernel: overlayfs: fs on '/lower' does not support file handles, falling back to index=off.

Fix this by publishing the fsid through struct super_block.s_uuid.

[ dsterba: I think that setting s_uuid is the last missing bit. Overlay
  needs the file handle encoding support from the lower filesystem, which
  is supported. Filling the whole filesystem id is correct, the subvolume
  id is encoded in the file handle buffer from inside btrfs_encode_fh. ]

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Omar Sandoval 718dc5fade Btrfs: fix __user casting in ioctl.c
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Omar Sandoval c9162bdfd6 Btrfs: make some volumes.c functions static
These aren't used outside of volumes.c.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Nikolay Borisov f78541ddb1 btrfs: Remove redundant forward declarations
Some static functions are needlessly forward declared. Let's remove those
declarations since they add no value.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 49e83f5735 Btrfs: protect conditions within root->log_mutex while waiting
Both wait_for_commit() and wait_for_writer() are checking the
condition out of the mutex lock.

This refactors code a bit to be lock safe.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 45bac0f3d2 Btrfs: use wait_event instead of a single function
Since TASK_UNINTERRUPTIBLE has been used here, wait_event() can do the
same job.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 69cc7151ee Btrfs: move finish_wait out of the loop
If we're still going to wait after schedule(), we don't have to do
finish_wait() to remove our %wait_queue_entry since prepare_to_wait()
won't add the same %wait_queue_entry twice.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Liu Bo 219d33b26a Btrfs: remove batch plug in run_scheduled_IO
Block layer has a limit on plug, ie. BLK_MAX_REQUEST_COUNT == 16, so
we don't gain benefits by batching 64 bios here.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-30 12:27:55 +01:00
Matthew Garrett 357fdad075 Convert fs/*/* to SB_I_VERSION
[AV: in addition to the fix in previous commit]

Signed-off-by: Matthew Garrett <mjg59@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-10-18 18:51:27 -04:00
Linus Torvalds bf2db0b9f5 Merge branch 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "Two more fixes for bugs introduced in 4.13.

  The sector_t problem with 32bit architecture and !LBDAF config seems
  serious but the number of affected deployments is hopefully low.

  The clashing status bits could lead to a confusing in-memory state of
  the whole-filesystem operations if used with the quota override sysfs
  knob"

* 'for-4.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Btrfs: fix overlap of fs_info::flags values
  btrfs: avoid overflow when sector_t is 32 bit
2017-10-06 09:03:08 -07:00
Tsutomu Itoh 69ad59767d Btrfs: fix overlap of fs_info::flags values
Because the values of BTRFS_FS_EXCL_OP and BTRFS_FS_QUOTA_OVERRIDE overlap,
we should change the value.

First, BTRFS_FS_EXCL_OP was set to 14.

  commit 171938e528 ("btrfs: track exclusive filesystem operation in flags")

Next, the value of BTRFS_FS_QUOTA_OVERRIDE was set to 14.

  commit f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")

As a result, the value 14 overlapped, by accident.
This problem is solved by defining the value of BTRFS_FS_EXCL_OP as 16,
the flags are internal.

Fixes: f29efe2921 ("btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minimize the change, update only BTRFS_FS_EXCL_OP ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:44:18 +02:00
Goffredo Baroncelli 2d8ce70a08 btrfs: avoid overflow when sector_t is 32 bit
Jean-Denis Girard noticed commit c821e7f3 "pass bytes to
btrfs_bio_alloc" (https://patchwork.kernel.org/patch/9763081/)
introduces a regression on 32 bit machines.
When CONFIG_LBDAF is _not_ defined (CONFIG_LBDAF == Support for large
(2TB+) block devices and files) sector_t is 32 bit on 32bit machines.

In the function submit_extent_page, 'sector' (which is sector_t type) is
multiplied by 512 to convert it from sectors to bytes, leading to an
overflow when the disk is bigger than 4GB (!).

I added a cast to u64 to avoid overflow.

Fixes: c821e7f3 ("btrfs: pass bytes to btrfs_bio_alloc")
CC: stable@vger.kernel.org # 4.13+
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Jean-Denis Girard <jd.girard@sysnux.pf>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-10-04 16:22:56 +02:00
Linus Torvalds 5ba88cd6e9 Merge branch 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
 "We've collected a bunch of isolated fixes, for crashes, user-visible
  behaviour or missing bits from other subsystem cleanups from the past.

  The overall number is not small but I was not able to make it
  significantly smaller. Most of the patches are supposed to go to
  stable"

* 'for-4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: log csums for all modified extents
  Btrfs: fix unexpected result when dio reading corrupted blocks
  btrfs: Report error on removing qgroup if del_qgroup_item fails
  Btrfs: skip checksum when reading compressed data if some IO have failed
  Btrfs: fix kernel oops while reading compressed data
  Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
  Btrfs: do not backup tree roots when fsync
  btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
  btrfs: propagate error to btrfs_cmp_data_prepare caller
  btrfs: prevent to set invalid default subvolid
  Btrfs: send: fix error number for unknown inode types
  btrfs: fix NULL pointer dereference from free_reloc_roots()
  btrfs: finish ordered extent cleaning if no progress is found
  btrfs: clear ordered flag on cleaning up ordered extents
  Btrfs: fix incorrect {node,sector}size endianness from BTRFS_IOC_FS_INFO
  Btrfs: do not reset bio->bi_ops while writing bio
  Btrfs: use the new helper wbc_to_write_flags
2017-09-29 12:57:35 -07:00
Josef Bacik 8c6c592831 btrfs: log csums for all modified extents
Amir reported a bug discovered by his cleaned up version of my
dm-log-writes xfstests where we were missing csums at certain replay
points.  This is because fsx was doing an msync(), which essentially
fsync()'s a specific range of a file.  We will log all modified extents,
but only search for the checksums in the range we are being asked to
sync.  We cannot simply log the extents in the range we're being asked
because we are logging the inode item as it is currently, which if it
has had a i_size update before the msync means we will miss extents when
replaying.  We could possibly get around this by marking the inode with
the transaction that extended the i_size to see if we have this case,
but this would be racy and we'd have to lock the whole range of the
inode to make sure we didn't have an ordered extent outside of our range
that was in the middle of completing.

Fix this simply by keeping track of the modified extents range and
logging the csums for the entire range of extents that we are logging.
This makes the xfstest pass.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:16 +02:00
Liu Bo 99c4e3b96c Btrfs: fix unexpected result when dio reading corrupted blocks
commit 4246a0b63b ("block: add a bi_error field to struct bio")
changed the logic of how dio read endio reports errors.

For single stripe dio read, %bio->bi_status reflects the error before
verifying checksum, and now we're updating it when data block matches
with its checksum, while in the mismatching case, %bio->bi_status is
not updated to relfect that.

When some blocks in a file have been corrupted on disk, reading such a
file ends up with

1) checksum errors are reported in kernel log
2) read(2) returns successfully with some content being 0x01.

In order to fix it, we need to report its checksum mismatch error to
the upper layer (dio layer in this case) as well.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Goffredo Baroncelli <kreijack@inwind.it>
Tested-by: Goffredo Baroncelli <kreijack@inwind.it>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:07 +02:00
Sargun Dhillon 36b96fdc6b btrfs: Report error on removing qgroup if del_qgroup_item fails
Previously, we were calling del_qgroup_item, and ignoring the return code
resulting in a potential to have divergent in-memory state without an
error. Perhaps, it makes sense to handle this error code, and put the
filesystem into a read only, or similar state.

This patch only adds reporting of the error if the error is fatal,
(any error other than qgroup not found).

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:54:01 +02:00
Liu Bo e6311f240c Btrfs: skip checksum when reading compressed data if some IO have failed
Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.

In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.

Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:26 +02:00
Liu Bo cf1167d5c1 Btrfs: fix kernel oops while reading compressed data
The kernel oops happens at

kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]

It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.

With this, btrfs's read-repair can work properly on reading compressed
data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reported-by: Paul Jones <paul@pauljones.id.au>
Tested-by: Paul Jones <paul@pauljones.id.au>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:23 +02:00
Liu Bo bd7d63c2ce Btrfs: use btrfs_op instead of bio_op in __btrfs_map_block
This seems to be a leftover of commit cf8cddd38b ("btrfs: don't
abuse REQ_OP_* flags for btrfs_map_block").

It should use btrfs_op() helper to provide one of 'enum btrfs_map_op'
types.

Fixes: cf8cddd38b ("btrfs: don't abuse REQ_OP_* flags for btrfs_map_block")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:17 +02:00
Liu Bo fed3b38114 Btrfs: do not backup tree roots when fsync
It doesn't make sense to backup tree roots when doing fsync, since
during fsync those tree roots have not been consistent on disk.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:53:04 +02:00
Misono, Tomohiro c2faff790c btrfs: remove BTRFS_FS_QUOTA_DISABLING flag
Currently, "btrfs quota enable" would fail after "btrfs quota disable" on
the first time with syslog output "qgroup_rescan_init failed with -22", but
it would succeed on the second time.

When "quota disable" is called, BTRFS_FS_QUOTA_DISABLING flag bit will be
set in fs_info->flags in btrfs_quota_disable(), but it will not be droppd
in btrfs_run_qgroups() (which is called in btrfs_commit_transaction())
because quota_root has already been freed. If "quota enable" is called
after that, both BTRFS_FS_QUOTA_DISABLING and BTRFS_FS_QUOTA_ENABLED flag
would be dropped in the btrfs_run_qgroups() since quota_root is not NULL.
This leads to the failure of "quota enable" on the first time.

BTRFS_FS_QUOTA_DISABLING flag is not used outside of "quota disable"
context and is equivalent to whether quota_root is NULL or not.
btrfs_run_qgroups() checks whether quota_root is NULL or not in the first
place.

So, let's remove BTRFS_FS_QUOTA_DISABLING flag.

Signed-off-by: Tomohiro Misono <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:57 +02:00
Naohiro Aota 78ad4ce014 btrfs: propagate error to btrfs_cmp_data_prepare caller
btrfs_cmp_data_prepare() (almost) always returns 0 i.e. ignoring errors
from gather_extent_pages(). While the pages are freed by
btrfs_cmp_data_free(), cmp->num_pages still has > 0. Then,
btrfs_extent_same() try to access the already freed pages causing faults
(or violates PageLocked assertion).

This patch just return the error as is so that the caller stop the process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Fixes: f441460202 ("btrfs: fix deadlock with extent-same and readpage")
Cc: <stable@vger.kernel.org> # 4.2
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:31 +02:00
satoru takeuchi 6d6d282932 btrfs: prevent to set invalid default subvolid
`btrfs sub set-default` succeeds to set an ID which isn't corresponding to any
fs/file tree. If such the bad ID is set to a filesystem, we can't mount this
filesystem without specifying `subvol` or `subvolid` mount options.

Fixes: 6ef5ed0d38 ("Btrfs: add ioctl and incompat flag to set the default mount subvol")
Cc: <stable@vger.kernel.org>
Signed-off-by: Satoru Takeuchi <satoru.takeuchi@gmail.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:25 +02:00
Tsutomu Itoh ca6842bf01 Btrfs: send: fix error number for unknown inode types
ENOTSUPP should not be returned to the user program.
 (cf. include/linux/errno.h)
Therefore, EOPNOTSUPP is used instead of ENOTSUPP.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:52:06 +02:00
Naohiro Aota bb166d7207 btrfs: fix NULL pointer dereference from free_reloc_roots()
__del_reloc_root should be called before freeing up reloc_root->node.
If not, calling __del_reloc_root() dereference reloc_root->node, causing
the system BUG.

Fixes: 6bdf131fac ("Btrfs: don't leak reloc root nodes on error")
Cc: <stable@vger.kernel.org> # 4.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:51:49 +02:00
Naohiro Aota 67c003f90f btrfs: finish ordered extent cleaning if no progress is found
__endio_write_update_ordered() repeats the search until it reaches the end
of the specified range. This works well with direct IO path, because before
the function is called, it's ensured that there are ordered extents filling
whole the range. It's not the case, however, when it's called from
run_delalloc_range(): it is possible to have error in the midle of the loop
in e.g. run_delalloc_nocow(), so that there exisits the range not covered
by any ordered extents. By cleaning such "uncomplete" range,
__endio_write_update_ordered() stucks at offset where there're no ordered
extents.

Since the ordered extents are created from head to tail, we can stop the
search if there are no offset progress.

Fixes: 524272607e ("btrfs: Handle delalloc error correctly to avoid ordered extent hang")
Cc: <stable@vger.kernel.org> # 4.12
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <quwenruo.btrfs@gmx.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-26 14:49:06 +02:00