Commit graph

922 commits

Author SHA1 Message Date
Chris Mason cdfb080e18 Btrfs: fix use after free when close_ctree frees the orphan_rsv
Near the end of close_ctree, we're calling btrfs_free_block_rsv
to free up the orphan rsv.  The problem is this call updates the
space_info, which has already been freed.

This adds a new __ function that directly calls kfree instead of trying
to update the space infos.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:29 -07:00
Chris Mason 1bbc621ef2 Btrfs: allow block group cache writeout outside critical section in commit
We loop through all of the dirty block groups during commit and write
the free space cache.  In order to make sure the cache is currect, we do
this while no other writers are allowed in the commit.

If a large number of block groups are dirty, this can introduce long
stalls during the final stages of the commit, which can block new procs
trying to change the filesystem.

This commit changes the block group cache writeout to take appropriate
locks and allow it to run earlier in the commit.  We'll still have to
redo some of the block groups, but it means we can get most of the work
out of the way without blocking the entire FS.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:22 -07:00
Chris Mason c9dc4c6578 Btrfs: two stage dirty block group writeout
Block group cache writeout is currently waiting on the pages for each
block group cache before moving on to writing the next one.  This commit
switches things around to send down all the caches and then wait on them
in batches.

The end result is much faster, since we're keeping the disk pipeline
full.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:07:11 -07:00
Josef Bacik 365c531377 Btrfs: don't commit the transaction in the async space flushing
We're triggering a huge number of commits from
btrfs_async_reclaim_metadata_space.  These aren't really requried,
because everyone calling the async reclaim code is going to end up
triggering a commit on their own.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:54 -07:00
Josef Bacik cb723e4919 Btrfs: reserve space for block groups
This changes our delayed refs calculations to include the space needed
to write back dirty block groups.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:48 -07:00
Chris Mason 28f75a0e6c Btrfs: refill block reserves during truncate
When truncate starts, it allocates some space in the block reserves so
that we'll have enough to update metadata along the way.

For very large files, we can easily go through all of that space as we
loop through the extents.  This changes truncate to refill the space
reservation as it progresses through the file.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:06:34 -07:00
Josef Bacik 1262133b8d Btrfs: account for crcs in delayed ref processing
As we delete large extents, we end up doing huge amounts of COW in order
to delete the corresponding crcs.  This adds accounting so that we keep
track of that space and flushing of delayed refs so that we don't build
up too much delayed crc work.

This helps limit the delayed work that must be done at commit time and
tries to avoid ENOSPC aborts because the crcs eat all the global
reserves.

Signed-off-by: Chris Mason <clm@fb.com>
2015-04-10 14:04:47 -07:00
Chris Mason dd82525956 Btrfs: free and unlock our path before btrfs_free_and_pin_reserved_extent()
The error handling path for alloc_reserved_tree_block is calling
btrfs_free_and_pin_reserved_extent with a spinning tree lock held.  This
might sleep as we allocate extent_state objects:

 BUG: sleeping function called from invalid context at mm/slub.c:1268
 in_atomic(): 1, irqs_disabled(): 0, pid: 11093, name: kworker/u4:7
 5 locks held by kworker/u4:7/11093:
  #0:  ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #1:  ((&work->normal_work)){+.+.+.}, at: [<ffffffff81091d51>] process_one_work+0x151/0x520
  #2:  (sb_internal){++++.+}, at: [<ffffffffa003a70e>] start_transaction+0x43e/0x590 [btrfs]
  #3:  (&head_ref->mutex){+.+...}, at: [<ffffffffa0089f8c>] btrfs_delayed_ref_lock+0x4c/0x240 [btrfs]
  #4:  (btrfs-extent-00){++++..}, at: [<ffffffffa007697b>] btrfs_clear_lock_blocking_rw+0x9b/0x150 [btrfs]
 CPU: 0 PID: 11093 Comm: kworker/u4:7 Tainted: G        W 4.0.0-rc6-default+ #246
 Hardware name: Intel Corporation Santa Rosa platform/Matanzas, BIOS TSRSCRB1.86C.0047.B00.0610170821 10/17/06
 Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
  00000000000004f4 ffff88006dd17848 ffffffff81ab0e3b ffff88006dd17848
  ffff88007a944760 ffff88006dd17868 ffffffff8109d516 ffff88006dd17898
  0000000000000000 ffff88006dd17898 ffffffff8109d5b2 ffffffff81aba2bb
 Call Trace:
  [<ffffffff81ab0e3b>] dump_stack+0x4f/0x6c
  [<ffffffff8109d516>] ___might_sleep+0xf6/0x140
  [<ffffffff8109d5b2>] __might_sleep+0x52/0x90
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffff81196363>] kmem_cache_alloc+0x163/0x1b0
  [<ffffffffa0056f31>] ? alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa0056f20>] ? alloc_extent_state+0x20/0x150 [btrfs]
  [<ffffffffa0056f31>] alloc_extent_state+0x31/0x150 [btrfs]
  [<ffffffffa005805b>] __set_extent_bit+0x37b/0x5d0 [btrfs]
  [<ffffffff81aba2bb>] ? ftrace_call+0x5/0x34
  [<ffffffffa005888d>] ? set_extent_bit+0xd/0x30 [btrfs]
  [<ffffffffa00588a3>] set_extent_bit+0x23/0x30 [btrfs]
  [<ffffffffa0058e80>] set_extent_dirty+0x20/0x30 [btrfs]
  [<ffffffffa00195ba>] pin_down_extent+0xaa/0x170 [btrfs]
  [<ffffffffa001d8ef>] __btrfs_free_reserved_extent+0xcf/0x160 [btrfs]
  [<ffffffffa0023856>] btrfs_free_and_pin_reserved_extent+0x16/0x20 [btrfs]
  [<ffffffffa002482a>] __btrfs_run_delayed_refs+0xfca/0x1290 [btrfs]
  [<ffffffffa0026eae>] btrfs_run_delayed_refs+0x6e/0x2e0 [btrfs]
  [<ffffffffa0027378>] delayed_ref_async_start+0x48/0xb0 [btrfs]
  [<ffffffffa006c883>] normal_work_helper+0x83/0x350 [btrfs]
  [<ffffffffa006cd79>] ? btrfs_extent_refs_helper+0x9/0x20 [btrfs]
  [<ffffffffa006cd82>] btrfs_extent_refs_helper+0x12/0x20 [btrfs]
  [<ffffffff81091dcb>] process_one_work+0x1cb/0x520
  [<ffffffff81091d51>] ? process_one_work+0x151/0x520
  [<ffffffff811c7abf>] ? seq_read+0x3f/0x400
  [<ffffffff8109260b>] worker_thread+0x5b/0x4e0
  [<ffffffff81097be2>] ? __kthread_parkme+0x12/0xa0
  [<ffffffff810925b0>] ? rescuer_thread+0x450/0x450
  [<ffffffff81098686>] kthread+0xf6/0x120
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
  [<ffffffff81ab8088>] ret_from_fork+0x58/0x90
  [<ffffffff81098590>] ? flush_kthread_worker+0x1b0/0x1b0
 ------------[ cut here ]------------

This changes things to free the path first, which will also unlock the
extent buffer.

Signed-off-by: Chris Mason <clm@fb.com>
Reported-by: Dave Sterba <dsterba@suse.cz>
Tested-by: Dave Sterba <dsterba@suse.cz>
2015-04-01 08:36:05 -07:00
Filipe Manana dcc82f4783 Btrfs: fix log tree corruption when fs mounted with -o discard
While committing a transaction we free the log roots before we write the
new super block. Freeing the log roots implies marking the disk location
of every node/leaf (metadata extent) as pinned before the new super block
is written. This is to prevent the disk location of log metadata extents
from being reused before the new super block is written, otherwise we
would have a corrupted log tree if before the new super block is written
a crash/reboot happens and the location of any log tree metadata extent
ended up being reused and rewritten.

Even though we pinned the log tree's metadata extents, we were issuing a
discard against them if the fs was mounted with the -o discard option,
resulting in corruption of the log tree if a crash/reboot happened before
writing the new super block - the next time the fs was mounted, during
the log replay process we would find nodes/leafs of the log btree with
a content full of zeroes, causing the process to fail and require the
use of the tool btrfs-zero-log to wipeout the log tree (and all data
previously fsynced becoming lost forever).

Fix this by not doing a discard when pinning an extent. The discard will
be done later when it's safe (after the new super block is committed) at
extent-tree.c:btrfs_finish_extent_commit().

Fixes: e688b7252f (Btrfs: fix extent pinning bugs in the tree log)
CC: <stable@vger.kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26 17:56:24 -07:00
Chris Mason fc4c3c872f Merge branch 'cleanups-post-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1
Signed-off-by: Chris Mason <clm@fb.com>

Conflicts:
	fs/btrfs/disk-io.c
2015-03-25 10:52:48 -07:00
Chris Mason 9deed229fa Merge branch 'cleanups-for-4.1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.1 2015-03-25 10:43:16 -07:00
Linus Torvalds 521d474631 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Most of these are fixing extent reservation accounting, or corners
  with tree writeback during commit.

  Josef's set does add a test, which isn't strictly a fix, but it'll
  keep us from making this same mistake again"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: fix outstanding_extents accounting in DIO
  Btrfs: add sanity test for outstanding_extents accounting
  Btrfs: just free dummy extent buffers
  Btrfs: account merges/splits properly
  Btrfs: prepare block group cache before writing
  Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list)
  Btrfs: account for the correct number of extents for delalloc reservations
  Btrfs: fix merge delalloc logic
  Btrfs: fix comp_oper to get right order
  Btrfs: catch transaction abortion after waiting for it
  btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-21 10:53:37 -07:00
Josef Bacik 6a3891c551 Btrfs: add sanity test for outstanding_extents accounting
I introduced a regression wrt outstanding_extents accounting.  These are tricky
areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
at any time.  So add sanity tests to cover the various conditions that are
tricky in order to make sure we don't introduce regressions in the future.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 16:36:31 -04:00
Josef Bacik dcdf7f6ddb Btrfs: prepare block group cache before writing
Writing the block group cache will modify the extent tree quite a bit because it
truncates the old space cache and pre-allocates new stuff.  To try and cut down
on the churn lets do the setup dance first, then later on hopefully we can avoid
looping with newly dirtied roots.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-03-17 10:56:55 -04:00
Josef Bacik 6a41dd0922 Btrfs: account for the correct number of extents for delalloc reservations
Direct IO can easily pass in an buffer that is greater than
BTRFS_MAX_EXTENT_SIZE, so take this into account when reserving extents in the
delalloc reservation code.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-13 13:46:59 -07:00
Linus Torvalds 84399bb075 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Outside of misc fixes, Filipe has a few fsync corners and we're
  pulling in one more of Josef's fixes from production use here"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs:__add_inode_ref: out of bounds memory read when looking for extended ref.
  Btrfs: fix data loss in the fast fsync path
  Btrfs: remove extra run_delayed_refs in update_cowonly_root
  Btrfs: incremental send, don't rename a directory too soon
  btrfs: fix lost return value due to variable shadowing
  Btrfs: do not ignore errors from btrfs_lookup_xattr in do_setxattr
  Btrfs: fix off-by-one logic error in btrfs_realloc_node
  Btrfs: add missing inode update when punching hole
  Btrfs: abort the transaction if we fail to update the free space cache inode
  Btrfs: fix fsync race leading to ordered extent memory leaks
2015-03-06 13:52:54 -08:00
David Sterba 47c5713f47 btrfs: replace remaining do_div calls with div_u64 variants
Switch to div_u64_rem that does type checking and has more obvious
semantics than do_div.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:01 +01:00
David Sterba b8b93addde btrfs: cleanup 64bit/32bit divs, provably bounded values
The divisor is derived from nodesize or PAGE_SIZE, fits into 32bit type.
Get rid of a few more do_div instances.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:24:00 +01:00
David Sterba f8c269d722 btrfs: cleanup 64bit/32bit divs, compile time constants
Switch to div_u64 if the divisor is a numeric constant or sum of
sizeof()s. We can remove a few instances of do_div that has the hidden
semtantics of changing the 1st argument.

Small power-of-two divisors are converted to bitshifts, large values are
kept intact for clarity.

Signed-off-by: David Sterba <dsterba@suse.cz>
2015-03-03 17:23:57 +01:00
Josef Bacik 0c0ef4bc84 Btrfs: abort the transaction if we fail to update the free space cache inode
Our gluster boxes were hitting a problem where they'd run out of space when
updating the block group cache and therefore wouldn't be able to update the free
space inode.  This is a problem because this is how we invalidate the cache and
protect ourselves from errors further down the stack, so if this fails we have
to abort the transaction so we make sure we don't end up with stale free space
cache.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-02 14:04:44 -08:00
Linus Torvalds 2b9fb532d4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "This pull is mostly cleanups and fixes:

   - The raid5/6 cleanups from Zhao Lei fixup some long standing warts
     in the code and add improvements on top of the scrubbing support
     from 3.19.

   - Josef has round one of our ENOSPC fixes coming from large btrfs
     clusters here at FB.

   - Dave Sterba continues a long series of cleanups (thanks Dave), and
     Filipe continues hammering on corner cases in fsync and others

  This all was held up a little trying to track down a use-after-free in
  btrfs raid5/6.  It's not clear yet if this is just made easier to
  trigger with this pull or if its a new bug from the raid5/6 cleanups.
  Dave Sterba is the only one to trigger it so far, but he has a
  consistent way to reproduce, so we'll get it nailed shortly"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits)
  Btrfs: don't remove extents and xattrs when logging new names
  Btrfs: fix fsync data loss after adding hard link to inode
  Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
  Btrfs: account for large extents with enospc
  Btrfs: don't set and clear delalloc for O_DIRECT writes
  Btrfs: only adjust outstanding_extents when we do a short write
  btrfs: Fix out-of-space bug
  Btrfs: scrub, fix sleep in atomic context
  Btrfs: fix scheduler warning when syncing log
  Btrfs: Remove unnecessary placeholder in btrfs_err_code
  btrfs: cleanup init for list in free-space-cache
  btrfs: delete chunk allocation attemp when setting block group ro
  btrfs: clear bio reference after submit_one_bio()
  Btrfs: fix scrub race leading to use-after-free
  Btrfs: add missing cleanup on sysfs init failure
  Btrfs: fix race between transaction commit and empty block group removal
  btrfs: add more checks to btrfs_read_sys_array
  btrfs: cleanup, rename a few variables in btrfs_read_sys_array
  btrfs: add checks for sys_chunk_array sizes
  btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize
  ...
2015-02-19 14:36:00 -08:00
Zhao Lei b9fd47cde5 btrfs: cleanup: remove no-used alloc_chunk in btrfs_check_data_free_space()
int alloc_chunk is never used in this function, remove it.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:44 +01:00
Daniel Dressler 01d58472a8 Btrfs: disk-io: replace root args iff only fs_info used
This is the 3rd independent patch of a larger project to cleanup btrfs's
internal usage of btrfs_root. Many functions take btrfs_root only to
grab the fs_info struct.

By requiring a root these functions cause programmer overhead. That
these functions can accept any valid root is not obvious until
inspection.

This patch reduces the specificity of such functions to accept the
fs_info directly.

These patches can be applied independently and thus are not being
submitted as a patch series. There should be about 26 patches by the
project's completion. Each patch will cleanup between 1 and 34 functions
apiece.  Each patch covers a single file's functions.

This patch affects the following function(s):
  1) csum_tree_block
  2) csum_dirty_buffer
  3) check_tree_block_fsid
  4) btrfs_find_tree_block
  5) clean_tree_block

Signed-off-by: Daniel Dressler <danieru.dressler@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.cz>
2015-02-16 18:48:43 +01:00
Forrest Liu 3d84be7991 Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group
Removing large amount of block group in a transaction may encounters
BUG_ON() in btrfs_orphan_add(). That is because btrfs_orphan_reserve_metadata()
will grab metadata reservation from transaction handle, and
btrfs_delete_unused_bgs() didn't reserve metadata for trnasaction handle when
delete unused block group.

The problem can be reproduce by following script

    mntpath=/btrfs
    loopdev=/dev/loop0
    filepath=/home/forrest/image

    umount $mntpath
    losetup -d $loopdev
    truncate --size 1000g $filepath
    losetup $loopdev $filepath
    mkfs.btrfs -f $loopdev
    mount $loopdev $mntpath

    for j in `seq 1 1 1000`; do
        fallocate -l 1g $mntpath/$j
    done
    # wait cleaner thread remove unused block group
    sleep 300

The call trace that results from the BUG_ON() is:

[  613.093084] ------------[ cut here ]------------
[  613.097928] kernel BUG at fs/btrfs/inode.c:3142!
[  613.105855] invalid opcode: 0000 [#1] SMP
[  613.112702] Modules linked in: coretemp(E) crc32_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) snd_ens1371(E) snd_ac97_codec(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E) ppdev(E) ac97_bus(E) ablk_helper(E) gameport(E) cryptd(E) snd_rawmidi(E) snd_seq_device(E) snd_pcm(E) vmw_balloon(E) snd_timer(E) snd(E) soundcore(E) serio_raw(E) vmwgfx(E) ttm(E) drm_kms_helper(E) drm(E) vmw_vmci(E) parport_pc(E) shpchp(E) i2c_piix4(E) mac_hid(E) lp(E) parport(E) btrfs(E) xor(E) raid6_pq(E) hid_generic(E) usbhid(E) hid(E) psmouse(E) ahci(E) libahci(E) e1000(E) mptspi(E) mptscsih(E) mptbase(E) floppy(E) vmw_pvscsi(E) vmxnet3(E)
[  613.144196] CPU: 0 PID: 1480 Comm: btrfs-cleaner Tainted: G            E  3.19.0-rc7-custom #2
[  613.148501] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  613.152694] task: ffff880035cdb1a0 ti: ffff880039cf4000 task.ti: ffff880039cf4000
[  613.154969] RIP: 0010:[<ffffffffa01441c2>]  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.157780] RSP: 0018:ffff880039cf7c48  EFLAGS: 00010286
[  613.159560] RAX: 00000000ffffffe4 RBX: ffff88003bd981a0 RCX: ffff88003c9e4000
[  613.161904] RDX: 0000000000002244 RSI: 0000000000040000 RDI: ffff88003c9e4138
[  613.164264] RBP: ffff880039cf7c88 R08: 000060ffc0000850 R09: 0000000000000000
[  613.166507] R10: ffff88003bc4b7a0 R11: ffffea0000eb6740 R12: ffff88003c9c0000
[  613.168681] R13: ffff88003c102160 R14: ffff88003c9c0458 R15: 0000000000000001
[  613.170932] FS:  0000000000000000(0000) GS:ffff88003f600000(0000) knlGS:0000000000000000
[  613.173316] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  613.175227] CR2: 00007f6343537000 CR3: 0000000036329000 CR4: 00000000000407f0
[  613.177554] Stack:
[  613.178712]  ffff880039cf7c88 ffffffffa0182a54 ffff88003c9e4b04 ffff88003c9c7800
[  613.181297]  ffff88003bc4b7a0 ffff88003bd981a0 ffff88003c8db200 ffff88003c2fcc60
[  613.183782]  ffff880039cf7d18 ffffffffa012da97 ffff88003bc4b7a4 ffff88003bc4b7a0
[  613.186171] Call Trace:
[  613.187493]  [<ffffffffa0182a54>] ? lookup_free_space_inode+0x44/0x100 [btrfs]
[  613.189801]  [<ffffffffa012da97>] btrfs_remove_block_group+0x137/0x740 [btrfs]
[  613.192126]  [<ffffffffa0166912>] btrfs_remove_chunk+0x672/0x780 [btrfs]
[  613.194267]  [<ffffffffa012e2ff>] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs]
[  613.196567]  [<ffffffffa0135e4c>] cleaner_kthread+0x12c/0x190 [btrfs]
[  613.198687]  [<ffffffffa0135d20>] ? check_leaf+0x350/0x350 [btrfs]
[  613.200758]  [<ffffffff8108f232>] kthread+0xd2/0xf0
[  613.202616]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.204738]  [<ffffffff8175dabc>] ret_from_fork+0x7c/0xb0
[  613.206652]  [<ffffffff8108f160>] ? kthread_create_on_node+0x180/0x180
[  613.208741] Code: ff ff 0f 1f 80 00 00 00 00 89 45 c8 3e 80 63 80 fd 48 89 df e8 d0 23 fe ff 8b 45 c8 e9 14 ff ff ff b8 f4 ff ff ff e9 12 ff ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
[  613.216562] RIP  [<ffffffffa01441c2>] btrfs_orphan_add+0x1d2/0x1e0 [btrfs]
[  613.218828]  RSP <ffff880039cf7c48>
[  613.220382] ---[ end trace 71073106deb8a457 ]---

This patch replace btrfs_join_transaction() with btrfs_start_transaction() in
btrfs_delete_unused_bgs() to revent BUG_ON() in btrfs_orphan_add()

Signed-off-by: Forrest Liu <forrestl@synology.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:49 -08:00
Josef Bacik dcab6a3b2a Btrfs: account for large extents with enospc
On our gluster boxes we stream large tar balls of backups onto our fses.  With
160gb of ram this means we get really large contiguous ranges of dirty data, but
the way our ENOSPC stuff works is that as long as it's contiguous we only hold
metadata reservation for one extent.  The problem is we limit our extents to
128mb, so we'll end up with at least 800 extents so our enospc accounting is
quite a bit lower than what we need.  To keep track of this make sure we
increase outstanding_extents for every multiple of the max extent size so we can
be sure to have enough reserved metadata space.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-14 08:22:48 -08:00
Shaohua Li 2f0810880f btrfs: delete chunk allocation attemp when setting block group ro
Below test will fail currently:
      mkfs.ext4 -F /dev/sda
      btrfs-convert /dev/sda
      mount /dev/sda /mnt
      btrfs device add -f /dev/sdb /mnt
      btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt

The reason is there are some block groups with usage 0, but the whole
disk hasn't free space to allocate new chunk, so we even can't set such
block group readonly. This patch deletes the chunk allocation when
setting block group ro. For META, we already have reserve. But for
SYSTEM, we don't have, so the check_system_chunk is still required.

Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:25:20 -08:00
Filipe Manana d4b450cd4b Btrfs: fix race between transaction commit and empty block group removal
Committing a transaction can race with automatic removal of empty block
groups (cleaner kthread), leading to a BUG_ON() in the transaction
commit code while running btrfs_finish_extent_commit(). The following
sequence diagram shows how it can happen:

           CPU 1                                       CPU 2

btrfs_commit_transaction()
  fs_info->running_transaction = NULL
  btrfs_finish_extent_commit()
    find_first_extent_bit()
      -> found range for block group X
         in fs_info->freed_extents[]

                                               btrfs_delete_unused_bgs()
                                                 -> found block group X

                                                 Removed block group X's range
                                                 from fs_info->freed_extents[]

                                                 btrfs_remove_chunk()
                                                    btrfs_remove_block_group(bg X)

    unpin_extent_range(bg X range)
       btrfs_lookup_block_group(bg X)
          -> returns NULL
            -> BUG_ON()

The trace that results from the BUG_ON() is:

[48665.187808] ------------[ cut here ]------------
[48665.188032] kernel BUG at fs/btrfs/extent-tree.c:5675!
[48665.188032] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[48665.188032] Modules linked in: dm_flakey dm_mod crc32c_generic btrfs xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop parport_pc evdev microcode
[48665.197388] CPU: 2 PID: 31211 Comm: kworker/u32:16 Tainted: G        W      3.19.0-rc5-btrfs-next-4+ #1
[48665.197388] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[48665.197388] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[48665.197388] task: ffff880222011810 ti: ffff8801b56a4000 task.ti: ffff8801b56a4000
[48665.197388] RIP: 0010:[<ffffffffa0350d05>]  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388] RSP: 0018:ffff8801b56a7b88  EFLAGS: 00010246
[48665.197388] RAX: 0000000000000000 RBX: ffff8802143a6000 RCX: ffff8802220120c8
[48665.197388] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8800a3c140b0
[48665.197388] RBP: ffff8801b56a7bd8 R08: 0000000000000003 R09: 0000000000000000
[48665.197388] R10: 0000000000000000 R11: 000000000000bbac R12: 0000000012e8e000
[48665.197388] R13: ffff8800a3c14000 R14: 0000000000000000 R15: 0000000000000000
[48665.197388] FS:  0000000000000000(0000) GS:ffff88023ec40000(0000) knlGS:0000000000000000
[48665.197388] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[48665.197388] CR2: 00007f065e42f270 CR3: 0000000206f70000 CR4: 00000000000006e0
[48665.197388] Stack:
[48665.197388]  ffff8801b56a7bd8 0000000012ea0000 01ff8800a3c14138 0000000012e9ffff
[48665.197388]  ffff880141df3dd8 ffff8802143a6000 ffff8800a3c14138 ffff880141df3df0
[48665.197388]  ffff880141df3dd8 0000000000000000 ffff8801b56a7c08 ffffffffa0354227
[48665.197388] Call Trace:
[48665.197388]  [<ffffffffa0354227>] btrfs_finish_extent_commit+0xb0/0xd9 [btrfs]
[48665.197388]  [<ffffffffa0366b4b>] btrfs_commit_transaction+0x791/0x92c [btrfs]
[48665.197388]  [<ffffffffa0352432>] flush_space+0x43d/0x452 [btrfs]
[48665.197388]  [<ffffffff814295c3>] ? _raw_spin_unlock+0x28/0x33
[48665.197388]  [<ffffffffa035255f>] btrfs_async_reclaim_metadata_space+0x118/0x164 [btrfs]
[48665.197388]  [<ffffffff81059917>] ? process_one_work+0x14b/0x3ab
[48665.197388]  [<ffffffff810599ac>] process_one_work+0x1e0/0x3ab
[48665.197388]  [<ffffffff81079fa9>] ? trace_hardirqs_off+0xd/0xf
[48665.197388]  [<ffffffff8105a55b>] worker_thread+0x210/0x2d0
[48665.197388]  [<ffffffff8105a34b>] ? rescuer_thread+0x2c3/0x2c3
[48665.197388]  [<ffffffff8105e5c0>] kthread+0xef/0xf7
[48665.197388]  [<ffffffff81429682>] ? _raw_spin_unlock_irq+0x2d/0x39
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388]  [<ffffffff81429dec>] ret_from_fork+0x7c/0xb0
[48665.197388]  [<ffffffff8105e4d1>] ? __kthread_parkme+0xad/0xad
[48665.197388] Code: 85 f6 74 14 49 8b 06 49 03 46 09 49 39 c4 72 1d 4c 89 f7 e8 83 ec ff ff 4c 89 e6 4c 89 ef e8 1e f1 ff ff 48 85 c0 49 89 c6 75 02 <0f> 0b 49 8b 1e 49 03 5e 09 48 8b
[48665.197388] RIP  [<ffffffffa0350d05>] unpin_extent_range+0x6a/0x1ba [btrfs]
[48665.197388]  RSP <ffff8801b56a7b88>
[48665.272246] ---[ end trace b9c6ab9957521376 ]---

Fix this by ensuring that unpining the block group's range in
btrfs_finish_extent_commit() is done in a synchronized fashion
with removing the block group's range from freed_extents[]
in btrfs_delete_unused_bgs()

This race got introduced with the change:

    Btrfs: remove empty block groups automatically
    commit 47ab2a6c68

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-02 19:24:48 -08:00
Liu Bo 26455d3318 Btrfs: cleanup unused run_most
"run_most" is not used anymore.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:22:16 -08:00
Zhao Lei 6e9606d2a2 Btrfs: add ref_count and free function for btrfs_bio
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag
   to keep btrfs_bio's memory in raid56 recovery implement.
2: free function for bbio will make code clean and flexible, plus
   forced data type checking in compile.

Changelog v1->v2:
 Rename following by David Sterba's suggestion:
 put_btrfs_bio() -> btrfs_put_bio()
 get_btrfs_bio() -> btrfs_get_bio()
 bbio->ref_count -> bbio->refs

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:06:48 -08:00
Filipe Manana 6219872dc6 Btrfs: lookup for block group only if needed when freeing a tree block
Very often our extent buffer's header generation doesn't match the current
transaction's id or it is also referenced by other trees (snapshots), so
we don't need the corresponding block group cache object. Therefore only
search for it if we are going to use it, so we avoid an unnecessary search
in the block groups rbtree (and acquiring and releasing its spinlock).

Freeing a tree block is performed when COWing or deleting a node/leaf,
which implies we are holding the node/leaf's parent node lock, therefore
reducing the amount of time spent when freeing a tree block helps reducing
the amount of time we are holding the parent node's lock.

For example, for a run of xfstests/generic/083, the block group cache
object was needed only 682 times for a total of 226691 calls to free
a tree block.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 18:02:04 -08:00
Chris Mason 57bbddd7fb Merge branch 'cleanup/blocksize-diet-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus 2015-01-21 17:49:35 -08:00
Josef Bacik ce93ec548c Btrfs: track dirty block groups on their own list
Currently any time we try to update the block groups on disk we will walk _all_
block groups and check for the ->dirty flag to see if it is set.  This function
can get called several times during a commit.  So if you have several terabytes
of data you will be a very sad panda as we will loop through _all_ of the block
groups several times, which makes the commit take a while which slows down the
rest of the file system operations.

This patch introduces a dirty list for the block groups that we get added to
when we dirty the block group for the first time.  Then we simply update any
block groups that have been dirtied since the last time we called
btrfs_write_dirty_block_groups.  This allows us to clean up how we write the
free space cache out so it is much cleaner.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21 17:36:52 -08:00
Filipe Manana 75c68e9fbb Btrfs: fix race deleting block group from space_info->ro_bgs list
When removing a block group we were deleting it from its space_info's
ro_bgs list without the correct protection - the space info's spinlock.
Fix this by doing the list delete while holding the spinlock of the
corresponding space info, which is the correct lock for any operation
on that list.

This issue was introduced in the 3.19 kernel by the following change:

    Btrfs: move read only block groups onto their own list V2
    commit 633c0aad4c

I ran into a kernel crash while a task was running statfs, which iterates
the space_info->ro_bgs list while holding the space info's spinlock,
and another task was deleting it from the same list, without holding that
spinlock, as part of the block group remove operation (while running the
function btrfs_remove_block_group). This happened often when running the
stress test xfstests/generic/038 I recently made.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19 13:05:45 -08:00
Josef Bacik df95e7f0d9 Btrfs: abort transaction if we don't find the block group
We shouldn't BUG_ON() if there is corruption.  I hit this while testing my block
group patch and the abort worked properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-01-02 14:47:55 -05:00
David Sterba a83fffb75d btrfs: sink blocksize parameter to btrfs_find_create_tree_block
Finally it's clear that the requested blocksize is always equal to
nodesize, with one exception, the superblock.

Superblock has fixed size regardless of the metadata block size, but
uses the same helpers to initialize sys array/chunk tree and to work
with the chunk items. So it pretends to be an extent_buffer for a
moment, btrfs_read_sys_array is full of special cases, we're adding one
more.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:21 +01:00
David Sterba fe864576de btrfs: sink blocksize parameter to btrfs_init_new_buffer
Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:20 +01:00
David Sterba d3e46fea1b btrfs: sink blocksize parameter to readahead_tree_block
All callers pass nodesize.

Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12 18:07:18 +01:00
Filipe Manana 1edb647bb9 Btrfs: remove non-sense btrfs_error_discard_extent() function
It doesn't do anything special, it just calls btrfs_discard_extent(),
so just remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:32 -08:00
Filipe Manana 678886bdc6 Btrfs: fix fs corruption on transaction abort if device supports discard
When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

  [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

   _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
   *** fsck.btrfs output ***
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   Check tree block failed, want=94945280, have=0
   read block failed check_tree_block
   Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

   1) transaction N started, fs_info->pinned_extents pointed to
      fs_info->freed_extents[0];

   2) node/eb 94945280 is created;

   3) eb is persisted to disk;

   4) transaction N commit starts, fs_info->pinned_extents now points to
      fs_info->freed_extents[1], and transaction N completes;

   5) transaction N + 1 starts;

   6) eb is COWed, and btrfs_free_tree_block() called for this eb;

   7) eb range (94945280 to 94945280 + 16Kb) is added to
      fs_info->pinned_extents (fs_info->freed_extents[1]);

   8) Something goes wrong in transaction N + 1, like hitting ENOSPC
      for example, and the transaction is aborted, turning the fs into
      readonly mode. The stack trace I got for example:

      [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
      [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
      [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
      [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
      [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
      [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
      [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
      [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
      (...)
      [112065.264493] ---[ end trace dd7903a975a31a08 ]---
      [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
      [112065.264997] BTRFS info (device sdc): forced readonly

   9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
      fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
      turn calls btrfs_destroy_pinned_extent();

   10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
       marked as dirty in fs_info->freed_extents[], and for each one
       it calls discard, if the fs was mounted with "-o discard", and
       adds the range to the free space cache of the respective block
       group;

   11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
       sees the free space entries and performs a discard;

   12) After an umount and mount (or fsck), our eb's location on disk was full
       of zeroes, and it should have been untouched, because it was marked as
       dirty in the fs_info->pinned_extents tree, and therefore used by the
       trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b02).

Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:31 -08:00
Filipe Manana 01eacb2779 Btrfs: always clear a block group node when removing it from the tree
Always clear a block group's rbnode after removing it from the rbtree to
ensure that any tasks that might be holding a reference on the block group
don't end up accessing stale rbnode left and right child pointers through
next_block_group().

This is a leftover from the change titled:
"Btrfs: fix invalid block group rbtree access after bg is removed"

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-10 12:22:29 -08:00
Josef Bacik cb83b7b816 Btrfs: make get_caching_control unconditionally return the ctl
This was written when we didn't do a caching control for the fast free space
cache loading.  However we started doing that a long time ago, and there is
still a small window of time that we could be caching the block group the fast
way, so if there is a caching_ctl at all on the block group just return it, the
callers all wait properly for what they want.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana 8dbcd10f69 Btrfs: fix unprotected deletion from pending_chunks list
On block group remove if the corresponding extent map was on the
transaction->pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana 495e64f4fe Btrfs: fix fs mapping extent map leak
On chunk allocation error (label "error_del_extent"), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: G        W    L 3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]  0000000000000000 ffff8800ba867eb8 ffffffff813e7a13 ffff8800a2e11040
[20770.106132]  ffff8800ba867ed0 ffffffff81105d0c 0000000000000000 ffff8800ba867ee0
[20770.106134]  ffffffffa035d65e ffff8800ba867ef0 ffffffffa03b0654 ffff8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[20770.106145]  [<ffffffff81105d0c>] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [<ffffffffa035d65e>] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [<ffffffffa03b0654>] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [<ffffffff8109dc97>] SyS_delete_module+0x153/0x1c4
[20770.106182]  [<ffffffff8121261b>] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
"Btrfs: fix race between fs trimming and block group remove/allocation"

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:10 -08:00
Filipe Manana c92f6be34c Btrfs: make btrfs_abort_transaction consider existence of new block groups
If the transaction handle doesn't have used blocks but has created new block
groups make sure we turn the fs into readonly mode too. This is because the
new block groups didn't get all their metadata persisted into the chunk and
device trees, and therefore if a subsequent transaction starts, allocates
space from the new block groups, writes data or metadata into that space,
commits successfully and then after we unmount and mount the filesystem
again, the same space can be allocated again for a new block group,
resulting in file data or metadata corruption.

Example where we don't abort the transaction when we fail to finish the
chunk allocation (add items to the chunk and device trees) and later a
future transaction where the block group is removed fails because it can't
find the chunk item in the chunk tree:

[25230.404300] WARNING: CPU: 0 PID: 7721 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x50/0xfc [btrfs]()
[25230.404301] BTRFS: Transaction aborted (error -28)
[25230.404302] Modules linked in: btrfs dm_flakey nls_utf8 fuse xor raid6_pq ntfs vfat msdos fat xfs crc32c_generic libcrc32c ext3 jbd ext2 dm_mod nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse i2c_piix4 i2ccore parport_pc parport processor button pcspkr serio_raw thermal_sys evdev microcode ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy e1000 ata_piix libata virtio_pci virtio_ring scsi_mod virtio [last unloaded: btrfs]
[25230.404325] CPU: 0 PID: 7721 Comm: xfs_io Not tainted 3.17.0-rc5-btrfs-next-1+ #1
[25230.404326] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[25230.404328]  0000000000000000 ffff88004581bb08 ffffffff813e7a13 ffff88004581bb50
[25230.404330]  ffff88004581bb40 ffffffff810423aa ffffffffa049386a 00000000ffffffe4
[25230.404332]  ffffffffa05214c0 000000000000240c ffff88010fc8f800 ffff88004581bba8
[25230.404334] Call Trace:
[25230.404338]  [<ffffffff813e7a13>] dump_stack+0x45/0x56
[25230.404342]  [<ffffffff810423aa>] warn_slowpath_common+0x7f/0x98
[25230.404351]  [<ffffffffa049386a>] ? __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404353]  [<ffffffff8104240b>] warn_slowpath_fmt+0x48/0x50
[25230.404362]  [<ffffffffa049386a>] __btrfs_abort_transaction+0x50/0xfc [btrfs]
[25230.404374]  [<ffffffffa04a8c43>] btrfs_create_pending_block_groups+0x10c/0x135 [btrfs]
[25230.404387]  [<ffffffffa04b77fd>] __btrfs_end_transaction+0x7e/0x2de [btrfs]
[25230.404398]  [<ffffffffa04b7a6d>] btrfs_end_transaction+0x10/0x12 [btrfs]
[25230.404408]  [<ffffffffa04a3d64>] btrfs_check_data_free_space+0x111/0x1f0 [btrfs]
[25230.404421]  [<ffffffffa04c53bd>] __btrfs_buffered_write+0x160/0x48d [btrfs]
[25230.404425]  [<ffffffff811a9268>] ? cap_inode_need_killpriv+0x2d/0x37
[25230.404429]  [<ffffffff810f6501>] ? get_page+0x1a/0x2b
[25230.404441]  [<ffffffffa04c7c95>] btrfs_file_write_iter+0x321/0x42f [btrfs]
[25230.404443]  [<ffffffff8110f5d9>] ? handle_mm_fault+0x7f3/0x846
[25230.404446]  [<ffffffff813e98c5>] ? mutex_unlock+0x16/0x18
[25230.404449]  [<ffffffff81138d68>] new_sync_write+0x7c/0xa0
[25230.404450]  [<ffffffff81139401>] vfs_write+0xb0/0x112
[25230.404452]  [<ffffffff81139c9d>] SyS_pwrite64+0x66/0x84
[25230.404454]  [<ffffffff813ebf52>] system_call_fastpath+0x16/0x1b
[25230.404455] ---[ end trace 5aa5684fdf47ab38 ]---
[25230.404458] BTRFS warning (device sdc): btrfs_create_pending_block_groups:9228: Aborting unused transaction(No space left).
[25288.084814] BTRFS: error (device sdc) in btrfs_free_chunk:2509: errno=-2 No such entry (Failed lookup while freeing chunk.)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana 04216820fe Btrfs: fix race between fs trimming and block group remove/allocation
Our fs trim operation, which is completely transactionless (doesn't start
or joins an existing transaction) consists of visiting all block groups
and then for each one to iterate its free space entries and perform a
discard operation against the space range represented by the free space
entries. However before performing a discard, the corresponding free space
entry is removed from the free space rbtree, and when the discard completes
it is added back to the free space rbtree.

If a block group remove operation happens while the discard is ongoing (or
before it starts and after a free space entry is hidden), we end up not
waiting for the discard to complete, remove the extent map that maps
logical address to physical addresses and the corresponding chunk metadata
from the the chunk and device trees. After that and before the discard
completes, the current running transaction can finish and a new one start,
allowing for new block groups that map to the same physical addresses to
be allocated and written to.

So fix this by keeping the extent map in memory until the discard completes
so that the same physical addresses aren't reused before it completes.

If the physical locations that are under a discard operation end up being
used for a new metadata block group for example, and dirty metadata extents
are written before the discard finishes (the VM might call writepages() of
our btree inode's i_mapping for example, or an fsync log commit happens) we
end up overwriting metadata with zeroes, which leads to errors from fsck
like the following:

        checking extents
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        owner ref check failed [833912832 16384]
        Errors found in extent allocation tree or chunk allocation
        checking free space cache
        checking fs roots
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        Check tree block failed, want=833912832, have=0
        read block failed check_tree_block
        root 5 root dir 256 error
        root 5 inode 260 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 262 errors 2001, no inode item, link count wrong
                unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
        root 5 inode 263 errors 2001, no inode item, link count wrong
        (...)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:35:09 -08:00
Filipe Manana ae0ab003f2 Btrfs: fix freeing used extents after removing empty block group
There's a race between adding a block group to the list of the unused
block groups and removing an unused block group (cleaner kthread) that
leads to freeing extents that are in use or a crash during transaction
commmit. Basically the cleaner kthread, when executing
btrfs_delete_unused_bgs(), might catch the newly added block group to
the list fs_info->unused_bgs and clear the range representing the whole
group from fs_info->freed_extents[] before the task that added the block
group to the list (running update_block_group()) marked the last freed
extent as dirty in fs_info->freed_extents (pinned_extents).

That is:

     CPU 1                                CPU 2

                                  btrfs_delete_unused_bgs()
update_block_group()
   add block group to
   fs_info->unused_bgs
                                    got block group from the list
                                    clear_extent_bits for the whole
                                    block group range in freed_extents[]
   set_extent_dirty for the
   range covering the freed
   extent in freed_extents[]
   (fs_info->pinned_extents)

                                  block group deleted, and a new block
                                  group with the same logical address is
                                  created

                                  reserve space from the new block group
                                  for new data or metadata - the reserved
                                  space overlaps the range specified by
                                  CPU 1 for set_extent_dirty()

                                  commit transaction
                                    find all ranges marked as dirty in
                                    fs_info->pinned_extents, clear them
                                    and add them to the free space cache

Alternatively, if CPU 2 doesn't create a new block group with the same
logical address, we get a crash/BUG_ON at transaction commit when unpining
extent ranges because we can't find a block group for the range marked as
dirty by CPU 1. Sample trace:

[ 2163.426462] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2163.426640] Modules linked in: btrfs xor raid6_pq dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio crc32c_generic libcrc32c dm_mod nfsd auth_rpc
gss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse parport_pc parport i2c_piix4 processor thermal_sys i2ccore evdev button pcspkr microcode serio_raw ext4 crc16 jbd2 mbcache
 sg sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_common ata_generic virtio_scsi floppy ata_piix libata e1000 scsi_mod virtio_pci virtio_ring virtio
[ 2163.428209] CPU: 0 PID: 11858 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[ 2163.428519] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[ 2163.428875] task: ffff88009f2c0650 ti: ffff8801356bc000 task.ti: ffff8801356bc000
[ 2163.429157] RIP: 0010:[<ffffffffa037728e>]  [<ffffffffa037728e>] unpin_extent_range.isra.58+0x62/0x192 [btrfs]
[ 2163.429562] RSP: 0018:ffff8801356bfda8  EFLAGS: 00010246
[ 2163.429802] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 2163.429990] RDX: 0000000041bfffff RSI: 0000000001c00000 RDI: ffff880024307080
[ 2163.430042] RBP: ffff8801356bfde8 R08: 0000000000000068 R09: ffff88003734f118
[ 2163.430042] R10: ffff8801356bfcb8 R11: fffffffffffffb69 R12: ffff8800243070d0
[ 2163.430042] R13: 0000000083c04000 R14: ffff8800751b0f00 R15: ffff880024307000
[ 2163.430042] FS:  0000000000000000(0000) GS:ffff88013f400000(0000) knlGS:0000000000000000
[ 2163.430042] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2163.430042] CR2: 00007ff10eb43fc0 CR3: 0000000004cb8000 CR4: 00000000000006f0
[ 2163.430042] Stack:
[ 2163.430042]  ffff8800243070d0 0000000083c08000 0000000083c07fff ffff88012d6bc800
[ 2163.430042]  ffff8800243070d0 ffff8800751b0f18 ffff8800751b0f00 0000000000000000
[ 2163.430042]  ffff8801356bfe18 ffffffffa037a481 0000000083c04000 0000000083c07fff
[ 2163.430042] Call Trace:
[ 2163.430042]  [<ffffffffa037a481>] btrfs_finish_extent_commit+0xac/0xbf [btrfs]
[ 2163.430042]  [<ffffffffa038c06d>] btrfs_commit_transaction+0x6ee/0x882 [btrfs]
[ 2163.430042]  [<ffffffffa03881f1>] transaction_kthread+0xf2/0x1a4 [btrfs]
[ 2163.430042]  [<ffffffffa03880ff>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[ 2163.430042]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[ 2163.430042]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[ 2163.430042]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

So fix this by making update_block_group() first set the range as dirty
in pinned_extents before adding the block group to the unused_bgs list.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana 4f69cb987e Btrfs: fix crash caused by block group removal
If we remove a block group (because it became empty), we might have left
a caching_ctl structure in fs_info->caching_block_groups that points to
the block group and is accessed at transaction commit time. This results
in accessing an invalid or incorrect block group. This issue became visible
after Josef's patch "Btrfs: remove empty block groups automatically".

So if the block group is removed make sure we don't leave a dangling
caching_ctl in caching_block_groups.

Sample crash trace:

[58380.439449] BUG: unable to handle kernel paging request at ffff8801446eaeb8
[58380.439707] IP: [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.440879] PGD 1acb067 PUD 23f5ff067 PMD 23f5db067 PTE 80000001446ea060
[58380.441220] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[58380.441486] Modules linked in: btrfs crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc loop psmouse processor i2c_piix4 parport_pc parport pcspkr serio_raw evdev i2ccore thermal_sys microcode button ext4 crc16 jbd2 mbcache sr_mod cdrom ata_generic sg sd_mod crc_t10dif crct10dif_generic crct10dif_common virtio_scsi floppy ata_piix e1000 libata virtio_pci scsi_mod virtio_ring virtio [last unloaded: btrfs]
[58380.443238] CPU: 3 PID: 25728 Comm: btrfs-transacti Tainted: G        W      3.17.0-rc5-btrfs-next-1+ #1
[58380.443238] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[58380.443238] task: ffff88013ac82090 ti: ffff88013896c000 task.ti: ffff88013896c000
[58380.443238] RIP: 0010:[<ffffffffa03f6d05>]  [<ffffffffa03f6d05>] block_group_cache_done.isra.21+0xc/0x1c [btrfs]
[58380.443238] RSP: 0018:ffff88013896fdd8  EFLAGS: 00010283
[58380.443238] RAX: ffff880222cae850 RBX: ffff880119ba74c0 RCX: 0000000000000000
[58380.443238] RDX: 0000000000000000 RSI: ffff880185e16800 RDI: ffff8801446eaeb8
[58380.443238] RBP: ffff88013896fdd8 R08: ffff8801a9ca9fa8 R09: ffff88013896fc60
[58380.443238] R10: ffff88013896fd28 R11: 0000000000000000 R12: ffff880222cae000
[58380.443238] R13: ffff880222cae850 R14: ffff880222cae6b0 R15: ffff8801446eae00
[58380.443238] FS:  0000000000000000(0000) GS:ffff88023ed80000(0000) knlGS:0000000000000000
[58380.443238] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[58380.443238] CR2: ffff8801446eaeb8 CR3: 0000000001811000 CR4: 00000000000006e0
[58380.443238] Stack:
[58380.443238]  ffff88013896fe18 ffffffffa03fe2d5 ffff880222cae850 ffff880185e16800
[58380.443238]  ffff88000dc41c20 0000000000000000 ffff8801a9ca9f00 0000000000000000
[58380.443238]  ffff88013896fe80 ffffffffa040fbcf ffff88018b0dcdb0 ffff88013ac82090
[58380.443238] Call Trace:
[58380.443238]  [<ffffffffa03fe2d5>] btrfs_prepare_extent_commit+0x5a/0xd7 [btrfs]
[58380.443238]  [<ffffffffa040fbcf>] btrfs_commit_transaction+0x45c/0x882 [btrfs]
[58380.443238]  [<ffffffffa040c058>] transaction_kthread+0xf2/0x1a4 [btrfs]
[58380.443238]  [<ffffffffa040bf66>] ? btrfs_cleanup_transaction+0x3d8/0x3d8 [btrfs]
[58380.443238]  [<ffffffff8105966b>] kthread+0xb7/0xbf
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67
[58380.443238]  [<ffffffff813ebeac>] ret_from_fork+0x7c/0xb0
[58380.443238]  [<ffffffff810595b4>] ? __kthread_parkme+0x67/0x67

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana 292cbd51ec Btrfs: fix invalid block group rbtree access after bg is removed
If we grab a block group, for example in btrfs_trim_fs(), we will be holding
a reference on it but the block group can be removed after we got it (via
btrfs_remove_block_group), which means it will no longer be part of the
rbtree.

However, btrfs_remove_block_group() was only calling rb_erase() which leaves
the block group's rb_node left and right child pointers with the same content
they had before calling rb_erase. This was dangerous because a call to
next_block_group() would access the node's left and right child pointers (via
rb_next), which can be no longer valid.

Fix this by clearing a block group's node after removing it from the tree,
and have next_block_group() do a tree search to get the next block group
instead of using rb_next() if our block group was removed.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-12-02 18:19:17 -08:00
Filipe Manana 9ea24bbe17 Btrfs: fix snapshot inconsistency after a file write followed by truncate
If right after starting the snapshot creation ioctl we perform a write against a
file followed by a truncate, with both operations increasing the file's size, we
can get a snapshot tree that reflects a state of the source subvolume's tree where
the file truncation happened but the write operation didn't. This leaves a gap
between 2 file extent items of the inode, which makes btrfs' fsck complain about it.

For example, if we perform the following file operations:

    $ mkfs.btrfs -f /dev/vdd
    $ mount /dev/vdd /mnt
    $ xfs_io -f \
          -c "pwrite -S 0xaa -b 32K 0 32K" \
          -c "fsync" \
          -c "pwrite -S 0xbb -b 32770 16K 32770" \
          -c "truncate 90123" \
          /mnt/foobar

and the snapshot creation ioctl was just called before the second write, we often
can get the following inode items in the snapshot's btree:

        item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
        item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                inode ref index 282 namelen 10 name: foobar
        item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                extent data disk byte 1104855040 nr 32768
                extent data offset 0 nr 32768 ram 32768
                extent compression 0
        item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                extent data disk byte 0 nr 0
                extent data offset 0 nr 40960 ram 40960
                extent compression 0

There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
for which there's no file extent item covering it. This is because the file write
and file truncate operations happened both right after the snapshot creation ioctl
called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
ordered extent that matches the write and, in btrfs_setsize(), we were able to call
btrfs_cont_expand() before being able to commit the current transaction in the
snapshot creation ioctl. So this made it possibe to insert the hole file extent
item in the source subvolume (which represents the region added by the truncate)
right before the transaction commit from the snapshot creation ioctl.

Btrfs' fsck tool complains about such cases with a message like the following:

    "root 331 inode 257 errors 100, file extent discount"

>From a user perspective, the expectation when a snapshot is created while those
file operations are being performed is that the snapshot will have a file that
either:

1) is empty
2) only the first write was captured
3) only the 2 writes were captured
4) both writes and the truncation were captured

But never capture a state where only the first write and the truncation were
captured (since the second write was performed before the truncation).

A test case for xfstests follows.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00
Filipe Manana 758eb51e71 Btrfs: fix freeing used extent after removing empty block group
Due to ignoring errors returned by clear_extent_bits (at the moment only
-ENOMEM is possible), we can end up freeing an extent that is actually in
use (i.e. return the extent to the free space cache).

The sequence of steps that lead to this:

1) Cleaner thread starts execution and calls btrfs_delete_unused_bgs(), with
   the goal of freeing empty block groups;

2) btrfs_delete_unused_bgs() finds an empty block group, joins the current
   transaction (or starts a new one if none is running) and attempts to
   clear the EXTENT_DIRTY bit for the block group's range from freed_extents[0]
   and freed_extents[1] (of which one corresponds to fs_info->pinned_extents);

3) Clearing the EXTENT_DIRTY bit (via clear_extent_bits()) fails with
   -ENOMEM, but such error is ignored and btrfs_delete_unused_bgs() proceeds
   to delete the block group and the respective chunk, while pinned_extents
   remains with that bit set for the whole (or a part of the) range covered
   by the block group;

4) Later while the transaction is still running, the chunk ends up being reused
   for a new block group (maybe for different purpose, data or metadata), and
   extents belonging to the new block group are allocated for file data or btree
   nodes/leafs;

5) The current transaction is committed, meaning that we unpinned one or more
   extents from the new block group (through btrfs_finish_extent_commit() and
   unpin_extent_range()) which are now being used for new file data or new
   metadata (through btrfs_finish_extent_commit() and unpin_extent_range()).
   And unpinning means we returned the extents to the free space cache of the
   new block group, which implies those extents can be used for future allocations
   while they're still in use.

Alternatively, we can hit a BUG_ON() when doing a lookup for a block group's cache
object in unpin_extent_range() if a new block group didn't end up being allocated for
the same chunk (step 4 above).

Fix this by not freeing the block group and chunk if we fail to clear the dirty bit.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-11-25 07:41:23 -08:00