remarkable-linux/fs
Filipe Manana 22ab04e814 Btrfs: fix race between device replace and chunk allocation
While iterating and copying extents from the source device, the device
replace code keeps adjusting a left cursor that is used to make sure that
once we finish processing a device extent, any future writes to extents
from the corresponding block group will get into both the source and
target devices. This left cursor is also used for resuming the device
replace operation at mount time.

However using this left cursor to decide whether writes go into both
devices or only the source device is not enough to guarantee we don't
miss copying extents into the target device. There are two cases where
the current approach fails. The first one is related to when there are
holes in the device and they get allocated for new block groups while
the device replace operation is iterating the device extents (more on
this explained below). The second one is that when that loop over the
device extents finishes, we start dellaloc, wait for all ordered extents
and then commit the current transaction, we might have got new block
groups allocated that are now using a device extent that has an offset
greater then or equals to the value of the left cursor, in which case
writes to extents belonging to these new block groups will get issued
only to the source device.

For the first case where the current approach of using a left cursor
fails, consider the source device currently has the following layout:

  [ extent bg A ] [ hole, unallocated space ] [extent bg B ]
  3Gb             4Gb                         5Gb

While we are iterating the device extents from the source device using
the commit root of the device tree, the following happens:

        CPU 1                                            CPU 2

                      <we are at transaction N>

  scrub_enumerate_chunks()
    --> searches the device tree for
        extents belonging to the source
        device using the device tree's
        commit root
    --> 1st iteration finds extent belonging to
        block group A

        --> sets block group A to RO mode
            (btrfs_inc_block_group_ro)

        --> sets cursor left to found_key.offset
            which is 3Gb

        --> scrub_chunk() starts
            copies all allocated extents from
            block group's A stripe at source
            device into target device

                                                           btrfs_alloc_chunk()
                                                             --> allocates device extent
                                                                 in the range [4Gb, 5Gb[
                                                                 from the source device for
                                                                 a new block group C

                                                           extent allocated from block
                                                           group C for a direct IO,
                                                           buffered write or btree node/leaf

                                                           extent is written to, perhaps
                                                           in response to a writepages()
                                                           call from the VM or directly
                                                           through direct IO

                                                           the write is made only against
                                                           the source device and not against
                                                           the target device because the
                                                           extent's offset is in the interval
                                                           [4Gb, 5Gb[ which is larger then
                                                           the value of cursor_left (3Gb)

        --> scrub_chunks() finishes

        --> updates left cursor from 3Gb to
            4Gb

        --> btrfs_dec_block_group_ro() sets
            block group A back to RW mode

                             <we are still at transaction N>

    --> 2nd iteration finds extent belonging to
        block group B - it did not find the new
        extent in the range [4Gb, 5Gb[ for block
        group C because we are using the device
        tree's commit root or even because the
        block group's items are not all yet
        inserted in the respective btrees, that is,
        the block group is still attached to some
        transaction handle's new_bgs list and
        btrfs_create_pending_block_groups() was
        not called yet against that transaction
        handle, so the device extent items were
        not yet inserted into the devices tree

                             <we are still at transaction N>

        --> so we end not copying anything from the newly
            allocated device extent from the source device
            to the target device

So fix this by making __btrfs_map_block() always redirect writes to the
target device as well, independently of the left cursor's value. With
this change the left cursor is now used only for the purpose of tracking
progress and allow a mount operation to resume a device replace.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2016-05-30 12:58:26 +01:00
..
9p mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
adfs fs/adfs/adfs.h: tidy up comments 2016-01-20 17:09:18 -08:00
affs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
afs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
autofs4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-03-19 18:52:29 -07:00
befs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
bfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
btrfs Btrfs: fix race between device replace and chunk allocation 2016-05-30 12:58:26 +01:00
cachefiles mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
ceph libceph: make authorizer destruction independent of ceph_auth_client 2016-04-25 20:54:13 +02:00
cifs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
coda Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2016-01-23 12:24:56 -08:00
configfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
cramfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
crypto ext4/fscrypto: avoid RCU lookup in d_revalidate 2016-04-12 20:01:35 -07:00
debugfs debugfs: Make automount point inodes permanently empty 2016-04-12 15:01:53 -07:00
devpts devpts: more pty driver interface cleanups 2016-04-26 15:47:32 -07:00
dlm mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
ecryptfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
efivarfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
efs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
exofs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
exportfs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ext2 mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
ext4 ext4/fscrypto: avoid RCU lookup in d_revalidate 2016-04-12 20:01:35 -07:00
f2fs fscrypto: don't let data integrity writebacks fail with ENOMEM 2016-04-12 10:25:30 -07:00
fat fat: add config option to set UTF-8 mount option by default 2016-03-22 15:36:02 -07:00
freevxfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
fscache mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
fuse fuse: Fix return value from fuse_get_user_pages() 2016-04-25 13:01:04 +02:00
gfs2 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
hfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
hfsplus mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
hostfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
hpfs hpfs: don't truncate the file when delete fails 2016-02-27 19:15:51 -05:00
hugetlbfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
isofs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
jbd2 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
jffs2 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
jfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
kernfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
lockd lockd: constify nlmsvc_binding structure 2016-01-07 10:10:50 -05:00
logfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
minix mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
ncpfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
nfs These changes contains a fix for overlayfs interacting with some 2016-04-07 17:22:20 -07:00
nfs_common
nfsd Various bugfixes, a RDMA update from Chuck Lever, and support for a new 2016-03-24 19:50:32 -07:00
nilfs2 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
nls
notify fsnotify: turn fsnotify reaper thread into a workqueue job 2016-02-18 16:23:24 -08:00
ntfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
ocfs2 ocfs2/dlm: return zero if deref_done message is successfully handled 2016-04-28 19:34:04 -07:00
omfs
openpromfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
orangefs Orangefs: cleanups and a strncpy vulnerability fix. 2016-04-09 10:33:58 -07:00
overlayfs fs: add file_dentry() 2016-03-26 16:14:37 -04:00
proc proc: prevent accessing /proc/<PID>/environ until it's ready 2016-05-05 17:38:53 -07:00
pstore mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
qnx4 kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
qnx6 mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
quota quota: Handle Q_GETNEXTQUOTA when quota is disabled 2016-03-29 17:20:10 +02:00
ramfs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
reiserfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
romfs kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
squashfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
sysfs
sysv mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
tracefs wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
ubifs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
udf udf: Fix conversion of 'dstring' fields to UTF8 2016-04-25 15:18:50 +02:00
ufs mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
xfs mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
aio.c
anon_inodes.c
attr.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
bad_inode.c fs/bad_inode.c: is_bad_inode can be boolean 2015-12-06 21:17:14 -05:00
binfmt_aout.c
binfmt_elf.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
binfmt_elf_fdpic.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
binfmt_script.c
block_dev.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
buffer.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
char_dev.c
compat.c saner calling conventions for copy_mount_options() 2016-01-04 10:28:32 -05:00
compat_binfmt_elf.c
compat_ioctl.c Merge 4.5-rc4 into char-misc-next 2016-02-14 14:25:59 -08:00
coredump.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-03-22 15:36:02 -07:00
dax.c mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
dcache.c fs: add file_dentry() 2016-03-26 16:14:37 -04:00
dcookies.c
direct-io.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
drop_caches.c
eventfd.c eventfd: document lockless access in eventfd_poll 2016-03-22 15:36:02 -07:00
eventpoll.c timer: convert timer_slack_ns from unsigned long to u64 2016-03-17 15:09:34 -07:00
exec.c Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2016-03-20 19:08:56 -07:00
fcntl.c fcntl: allow to set O_DIRECT flag on pipe 2016-01-09 02:55:37 -05:00
fhandle.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-03-22 15:36:02 -07:00
file.c kmemcg: account certain kmem allocations to memcg 2016-01-14 16:00:49 -08:00
file_table.c
filesystems.c find_filesystem(): simplify comparison 2016-01-19 12:02:23 -05:00
fs-writeback.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
fs_pin.c
fs_struct.c
inode.c writeback: initialize inode members that track writeback history 2016-02-16 14:57:21 -07:00
internal.h Merge branch 'for-linus' into work.misc 2016-01-08 21:20:11 -05:00
ioctl.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
Kconfig Merge tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux 2016-03-26 12:59:04 -07:00
Kconfig.binfmt
libfs.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
locks.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
Makefile Merge tag 'ofs-pull-tag-1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux 2016-03-26 12:59:04 -07:00
mbcache.c mbcache: add reusable flag to cache entries 2016-02-22 22:44:04 -05:00
mount.h
mpage.c mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage 2016-04-04 10:41:08 -07:00
namei.c fix the braino in "namei: massage lookup_slow() to be usable by lookup_one_len_unlocked()" 2016-03-31 00:23:05 -04:00
namespace.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
no-block.c
nsfs.c
open.c fs/coredump: prevent fsuid=0 dumps into user-controlled directories 2016-03-22 15:36:02 -07:00
pipe.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
pnode.c propogate_mnt: Handle the first propogated copy being a slave 2016-05-05 09:54:45 -05:00
pnode.h
posix_acl.c xattr handlers: Simplify list operation 2015-12-13 19:46:12 -05:00
proc_namespace.c vfs: show_vfsstat: do not ignore errors from show_devname method 2016-03-16 13:09:08 -04:00
read_write.c Merge branches 'work.lookups', 'work.misc' and 'work.preadv2' into for-next 2016-03-18 16:07:38 -04:00
readdir.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
select.c timer: convert timer_slack_ns from unsigned long to u64 2016-03-17 15:09:34 -07:00
seq_file.c Make file credentials available to the seqfile interfaces 2016-04-14 12:56:09 -07:00
signalfd.c
splice.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
stack.c
stat.c fs/stat.c: drop the last new_valid_dev check 2016-01-16 11:17:23 -08:00
statfs.c
super.c writeback: flush inode cgroup wb switches instead of pinning super_block 2016-03-03 14:42:50 -07:00
sync.c mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros 2016-04-04 10:41:08 -07:00
timerfd.c timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper 2016-01-17 11:13:55 +01:00
userfaultfd.c userfaultfd: don't block on the last VM updates at exit time 2016-03-02 09:03:18 -08:00
utimes.c wrappers for ->i_mutex access 2016-01-22 18:04:28 -05:00
xattr.c xattr handlers: plug a lock leak in simple_xattr_list 2016-02-20 00:15:51 -05:00