alistair23-linux/fs
Filipe Manana e2127cf008 Btrfs: make defrag not fragment files when using prealloc extents
When using prealloc extents, a file defragment operation may actually
fragment the file and increase the amount of data space used by the file.
This change fixes that behaviour.

Example:

$ mkfs.btrfs -f /dev/sdb3
$ mount /dev/sdb3 /mnt
$ cd /mnt
$ xfs_io -f -c 'falloc 0 1048576' foobar && sync
$ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar
$ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar
$ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync

Before defragmenting the file:

$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.25MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 0 nr 4096
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 4096 nr 102400 ram 1048576
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 106496 nr 90112
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 196608 nr 106496 ram 1048576
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 897024 nr 106496 ram 1048576
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 1003520 nr 45056
(...)

Now defragmenting the file results in more data space used than before:

$ btrfs filesystem defragment -f foobar && sync
$ btrfs filesystem df /mnt
Data, single: total=8.00MiB, used=1.55MiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.00GiB, used=112.00KiB
Metadata, single: total=8.00MiB, used=0.00

And the corresponding file extent items are now no longer perfectly sequential
as before, and we're now needlessly using more space from data block groups:

$ btrfs-debug-tree /dev/sdb3
(...)
	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 0 nr 4096 ram 1048576
		extent compression 0
	item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53
		extent data disk byte 13893632 nr 102400
		extent data offset 0 nr 102400 ram 102400
		extent compression 0
	item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 106496 nr 90112 ram 1048576
		extent compression 0
	item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53
		extent data disk byte 13996032 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53
		prealloc data disk byte 12845056 nr 1048576
		prealloc data offset 303104 nr 593920
	item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53
		extent data disk byte 14102528 nr 106496
		extent data offset 0 nr 106496 ram 106496
		extent compression 0
	item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53
		extent data disk byte 12845056 nr 1048576
		extent data offset 1003520 nr 45056 ram 1048576
		extent compression 0
(...)

With this change, the above example will no longer cause allocation of new data
space nor change the sequentiality of the file extents, that is, defragment will
be effectless, leaving all extent items pointing to the extent starting at disk
byte 12845056.

In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of
400Mb each, initially consisting of a single prealloc extent of 400Mb, having
random writes happening at a low rate, lead to a total of over ~17Gb of data
space used, not far from eventually reaching an ENOSPC state.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-03-10 15:17:17 -04:00
..
9p consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
adfs adfs: delayed freeing of sbi 2013-10-24 23:43:27 -04:00
affs remove obsolete references to powertweak 2013-11-27 20:34:32 -08:00
afs Merge branch 'fscache' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs into linux-next 2013-10-28 19:36:46 -04:00
autofs4 autofs4: make freeing sbi rcu-delayed 2013-10-24 23:43:27 -04:00
befs befs: split symlink iops in two - for short and long symlinks resp. 2013-10-24 23:34:50 -04:00
bfs
btrfs Btrfs: make defrag not fragment files when using prealloc extents 2014-03-10 15:17:17 -04:00
cachefiles Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
ceph ceph: Avoid data inconsistency due to d-cache aliasing in readpage() 2013-12-13 09:11:38 -08:00
cifs cifs: set FILE_CREATED 2013-12-27 15:14:45 -06:00
coda coda_revalidate_inode(): switch to passing inode... 2013-11-09 00:16:21 -05:00
configfs configfs: fix race between dentry put and lookup 2013-11-21 16:42:27 -08:00
cramfs cramfs: mark as obsolete 2013-11-13 12:09:12 +09:00
debugfs debugfs: use list_next_entry() in debugfs_remove_recursive() 2013-11-13 12:09:24 +09:00
devpts devpts: plug the memory leak in kill_sb 2013-11-13 12:09:36 +09:00
dlm genetlink: only pass array to genl_register_family_with_ops() 2013-11-19 16:39:05 -05:00
ecryptfs Quiet static checkers by removing unneeded conditionals 2013-11-22 10:58:14 -08:00
efivarfs consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
efs
exofs
exportfs exportfs: fix quadratic behavior in filehandle lookup 2013-11-09 00:16:38 -05:00
ext2 ext2: Fix oops in ext2_get_block() called from ext2_quota_write() 2013-12-04 12:26:51 +01:00
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-11-13 15:25:47 +09:00
ext4 ext4: fix bigalloc regression 2014-01-06 14:00:23 -05:00
f2fs f2fs: issue more large discard command 2013-11-11 09:36:32 +09:00
fat fat: rcu-delay unloading nls and freeing sbi 2013-10-24 23:43:28 -04:00
freevxfs
fscache Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block 2013-11-14 12:08:14 +09:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
gfs2 GFS2: Fix unsafe dereference in dump_holder() 2014-01-02 12:18:04 +00:00
hfs fs/hfs/btree.h: remove duplicate defines 2013-11-13 12:09:32 +09:00
hfsplus block: submit_bio_wait() conversions 2013-11-24 16:33:41 -07:00
hostfs consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
hpfs locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
hppfs
hugetlbfs
isofs isofs: don't pass dentry to isofs_hash{i,}_common() 2013-10-24 23:34:59 -04:00
jbd jbd: Revert "jbd: remove dependency on __GFP_NOFAIL" 2013-10-31 20:37:15 +01:00
jbd2 jbd2: rename obsoleted msg JBD->JBD2 2013-12-08 21:14:59 -05:00
jffs2 jffs2: do not support the MLC nand 2013-10-27 16:27:07 -07:00
jfs Just a patch to fix an oops in an error path. 2013-10-22 09:01:11 +01:00
lockd
logfs block: submit_bio_wait() conversions 2013-11-24 16:33:41 -07:00
minix fs/minix: Drop dependency on H8300 2013-09-16 18:20:25 -07:00
ncpfs ncpfs: rcu-delay unload_nls() and freeing ncp_server 2013-10-24 23:43:28 -04:00
nfs NFS client bugfixes 2013-12-05 13:05:48 -08:00
nfs_common
nfsd nfsd: when reusing an existing repcache entry, unhash it first 2013-12-10 20:34:44 -05:00
nilfs2 nilfs2: fix segctor bug that causes file system corruption 2014-01-15 14:19:42 +07:00
nls
notify
ntfs iget/iget5: don't bother with ->i_lock until we find a match 2013-11-09 00:16:31 -05:00
ocfs2 tree-wide: use reinit_completion instead of INIT_COMPLETION 2013-11-15 09:32:21 +09:00
omfs
openpromfs
proc procfs: also fix proc_reg_get_unmapped_area() for !MMU case 2013-12-12 18:19:26 -08:00
pstore pstore: Don't allow high traffic options on fragile devices 2013-12-20 13:12:01 -08:00
qnx4 qnx4: i_sb is never NULL 2013-11-09 00:16:32 -05:00
qnx6
quota genetlink: make multicast groups const, prevent abuse 2013-11-19 16:39:06 -05:00
ramfs
reiserfs reiserfs: fix race with flush_used_journal_lists and flush_journal_list 2013-09-24 11:24:21 +02:00
romfs
squashfs Squashfs: fix failure to unlock pages on decompress error 2013-11-24 01:02:50 +00:00
sysfs sysfs: give different locking key to regular and bin files 2013-12-07 21:22:00 -08:00
sysv sysv: Add forgotten superblock lock init for v7 fs 2013-09-29 22:02:02 -04:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
udf udf: fix for pathetic mount times in case of invalid file system 2013-10-18 22:39:07 +02:00
ufs
xfs xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() 2014-01-10 12:39:38 -06:00
aio.c Merge git://git.kvack.org/~bcrl/aio-next 2013-12-22 11:03:49 -08:00
anon_inodes.c ... and kill anon_inode_getfile_private() 2013-11-09 00:16:28 -05:00
attr.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
bad_inode.c
binfmt_aout.c dump_skip(): dump_seek() replacement taking coredump_params 2013-11-09 00:16:26 -05:00
binfmt_elf.c elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo) 2013-11-09 00:16:30 -05:00
binfmt_elf_fdpic.c elf{,_fdpic} coredump: get rid of pointless if (siginfo->si_signo) 2013-11-09 00:16:30 -05:00
binfmt_em86.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c Merge branch 'for-3.12/core' of git://git.kernel.dk/linux-block 2013-09-22 15:00:11 -07:00
bio.c bio: fix argument of __bio_add_page() for max_sectors > 0xffff 2013-11-18 12:31:27 -07:00
block_dev.c a trivial writeback fix 2013-09-13 23:06:40 -04:00
buffer.c fs: buffer: move allocation failure loop into the allocator 2013-10-16 21:35:53 -07:00
char_dev.c Merge branch 'for-3.13/core' of git://git.kernel.dk/linux-block 2013-11-14 12:08:14 +09:00
compat.c
compat_binfmt_elf.c
compat_ioctl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
coredump.c dump_emit(): use __kernel_write(), not vfs_write() 2013-11-15 22:04:09 -05:00
coredump.h
dcache.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-01-17 17:29:36 -08:00
dcookies.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c epoll: do not take the nested ep->mtx on EPOLL_CTL_DEL 2014-01-02 14:40:30 -08:00
exec.c Merge git://git.infradead.org/users/eparis/audit 2013-11-21 19:18:14 -08:00
fcntl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
fhandle.c
file.c
file_table.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
filesystems.c
fs-writeback.c writeback: Fix data corruption on NFS 2013-12-14 04:21:26 +08:00
fs_struct.c seqcount: Add lockdep functionality to seqcount/seqlock structures 2013-11-06 12:40:26 +01:00
generic_acl.c
inode.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
internal.h get rid of s_files and files_lock 2013-11-09 00:16:20 -05:00
ioctl.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
ioprio.c
Kconfig
Kconfig.binfmt
libfs.c consolidate simple ->d_delete() instances 2013-11-15 22:04:17 -05:00
locks.c locks: missing unlock on error in generic_add_lease() 2013-11-13 07:30:53 -05:00
Makefile
mbcache.c
mount.h RCU'd vfsmounts 2013-11-09 00:16:19 -05:00
mpage.c
namei.c dcache: allow word-at-a-time name hashing with big-endian CPUs 2013-12-12 10:39:01 -08:00
namespace.c vfs: Fix a regression in mounting proc 2013-11-26 20:54:52 -08:00
no-block.c
open.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
pipe.c vfs: fix subtle use-after-free of pipe_inode_info 2013-12-02 09:44:51 -08:00
pnode.c split __lookup_mnt() in two functions 2013-10-24 23:35:00 -04:00
pnode.h
posix_acl.c
proc_namespace.c don't bother with vfsmount_lock in mounts_poll() 2013-10-24 23:34:59 -04:00
read_write.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
readdir.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
select.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-11-13 15:34:18 +09:00
seq_file.c seq_file: always clear m->count when we free m->buf 2013-11-18 19:07:53 -08:00
signalfd.c
splice.c file->f_op is never NULL... 2013-10-24 23:34:54 -04:00
stack.c
stat.c vfs: split out vfs_getattr_nosec 2013-11-09 00:16:31 -05:00
statfs.c vfs: allow O_PATH file descriptors for fstatfs() 2013-10-12 13:12:31 -07:00
super.c get rid of s_files and files_lock 2013-11-09 00:16:20 -05:00
sync.c Merge branch 'akpm' (patches from Andrew Morton) 2013-11-13 15:45:43 +09:00
timerfd.c
utimes.c locks: break delegations on any attribute modification 2013-11-09 00:16:44 -05:00
xattr.c
xattr_acl.c