Commit graph

615 commits

Author SHA1 Message Date
Josef Bacik b4aff1f874 Btrfs: load the key from the dir item in readdir into a fake dentry
In btrfs we have 2 indexes for inodes.  One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir.  However if you use ls,
it usually stat's each file it gets from readdir.  This is where the second
index comes in, which is based on a hash of the name of the file.  So then the
lookup has to lookup this index, and then lookup the inode.  The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance.  Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata.  Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-01 01:31:42 -04:00
Linus Torvalds 22712200e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
  Btrfs: use the commit_root for reading free_space_inode crcs
  Btrfs: reduce extent_state lock contention for metadata
  Btrfs: remove lockdep magic from btrfs_next_leaf
  Btrfs: make a lockdep class for each root
  Btrfs: switch the btrfs tree locks to reader/writer
  Btrfs: fix deadlock when throttling transactions
  Btrfs: stop using highmem for extent_buffers
  Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
  Btrfs: tag pages for writeback in sync
  Btrfs: fix enospc problems with delalloc
  Btrfs: don't flush delalloc arbitrarily
  Btrfs: use find_or_create_page instead of grab_cache_page
  Btrfs: use a worker thread to do caching
  Btrfs: fix how we merge extent states and deal with cached states
  Btrfs: use the normal checksumming infrastructure for free space cache
  Btrfs: serialize flushers in reserve_metadata_bytes
  Btrfs: do transaction space reservation before joining the transaction
  Btrfs: try to only do one btrfs_search_slot in do_setxattr
2011-07-27 16:43:52 -07:00
Chris Mason ff95acb673 Merge branch 'integration' into for-linus 2011-07-27 16:18:13 -04:00
Chris Mason 2cf8572dac Btrfs: use the commit_root for reading free_space_inode crcs
Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.

This commit fixes things by using the commit_root to read the crcs.  This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:48 -04:00
Chris Mason a65917156e Btrfs: stop using highmem for extent_buffers
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.

The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.

This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:45 -04:00
Josef Bacik 9e0baf60de Btrfs: fix enospc problems with delalloc
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea.  Consider this where we have 1
outstanding extent and 1 reserved extent

Reserver				Releaser
					atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
					atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
					free reserved space for 1 extent

Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:44 -04:00
Josef Bacik a94733d0bc Btrfs: use find_or_create_page instead of grab_cache_page
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE.  So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-07-27 12:46:43 -04:00
Al Viro 569254b0cc btrfs: S_ISREG(mode) is not mode & S_IFREG...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-26 13:05:05 -04:00
Christoph Hellwig 4e34e719e4 fs: take the ACL checks to common code
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss.  This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-25 14:30:23 -04:00
Al Viro 10d9f309d8 get rid of useless dget_parent() in btrfs rename() and link()
->d_parent is locked and stable there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:48:00 -04:00
Al Viro a9049376ee make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err)
... and simplify the living hell out of callers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:26 -04:00
Al Viro 0ee5dc676a btrfs: kill magical embedded struct superblock
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:44:20 -04:00
Al Viro 10556cb21a ->permission() sanitizing: don't pass flags to ->permission()
not used by the instances anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:24 -04:00
Al Viro 2830ba7f34 ->permission() sanitizing: don't pass flags to generic_permission()
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:22 -04:00
Al Viro 178ea73521 kill check_acl callback of generic_permission()
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 01:43:16 -04:00
Mark Fasheh 1748f843a0 btrfs: Don't BUG_ON alloc_path errors in btrfs_read_locked_inode
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Mark Fasheh 0eb0e19cde btrfs: Don't BUG_ON alloc_path errors in btrfs_truncate_inode_items
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Mark Fasheh d8926bb3ba btrfs: don't BUG_ON btrfs_alloc_path() errors
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:

 - Callers of the function catch errors from it already so bubbling the
   error up will be handled.
 - Callers of the function might BUG_ON any nonzero return code in which
   case there is no behavior changed (but we still got to remove a BUG_ON)

The following functions were updated:

btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:44 -07:00
Linus Torvalds 1acc9309eb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  btrfs: fix oops when doing space balance
  Btrfs: don't panic if we get an error while balancing V2
  btrfs: add missing options displayed in mount output
2011-07-08 23:25:45 -07:00
Miao Xie 149e2d76b4 btrfs: fix oops when doing space balance
We need to make sure the data relocation inode doesn't go through
the delayed metadata updates, otherwise we get an oops during balance:

kernel BUG at fs/btrfs/relocation.c:4303!
[SNIP]
Call Trace:
 [<ffffffffa03143fd>] ? update_ref_for_cow+0x22d/0x330 [btrfs]
 [<ffffffffa0314951>] __btrfs_cow_block+0x451/0x5e0 [btrfs]
 [<ffffffffa031355d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
 [<ffffffffa0314beb>] btrfs_cow_block+0x10b/0x240 [btrfs]
 [<ffffffffa031acae>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
 [<ffffffffa032d8af>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
 [<ffffffff8147bf0e>] ? mutex_lock+0x1e/0x50
 [<ffffffffa0380cf1>] btrfs_update_delayed_inode+0x71/0x160 [btrfs]
 [<ffffffffa037ff27>] ? __btrfs_release_delayed_node+0x67/0x190 [btrfs]
 [<ffffffffa0381cf8>] btrfs_run_delayed_items+0xe8/0x120 [btrfs]
 [<ffffffffa03365e0>] btrfs_commit_transaction+0x250/0x850 [btrfs]
 [<ffffffff810f91d9>] ? find_get_pages+0x39/0x130
 [<ffffffffa0336cd5>] ? join_transaction+0x25/0x250 [btrfs]
 [<ffffffff81081de0>] ? wake_up_bit+0x40/0x40
 [<ffffffffa03785fa>] prepare_to_relocate+0xda/0xf0 [btrfs]
 [<ffffffffa037f2bb>] relocate_block_group+0x4b/0x620 [btrfs]
 [<ffffffffa0334cf5>] ? btrfs_clean_old_snapshots+0x35/0x150 [btrfs]
 [<ffffffffa037fa43>] btrfs_relocate_block_group+0x1b3/0x2e0 [btrfs]
 [<ffffffffa0368ec0>] ? btrfs_tree_unlock+0x50/0x50 [btrfs]
 [<ffffffffa035e39b>] btrfs_relocate_chunk+0x8b/0x670 [btrfs]
 [<ffffffffa031303d>] ? btrfs_set_path_blocking+0x3d/0x50 [btrfs]
 [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
 [<ffffffffa031bea1>] ? btrfs_previous_item+0xb1/0x150 [btrfs]
 [<ffffffffa03577d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
 [<ffffffffa035f5aa>] btrfs_balance+0x21a/0x2b0 [btrfs]
 [<ffffffffa0368898>] btrfs_ioctl+0x798/0xd20 [btrfs]
 [<ffffffff8111e358>] ? handle_mm_fault+0x148/0x270
 [<ffffffff814809e8>] ? do_page_fault+0x1d8/0x4b0
 [<ffffffff81160d6a>] do_vfs_ioctl+0x9a/0x540
 [<ffffffff811612b1>] sys_ioctl+0xa1/0xb0
 [<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP  [<ffffffffa037c1cc>] btrfs_reloc_cow_block+0x22c/0x270 [btrfs]

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-06 18:51:53 -04:00
Linus Torvalds af4087e0e6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  btrfs: fix inconsonant inode information
  Btrfs: make sure to update total_bitmaps when freeing cache V3
  Btrfs: fix type mismatch in find_free_extent()
  Btrfs: make sure to record the transid in new inodes
2011-06-27 13:32:14 -07:00
Miao Xie 2f7e33d432 btrfs: fix inconsonant inode information
When iputting the inode, We may leave the delayed nodes if they have some
delayed items that have not been dealt with. So when the inode is read again,
we must look up the relative delayed node, and use the information in it to
initialize the inode. Or we will get inconsonant inode information, it may
cause that the same directory index number is allocated again, and hit the
following oops:

[ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
[ 5447.569766] ------------[ cut here ]------------
[ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
[SNIP]
[ 5447.790721] Call Trace:
[ 5447.793191]  [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
[ 5447.800156]  [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
[ 5447.806517]  [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
[ 5447.812876]  [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
[ 5447.818961]  [<ffffffff8111f840>] vfs_create+0x72/0x92
[ 5447.824090]  [<ffffffff8111fa8c>] do_last+0x22c/0x40b
[ 5447.829133]  [<ffffffff8112076a>] path_openat+0xc0/0x2ef
[ 5447.834438]  [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
[ 5447.841216]  [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
[ 5447.847846]  [<ffffffff81121a79>] do_filp_open+0x3d/0x87
[ 5447.853156]  [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
[ 5447.859072]  [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
[ 5447.864636]  [<ffffffff8111f179>] ? do_getname+0x14b/0x173
[ 5447.870112]  [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
[ 5447.875682]  [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
[ 5447.880882]  [<ffffffff81112d39>] do_sys_open+0x69/0xae
[ 5447.886153]  [<ffffffff81112db1>] sys_open+0x20/0x22
[ 5447.891114]  [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b

Fix it by reusing the old delayed node.

Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-27 11:34:27 -04:00
Chris Mason 1973f0faeb Btrfs: make sure to record the transid in new inodes
When we create a new inode, we aren't filling in the
field that records the transaction that last changed this
inode.

If we then go to fsync that inode, it will be skipped because the field
isn't filled in.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-24 13:13:29 -04:00
Linus Torvalds 90a800de0a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: avoid delayed metadata items during commits
  btrfs: fix uninitialized return value
  btrfs: fix wrong reservation when doing delayed inode operations
  btrfs: Remove unused sysfs code
  btrfs: fix dereference of ERR_PTR value
  Btrfs: fix relocation races
  Btrfs: set no_trans_join after trying to expand the transaction
  Btrfs: protect the pending_snapshots list with trans_lock
  Btrfs: fix path leakage on subvol deletion
  Btrfs: drop the delalloc_bytes check in shrink_delalloc
  Btrfs: check the return value from set_anon_super
2011-06-20 08:58:53 -07:00
Josef Bacik 71d7aed014 Btrfs: fix path leakage on subvol deletion
The delayed ref patch accidently removed the btrfs_free_path in
btrfs_unlink_subvol, this puts it back and means we don't leak a path.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-06-15 13:24:45 -04:00
Linus Torvalds 3c25fa740e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: use join_transaction in btrfs_evict_inode()
  Btrfs - use %pU to print fsid
  Btrfs: fix extent state leak on failed nodatasum reads
  btrfs: fix unlocked access of delalloc_inodes
  Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
  btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
  Btrfs: clear current->journal_info on async transaction commit
  Btrfs: make sure to recheck for bitmaps in clusters
  btrfs: remove unneeded includes from scrub.c
  btrfs: reinitialize scrub workers
  btrfs: scrub: errors in tree enumeration
  Btrfs: don't map extent buffer if path->skip_locking is set
  Btrfs: unlock the trans lock properly
  Btrfs: don't map extent buffer if path->skip_locking is set
  Btrfs: fix duplicate checking logic
  Btrfs: fix the allocator loop logic
  Btrfs: fix bitmap regression
  Btrfs: don't commit the transaction if we dont have enough pinned bytes
  Btrfs: noinline the cluster searching functions
  Btrfs: cache bitmaps when searching for a cluster
2011-06-12 11:06:36 -07:00
Li Zefan 30b4caf5d7 Btrfs: use join_transaction in btrfs_evict_inode()
The WARN_ON() in start_transaction() was triggered while balancing.

The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-11 08:31:55 -04:00
Jan Schmidt 08d2f347e8 Btrfs: fix extent state leak on failed nodatasum reads
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about

	BUG extent_state: Objects remaining on kmem_cache_close()

on rmmod.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-10 19:00:53 -04:00
Linus Torvalds e6ece70732 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
  btrfs: fix uninitialized variable warning
  btrfs: add helper for fs_info->closing
  Btrfs: add mount -o inode_cache
  btrfs: scrub: add explicit plugging
  btrfs: use btrfs_ino to access inode number
  Btrfs: don't save the inode cache if we are deleting this root
  btrfs: false BUG_ON when degraded
  Btrfs: don't save the inode cache in non-FS roots
  Btrfs: make sure we don't overflow the free space cache crc page
  Btrfs: fix uninit variable in the delayed inode code
  btrfs: scrub: don't reuse bios and pages
  Btrfs: leave spinning on lookup and map the leaf
  Btrfs: check for duplicate entries in the free space cache
  Btrfs: don't try to allocate from a block group that doesn't have enough space
  Btrfs: don't always do readahead
  Btrfs: try not to sleep as much when doing slow caching
  Btrfs: kill BTRFS_I(inode)->block_group
  Btrfs: don't look at the extent buffer level 3 times in a row
  Btrfs: map the node block when looking for readahead targets
  Btrfs: set range_start to the right start in count_range_bits
  ...
2011-06-05 06:17:23 +09:00
David Sterba 7841cb2898 btrfs: add helper for fs_info->closing
wrap checking of filesystem 'closing' flag and fix a few missing memory
barriers.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-06-04 08:11:22 -04:00
Linus Torvalds 36947a7682 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
  Cache xattr security drop check for write v2
  fs: block_page_mkwrite should wait for writeback to finish
  mm: Wait for writeback when grabbing pages to begin a write
  configfs: remove unnecessary dentry_unhash on rmdir, dir rename
  fat: remove unnecessary dentry_unhash on rmdir, dir rename
  hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
  minix: remove unnecessary dentry_unhash on rmdir, dir rename
  fuse: remove unnecessary dentry_unhash on rmdir, dir rename
  coda: remove unnecessary dentry_unhash on rmdir, dir rename
  afs: remove unnecessary dentry_unhash on rmdir, dir rename
  affs: remove unnecessary dentry_unhash on rmdir, dir rename
  9p: remove unnecessary dentry_unhash on rmdir, dir rename
  ncpfs: fix rename over directory with dangling references
  ncpfs: document dentry_unhash usage
  ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
  hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
  hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
  hfs: remove unnecessary dentry_unhash on rmdir, dir rename
  omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
  udf: remove unnecessary dentry_unhash from rmdir, dir rename
  ...
2011-05-28 13:03:41 -07:00
Chris Mason ff5714cca9 Merge branch 'for-chris' of
git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus

Conflicts:
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-28 07:00:39 -04:00
Christoph Hellwig aa38572954 fs: pass exact type of data dirties to ->dirty_inode
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.

This is just the prototype change with no user for it yet.  I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.

Also remove incorrect comments that ->dirty_inode can't block.  That
has been changed a long time ago, and many implementations rely on it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-27 07:04:40 -04:00
Chris Mason 4cb5300bc8 Btrfs: add mount -o auto_defrag
This will detect small random writes into files and
queue the up for an auto defrag process.  It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-26 17:52:15 -04:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Julia Lawall b083916638 fs/btrfs: Add missing btrfs_free_path
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@

x = btrfs_alloc_path@p1(...)
...  when != btrfs_free_path(x,...)
     when != if (...) { ... btrfs_free_path(x,...) ...}
     when != x = ra
if(...) { ... when != x = rb
     when forall
     when != btrfs_free_path(x,...)
 \(return <+...x...+>; \| return@p2...; \) }

@script:python@
p1 << r.p1;
p2 << r.p2;
@@

cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:41 -04:00
Tsutomu Itoh 1cd307990d Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Sergei Trofimovich 27160b6b5a btrfs: fix typo 'testeing' -> 'testing'
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:32 -04:00
Josef Bacik d90c732122 Btrfs: leave spinning on lookup and map the leaf
On lookup we only want to read the inode item, so leave the path spinning.  Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:17 -04:00
Josef Bacik 026fd31782 Btrfs: don't always do readahead
Our readahead is sort of sloppy, and really isn't always needed.  For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat.  Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes.  This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:14 -04:00
Josef Bacik d82a6f1d7e Btrfs: kill BTRFS_I(inode)->block_group
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull.  In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable.  So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:12 -04:00
Josef Bacik fcb80c2aff Btrfs: fix how we do space reservation for truncate
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up.  This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen.  This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen.  So in order
to fix this we need to split all of the reservation stuf up.  So with this patch
we have

1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.

2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.

3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.

Hopefully this will make the ceph problem go away.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:03:08 -04:00
Josef Bacik 7a7eaa40a3 Btrfs: take away the num_items argument from btrfs_join_transaction
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :).  So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Josef Bacik 74b2107543 Btrfs: make sure to use the delalloc reserve when filling delalloc
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-05-23 13:00:56 -04:00
Chris Mason 712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Chris Mason 945d8962ce Merge branch 'cleanups' of git://repo.or.cz/linux-2.6/btrfs-unstable into inode_numbers
Conflicts:
	fs/btrfs/extent-tree.c
	fs/btrfs/free-space-cache.c
	fs/btrfs/inode.c
	fs/btrfs/tree-log.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:33:42 -04:00
Chris Mason dcc6d07322 Merge branch 'delayed_inode' into inode_numbers
Conflicts:
	fs/btrfs/inode.c
	fs/btrfs/ioctl.c
	fs/btrfs/transaction.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 07:07:01 -04:00
Miao Xie 16cdcec736 btrfs: implement delayed inode items operation
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
  root's radix tree, and letting btrfs inodes go.

Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
  Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
  Itaru Kitayama.

Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
  inode in time.

Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
  balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason

Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
  which is created for every directory and file, and used to manage the
  delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.

Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.

If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.

Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
  manage the delayed nodes which are created for every file/directory.
  One is used to manage all the delayed nodes that have delayed items. And the
  other is used to manage the delayed nodes which is waiting to be dealt with
  by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
  index which is going to be inserted into b+ tree, and the other is used to
  manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
  to deal with the works of the delayed directory name index items insertion
  and deletion and the delayed inode update.
  When the delayed items is beyond the lower limit, we create works for some
  delayed nodes and insert them into the work queue of the worker, and then
  go back.
  When the delayed items is beyond the upper bound, we create works for all
  the delayed nodes that haven't been dealt with, and insert them into the work
  queue of the worker, and then wait for that the untreated items is below some
  threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
  information into the delayed inserting rb-tree.
  And then we check the number of the delayed items and do delayed items
  balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
  in the inserting rb-tree at first. If we look it up, just drop it. If not,
  add the key of it into the delayed deleting rb-tree.
  Similar to the delayed inserting rb-tree, we also check the number of the
  delayed items and do delayed items balance.
  (The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
  inode into the delayed node. the worker will flush it into the b+ tree after
  dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
  delayed node, By this way, we can cache more delayed items and merge more
  inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
  and the delayed inode update.

I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.

Before applying this patch:
Create files:
        Total files: 50000
        Total time: 1.096108
        Average time: 0.000022
Delete files:
        Total files: 50000
        Total time: 1.510403
        Average time: 0.000030

After applying this patch:
Create files:
        Total files: 50000
        Total time: 0.932899
        Average time: 0.000019
Delete files:
        Total files: 50000
        Total time: 1.215732
        Average time: 0.000024

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3

Many thanks for Kitayama-san's help!

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:30:56 -04:00
Chris Mason 0965537308 Merge branch 'ino-alloc' of git://repo.or.cz/linux-btrfs-devel into inode_numbers
Conflicts:
	fs/btrfs/free-space-cache.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-21 09:27:38 -04:00
David Sterba 7a36ddec10 btrfs: use printk_ratelimited instead of printk_ratelimit
As per printk_ratelimit comment, it should not be used.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-12 18:08:38 +02:00
Arne Jansen a2de733c78 btrfs: scrub
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.

This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.

Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-05-12 14:45:20 +02:00
David Sterba 182608c829 btrfs: remove old unused commented out code
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:10 +02:00
David Sterba f2a97a9dbd btrfs: remove all unused functions
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.

  text    data     bss     dec     hex filename
402081    7464     200  409745   64091 btrfs.ko.base
398620    7144     200  405964   631cc btrfs.ko.remove-all

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-06 12:34:03 +02:00
David Sterba b3b4aa74b5 btrfs: drop unused parameter from btrfs_release_path
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:22 +02:00
David Sterba 172ddd60a6 btrfs: drop gfp parameter from alloc_extent_map
pass GFP_NOFS directly to kmem_cache_alloc

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba a8067e022a btrfs: drop unused parameter from extent_map_tree_init
the GFP flags are not stored anywhere and all allocations are done via
alloc_extent_map(GFP_NOFS).

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba f993c883ad btrfs: drop unused argument from extent_io_tree_init
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:21 +02:00
David Sterba c704005d88 btrfs: unify checking of IS_ERR and null
use IS_ERR_OR_NULL when possible, done by this coccinelle script:

@ match @
identifier id;
@@
(
- BUG_ON(IS_ERR(id) || !id);
+ BUG_ON(IS_ERR_OR_NULL(id));
|
- IS_ERR(id) || !id
+ IS_ERR_OR_NULL(id)
|
- !id || IS_ERR(id)
+ IS_ERR_OR_NULL(id)
)

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:20 +02:00
David Sterba 306e16ce13 btrfs: rename variables clashing with global function names
reported by gcc -Wshadow:
page_index, page_offset, new_inode, dev_name

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-05-02 13:57:19 +02:00
Linus Torvalds 019793b755 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: cleanup error handling in inode.c
  Btrfs: put the right bio if we have an error
  Btrfs: free bitmaps properly when evicting the cache
  Btrfs: Free free_space item properly in btrfs_trim_block_group()
  btrfs: add missing spin_unlock to a rare exit path
  Btrfs: check return value of kmalloc()
  btrfs: fix wrong allocating flag when reading page
  Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
2011-04-26 08:26:58 -07:00
Tsutomu Itoh 7cf96da3ec Btrfs: cleanup error handling in inode.c
The error processing of several places is changed like setting the
error number only at the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:53 -04:00
Josef Bacik 64728bbbf8 Btrfs: put the right bio if we have an error
In btrfs_submit_direct_hook if the first btrfs_map_block fails we need to put
the orig_bio, not bio.

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Tsutomu Itoh 8d413713ca Btrfs: check return value of kmalloc()
The check on the return value of kmalloc() is added to some places.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-25 19:43:52 -04:00
Li Zefan 82d5902d9c Btrfs: Support reading/writing on disk free ino cache
This is similar to block group caching.

We dedicate a special inode in fs tree to save free ino cache.

At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.

To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:11 +08:00
Li Zefan 33345d0152 Btrfs: Always use 64bit inode number
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.

So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.

There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.

Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:09 +08:00
Li Zefan 581bb05094 Btrfs: Cache free inode numbers in memory
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.

This fixes it, and it works similarly to how we cache free space in block
cgroups.

We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.

Because we are searching the commit root, we have to carefully handle the
cross-transaction case.

The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-25 16:46:04 +08:00
Linus Torvalds adff377bb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
  Btrfs: fix free space cache leak
  Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
  Btrfs end_bio_extent_readpage should look for locked bits
  Btrfs: don't force chunk allocation in find_free_extent
  Btrfs: Check validity before setting an acl
  Btrfs: Fix incorrect inode nlink in btrfs_link()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
  Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
  Btrfs: make uncache_state unconditional
  btrfs: using cached extent_state in set/unlock combinations
  Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
  Btrfs: fix subvolume mount by name problem when default mount subvolume is set
  fix user annotation in ioctl.c
  Btrfs: check for duplicate iov_base's when doing dio reads
  btrfs: properly handle overlapping areas in memmove_extent_buffer
  Btrfs: fix memory leaks in btrfs_new_inode()
  Btrfs: check for duplicate iov_base's when doing dio reads
  Btrfs: reuse the extent_map we found when calling btrfs_get_extent
  Btrfs: do not use async submit for small DIO io's
  Btrfs: don't split dio bios if we don't have to
  ...
2011-04-18 12:24:05 -07:00
Miao Xie 3153495d8e Btrfs: Fix incorrect inode nlink in btrfs_link()
Link count of the inode is not decreased if btrfs_set_inode_index()
fails.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Singed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:32 +08:00
Li Zefan b9e03af0bc Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.

This also simplifies how we walk the btree path.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-04-13 14:25:31 +08:00
Chris Mason 874d0d2633 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work into for-linus 2011-04-11 20:46:03 -04:00
Arne Jansen 507903b818 btrfs: using cached extent_state in set/unlock combinations
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:45:36 -04:00
Josef Bacik a1b75f7d96 Btrfs: check for duplicate iov_base's when doing dio reads
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets.  This is what Windows does under qemu.  The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine.  So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Yoshinori Sano 8fb27640d0 Btrfs: fix memory leaks in btrfs_new_inode()
This patch fixes memory leaks in btrfs_new_inode().

Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-11 20:25:06 -04:00
Josef Bacik 93a54bc4c2 Btrfs: check for duplicate iov_base's when doing dio reads
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets.  This is what Windows does under qemu.  The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine.  So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:43 -04:00
Josef Bacik 16d299ac74 Btrfs: reuse the extent_map we found when calling btrfs_get_extent
In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
range that we are looking for.  If we don't find an extent, btrfs_get_extent
will insert a extent_map for that area and mark it as a hole.  So it does the
job of allocating a new extent map and inserting it into the io tree.  But if
we're creating a new extent we free it up and redo all of that work.  So instead
pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
disk space and set it up properly and bypass the freeing/allocating of a new
extent map and the expensive operation of inserting the thing into the io_tree.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:41 -04:00
Josef Bacik 1ae3993825 Btrfs: do not use async submit for small DIO io's
When looking at our DIO performance Chris said that for small IO's doing the
async submit stuff tends to be more overhead than it's worth.  With this on top
of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
IO's.  Basically if we don't have to split the bio for the map length it's small
enough to be directly submitted, otherwise go back to the async submit.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:39 -04:00
Josef Bacik 02f57c7aed Btrfs: don't split dio bios if we don't have to
We have been unconditionally allocating a new bio and re-adding all pages from
our original bio to the new bio.  This is needed if our original bio is larger
than our stripe size, but if it is smaller than the stripe size then there is no
need to do this.  So check the map length and if we are under that then go ahead
and submit the original bio.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:37 -04:00
Josef Bacik 1ef30be142 Btrfs: do not call btrfs_update_inode in endio if nothing changed
In the DIO code we often don't update the i_disk_size because the i_size isn't
updated until after the DIO is completed, so basically we are allocating a path,
doing a search, and updating the inode item for no reason since nothing changed.
btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:36 -04:00
Josef Bacik 12ddb96cb6 Btrfs: map the inode item when doing fill_inode_item
Instead of calling kmap_atomic for every thing we set in the inode item, map the
entire inode item at the start and unmap it at the end.  This makes a sequential
dd of 400mb O_DIRECT something like 1% faster.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-04-08 13:00:34 -04:00
Linus Torvalds 42933bac11 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6
* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
  Fix common misspellings
2011-04-07 11:14:49 -07:00
Linus Torvalds 884b8267d5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: don't warn in btrfs_add_orphan
  Btrfs: fix free space cache when there are pinned extents and clusters V2
  Btrfs: Fix uninitialized root flags for subvolumes
  btrfs: clear __GFP_FS flag in the space cache inode
  Btrfs: fix memory leak in start_transaction()
  Btrfs: fix memory leak in btrfs_ioctl_start_sync()
  Btrfs: fix subvol_sem leak in btrfs_rename()
  Btrfs: Fix oops for defrag with compression turned on
  Btrfs: fix /proc/mounts info.
  Btrfs: fix compiler warning in file.c
2011-04-05 12:29:25 -07:00
Josef Bacik c9ddec74aa Btrfs: don't warn in btrfs_add_orphan
When I moved the orphan adding to btrfs_truncate I missed the fact that during
orphan cleanup we just add the orphan items to the orphan list without going
through btrfs_orphan_add, which results in lots of warnings on mount if you have
any orphan items that need to be truncated.  Just remove this warning since it's
ok, this will allow all of the normal space accounting take place.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:20:24 -04:00
Miao Xie adae52b94e btrfs: clear __GFP_FS flag in the space cache inode
the object id of the space cache inode's key is allocated from the relative
root, just like the regular file. So we can't identify space cache inode by
checking the object id of the inode's key, and we have to clear __GFP_FS flag
at the time we look up the space cache inode.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:19:43 -04:00
Johann Lombardi b44c59a80d Btrfs: fix subvol_sem leak in btrfs_rename()
btrfs_rename() does not release the subvol_sem if the transaction failed to start.

Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:19:42 -04:00
Li Zefan fe3f566cd1 Btrfs: Fix oops for defrag with compression turned on
When we defrag a file, whose size can be fit into an inline extent,
with compression enabled, the compress type is set to be
fs_info->compress_type, which is 0 if the btrfs filesystem is mounted
without compress option. This leads to oops.

Reported-by: Daniel Blueman <daniel.blueman@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-04-05 01:19:42 -04:00
Lucas De Marchi 25985edced Fix common misspellings
Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-31 11:26:23 -03:00
Linus Torvalds 212a17ab87 Merge branch 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
  Btrfs: fix __btrfs_map_block on 32 bit machines
  btrfs: fix possible deadlock by clearing __GFP_FS flag
  btrfs: check link counter overflow in link(2)
  btrfs: don't mess with i_nlink of unlocked inode in rename()
  Btrfs: check return value of btrfs_alloc_path()
  Btrfs: fix OOPS of empty filesystem after balance
  Btrfs: fix memory leak of empty filesystem after balance
  Btrfs: fix return value of setflags ioctl
  Btrfs: fix uncheck memory allocations
  btrfs: make inode ref log recovery faster
  Btrfs: add btrfs_trim_fs() to handle FITRIM
  Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
  Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
  Btrfs: make update_reserved_bytes() public
  btrfs: return EXDEV when linking from different subvolumes
  Btrfs: Per file/directory controls for COW and compression
  Btrfs: add datacow flag in inode flag
  btrfs: use GFP_NOFS instead of GFP_KERNEL
  Btrfs: check return value of read_tree_block()
  btrfs: properly access unaligned checksum buffer
  ...

Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
2011-03-28 15:31:05 -07:00
Miao Xie 1561deda68 btrfs: fix possible deadlock by clearing __GFP_FS flag
Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause
deadlock.
  Task1
  open()
    ...
    btrfs_search_slot()
      ...
      btrfs_cow_block()
	...
	alloc_page()
	  wait for reclaiming
					shrink_slab()
					  ...
					  shrink_icache_memory()
					    ...
					    btrfs_evict_inode()
					      ...
					      btrfs_search_slot()

If the path is locked by task1, the deadlock happens.

So the btree's page cache is different with the file's page cache, it can not
allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in
GFP_HIGHUSER_MOVABLE flag.

Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:58 -04:00
Al Viro c055e99eea btrfs: check link counter overflow in link(2)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:56 -04:00
Al Viro 92986796d8 btrfs: don't mess with i_nlink of unlocked inode in rename()
old_inode is not locked; it's not safe to play with its link
count.  Instead of bumping it and calling btrfs_unlink_inode(),
add a variant of the latter that does not do btrfs_drop_nlink()/
btrfs_update_inode(), call it instead of btrfs_inc_nlink()/
btrfs_unlink_inode() and do btrfs_update_inode() ourselves.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:55 -04:00
Tsutomu Itoh c2db1073fd Btrfs: check return value of btrfs_alloc_path()
Adding the check on the return value of btrfs_alloc_path() to several places.
And, some of callers are modified by this change.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:54 -04:00
Yoshinori Sano dac97e516c Btrfs: fix uncheck memory allocations
To make Btrfs code more robust, several return value checks where memory
allocation can fail are introduced. I use BUG_ON where I don't know how
to handle the error properly, which increases the number of using the
notorious BUG_ON, though.

Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:49 -04:00
Mark Fasheh 3ab3564f01 btrfs: return EXDEV when linking from different subvolumes
btrfs_link returns EPERM if a cross-subvolume link is attempted.

However, in this case I believe EXDEV to be the more appropriate value.
>From the link(2) man page:

EXDEV  oldpath and newpath are not on the same mounted file system.  (Linux
       permits a file system to be mounted at multiple points, but link()
       does not work across different mount points, even if the same file
       system is mounted on both.)

This matters because an application may have different behaviors based on
return codes.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:42 -04:00
Liu Bo 75e7cb7fe0 Btrfs: Per file/directory controls for COW and compression
Data compression and data cow are controlled across the entire FS by mount
options right now.  ioctls are needed to set this on a per file or per
directory basis.  This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.

According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super.  However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.

After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.

NOTE:
 - The compression type is selected by such rules:
   If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
   Otherwise, we'll use the default compress type (zlib today).

v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
  will be screwed by inheritance from parent directory.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:41 -04:00
liubo 1abe9b8a13 Btrfs: add initial tracepoint support for btrfs
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
              dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
              dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
 btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
 btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
   flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
   flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
   flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
   flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
 btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)

Here is what I have added:

1) ordere_extent:
        btrfs_ordered_extent_add
        btrfs_ordered_extent_remove
        btrfs_ordered_extent_start
        btrfs_ordered_extent_put

These provide critical information to understand how ordered_extents are
updated.

2) extent_map:
        btrfs_get_extent

extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.

3) writepage:
        __extent_writepage
        btrfs_writepage_end_io_hook

Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.

4) inode:
        btrfs_inode_new
        btrfs_inode_request
        btrfs_inode_evict

These can show where and when a inode is created, when a inode is evicted.

5) sync:
        btrfs_sync_file
        btrfs_sync_fs

These show sync arguments.

6) transaction:
        btrfs_transaction_commit

In transaction based filesystem, it will be useful to know the generation and
who does commit.

7) back reference and cow:
	btrfs_delayed_tree_ref
	btrfs_delayed_data_ref
	btrfs_delayed_ref_head
	btrfs_cow_block

Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.

8) chunk:
	btrfs_chunk_alloc
	btrfs_chunk_free

Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.

9) reserved_extent:
	btrfs_reserved_extent_alloc
	btrfs_reserved_extent_free

These can show how btrfs uses its space.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-03-28 05:37:33 -04:00
Josef Bacik c0da7aa1a2 Btrfs: mark the bio with an error if we have a failure in dio
I noticed that dio_end_io calls the appropriate endio function with an error,
but the endio functions don't actually do anything with that error, they assume
that if there was an error then the bio will not be uptodate.  So if we had
checksum failures we would never pass back EIO.  So if there is an error in our
endio functions make sure to clear the uptodate flag on the bio.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-25 19:08:19 -04:00
Josef Bacik 98bc3149fa Btrfs: don't allocate dip->csums when doing writes
When doing direct writes we store the checksums in the ordered sum stuff in the
ordered extent for writing them when the write completes, so we don't even use
the dip->csums array.  So if we're writing, don't bother allocating dip->csums
since we won't use it anyway.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-25 19:08:18 -04:00
Linus Torvalds 6c51038900 Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
  Documentation/iostats.txt: bit-size reference etc.
  cfq-iosched: removing unnecessary think time checking
  cfq-iosched: Don't clear queue stats when preempt.
  blk-throttle: Reset group slice when limits are changed
  blk-cgroup: Only give unaccounted_time under debug
  cfq-iosched: Don't set active queue in preempt
  block: fix non-atomic access to genhd inflight structures
  block: attempt to merge with existing requests on plug flush
  block: NULL dereference on error path in __blkdev_get()
  cfq-iosched: Don't update group weights when on service tree
  fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
  block: Require subsystems to explicitly allocate bio_set integrity mempool
  jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
  fs: make fsync_buffers_list() plug
  mm: make generic_writepages() use plugging
  blk-cgroup: Add unaccounted time to timeslice_used.
  block: fixup plugging stubs for !CONFIG_BLOCK
  block: remove obsolete comments for blkdev_issue_zeroout.
  blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
  ...

Fix up conflicts in fs/{aio.c,super.c}
2011-03-24 10:16:26 -07:00
Josef Bacik 22a94d44bd Btrfs: add checks to verify dir items are correct
We need to make sure the dir items we get are valid dir items.  So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:41 -04:00
Josef Bacik 695a0d0da0 Btrfs: add a comment explaining what btrfs_cont_expand does
Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes
trying to remember what exactly it does and why the hell we need it.  So add a
comment to save future-Josef some time.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-03-17 14:21:33 -04:00