1
0
Fork 0
Commit Graph

229 Commits (f4c697e6406da5dd445eda8d923c53e1138793dd)

Author SHA1 Message Date
Chris Mason 56bec294de Btrfs: do extent allocation and reference count updates in the background
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem.  For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.

If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.

These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot.  btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.

This commit adds an rbtree of pending reference count updates and extent
allocations.  The tree is ordered by byte number of the extent and byte number
of the parent for the back reference.  The tree allows us to:

1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.

#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.

These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction.  Throttling is
implemented to help keep the queue of backrefs at a reasonable size.

Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.

Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason 9fa8cfe706 Btrfs: don't preallocate metadata blocks during btrfs_search_slot
In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.

A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-24 16:14:25 -04:00
Chris Mason b9447ef80b Btrfs: fix spinlock assertions on UP systems
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places.  But, it wasn't correct for UP compiled
kernels.

This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-03-09 11:45:38 -04:00
Chris Mason 4008c04a07 Btrfs: make a lockdep class for the extent buffer locks
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block.  But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.

The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.

btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks.  Otherwise, lockdep
gets upset about bad lock orderin.

The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:09:45 -05:00
Jeff Mahoney e00f730865 Btrfs: remove btrfs_init_path
btrfs_init_path was initially used when the path objects were on the
stack.  Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.

This patch removes it, and just uses kmem_cache_zalloc to zero out the object.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 14:11:25 -05:00
Jeff Mahoney 7951f3cefb Btrfs: balance_level checks !child after access
The BUG_ON() is in the wrong spot.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-12 10:06:15 -05:00
Chris Mason 284b066af4 Btrfs: don't use spin_is_contended
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot.  The idea was to avoid
expensive searches in the tree unless the lock was actually contended.

But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.

In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-09 16:22:03 -05:00
Chris Mason 7b78c170dc Btrfs: Only prep for btree deletion balances when nodes are mostly empty
Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted.  This balance prep happens from the root of the tree down
so we can drop our locks as we go.

reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required.  This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:12:46 -05:00
Chris Mason 12f4daccfc Btrfs: fix btrfs_unlock_up_safe to walk the entire path
btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.

Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:42 -05:00
Chris Mason 4d081c41a4 Btrfs: change btrfs_del_leaf to drop locks earlier
btrfs_del_leaf does two things.  First it removes the pointer in the
parent, and then it frees the block that has the leaf.  It has the
parent node locked for both operations.

But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:31:28 -05:00
Chris Mason b4ce94de9b Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.

So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.

This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.

We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.

The basic idea is:

btrfs_tree_lock() returns with the spin lock held

btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock.  The buffer is
still considered locked by all of the btrfs code.

If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.

Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time.  So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.

btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.

btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.

ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:25:08 -05:00
Chris Mason c487685d7c Btrfs: hash_lock is no longer needed
Before metadata is written to disk, it is updated to reflect that writeout
has begun.  Once this update is done, the block must be cow'd before it
can be modified again.

This update was originally synchronized by using a per-fs spinlock.  Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.

So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 09:24:25 -05:00
Chris Mason a717531942 Btrfs: do less aggressive btree readahead
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too.  It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.

This patch changes things to just look for blocks within 64k of the
target block.  It will trigger less IO and make for lower latencies on
the read size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-22 09:23:10 -05:00
Chris Mason d397712bcc Btrfs: Fix checkpatch.pl warnings
There were many, most are fixed now.  struct-funcs.c generates some warnings
but these are bogus.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-01-05 21:25:51 -05:00
Yan Zheng 87b29b208c Btrfs: properly check free space for tree balancing
btrfs_insert_empty_items takes the space needed by the btrfs_item
structure into account when calculating the required free space.

So the tree balancing code shouldn't add sizeof(struct btrfs_item)
to the size when checking the free space. This patch removes these
superfluous additions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-12-17 10:21:48 -05:00
Chris Mason 42dc7babdc Btrfs: Fix compressed writes on truncated pages
The compression code was using isize to limit the amount of data it
sent through zlib.  But, it wasn't properly limiting the looping to
just the pages inside i_size.  The end result was trying to compress
too many pages, including those that had not been setup and properly locked
down.  This made the compression code oops while trying find_get_page on a
page that didn't exist.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-15 11:44:56 -05:00
Chris Mason 459931eca5 Btrfs: Delete csum items when freeing extents
This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.

The trick is doing it without racing because a single csum item may
hold csums for more than one extent.  Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.

A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock.  This is used to
remove csum bytes from the middle of an item.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-10 09:10:46 -05:00
Chris Mason 934d375bac Btrfs: Use map_private_extent_buffer during generic_bin_search
It is possible that generic_bin_search will be called on a tree block
that has not been locked.  This happens because cache_block_block skips
locking on the tree blocks.

Since the tree block isn't locked, we aren't allowed to change
the extent_buffer->map_token field.  Using map_private_extent_buffer
avoids any changes to the internal extent buffer fields.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-12-08 16:43:10 -05:00
Christoph Hellwig b2950863c6 Btrfs: make things static and include the right headers
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2008-12-02 09:54:17 -05:00
Liu Hui b4eec2ca11 Btrfs: Some fixes for batching extent insert.
In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.

There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-18 11:30:10 -05:00
Yan Zheng 2b82032c34 Btrfs: Seed device support
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.

The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.

This patch does the following:

1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.

2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.

3) make btrfs_find_block_group not return NULL when all
block groups are read-only.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-11-17 21:11:30 -05:00
Josef Bacik f3465ca44e Btrfs: batch extent inserts/updates/deletions on the extent root
While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions.  This patch aims to try
and reduce that problem.  This happens two ways

1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion.  This saves us from doing unnecessary work almost every
time finish_current_insert runs.

2) Batch insertion/updates/deletions.  Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf.  On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert.  In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update.  And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.

This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk.  There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-11-12 14:19:50 -05:00
Chris Mason 6f3577bdc7 Btrfs: Improve metadata read latencies
This fixes latency problems on metadata reads by making sure they
don't go through the async submit queue, and by tuning down the amount
of readahead done during btree searches.

Also, the btrfs bdi congestion function is tuned to ignore the
number of pending async bios and checksums pending.  There is additional
code that throttles new async bios now and the congestion function
doesn't need to worry about it anymore.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-11-13 09:59:36 -05:00
Josef Bacik 2517920135 Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.

There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.

The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same.  Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.

Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one.  I have tested this heavily and it does
not appear to break anything.  This has to be applied on top of my
find_free_extent redo patch.

I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out.  Thank you,

Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 14:49:05 -04:00
Yan Zheng f82d02d9d8 Btrfs: Improve space balancing code
This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:

Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.

Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.

This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-29 14:49:05 -04:00
Yan Zheng 3bb1a1bc42 Btrfs: Remove offset field from struct btrfs_extent_ref
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.

This patch also makes the back reference system check the objectid
when extents are in deleting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
2008-10-09 11:46:24 -04:00
Chris Mason 323ac95bce Btrfs: don't read leaf blocks containing only checksums during truncate
Checksum items take up a significant portion of the metadata for large files.
It is possible to avoid reading them during truncates by checking the keys in
the higher level nodes.

If a given leaf is followed by another leaf where the lowest key is a checksum
item from the same file, we know we can safely delete the leaf without
reading it.

For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
the file with a cold cache.  It is read bound during the run.

With this change, Btrfs is able to delete the file in 0.5s

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-01 19:05:46 -04:00
Chris Mason d352ac6814 Btrfs: add and improve comments
This improves the comments at the top of many functions.  It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.

extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-29 15:18:18 -04:00
Zheng Yan 1a40e23b95 Btrfs: update space balancing code
This patch updates the space balancing code to utilize the new
backref format.  Before, btrfs-vol -b would break any COW links
on data blocks or metadata.  This was slow and caused the amount
of space used to explode if a large number of snapshots were present.

The new code can keeps the sharing of all data extents and
most of the tree blocks.

To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.

To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).

To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:09:34 -04:00
Zheng Yan 5b21f2ed3f Btrfs: extent_map and data=ordered fixes for space balancing
* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated.  This allows us to do accounting for them properly.

* The balancing code relocates data extents indepdent of the underlying
inode.  The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).

* Don't take the drop_mutex in the create_subvol ioctl.  It isn't
required.

* Fix walking of the ordered extent list to avoid races with sys_unlink

* Change the lock ordering rules.  Transaction start goes outside
the drop_mutex.  This allows btrfs_commit_transaction to directly
drop the relocation trees.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-26 10:05:38 -04:00
Zheng Yan 31840ae1a6 Btrfs: Full back reference support
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Josef Bacik 0f9dd46cda Btrfs: free space accounting redo
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size.  The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing.  If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible.  When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.

2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset.  also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.

3) cleaned up the allocation code to make it a little easier to read and a
little less complicated.  Basically there are 3 steps, first look from our
provided hint.  If we couldn't find from that given hint, start back at our
original search start and look for space from there.  If that fails try to
allocate space if we can and start looking again.  If not we're screwed and need
to start over again.

4) small fixes.  there were some issues in volumes.c where we wouldn't allocate
the rest of the disk.  fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space.  Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space.  Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups.  The alloc_hint has fixed
this slight degredation and made things semi-normal.

There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space.  This only happens with metadata
allocations, and only when we are almost full.  So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall.  I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason f25956cc58 Fix leaf overflow check in btrfs_insert_empty_items
It was incorrectly adding an extra sizeof(struct btrfs_item) and causing
false positives (oops)

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Christoph Hellwig b214107eda Btrfs: trivial sparse fixes
Fix a bunch of trivial sparse complaints.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Christoph Hellwig ad3d81ba8f Btrfs: missing endianess conversion in insert_new_root
Add two missing endianess conversions in this function, found by sparse.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason e02119d5a7 Btrfs: Add a write ahead tree log to optimize synchronous operations
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree.  There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.

After a crash, items are copied out of the log tree and back into the
subvolume.  See tree-log.c for all the details.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:07 -04:00
Chris Mason 65b51a009e btrfs_search_slot: reduce lock contention by cowing in two stages
A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.

The first part needs locks in the extent allocation tree, and may need to
do IO.  This changeset splits that into a separate function that can be
called without any tree locks held.

btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block.  This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:06 -04:00
Yan bcc63abbf3 Btrfs: implement memory reclaim for leaf reference cache
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.

The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan Zheng 31153d8128 Btrfs: Add a leaf reference cache
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.

This creates a cache so that IO isn't required.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Yan 9652480bf4 Fix path slots selection in btrfs_search_forward
We should decrease the found slot by one as btrfs_search_slot does
when bin_search return 1 and node level > 0.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Josef Bacik 7b12876623 Btrfs: Create orphan inode records to prevent lost files after a crash
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:05 -04:00
Chris Mason 0bd40a7184 btrfs_next_leaf: do readahead when skip_locking is turned on
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason 7d9eb12c87 Btrfs: Add locking around volume management (device add/remove/balance)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason f9efa9c784 Btrfs: Reduce contention on the root node
This calls unlock_up sooner in btrfs_search_slot in order to decrease the
amount of work done with the higher level tree locks held.

Also, it changes btrfs_tree_lock to spin for a big against the page lock
before scheduling.  This makes a big difference in context switch rate under
highly contended workloads.

Longer term, a better locking structure is needed than the page lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason 3f157a2fd2 Btrfs: Online btree defragmentation fixes
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.

This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems.  The auto-defrag needs to be done differently.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:04 -04:00
Chris Mason e7a84565bc Btrfs: Add btree locking to the tree defragmentation code
The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason a74a4b97b6 Btrfs: Replace the transaction work queue with kthreads
This creates one kthread for commits and one kthread for
deleting old snapshots.  All the work queues are removed.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 333db94cdd Btrfs: Fix snapshot deletion to release the alloc_mutex much more often.
This lowers the impact of snapshot deletion on the rest of the FS.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 5cd57b2cbb Btrfs: Add a skip_locking parameter to struct path, and make various funcs honor it
Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree.  But, those locks might already be held in other places, leading
to deadlocks.

Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups.  A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 168fd7d271 Fix btrfs_next_leaf to check for new items after dropping locks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 594a24eb0e Fix btrfs_del_ordered_inode to allow forcing the drop during unlinks
This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 051e1b9f74 Drop locks in btrfs_search_slot when reading a tree block.
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree.  This drops
locks held by a path when doing reads during a tree search.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason a213501153 Btrfs: Replace the big fs_mutex with a collection of other locks
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 925baeddc5 Btrfs: Start btree concurrency work.
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes.  This means allocation location is still not very
fine grained.

The main FS btree is protected by locks on each block in the btree.  Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.

The end result of a search is now a path where only the lowest level
is locked.  Releasing or freeing the path drops any locks held.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 0ef3e66b67 Btrfs: Allocator fix variety pack
* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
  next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
  twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
  try to allocate from devices that don't exist

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason 1259ab75c6 Btrfs: Handle write errors on raid1 and raid10
When duplicate copies exist, writes are allowed to fail to one of those
copies.  This changeset includes a few changes that allow the FS to
continue even when some IOs fail.

It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason ca7a79ad8d Btrfs: Pass down the expected generation number when reading tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:03 -04:00
Chris Mason bce4eae986 Btrfs: Fix balance_level to free the middle block if there is room in the left one
balance level starts by trying to empty the middle block, and then
pushes from the right to the middle.  This might empty the right block
and leave a small number of pointers in the middle.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 971a1f6648 Btrfs: Don't empty the middle buffer in push_nodes_for_insert
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason c448acf0a0 Btrfs: Fix split_node to require more empty slots in the node as well
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 1514794e42 Btrfs: Make sure nodes have enough room for a double split
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:02 -04:00
Chris Mason 699122f559 Btrfs: Don't wait on tree block writeback before freeing them anymore
This isn't required anymore because we don't reallocate blocks that
have already been written in this transaction.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason e17cade25f Btrfs: Add chunk uuids and update multi-device back references
Block headers now store the chunk tree uuid

Chunk items records the device uuid for each stripes

Device extent items record better back refs to the chunk tree

Block groups record better back refs to the chunk tree

The chunk tree format has also changed.  The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk.  Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.

This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 85d824c4a4 Btrfs: Disable extra debugging checks on tree blocks
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason f188591e98 Btrfs: Retry metadata reads in the face of checksum failures
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason ce9adaa5a7 Btrfs: Do metadata checksums for reads via a workqueue
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.

But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.

The new code validates checksums when the page is read.  It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason cea9e4452e Change btrfs_map_block to return a structure with mappings for all stripes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 0ef8b2428a Btrfs: Properly dirty buffers in the split corner cases
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 0999df54f8 Btrfs: Verify checksums on tree blocks found without read_tree_block
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.

But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified.  This patch
makes sure all buffers go through checksum verification before they
are used.

This is safer, and it prevents modification of buffers before they go
through the csum code.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 63b10fc487 Reorder the flags field in struct btrfs_header and record a flag on writeout
This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:01 -04:00
Chris Mason 0b86a832a1 Btrfs: Add support for multiple devices per filesystem
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Yan 2f375ab9c5 Call btrfs_cow_block while lowering tree level.
When freeing root block of a tree,  btrfs_free_extent' parameter
'ref_generation' is from root block itseft.  When freeing non-root
block,  'ref_generation' is from its parent. so when converting a
non-root block to root block, we must guarantee its generation is
equal to its parent's generation.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 5a01a2e3a9 Btrfs: Copy correct tree when inserting into slot 0
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 9c58309d6c Btrfs: Add inode item and backref in one insert, reducing cpu usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason 85e21bac16 Btrfs: During deletes and truncate, remove many items at once from the tree
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:04:00 -04:00
Chris Mason dc17ff8f11 Btrfs: Add data=ordered support
This forces file data extents down the disk along with the metadata that
references them.  The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:59 -04:00
Chris Mason 98ed51747b Btrfs: Force inlining off in a few places to save stack usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 8f662a76c6 Btrfs: Add readahead to the online shrinker, and a mount -o alloc_start= for testing
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 01f4665805 Btrfs: Less aggressive readahead on deletes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 4aec2b5232 kmalloc a few large stack objects in the btrfs_ioctl path
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason be20aa9dba Btrfs: Add mount option to turn off data cow
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.

In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 7bb86316c3 Btrfs: Add back pointers from extents to the btree or file referencing them
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 74493f7a59 Btrfs: Implement generation numbers in block pointers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Yan eef1c494a2 Btrfs: Properly update right_nritems in push_leaf_left
The codes that fixup the right leaf and the codes that dirty the
extnet buffer use the variable 'right_nritems' ,  both of them expect
'right_nritems' is the number of items in right leaf after the push.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:58 -04:00
Chris Mason 34a3821873 Btrfs: Change push_leaf_{leaf,right} to empty the src leave during item deletion
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 081e95736d Btrfs: Make defrag check nodes against the progress key to prevent repeating work
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 179e29e488 Btrfs: Fix a number of inline extent problems that Yan Zheng reported.
The fixes do a number of things:

1) Most btrfs_drop_extent callers will try to leave the inline extents in
place.  It can truncate bytes off the beginning of the inline extent if
required.

2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.

3) btrfs_truncate_in_transaction truncates inline extents

4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 5708b95916 Btrfs: Tune the automatic defrag code
1) Forced defrag wasn't working properly (btrfsctl -d) because some
cache only checks were incorrect.

2) Defrag only the leaves unless in forced defrag mode.

3) Don't use complex logic to figure out if a leaf is needs defrag

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason cc0c553847 Btrfs: Fix split_leaf to detect when it is extending an item
When making room for a new item, it is ok to create an empty leaf, but
when making room to extend an item, split_leaf needs to make sure it
keeps the item we're extending in the path and make sure we don't end up
with an empty leaf.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 5ee78ac70f Btrfs: Fix split_leaf to avoid incorrect double splits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 3685f79165 Btrfs: CPU usage optimizations in push and the extent_map code
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Jens Axboe ae2f5411c4 btrfs: 32-bit type problems
An assorted set of casts to get rid of the warnings on 32-bit archs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason 7936ca3883 Btrfs: Default to 8k max packed tails
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason a6b6e75e09 Btrfs: Defrag only leaves, and only when the parent node has a single objectid
This allows us to defrag huge directories, but skip the expensive defrag
case in more common usage, where it does not help as much.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:57 -04:00
Chris Mason cf786e79e3 Btrfs: Defrag: only walk into nodes with the defrag bit set
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 0f1ebbd159 Btrfs: Large block related defrag optimizations
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 0f82731fc5 Breakout BTRFS_SETGET_FUNCS into a separate C file, the inlines were too big.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 810191ff30 Btrfs: extent_map optimizations to cut down on CPU usage
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 3326d1b07c Btrfs: Allow tails larger than one page
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00
Chris Mason 4dc119046d Btrfs: Add an extent buffer LRU to reduce radix tree hits
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-09-25 11:03:56 -04:00