1
0
Fork 0
Commit Graph

413 Commits (b96de000bc8bc9688b3a2abea4332bd57648a49f)

Author SHA1 Message Date
Ilya Dryomov bf023ecfca Btrfs: eliminate a use-after-free in btrfs_balance()
Commit 5af3e8cc introduced a use-after-free at volumes.c:3139: bctl is freed
above in __cancel_balance() in all cases except for balance pause.  Fix this
by moving the offending check a couple statements above, the meaning of the
check is preserved.

Reported-by: Chris Mason <chris.mason@fusionio.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:53 -05:00
Eric Sandeen 063d006fa0 btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk
WARN_ON isn't enough, we need to stop the loop if for any reason
we would overrun the devices_info array.

I tried to track down the connection between the length of
the alloc_devices list and the rw_devices counter but
it wasn't immediately obvious, so be defensive about it.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:26 -05:00
Josef Bacik 0f5d42b287 Btrfs: remove extent mapping if we fail to add chunk
I got a double free error when unmounting a file system that failed to add a
chunk during its operation.  This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks.  So to fix
this just remove the extent_map when we error out so we don't run into this
problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:11 -05:00
Josef Bacik 0448748849 Btrfs: fix chunk allocation error handling
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists.  This will
cause BUG_ON()'s in the bio submission path.  This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit.  Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:10 -05:00
Miao Xie de98ced9e7 Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits
There is no lock to protect
  fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.

Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 12:59:08 -05:00
Miao Xie e6ec716f0d Btrfs: make raid attr array more readable
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:37:19 -05:00
David Sterba 6f60cbd3ae btrfs: access superblock via pagecache in scan_one_device
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages.  It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.

Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too.  When this
happens, we drop some of the important blocks written by mkfs.

This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread.  This way we don't have to care
about the blocksize of the device.

This also drops the invalidate_bdev() call.  It wasn't very polite to
invalidate during the scan either.  mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-15 16:57:47 -05:00
Chris Mason 24f8ebe918 Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus 2013-02-05 19:24:44 -05:00
Chris Mason 0e4e026366 Merge branch 'for-linus' into raid56-experimental
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:04:03 -05:00
Chris Mason 1f0905ec15 Btrfs: remove conflicting check for minimum number of devices in raid56
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-05 10:01:42 -05:00
David Woodhouse 53b381b3ab Btrfs: RAID5 and RAID6
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.

Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio.  This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.

But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle.  This will be addressed in a
later commit.

Scrub is unable to repair crc errors on raid5/6 chunks.

Discard does not work on raid5/6 (yet)

The stripe size is fixed at 64KiB per disk.  This will be tunable
in a later commit.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 14:24:23 -05:00
Eric Sandeen 3c91160808 btrfs: don't try to notify udev about missing devices
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-02-01 11:47:37 -05:00
Miao Xie c9f01bfe0c Btrfs: fix wrong max device number for single profile
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.

Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-01-24 12:51:26 -05:00
Ilya Dryomov a105bb88f4 Btrfs: fix a regression in balance usage filter
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input.  Bring it back.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-01-21 20:40:27 -05:00
Ilya Dryomov ed0fb78fb6 Btrfs: bring back balance pause/resume logic
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge).  Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away.  Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.

Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2013-01-20 16:21:17 +02:00
Lukas Czerner cc975eb460 btrfs: get the device in write mode when deleting it
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
2013-01-14 13:52:31 -05:00
Liu Bo 31e502298d Btrfs: put raid properties into global table
Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:28 -05:00
Josef Bacik 70c8a91ce2 Btrfs: log changed inodes based on the extent map tree
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree.  So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:24 -05:00
Lukas Czerner b8b8ff590f btrfs: Notify udev when removing device
Currently udev does not know about the device being removed from the
file system. This may result in the situation where we're unable to
mount the file system by UUID or by LABEL because the by-uuid and
by-label links may still point to the device which is no longer part of
the btrfs file system and hence does not have any btrfs super block.

It can be easily reproduced by the following:

mkfs.btrfs -L bugfs /dev/loop[0-6]
mount /dev/loop0 /mnt/test
btrfs device delete /dev/loop0 /mnt/test
umount /mnt/test

mount LABEL=bugfs /mnt/test <---- this fails

then see:

ls -l /dev/disk/by-label/bugfs

which will still point to the /dev/loop0

We did not noticed this before because libblkid would send the udev
event for us when it notice that the link does not fit the reality,
however it does not do that anymore and completely relies on udev
information.

Fix this by sending the KOBJ_CHANGE event to the bdev kobject after
successful device removal.

Note that this does not affect device addition, because we will open the
device prior the addition from userspace and udev will notice that and
reread the device afterwards.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:21 -05:00
Stefan Behrens f9c83748de Btrfs: fix a build warning for an unused label
This issue was detected by the "0-DAY kernel build testing".

fs/btrfs/volumes.c: In function 'btrfs_rm_device':
fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label]

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-16 20:46:13 -05:00
Stefan Behrens ad6d620e2a Btrfs: allow repair code to include target disk when searching mirrors
Make the target disk of a running device replace operation
available for reading. This is only used as a last ressort for
the defect repair procedure. And it is dependent on the location
of the data block to read, because during an ongoing device
replace operation, the target drive is only partially filled
with the filesystem data.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:45 -05:00
Stefan Behrens 30d9861ff9 Btrfs: optionally avoid reads from device replace source drive
It is desirable to be able to configure the device replace
procedure to avoid reading the source drive (the one to be
copied) whenever possible. This is useful when the number of
read errors on this disk is high, because it would delay the
copy procedure alot. Therefore there is an option to avoid
reading from the source disk unless the repair procedure
really needs to access it. The regular read req asks for
mapping the block with mirror_num == 0, in this case the
source disk is avoided whenever possible. The repair code
selects the mirror_num explicitly (mirror_num != 0), this
case is not changed by this commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:44 -05:00
Stefan Behrens 472262f35a Btrfs: changes to live filesystem are also written to replacement disk
During a running dev replace operation, all write requests to
the live filesystem are duplicated to also write to the target
drive. Therefore btrfs_map_block() is changed to duplicate
stripes that are written to the source disk of a device replace
procedure to be written to the target disk as well.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:43 -05:00
Stefan Behrens 29a8d9a0bc Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block()
Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.

The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:43 -05:00
Stefan Behrens 8dabb7420f Btrfs: change core code of btrfs to support the device replace operations
This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:42 -05:00
Stefan Behrens e93c89c1aa Btrfs: add new sources for device replace code
This adds a new file to the sources together with the header file
and the changes to ioctl.h and ctree.h that are required by the
new C source file. Additionally, 4 new functions are added to
volume.c that deal with device creation and destruction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:41 -05:00
Stefan Behrens 618919236b Btrfs: handle errors from btrfs_map_bio() everywhere
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:40 -05:00
Stefan Behrens 63a212abc2 Btrfs: disallow some operations on the device replace target device
This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:39 -05:00
Stefan Behrens 5ac00addc7 Btrfs: disallow mutually exclusive admin operations from user mode
Btrfs admin operations that are manually started from user mode
and that cannot be executed at the same time return -EINPROGRESS.
A common way to enter and leave this locked section is introduced
since it used to be specific to the balance operation.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:38 -05:00
Stefan Behrens aa1b8cd409 Btrfs: pass fs_info instead of root
A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:36 -05:00
Stefan Behrens a8a6dab779 Btrfs: add btrfs_scratch_superblock() function
This new function is used by the device replace procedure in
a later patch.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:35 -05:00
Stefan Behrens 3ec706c831 Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:34 -05:00
Stefan Behrens 5d9640517d Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree
This is required for the device replace procedure in a later step.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:34 -05:00
Stefan Behrens 7ba15b7d21 Btrfs: add two more find_device() methods
The new function btrfs_find_device_missing_or_by_path() will be
used for the device replace procedure. This function itself calls
the second new function btrfs_find_device_by_path().
Unfortunately, it is not possible to currently make the rest of the
code use these functions as well, since all functions that look
similar at first view are all a little bit different in what they
are doing. But in the future, new code could benefit from these
two new functions, and currently, device replace uses them.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:33 -05:00
Stefan Behrens beaf8ab3af Btrfs: move some common code into a subfunction
Some code to open block devices, to read the superblock and to
handle errors was repeated multiple times in 3 places, and the
following patch makes use of it as well. This code is now moved
into a subfunction.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:33 -05:00
Liu Bo d25628bdd6 Btrfs: protect devices list with its mutex
Since we've kill the bigger one volume_mutex, we need to add devices
list mutex back.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:28 -05:00
Stefan Behrens d03f918ab9 Btrfs: Don't trust the superblock label and simply printk("%s") it
Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the
superblock and make Btrfs printk("%s") crash while holding the
uuid_mutex since nobody forces a limit on the string. Since the
uuid_mutex is significant, the system would be unusable
afterwards.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:26 -05:00
Julia Lawall 31b1a2bd75 fs/btrfs: use WARN
Use WARN rather than printk followed by WARN_ON(1), for conciseness.

A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
expression list es;
@@

-printk(
+WARN(1,
  es);
-WARN_ON(1);
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:23 -05:00
Masanari Iida d142324873 Btrfs: Fix typo in fs/btrfs
Correct spelling typo in btrfs.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:17 -05:00
jeff.liu 0253f40ef9 Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev()
Remove an invalid size check up from btrfs_shrink_dev().

The new size should not larger than the device->total_bytes as it was
already verified before coming to here(i.e. new_size < old_size).

Remove invalid check up for btrfs_shrink_dev().

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-12 17:15:16 -05:00
Josef Bacik de1ee92ac3 Btrfs: recheck bio against block device when we map the bio
Alex reported a problem where we were writing between chunks on a rbd
device.  The thing is we do bio_add_page using logical offsets, but the
physical offset may be different.  So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector.  This fixes the
problem for Alex and doesn't affect performance in the normal case.  Thanks,

Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:32 -05:00
Miao Xie 3fed40cc97 Btrfs: cleanup duplicated division functions
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-12-11 13:31:30 -05:00
Miao Xie 671415b7db Btrfs: fix deadlock caused by the nested chunk allocation
Steps to reproduce:
 # mkfs.btrfs -m raid1 <disk1> <disk2>
 # btrfstune -S 1 <disk1>
 # mount <disk1> <mnt>
 # btrfs device add <disk3> <disk4> <mnt>
 # mount -o remount,rw <mnt>
 # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
 Deadlock happened.

It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.

By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.

So, I fix this problem by commiting the transaction at the end of the first
device insertion.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-10-25 15:47:00 -04:00
Stefan Behrens 5af3e8cce8 Btrfs: make filesystem read-only when submitting barrier fails
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.

In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.

The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-10-09 09:20:19 -04:00
Daniel J Blueman 489406626c btrfs: fix message printing
Fix various messages to include newline and module prefix.

Signed-off-by: Daniel J Blueman <daniel@quora.org>
2012-10-09 09:19:57 -04:00
David Sterba 005d6427ac btrfs: move transaction aborts to the point of failure
Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-10-08 20:09:02 -04:00
Linus Torvalds 318e151019 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I've split out the big send/receive update from my last pull request
  and now have just the fixes in my for-linus branch.  The send/recv
  branch will wander over to linux-next shortly though.

  The largest patches in this pull are Josef's patches to fix DIO
  locking problems and his patch to fix a crash during balance.  They
  are both well tested.

  The rest are smaller fixes that we've had queued.  The last rc came
  out while I was hacking new and exciting ways to recover from a
  misplaced rm -rf on my dev box, so these missed rc3."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: fix that repair code is spuriously executed for transid failures
  Btrfs: fix ordered extent leak when failing to start a transaction
  Btrfs: fix a dio write regression
  Btrfs: fix deadlock with freeze and sync V2
  Btrfs: revert checksum error statistic which can cause a BUG()
  Btrfs: remove superblock writing after fatal error
  Btrfs: allow delayed refs to be merged
  Btrfs: fix enospc problems when deleting a subvol
  Btrfs: fix wrong mtime and ctime when creating snapshots
  Btrfs: fix race in run_clustered_refs
  Btrfs: don't run __tree_mod_log_free_eb on leaves
  Btrfs: increase the size of the free space cache
  Btrfs: barrier before waitqueue_active
  Btrfs: fix deadlock in wait_for_more_refs
  btrfs: fix second lock in btrfs_delete_delayed_items()
  Btrfs: don't allocate a seperate csums array for direct reads
  Btrfs: do not strdup non existent strings
  Btrfs: do not use missing devices when showing devname
  Btrfs: fix that error value is changed by mistake
  Btrfs: lock extents as we map them in DIO
  ...
2012-08-29 11:36:22 -07:00
Stefan Behrens 5ee0844d64 Btrfs: revert checksum error statistic which can cause a BUG()
Commit 442a4f6308 added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.

Cc: stable <stable@vger.kernel.org> # 3.5
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-08-28 16:53:39 -04:00
Josef Bacik 66657b318e Btrfs: barrier before waitqueue_active
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups.  So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:33 -04:00
Josef Bacik 99f5944b84 Btrfs: do not strdup non existent strings
When we close devices we add back empty devices for some reason that escapes
me.  In the case of a missing dev we don't allocate an rcu_string for it's
name, so check to see if the device has a name and if it doesn't don't
bother strdup()'ing it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-28 16:53:29 -04:00
Artem Bityutskiy 34eaadaf22 btrfs: nuke write_super from comments
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from btrfs.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-04 12:15:35 +04:00
Stefan Behrens a98cdb85b9 Btrfs: suppress printk() if all device I/O stats are zero
Code is added to suppress the I/O stats printing at mount time if all
statistic values are zero.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Stefan Behrens 5021976d8d Btrfs: remove unwanted printk() for btrfs device I/O stats
People complained about the annoying kernel log message
"btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
everytime a filesystem is mounted for the first time after running
mkfs. Since the distribution of the btrfs-progs is not synchronized
to the kernel version, mkfs like it is now will be used also in the
future. Then this message is not useful to find errors, it is just
annoying. This commit removes the printk().

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-07-23 16:28:07 -04:00
Josef Bacik 02db0844be Btrfs: add DEVICE_READY ioctl
This will be used in conjunction with btrfs device ready <dev>.  This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently.  This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-23 16:27:42 -04:00
David Sterba b27f7c0c15 btrfs: join DEV_STATS ioctls to one
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html

to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.

CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
2012-07-23 15:41:40 -04:00
Ilya Dryomov 2b6ba629b5 Btrfs: resume balance on rw (re)mounts properly
This introduces btrfs_resume_balance_async(), which, given that
restriper state was recovered earlier by btrfs_recover_balance(),
resumes balance in btrfs-balance kthread.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-02 15:39:17 -04:00
Ilya Dryomov 68310a5e42 Btrfs: restore restriper state on all mounts
Fix a bug that triggered asserts in btrfs_balance() in both normal and
resume modes -- restriper state was not properly restored on read-only
mounts.  This factors out resuming code from btrfs_restore_balance(),
which is now also called earlier in the mount sequence to avoid the
problem of some early writes getting the old profile.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-07-02 15:39:16 -04:00
Stefan Behrens 597a60fade Btrfs: don't count I/O statistic read errors for missing devices
It is normal behaviour of the low level btrfs function btrfs_map_bio()
to complete a bio with -EIO if the device is missing, instead of just
preventing the bio creation in an earlier step.
This used to cause I/O statistic read error increments and annoying
printk_ratelimited messages. This commit fixes the issue.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Carey Underwood <cwillu@cwillu.com>
2012-07-02 15:36:23 -04:00
Josef Bacik 606686eeac Btrfs: use rcu to protect device->name
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory.  Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock().  This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch.  Thanks,

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
2012-06-14 21:29:16 -04:00
Stefan Behrens 733f4fbbc1 Btrfs: read device stats on mount, write modified ones during commit
The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-05-30 10:23:41 -04:00
Stefan Behrens c11d2c236c Btrfs: add ioctl to get and reset the device stats
An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-05-30 10:23:40 -04:00
Stefan Behrens 442a4f6308 Btrfs: add device counters for detected IO and checksum errors
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2012-05-30 10:23:39 -04:00
Liu Bo f8c5d0b443 Btrfs: fix wrong error returned by adding a device
Reproduce:
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs -o ro
$ btrfs dev add /dev/sdb8 /mnt/btrfs
ERROR: error adding the device '/dev/sdb8' - Invalid argument

Since we mount with readonly options, and /dev/sdb7 is not a seeding one,
a readonly notification is preferred.

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
2012-05-30 10:23:34 -04:00
Jan Schmidt 3e74317ad7 Btrfs: fix repair code for RAID10
btrfs_map_block sets mirror_num, so that the repair code knows eventually
which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
Before this fix mirror_num was incorrectly related to our stripe index.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-27 13:55:13 -04:00
Julia Lawall 48d282326b fs/btrfs/volumes.c: add missing free_fs_devices
Free fs_devices as done in the error-handling code just below.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
2012-04-18 19:22:28 +02:00
Ilya Dryomov 37db63a400 Btrfs: fix max chunk size check in chunk allocator
Fix a bug, where in case we need to adjust stripe_size so that the
length of the resulting chunk is less than or equal to max_chunk_size,
DUP chunks turn out to be only half as big as they could be.

Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-04-18 19:22:25 +02:00
Liu Bo b89203f74b Btrfs: fix eof while discarding extents
We miscalculate the length of extents we're discarding, and it leads to
an eof of device.

Reported-by: Daniel Blueman <daniel@quora.org>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-04-12 16:03:56 -04:00
Chris Mason 3c4bb26b21 Btrfs: flush out and clean up any block device pages during mount
Btrfs puts the filesystem metadata into its own address space, and
somehow the block device address space isn't getting onto disk properly
before a mount.  The end result is that a loop of mkfs and mounting the
filesystem will sometimes find stale or incorrect data.

This commit should fix it by sprinkling fdatawrites and invalidate_bdev
calls around.  This is a short term measure to make sure it is fixed.
The block devices really should be flushed and cleaned up higher in the
stack.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:33:58 -04:00
Chris Mason 1c691b330a Merge branch 'for-chris' of git://github.com/idryomov/btrfs-unstable into for-linus 2012-03-28 20:32:46 -04:00
Chris Mason 1d4284bd6e Merge branch 'error-handling' into for-linus
Conflicts:
	fs/btrfs/ctree.c
	fs/btrfs/disk-io.c
	fs/btrfs/extent-tree.c
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/inode.c
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-28 20:31:37 -04:00
Ilya Dryomov 213e64da90 Btrfs: fix infinite loop in btrfs_shrink_device()
If relocate of block group 0 fails with ENOSPC we end up infinitely
looping because key.offset -= 1 statement in that case brings us back to
where we started.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:18 +03:00
Ilya Dryomov e4837f8f3b Btrfs: allow dup for data chunks in mixed mode
Generally we don't allow dup for data, but mixed chunks are special and
people seem to think this has its use cases.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:17 +03:00
Ilya Dryomov 6728b198de Btrfs: validate target profiles only if we are going to use them
Do not run sanity checks on all target profiles unless they all will be
used.  This came up because alloc_profile_is_valid() is now more strict
than it used to be.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:17 +03:00
Ilya Dryomov 0c460c0d70 Btrfs: move alloc_profile_is_valid() to volumes.c
Header file is not a good place to define functions.  This also moves a
call to alloc_profile_is_valid() down the stack and removes a redundant
check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it
into account.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:17 +03:00
Ilya Dryomov e8920a640b Btrfs: make profile_is_valid() check more strict
"0" is a valid value for an on-disk chunk profile, but it is not a valid
extended profile.  (We have a separate bit for single chunks in extended
case)

Also rename it to alloc_profile_is_valid() for clarity.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:17 +03:00
Ilya Dryomov 899c81eac8 Btrfs: add wrappers for working with alloc profiles
Add functions to abstract the conversion between chunk and extended
allocation profile formats and switch everybody to use them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-03-27 17:09:16 +03:00
Chris Mason 727011e07c Btrfs: allow metadata blocks larger than the page size
A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling.  This fixes the code to properly support
large metadata blocks again.

Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.

This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-03-26 16:50:37 -04:00
Jeff Mahoney 79787eaab4 btrfs: replace many BUG_ONs with proper error handling
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
 progress but aims to handle most errors other than internal logic
 errors and ENOMEM more gracefully.

 This iteration prevents most crashes but can run into lockups with
 the page lock on occasion when the timing "works out."

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 11:52:54 +01:00
Mark Fasheh 3acd395317 btrfs: Remove BUG_ON from __finish_chunk_alloc()
btrfs_alloc_chunk() unconditionally BUGs on any error returned from
__finish_chunk_alloc() so there's no need for two BUG_ON lines. Remove the
one from __finish_chunk_alloc().

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:39 +01:00
Mark Fasheh 1dd4602fa7 btrfs: Remove BUG_ON from __btrfs_alloc_chunk()
We BUG_ON() error from add_extent_mapping(), but that error looks pretty
easy to bubble back up - as far as I can tell there have not been any
permanent modifications to fs state at that point.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:39 +01:00
Mark Fasheh 2cdcecbc15 btrfs: Don't BUG_ON insert errors in btrfs_alloc_dev_extent()
The only caller of btrfs_alloc_dev_extent() is __btrfs_alloc_chunk() which
already bugs on any error returned. We can remove the BUG_ON's in
btrfs_alloc_dev_extent() then since __btrfs_alloc_chunk() will "catch" them
anyway.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:38 +01:00
Mark Fasheh 4ed1d16e94 btrfs: Don't BUG_ON errors in __finish_chunk_alloc()
All callers of __finish_chunk_alloc() BUG_ON() return value, so it's trivial
for us to always bubble up any errors caught in __finish_chunk_alloc() to be
caught there.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
2012-03-22 01:45:37 +01:00
Jeff Mahoney 143bede527 btrfs: return void in functions without error conditions
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-22 01:45:34 +01:00
Linus Torvalds 855a85f704 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Quoth Chris:
 "This is later than I wanted because I got backed up running through
  btrfs bugs from the Oracle QA teams.  But they are all bug fixes that
  we've queued and tested since rc1.

  Nothing in particular stands out, this just reflects bug fixing and QA
  done in parallel by all the btrfs developers.  The most user visible
  of these is:

    Btrfs: clear the extent uptodate bits during parent transid failures

  Because that helps deal with out of date drives (say an iscsi disk
  that has gone away and come back).  The old code wasn't always
  properly retrying the other mirror for this type of failure."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
  Btrfs: fix compiler warnings on 32 bit systems
  Btrfs: increase the global block reserve estimates
  Btrfs: clear the extent uptodate bits during parent transid failures
  Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
  Btrfs: make sure we update latest_bdev
  Btrfs: improve error handling for btrfs_insert_dir_item callers
  Btrfs: be less strict on finding next node in clear_extent_bit
  Btrfs: fix a bug on overcommit stuff
  Btrfs: kick out redundant stuff in convert_extent_bit
  Btrfs: skip states when they does not contain bits to clear
  Btrfs: check return value of lookup_extent_mapping() correctly
  Btrfs: fix deadlock on page lock when doing auto-defragment
  Btrfs: fix return value check of extent_io_ops
  btrfs: honor umask when creating subvol root
  btrfs: silence warning in raid array setup
  btrfs: fix structs where bitfields and spinlock/atomic share 8B word
  btrfs: delalloc for page dirtied out-of-band in fixup worker
  Btrfs: fix memory leak in load_free_space_cache()
  btrfs: don't check DUP chunks twice
  Btrfs: fix trim 0 bytes after a device delete
  ...
2012-02-24 09:02:53 -08:00
Chris Mason a6b0d5c8db Btrfs: make sure we update latest_bdev
When we are setting up the mount, we close all the
devices that were not actually part of the metadata we found.

But, we don't make sure that one of those devices wasn't
fs_devices->latest_bdev, which means we can do a use after free
on the one we closed.

This updates latest_bdev as it goes.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-02-23 10:43:45 -05:00
Tsutomu Itoh 285190d99f Btrfs: check return value of lookup_extent_mapping() correctly
This patch corrects error checking of lookup_extent_mapping().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
2012-02-16 17:23:17 +01:00
David Sterba 8a33442694 btrfs: silence warning in raid array setup
Raid array setup code creates an extent buffer in an usual way. When the
PAGE_CACHE_SIZE is > super block size, the extent pages are not marked
up-to-date, which triggers a WARN_ON in the following
write_extent_buffer call. Add an explicit up-to-date call to silence the
warning.

Signed-off-by: David Sterba <dsterba@suse.cz>
2012-02-15 16:40:25 +01:00
Linus Torvalds d65773b22b Merge branch 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
* 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  btrfs: take allocation of ->tree_root into open_ctree()
  btrfs: let ->s_fs_info point to fs_info, not root...
  btrfs: consolidate failure exits in btrfs_mount() a bit
  btrfs: make free_fs_info() call ->kill_sb() unconditional
  btrfs: merge free_fs_info() calls on fill_super failures
  btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
  btrfs: make open_ctree() return int
  btrfs: sanitizing ->fs_info, part 5
  btrfs: sanitizing ->fs_info, part 4
  btrfs: sanitizing ->fs_info, part 3
  btrfs: sanitizing ->fs_info, part 2
  btrfs: sanitizing ->fs_info, part 1
  btrfs: fix a deadlock in btrfs_scan_one_device()
  btrfs: fix mount/umount race
  btrfs: get ->kill_sb() of its own
  btrfs: preparation to fixing mount/umount race
2012-01-17 15:52:51 -08:00
Chris Mason 96bdc7dc61 Btrfs: use larger system chunks
system chunks by default are very small.  This makes them slightly
larger and also fixes the conditional checks to make sure we don't
allocate a billion of them at once.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:38:24 -05:00
Chris Mason c126dea771 Merge branch 'integrity-check-patch-v2' of git://btrfs.giantdisaster.de/git/btrfs into integration
Conflicts:
	fs/btrfs/ctree.h
	fs/btrfs/super.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:27:58 -05:00
Chris Mason d756bd2d93 Merge branch 'for-chris' of git://repo.or.cz/linux-btrfs-devel into integration
Conflicts:
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-16 15:26:17 -05:00
Chris Mason 27263e2832 Merge branch 'restriper' of git://github.com/idryomov/btrfs-unstable into integration 2012-01-16 15:26:02 -05:00
Ilya Dryomov 19a39dce3b Btrfs: add balance progress reporting
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov a7e99c691a Btrfs: allow for canceling restriper
Implement an ioctl for canceling restriper.  Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit.  Balance item is deleted and no memory
about the interrupted balance is kept.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov 837d5b6e46 Btrfs: allow for pausing restriper
Implement an ioctl for pausing restriper.  This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc.  If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.

Add a hook to close_ctree() to pause restriper and free its data
structures on unmount.  (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:49 +02:00
Ilya Dryomov 9555c6c180 Btrfs: add skip_balance mount option
Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it.  The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov 596410151e Btrfs: recover balance on mount
On mount, if balance item is found, resume balance in a separate
kernel thread.

Try to be smart to continue roughly where previous balance (or convert)
was interrupted.  For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov 0940ebf6b9 Btrfs: save balance parameters to disk
Introduce a new btree objectid for storing balance item.  The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.

The key for the new item is as follows:

	[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]

Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov cfa4c961cc Btrfs: soft profile changing mode (aka soft convert)
When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to.  This is useful if e.g. half of the FS was converted
earlier.

The soft mode switch is (like every other filter) per-type.  This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov e4d8ec0f65 Btrfs: implement online profile changing
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized.  Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce.  If target profile is not yet available it goes
back to a plain reduce.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov ea67176ae8 Btrfs: virtual address space subset filter
Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov 94e60d5a5c Btrfs: devid subset filter
Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.

This filter only works when devid filter is turned on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:48 +02:00
Ilya Dryomov 409d404b46 Btrfs: devid filter
Relocate chunks which have at least one stripe located on a device with
devid X.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov 5ce5b3c091 Btrfs: usage filter
Select chunks that are less than X percent full.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov ed25e9b26f Btrfs: profiles filter
Select chunks based on a given profile mask.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov f43ffb60fd Btrfs: add basic infrastructure for selective balancing
This allows to have a separate set of filters for each chunk type
(data,meta,sys).  The code however is generic and switch on chunk type
is only done once.

This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov c9e9f97bdf Btrfs: add basic restriper infrastructure
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc.  The semantics of the old balancing
ioctl are fully preserved.

Explicitly disallow any volume operations when balance is in progress.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov 52ba692972 Btrfs: introduce masks for chunk type and profile
Chunk's type and profile are encoded in u64 flags field.  Introduce
masks to easily access them.  Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Ilya Dryomov 6fef8df1dc Btrfs: get rid of *_alloc_profile fields
{data,metadata,system}_alloc_profile fields have been unused for a long
time now.  Get rid of them.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2012-01-16 22:04:47 +02:00
Li Zefan b367e47fb3 Btrfs: fix possible deadlock when opening a seed device
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:

    open_ctree()
        lock(chunk_mutex);
        read_chunk_tree();
            read_one_dev();
                open_seed_devices();
                    lock(uuid_mutex);

and then we hit a lockdep splat.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:54 +08:00
Li Zefan ec9ef7a13b Btrfs: simplfy calculation of stripe length for discard operation
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().

However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:

        dev1          dev2           dev3    (raid0)
        -----------------------------------
        s0 s3 s6      s1 s4 s7       s2 s5

Each device has (total_stripes / nr_dev) stripes, or plus one.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:46 +08:00
Li Zefan de11cc12df Btrfs: don't pre-allocate btrfs bio
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.

Also we don't have to calcuate the stripe number twice.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:44 +08:00
Li Zefan 125ccb0ae6 Btrfs: don't pass a trans handle unnecessarily in volumes.c
Some functions never use the transaction handle passed to them.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2012-01-11 10:26:42 +08:00
Al Viro 10f6327b5d btrfs: fix a deadlock in btrfs_scan_one_device()
pathname resolution under a global mutex, taken on some paths in ->mount()
is a Bad Idea(tm) - think what happens if said pathname resolution triggers
automount of some btrfs instance and walks into attempt to grab the same
mutex.  Deadlock - we are waiting for daemon to finish walking the path,
daemon is waiting for us to release the mutex...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-08 19:33:24 -05:00
Chris Mason 1100373f8a Btrfs: use bigger metadata chunks on bigger filesystems
The 256MB chunk is a little small on a huge FS.  This scales up the
chunk size.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-01-06 15:47:38 -05:00
Stefan Behrens 21adbd5cbb Btrfs: integrate integrity check module into btrfs
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.

Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
2011-12-21 19:14:17 +01:00
Chris Mason d85c8a6f1b Btrfs: unplug every once and a while
The btrfs io submission threads can build up massive plug lists.  This
keeps things more reasonable so we don't hand over huge dumps of IO at
once.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-15 15:38:41 -05:00
Chris Mason 5dbc8fca8e Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror
btrfs_end_bio checks the number of errors on a bio against the max
number of errors allowed before sending any EIOs up to the higher
levels.

If we got enough copies of the bio done for a given raid level, it is
supposed to clear the bio error flag and return success.

We have pointers to the original bio sent down by the higher layers and
pointers to any cloned bios we made for raid purposes.  If the original
bio happens to be the one that got an io error, but not the last one to
finish, it might not have the BIO_UPTODATE bit set.

Then, when the last bio does finish, we'll call bio_end_io on the
original bio.  It won't have the uptodate bit set and we'll end up
sending EIO to the higher layers.

We already had a check for this, it just was conditional on getting the
IO error on the very last bio.  Make the check unconditional so we eat
the EIOs properly.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-09 11:07:37 -05:00
Li Zefan a5d1633361 Btrfs: check if the to-be-added device is writable
If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:

[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-12-08 08:55:46 -05:00
Miao Xie 924cd8fbe4 Btrfs: fix nocow when deleting the item
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-10 20:45:04 -05:00
Chris Mason 806468f8bf Merge git://git.jan-o-sch.net/btrfs-unstable into integration
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/extent_io.c
	fs/btrfs/extent_io.h
	fs/btrfs/scrub.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:07:10 -05:00
Chris Mason 531f4b1ae5 Merge branch 'for-chris' of git://github.com/sensille/linux into integration
Conflicts:
	fs/btrfs/ctree.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-11-06 03:05:08 -05:00
David Sterba 6c41761fc6 btrfs: separate superblock items out of fs_info
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)

Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.

Signed-off-by: David Sterba <dsterba@suse.cz>
2011-11-06 03:04:01 -05:00
Ilya Dryomov 20bcd64934 Btrfs: close all bdevs on mount failure
Fix a bug introduced by 20b45077.  We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively.  This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2011-10-20 18:20:57 +02:00
Josef Bacik 2bf64758fd Btrfs: allow us to overcommit our enospc reservations
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases.  There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve.  However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit.  So this
patch adds chunk free space accounting so we always know how much unallocated
space we have.  Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit.  In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much.  Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space.  This makes this fio test

[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test

go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system.  This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2011-10-19 15:12:50 -04:00
Arne Jansen 90519d66ab btrfs: state information for readahead
Add state information for readahead to btrfs_fs_info and btrfs_device

Changes v2:
 - don't wait in radix_trees
 - add own set of workers for readahead

Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
2011-10-02 08:48:30 +02:00
Jan Schmidt 2774b2ca3d btrfs: Put mirror_num in bi_bdev
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Jan Schmidt a1d3c4786a btrfs: btrfs_multi_bio replaced with btrfs_bio
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
2011-09-29 13:38:42 +02:00
Miao Xie 0e58885961 Btrfs: fix uninitialized sync_pending
sync_pending is uninitialized before it be used, fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:31 -04:00
liubo 38c01b9605 Btrfs: fix a bug of balance on full multi-disk partitions
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.

Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:

device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device

Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:15 -04:00
Josef Bacik d5e2003c2b Btrfs: detect wether a device supports discard
We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent.  This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay.  So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it.  And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-16 21:09:15 -04:00
Chris Mason 2ab1ba68ae Btrfs: force unplugs when switching from high to regular priority bios
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.

Synchronous writes go to the high priority thread while async writes
go to regular list.  This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-08-05 13:48:18 -04:00
Chris Mason b43b31bdf2 Merge branch 'alloc_path' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/btrfs-error-handling into for-linus 2011-08-01 14:27:34 -04:00
Chris Mason 85d4e46111 Btrfs: make a lockdep class for each root
This patch was originally from Tejun Heo.  lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time.  The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.

This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-27 12:46:46 -04:00
Mark Fasheh 92b8e897f6 btrfs: Don't BUG_ON alloc_path errors in find_next_chunk
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.

do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-25 14:34:54 -07:00
Mark Fasheh 17e9f796bd btrfs: Don't BUG_ON alloc_path errors in btrfs_balance()
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).

Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2011-07-14 14:14:45 -07:00
Josef Bacik 508794eb5e Btrfs: don't panic if we get an error while balancing V2
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up.  This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass.  To fix we just exit.  Thanks,

Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-07-06 18:46:43 -04:00
Ilya Dryomov 22b63a2971 Btrfs - use %pU to print fsid
Get rid of FIXME comment.  Uuids from dmesg are now the same as uuids
given by btrfs-progs.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-10 19:02:04 -04:00
Arne Jansen 5f3f302a6f btrfs: false BUG_ON when degraded
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.

Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-06-04 08:03:44 -04:00
Chris Mason d6c0cb379c Merge branch 'cleanups_and_fixes' into inode_numbers
Conflicts:
	fs/btrfs/tree-log.c
	fs/btrfs/volumes.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 14:37:47 -04:00
Xiao Guangrong 1f78160ce1 Btrfs: using rcu lock in the reader side of devices list
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong 4622470565 Btrfs: drop unnecessary device lock
Drop device_list_mutex for the reader side  on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated

btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong 0c1daee085 Btrfs: fix the race between remove dev and alloc chunk
On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:43 -04:00
Xiao Guangrong c9513edb00 Btrfs: fix the race between reading and updating devices
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices

On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:42 -04:00
Xiao Guangrong 4f6c9328c6 Btrfs: fix bh leak on __btrfs_open_devices path
'bh' is forgot to release if no error is detected

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:42 -04:00
Tsutomu Itoh 65a246c5ff Btrfs: return error code to caller when btrfs_del_item fails
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Tsutomu Itoh b0b802d7e3 Btrfs: return error code to caller when btrfs_previous_item fails
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 13:24:39 -04:00
Chris Mason 712673339a Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Conflicts:
	fs/btrfs/Makefile
	fs/btrfs/ctree.h
	fs/btrfs/volumes.h

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-23 06:30:52 -04:00
Chris Mason aa2dfb372a Merge branch 'allocator' of git://git.kernel.org/pub/scm/linux/kernel/git/arne/btrfs-unstable-arne into inode_numbers
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2011-05-22 12:36:34 -04:00
Arne Jansen 73c5de0051 btrfs: quasi-round-robin for chunk allocation
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
2011-05-13 15:36:14 +02:00