Commit graph

389483 commits

Author SHA1 Message Date
Ilya Dryomov 964fb15acf Btrfs: eliminate races in worker stopping code
The current implementation of worker threads in Btrfs has races in
worker stopping code, which cause all kinds of panics and lockups when
running btrfs/011 xfstest in a loop.  The problem is that
btrfs_stop_workers is unsynchronized with respect to check_idle_worker,
check_busy_worker and __btrfs_start_workers.

E.g., check_idle_worker race flow:

       btrfs_stop_workers():            check_idle_worker(aworker):
- grabs the lock
- splices the idle list into the
  working list
- removes the first worker from the
  working list
- releases the lock to wait for
  its kthread's completion
                                  - grabs the lock
                                  - if aworker is on the working list,
                                    moves aworker from the working list
                                    to the idle list
                                  - releases the lock
- grabs the lock
- puts the worker
- removes the second worker from the
  working list
                              ......
        btrfs_stop_workers returns, aworker is on the idle list
                 FS is umounted, memory is freed
                              ......
              aworker is waken up, fireworks ensue

With this applied, I wasn't able to trigger the problem in 48 hours,
whereas previously I could reliably reproduce at least one of these
races within an hour.

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-10-04 16:02:13 -04:00
Liu Bo 385fe0bede Btrfs: fix crash of compressed writes
The crash[1] is found by xfstests/generic/208 with "-o compress",
it's not reproduced everytime, but it does panic.

The bug is quite interesting, it's actually introduced by a recent commit
(573aecafca,
Btrfs: actually limit the size of delalloc range).

Btrfs implements delay allocation, so during writeback, we
(1) get a page A and lock it
(2) search the state tree for delalloc bytes and lock all pages within the range
(3) process the delalloc range, including find disk space and create
    ordered extent and so on.
(4) submit the page A.

It runs well in normal cases, but if we're in a racy case, eg.
buffered compressed writes and aio-dio writes,
sometimes we may fail to lock all pages in the 'delalloc' range,
in which case, we need to fall back to search the state tree again with
a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset).

The mentioned commit has a side effect, that is, in the fallback case,
we can find delalloc bytes before the index of the page we already have locked,
so we're in the case of (delalloc_end <= *start) and return with (found > 0).

This ends with not locking delalloc pages but making ->writepage still
process them, and the crash happens.

This fixes it by just thinking that we find nothing and returning to caller
as the caller knows how to deal with it properly.

[1]:
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2170!
[...]
CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G           O 3.11.0+ #8
[...]
RIP: 0010:[<ffffffff810f5093>]  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[...]
[ 4934.248731] Stack:
[ 4934.248731]  ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a
[ 4934.248731]  0000000000000000 0000000000000000 0000000000000fff 0000000000000620
[ 4934.248731]  ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0
[ 4934.248731] Call Trace:
[ 4934.248731]  [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs]
[ 4934.248731]  [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs]
[ 4934.248731]  [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b
[ 4934.248731]  [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs]
[ 4934.248731]  [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs]
[ 4934.248731]  [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
[ 4934.248731]  [<ffffffff810608f5>] kthread+0x8d/0x95
[ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731]  [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0
[ 4934.248731]  [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43
[ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44
[ 4934.248731] RIP  [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83
[ 4934.248731]  RSP <ffff8801869f9c48>
[ 4934.280307] ---[ end trace 36f06d3f8750236a ]---

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-10-04 16:02:11 -04:00
Josef Bacik 60e7cd3a4b Btrfs: fix transid verify errors when recovering log tree
If we crash with a log, remount and recover that log, and then crash before we
can commit another transaction we will get transid verify errors on the next
mount.  This is because we were not zero'ing out the log when we committed the
transaction after recovery.  This is ok as long as we commit another transaction
at some point in the future, but if you abort or something else goes wrong you
can end up in this weird state because the recovery stuff says that the tree log
should have a generation+1 of the super generation, which won't be the case of
the transaction that was started for recovery.  Fix this by removing the check
and _always_ zero out the log portion of the super when we commit a transaction.
This fixes the transid verify issues I was seeing with my force errors tests.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-10-04 16:02:09 -04:00
Josef Bacik 94aebfb2e7 Btrfs: create the uuid tree on remount rw
Users have been complaining of the uuid tree stuff warning that there is no uuid
root when trying to do snapshot operations.  This is because if you mount -o ro
we will not create the uuid tree.  But then if you mount -o rw,remount we will
still not create it and then any subsequent snapshot/subvol operations you try
to do will fail gloriously.  Fix this by creating the uuid_root on remount rw if
it was not already there.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:50:43 -04:00
Mark Fasheh cbf8b8ca3e btrfs: change extent-same to copy entire argument struct
btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data
back to it's argument struct. Unfortunately, not all architectures provide
__put_user_unaligned(), so compiles break on them if btrfs is selected.

Instead, just copy the whole struct in / out at the start and end of
operations, respectively.

Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:31 -04:00
Guangyu Sun 93fd63c2f0 Btrfs: dir_inode_operations should use btrfs_update_time also
Commit 2bc5565286 (Btrfs: don't update atime on
RO subvolumes) ensures that the access time of an inode is not updated when
the inode lives in a read-only subvolume.
However, if a directory on a read-only subvolume is accessed, the atime is
updated. This results in a write operation to a read-only subvolume. I
believe that access times should never be updated on read-only subvolumes.

To reproduce:

 # mkfs.btrfs -f /dev/dm-3
 (...)
 # mount /dev/dm-3 /mnt
 # btrfs subvol create /mnt/sub
 	Create subvolume '/mnt/sub'
 # mkdir /mnt/sub/dir
 # echo "abc" > /mnt/sub/dir/file
 # btrfs subvol snapshot -r /mnt/sub /mnt/rosnap
 	Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap'
 # stat /mnt/rosnap/dir
 	File: `/mnt/rosnap/dir'
 	Size: 8         Blocks: 0          IO Block: 4096   directory
 Device: 16h/22d    Inode: 257         Links: 1
 Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
 	Access: 2013-09-11 07:21:49.389157126 -0400
 	Modify: 2013-09-11 07:22:02.330156079 -0400
 	Change: 2013-09-11 07:22:02.330156079 -0400
 # ls /mnt/rosnap/dir
 	file
 # stat /mnt/rosnap/dir
 	File: `/mnt/rosnap/dir'
 	Size: 8         Blocks: 0          IO Block: 4096   directory
 Device: 16h/22d    Inode: 257         Links: 1
 Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
 	Access: 2013-09-11 07:22:56.797151670 -0400
 	Modify: 2013-09-11 07:22:02.330156079 -0400
 	Change: 2013-09-11 07:22:02.330156079 -0400

Reported-by: Koen De Wit <koen.de.wit@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:30 -04:00
Frank Holton 5138cccf34 btrfs: Add btrfs: prefix to kernel log output
The kernel log entries for device label %s and device fsid %pU
are missing the btrfs: prefix. Add those here.

Signed-off-by: Frank Holton <fholton@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:30 -04:00
David Sterba 6ef3de9c92 btrfs: refuse to remount read-write after abort
It's still possible to flip the filesystem into RW mode after it's
remounted RO due to an abort. There are lots of places that check for
the superblock error bit and will not write data, but we should not let
the filesystem appear read-write.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:30 -04:00
chandan 1cecf579d1 Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0
This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default
subvolume by passing a subvolume id of 0.

Signed-off-by: chandan <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:29 -04:00
Filipe David Borba Manana a0634be562 Btrfs: don't leak transaction in btrfs_sync_file()
In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns
a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would
return without ending/freeing the transaction.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:29 -04:00
Stefan Behrens a724b43690 Btrfs: add the missing mutex unlock in write_all_supers()
The BUG() was replaced by btrfs_error() and return -EIO with the
patch "get rid of one BUG() in write_all_supers()", but the missing
mutex_unlock() was overlooked.

The 0-DAY kernel build service from Intel reported the missing
unlock which was found by the coccinelle tool:

    fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:28 -04:00
Josef Bacik f4ab9ea706 Btrfs: iput inode on allocation failure
We don't do the iput when we fail to allocate our delayed delalloc work in
__start_delalloc_inodes, fix this.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:28 -04:00
Josef Bacik 363e4d354e Btrfs: remove space_info->reservation_progress
This isn't used for anything anymore, just remove it.

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:27 -04:00
Josef Bacik f0de181c9b Btrfs: kill delay_iput arg to the wait_ordered functions
This is a left over of how we used to wait for ordered extents, which was to
grab the inode and then run filemap flush on it.  However if we have an ordered
extent then we already are holding a ref on the inode, and we just use
btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on
the inode to start work on the ordered extent.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:27 -04:00
Josef Bacik c4fbb4300a Btrfs: fix worst case calculator for space usage
Forever ago I made the worst case calculator say that we could potentially split
into 3 blocks for every level on the way down, which isn't right.  If we split
we're only going to get two new blocks, the one we originally cow'ed and the new
one we're going to split.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:27 -04:00
Josef Bacik 14575aef42 Revert "Btrfs: rework the overcommit logic to be based on the total size"
This reverts commit 70afa3998c.  It is causing
performance issues and wasn't actually correct.  There were problems with the
way we flushed delalloc and that was the real cause of the early enospc.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:26 -04:00
Josef Bacik 652f25a292 Btrfs: improve replacing nocow extents
Various people have hit a deadlock when running btrfs/011.  This is because when
replacing nocow extents we will take the i_mutex to make sure nobody messes with
the file while we are replacing the extent.  The problem is we are already
holding a transaction open, which is a locking inversion, so instead we need to
save these inodes we find and then process them outside of the transaction.

Further we can't just lock the inode and assume we are good to go.  We need to
lock the extent range and then read back the extent cache for the inode to make
sure the extent really still points at the physical block we want.  If it
doesn't we don't have to copy it.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:26 -04:00
Josef Bacik d555438b6e Btrfs: drop dir i_size when adding new names on replay
So if we have dir_index items in the log that means we also have the inode item
as well, which means that the inode's i_size is correct.  However when we
process dir_index'es we call btrfs_add_link() which will increase the
directory's i_size for the new entry.  To fix this we need to just set the dir
items i_size to 0, and then as we find dir_index items we adjust the i_size.
btrfs_add_link() will do it for new entries, and if the entry already exists we
can just add the name_len to the i_size ourselves.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:25 -04:00
Josef Bacik dd8e721773 Btrfs: replay dir_index items before other items
A user reported a bug where his log would not replay because he was getting
-EEXIST back.  This was because he had a file moved into a directory that was
logged.  What happens is the file had a lower inode number, and so it is
processed first when replaying the log, and so we add the inode ref in for the
directory it was moved to.  But then we process the directories DIR_INDEX item
and try to add the inode ref for that inode and it fails because we already
added it when we replayed the inode.  To solve this problem we need to just
process any DIR_INDEX items we have in the log first so this all is taken care
of, and then we can replay the rest of the items.  With this patch my reproducer
can remount the file system properly instead of erroring out.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:25 -04:00
Josef Bacik a5874ce6ce Btrfs: check roots last log commit when checking if an inode has been logged
Liu introduced a local copy of the last log commit for an inode to make sure we
actually log an inode even if a log commit has already taken place.  In order to
make sure we didn't relog the same inode multiple times he set this local copy
to the current trans when we log the inode, because usually we log the inode and
then sync the log.  The exception to this is during rename, we will relog an
inode if the name changed and it is already in the log.  The problem with this
is then we go to sync the inode, and our check to see if the inode has already
been logged is tripped and we don't sync the log.  To fix this we need to _also_
check against the roots last log commit, because it could be less than what is
in our local copy of the log commit.  This fixes a bug where we rename a file
into a directory and then fsync the directory and then on remount the directory
is no longer there.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:24 -04:00
Josef Bacik de2b530bfb Btrfs: actually log directory we are fsync()'ing
If you just create a directory and then fsync that directory and then pull the
power plug you will come back up and the directory will not be there.  That is
because we won't actually create directories if we've logged files inside of
them since they will be created on replay, but in this check we will set our
logged_trans of our current directory if it happens to be a directory, making us
think it doesn't need to be logged.  Fix the logic to only do this to parent
directories.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:24 -04:00
Josef Bacik 573aecafca Btrfs: actually limit the size of delalloc range
So forever we have had this thing to limit the amount of delalloc pages we'll
setup to be written out to 128mb.  This is because we have to lock all the pages
in this range, so anything above this gets a bit unweildly, and also without a
limit we'll happily allocate gigantic chunks of disk space.  Turns out our check
for this wasn't quite right, we wouldn't actually limit the chunk we wanted to
write out, we'd just stop looking for more space after we went over the limit.
So if you do a giant 20gb dd on my box with lots of ram I could get 2gig
extents.  This is fine normally, except when you go to relocate these extents
and we can't find enough space to relocate these moster extents, since we have
to be able to allocate exactly the same sized extent to move it around.  So fix
this by actually enforcing the limit.  With this patch I'm no longer seeing
giant 1.5gb extents.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:24 -04:00
Miao Xie a482039889 Btrfs: allocate the free space by the existed max extent size when ENOSPC
By the current code, if the requested size is very large, and all the extents
in the free space cache are small, we will waste lots of the cpu time to cut
the requested size in half and search the cache again and again until it gets
down to the size the allocator can return. In fact, we can know the max extent
size in the cache after the first search, so we needn't cut the size in half
repeatedly, and just use the max extent size directly. This way can save
lots of cpu time and make the performance grow up when there are only fragments
in the free space cache.

According to my test, if there are only 4KB free space extents in the fs,
and the total size of those extents are 256MB, we can reduce the execute
time of the following test from 5.4s to 1.4s.
  dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync

Changelog v2 -> v3:
- fix the problem that we skip the block group with the space which is
  less than we need.

Changelog v1 -> v2:
- address the problem that we return a wrong start position when searching
  the free space in a bitmap.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 11:05:23 -04:00
David Sterba 13fd8da98f btrfs: add lockdep and tracing annotations for uuid tree
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:56 -04:00
Stefan Behrens 79556c3d88 btrfs: show compiled-in config features at module load time
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.

(This commit message is a copy of David Sterba's commit message when
he introduced btrfs_print_info()).

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:56 -04:00
Filipe David Borba Manana cef2193729 Btrfs: more efficient inode tree replace operation
Instead of removing the current inode from the red black tree
and then add the new one, just use the red black tree replace
operation, which is more efficient.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:55 -04:00
Ilya Dryomov 55e50e458e Btrfs: do not add replace target to the alloc_list
If replace was suspended by the umount, replace target device is added
to the fs_devices->alloc_list during a later mount.  This is obviously
wrong.  ->is_tgtdev_for_dev_replace is supposed to guard against that,
but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized
*after* everything is opened and fs_devices lists are populated.  Fix
this by checking the devid instead: for replace targets it's always
equal to BTRFS_DEV_REPLACE_DEVID.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:55 -04:00
Josef Bacik 83d4cfd4da Btrfs: fixup error handling in btrfs_reloc_cow
If we failed to actually allocate the correct size of the extent to relocate we
will end up in an infinite loop because we won't return an error, we'll just
move on to the next extent.  So fix this up by returning an error, and then fix
all the callers to return an error up the stack rather than BUG_ON()'ing.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-21 10:58:54 -04:00
Chris Mason 07f0e62e7f Linux 3.11
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSJPkeAAoJEHm+PkMAQRiGVWMH/jo5f01Ra7G4/CYS59K+AlBQ
 /oWL3W81r5MORlsMxwUwGtJ3sZ7UulKwiDrluWeOkz2+/9SmoHoUfkpbByq1bSIV
 y0eqhmjtkHQZz5radJIHeyz1gJIICBIgAM0l45j8SpK4n9EXRcjLSZjdjAkPzxZp
 qZpfxKhVSTu79m96bud7F+HrboHDQEyhD9zqdSi4xPQNnOmTc7K3tvui9AB3rMbV
 ablM3C+LqBYjZx+pKS/rOdfATxZvtU392HU53XTALt6VD1e8alMmhmpe0I9Zxvjv
 scsB6hfRkevfe7VaK3aVoDnQnLKd61yxs+/XdzTtkWPbVGp+kiuFUdDv/5y2r1g=
 =7Xf6
 -----END PGP SIGNATURE-----

Merge tag 'v3.11' into for-linus

Linux 3.11
2013-09-21 10:44:55 -04:00
Linus Torvalds 6e4664525b Linux 3.11 2013-09-02 13:46:10 -07:00
Linus Torvalds 248d296d6d SCSI fixes on 20130831
This is a bug fix for the pm80xx driver.  It turns out that when the new
 hardware support was added in 3.10 the IO command size was kept at the old
 hard coded value.  This means that the driver attaches to some new cards and
 then simply hangs the system.
 
 Signed-off-by: James Bottomley <JBottomley@Parallels.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQEcBAABAgAGBQJSIh8wAAoJEDeqqVYsXL0MCOIH/3Ii/4xKN7BK/G7UYVj7QuIu
 lxmshuc6FUJJkg4fZiV3oHQgkYiUoOOYTVWg+rEKycE1XZS8b3E5BVTlM2+NHezo
 OcjFmctDb5HrElbBL7BrsJwNwSeSL+ATZEqPuOoXQ+CIJ9pkFwm3u1ernDLsM0bB
 PuDRn1duAbyUscHNqYsInpg2a21F1cuoLIzz/ziHgXtjRre30An2wZjmNVwDKeaY
 UhnCvjUy37LFFWL3mLVaS0fhkCS484uKRyloX0FJdLgtfzGvOFGF01f02gmcziti
 o0+PqIhV2wPvGpiNea761JN5opxc/IhhhPapR0kaj9Qig79TP9wjEZ8ynnQvvG4=
 =i73i
 -----END PGP SIGNATURE-----

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fix from James Bottomley:
 "This is a bug fix for the pm80xx driver.  It turns out that when the
  new hardware support was added in 3.10 the IO command size was kept at
  the old hard coded value.  This means that the driver attaches to some
  new cards and then simply hangs the system"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  [SCSI] pm80xx: fix Adaptec 71605H hang
2013-09-02 10:43:13 -07:00
Linus Torvalds e09a1fa9be Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 boot fix from Peter Anvin:
 "A single very small boot fix for very large memory systems (> 0.5T)"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM
2013-09-02 09:55:14 -07:00
Linus Torvalds ac0bc7899a Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dma fix from Vinod Koul:
 "A fix for resolving TI_EDMA driver's build error in allmodconfig to
  have filter function built in""

* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
  dma/Kconfig: TI_EDMA needs to be boolean
2013-09-02 09:54:06 -07:00
Filipe David Borba Manana d7396f0735 Btrfs: optimize key searches in btrfs_search_slot
When the binary search returns 0 (exact match), the target key
will necessarily be at slot 0 of all nodes below the current one,
so in this case the binary search is not needed because it will
always return 0, and we waste time doing it, holding node locks
for longer than necessary, etc.

Below follow histograms with the times spent on the current approach of
doing a binary search when the previous binary search returned 0, and
times for the new approach, which directly picks the first item/child
node in the leaf/node.

Current approach:

Count: 6682
Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429
Percentiles:  90th: 124.000; 95th: 145.000; 99th: 206.000
  35.000 -   61.080:  1235 ################
  61.080 -  106.053:  4207 #####################################################
 106.053 -  183.606:  1122 ##############
 183.606 -  317.341:   111 #
 317.341 -  547.959:     6 |
 547.959 - 8370.000:     1 |

Approach proposed by this patch:

Count: 6682
Range:  6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev:  7.160
Percentiles:  90th: 23.000; 95th: 27.000; 99th: 40.000
   6.000 -    8.418:    58 #
   8.418 -   11.670:  1149 #########################
  11.670 -   16.046:  2418 #####################################################
  16.046 -   21.934:  2098 ##############################################
  21.934 -   29.854:   744 ################
  29.854 -   40.511:   154 ###
  40.511 -   54.848:    41 #
  54.848 -   74.136:     5 |
  74.136 -  100.087:     9 |
 100.087 -  135.000:     6 |

These samples were captured during a run of the btrfs tests 001, 002 and
004 in the xfstests, with a leaf/node size of 4Kb.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:42 -04:00
Josef Bacik 45d5fd14d2 Btrfs: don't use an async starter for most of our workers
We only need an async starter if we can't make a GFP_NOFS allocation in our
current path.  This is the case for the endio stuff since it happens in IRQ
context, but things like the caching thread workers and the delalloc flushers we
can easily make this allocation and start threads right away.  Also change the
worker count for the caching thread pool.  Traditionally we limited this to 2
since we took read locks while caching, but nowadays we do this lockless so
there's no reason to limit the number of caching threads.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:41 -04:00
Josef Bacik 7f4f6e0a3f Btrfs: only update disk_i_size as we remove extents
This fixes a problem where if we fail a truncate we will leave the i_size set
where we wanted to truncate to instead of where we were able to truncate to.
Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it
removes extents, that way it will always be consistent with where its extents
are.  Then if the truncate fails at all we can update the in-ram i_size with
what we have on disk and delete the orphan item.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:40 -04:00
Filipe David Borba Manana f45388f387 Btrfs: fix deadlock in uuid scan kthread
If there's an ongoing transaction when the uuid scan kthread attempts
to create one, the kthread will block, waiting for that transaction to
finish while it's keeping locks on the tree root, and in turn the existing
transaction is waiting for those locks to be free.

The stack trace reported by the kernel follows.

[36700.671601] INFO: task btrfs-uuid:15480 blocked for more than 120 seconds.
[36700.671602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[36700.671602] btrfs-uuid      D 0000000000000000     0 15480      2 0x00000000
[36700.671604]  ffff880710bd5b88 0000000000000046 ffff8803d36ba850 0000000000030000
[36700.671605]  ffff8806d76dc530 ffff880710bd5fd8 ffff880710bd5fd8 ffff880710bd5fd8
[36700.671607]  ffff8808098ac530 ffff8806d76dc530 ffff880710bd5b98 ffff8805e4508e40
[36700.671608] Call Trace:
[36700.671610]  [<ffffffff816f36b9>] schedule+0x29/0x70
[36700.671620]  [<ffffffffa05a3bdf>] wait_current_trans.isra.33+0xbf/0x120 [btrfs]
[36700.671623]  [<ffffffff81066760>] ? add_wait_queue+0x60/0x60
[36700.671629]  [<ffffffffa05a5b06>] start_transaction+0x3d6/0x530 [btrfs]
[36700.671636]  [<ffffffffa05bb1f4>] ? btrfs_get_token_32+0x64/0xf0 [btrfs]
[36700.671642]  [<ffffffffa05a5fbb>] btrfs_start_transaction+0x1b/0x20 [btrfs]
[36700.671649]  [<ffffffffa05c8a81>] btrfs_uuid_scan_kthread+0x211/0x3d0 [btrfs]
[36700.671655]  [<ffffffffa05c8870>] ? __btrfs_open_devices+0x2a0/0x2a0 [btrfs]
[36700.671657]  [<ffffffff81065fa0>] kthread+0xc0/0xd0
[36700.671659]  [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0
[36700.671661]  [<ffffffff816fcd1c>] ret_from_fork+0x7c/0xb0
[36700.671662]  [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0
[36700.671663] INFO: task btrfs:15481 blocked for more than 120 seconds.
[36700.671664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[36700.671665] btrfs           D 0000000000000000     0 15481  15212 0x00000004
[36700.671666]  ffff880248cbf4c8 0000000000000086 ffff8803d36ba700 ffff8801dbd5c280
[36700.671668]  ffff880807815c40 ffff880248cbffd8 ffff880248cbffd8 ffff880248cbffd8
[36700.671669]  ffff8805e86a0000 ffff880807815c40 ffff880248cbf4d8 ffff8801dbd5c280
[36700.671670] Call Trace:
[36700.671672]  [<ffffffff816f36b9>] schedule+0x29/0x70
[36700.671679]  [<ffffffffa05d9b0d>] btrfs_tree_lock+0x6d/0x230 [btrfs]
[36700.671680]  [<ffffffff81066760>] ? add_wait_queue+0x60/0x60
[36700.671685]  [<ffffffffa0582829>] btrfs_search_slot+0x999/0xb00 [btrfs]
[36700.671691]  [<ffffffffa05bd9de>] ? btrfs_lookup_first_ordered_extent+0x5e/0xb0 [btrfs]
[36700.671698]  [<ffffffffa05e3e54>] __btrfs_write_out_cache+0x8c4/0xa80 [btrfs]
[36700.671704]  [<ffffffffa05e4362>] btrfs_write_out_cache+0xb2/0xf0 [btrfs]
[36700.671710]  [<ffffffffa05c4441>] ? free_extent_buffer+0x61/0xc0 [btrfs]
[36700.671716]  [<ffffffffa0594c82>] btrfs_write_dirty_block_groups+0x562/0x650 [btrfs]
[36700.671723]  [<ffffffffa0610092>] commit_cowonly_roots+0x171/0x24b [btrfs]
[36700.671729]  [<ffffffffa05a4dde>] btrfs_commit_transaction+0x4fe/0xa10 [btrfs]
[36700.671735]  [<ffffffffa0610af3>] create_subvol+0x5c0/0x636 [btrfs]
[36700.671742]  [<ffffffffa05d49ff>] btrfs_mksubvol.isra.60+0x33f/0x3f0 [btrfs]
[36700.671747]  [<ffffffffa05d4bf2>] btrfs_ioctl_snap_create_transid+0x142/0x190 [btrfs]
[36700.671752]  [<ffffffffa05d4c6c>] ? btrfs_ioctl_snap_create+0x2c/0x80 [btrfs]
[36700.671757]  [<ffffffffa05d4c9e>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs]
[36700.671759]  [<ffffffff8113a764>] ? handle_pte_fault+0x84/0x920
[36700.671764]  [<ffffffffa05d87eb>] btrfs_ioctl+0xf0b/0x1d00 [btrfs]
[36700.671766]  [<ffffffff8113c120>] ? handle_mm_fault+0x210/0x310
[36700.671768]  [<ffffffff816f83a4>] ? __do_page_fault+0x284/0x4e0
[36700.671770]  [<ffffffff81180aa6>] do_vfs_ioctl+0x96/0x550
[36700.671772]  [<ffffffff81170fe3>] ? __sb_end_write+0x33/0x70
[36700.671774]  [<ffffffff81180ff1>] SyS_ioctl+0x91/0xb0
[36700.671775]  [<ffffffff816fcdc2>] system_call_fastpath+0x16/0x1b

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:39 -04:00
Ilya Dryomov 795a332139 Btrfs: stop refusing the relocation of chunk 0
AFAICT chunk 0 is no longer special, and so it should be restriped just
like every other chunk.  One reason for this change is us refusing the
relocation can lead to filesystems that can only be mounted ro, and
never rw -- see the bugzilla [1] for details.  The other reason is that
device removal code is already doing this: it will happily relocate
chunk 0 is part of shrinking the device.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=60594

Reported-by: Xavier Bassery <xavier@bartica.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:38 -04:00
Filipe David Borba Manana d8f980391f Btrfs: fix memory leak of uuid_root in free_fs_info
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:37 -04:00
Andy Shevchenko ed84885d1e btrfs: reuse kbasename helper
To get name of the file from a pathname let's use kbasename() helper. It allows
to simplify code a bit.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:36 -04:00
Anand Jain e57138b3e9 btrfs: return btrfs error code for dev excl ops err
now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
as defined in btrfs.h for the dev excl operation error in
the FS, which means with this kernel would stop logging
(almost an user error) into the /var/log/messages

v2: accepts Josef' comment

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:35 -04:00
Josef Bacik 77cef2ec54 Btrfs: allow partial ordered extent completion
We currently have this problem where you can truncate pages that have not yet
been written for an ordered extent.  We do this because the truncate will be
coming behind to clean us up anyway so what's the harm right?  Well if truncate
fails for whatever reason we leave an orphan item around for the file to be
cleaned up later.  But if the user goes and truncates up the file and tries to
read from the area that had been discarded previously they will get a csum error
because we never actually wrote that data out.

This patch fixes this by allowing us to either discard the ordered extent
completely, by which I mean we just free up the space we had allocated and not
add the file extent, or adjust the length of the file extent we write.  We do
this by setting the length we truncated down to in the ordered extent, and then
we set the file extent length and ram bytes to this length.  The total disk
space stays unchanged since we may be compressed and we can't just chop off the
disk space, but at least this way the file extent only points to the valid data.
Then when the file extent is free'd the extent and csums will be freed normally.

This patch is needed for the next series which will give us more graceful
recovery of failed truncates.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:34 -04:00
Josef Bacik b12d6869f6 Btrfs: convert all bug_ons in free-space-cache.c
All of these are logic checks to make sure we're not breaking anything, so
convert them over to ASSERT().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:33 -04:00
Josef Bacik 2e17c7c65e Btrfs: add support for asserts
One of the complaints we get a lot is how many BUG_ON()'s we have.  So to help
with this I'm introducing a kconfig option to enable/disable a new ASSERT()
mechanism much like what XFS does.  This will allow us developers to still get
our nice panics but allow users/distros to compile them out.  With this we can
go through and convert any BUG_ON()'s that we have to catch actual programming
mistakes to the new ASSERT() and then fix everybody else to return errors.  This
will also allow developers to leave sanity checks in their new code to make sure
we don't trip over problems while testing stuff and vetting new features.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:32 -04:00
Josef Bacik 726551ebc7 Btrfs: adjust the fs_devices->missing count on unmount
I noticed that if I tried to mount a file system with -o degraded after having
done it once already we would fail to mount.  This is because the
fs_devices->missing count was getting bumped everytime we mounted, but not
getting reset whenever we unmounted.  To fix this we just drop the missing count
as we're closing devices to make sure this doesn't happen.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:31 -04:00
Stefan Behrens 23fa76b0ba Btrf: cleanup: don't check for root_refs == 0 twice
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in three more places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:29 -04:00
Stefan Behrens 4847547172 Btrfs: fix for patch "cleanup: don't check the same thing twice"
Mitch Harder noticed that the patch 3c64a1a mentioned in the subject
line was causing a kernel BUG() on snapshot deletion.

The patch was wrong. It did not handle cached roots correctly. The
check for root_refs == 0 was removed everywhere where
btrfs_read_fs_root_no_name() had been used to retrieve the root,
because this check was already dealt with in
btrfs_read_fs_root_no_name(). But in the case when the root was
found in the cache, there was no such check.

This patch adds the missing check in the case where the root is
found in the cache.

Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:29 -04:00
Stefan Behrens 9d565ba433 Btrfs: get rid of one BUG() in write_all_supers()
The second round uses btrfs_error() and return -EIO, the first round
can handle write errors the same way.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:28 -04:00
Wang Shilong b9e9a6cbc6 Btrfs: allocate prelim_ref with a slab allocater
struct __prelim_ref is allocated and freed frequently when
walking backref tree, using slab allocater can not only
speed up allocating but also detect memory leaks.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:27 -04:00
Wang Shilong 742916b885 Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC
Currently, only add_delayed_refs have to allocate with GFP_ATOMIC,
So just pass arg 'gfp_t' to decide which allocation mode.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-09-01 08:16:26 -04:00