Commit graph

1437 commits

Author SHA1 Message Date
Sage Weil 050006a753 Btrfs: fix clone ioctl where range is adjacent to extent
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Sage Weil 9a019196ec Btrfs: fix delalloc checks in clone ioctl
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().

Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:37:33 -04:00
Chris Mason d8e39c457b Btrfs: drop unused variable in block_alloc_rsv
The alloc_target variable is not really used.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:17:41 -04:00
Andi Kleen 559af82114 Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.

Still needs more review.

Found by gcc 4.6's new warnings

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:37 -04:00
Andi Kleen 411fc6bcef Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
These are all the cases where a variable is set, but not
read which are really bugs.

- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things

Still needs more review.

Found by gcc 4.6's new warnings.

[akpm@linux-foundation.org: fix build.  Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:31 -04:00
Julia Lawall d0b678cb0a Btrfs: Use ERR_CAST helpers
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)).  The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
type T;
T x;
identifier f;
@@

T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
 ...+> }

@@
expression x;
@@

- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:23 -04:00
Julia Lawall 2354d08fe9 Btrfs: use memdup_user helpers
Use memdup_user when user data is immediately copied into the
allocated region.

The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@

-  to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+  to = memdup_user(from,size);
   if (
-      to==NULL
+      IS_ERR(to)
                 || ...) {
   <+... when != goto l1;
-  -ENOMEM
+  PTR_ERR(to)
   ...+>
   }
-  if (copy_from_user(to, from, size) != 0) {
-    <+... when != goto l2;
-    -EFAULT
-    ...+>
-  }
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:14:18 -04:00
Chris Mason 18e503d695 Btrfs: fix raid code for removing missing drives
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices.  This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.

This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:46 -04:00
Miao Xie 19fe0a8b78 Btrfs: Switch the extent buffer rbtree into a radix tree
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.

I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].

Before applying this patch:
Create files:
	Total files: 50000
	Total time: 0.971531
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.366761
	Average time: 0.000027

After applying this patch:
Create files:
	Total files: 50000
	Total time: 0.927455
	Average time: 0.000019
Delete files:
	Total files: 50000
	Total time: 1.292280
	Average time: 0.000026

[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Miao Xie 897ca6e9b4 Btrfs: restructure try_release_extent_buffer()
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:45 -04:00
Chris Mason bf9022e06a Btrfs: use the flusher threads for delalloc throttling
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.

This switches us to kick the bdi flusher threads instead.  All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:36 -04:00
Chris Mason e5bc245829 Btrfs: tune the chunk allocation to 5% of the FS as metadata
An earlier commit tried to keep us from allocating too many
empty metadata chunks.  It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.

This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 11:25:35 -04:00
Chris Mason cb44921a09 Btrfs: don't loop forever on bad btree blocks
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.

This changes things to properly kick out the EIO.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:31:30 -04:00
Chris Mason 6b5b817f10 Merge branch 'bug-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-work
Conflicts:
	fs/btrfs/extent-tree.c

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 09:27:49 -04:00
Josef Bacik 8216ef866d Btrfs: let the user know space caching is enabled
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:37 -04:00
Josef Bacik 88c2ba3b06 Btrfs: Add a clear_cache mount option
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody.  When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik 67377734fd Btrfs: add support for mixed data+metadata block groups
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups.  Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them.  Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only.  Tested this with xfstests and it works
nicely.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:36 -04:00
Josef Bacik dde5abee12 Btrfs: check cache->caching_ctl before returning if caching has started
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl.  This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw).  So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik 9d66e233c7 Btrfs: load free space cache if it exists
This patch actually loads the free space cache if it exists.  The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it.  Previously we did not do this since the caching
kthread would pick it up.  With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group.  This has been tested with all sorts of things.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:35 -04:00
Josef Bacik 0cb59c9953 Btrfs: write out free space cache
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups.  There are a bunch of changes
in inode.c in order to account for special cases.  Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions.  Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock.  This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-29 09:26:29 -04:00
Al Viro 061dbc6b90 convert btrfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-29 04:17:21 -04:00
Josef Bacik 0af3d00bad Btrfs: create special free space cache inode
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group.  So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have.  We
truncate and pre-allocate space everytime to make sure it's uptodate.

This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way.  When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-28 15:59:09 -04:00
Josef Bacik e9bb7f10d3 Btrfs: remove warn_on from use_block_rsv
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space.  It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems.  But if it does get back ENOSPC it
handles it properly.  The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight.  So instead just remove
the warning.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:55:03 -04:00
Josef Bacik 382279336f Btrfs: set trans to null in reserve_metadata_bytes if we commit the transaction
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation.  So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction.  This fixes a panic I could reproduce
at will.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-26 12:52:53 -04:00
Al Viro 7de9c6ee3e new helper: ihold()
Clones an existing reference to inode; caller must already hold one.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:26:11 -04:00
Al Viro 1d3382cbf0 new helper: inode_unhashed()
note: for race-free uses you inode_lock held

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-25 21:24:15 -04:00
Linus Torvalds a2887097f2 Merge branch 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
  xen-blkfront: disable barrier/flush write support
  Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
  block: remove BLKDEV_IFL_WAIT
  aic7xxx_old: removed unused 'req' variable
  block: remove the BH_Eopnotsupp flag
  block: remove the BLKDEV_IFL_BARRIER flag
  block: remove the WRITE_BARRIER flag
  swap: do not send discards as barriers
  fat: do not send discards as barriers
  ext4: do not send discards as barriers
  jbd2: replace barriers with explicit flush / FUA usage
  jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
  jbd: replace barriers with explicit flush / FUA usage
  nilfs2: replace barriers with explicit flush / FUA usage
  reiserfs: replace barriers with explicit flush / FUA usage
  gfs2: replace barriers with explicit flush / FUA usage
  btrfs: replace barriers with explicit flush / FUA usage
  xfs: replace barriers with explicit flush / FUA usage
  block: pass gfp_mask and flags to sb_issue_discard
  dm: convey that all flushes are processed as empty
  ...
2010-10-22 17:07:18 -07:00
Josef Bacik 0e78340f3c Btrfs: fix error handling in btrfs_get_sb
If we failed to find the root subvol id, or the subvol=<name>, we would
deactivate the locked super and close the devices.  The problem is at this point
we have gotten the SB all setup, which includes setting super_operations, so
when we'd deactiveate the super, we'd do a close_ctree() which closes the
devices, so we'd end up closing the devices twice.  So if you do something like
this

mount /dev/sda1 /mnt/test1
mount /dev/sda1 /mnt/test2 -o subvol=xxx
umount /mnt/test1

it would blow up (if subvol xxx doesn't exist).  This patch fixes that problem.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:03 -04:00
Josef Bacik 8bb8ab2e93 Btrfs: rework how we reserve metadata bytes
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC.  So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing.  The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves.  This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform.  This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:01 -04:00
Josef Bacik 14ed0ca6e8 Btrfs: don't allocate chunks as aggressively
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should.  For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used.  So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:55:00 -04:00
Josef Bacik 0019f10db6 Btrfs: re-work delalloc flushing
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen.  Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate.  The
sync option will be used in a future patch.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:58 -04:00
Josef Bacik 6d48755d02 Btrfs: fix reservation code for mixed block groups
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups.  Instead if we have mixed block groups just set
data used to 0.  Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:56 -04:00
Josef Bacik 89a55897a2 Btrfs: fix df regression
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system.  This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:55 -04:00
Josef Bacik bf5fc093c5 Btrfs: fix the df ioctl to report raid types
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type.  So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:53 -04:00
Josef Bacik a1f765061e Btrfs: stop trying to shrink delalloc if there are no inodes to reclaim
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which
means we'll just keep looping trying to shrink delalloc.  Instead, if we fail to
shrink delalloc 3 times in a row break out since we're not likely to make any
progress.  Tested this with a 100mb fs an xfstests test 13.  Before the patch it
would hang the box, with the patch we get -ENOSPC like we should.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
2010-10-22 15:54:51 -04:00
Arnd Bergmann 6038f373a3 llseek: automatically add .llseek fop
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.

The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.

New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time.  Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.

The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.

Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.

Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.

===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
//   but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}

@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}

@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
   *off = E
|
   *off += E
|
   func(..., off, ...)
|
   E = *off
)
...+>
}

@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}

@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
  *off = E
|
  *off += E
|
  func(..., off, ...)
|
  E = *off
)
...+>
}

@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}

@ fops0 @
identifier fops;
@@
struct file_operations fops = {
 ...
};

@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
 .llseek = llseek_f,
...
};

@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
 .read = read_f,
...
};

@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
 .write = write_f,
...
};

@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
 .open = open_f,
...
};

// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
...  .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};

@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
...  .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};

// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
...  .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};

// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};

// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};

@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+	.llseek = default_llseek, /* write accesses f_pos */
};

// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////

@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
 .write = write_f,
 .read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};

@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};

@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};

@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
2010-10-15 15:53:27 +02:00
Christoph Hellwig dd3932eddf block: remove BLKDEV_IFL_WAIT
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller.  To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead.  So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout.  For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-16 20:52:58 +02:00
Christoph Hellwig c3b9a62c8f btrfs: replace barriers with explicit flush / FUA usage
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-09-10 12:35:39 +02:00
Linus Torvalds 2f9e825d3e Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
  block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
  xen-blkfront: fix missing out label
  blkdev: fix blkdev_issue_zeroout return value
  block: update request stacking methods to support discards
  block: fix missing export of blk_types.h
  writeback: fix bad _bh spinlock nesting
  drbd: revert "delay probes", feature is being re-implemented differently
  drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
  drbd: Disable delay probes for the upcomming release
  writeback: cleanup bdi_register
  writeback: add new tracepoints
  writeback: remove unnecessary init_timer call
  writeback: optimize periodic bdi thread wakeups
  writeback: prevent unnecessary bdi threads wakeups
  writeback: move bdi threads exiting logic to the forker thread
  writeback: restructure bdi forker loop a little
  writeback: move last_active to bdi
  writeback: do not remove bdi from bdi_list
  writeback: simplify bdi code a little
  writeback: do not lose wake-ups in bdi threads
  ...

Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
2010-08-10 15:22:42 -07:00
Artem Bityutskiy 696ac96c27 btrfs: remove junk sb_dirt change
BTRFS does not define a '->write_super()' method, so it should
not mark its superblock as dirty. This looks like some left-over.

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:55 -04:00
Al Viro 45321ac543 Make ->drop_inode() just return whether inode needs to be dropped
... and let iput_final() do the actual eviction or retention

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:35 -04:00
Al Viro bd55597520 convert btrfs to ->evict_inode()
NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:48:22 -04:00
Al Viro a4ffdde6e5 simplify checks for I_CLEAR/I_FREEING
add I_CLEAR instead of replacing I_FREEING with it.  I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information.  As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:44 -04:00
Christoph Hellwig 1025774ce4 remove inode_setattr
Replace inode_setattr with opencoded variants of it in all callers.  This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.

In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:

 spufs: explicitly checks for ATTR_SIZE earlier
 btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
 ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above

In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09 16:47:37 -04:00
Christoph Hellwig 7b6d91daee block: unify flags for struct bio and struct request
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver.  There were two flags in the bio that were
missing in the requests:  BIO_RW_UNPLUG and BIO_RW_AHEAD.  Also I've
renamed two request flags that had a superflous RW in them.

Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-08-07 18:20:39 +02:00
Linus Torvalds ee1039307a Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
  Btrfs: fix CLONE ioctl destination file size expansion to block boundary
  Btrfs: fix split_leaf double split corner case
2010-07-19 19:33:02 -07:00
Dan Rosenberg 2ebc346478 Btrfs: fix checks in BTRFS_IOC_CLONE_RANGE
1.  The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.

2.  The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around).  I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't.  Even if it's not
exploitable, it couldn't hurt to be safe.

Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:58:20 -04:00
Sage Weil b5384d48f4 Btrfs: fix CLONE ioctl destination file size expansion to block boundary
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE).  It was then using that offset when
extending the destination file's i_size.  Fix this by not setting i_size
beyond the originally requested ending offset.

This bug was introduced by a22285a6 (2.6.35-rc1).

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:15:06 -04:00
Chris Mason 99d8f83c98 Btrfs: fix split_leaf double split corner case
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice.  This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.

If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-07-19 16:14:50 -04:00
Christoph Hellwig 9c3a8ee8a1 writeback: remove writeback_inodes_wbc
This was just an odd wrapper around writeback_inodes_wb.  Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-07-06 08:54:03 +02:00
Linus Torvalds b25b550bb1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: The file argument for fsync() is never null
  Btrfs: handle ERR_PTR from posix_acl_from_xattr()
  Btrfs: avoid BUG when dropping root and reference in same transaction
  Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
  Btrfs: should add a permission check for setfacl
  Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
  Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
  Btrfs: unwind after btrfs_start_transaction() errors
  Btrfs: btrfs_iget() returns ERR_PTR
  Btrfs: handle kzalloc() failure in open_ctree()
  Btrfs: handle error returns from btrfs_lookup_dir_item()
  Btrfs: Fix BUG_ON for fs converted from extN
  Btrfs: Fix null dereference in relocation.c
  Btrfs: fix remap_file_pages error
  Btrfs: uninitialized data is check_path_shared()
  Btrfs: fix fallocate regression
  Btrfs: fix loop device on top of btrfs
2010-06-11 14:18:47 -07:00
Dan Carpenter 6f902af400 Btrfs: The file argument for fsync() is never null
The "file" argument for fsync is never null so we can remove this check.

What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:40 -04:00
Dan Carpenter 834e74759a Btrfs: handle ERR_PTR from posix_acl_from_xattr()
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:39 -04:00
Sage Weil 15e7000095 Btrfs: avoid BUG when dropping root and reference in same transaction
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction.  If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).

This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.

Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().

Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:39 -04:00
Shi Weihua 731e3d1b43 Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa    <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
  setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.

So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.

Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:38 -04:00
Shi Weihua 2f26afba46 Btrfs: should add a permission check for setfacl
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
  # file: aaa
  # owner: user1
  # group: user1
  user::rw-
  group::rw-
  other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
  # file: aaa
  # owner: user1
  # group: user1
  user::rwx           <- successed to setfacl
  group::rw-
  other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
  setfacl: aaa: Operation not permitted

This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:37 -04:00
Dan Carpenter cf1e99a4e0 Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
btrfs_lookup_dir_item() can return either ERR_PTRs or null.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:37 -04:00
Dan Carpenter 3140c9a34b Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that.  It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:36 -04:00
Dan Carpenter d327099a23 Btrfs: unwind after btrfs_start_transaction() errors
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction".  If we goto out here then we skip all the
unwinding and there are locks still held etc.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:35 -04:00
Dan Carpenter 4cbd1149fb Btrfs: btrfs_iget() returns ERR_PTR
btrfs_iget() returns an ERR_PTR() on failure and not null.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:35 -04:00
Dan Carpenter 676e4c8639 Btrfs: handle kzalloc() failure in open_ctree()
Unwind and return -ENOMEM if the allocation fails here.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:34 -04:00
Dan Carpenter fb4f6f910c Btrfs: handle error returns from btrfs_lookup_dir_item()
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:57:33 -04:00
Yan, Zheng 3bf84a5a83 Btrfs: Fix BUG_ON for fs converted from extN
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:48:35 -04:00
Yan, Zheng 046f264f6b Btrfs: Fix null dereference in relocation.c
Fix a potential null dereference in relocation.c

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 15:48:34 -04:00
Miao Xie 058a457ef0 Btrfs: fix remap_file_pages error
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 11:46:12 -04:00
Dan Carpenter 0e4dcbef1c Btrfs: uninitialized data is check_path_shared()
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop.  In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.

Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 11:46:12 -04:00
Josef Bacik 8360977972 Btrfs: fix fallocate regression
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 11:46:12 -04:00
Miao Xie 4a001071d3 Btrfs: fix loop device on top of btrfs
We cannot use the loop device which has been connected to a file in the btrf

The reproduce steps is following:
 # dd if=/dev/zero of=vdev0 bs=1M count=1024
 # losetup /dev/loop0 vdev0
 # mkfs.btrfs /dev/loop0
 ...
 failed to zero device start -5

The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-06-11 11:46:11 -04:00
Christoph Hellwig 7ea8085910 drop unused dentry argument to ->fsync
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-27 22:05:02 -04:00
Linus Torvalds 105a048a4f Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
  Btrfs: add more error checking to btrfs_dirty_inode
  Btrfs: allow unaligned DIO
  Btrfs: drop verbose enospc printk
  Btrfs: Fix block generation verification race
  Btrfs: fix preallocation and nodatacow checks in O_DIRECT
  Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
  Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
  Btrfs: rework O_DIRECT enospc handling
  Btrfs: use async helpers for DIO write checksumming
  Btrfs: don't walk around with task->state != TASK_RUNNING
  Btrfs: do aio_write instead of write
  Btrfs: add basic DIO read/write support
  direct-io: do not merge logically non-contiguous requests
  direct-io: add a hook for the fs to provide its own submit_bio function
  fs: allow short direct-io reads to be completed via buffered IO
  Btrfs: Metadata ENOSPC handling for balance
  Btrfs: Pre-allocate space for data relocation
  Btrfs: Metadata ENOSPC handling for tree log
  Btrfs: Metadata reservation for orphan inodes
  Btrfs: Introduce global metadata reservation
  ...
2010-05-27 10:43:44 -07:00
Chris Mason 9aeead7378 Btrfs: add more error checking to btrfs_dirty_inode
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-27 10:23:00 -04:00
Chris Mason 5a5f79b570 Btrfs: allow unaligned DIO
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:35:35 -04:00
Chris Mason 933b585f70 Btrfs: drop verbose enospc printk
Less printk is good printk.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:35:34 -04:00
Yan, Zheng 5bdd3536cb Btrfs: Fix block generation verification race
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:35:33 -04:00
Chris Mason 46bfbb5c07 Btrfs: fix preallocation and nodatacow checks in O_DIRECT
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents.  This means it
wasn't honoring snapshots properly.

The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 21:34:45 -04:00
Chris Mason 94b604429a Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons.  This
usually works well but can cause problems when there are
many many writers.

When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 11:02:00 -04:00
Chris Mason 3f7c579c41 Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO.  This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-26 10:59:53 -04:00
Chris Mason 4845e44ffd Btrfs: rework O_DIRECT enospc handling
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them.  Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.

There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong.  This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage.  We have
to clear them ourselves as the DIO ends.

With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.

btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 21:52:08 -04:00
Kay Sievers 578454ff7e driver core: add devname module aliases to allow module on-demand auto-loading
This adds:
  alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.

Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.

The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
  $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
  # Device nodes to trigger on-demand module loading.
  microcode cpu/microcode c10:184
  fuse fuse c10:229
  ppp_generic ppp c108:0
  tun net/tun c10:200
  dm_mod mapper/control c10:235

Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
  $ /sbin/udevd --debug
  ...
  static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
  static_dev_create_from_modules: mknod '/dev/fuse' c10:229
  static_dev_create_from_modules: mknod '/dev/ppp' c108:0
  static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
  static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
  udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
  udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666

A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.

Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.

This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)

Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-25 15:08:26 -07:00
Chris Mason eaf25d933e Btrfs: use async helpers for DIO write checksumming
The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster.  This
changes the O_DIRECT write code to use them.  The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:58 -04:00
Chris Mason ed3b3d314c Btrfs: don't walk around with task->state != TASK_RUNNING
Yan Zheng noticed two places we were doing a lot of work
without task->state set to TASK_RUNNING.  This sets the state
properly after we get ready to sleep but decide not to.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:58 -04:00
Josef Bacik 11c65dccf7 Btrfs: do aio_write instead of write
In order for AIO to work, we need to implement aio_write.  This patch converts
our btrfs_file_write to btrfs_aio_write.  I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:57 -04:00
Josef Bacik 4b46fce233 Btrfs: add basic DIO read/write support
This provides basic DIO support for reading and writing.  It does not do the
work to recover from mismatching checksums, that will come later.  A few design
changes have been made from Jim's code (sorry Jim!)

1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.

2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
it's own buffering to make dio with compressed extents work.  Now we just
fallback onto normal buffered IO.

3) Use ordered extents for the writes so that all of the

lock_extent()
lookup_ordered()

type checks continue to work.

4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.

I've tested this with fsx and everything works great.  This patch depends on my
dio and filemap.c patches to work.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:57 -04:00
Yan, Zheng 3fd0a5585e Btrfs: Metadata ENOSPC handling for balance
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:

1. Avoid COW tree leave in the phrase of merging tree.

2. Handle interaction with snapshot creation.

3. make the backref cache can live across transactions.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:54 -04:00
Yan, Zheng efa5646456 Btrfs: Pre-allocate space for data relocation
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:53 -04:00
Yan, Zheng 4a500fd178 Btrfs: Metadata ENOSPC handling for tree log
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:53 -04:00
Yan, Zheng d68fc57b7e Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng 8929ecfa50 Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:52 -04:00
Yan, Zheng 0ca1f7ceb1 Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation
and update various related functions.

This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:51 -04:00
Yan, Zheng a22285a6a3 Btrfs: Integrate metadata reservation with start_transaction
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.

Changes since V1:

Add code that check if unlink and rmdir will free space.

Add ENOSPC handling for clone ioctl.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng f0486c68e4 Btrfs: Introduce contexts for metadata reservation
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.

Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:

Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().

Make space accounting more accurate, mainly for handling read only
block groups.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:50 -04:00
Yan, Zheng 2ead6ae770 Btrfs: Kill init_btrfs_i()
All code in init_btrfs_i can be moved into btrfs_alloc_inode()

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:49 -04:00
Yan, Zheng 5da9d01b66 Btrfs: Shrink delay allocated space in a synchronized
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng 424499dbd0 Btrfs: Kill allocate_wait in space_info
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:48 -04:00
Yan, Zheng b742bb82f1 Btrfs: Link block groups of different raid types
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.

Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-05-25 10:34:47 -04:00
Linus Torvalds e8bebe2f71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
  fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
  get rid of home-grown mutex in cris eeprom.c
  switch ecryptfs_write() to struct inode *, kill on-stack fake files
  switch ecryptfs_get_locked_page() to struct inode *
  simplify access to ecryptfs inodes in ->readpage() and friends
  AFS: Don't put struct file on the stack
  Ban ecryptfs over ecryptfs
  logfs: replace inode uid,gid,mode initialization with helper function
  ufs: replace inode uid,gid,mode initialization with helper function
  udf: replace inode uid,gid,mode init with helper
  ubifs: replace inode uid,gid,mode initialization with helper function
  sysv: replace inode uid,gid,mode initialization with helper function
  reiserfs: replace inode uid,gid,mode initialization with helper function
  ramfs: replace inode uid,gid,mode initialization with helper function
  omfs: replace inode uid,gid,mode initialization with helper function
  bfs: replace inode uid,gid,mode initialization with helper function
  ocfs2: replace inode uid,gid,mode initialization with helper function
  nilfs2: replace inode uid,gid,mode initialization with helper function
  minix: replace inode uid,gid,mode init with helper
  ext4: replace inode uid,gid,mode init with helper
  ...

Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
2010-05-21 19:37:45 -07:00
Dmitry Monakhov ecc11fabf7 btrfs: replace inode uid,gid,mode initialization with helper function
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:23 -04:00
Stephen Hemminger f01cbd3f81 btrfs: constify xattr_handler
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-05-21 18:31:18 -04:00
Jens Axboe ee9a3607fb Merge branch 'master' into for-2.6.35
Conflicts:
	fs/ext3/fsync.c

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-05-21 21:27:26 +02:00
Linus Torvalds 18e41da89d Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
  Btrfs: check for read permission on src file in the clone ioctl
2010-05-15 12:55:31 -07:00