Commit graph

254 commits

Author SHA1 Message Date
Curt Wohlgemuth 4188188bdc ext4: add comments to definition of ext4_io_end_t
This should make it more clear what this structure is used
for, and how some of the (mutually exclusive) fields are
used to keep page cache references.

Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-05 10:40:22 -05:00
Jeff Moyer 491caa4363 ext4: fix race between sync and completed io work
The following command line will leave the aio-stress process unkillable
on an ext4 file system (in my case, mounted on /mnt/test):

aio-stress -t 20 -s 10 -O -S -o 2 -I 1000 /mnt/test/aiostress.3561.4 /mnt/test/aiostress.3561.4.20 /mnt/test/aiostress.3561.4.19 /mnt/test/aiostress.3561.4.18 /mnt/test/aiostress.3561.4.17 /mnt/test/aiostress.3561.4.16 /mnt/test/aiostress.3561.4.15 /mnt/test/aiostress.3561.4.14 /mnt/test/aiostress.3561.4.13 /mnt/test/aiostress.3561.4.12 /mnt/test/aiostress.3561.4.11 /mnt/test/aiostress.3561.4.10 /mnt/test/aiostress.3561.4.9 /mnt/test/aiostress.3561.4.8 /mnt/test/aiostress.3561.4.7 /mnt/test/aiostress.3561.4.6 /mnt/test/aiostress.3561.4.5 /mnt/test/aiostress.3561.4.4 /mnt/test/aiostress.3561.4.3 /mnt/test/aiostress.3561.4.2

This is using the aio-stress program from the xfstests test suite.
That particular command line tells aio-stress to do random writes to
20 files from 20 threads (one thread per file).  The files are NOT
preallocated, so you will get writes to random offsets within the
file, thus creating holes and extending i_size.  It also opens the
file with O_DIRECT and O_SYNC.

On to the problem.  When an I/O requires unwritten extent conversion,
it is queued onto the completed_io_list for the ext4 inode.  Two code
paths will pull work items from this list.  The first is the
ext4_end_io_work routine, and the second is ext4_flush_completed_IO,
which is called via the fsync path (and O_SYNC handling, as well).
There are two issues I've found in these code paths.  First, if the
fsync path beats the work routine to a particular I/O, the work
routine will free the io_end structure!  It does not take into account
the fact that the io_end may still be in use by the fsync path.  I've
fixed this issue by adding yet another IO_END flag, indicating that
the io_end is being processed by the fsync path.

The second problem is that the work routine will make an assignment to
io->flag outside of the lock.  I have witnessed this result in a hang
at umount.  Moving the flag setting inside the lock resolved that
problem.

The problem was introduced by commit b82e384c7b ("ext4: optimize
locking for end_io extent conversion"), which first appeared in 3.2.
As such, the fix should be backported to that release (probably along
with the unwritten extent conversion race fix).

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
CC: stable@kernel.org
2012-03-05 10:29:52 -05:00
Theodore Ts'o 5a916be1b3 ext4: make ext4_show_options() be table-driven
Consistently show mount options which are the non-default, so that
/proc/mounts accurately shows the mount options that would be
necessary to mount the file system in its current mode of operation.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-04 19:27:31 -05:00
Theodore Ts'o 39ef17f1b0 ext4: simplify handling of the errors=* mount options
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-03 17:56:23 -05:00
Theodore Ts'o c64db50e76 ext4: remove the I_VERSION mount flag and use the super_block flag instead
There's no point to have two bits that are set in parallel; so use the
MS_I_VERSION flag that is needed by the VFS anyway, and that way we
free up a bit in sbi->s_mount_opts.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-02 12:23:11 -05:00
Jeff Moyer 266991b138 ext4: fix race between unwritten extent conversion and truncate
The following comment in ext4_end_io_dio caught my attention:

	/* XXX: probably should move into the real I/O completion handler */
        inode_dio_done(inode);

The truncate code takes i_mutex, then calls inode_dio_wait.  Because the
ext4 code path above will end up dropping the mutex before it is
reacquired by the worker thread that does the extent conversion, it
seems to me that the truncate can happen out of order.  Jan Kara
mentioned that this might result in error messages in the system logs,
but that should be the extent of the "damage."

The fix is pretty straight-forward: don't call inode_dio_done until the
extent conversion is complete.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2012-02-20 17:59:24 -05:00
Theodore Ts'o 856cbcf9a9 ext4: fix INCOMPAT feature codepoint reservation for INLINEDATA
In commit 9b90e5e028 I incorrectly reserved the wrong bit for
EXT4_FEATURE_INCOMPAT_INLINEDATA per the discussion on the linux-ext4
list on December 7, 2011.  The codepoint 0x2000 should be used for
EXT4_FEATURE_INCOMPAT_USE_META_CSUM, so INLINEDATA will be assigned
the value 0x8000.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:53:01 -05:00
Theodore Ts'o 813e57276f ext4: fix race when setting bitmap_uptodate flag
In ext4_read_{inode,block}_bitmap() we were setting bitmap_uptodate()
before submitting the buffer for read.  The is bad, since we check
bitmap_uptodate() without locking the buffer, and so if another
process is racing with us, it's possible that they will think the
bitmap is uptodate even though the read has not completed yet,
resulting in inodes and blocks potentially getting allocated more than
once if we get really unlucky.

Addresses-Google-Bug: 2828254

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-02-20 17:52:46 -05:00
Theodore Ts'o ff9cb1c4ee Merge branch 'for_linus' into for_linus_merged
Conflicts:
	fs/ext4/ioctl.c
2012-01-10 11:54:07 -05:00
Eric Sandeen 5f163cc759 ext4: make more symbols static
A couple more functions can reasonably be made static if desired.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04 22:33:28 -05:00
Theodore Ts'o 9b90e5e028 ext4: reserve new feature flag codepoints
Reserve the ext4 features flags EXT4_FEATURE_RO_COMPAT_METADATA_CSUM,
EXT4_FEATURE_INCOMPAT_INLINEDATA, and EXT4_FEATURE_INCOMPAT_LARGEDIR.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04 22:01:53 -05:00
Yongqiang Yang 19c5246d25 ext4: add new online resize interface
This patch adds new online resize interface, whose input argument is a
64-bit integer indicating how many blocks there are in the resized fs.

In new resize impelmentation, all work like allocating group tables
are done by kernel side, so the new resize interface can support
flex_bg feature and prepares ground for suppoting resize with features
like bigalloc and exclude bitmap. Besides these, user-space tools just
passes in the new number of blocks.

We delay initializing the bitmaps and inode tables of added groups if
possible and add multi groups (a flex groups) each time, so new resize
is very fast like mkfs.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-04 17:09:44 -05:00
Yongqiang Yang 33afdcc540 ext4: add a function which sets up group blocks of a flex bg
This patch adds a function named setup_new_flex_group_blocks() which
sets up group blocks of a flex bg.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-01-03 23:32:52 -05:00
Al Viro dcca3fec9f ext4: propagate umode_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-01-03 22:54:59 -05:00
Akinobu Mita 597d508c17 ext4: use proper little-endian bitops
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4.  Only two ext4_{set,clear}_bit() calls check the return value.  The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().

This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.

This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.

This also removes unused ext4_find_first_zero_bit().

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-28 20:32:07 -05:00
Zheng Liu ccb4d7af91 ext4: remove no longer used functions in inode.c
The functions ext4_block_truncate_page() and ext4_block_zero_page_range()
are no longer used, so remove them.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-12-28 20:25:40 -05:00
Linus Torvalds f1f8935a5c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
  jbd2: Unify log messages in jbd2 code
  jbd/jbd2: validate sb->s_first in journal_get_superblock()
  ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
  ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
  ext4: fix a typo in struct ext4_allocation_context
  ext4: Don't normalize an falloc request if it can fit in 1 extent.
  ext4: remove comments about extent mount option in ext4_new_inode()
  ext4: let ext4_discard_partial_buffers handle unaligned range correctly
  ext4: return ENOMEM if find_or_create_pages fails
  ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
  ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
  ext4: optimize locking for end_io extent conversion
  ext4: remove unnecessary call to waitqueue_active()
  ext4: Use correct locking for ext4_end_io_nolock()
  ext4: fix race in xattr block allocation path
  ext4: trace punch_hole correctly in ext4_ext_map_blocks
  ext4: clean up AGGRESSIVE_TEST code
  ext4: move variables to their scope
  ext4: fix quota accounting during migration
  ext4: migrate cleanup
  ...
2011-11-02 10:06:20 -07:00
Joe Perches b9075fa968 treewide: use __printf not __attribute__((format(printf,...)))
Standardize the style for compiler based printf format verification.
Standardized the location of __printf too.

Done via script and a little typing.

$ grep -rPl --include=*.[ch] -w "__attribute__" * | \
  grep -vP "^(tools|scripts|include/linux/compiler-gcc.h)" | \
  xargs perl -n -i -e 'local $/; while (<>) { s/\b__attribute__\s*\(\s*\(\s*format\s*\(\s*printf\s*,\s*(.+)\s*,\s*(.+)\s*\)\s*\)\s*\)/__printf($1, $2)/g ; print; }'

[akpm@linux-foundation.org: revert arch bits]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-10-31 17:30:54 -07:00
Tao Ma 0edeb71dc9 ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
EXT4_IO_END_UNWRITTEN flag set and the increase of i_aiodio_unwritten
should be done simultaneously since ext4_end_io_nolock always clear
the flag and decrease the counter in the same time.

We have found some bugs that the flag is set while leaving
i_aiodio_unwritten unchanged(commit 32c80b32c0). So this patch just tries
to create a helper function to wrap them to avoid any future bug.
The idea is inspired by Eric.

Cc: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-31 17:30:44 -04:00
Dmitry Monakhov 5cb81dabcc ext4: fix quota accounting during migration
The tmp_inode should have same uid/gid as the original inode.
Otherwise new metadata blocks will be accounted to wrong quota-id,
which will result in a quota leak after the inode migration is
completed.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-29 09:05:00 -04:00
Dmitry Monakhov a4e5d88b1b ext4: update EOFBLOCKS flag on fallocate properly
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
Currently it happens only if new extent was allocated.

TESTCASE:
fallocate test_file -n -l4096
fallocate test_file -l4096
Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
fsck will complain about that.

Also remove ping pong in ext4_fallocate() in case of new extents,
where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
ext4_falloc_update_inode() restore it again.

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-25 08:15:12 -04:00
Tao Ma 7fd59c83b0 ext4: remove the obsolete/broken EXT4_IOC_WAIT_FOR_READONLY ioctl
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is
also broken.  No one sets the set_ro_timer, no one wakes up us and our
state is set to TASK_INTERRUPTIBLE not RUNNING.  So remove it.

Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 15:56:35 -04:00
Lukas Czerner 4113c4caa4 ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.

This is a part of the effort to reduce number of ext4 options hence the
test matrix.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-10-08 14:34:47 -04:00
Aditya Kali 5356f2615c ext4: attempt to fix race in bigalloc code path
Currently, there exists a race between delayed allocated writes and
the writeback when bigalloc feature is in use. The race was because we
wanted to determine what blocks in a cluster are under delayed
allocation and we were using buffer_delayed(bh) check for it. But, the
writeback codepath clears this bit without any synchronization which
resulted in a race and an ext4 warning similar to:

EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
		reserved data blocks

The race existed in two places.
(1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
    writeback code path.
(2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
    buffer_delayed(bh) is set.

To fix (1), this patch introduces a new buffer_head state bit -
BH_Da_Mapped.  This bit is set under the protection of
EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
allocated blocks during the writeout time. We can now reliably check
for this bit inside ext4_find_delalloc_range() to determine whether
the reservation for the blocks have already been claimed or not.

To fix (2), it was necessary to set buffer_delay(bh) under the
protection of i_data_sem.  So, I extracted the very beginning of
ext4_map_blocks into a new function - ext4_da_map_blocks() - and
performed the required setting of bh_delay bit and the quota
reservation under the protection of i_data_sem.  These two fixes makes
the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
thus removing the race.

Tested: I was able to reproduce the problem by running 'dd' and
'fsync' in parallel. Also, xfstests sometimes used to reproduce this
race. After the fix both my test and xfstests were successful and no
race (warning message) was observed.

Google-Bug-Id: 4997027

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:20:51 -04:00
Aditya Kali d8990240d8 ext4: add some tracepoints in ext4/extents.c
This patch adds some tracepoints in ext4/extents.c and updates a tracepoint in
ext4/inode.c.

Tested: Built and ran the kernel and verified that these tracepoints work.
Also ran xfstests.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:18:51 -04:00
Theodore Ts'o df55c99dc8 ext4: rename ext4_has_free_blocks() to ext4_has_free_clusters()
Rename the function so it is more clear what is going on.  Also rename
the various variables so it's clearer what's happening.

Also fix a missing blocks to cluster conversion when reading the
number of reserved blocks for root.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:16:51 -04:00
Theodore Ts'o e7d5f3156e ext4: rename ext4_claim_free_blocks() to ext4_claim_free_clusters()
This function really claims a number of free clusters, not blocks, so
rename it so it's clearer what's going on.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:14:51 -04:00
Theodore Ts'o cff1dfd767 ext4: rename ext4_free_blocks_after_init() to ext4_free_clusters_after_init()
This function really returns the number of clusters after initializing
an uninitalized block bitmap has been initialized.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:12:51 -04:00
Theodore Ts'o 5dee54372c ext4: rename ext4_count_free_blocks() to ext4_count_free_clusters()
This function really counts the free clusters reported in the block
group descriptors, so rename it to reduce confusion.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:10:51 -04:00
Theodore Ts'o 021b65bb1e ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions.  So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.

Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:08:51 -04:00
Theodore Ts'o 6f16b60690 ext4: enable mounting bigalloc as read/write
Now that we have implemented all of the changes needed for bigalloc,
we can finally enable it!

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:06:51 -04:00
Aditya Kali 7b415bf60f ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.

Signed-off-by: Aditya Kali <adityakali@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 19:04:51 -04:00
Theodore Ts'o 24aaa8ef4e ext4: convert the free_blocks field in s_flex_groups to be free_clusters
Convert the free_blocks to be free_clusters to make the final revised
bigalloc changes easier to read/understand.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:58:51 -04:00
Theodore Ts'o 5704265188 ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:56:51 -04:00
Theodore Ts'o 84130193e0 ext4: teach ext4_free_blocks() about bigalloc and clusters
The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not.  That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.

We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:50:51 -04:00
Theodore Ts'o d5b8f31007 ext4: bigalloc changes to block bitmap initialization functions
Add bigalloc support to ext4_init_block_bitmap() and
ext4_free_blocks_after_init().

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:44:51 -04:00
Theodore Ts'o fd034a84e1 ext4: split out ext4_free_blocks_after_init()
The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap().  This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters.  So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:42:51 -04:00
Theodore Ts'o 281b599597 ext4: read-only support for bigalloc file systems
This adds supports for bigalloc file systems.  It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-09 18:34:51 -04:00
Theodore Ts'o 56889787cf ext4: improve handling of conflicting mount options
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).

Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported.  Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.

Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03 18:22:38 -04:00
Allison Henderson 4e96b2dbbf ext4: Add new ext4_discard_partial_page_buffers routines
This patch adds two new routines: ext4_discard_partial_page_buffers
and ext4_discard_partial_page_buffers_no_lock.

The ext4_discard_partial_page_buffers routine is a wrapper
function to ext4_discard_partial_page_buffers_no_lock.
The wrapper function locks the page and passes it to
ext4_discard_partial_page_buffers_no_lock.
Calling functions that already have the page locked can call
ext4_discard_partial_page_buffers_no_lock directly.

The ext4_discard_partial_page_buffers_no_lock function
zeros a specified range in a page, and unmaps the
corresponding buffer heads.  Only block aligned regions of the
page will have their buffer heads unmapped.  Unblock aligned regions
will be mapped if needed so that they can be updated with the
partial zero out.  This function is meant to
be used to update a page and its buffer heads to be zeroed
and unmapped when the corresponding blocks have been released
or will be released.

This routine is used in the following scenarios:
* A hole is punched and the non page aligned regions
  of the head and tail of the hole need to be discarded

* The file is truncated and the partial page beyond EOF needs
  to be discarded

* The end of a hole is in the same page as EOF.  After the
  page is flushed, the partial page beyond EOF needs to be
  discarded.

* A write operation begins or ends inside a hole and the partial
  page appearing before or after the write needs to be discarded

* A write operation extends EOF and the partial page beyond EOF
  needs to be discarded

This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
which is used when a write operation begins or ends in a hole.
When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only
buffer heads that are already unmapped will have the corresponding
regions of the page zeroed.

Signed-off-by: Allison Henderson <achender@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-09-03 11:51:09 -04:00
Theodore Ts'o 1cd9f0976a ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodes
This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file).  This was
discovered when xfstests #79 was generalized so it could run on all
file systems.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc:stable@kernel.org
2011-08-31 11:54:51 -04:00
Jiaying Zhang 8c0bec2151 ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places.  In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.

This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode().  Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix.  Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.

This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero.  However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue.  As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock.  Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-31 11:50:51 -04:00
Linus Torvalds 60ad446682 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits)
  ext4: prevent memory leaks from ext4_mb_init_backend() on error path
  ext4: use EXT4_BAD_INO for buddy cache to avoid colliding with valid inode #
  ext4: use ext4_msg() instead of printk in mballoc
  ext4: use ext4_kvzalloc()/ext4_kvmalloc() for s_group_desc and s_group_info
  ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
  ext4: use the correct error exit path in ext4_init_inode_table()
  ext4: add missing kfree() on error return path in add_new_gdb()
  ext4: change umode_t in tracepoint headers to be an explicit __u16
  ext4: fix races in ext4_sync_parent()
  ext4: Fix overflow caused by missing cast in ext4_fallocate()
  ext4: add action of moving index in ext4_ext_rm_idx for Punch Hole
  ext4: simplify parameters of reserve_backup_gdb()
  ext4: simplify parameters of add_new_gdb()
  ext4: remove lock_buffer in bclean() and setup_new_group_blocks()
  ext4: simplify journal handling in setup_new_group_blocks()
  ext4: let setup_new_group_blocks() set multiple bits at a time
  ext4: fix a typo in ext4_group_extend()
  ext4: let ext4_group_add_blocks() handle 0 blocks quickly
  ext4: let ext4_group_add_blocks() return an error code
  ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
  ...

Fix up conflict in fs/ext4/inode.c: commit aacfc19c62 ("fs: simplify
the blockdev_direct_IO prototype") had changed the ext4_ind_direct_IO()
function for the new simplified calling convention, while commit
dae1e52cb1 ("ext4: move ext4_ind_* functions from inode.c to
indirect.c") moved the function to another file.
2011-08-01 13:56:03 -10:00
Theodore Ts'o 9933fc0ac1 ext4: introduce ext4_kvmalloc(), ext4_kzalloc(), and ext4_kvfree()
Introduce new helper functions which try kmalloc, and then fall back
to vmalloc if necessary, and use them for allocating and deallocating
s_flex_groups.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-08-01 08:45:02 -04:00
Yongqiang Yang c3e94d1df9 ext4: let setup_new_group_blocks() set multiple bits at a time
Rename mb_set_bits() to ext4_set_bits() and make it a global function
so that setup_new_group_blocks() can use it.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 22:05:53 -04:00
Yongqiang Yang cc7365dfe4 ext4: let ext4_group_add_blocks() return an error code
This patch lets ext4_group_add_blocks() return an error code if it
fails, so that upper functions can handle error correctly.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:46:07 -04:00
Yongqiang Yang 0529155e8a ext4: rename ext4_add_groupblocks() to ext4_group_add_blocks()
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:43:56 -04:00
Yongqiang Yang 8f82f840ec ext4: prevent parallel resizers by atomic bit ops
Before this patch, parallel resizers are allowed and protected by a
mutex lock, actually, there is no need to support parallel resizer, so
this patch prevents parallel resizers by atmoic bit ops, like
lock_page() and unlock_page() do.

To do this, the patch removed the mutex lock s_resize_lock from struct
ext4_sb_info and added a unsigned long field named s_resize_flags
which inidicates if there is a resizer.

Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-26 21:35:44 -04:00
Josef Bacik 02c24a8218 fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers.  Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2.  For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20 20:47:59 -04:00
Tao Ma 3d56b8d2c7 ext4: Speed up FITRIM by recording flags in ext4_group_info
In ext4, when FITRIM is called every time, we iterate all the
groups and do trim one by one. It is a bit time wasting if the
group has been trimmed and there is no change since the last
trim.

So this patch adds a new flag in ext4_group_info->bb_state to
indicate that the group has been trimmed, and it will be cleared
if some blocks is freed(in release_blocks_on_commit). Another
trim_minlen is added in ext4_sb_info to record the last minlen
we use to trim the volume, so that if the caller provide a small
one, we will go on the trim regardless of the bb_state.

A simple test with my intel x25m ssd:
df -h shows:
/dev/sdb1              40G   21G   17G  56% /mnt/ext4
Block size:               4096

run the FITRIM with the following parameter:
range.start = 0;
range.len = UINT64_MAX;
range.minlen = 1048576;

without the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.505s
user	0m0.000s
sys	0m1.224s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.359s
user	0m0.000s
sys	0m1.178s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.228s
user	0m0.000s
sys	0m1.151s

with the patch:
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m5.625s
user	0m0.000s
sys	0m1.269s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m0.002s
user	0m0.000s
sys	0m0.001s
[root@boyu-tm linux-2.6]# time ./ftrim /mnt/ext4/a
real	0m0.002s
user	0m0.000s
sys	0m0.001s

A big improvement for the 2nd and 3rd run.

Even after I delete some big image files, it is still much
faster than iterating the whole disk.

[root@boyu-tm test]# time ./ftrim /mnt/ext4/a
real	0m1.217s
user	0m0.000s
sys	0m0.196s

Cc: Lukas Czerner <lczerner@redhat.com>
Reviewed-by: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2011-07-11 00:03:38 -04:00