alistair23-linux

redonkable

Author	SHA1	Message	Date
Linus Torvalds	0c93ea4064	Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (61 commits) Dynamic debug: fix pr_fmt() build error Dynamic debug: allow simple quoting of words dynamic debug: update docs dynamic debug: combine dprintk and dynamic printk sysfs: fix some bin_vm_ops errors kobject: don't block for each kobject_uevent sysfs: only allow one scheduled removal callback per kobj Driver core: Fix device_move() vs. dpm list ordering, v2 Driver core: some cleanup on drivers/base/sys.c Driver core: implement uevent suppress in kobject vcs: hook sysfs devices into object lifetime instead of "binding" driver core: fix passing platform_data driver core: move platform_data into platform_device sysfs: don't block indefinitely for unmapped files. driver core: move knode_bus into private structure driver core: move knode_driver into private structure driver core: move klist_children into private structure driver core: create a private portion of struct device driver core: remove polling for driver_probe_done(v5) sysfs: reference sysfs_dirent from sysfs inodes ... Fixed conflicts in drivers/sh/maple/maple.c manually	2009-03-26 11:17:04 -07:00
Linus Torvalds	8ff64b539b	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: GFS2: Fix freeze issue Fix a minor bug in the previous patch GFS2: Clean up of glops.c GFS2: Fix locking bug in failed shared to exclusive conversion GFS2: Pagecache usage optimization on GFS2 GFS2: fix sparse warning: Should it be static? GFS2: fix sparse warnings: constant is so big it is ... GFS2: Support quota/noquota mount arguments GFS2: Fix alignment issue and tidy gfs2_bitfit GFS2: Add a "demote a glock" interface to sysfs GFS2: Expose UUID via sysfs/uevent GFS2: Support generation of discard requests GFS2: Fix deadlock on journal flush GFS2: Fix error path ref counting for root inode GFS2: Remove unused field from glock GFS2: Merge lock_dlm module into GFS2 GFS2: Remove "double" locking in quota GFS2: change gfs2_quota_scan into a shrinker GFS2: Bring back lvb-related stuff to lock_nolock to support quotas GFS2: Fix remount argument parsing	2009-03-26 11:08:47 -07:00
Linus Torvalds	8d80ce80e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits) SELinux: inode_doinit_with_dentry drop no dentry printk SELinux: new permission between tty audit and audit socket SELinux: open perm for sock files smack: fixes for unlabeled host support keys: make procfiles per-user-namespace keys: skip keys from another user namespace keys: consider user namespace in key_permission keys: distinguish per-uid keys in different namespaces integrity: ima iint radix_tree_lookup locking fix TOMOYO: Do not call tomoyo_realpath_init unless registered. integrity: ima scatterlist bug fix smack: fix lots of kernel-doc notation TOMOYO: Don't create securityfs entries unless registered. TOMOYO: Fix exception policy read failure. SELinux: convert the avc cache hash list to an hlist SELinux: code readability with avc_cache SELinux: remove unused av.decided field SELinux: more careful use of avd in avc_has_perm_noaudit SELinux: remove the unused ae.used SELinux: check seqno when updating an avc_node ...	2009-03-26 11:03:39 -07:00
Matthew Garrett	0a1c01c947	Make relatime default Change the default behaviour of the kernel to use relatime for all filesystems. This can be overridden with the "strictatime" mount option. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 11:01:10 -07:00
Matthew Garrett	d0adde574b	Add a strictatime mount option Add support for explicitly requesting full atime updates. This makes it possible for kernels to default to relatime but still allow userspace to override it. Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 10:56:35 -07:00
Matthew Garrett	11ff6f05f1	Allow relatime to update atime once a day Allow atime to be updated once per day even with relatime. This lets utilities like tmpreaper (which delete files based on last access time) continue working, making relatime a plausible default for distributions. Signed-off-by: Matthew Garrett <mjg@redhat.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Acked-by: Valerie Aurora Henson <vaurora@redhat.com> Acked-by: Alan Cox <alan@redhat.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-26 10:48:13 -07:00
Stefan Weinhuber	b44b0ab3ba	[S390] dasd: add large volume support The dasd device driver will now support ECKD devices with more then 65520 cylinders. In the traditional ECKD adressing scheme each track is addressed by a 16-bit cylinder and 16-bit head number. The new addressing scheme makes use of the fact that the actual number of heads is never larger then 15, so 12 bits of the head number can be redefined to be part of the cylinder address. Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>	2009-03-26 15:24:05 +01:00
Jens Axboe	a2a9537ac0	Get rid of pdflush_operation() in emergency sync and remount Opencode a cheasy approach with kevent. The idea here is that we'll add some generic delayed work infrastructure, which probably wont be based on pdflush (or maybe it will, in which case we can just add it back). This is in preparation for getting rid of pdflush completely. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-26 11:01:36 +01:00
Jens Axboe	6933c02e9c	btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty Chris says it's safe to kill. Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-26 11:01:35 +01:00
Theodore Ts'o	563bdd61fe	ext4: Check for an valid i_mode when reading the inode from disk Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-26 00:06:19 -04:00
Theodore Ts'o	7058548cd5	ext4: Use WRITE_SYNC for commits which are caused by fsync() If a commit is triggered by fsync(), set a flag indicating the journal blocks associated with the transaction should be flushed out using WRITE_SYNC. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-25 23:35:46 -04:00
Manish Katiyar	c16831b4cc	ext2: Zero our b_size in ext2_quota_read() ext2_quota_read() doesn't initialize tmp_bh.b_size before calling ext2_get_block() where we access it. Since it is a local variable it might contain some garbage. Make sure it is filled with reasonable value before passing. Signed-off-by: Manish Katiyar <mkatiyar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Matt LaPlante	620372a9ff	trivial: fix typos/grammar errors in fs/Kconfig Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Jan Kara	268157ba67	quota: Coding style fixes Wrap long lines, remove assignments from conditions, rewrite two overcomplicated for loops. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:38 +01:00
Jan Kara	7a2435d874	quota: Remove superfluous inlines Remove inlines of large functions to decrease code size (saved 1543 bytes). Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:37 +01:00
Jan Kara	90c0af05a5	nfsd: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. CC: bfields@fieldses.org CC: neilb@suse.de Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:37 +01:00
Jan Kara	c94d2a22f2	jfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>	2009-03-26 02:18:37 +01:00
Jan Kara	bacfb7c2e5	udf: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:36 +01:00
Jan Kara	5f5fa796c6	ufs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Evgeniy Dushistov <dushistov@mail.ru>	2009-03-26 02:18:36 +01:00
Jan Kara	77db4f25bc	reiserfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: reiserfs-devel@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	a269eb1829	ext4: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Mingming Cao <cmm@us.ibm.com> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	81a0522739	ext3: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	6f90bee506	ext2: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: linux-ext4@vger.kernel.org	2009-03-26 02:18:36 +01:00
Jan Kara	314649558d	ramfs: Remove quota call Ramfs has no bussiness in quotas. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	9e3509e273	vfs: Use lowercase names of quota functions Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Alexander Viro <viro@zeniv.linux.org.uk>	2009-03-26 02:18:35 +01:00
Jan Kara	d26ac1a812	quota: Remove dqbuf_t and other cleanups Remove bogus typedef which is just a definition of char *. Remove unnecessary type casts. Substitute freedqbuf() with kfree. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	dd6f3c6d5a	quota: Remove NODQUOT macro Remove this macro which is just a definition of NULL. Fix a few coding style issues along the way. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Jan Kara	c516610cfe	quota: Make global quota locks cacheline aligned Andrew Morton has suggested that three global quota locks can end up in the same cacheline which can result in bad cacheline ping-pong on SMP machines. Make locks cacheline aligned so that we avoid this problem (thanks goes to Andrew for the idea). Signed-off-by: Jan Kara <jack@suse.cz> CC: Andrew Morton <akpm@linux-foundation.org>	2009-03-26 02:18:35 +01:00
Jan Kara	884d179dff	quota: Move quota files into separate directory Quota subsystem has more and more files. It's time to create a dir for it. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:35 +01:00
Mingming Cao	60e58e0f30	ext4: quota reservation for delayed allocation Uses quota reservation/claim/release to handle quota properly for delayed allocation in the three steps: 1) quotas are reserved when data being copied to cache when block allocation is defered 2) when new blocks are allocated. reserved quotas are converted to the real allocated quota, 2) over-booked quotas for metadata blocks are released back. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	643d00ccc3	reiserfs: Remove unnecessary quota functions reiserfs_dquot_initialize() and reiserfs_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	edf7245362	ext4: Remove unnecessary quota functions ext4_dquot_initialize() and ext4_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Jan Kara	a219ce3748	ext3: Remove unnecessary quota functions ext3_dquot_initialize() and ext3_dquot_drop() is no longer needed because of modified quota locking. Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Mingming Cao	08d0350ce9	quota: Move EXPORT_SYMBOL immediately next to the functions/varibles According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its function/variable Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:34 +01:00
Mingming Cao	740d9dcd94	quota: Add quota reservation claim and released operations Reserved quota will be claimed at the block allocation time. Over-booked quota could be returned back with the release callback function. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:18:24 +01:00
Mingming Cao	f18df22899	quota: Add quota reservation support Delayed allocation defers the block allocation at the dirty pages flush-out time, doing quota charge/check at that time is too late. But we can't charge the quota blocks until blocks are really allocated, otherwise users could get overcharged after reboot from system crash. This patch adds quota reservation for delayed allocation. Quota blocks are reserved in memory, inode and quota won't gets dirtied until later block allocation time. Signed-off-by: Mingming Cao <cmm@us.ibm.com> Signed-off-by: Jan Kara <jack@suse.cz>	2009-03-26 02:15:50 +01:00
Chris Mason	1a81af4d1d	Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod btrfs_update_delayed_ref is optimized to add and remove different references in one pass through the delayed ref tree. It is a zero sum on the total number of refs on a given extent. But, the code was recording an extra ref in the head node. This never made it down to the disk but was used when deciding if it was safe to free the extent while dropping snapshots. The fix used here is to make sure the ref_mod count is unchanged on the head ref when btrfs_update_delayed_ref is called. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-25 09:55:11 -04:00
Hugh Dickins	095160aee9	sysfs: fix some bin_vm_ops errors Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a "sysfs: don't block indefinitely for unmapped files" in linux-next crashes the PowerMac G5 when X starts up. It's caught out by the way powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting a new vma->vm_file whose private_data no longer points to the bin_buffer (substitution done because some versions of X crash if that mmap fails). The fix to this is straightforward: the original vm_file is fput() in that case, so this mmap won't block sysfs at all, so just don't switch over to bin_vm_ops if vm_file has changed. But more fixes made before realizing that was the problem:- It should not be an error if bin_page_mkwrite() finds no underlying page_mkwrite(). Check that a file already mmap'ed has the same underlying vm_ops _before_ pointing vma->vm_ops at bin_vm_ops. If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap on CONFIG_NUMA=y, just because that has a set_policy and get_policy: provide bin_set_policy, bin_get_policy and bin_migrate. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Eric Biederman <ebiederm@aristanetworks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Alex Chiang	669420644c	sysfs: only allow one scheduled removal callback per kobj The only way for a sysfs attribute to remove itself (without deadlock) is to use the sysfs_schedule_callback() interface. Vegard Nossum discovered that a poorly written sysfs ->store callback can repeatedly schedule remove callbacks on the same device over and over, e.g. $ while true ; do echo 1 > /sys/devices/.../remove ; done If the 'remove' attribute uses the sysfs_schedule_callback API and also does not protect itself from concurrent accesses, its callback handler will be called multiple times, and will eventually attempt to perform operations on a freed kobject, leading to many problems. Instead of requiring all callers of sysfs_schedule_callback to implement their own synchronization, provide the protection in the infrastructure. Now, sysfs_schedule_callback will only allow one scheduled callback per kobject. On subsequent calls with the same kobject, return -EAGAIN. This is a short term fix. The long term fix is to allow sysfs attributes to remove themselves directly, without any of this callback hokey pokey. [cornelia.huck@de.ibm.com: s390 ccwgroup bits] Reported-by: vegard.nossum@gmail.com Signed-off-by: Alex Chiang <achiang@hp.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Ming Lei	f67f129e51	Driver core: implement uevent suppress in kobject This patch implements uevent suppress in kobject and removes it from struct device, based on the following ideas: 1,Uevent sending should be one attribute of kobject, so suppressing it in kobject layer is more natural than in device layer. By this way, we can do it for other objects embedded with kobject. 2,It may save several bytes for each instance of struct device.(On my omap3(32bit ARM) based box, can save 8bytes per device object) This patch also introduces dev_set\|get_uevent_suppress() helpers to set and query uevent_suppress attribute in case to help kobject as private part of struct device in future. [This version is against the latest driver-core patch set of Greg,please ignore the last version.] Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Eric W. Biederman	e0edd3c65a	sysfs: don't block indefinitely for unmapped files. Modify sysfs bin files so that we can remove the bin file while they are still mapped. When the kobject is removed we unmap the bin file and arrange for future accesses to the mapping to receive SIGBUS. Implementing this prevents a nasty DOS when pci devices are hot plugged and unplugged. Where if any of their resources were mmaped the kernel could not free up their pci resources or release their pci data structures. [akpm@linux-foundation.org: remove unused var] Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:26 -07:00
Eric W. Biederman	04256b4a8f	sysfs: reference sysfs_dirent from sysfs inodes The sysfs_dirent serves as both an inode and a directory entry for sysfs. To prevent the sysfs inode numbers from being freed prematurely hold a reference to sysfs_dirent from the sysfs inode. [akpm@linux-foundation.org: add comment] Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:25 -07:00
Alex Chiang	425cb02912	sysfs: sysfs_add_one WARNs with full path to duplicate filename sysfs: sysfs_add_one WARNs with full path to duplicate filename As a debugging aid, it can be useful to know the full path to a duplicate file being created in sysfs. We now will display warnings such as: sysfs: cannot create duplicate filename '/foo' when attempting to create multiple files named 'foo' in the sysfs root, or: sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo' when attempting to create multiple files named 'foo' under a given directory in sysfs. The path displayed is always a relative path to sysfs_root. The leading '/' in the path name refers to the sysfs_root mount point, and should not be confused with the "real" '/'. Thanks to Alex Williamson for essentially writing sysfs_pathname. Cc: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:25 -07:00
Eric W. Biederman	4a67a1bc0b	sysfs: Take sysfs_mutex when fetching the root inode. sysfs_get_inode ultimately calls sysfs_count_nlink when the a directory inode is fectched. sysfs_count_nlink needs to be called under the sysfs_mutex to guard against the unlikely but possible scenario that the root directory is changing as we are counting the number entries in it, and just in general to be consistent. Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:24 -07:00
Qinghuang Feng	8231f2f99a	SYSFS: use standard magic.h for sysfs SYSFS_MAGIC has been added into magic.h, so only use that definition in magic.h to avoid potential consistency problem. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2009-03-24 16:38:24 -07:00
Chris Mason	af4176b49c	Btrfs: optimize fsyncs on old files The fsync log has code to make sure all of the parents of a file are in the log along with the file. It uses a minimal log of the parent directory inodes, just enough to get the parent directory on disk. If the transaction that originally created a file is fully on disk, and the file hasn't been renamed or linked into other directories, we can safely skip the parent directory walk. We know the file is on disk somewhere and we can go ahead and just log that single file. This is more important now because unrelated unlinks in the parent directory might make us force a commit if we try to log the parent. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	12fcfd22fe	Btrfs: tree logging unlink/rename fixes The tree logging code allows individual files or directories to be logged without including operations on other files and directories in the FS. It tries to commit the minimal set of changes to disk in order to fsync the single file or directory that was sent to fsync or O_SYNC. The tree logging code was allowing files and directories to be unlinked if they were part of a rename operation where only one directory in the rename was in the fsync log. This patch adds a few new rules to the tree logging. 1) on rename or unlink, if the inode being unlinked isn't in the fsync log, we must force a full commit before doing an fsync of the directory where the unlink was done. The commit isn't done during the unlink, but it is forced the next time we try to log the parent directory. Solution: record transid of last unlink/rename per directory when the directory wasn't already logged. For renames this is only done when renaming to a different directory. mkdir foo/some_dir normal commit rename foo/some_dir foo2/some_dir mkdir foo/some_dir fsync foo/some_dir/some_file The fsync above will unlink the original some_dir without recording it in its new location (foo2). After a crash, some_dir will be gone unless the fsync of some_file forces a full commit 2) we must log any new names for any file or dir that is in the fsync log. This way we make sure not to lose files that are unlinked during the same transaction. 2a) we must log any new names for any file or dir during rename when the directory they are being removed from was logged. 2a is actually the more important variant. Without the extra logging a crash might unlink the old name without recreating the new one 3) after a crash, we must go through any directories with a link count of zero and redo the rm -rf mkdir f1/foo normal commit rm -rf f1/foo fsync(f1) The directory f1 was fully removed from the FS, but fsync was never called on f1, only its parent dir. After a crash the rm -rf must be replayed. This must be able to recurse down the entire directory tree. The inode link count fixup code takes care of the ugly details. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:52 -04:00
Chris Mason	a74ac32207	Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay During log replay, inodes are copied from the log to the main filesystem btrees. Sometimes they have a zero link count in the log but they actually gain links during the replay or have some in the main btree. This patch updates the link count to be at least one after copying the inode out of the log. This makes sure the inode is deleted during an iput while the rest of the replay code is still working on it. The log replay has fixup code to make sure that link counts are correct at the end of the replay, so we could use any non-zero number here and it would work fine. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	a4b6e07d1a	Btrfs: limit balancing work while flushing delayed refs The delayed reference mechanism is responsible for all updates to the extent allocation trees, including those updates created while processing the delayed references. This commit tries to limit the amount of work that gets created during the final run of delayed refs before a commit. It avoids cowing new blocks unless it is required to finish the commit, and so it avoids new allocations that were not really required. The goal is to avoid infinite loops where we are always making more work on the final run of delayed refs. Over the long term we'll make a special log for the last delayed ref updates as well. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	5d13a98f3b	Btrfs: readahead checksums during btrfs_finish_ordered_io This reads in blocks in the checksum btree before starting the transaction in btrfs_finish_ordered_io. It makes it much more likely we'll be able to do operations inside the transaction without needing any btree reads, which limits transaction latencies overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:51 -04:00
Chris Mason	b9473439d3	Btrfs: leave btree locks spinning more often btrfs_mark_buffer dirty would set dirty bits in the extent_io tree for the buffers it was dirtying. This may require a kmalloc and it was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to set any btree locks they were holding to blocking first. This commit changes dirty tracking for extent buffers to just use a flag in the extent buffer. Now that we have one and only one extent buffer per page, this can be safely done without losing dirty bits along the way. This also introduces a path->leave_spinning flag that callers of btrfs_search_slot can use to indicate they will properly deal with a path returned where all the locks are spinning instead of blocking. Many of the btree search callers now expect spinning paths, resulting in better btree concurrency overall. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	89573b9c51	Btrfs: Only let very young transactions grow during commit Commits are fairly expensive, and so btrfs has code to sit around for a while during the commit and let new writers come in. But, while we're sitting there, new delayed refs might be added, and those can be expensive to process as well. Unless the transaction is very very young, it makes sense to go ahead and let the commit finish without hanging around. The commit grow loop isn't as important as it used to be, the fsync logging code handles most performance critical syncs now. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:28 -04:00
Chris Mason	66d7e85ea7	Btrfs: Check for a blocking lock before taking the spin This reduces contention on the extent buffer spin locks by testing for a blocking lock before trying to take the spinlock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	7f366cfecf	Btrfs: reduce stack in cow_file_range The fs/btrfs/inode.c code to run delayed allocation during writout needed some stack usage optimization. This is the first pass, it does the check for compression earlier on, which allows us to do the common (no compression) case higher up in the call chain. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:27 -04:00
Chris Mason	b7ec40d784	Btrfs: reduce stalls during transaction commit To avoid deadlocks and reduce latencies during some critical operations, some transaction writers are allowed to jump into the running transaction and make it run a little longer, while others sit around and wait for the commit to finish. This is a bit unfair, especially when the callers that jump in do a bunch of IO that makes all the others procs on the box wait. This commit reduces the stalls this produces by pre-reading file extent pointers during btrfs_finish_ordered_io before the transaction is joined. It also tunes the drop_snapshot code to politely wait for transactions that have started writing out their delayed refs to finish. This avoids new delayed refs being flooded into the queue while we're trying to close off the transaction. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	c3e69d58e8	Btrfs: process the delayed reference queue in clusters The delayed reference queue maintains pending operations that need to be done to the extent allocation tree. These are processed by finding records in the tree that are not currently being processed one at a time. This is slow because it uses lots of time searching through the rbtree and because it creates lock contention on the extent allocation tree when lots of different procs are running delayed refs at the same time. This commit changes things to grab a cluster of refs for processing, using a cursor into the rbtree as the starting point of the next search. This way we walk smoothly through the rbtree. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	1887be66dc	Btrfs: try to cleanup delayed refs while freeing extents When extents are freed, it is likely that we've removed the last delayed reference update for the extent. This checks the delayed ref tree when things are freed, and if no ref updates area left it immediately processes the delayed ref. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:26 -04:00
Chris Mason	44871b1b24	Btrfs: reduce stack usage in some crucial tree balancing functions Many of the tree balancing functions follow the same pattern. 1) cow a block 2) do something to the result This commit breaks them up into two functions so the variables and code required for part two don't suck down stack during part one. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	56bec294de	Btrfs: do extent allocation and reference count updates in the background The extent allocation tree maintains a reference count and full back reference information for every extent allocated in the filesystem. For subvolume and snapshot trees, every time a block goes through COW, the new copy of the block adds a reference on every block it points to. If a btree node points to 150 leaves, then the COW code needs to go and add backrefs on 150 different extents, which might be spread all over the extent allocation tree. These updates currently happen during btrfs_cow_block, and most COWs happen during btrfs_search_slot. btrfs_search_slot has locks held on both the parent and the node we are COWing, and so we really want to avoid IO during the COW if we can. This commit adds an rbtree of pending reference count updates and extent allocations. The tree is ordered by byte number of the extent and byte number of the parent for the back reference. The tree allows us to: 1) Modify back references in something close to disk order, reducing seeks 2) Significantly reduce the number of modifications made as block pointers are balanced around 3) Do all of the extent insertion and back reference modifications outside of the performance critical btrfs_search_slot code. #3 has the added benefit of greatly reducing the btrfs stack footprint. The extent allocation tree modifications are done without the deep (and somewhat recursive) call chains used in the past. These delayed back reference updates must be done before the transaction commits, and so the rbtree is tied to the transaction. Throttling is implemented to help keep the queue of backrefs at a reasonable size. Since there was a similar mechanism in place for the extent tree extents, that is removed and replaced by the delayed reference tree. Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Chris Mason	9fa8cfe706	Btrfs: don't preallocate metadata blocks during btrfs_search_slot In order to avoid doing expensive extent management with tree locks held, btrfs_search_slot will preallocate tree blocks for use by COW without any tree locks held. A later commit moves all of the extent allocation work for COW into a delayed update mechanism, and this preallocation will no longer be required. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-24 16:14:25 -04:00
Martin K. Petersen	6d2a78e783	block: add private bio_set for bio integrity allocations The integrity bio allocation needs its own bio_set to avoid violating the mempool allocation rules and risking deadlocks. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:17 +01:00
Jens Axboe	a7fcd37cdc	block: don't create bio_vec slabs of less than the inline number If we don't have CONFIG_BLK_DEV_INTEGRITY set, then we don't have any external dependencies on the bio_vec slabs. So don't create the ones that we will inline anyway. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:16 +01:00
Ingo Molnar	34053979fb	block: cleanup bio_alloc_bioset() this warning (which got fixed by commit `b2bf968`): fs/bio.c: In function ‘bio_alloc_bioset’: fs/bio.c:305: warning: ‘p’ may be used uninitialized in this function Triggered because the code flow in bio_alloc_bioset() is correct but a bit complex for the compiler to see through. Streamline it a bit - this also makes the code a tiny bit more compact: text data bss dec hex filename 7540 256 40 7836 1e9c bio.o.before 7539 256 40 7835 1e9b bio.o.after Also remove an older compiler-warnings annotation from this function, it's not needed. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-24 12:35:16 +01:00
Steven Whitehouse	df3647b245	GFS2: Fix freeze issue This removes some old code that was causing issues during filesystem freeze. Reported-by: Andrew Price <andy@andrewprice.me.uk> Tested-by: Andrew Price <andy@andrewprice.me.uk> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:31:30 +00:00
Steven Whitehouse	9c538837d8	Fix a minor bug in the previous patch The logic requires that we mark the glock dirty in page_mkwrite otherwise we might not flush correctly in the case that no allocation was required in the process of dirying the page. Also we need to set the shared write flag early for the same reason. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:27 +00:00
Steven Whitehouse	6bac243f07	GFS2: Clean up of glops.c This cleans up a number of bits of code mostly based in glops.c. A couple of simple functions have been merged into the callers to make it more obvious what is going on, the mysterious raising of i_writecount around the truncate_inode_pages() call has been removed. The meta_go_* operations have been renamed rgrp_go_* since that is the only lock type that they are used with. The unused argument of gfs2_read_sb has been removed. Also a bug has been fixed where a check for the rindex inode was in the wrong callback. More comments are added, and the debugging code is improved too. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:27 +00:00
Benjamin Marzinski	02ffad08e8	GFS2: Fix locking bug in failed shared to exclusive conversion After calling out to the dlm, GFS2 sets the new state of a glock to gl_target in gdlm_ast(). However, gl_target is not always the lock state that was requested. If a conversion from shared to exclusive fails, finish_xmote() will call do_xmote() with LM_ST_UNLOCKED, instead of gl->gl_target, so that it can reacquire the lock in exlusive the next time around. In this case, setting the lock to gl_target in gdlm_ast() will make GFS2 think that it has the glock in exclusive mode, when really, it doesn't have the glock locked at all. This patch adds a new field to the gfs2_glock structure, gl_req, to track the mode that was requested. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:26 +00:00
Hisashi Hifumi	229615def3	GFS2: Pagecache usage optimization on GFS2 I introduced "is_partially_uptodate" aops for GFS2. A page can have multiple buffers and even if a page is not uptodate, some buffers can be uptodate on pagesize != blocksize environment. This aops checks that all buffers which correspond to a part of a file that we want to read are uptodate. If so, we do not have to issue actual read IO to HDD even if a page is not uptodate because the portion we want to read are uptodate. "block_is_partially_uptodate" function is already used by ext2/3/4. With the following patch random read/write mixed workloads or random read after random write workloads can be optimized and we can get performance improvement. I did a performance test using the sysbench. #sysbench --num-threads=16 --max-requests=200000 --test=fileio --file-num=1 --file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0 --file-rw-ratio=1 run -2.6.29-rc6 Test execution summary: total time: 202.6389s total number of events: 200000 total time taken by event execution: 2580.0480 per-request statistics: min: 0.0000s avg: 0.0129s max: 49.5852s approx. 95 percentile: 0.0462s -2.6.29-rc6-patched Test execution summary: total time: 177.8639s total number of events: 200000 total time taken by event execution: 2419.0199 per-request statistics: min: 0.0000s avg: 0.0121s max: 52.4306s approx. 95 percentile: 0.0444s arch: ia64 pagesize: 16k blocksize: 4k Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:25 +00:00
Hannes Eder	02ab172159	GFS2: fix sparse warning: Should it be static? Impact: Make symbol static. Fix this sparse warning: fs/gfs2/rgrp.c:188:5: warning: symbol 'gfs2_bitfit' was not declared. Should it be static? Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:25 +00:00
Hannes Eder	075ac44875	GFS2: fix sparse warnings: constant is so big it is ... Fix this sparse warnings: fs/gfs2/rgrp.c:156:23: warning: constant 0xffffffffffffffff is so big it is unsigned long long fs/gfs2/rgrp.c:157:23: warning: constant 0xaaaaaaaaaaaaaaaa is so big it is unsigned long long fs/gfs2/rgrp.c:158:23: warning: constant 0x5555555555555555 is so big it is long long fs/gfs2/rgrp.c:194:20: warning: constant 0x5555555555555555 is so big it is long long fs/gfs2/rgrp.c:204:44: warning: constant 0x5555555555555555 is so big it is long long Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:24 +00:00
Steven Whitehouse	b9a9694570	GFS2: Support quota/noquota mount arguments This adds support for "quota" and "noquota" mount options in addition to the existing "quota=on/off/account" so that we are compatible with the names by which these options are more generally known. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:23 +00:00
Steven Whitehouse	223b2b889f	GFS2: Fix alignment issue and tidy gfs2_bitfit An alignment issue with the existing bitfit algorithm was reported on IA64. This patch attempts to fix that, and also to tidy up the code a bit. There is now more documentation about how this works and it has survived a number of different tests. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:22 +00:00
Steven Whitehouse	64d576ba23	GFS2: Add a "demote a glock" interface to sysfs This adds a sysfs file called demote_rq to GFS2's per filesystem directory. Its possible to use this file to demote arbitrary glocks in exactly the same way as if a request had come in from a remote node. This is intended for testing issues relating to caching of data under glocks. Despite that, the interface is generic enough to send requests to any type of glock, but be careful as its not always safe to send an arbitrary message to an arbitrary glock. For that reason and to prevent DoS, this interface is restricted to root only. The messages look like this: <type>:<glocknumber> <mode> Example: echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq Which means "please demote inode glock (type 2) number 13324 so that I can get an EX (exclusive) lock". The lock modes are those which would normally be sent by a remote node in its callback so if you want to unlock a glock, you use EX, to demote to shared, use SH or PR (depending on whether you like GFS2 or DLM lock modes better!). If the glock doesn't exist, you'll get -ENOENT returned. If the arguments don't make sense, you'll get -EINVAL returned. The plan is that this interface will be used in combination with the blktrace patch which I recently posted for comments although it is, of course, still useful in its own right. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:22 +00:00
Steven Whitehouse	02e3cc70ec	GFS2: Expose UUID via sysfs/uevent Since we have a UUID, we ought to expose it to the user via sysfs and uevents. We already have the fs name in both of these places (a combination of the lock proto and lock table name) so if we add the UUID as well, we have a full set. For older filesystems (i.e. those created before mkfs.gfs2 was writing UUIDs by default) the sysfs file will appear zero length, and no UUID env var will be added to the uevents. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:21 +00:00
Steven Whitehouse	f15ab5619d	GFS2: Support generation of discard requests This patch allows GFS2 to generate discard requests for blocks which are no longer useful to the filesystem (i.e. those which have been freed as the result of an unlink operation). The requests are generated at the time which those blocks become available for reuse in the filesystem. In order to use this new feature, you have to specify the "discard" mount option. The code coalesces adjacent blocks into a single extent when generating the discard requests, thus generating the minimum number. If an error occurs when the request has been sent to the block device, then it will print a message and turn off the requests for that filesystem. If the problem is temporary, then you can use remount to turn the option back on again. There is also a nodiscard mount option so that you can use remount to turn discard requests off, if required. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:20 +00:00
Steven Whitehouse	d8348de06f	GFS2: Fix deadlock on journal flush This patch fixes a deadlock when the journal is flushed and there are dirty inodes other than the one which caused the journal flush. Originally the journal flushing code was trying to obtain the transaction glock while running the flush code for an inode glock. We no longer require the transaction glock at this point in time since we know that any attempt to get the transaction glock from another node will result in a journal flush. So if we are flushing the journal, we can be sure that the transaction lock is still cached from when the transaction was started. By inlining a version of gfs2_trans_begin() (minus the bit which gets the transaction glock) we can avoid the deadlock problems caused if there is a demote request queued up on the transaction glock. In addition I've also moved the umount rwsem so that it covers the glock workqueue, since it all demotions are done by this workqueue now. That fixes a bug on umount which I came across while fixing the original problem. Reported-by: David Teigland <teigland@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:18 +00:00
Steven Whitehouse	e7c8707ea2	GFS2: Fix error path ref counting for root inode We were keeping hold of an extra ref to the root inode in one of the error paths, that resulted in a hang. Reported-by: Nate Straz <nstraz@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Tested-by: Robert Peterson <rpeterso@redhat.com>	2009-03-24 11:21:17 +00:00
Steven Whitehouse	ac2425e7d3	GFS2: Remove unused field from glock The time stamp field is unused in the glock now that we are using a shrinker, so that we can remove it and save sizeof(unsigned long) bytes in each glock. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:17 +00:00
Steven Whitehouse	f057f6cdf6	GFS2: Merge lock_dlm module into GFS2 This is the big patch that I've been working on for some time now. There are many reasons for wanting to make this change such as: o Reducing overhead by eliminating duplicated fields between structures o Simplifcation of the code (reduces the code size by a fair bit) o The locking interface is now the DLM interface itself as proposed some time ago. o Fewer lookups of glocks when processing replies from the DLM o Fewer memory allocations/deallocations for each glock o Scope to do further optimisations in the future (but this patch is more than big enough for now!) Please note that (a) this patch relates to the lock_dlm module and not the DLM itself, that is still a separate module; and (b) that we retain the ability to build GFS2 as a standalone single node filesystem with out requiring the DLM. This patch needs a lot of testing, hence my keeping it I restarted my -git tree after the last merge window. That way, this has the maximum exposure before its merged. This is (modulo a few minor bug fixes) the same patch that I've been posting on and off the the last three months and its passed a number of different tests so far. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:14 +00:00
Steven Whitehouse	22077f57de	GFS2: Remove "double" locking in quota We only really need a single spin lock for the quota data, so lets just use the lru lock for now. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Abhijith Das <adas@redhat.com>	2009-03-24 11:21:13 +00:00
Abhijith Das	0a7ab79c5b	GFS2: change gfs2_quota_scan into a shrinker Deallocation of gfs2_quota_data objects now happens on-demand through a shrinker instead of routinely deallocating through the quotad daemon. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:12 +00:00
Abhijith Das	2db2aac255	GFS2: Bring back lvb-related stuff to lock_nolock to support quotas The quota code uses lvbs and this is currently not implemented in lock_nolock, thereby causing panics when quota is enabled with lock_nolock. This patch adds the relevant bits. Signed-off-by: Abhijith Das <adas@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:11 +00:00
Steven Whitehouse	6f04c1c7fe	GFS2: Fix remount argument parsing The following patch fixes an issue relating to remount and argument parsing. After this fix is applied, remount becomes atomic in that it either succeeds changing the mount to the new state, or it fails and leaves it in the old state. Previously it was possible for the parsing of options to fail part way though and for the fs to be left in a state where some of the new arguments had been applied, but some had not. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>	2009-03-24 11:21:10 +00:00
James Morris	703a3cd728	Merge branch 'master' into next	2009-03-24 10:52:46 +11:00
Gertjan van Wingerde	f762dd6821	Update my email address Update all previous incarnations of my email address to the correct one. Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:28:37 -07:00
Tyler Hicks	2aac0cf886	eCryptfs: NULL crypt_stat dereference during lookup If ecryptfs_encrypted_view or ecryptfs_xattr_metadata were being specified as mount options, a NULL pointer dereference of crypt_stat was possible during lookup. This patch moves the crypt_stat assignment into ecryptfs_lookup_and_interpose_lower(), ensuring that crypt_stat will not be NULL before we attempt to dereference it. Thanks to Dan Carpenter and his static analysis tool, smatch, for finding this bug. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:20:43 -07:00
Tyler Hicks	8faece5f90	eCryptfs: Allocate a variable number of pages for file headers When allocating the memory used to store the eCryptfs header contents, a single, zeroed page was being allocated with get_zeroed_page(). However, the size of an eCryptfs header is either PAGE_CACHE_SIZE or ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE (8192), whichever is larger, and is stored in the file's private_data->crypt_stat->num_header_bytes_at_front field. ecryptfs_write_metadata_to_contents() was using num_header_bytes_at_front to decide how many bytes should be written to the lower filesystem for the file header. Unfortunately, at least 8K was being written from the page, despite the chance of the single, zeroed page being smaller than 8K. This resulted in random areas of kernel memory being written between the 0x1000 and 0x1FFF bytes offsets in the eCryptfs file headers if PAGE_SIZE was 4K. This patch allocates a variable number of pages, calculated with num_header_bytes_at_front, and passes the number of allocated pages along to ecryptfs_write_metadata_to_contents(). Thanks to Florian Streibelt for reporting the data leak and working with me to find the problem. 2.6.28 is the only kernel release with this vulnerability. Corresponds to CVE-2009-0787 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Greg KH <greg@kroah.com> Cc: dann frazier <dannf@dannf.org> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Florian Streibelt <florian@f-streibelt.de> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-22 11:20:43 -07:00
Jeff Moyer	65c24491b4	aio: lookup_ioctx can return the wrong value when looking up a bogus context The libaio test harness turned up a problem whereby lookup_ioctx on a bogus io context was returning the 1 valid io context from the list (harness/cases/3.p). Because of that, an extra put_iocontext was done, and when the process exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio (since we expect a users count of 1 and instead get 0). The problem was introduced by "aio: make the lookup_ioctx() lockless" (commit `abf137dd77`). Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not return with a NULL tpos at the end of the loop, even if the entry was not found. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 15:57:18 -07:00
Davide Libenzi	87c3a86e1c	eventfd: remove fput() call from possible IRQ context Remove a source of fput() call from inside IRQ context. Myself, like Eric, wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was able to, with the attached test program. Independently from this, the bug is conceptually there, so we might be better off fixing it. This patch adds an optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd. Playing with ->f_count directly is not pretty in general, but the alternative here would be to add a brand new delayed fput() infrastructure, that I'm not sure is worth it. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 15:57:18 -07:00
Linus Torvalds	fe2fd6cc34	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: Clear space_info full when adding new devices Btrfs: Fix locking around adding new space_info	2009-03-19 14:49:55 -07:00
Trond Myklebust	7fe5c398fc	NFS: Optimise NFS close() Close-to-open cache consistency rules really only require us to flush out writes on calls to close(), and require us to revalidate attributes on the very last close of the file. Currently we appear to be doing a lot of extra attribute revalidation and cache flushes. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:35:50 -04:00
Trond Myklebust	b1e4adf4ea	NFS: Fix the notifications when renaming onto an existing file NFS appears to be returning an unnecessary "delete" notification when we're doing an atomic rename. See http://bugzilla.gnome.org/show_bug.cgi?id=575684 The fix is to get rid of the redundant call to d_delete(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:35:49 -04:00
Trond Myklebust	47c6256420	NFS: Fix up a mismerged patch Move the definition of nfs_need_commit() into the #ifdef CONFIG_NFS_V3 section as originally intended in the patch "NFS: cleanup - remove struct nfs_inode->ncommit" Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-19 15:17:40 -04:00
Linus Torvalds	a8e7d49aa7	Fix race in create_empty_buffers() vs __set_page_dirty_buffers() Nick Piggin noticed this (very unlikely) race between setting a page dirty and creating the buffers for it - we need to hold the mapping private_lock until we've set the page dirty bit in order to make sure that create_empty_buffers() might not build up a set of buffers without the dirty bits set when the page is dirty. I doubt anybody has ever hit this race (and it didn't solve the issue Nick was looking at), but as Nick says: "Still, it does appear to solve a real race, which we should close." Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-19 11:32:05 -07:00
Linus Torvalds	85bff8857c	Merge branch 'for-2.6.29' of git://linux-nfs.org/~bfields/linux * 'for-2.6.29' of git://linux-nfs.org/~bfields/linux: nfsd: nfsd should drop CAP_MKNOD for non-root NFSD: provide encode routine for OP_OPENATTR	2009-03-18 09:27:20 -07:00
Steve French	b363b3304b	[CIFS] Fix memory overwrite when saving nativeFileSystem field during mount CIFS can allocate a few bytes to little for the nativeFileSystem field during tree connect response processing during mount. This can result in a "Redzone overwritten" message to be logged. Signed-off-by: Sridhar Vinay <vinaysridhar@in.ibm.com> Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com> CC: Stable <stable@kernel.org> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-18 05:57:22 +00:00
Steve French	c6c00919ab	[CIFS] Rename compose_mount_options to cifs_compose_mount_options. Make it available to others for reuse. Signed-off-by: Igor Mammedov <niallain@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-18 05:50:07 +00:00
Linus Torvalds	58cefd2b1e	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix bb_prealloc_list corruption due to wrong group locking ext4: fix bogus BUG_ONs in in mballoc code ext4: Print the find_group_flex() warning only once ext4: fix header check in ext4_ext_search_right() for deep extent trees.	2009-03-17 20:55:40 -07:00
Benny Halevy	84f09f46b4	NFSD: provide encode routine for OP_OPENATTR Although this operation is unsupported by our implementation we still need to provide an encode routine for it to merely encode its (error) status back in the compound reply. Thanks for Bill Baker at sun.com for testing with the Sun OpenSolaris' client, finding, and reporting this bug at Connectathon 2009. This bug was introduced in 2.6.27 Signed-off-by: Benny Halevy <bhalevy@panasas.com> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2009-03-17 14:54:45 -04:00
Linus Torvalds	ee568b25ee	Avoid 64-bit "switch()" statements on 32-bit architectures Commit `ee6f779b9e` ("filp->f_pos not correctly updated in proc_task_readdir") changed the proc code to use filp->f_pos directly, rather than through a temporary variable. In the process, that caused the operations to be done on the full 64 bits, even though the offset is never that big. That's all fine and dandy per se, but for some unfathomable reason gcc generates absolutely horrid code when using 64-bit values in switch() statements. To the point of actually calling out to gcc helper functions like __cmpdi2 rather than just doing the trivial comparisons directly the way gcc does for normal compares. At which point we get link failures, because we really don't want to support that kind of crazy code. Fix this by just casting the f_pos value to "unsigned long", which is plenty big enough for /proc, and avoids the gcc code generation issue. Reported-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Zhang Le <r0bertz@gentoo.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-17 10:02:35 -07:00
Eric Sandeen	d33a1976fb	ext4: fix bb_prealloc_list corruption due to wrong group locking This is for Red Hat bug 490026: EXT4 panic, list corruption in ext4_mb_new_inode_pa ext4_lock_group(sb, group) is supposed to protect this list for each group, and a common code flow to remove an album is like this: ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL); ext4_lock_group(sb, grp); list_del(&pa->pa_group_list); ext4_unlock_group(sb, grp); so it's critical that we get the right group number back for this prealloc context, to lock the right group (the one associated with this pa) and prevent concurrent list manipulation. however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a comment, "-1 is to protect from crossing allocation group". This makes sense for the group_pa, where pa_pstart is advanced by the length which has been used (in ext4_mb_release_context()), and when the entire length has been used, pa_pstart has been advanced to the first block of the next group. However, for inode_pa, pa_pstart is never advanced; it's just set once to the first block in the group and not moved after that. So in this case, if we subtract one in ext4_mb_put_pa(), we are actually locking the previous group, and opening the race with the other threads which do not subtract off the extra block. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-16 23:25:40 -04:00
Theodore Ts'o	afd4672dc7	ext4: Add auto_da_alloc mount option Add a mount option which allows the user to disable automatic allocation of blocks whose allocation by delayed allocation when the file was originally truncated or when the file is renamed over an existing file. This feature is intended to save users from the effects of naive application writers, but it reduces the effectiveness of the delayed allocation code. This mount option disables this safety feature, which may be desirable for prodcutions systems where the risk of unclean shutdowns or unexpected system crashes is low. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-16 23:12:23 -04:00
Zhang Le	ee6f779b9e	filp->f_pos not correctly updated in proc_task_readdir filp->f_pos only get updated at the end of the function. Thus d_off of those dirents who are in the middle will be 0, and this will cause a problem in glibc's readdir implementation, specifically endless loop. Because when overflow occurs, f_pos will be set to next dirent to read, however it will be 0, unless the next one is the last one. So it will start over again and again. There is a sample program in man 2 gendents. This is the output of the program running on a multithread program's task dir before this patch is applied: $ ./a.out /proc/3807/task --------------- nread=128 --------------- i-node# file type d_reclen d_off d_name 506442 directory 16 1 . 506441 directory 16 0 .. 506443 directory 16 0 3807 506444 directory 16 0 3809 506445 directory 16 0 3812 506446 directory 16 0 3861 506447 directory 16 0 3862 506448 directory 16 8 3863 This is the output after this patch is applied $ ./a.out /proc/3807/task --------------- nread=128 --------------- i-node# file type d_reclen d_off d_name 506442 directory 16 1 . 506441 directory 16 2 .. 506443 directory 16 3 3807 506444 directory 16 4 3809 506445 directory 16 5 3812 506446 directory 16 6 3861 506447 directory 16 7 3862 506448 directory 16 8 3863 Signed-off-by: Zhang Le <r0bertz@gentoo.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-16 07:51:33 -07:00
Jonathan Corbet	60aa49243d	Rationalize fasync return values Most fasync implementations do something like: return fasync_helper(...); But fasync_helper() will return a positive value at times - a feature used in at least one place. Thus, a number of other drivers do: err = fasync_helper(...); if (err < 0) return err; return 0; In the interests of consistency and more concise code, it makes sense to map positive return values onto zero where ->fasync() is called. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:34:35 -06:00
Jonathan Corbet	76398425bb	Move FASYNC bit handling to f_op->fasync() Removing the BKL from FASYNC handling ran into the challenge of keeping the setting of the FASYNC bit in filp->f_flags atomic with regard to calls to the underlying fasync() function. Andi Kleen suggested moving the handling of that bit into fasync(); this patch does exactly that. As a result, we have a couple of internal API changes: fasync() must now manage the FASYNC bit, and it will be called without the BKL held. As it happens, every fasync() implementation in the kernel with one exception calls fasync_helper(). So, if we make fasync_helper() set the FASYNC bit, we can avoid making any changes to the other fasync() functions - as long as those functions, themselves, have proper locking. Most fasync() implementations do nothing but call fasync_helper() - which has its own lock - so they are easily verified as correct. The BKL had already been pushed down into the rest. The networking code has its own version of fasync_helper(), so that code has been augmented with explicit FASYNC bit handling. Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Miller <davem@davemloft.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Jonathan Corbet	db1dd4d376	Use f_lock to protect f_flags Traditionally, changes to struct file->f_flags have been done under BKL protection, or with no protection at all. This patch causes all f_flags changes after file open/creation time to be done under protection of f_lock. This allows the removal of some BKL usage and fixes a number of longstanding (if microscopic) races. Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Jonathan Corbet	6849991490	Rename struct file->f_ep_lock This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now, epoll remains the only user, but a future patch will use it to protect f_flags as well. Cc: Davide Libenzi <davidel@xmailserver.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jonathan Corbet <corbet@lwn.net>	2009-03-16 08:32:27 -06:00
Linus Torvalds	18553c38bc	Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix Xilinx SystemACE driver to handle empty CF slot block: fix memory leak in bio_clone() block: Add gfp_mask parameter to bio_integrity_clone()	2009-03-14 13:43:18 -07:00
Li Zefan	059ea3318c	block: fix memory leak in bio_clone() If bio_integrity_clone() fails, bio_clone() returns NULL without freeing the newly allocated bio. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-14 21:06:52 +01:00
un'ichi Nomura	87092698c6	block: Add gfp_mask parameter to bio_integrity_clone() Stricter gfp_mask might be required for clone allocation. For example, request-based dm may clone bio in interrupt context so it has to use GFP_ATOMIC. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Cc: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-03-14 21:06:51 +01:00
Linus Torvalds	f1823acfbc	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined... SUNRPC: xprt_connect() don't abort the task if the transport isn't bound SUNRPC: Fix an Oops due to socket not set up yet... Bug 11061, NFS mounts dropped NFS: Handle -ESTALE error in access() NLM: Fix GRANT callback address comparison when IPv6 is enabled NLM: Shrink the IPv4-only version of nlm_cmp_addr() NFSv3: Fix posix ACL code NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2) SUNRPC: Tighten up the task locking rules in __rpc_execute()	2009-03-14 12:00:18 -07:00
Linus Torvalds	ff9cb43ce0	Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: ocfs2: Use xs->bucket to set xattr value outside ocfs2: Fix a bug found by sparse check. ocfs2: tweak to get the maximum inline data size with xattr ocfs2: reserve xattr block for new directory with inline data	2009-03-14 11:59:22 -07:00
Tyler Hicks	84814d642a	eCryptfs: don't encrypt file key with filename key eCryptfs has file encryption keys (FEK), file encryption key encryption keys (FEKEK), and filename encryption keys (FNEK). The per-file FEK is encrypted with one or more FEKEKs and stored in the header of the encrypted file. I noticed that the FEK is also being encrypted by the FNEK. This is a problem if a user wants to use a different FNEK than their FEKEK, as their file contents will still be accessible with the FNEK. This is a minimalistic patch which prevents the FNEKs signatures from being copied to the inode signatures list. Ultimately, it keeps the FEK from being encrypted with a FNEK. Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Acked-by: Dustin Kirkland <kirkland@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Johannes Weiner	15e7b87676	nommu: ramfs: don't leak pages when adding to page cache fails When a ramfs nommu mapping is expanded, contiguous pages are allocated and added to the pagecache. The caller's reference is then passed on by moving whole pagevecs to the file lru list. If the page cache adding fails, make sure that the error path also moves the pagevec contents which might still contain up to PAGEVEC_SIZE successfully added pages, of which we would leak references otherwise. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Howells <dhowells@redhat.com> Cc: Enrik Berkhan <Enrik.Berkhan@ge.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Enrik Berkhan	020fe22ff1	nommu: ramfs: pages allocated to an inode's pagecache may get wrongly discarded The pages attached to a ramfs inode's pagecache by truncation from nothing - as done by SYSV SHM for example - may get discarded under memory pressure. The problem is that the pages are not marked dirty. Anything that creates data in an MMU-based ramfs will cause the pages holding that data will cause the set_page_dirty() aop to be called. For the NOMMU-based mmap, set_page_dirty() may be called by write(), but it won't be called by page-writing faults on writable mmaps, and it isn't called by ramfs_nommu_expand_for_mapping() when a file is being truncated from nothing to allocate a contiguous run. The solution is to mark the pages dirty at the point of allocation by the truncation code. Signed-off-by: Enrik Berkhan <Enrik.Berkhan@ge.com> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-14 11:57:22 -07:00
Eric Sandeen	8d03c7a0c5	ext4: fix bogus BUG_ONs in in mballoc code Thiemo Nagel reported that: # dd if=/dev/zero of=image.ext4 bs=1M count=2 # mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \ -O large_file,dir_index,flex_bg,extent,sparse_super image.ext4 # mount -o loop image.ext4 mnt/ # dd if=/dev/zero of=mnt/file oopsed, with a BUG_ON in ext4_mb_normalize_request because size == EXT4_BLOCKS_PER_GROUP It appears to me (esp. after talking to Andreas) that the BUG_ON is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should be allowed, though larger sizes do indicate a problem. Fix that an another (apparently rare) codepath with a similar check. Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-14 11:51:46 -04:00
Tao Ma	712e53e46a	ocfs2: Use xs->bucket to set xattr value outside A long time ago, xs->base is allocated a 4K size and all the contents in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to abstract xattr bucket and xs->base is initialized to the start of the bu_bhs[0]. So xs->base + offset will overflow when the value root is stored outside the first block. Then why we can survive the xattr test by now? It is because we always read the bucket contiguously now and kernel mm allocate continguous memory for us. We are lucky, but we should fix it. So just get the right value root as other callers do. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:46:09 -07:00
Tao Ma	74e77eb30d	ocfs2: Fix a bug found by sparse check. We need to use le32_to_cpu to test rec->e_cpos in ocfs2_dinode_insert_check. Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:46:01 -07:00
Tiger Yang	d9ae49d6e2	ocfs2: tweak to get the maximum inline data size with xattr Replace max_inline_data with max_inline_data_with_xattr to ensure it correct when xattr inlined. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:45:46 -07:00
Tiger Yang	6c9fd1dc0a	ocfs2: reserve xattr block for new directory with inline data If this is a new directory with inline data, we choose to reserve the entire inline area for directory contents and force an external xattr block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-03-12 16:45:40 -07:00
Linus Torvalds	188de5ec56	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch	2009-03-12 16:32:36 -07:00
Nick Piggin	7ef0d7377c	fs: new inode i_state corruption fix There was a report of a data corruption http://lkml.org/lkml/2008/11/14/121. There is a script included to reproduce the problem. During testing, I encountered a number of strange things with ext3, so I tried ext2 to attempt to reduce complexity of the problem. I found that fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be cleared, even though instrumentation showed that unlock_new_inode had already been called for that inode. This points to memory scribble, or synchronisation problme. i_state of I_NEW inodes is not protected by inode_lock because other processes are not supposed to touch them until I_LOCK (and I_NEW) is cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify i_state revealed that generic_sync_sb_inodes is picking up new inodes from the inode lists and passing them to __writeback_single_inode without waiting for I_NEW. Subsequently modifying i_state causes corruption. In my case it would look like this: CPU0 CPU1 unlock_new_inode() __sync_single_inode() reg <- inode->i_state reg -> reg & ~(I_LOCK\|I_NEW) reg <- inode->i_state reg -> inode->i_state reg -> reg \| I_SYNC reg -> inode->i_state Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK\|I_NEW again. Fix for this is rather than wait for I_NEW inodes, just skip over them: inodes concurrently being created are not subject to data integrity operations, and should not significantly contribute to dirty memory either. After this change, I'm unable to reproduce any of the added warnings or hangs after ~1hour of running. Previously, the new warnings would start immediately and hang would happen in under 5 minutes. I'm also testing on ext3 now, and so far no problems there either. I don't know whether this fixes the problem reported above, but it fixes a real problem for me. Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:24 -07:00
Li Zefan	a3cfbb53b1	vfs: add missing unlock in sget() In sget(), destroy_super(s) is called with s->s_umount held, which makes lockdep unhappy. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:23 -07:00
Oleg Nesterov	e5bc49ba74	pipe_rdwr_fasync: fix the error handling to prevent the leak/crash If the second fasync_helper() fails, pipe_rdwr_fasync() returns the error but leaves the file on ->fasync_readers. This was always wrong, but since `233e70f422` "saner FASYNC handling on file close" we have the new problem. Because in this case setfl() doesn't set FASYNC bit, __fput() will not do ->fasync(0), and we leak fasync_struct with ->fa_file pointing to the freed file. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-12 16:20:23 -07:00
Trond Myklebust	9f4c899c0d	NFS: Fix the fix to Bugzilla #11061 , when IPv6 isn't defined... Stephen Rothwell reports: Today's linux-next build (powerpc ppc64_defconfig) failed like this: fs/built-in.o: In function `.nfs_get_client': client.c:(.text+0x115010): undefined reference to `.__ipv6_addr_type' Fix by moving the IPV6 specific parts of commit `d7371c41b0` ("Bug 11061, NFS mounts dropped") into the '#ifdef IPV6..." section. Also fix up a couple of formatting issues. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-12 14:51:32 -04:00
Theodore Ts'o	2842c3b544	ext4: Print the find_group_flex() warning only once This is a short-term warning, and even printk_ratelimit() can result in too much noise in system logs. So only print it once as a warning. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-12 12:20:01 -04:00
Phillip Lougher	363911d027	Squashfs: Valid filesystems are flagged as bad by the corrupted fs patch The corrupted filesystem patch added a check against zlib trying to output too much data in the presence of data corruption. This check triggered if zlib_inflate asked to be called again (Z_OK) with avail_out == 0 and no more output buffers available. This check proves to be rather dumb, as it incorrectly catches the case where zlib has generated all the output, but there are still input bytes to be processed. This patch does a number of things. It removes the original check and replaces it with code to not move to the next output buffer if there are no more output buffers available, relying on zlib to error if it wants an extra output buffer in the case of data corruption. It also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH flag, and makes the error messages more understandable to non-technical users. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk> Reported-by: Stefan Lippers-Hollmann <s.L-H@gmx.de>	2009-03-12 03:23:48 +00:00
Steve French	64cc2c6369	[CIFS] work around bug in Samba server handling for posix open Samba server (version 3.3.1 and earlier, and 3.2.8 and earlier) incorrectly required the O_CREAT flag on posix open (even when a file was not being created). This disables posix open (create is still ok) after the first attempt returns EINVAL (and logs an error, once, recommending that they update their server). Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	276a74a483	[CIFS] Use posix open on file open when server supports it Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Jeff Layton	fcc7c09d94	cifs: fix buffer format byte on NT Rename/hardlink Discovered at Connnectathon 2009... The buffer format byte and the pad are transposed in NT_RENAME calls (which are used to set hardlinks). Most servers seem to ignore this fact, but NetApp filers throw back an error due to this problem. This patch fixes it. CC: Stable <stable@kernel.org> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	0382457744	[CIFS] Add definitions for remoteably fsctl calls There are about 60 fsctl calls which Windows claims would be able to be sent remotely and handled by the server. This adds the #defines for them. A few of them look immediately useful, but need to also add the structure definitions for them so they can be sent as SMBs. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	1adcb71092	[CIFS] add extra null attr check Although attr == NULL can not happen, this makes cifs_set_file_info safer in the future since it may not be obvious that the caller can not set attr to NULL. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:21 +00:00
Steve French	4717bed680	[CIFS] fix build error Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	7fc8f4e95b	[CIFS] reopen file via newer posix open protocol operation if available If the network connection crashes, and we have to reopen files, preferentially use the newer cifs posix open protocol operation if the server supports it. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	be652445fd	[CIFS] Add new nostrictsync cifs mount option to avoid slow SMB flush If this mount option is set, when an application does an fsync call then the cifs client does not send an SMB Flush to the server (to force the server to write all dirty data for this file immediately to disk), although cifs still sends all dirty (cached) file data to the server and waits for the server to respond to the write write. Since SMB Flush can be very slow, and some servers may be reliable enough (to risk delaying slightly flushing the data to disk on the server), turning on this option may be useful to improve performance for applications that fsync too much, at a small risk of server crash. If this mount option is not set, by default cifs will send an SMB flush request (and wait for a response) on every fsync call. Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	10e70afa75	[CIFS] DFS no longer experimental Also updates some DFS flag definitions Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Steve French	b298f22355	[CIFS] Send SMB flush in cifs_fsync In contrast to the now-obsolete smbfs, cifs does not send SMB_COM_FLUSH in response to an explicit fsync(2) to guarantee that all volatile data is written to stable storage on the server side, provided the server honors the request (which, to my knowledge, is true for Windows and Samba with 'strict sync' enabled). This patch modifies the cifs_fsync implementation to restore the fsync-behavior of smbfs by triggering SMB_COM_FLUSH after sending outstanding data on the client side to the server. Signed-off-by: Horst Reiterer <horst.reiterer@gmail.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>	2009-03-12 01:36:20 +00:00
Linus Torvalds	0789d8fccb	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: only issues a cache flush on unmount if barriers are enabled xfs: prevent lockdep false positive in xfs_iget_cache_miss xfs: prevent kernel crash due to corrupted inode log format	2009-03-11 14:29:03 -07:00
OGAWA Hirofumi	3a95ea1155	Fix _fat_bmap() locking On swapon() path, it has already i_mutex. So, this uses i_alloc_sem instead of it. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Laurent GUERBY <laurent@guerby.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-11 12:04:18 -07:00
Tom Talpey	a67d18f89f	NFS: load the rpc/rdma transport module automatically When mounting an NFS/RDMA server with the "-o proto=rdma" or "-o rdma" options, attempt to dynamically load the necessary "xprtrdma" client transport module. Doing so improves usability, while avoiding a static module dependency and any unnecesary resources. Signed-off-by: Tom Talpey <tmtalpey@gmail.com> Cc: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:37:56 -04:00
Trond Myklebust	e1ebfd33be	NFS: Kill the "defined but not used" compile error on nommu machines Bryan Wu reports that when compiling NFS on nommu machines he gets a "defined but not used" error on nfs_file_mmap(). The easiest fix is simply to get rid of the special casing in NFS, and just always call generic_file_mmap() to set up the file. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:37:54 -04:00
Trond Myklebust	72cb77f4a5	NFS: Throttle page dirtying while we're flushing to disk The following patch is a combination of a patch by myself and Peter Staubach. Trond: If we allow other processes to dirty pages while a process is doing a consistency sync to disk, we can end up never making progress. Peter: Attached is a patch which addresses a continuing problem with the NFS client generating out of order WRITE requests. While this is compliant with all of the current protocol specifications, there are servers in the market which can not handle out of order WRITE requests very well. Also, this may lead to sub-optimal block allocations in the underlying file system on the server. This may cause the read throughputs to be reduced when reading the file from the server. Peter: There has been a lot of work recently done to address out of order issues on a systemic level. However, the NFS client is still susceptible to the problem. Out of order WRITE requests can occur when pdflush is in the middle of writing out pages while the process dirtying the pages calls generic_file_buffered_write which calls generic_perform_write which calls balance_dirty_pages_rate_limited which ends up calling writeback_inodes which ends up calling back into the NFS client to writes out dirty pages for the same file that pdflush happens to be working with. Signed-off-by: Peter Staubach <staubach@redhat.com> [modification by Trond to merge the two similar patches] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:30 -04:00
Trond Myklebust	fb8a1f11b6	NFS: cleanup - remove struct nfs_inode->ncommit Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:29 -04:00
Trond Myklebust	a65318bf3a	NFSv4: Simplify some cache consistency post-op GETATTRs Certain asynchronous operations such as write() do not expect (or care) that other metadata such as the file owner, mode, acls, ... change. All they want to do is update and/or check the change attribute, ctime, and mtime. By skipping the file owner and group update, we also avoid having to do a potential idmapper upcall for these asynchronous RPC calls. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:28 -04:00
Trond Myklebust	69aaaae18f	NFSv4: A referral is assumed to always point to a directory. Fix a bug whereby we would fail to create a mount point for a referral. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:28 -04:00
Trond Myklebust	409924e4c9	NFSv4: Make decode_getfattr() set fattr->valid to reflect what was decoded Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:27 -04:00
Trond Myklebust	f26c7a7887	NFSv4: Clean up decode_getfattr() Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:26 -04:00
Trond Myklebust	bca794785c	NFS: Fix the type of struct nfs_fattr->mode There is no point in using anything other than umode_t, since we copy the content pretty much directly into inode->i_mode. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:26 -04:00
Trond Myklebust	1ca277d88d	NFS: Shrink the struct nfs_fattr We don't need the bitmap[] field anymore, since the 'valid' field tells us all we need to know about which attributes were filled in... Also move the pre-op attributes in order to improve the structure packing. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:25 -04:00
Trond Myklebust	9e6e70f8d8	NFSv4: Support NFSv4 optional attributes in the struct nfs_fattr Currently, filling struct nfs_fattr is more or less an all or nothing operation, since NFSv2 and NFSv3 have only mandatory attributes. In NFSv4, some attributes are optional, and so we may simply not be able to fill in those fields. Furthermore, NFSv4 allows you to specify which attributes you are interested in retrieving, thus permitting you to optimise away retrieval of attributes that you know will no change... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:24 -04:00
Trond Myklebust	78f945f88e	NFSv4: Ignore errors on the post-op attributes in SETATTR calls There is no need to fail or retry a SETATTR call just because the post-op GETATTR failed. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:23 -04:00
NeilBrown	37d9d76d8b	NFS: flush cached directory information slightly more readily. If cached directory contents becomes incorrect, there is no way to flush the contents. This contrasts with files where file locking is the recommended way to ensure cache consistency between multiple applications (a read-lock always flushes the cache). Also while changes to files often change the size of the file (thus triggering a cache flush), changes to directories often do not change the apparent size (as the size is often rounded to a block size). So it is particularly important with directories to avoid the possibility of an incorrect cache wherever possible. When the link count on a directory changes it implies a change in the number of child directories, and so a change in the contents of this directory. So use that as a trigger to flush cached contents. When the ctime changes but the mtime does not, there are two possible reasons. 1/ The owner/mode information has been changed. 2/ utimes has been used to set the mtime backwards. In the first case, a data-cache flush is not required. In the second case it is. So on the basis that correctness trumps performance, flush the directory contents cache in this case also. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:23 -04:00
Suresh Jayaraman	2b57dc6cf9	NFS: Minor __nfs_revalidate_inode cleanup Remove redundant NFS_STALE() check, a leftover due to the commit `691beb13cd` Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-11 14:10:22 -04:00
David Teigland	1fecb1c4b6	dlm: fix length calculation in compat code Using offsetof() to calculate name length does not work because it does not produce consistent results with with structure packing. This caused memcpy to corrupt memory by copying 4 extra bytes off the end of the buffer on 64 bit kernels with 32 bit userspace (the only case where this 32/64 compat code is used). The fix is to calculate name length directly from the start instead of trying to derive it later using count and offsetof. Signed-off-by: David Teigland <teigland@redhat.com>	2009-03-11 12:23:59 -05:00
David Teigland	a536e38125	dlm: ignore cancel on granted lock Return immediately from dlm_unlock(CANCEL) if the lock is granted and not being converted; there's nothing to cancel. Signed-off-by: David Teigland <teigland@redhat.com>	2009-03-11 12:23:58 -05:00
David Teigland	43279e5376	dlm: clear defunct cancel state When a conversion completes successfully and finds that a cancel of the convert is still in progress (which is now a moot point), preemptively clear the state associated with outstanding cancel. That state could cause a subsequent conversion to be ignored. Also, improve the consistency and content of error and debug messages in this area. Signed-off-by: David Teigland <teigland@redhat.com>	2009-03-11 12:23:39 -05:00
Christine Caulfield	5e9ccc372d	dlm: replace idr with hash table for connections Integer nodeids can be too large for the idr code; use a hash table instead. Signed-off-by: Christine Caulfield <ccaulfie@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2009-03-11 12:20:58 -05:00
Wu Fengguang	ad3bdefe87	proc: fix kflags to uflags copying in /proc/kpageflags Fix kpf_copy_bit(src,dst) to be kpf_copy_bit(dst,src) to match the actual call patterns, e.g. kpf_copy_bit(kflags, KPF_LOCKED, PG_locked). This misplacement of src/dst only affected reporting of PG_writeback, PG_reclaim and PG_buddy. For others kflags==uflags so not affected. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-11 07:43:33 -07:00
Ian Dall	d7371c41b0	Bug 11061, NFS mounts dropped Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=11061 sockaddr structures can't be reliably compared using memcmp() because there are padding bytes in the structure which can't be guaranteed to be the same even when the sockaddr structures refer to the same socket. Instead compare all the relevant fields. In the case of IPv6 sin6_flowinfo is not compared because it only affects QoS and sin6_scope_id is only compared if the address is "link local" because "link local" addresses need only be unique to a specific link. Signed-off-by: Ian Dall <ian@beware.dropbear.id.au> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-10 20:33:22 -04:00
Suresh Jayaraman	a71ee337b3	NFS: Handle -ESTALE error in access() Hi Trond, I have been looking at a bugreport where trying to open applications on KDE on a NFS mounted home fails temporarily. There have been multiple reports on different kernel versions pointing to this common issue: http://bugzilla.kernel.org/show_bug.cgi?id=12557 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html This issue can be reproducible consistently by doing this on a NFS mounted home (KDE): 1. Open 2 xterm sessions 2. From one of the xterm session, do "ssh -X <remote host>" 3. "stat ~/.Xauthority" on the remote SSH session 4. Close the two xterm sessions 5. On the server do a "stat ~/.Xauthority" 6. Now on the client, try to open xterm This will fail. Even if the filehandle had become stale, the NFS client should invalidate the cache/inode and should repeat LOOKUP. Looking at the packet capture when the failure occurs shows that there were two subsequent ACCESS() calls with the same filehandle and both fails with -ESTALE error. I have tested the fix below. Now the client issue a LOOKUP after the ACCESS() call fails with -ESTALE. If all this makes sense to you, can you consider this for inclusion? Thanks, If the server returns an -ESTALE error due to stale filehandle in response to an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP() can be retried. Without this change, the nfs client retries ACCESS() with the same filehandle, fails again and could lead to temporary failure of applications running on nfs mounted home. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-10 20:33:21 -04:00
Chuck Lever	57df675c60	NLM: Fix GRANT callback address comparison when IPv6 is enabled The NFS mount command may pass an AF_INET server address to lockd. If lockd happens to be using a PF_INET6 listener, the nlm_cmp_addr() in nlmclnt_grant() will fail to match requests from that host because they will all have a mapped IPv4 AF_INET6 address. Adopt the same solution used in nfs_sockaddr_match_ipaddr() for NFSv4 callbacks: if either address is AF_INET, map it to an AF_INET6 address before doing the comparison. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-10 20:33:20 -04:00
Trond Myklebust	ae46141ff0	NFSv3: Fix posix ACL code Fix a memory leak due to allocation in the XDR layer. In cases where the RPC call needs to be retransmitted, we end up allocating new pages without clearing the old ones. Fix this by moving the allocation into nfs3_proc_setacls(). Also fix an issue discovered by Kevin Rudd, whereby the amount of memory reserved for the acls in the xdr_buf->head was miscalculated, and causing corruption. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-10 20:33:18 -04:00
Trond Myklebust	ef95d31e6d	NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2) The changeset `ea31a4437c` (nfs: Fix misparsing of nfsv4 fs_locations attribute) causes the mountpath that is calculated at the beginning of try_location() to be clobbered when we later strncpy a non-nul terminated hostname using an incorrect buffer length. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2009-03-10 20:33:17 -04:00
Alexey Dobriyan	260219cc48	devpts: remove graffiti Very annoying when working with containters. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-03-10 15:55:11 -07:00
Eric Sandeen	395a87bfef	ext4: fix header check in ext4_ext_search_right() for deep extent trees. The ext4_ext_search_right() function is confusing; it uses a "depth" variable which is 0 at the root and maximum at the leaves, but the on-disk metadata uses a "depth" (actually eh_depth) which is opposite: maximum at the root, and 0 at the leaves. The ext4_ext_check_header() function is given a depth and checks the header agaisnt that depth; it expects the on-disk semantics, but we are giving it the opposite in the while loop in this function. We should be giving it the on-disk notion of "depth" which we can get from (p_depth - depth) - and if you look, the last (more commonly hit) call to ext4_ext_check_header() does just this. Sending in the wrong depth results in (incorrect) messages about corruption: EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0), depth 1(2) http://bugzilla.kernel.org/show_bug.cgi?id=12821 Reported-by: David Dindorp <ddi@dubex.dk> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-10 18:18:47 -04:00
Chris Mason	913d952eb5	Btrfs: Clear space_info full when adding new devices The full flag on the space info structs tells the allocator not to try and allocate more chunks because the devices in the FS are fully allocated. When more devices are added, we need to clear the full flag so the allocator knows it has more space available. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-10 13:17:18 -04:00
Chris Mason	4184ea7f90	Btrfs: Fix locking around adding new space_info Storage allocated to different raid levels in btrfs is tracked by a btrfs_space_info structure, and all of the current space_infos are collected into a list_head. Most filesystems have 3 or 4 of these structs total, and the list is only changed when new raid levels are added or at unmount time. This commit adds rcu locking on the list head, and properly frees things at unmount time. It also clears the space_info->full flag whenever new space is added to the FS. The locking for the space info list goes like this: reads: protected by rcu_read_lock() writes: protected by the chunk_mutex At unmount time we don't need special locking because all the readers are gone. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-10 12:39:20 -04:00
Linus Torvalds	1c91ffc896	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: fix spinlock assertions on UP systems	2009-03-09 09:13:16 -07:00
Chris Mason	b9447ef80b	Btrfs: fix spinlock assertions on UP systems btrfs_tree_locked was being used to make sure a given extent_buffer was properly locked in a few places. But, it wasn't correct for UP compiled kernels. This switches it to using assert_spin_locked instead, and renames it to btrfs_assert_tree_locked to better reflect how it was really being used. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2009-03-09 11:45:38 -04:00
Linus Torvalds	5b61f6accf	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix ext4_free_inode() vs. ext4_claim_inode() race	2009-03-08 10:24:57 -07:00
Christoph Hellwig	c141b2928f	xfs: only issues a cache flush on unmount if barriers are enabled Currently we unconditionally issue a flush from xfs_free_buftarg, but since 2.6.29-rc1 this gives a warning in the style of end_request: I/O error, dev vdb, sector 0 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:35:12 -06:00
Christoph Hellwig	7d46be4a25	xfs: prevent lockdep false positive in xfs_iget_cache_miss The inode can't be locked by anyone else as we just created it a few lines above and it's not been added to any lookup data structure yet. So use a trylock that must succeed to get around the lockdep warnings. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Alexander Beregalov <a.beregalov@gmail.com> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:34:59 -06:00
Christoph Hellwig	ff392c497b	xfs: prevent kernel crash due to corrupted inode log format Andras Korn reported an oops on log replay causes by a corrupted xfs_inode_log_format_t passing a 0 size to kmem_zalloc. This patch handles to small or too large numbers of log regions gracefully by rejecting the log replay with a useful error message. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Andras Korn <korn-sgi.com@chardonnay.math.bme.hu> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Signed-off-by: Felix Blyakher <felixb@sgi.com>	2009-03-06 17:34:45 -06:00
David S. Miller	508827ff0a	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/tokenring/tmspci.c drivers/net/ucc_geth_mii.c	2009-03-05 02:06:47 -08:00
Roel Kluin	f4f8056a86	Squashfs: frag_size should be signed, as it can hold an error result Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-03-05 00:55:31 +00:00
Theodore Ts'o	7d39db14a4	ext4: Use struct flex_groups to calculate get_orlov_stats() Instead of looping over all of the block groups in a flex group summing their summary statistics, start tracking used_dirs in struct flex_groups, and use struct flex_groups instead. This should save a bit of CPU for mkdir-heavy workloads. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-04 19:31:53 -05:00
Phillip Lougher	118e1ef6fa	Squashfs: Fix oops when reading fsfuzzer corrupted filesystems This fixes a code regression caused by the recent mainlining changes. The recent code changes call zlib_inflate repeatedly, decompressing into separate 4K buffers, this code didn't check for the possibility that zlib_inflate might ask for too many buffers when decompressing corrupted data. Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>	2009-03-05 00:31:12 +00:00
Theodore Ts'o	9f24e4208f	ext4: Use atomic_t's in struct flex_groups Reduce pressure on the sb_bgl_lock family of locks by using atomic_t's to track the number of free blocks and inodes in each flex_group. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-04 19:09:10 -05:00
Theodore Ts'o	b713a5ec55	ext4: remove /proc tuning knobs Remove tuning knobs in /proc/fs/ext4/<dev/* since they have been replaced by knobs in sysfs at /sys/fs/ext4/<dev>/*. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-31 09:11:14 -04:00
Theodore Ts'o	3197ebdb13	ext4: Add sysfs support Add basic sysfs support so that information about the mounted filesystem and various tuning parameters can be accessed via /sys/fs/ext4/<dev>/*. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-31 09:10:09 -04:00
Eric Sandeen	7ce9d5d1f3	ext4: fix ext4_free_inode() vs. ext4_claim_inode() race I was seeing fsck errors on inode bitmaps after a 4 thread dbench run on a 4 cpu machine: Inode bitmap differences: -50736 -(50752--50753) etc... I believe that this is because ext4_free_inode() uses atomic bitops, and although ext4_new_inode() used to also use atomic bitops for synchronization, commit `393418676a` changed this to use the sb_bgl_lock, so that we could also synchronize against read_inode_bitmap and initialization of uninit inode tables. However, that change left ext4_free_inode using atomic bitops, which I think leaves no synchronization between setting & unsetting bits in the inode table. The below patch fixes it for me, although I wonder if we're getting at all heavy-handed with this spinlock... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-04 18:38:18 -05:00
Linus Torvalds	36b31106b7	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: don't call jbd2_journal_force_commit_nested without journal ext4: Reorder fs/Makefile so that ext2 root fs's are mounted using ext2 ext4: Remove duplicate call to ext4_commit_super() in ext4_freeze()	2009-03-02 15:42:26 -08:00
David S. Miller	aa4abc9bcc	Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/wireless/iwlwifi/iwl-tx.c net/8021q/vlan_core.c net/core/dev.c	2009-03-01 21:35:16 -08:00
Theodore Ts'o	afc32f7ee9	ext4: Track lifetime disk writes Add a new superblock value which tracks the lifetime amount of writes to the filesystem. This is useful in estimating the amount of wear on solid state drives (SSD's) caused by writes to the filesystem. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-28 19:39:58 -05:00
Aneesh Kumar K.V	d6014301b5	ext4: Fix discard of inode prealloc space with delayed allocation. With delayed allocation we should not/cannot discard inode prealloc space during file close. We would still have dirty pages for which we haven't allocated blocks yet. With this fix after each get_blocks request we check whether we have zero reserved blocks and if yes and we don't have any writers on the file we discard inode prealloc space. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-03-27 22:36:43 -04:00
Christoph Hellwig	5cf8cf4146	Fix FREEZE/THAW compat_ioctl regression Commit `8e961870bb` removed the FREEZE/THAW handling in xfs_compat_ioctl but never added any compat handler back, so now any freeze/thaw request from a 32-bit binary ond 64-bit userspace will fail. As these ioctls are 32/64-bit compatible two simple COMPATIBLE_IOCTL entries in fs/compat_ioctl.c will do the job. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-27 16:27:45 -08:00
Benny Halevy	adc487204a	EXPORT_SYMBOL(d_obtain_alias) rather than EXPORT_SYMBOL_GPL Commit `4ea3ada295` declares d_obtain_alias() as EXPORT_SYMBOL_GPL where it's supposed to replace d_alloc_anon which was previously declared as EXPORT_SYMBOL and thus available to any loadable module. This patch reverts that. Signed-off-by: Benny Halevy <bhalevy@panasas.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2009-02-27 16:26:20 -08:00
Linus Torvalds	221be177e6	Merge git://git.infradead.org/mtd-2.6 * git://git.infradead.org/mtd-2.6: [MTD] [MAPS] Remove MODULE_DEVICE_TABLE() from ck804rom driver. [JFFS2] fix mount crash caused by removed nodes [JFFS2] force the jffs2 GC daemon to behave a bit better [MTD] [MAPS] blackfin async requires complex mappings [MTD] [MAPS] blackfin: fix memory leak in error path [MTD] [MAPS] physmap: fix wrong free and del_mtd_{partition,device} [MTD] slram: Handle negative devlength correctly [MTD] map_rom has NULL erase pointer [MTD] [LPDDR] qinfo_probe depends on lpddr	2009-02-26 14:45:57 -08:00
wengang wang	28d57d4377	ocfs2: add IO error check in ocfs2_get_sector() Check for IO error in ocfs2_get_sector(). Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:12 -08:00
Tiger Yang	4442f51826	ocfs2: set gap to seperate entry and value when xattr in bucket This patch set a gap (4 bytes) between xattr entry and name/value when xattr in bucket. This gap use to seperate entry and name/value when a bucket is full. It had already been set when xattr in inode/block. Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:11 -08:00
Tao Ma	c8b9cf9a7c	ocfs2: lock the metaecc process for xattr bucket For other metadata in ocfs2, metaecc is checked in ocfs2_read_blocks with io_mutex held. While for xattr bucket, it is calculated by the whole buckets. So we have to add a spin_lock to prevent multiple processes calculating metaecc. Signed-off-by: Tao Ma <tao.ma@oracle.com> Tested-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:11 -08:00
Tao Ma	89a907afe0	ocfs2: Use the right access_* method in ctime update of xattr. In ctime updating of xattr, it use the wrong type of access for inode, so use ocfs2_journal_access_di instead. Reported-and-Tested-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Tao Ma <tao.ma@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:11 -08:00
Sunil Mushran	53ecd25e14	ocfs2/dlm: Make dlm_assert_master_handler() kill itself instead of the asserter In dlm_assert_master_handler(), if we get an incorrect assert master from a node that, we reply with EINVAL asking the asserter to die. The problem is that an assert is sent after so many hoops, it is invariably the node that thinks the asserter is wrong, is actually wrong. So instead of killing the asserter, this patch kills the assertee. This patch papers over a race that is still being addressed. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:11 -08:00
Sunil Mushran	dabc47de7a	ocfs2/dlm: Use ast_lock to protect ast_list The code was using dlm->spinlock instead of dlm->ast_lock to protect the ast_list. This patch fixes the issue. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:09 -08:00
Sunil Mushran	c74ff8bb22	ocfs2: Cleanup the lockname print in dlmglue.c The dentry lock has a different format than other locks. This patch fixes ocfs2_log_dlm_error() macro to make it print the dentry lock correctly. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:09 -08:00
Sunil Mushran	7dc102b737	ocfs2/dlm: Retract fix for race between purge and migrate Mainline commit `d4f7e650e5` attempts to delay the dlm_thread from sending the drop ref message if the lockres is being migrated. The problem is that we make the dlm_thread wait for the migration to complete. This causes a deadlock as dlm_thread also participates in the lockres migration process. A better fix for the original oss bugzilla#1012 is in testing. Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:09 -08:00
Tao Ma	47be12e4ee	ocfs2: Access and dirty the buffer_head in mark_written. In __ocfs2_mark_extent_written, when we meet with the situation of c_split_covers_rec, the old solution just replace the extent record and forget to access and dirty the buffer_head. This will cause a problem when the unwritten extent is in an extent block. So access and dirty it. Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>	2009-02-26 11:51:09 -08:00
Linus Torvalds	64e71303e4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: Btrfs: try committing transaction before returning ENOSPC Btrfs: add better -ENOSPC handling	2009-02-26 10:37:00 -08:00
Jens Axboe	b2bf96833c	block: fix bogus gcc warning for uninitialized var usage Newer gcc throw this warning: fs/bio.c: In function ?bio_alloc_bioset?: fs/bio.c:305: warning: ?p? may be used uninitialized in this function since it cannot figure out that 'p' is only ever used if 'bs' is non-NULL. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2009-02-26 10:45:48 +01:00
Eric Sandeen	8f64b32eb7	ext4: don't call jbd2_journal_force_commit_nested without journal Running without a journal, I oopsed when I ran out of space, because we called jbd2_journal_force_commit_nested() from ext4_should_retry_alloc() without a journal. This should take care of it, I think. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>	2009-02-26 00:57:35 -05:00

... 2 3 4 5 6 ...

13159 Commits (30930554f23511645ad9cfb89792219bf398b654)