redonkable/alistair23-linux

Author	SHA1	Message	Date
Boaz Harrosh	861d66601a	exofs: don't leak io_state and pages on read error Same bug as fixed by Idan for write_exec was in read_exec. Fix the io_state leak and pages state on read error. Also while at it: The if (!pcol->read_4_write) at the error path is redundant because all goto err; are after the if (pcol->read_4_write) bale out. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-12-14 12:17:32 +02:00
Idan Kedar	af402ab2b0	exofs: clean up the correct page collection on write error if ore_write() fails, we would unlock the pages of pcol, which is now empty, rather than pcol_copy which owns the pages when ore_write() is called. this means that no pages will actually be unlocked (pcol.nr_pages == 0) and the writing process (more accurately, the syncing process) will hang waiting for a writeback notification that never comes. moreover, if ore_write() fails, pcol_free() is called for pcol, whereas pcol_copy is the object owning the ore_io_state, thus leaking the ore_io_state. [Boaz] I have simplified Idan's original patch a bit, everything else still holds Signed-off-by: Idan Kedar <idank@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-12-11 18:56:18 +02:00
Linus Torvalds	79360ddd73	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull pile 2 of vfs updates from Al Viro: "Stuff in this one - assorted fixes, lglock tidy-up, death to lock_super(). There'll be a VFS pile tomorrow (with patches from Jeff Layton, sanitizing getname() and related parts of audit and preparing for ESTALE fixes), but I'd rather push the stuff in this one ASAP - some of the bugs closed here are quite unpleasant." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: bogus warnings in fs/namei.c consitify do_mount() arguments lglock: add DEFINE_STATIC_LGLOCK() lglock: make the per_cpu locks static lglock: remove unused DEFINE_LGLOCK_LOCKDEP() MAX_LFS_FILESIZE definition for 64bit needs LL... tmpfs,ceph,gfs2,isofs,reiserfs,xfs: fix fh_len checking vfs: drop lock/unlock super ufs: drop lock/unlock super sysv: drop lock/unlock super hpfs: drop lock/unlock super fat: drop lock/unlock super ext3: drop lock/unlock super exofs: drop lock/unlock super dup3: Return an error when oldfd == newfd. fs: handle failed audit_log_start properly fs: prevent use after free in auditing when symlink following was denied	2012-10-12 10:52:03 +09:00
Linus Torvalds	ce40be7a82	Merge branch 'for-3.7/core' of git://git.kernel.dk/linux-block Pull block IO update from Jens Axboe: "Core block IO bits for 3.7. Not a huge round this time, it contains: - First series from Kent cleaning up and generalizing bio allocation and freeing. - WRITE_SAME support from Martin. - Mikulas patches to prevent O_DIRECT crashes when someone changes the block size of a device. - Make bio_split() work on data-less bio's (like trim/discards). - A few other minor fixups." Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew Morton. It is due to the VM no longer using a prio-tree (see commit `6b2dbba8b6`: "mm: replace vma prio_tree with an interval tree"). So make set_blocksize() use mapping_mapped() instead of open-coding the internal VM knowledge that has changed. * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits) block: makes bio_split support bio without data scatterlist: refactor the sg_nents scatterlist: add sg_nents fs: fix include/percpu-rwsem.h export error percpu-rw-semaphore: fix documentation typos fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared blockdev: turn a rw semaphore into a percpu rw semaphore Fix a crash when block device is read and block size is changed at the same time block: fix request_queue->flags initialization block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue() block: ioctl to zero block ranges block: Make blkdev_issue_zeroout use WRITE SAME block: Implement support for WRITE SAME block: Consolidate command flag and queue limit checks for merges block: Clean up special command handling logic block/blk-tag.c: Remove useless kfree block: remove the duplicated setting for congestion_threshold block: reject invalid queue attribute values block: Add bio_clone_bioset(), bio_clone_kmalloc() block: Consolidate bio_alloc_bioset(), bio_kmalloc() ...	2012-10-11 09:04:23 +09:00
Marco Stornelli	4f7754c889	exofs: drop lock/unlock super Removed lock/unlock super. Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-09 23:33:38 -04:00
Linus Torvalds	8711798772	Merge branch 'linux-next' of git://git.open-osd.org/linux-open-osd Pull exofs update from Boaz Harrosh: "Just three one liners" * 'linux-next' of git://git.open-osd.org/linux-open-osd: pnfs_osd_xdr: Remove unused #include from pnfs_osd_xdr.h ore: signedness bug in _sp2d_min_pg() exofs: check for allocation failure in uri_store()	2012-10-09 15:54:27 +09:00
Dan Carpenter	74b217d0d3	ore: signedness bug in _sp2d_min_pg() This for loop doesn't work correctly when "p" is unsigned. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-10-03 13:51:51 -07:00
Linus Torvalds	aab174f0df	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...	2012-10-02 20:25:04 -07:00
Kirill A. Shutemov	8c0a853770	fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Linus Torvalds	437589a74b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...	2012-10-02 11:11:09 -07:00
Eric W. Biederman	d001b05365	userns: Convert exofs to use kuid/kgid where appropriate Cc: Benny Halevy <bhalevy@tonian.com> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-09-21 03:13:10 -07:00
Kent Overstreet	bf800ef181	block: Add bio_clone_bioset(), bio_clone_kmalloc() Previously, there was bio_clone() but it only allocated from the fs bio set; as a result various users were open coding it and using __bio_clone(). This changes bio_clone() to become bio_clone_bioset(), and then we add bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of the functionality the last patch adedd. This will also help in a later patch changing how bio cloning works. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: NeilBrown <neilb@suse.de> CC: Alasdair Kergon <agk@redhat.com> CC: Boaz Harrosh <bharrosh@panasas.com> CC: Jeff Garzik <jeff@garzik.org> Acked-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2012-09-09 10:35:39 +02:00
Alexey Khoroshilov	b8017d2957	exofs: check for allocation failure in uri_store() There is no memory allocation failure check in uri_store(). That can lead to NULL pointer dereference. Found by Linux Driver Verification project (linuxtesting.org). Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-12 21:54:44 +03:00
Linus Torvalds	d42d1dabf3	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd Pull exofs update from Boaz Harrosh: "They are all mostly fixes, except the most important patch by Artem Bityutskiy which removes the use of s_dirt. After this patch s_dirt can be completely removed from the tree." * 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Fix out-of-bounds access in _ios_obj() exofs: Use proper max_IO calculations from ore exofs: Fix __r4w_get_page when offset is beyond i_size exofs: stop using s_dirt exofs: readpage_strip: Add a BUG_ON to check for PageLocked(page)	2012-08-03 13:24:07 -07:00
Boaz Harrosh	9e62bb4458	ore: Fix out-of-bounds access in _ios_obj() _ios_obj() is accessed by group_index not device_table index. The oc->comps array is only a group_full of devices at a time it is not like ore_comp_dev() which is indexed by a global device_table index. This did not BUG until now because exofs only uses a single COMP for all devices. But with other FSs like PanFS this is not true. This bug was only in the write_path, all other users were using it correctly [This is a bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-02 16:41:56 +03:00
Boaz Harrosh	be388f3d9a	exofs: Use proper max_IO calculations from ore exofs_max_io_pages should just use the ORE's calculated layout->max_io_length, And avoid unnecessary BUGs, calculations made here were also a layering violation. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-02 16:39:17 +03:00
Boaz Harrosh	4b74f6ea84	exofs: Fix __r4w_get_page when offset is beyond i_size It is very common for the end of the file to be unaligned on stripe size. But since we know it's beyond file's end then the XOR should be preformed with all zeros. Old code used to just read zeros out of the OSD devices, which is a great waist. But what scares me more about this situation is that, we now have pages attached to the file's mapping that are beyond i_size. I don't like the kind of bugs this calls for. Fix both birds, by returning a global ZERO_PAGE, if offset is beyond i_size. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-02 14:58:22 +03:00
Artem Bityutskiy	66153f6e0f	exofs: stop using s_dirt Exofs has the '->write_super()' handler and makes some use of the '->s_dirt' superblock flag, but it really needs neither of them because it never sets 's_dirt' to one which means the VFS never calls its '->write_super()' handler. Thus, remove both. Note, I am trying to remove both 's_dirt' and 'write_super()' from VFS altogether once all users are gone. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-02 14:52:13 +03:00
Kautuk Consul	0e8d96dd2c	exofs: readpage_strip: Add a BUG_ON to check for PageLocked(page) readpage_strip can be called from several code paths all of which require that the page be locked before any operations are carried out. Since we export the exofs_readpage callback to the VFS, add a BUG_ON to check for PageLocked(page) to make sure that this understanding is never compromised. Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-08-02 14:52:12 +03:00
Linus Torvalds	a66d2c8f7e	Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull the big VFS changes from Al Viro: "This one is big and changes quite a few things around VFS. What's in there: - the first of two really major architecture changes - death to open intents. The former is finally there; it was very long in making, but with Miklos getting through really hard and messy final push in fs/namei.c, we finally have it. Unlike his variant, this one doesn't introduce struct opendata; what we have instead is ->atomic_open() taking preallocated struct file * and passing everything via its fields. Instead of returning struct file , it returns -E... on error, 0 on success and 1 in "deal with it yourself" case (e.g. symlink found on server, etc.). See comments before fs/namei.c:atomic_open(). That made a lot of goodies finally possible and quite a few are in that pile: ->lookup(), ->d_revalidate() and ->create() do not get struct nameidata anymore; ->lookup() and ->d_revalidate() get lookup flags instead, ->create() gets "do we want it exclusive" flag. With the introduction of new helper (kern_path_locked()) we are rid of all struct nameidata instances outside of fs/namei.c; it's still visible in namei.h, but not for long. Come the next cycle, declaration will move either to fs/internal.h or to fs/namei.c itself. [me, miklos, hch] - The second major change: behaviour of final fput(). Now we have __fput() done without any locks held by caller and not from deep in call stack. That obviously lifts a lot of constraints on the locking in there. Moreover, it's legal now to call fput() from atomic contexts (which has immediately simplified life for aio.c). We also don't need anti-recursion logics in __scm_destroy() anymore. There is a price, though - the damn thing has become partially asynchronous. For fput() from normal process we are guaranteed that pending __fput() will be done before the caller returns to userland, exits or gets stopped for ptrace. For kernel threads and atomic contexts it's done via schedule_work(), so theoretically we might need a way to make sure it's finished; so far only one such place had been found, but there might be more. There's flush_delayed_fput() (do all pending __fput()) and there's __fput_sync() (fput() analog doing __fput() immediately). I hope we won't need them often; see warnings in fs/file_table.c for details. [me, based on task_work series from Oleg merged last cycle] - sync series from Jan - large part of "death to sync_supers()" work from Artem; the only bits missing here are exofs and ext4 ones. As far as I understand, those are going via the exofs and ext4 trees resp.; once they are in, we can put ->write_super() to the rest, along with the thread calling it. - preparatory bits from unionmount series (from dhowells). - assorted cleanups and fixes all over the place, as usual. This is not the last pile for this cycle; there's at least jlayton's ESTALE work and fsfreeze series (the latter - in dire need of fixes, so I'm not sure it'll make the cut this cycle). I'll probably throw symlink/hardlink restrictions stuff from Kees into the next pile, too. Plus there's a lot of misc patches I hadn't thrown into that one - it's large enough as it is..." * 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (127 commits) ext4: switch EXT4_IOC_RESIZE_FS to mnt_want_write_file() btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() switch dentry_open() to struct path, make it grab references itself spufs: shift dget/mntget towards dentry_open() zoran: don't bother with struct file * in zoran_map ecryptfs: don't reinvent the wheels, please - use struct completion don't expose I_NEW inodes via dentry->d_inode tidy up namei.c a bit unobfuscate follow_up() a bit ext3: pass custom EOF to generic_file_llseek_size() ext4: use core vfs llseek code for dir seeks vfs: allow custom EOF in generic_file_llseek code vfs: Avoid unnecessary WB_SYNC_NONE writeback during sys_sync and reorder sync passes vfs: Remove unnecessary flushing of block devices vfs: Make sys_sync writeout also block device inodes vfs: Create function for iterating over block devices vfs: Reorder operations during sys_sync quota: Move quota syncing to ->sync_fs method quota: Split dquot_quota_sync() to writeback and cache flushing part vfs: Move noop_backing_dev_info check from sync into writeback ...	2012-07-23 12:27:27 -07:00
Boaz Harrosh	537632e0a5	ore: Unlock r4w pages in exact reverse order of locking The read-4-write pages are locked in address ascending order. But where unlocked in a way easiest for coding. Fix that, locks should be released in opposite order of locking, .i.e descending address order. I have not hit this dead-lock. It was found by inspecting the dbug print-outs. I suspect there is an higher lock at caller that protects us, but fix it regardless. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-07-20 11:49:25 +03:00
Boaz Harrosh	62b62ad873	ore: Remove support of partial IO request (NFS crash) Do to OOM situations the ore might fail to allocate all resources needed for IO of the full request. If some progress was possible it would proceed with a partial/short request, for the sake of forward progress. Since this crashes NFS-core and exofs is just fine without it just remove this contraption, and fail. TODO: Support real forward progress with some reserved allocations of resources, such as mem pools and/or bio_sets [Bug since 3.2 Kernel] CC: Stable Tree <stable@kernel.org> CC: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-07-20 11:47:43 +03:00
Boaz Harrosh	9ff19309a9	ore: Fix NFS crash by supporting any unaligned RAID IO In RAID_5/6 We used to not permit an IO that it's end byte is not stripe_size aligned and spans more than one stripe. .i.e the caller must check if after submission the actual transferred bytes is shorter, and would need to resubmit a new IO with the remainder. Exofs supports this, and NFS was supposed to support this as well with it's short write mechanism. But late testing has exposed a CRASH when this is used with none-RPC layout-drivers. The change at NFS is deep and risky, in it's place the fix at ORE to lift the limitation is actually clean and simple. So here it is below. The principal here is that in the case of unaligned IO on both ends, beginning and end, we will send two read requests one like old code, before the calculation of the first stripe, and also a new site, before the calculation of the last stripe. If any "boundary" is aligned or the complete IO is within a single stripe. we do a single read like before. The code is clean and simple by splitting the old _read_4_write into 3 even parts: 1._read_4_write_first_stripe 2. _read_4_write_last_stripe 3. _read_4_write_execute And calling 1+3 at the same place as before. 2+3 before last stripe, and in the case of all in a single stripe then 1+2+3 is preformed additively. Why did I not think of it before. Well I had a strike of genius because I have stared at this code for 2 years, and did not find this simple solution, til today. Not that I did not try. This solution is much better for NFS than the previous supposedly solution because the short write was dealt with out-of-band after IO_done, which would cause for a seeky IO pattern where as in here we execute in order. At both solutions we do 2 separate reads, only here we do it within a single IO request. (And actually combine two writes into a single submission) NFS/exofs code need not change since the ORE API communicates the new shorter length on return, what will happen is that this case would not occur anymore. hurray!! [Stable this is an NFS bug since 3.2 Kernel should apply cleanly] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-07-20 11:45:28 +03:00
Al Viro	ebfc3b49a7	don't pass nameidata to ->create() boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:47 +04:00
Al Viro	00cd8dd3bf	stop passing nameidata to ->lookup() Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:32 +04:00
Randy Dunlap	b84297197c	exofs: fix sparse non-ANSI function warning Fix sparse non-ANSI function warning: fs/exofs/sys.c:112:28: warning: non-ANSI function declaration of function 'exofs_sysfs_dbg_print' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-06-12 06:33:22 +03:00
Linus Torvalds	a01ee165a1	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd Pull exofs updates from Boaz Harrosh: "Just a couple of patches. The first is a BUG fix destined for stable which missed the 3.4-rc7 Kernel. The second is just a fixture addition so exofs is able to be better exported as a cluster file system via pNFS." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Add SYSFS info for autologin/pNFS export exofs: Fix CRASH on very early IO errors.	2012-05-28 13:10:41 -07:00
Sachin Bhamare	8b56a30caa	exofs: Add SYSFS info for autologin/pNFS export Introduce sysfs infrastructure for exofs cluster filesystem. Each OSD target shows up as below in the sysfs hierarchy: /sys/fs/exofs/<osdname>_<partition_id>/devX Where <osdname>_<partition_id> is the unique identification of a Superblock. Where devX: 0 <= X < device_table_size. They are ordered in device-table order as specified to the mkfs.exofs command Each OSD device devX has following attributes : osdname - ReadOnly systemid - ReadOnly uri - Read/Write It is up to user-mode to update devX/uri for support of autologin. These sysfs information are used both for autologin as well as support for exporting exofs via a pNFSD server in user-mode. (.eg NFS-Ganesha) Signed-off-by: Sachin Bhamare <sbhamare@panasas.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-05-21 12:24:01 +03:00
Boaz Harrosh	6abe4a87f7	exofs: Fix CRASH on very early IO errors. If at exofs_fill_super() we had an early termination do to any error, like an IO error while reading the super-block. We would crash inside exofs_free_sbi(). This is because sbi->oc.numdevs was set to 1, before we actually have a device table at all. Fix it by moving the sbi->oc.numdevs = 1 to after the allocation of the device table. Reported-by: Johannes Schild <JSchild@gmx.de> Stable: This is a bug since v3.2.0 CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-05-20 19:42:41 +03:00
Jan Kara	dbd5768f87	vfs: Rename end_writeback() to clear_inode() After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2012-05-06 13:43:41 +08:00
Linus Torvalds	afb9bd704c	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd Pull trivial exofs changes from Boaz Harrosh: "Just nothingness really. The big exofs changes are reserved for the next merge window." * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: Cap on the memcpy() size exofs: (trivial) Fix typo in super.c exofs: fix endian conversion in exofs_sync_fs()	2012-03-28 20:04:27 -07:00
Linus Torvalds	e2a0883e40	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "This is _not_ all; in particular, Miklos' and Jan's stuff is not there yet." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits) ext4: initialization of ext4_li_mtx needs to be done earlier debugfs-related mode_t whack-a-mole hfsplus: add an ioctl to bless files hfsplus: change finder_info to u32 hfsplus: initialise userflags qnx4: new helper - try_extent() qnx4: get rid of qnx4_bread/qnx4_getblk take removal of PF_FORKNOEXEC to flush_old_exec() trim includes in inode.c um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it um: embed ->stub_pages[] into mmu_context gadgetfs: list_for_each_safe() misuse ocfs2: fix leaks on failure exits in module_init ecryptfs: make register_filesystem() the last potential failure exit ntfs: forgets to unregister sysctls on register_filesystem() failure logfs: missing cleanup on register_filesystem() failure jfs: mising cleanup on register_filesystem() failure make configfs_pin_fs() return root dentry on success configfs: configfs_create_dir() has parent dentry in dentry->d_parent configfs: sanitize configfs_create() ...	2012-03-21 13:36:41 -07:00
Al Viro	48fde701af	switch open-coded instances of d_make_root() to new helper Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:35 -04:00
Al Viro	8de5277879	vfs: check i_nlink limits in vfs_{mkdir,rename_dir,link} New field of struct super_block - ->s_max_links. Maximal allowed value of ->i_nlink or 0; in the latter case all checks still need to be done in ->link/->mkdir/->rename instances. Note that this limit applies both to directoris and to non-directories. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-03-20 21:29:32 -04:00
Cong Wang	bf7014b67f	exofs: remove the second argument of k[un]map_atomic() Ack-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:22 +08:00
Dan Carpenter	72749a270b	exofs: Cap on the memcpy() size This data comes from the device, so probably it's fairly trustworthy but it makes the static checkers happy if we check it. [Boaz] the system_id_len is zero, if not present, or always OSD_SYSTEMID_LEN. So always copy OSD_SYSTEMID_LEN bytes. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-03-19 13:39:12 -07:00
Masanari Iida	3e57638bb1	exofs: (trivial) Fix typo in super.c Correct spelling "faild" to "failed" in fs/exofs/super.c Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-03-19 13:39:12 -07:00
Dan Carpenter	b6d1f2dd61	exofs: fix endian conversion in exofs_sync_fs() fscb->s_numfiles is an __le64 field so we need to use cpu_to_le64() to get a little endian 64 bit on big endian systems. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-03-19 13:39:11 -07:00
Linus Torvalds	9e203936ea	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Must support none-PAGE-aligned IO ore: fix BUG_ON, too few sgs when reading ore: Fix crash in case of an IO error. ore: FIX breakage when MISC_FILESYSTEMS is not set	2012-01-09 12:51:01 -08:00
Al Viro	da01636a65	exofs: oops after late failure in mount We have already set ->s_root, so ->put_super() is going to be called. Freeing ->s_fs_info is a bloody bad idea when it's going to be dereferenced very shortly... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-08 20:19:12 -05:00
Boaz Harrosh	724577ca35	ore: Must support none-PAGE-aligned IO NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-01-08 10:43:13 +02:00
Boaz Harrosh	361aba569f	ore: fix BUG_ON, too few sgs when reading When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-01-06 16:49:07 +02:00
Boaz Harrosh	ffefb8eaa3	ore: Fix crash in case of an IO error. The users of ore_check_io() expect the reported device (In case of error) to be indexed relative to the passed-in ore_components table, and not the logical dev index. This causes a crash inside objlayoutdriver in case of an IO error. [Bug in 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-01-06 16:49:06 +02:00
Boaz Harrosh	831c2dc5f4	ore: FIX breakage when MISC_FILESYSTEMS is not set As Reported by Randy Dunlap When MISC_FILESYSTEMS is not enabled and NFS4.1 is: fs/built-in.o: In function `objio_alloc_io_state': objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' fs/built-in.o: In function `_write_done': objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' fs/built-in.o: In function `_read_done': ... When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, is not selected. exofs/Kconfig is never examined during Kconfig, and it can not do it's magic stuff to automatically select everything needed. We must split exofs/Kconfig in two. The ore one is always included. And the exofs one is left in it's old place in the menu. [Needed for the 3.2.0 Kernel] CC: Stable Tree <stable@kernel.org> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2012-01-06 16:48:14 +02:00
Al Viro	bef41c267e	exofs: propagate umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:55:05 -05:00
Al Viro	1a67aafb5f	switch ->mknod() to umode_t Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:54 -05:00
Al Viro	4acdaf27eb	switch ->create() to umode_t vfs_create() ignores everything outside of 16bit subset of its mode argument; switching it to umode_t is obviously equivalent and it's the only caller of the method Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	18bb1db3e7	switch vfs_mkdir() and ->mkdir() to umode_t vfs_mkdir() gets int, but immediately drops everything that might not fit into umode_t and that's the only caller of ->mkdir()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:54:53 -05:00
Al Viro	6b520e0565	vfs: fix the stupidity with i_dentry in inode destructors Seeing that just about every destructor got that INIT_LIST_HEAD() copied into it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once(); the cost of taking it into inode_init_always() will be negligible for pipes and sockets and negative for everything else. Not to mention the removal of boilerplate code from ->destroy_inode() instances... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-01-03 22:52:40 -05:00
Linus Torvalds	32aaeffbd4	Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h	2011-11-06 19:44:47 -08:00
Linus Torvalds	6736c04799	Merge branch 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs * 'nfs-for-3.2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (25 commits) nfs: set vs_hidden on nfs4_callback_version4 (try #2) pnfs-obj: Support for RAID5 read-4-write interface. pnfs-obj: move to ore 03: Remove old raid engine pnfs-obj: move to ore 02: move to ORE pnfs-obj: move to ore 01: ore_layout & ore_components pnfs-obj: Rename objlayout_io_state => objlayout_io_res pnfs-obj: Get rid of objlayout_{alloc,free}_io_state pnfs-obj: Return PNFS_NOT_ATTEMPTED in case of read/write_pagelist pnfs-obj: Remove redundant EOF from objlayout_io_state nfs: Remove unused variable from write.c nfs: Fix unused variable warning from file.c NFS: Remove no-op less-than-zero checks on unsigned variables. NFS: Clean up nfs4_xdr_dec_secinfo() NFS: Fix documenting comment for nfs_create_request() NFS4: fix cb_recallany decode error nfs4: serialize layoutcommit SUNRPC: remove rpcbind clients destruction on module cleanup SUNRPC: remove rpcbind clients creation during service registering NFSd: call svc rpcbind cleanup explicitly SUNRPC: cleanup service destruction ...	2011-11-04 12:27:43 -07:00
Boaz Harrosh	eecfc6312a	pnfs-obj: move to ore 02: move to ORE In this patch we are actually moving to the ORE. (Object Raid Engine). objio_state holds a pointer to an ore_io_state. Once we have an ore_io_state at hand we can call the ore for reading/writing. We register on the done path to kick off the nfs io_done mechanism. Again for Ease of reviewing the old code is "#if 0" but is not removed so the diff command works better. The old code will be removed in the next patch. fs/exofs/Kconfig::ORE is modified to also be auto-included if PNFS_OBJLAYOUT is set. Since we now depend on ORE. (See comments in fs/exofs/Kconfig) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2011-11-02 23:56:08 -04:00
Miklos Szeredi	bfe8684869	filesystems: add set_nlink() Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2011-11-02 12:53:43 +01:00
Paul Gortmaker	143cb494cb	fs: add module.h to files that were implicitly using it Some files were using the complete module.h infrastructure without actually including the header at all. Fix them up in advance so once the implicit presence is removed, we won't get failures like this: CC [M] fs/nfsd/nfssvc.o fs/nfsd/nfssvc.c: In function 'nfsd_create_serv': fs/nfsd/nfssvc.c:335: error: 'THIS_MODULE' undeclared (first use in this function) fs/nfsd/nfssvc.c:335: error: (Each undeclared identifier is reported only once fs/nfsd/nfssvc.c:335: error: for each function it appears in.) fs/nfsd/nfssvc.c: In function 'nfsd': fs/nfsd/nfssvc.c:555: error: implicit declaration of function 'module_put_and_exit' make[3]: *** [fs/nfsd/nfssvc.o] Error 1 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>	2011-10-31 19:30:31 -04:00
Boaz Harrosh	44231e686b	ore: Enable RAID5 mounts Now that we support raid5 Enable it at mount. Raid6 will come next raid4 is not demanded for so it will probably not be enabled. (Until some one wants it) NOTE: That mkfs.exofs had support for raid5/6 since long time ago. (Making an empty raidX FS is just as easy as raid0 ;-} ) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:22:29 -07:00
Boaz Harrosh	dd29661997	exofs: Support for RAID5 read-4-write interface. The ore need suplied a r4w_get_page/r4w_put_page API from Filesystem so it can get cache pages to read-into when writing parial stripes. Also I commented out and NULLed the .writepage (singular) vector. Because it gives terrible write pattern to raid and is apparently not needed. Even in OOM conditions the system copes (even better) with out it. TODO: How to specify to write_cache_pages() to start or include a certain page? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:22:28 -07:00
Boaz Harrosh	769ba8d920	ore: RAID5 Write This is finally the RAID5 Write support. The bigger part of this patch is not the XOR engine itself, But the read4write logic, which is a complete mini prepare_for_striping reading engine that can read scattered pages of a stripe into cache so it can be used for XOR calculation. That is, if the write was not stripe aligned. The main algorithm behind the XOR engine is the 2 dimensional array: struct __stripe_pages_2d. A drawing might save 1000 words --- __stripe_pages_2d \| n = pages_in_stripe_unit; w = group_width - parity; \| pages array presented to the XOR lib \| \| V \| __1_page_stripe[0].pages --> [c0][c1]..[cw][c_par] <---\| \| \| __1_page_stripe[1].pages --> [c0][c1]..[cw][c_par] <--- \| ... \| ... \| __1_page_stripe[n].pages --> [c0][c1]..[cw][c_par] ^ \| data added columns first then row --- The pages are put on this array columns first. .i.e: p0-of-c0, p1-of-c0, ... pn-of-c0, p0-of-c1, ... So we are doing a corner turn of the pages. Note that pages will zigzag down and left. but are put sequentially in growing order. So when the time comes to XOR the stripe, only the beginning and end of the array need be checked. We scan the array and any NULL spot will be field by pages-to-be-read. The FS that wants to support RAID5 needs to supply an operations-vector that searches a given page in cache, and specifies if the page is uptodate or need reading. All these pages to be read are put on a slave ore_io_state and synchronously read. All the pages of a stripe are read in one IO, using the scatter gather mechanism. In write we constrain our IO to only be incomplete on a single stripe. Meaning either the complete IO is within a single stripe so we might have pages to read from both beginning or end of the strip. Or we have some reading to do at beginning but end at strip boundary. The left over pages are pushed to the next IO by the API already established by previous work, where an IO offset/length combination presented to the ORE might get the length truncated and the user must re-submit the leftover pages. (Both exofs and NFS support this) But any ORE user should make it's best effort to align it's IO before hand and avoid complications. A cached ore_layout->stripe_size member can be used for that calculation. (NOTE: that ORE demands that stripe_size may not be bigger then 32bit) What else? Well read it and tell me. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 17:15:33 -07:00
Boaz Harrosh	a1fec1dbbc	ore: RAID5 read This patch introduces the first stage of RAID5 support mainly the skip-over-raid-units when reading. For writes it inserts BLANK units, into where XOR blocks should be calculated and written to. It introduces the new "general raid maths", and the main additional parameters and components needed for raid5. Since at this stage it could corrupt future version that actually do support raid5. The enablement of raid5 mounting and setting of parity-count > 0 is disabled. So the raid5 code will never be used. Mounting of raid5 is only enabled later once the basic XOR write is also in. But if the patch "enable RAID5" is applied this code has been tested to be able to properly read raid5 volumes and is according to standard. Also it has been tested that the new maths still properly supports RAID0 and grouping code just as before. (BTW: I have found more bugs in the pnfs-obj RAID math fixed here) The ore.c file is getting too big, so new ore_raid.[hc] files are added that will include the special raid stuff that are not used in striping and mirrors. In future write support these will get bigger. When adding the ore_raid.c to Kbuild file I was forced to rename ore.ko to libore.ko. Is it possible to keep source file, say ore.c and module file ore.ko the same even if there are multiple files inside ore.ko? Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 16:55:36 -07:00
Boaz Harrosh	611d7a5dc6	ore: Make ore_calc_stripe_info EXPORT_SYMBOL ore_calc_stripe_info is needed by exofs::export.c for the layout calculations. Make it exportable Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-24 16:30:08 -07:00
Boaz Harrosh	4b46c9f5cf	ore/exofs: Change ore_check_io API Current ore_check_io API receives a residual pointer, to report partial IO. But it is actually not used, because in a multiple devices IO there is never a linearity in the IO failure. On the other hand if every failing device is reported through a received callback measures can be taken to handle only failed devices. One at a time. This will also be needed by the objects-layout-driver for it's error reporting facility. Exofs is not currently using the new information and keeps the old behaviour of failing the complete IO in case of an error. (No partial completion) TODO: Use an ore_check_io callback to set_page_error only the failing pages. And re-dirty write pages. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:54:42 +02:00
Boaz Harrosh	5a51c0c7e9	ore/exofs: Define new ore_verify_layout All users of the ore will need to check if current code supports the given layout. For example RAID5/6 is not currently supported. So move all the checks from exofs/super.c to a new ore_verify_layout() to be used by ore users. Note that any new layout should be passed through the ore_verify_layout() because the ore engine will prepare and verify some internal members of ore_layout, and assumes it's called. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:54:41 +02:00
Boaz Harrosh	3bd9856857	ore: Support for partial component table Users like the objlayout-driver would like to only pass a partial device table that covers the IO in question. For example exofs divides the file into raid-group-sized chunks and only serves group_width number of devices at a time. The partiality is communicated by setting ore_componets->first_dev and the array covers all logical devices from oc->first_dev upto (oc->first_dev + oc->numdevs) The ore_comp_dev() API receives a logical device index and returns the actual present device in the table. An out-of-range dev_index will BUG. Logical device index is the theoretical device index as if all the devices of a file are present. .i.e: total_devs = group_width * mirror_p1 * group_count 0 <= dev_index < total_devs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:54:41 +02:00
Boaz Harrosh	bbf9a31bba	ore: Support for short read/writes Memory conditions and max_bio constraints might cause us to not comply to the full length of the requested IO. Instead of failing the complete IO we can issue a shorter read/write and report how much was actually executed in the ios->length member. All users must check ios->length at IO_done or upon return of ore_read/write and re-issue the reminder of the bytes. Because other wise there is no error returned like before. This is part of the effort to support the pnfs-obj layout driver. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:54:40 +02:00
Boaz Harrosh	154a9300cd	exofs: Support for short read/writes If at read/write_done the actual IO was shorter then requested, reported in returned ios->length. It is not an error. The reminder of the pages should just be unlocked but not marked uptodate or end_page_writeback. They will be re issued later by the VFS. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:54:39 +02:00
Boaz Harrosh	6851a5e5c1	ore: Remove check for ios->kern_buff in _prepare_for_striping to later Move the check and preparation of the ios->kern_buff case to later inside _write_mirror(). Since read was never used with ios->kern_buff its support is removed instead of fixed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:53:55 +02:00
Boaz Harrosh	9826075404	ore: cleanup: Embed an ore_striping_info inside ore_io_state Now that each ore_io_state covers only a single raid group. A single striping_info math is needed. Embed one inside ore_io_state to cache the calculation results and eliminate an extra call. Also the outer _prepare_for_striping is removed since it does nothing. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:53:54 +02:00
Boaz Harrosh	b916c5cd4d	ore: Only IO one group at a time (API change) Usually a single IO is confined to one group of devices (group_width) and at the boundary of a raid group it can spill into a second group. Current code would allocate a full device_table size array at each io_state so it can comply to requests that span two groups. Needless to say that is very wasteful, specially when device_table count can get very large (hundreds even thousands), while a group_width is usually 8 or 10. * Change ore API to trim on IO that spans two raid groups. The user passes offset+length to ore_get_rw_state, the ore might trim on that length if spanning a group boundary. The user must check ios->length or ios->nrpages to see how much IO will be preformed. It is the responsibility of the user to re-issue the reminder of the IO. * Modify exofs To copy spilled pages on to the next IO. This means one last kick is needed after all coalescing of pages is done. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-14 18:52:50 +02:00
Boaz Harrosh	d866d875f6	ore/exofs: Change the type of the devices array (API change) In the pNFS obj-LD the device table at the layout level needs to point to a device_cache node, where it is possible and likely that many layouts will point to the same device-nodes. In Exofs we have a more orderly structure where we have a single array of devices that repeats twice for a round-robin view of the device table This patch moves to a model that can be used by the pNFS obj-LD where struct ore_components holds an array of ore_dev-pointers. (ore_dev is newly defined and contains a struct osd_dev *od member) Each pointer in the array of pointers will point to a bigger user-defined dev_struct. That can be accessed by use of the container_of macro. In Exofs an __alloc_dev_table() function allocates the ore_dev-pointers array as well as an exofs_dev array, in one allocation and does the addresses dance to set everything pointing correctly. It still keeps the double allocation trick for the inodes round-robin view of the table. The device table is always allocated dynamically, also for the single device case. So it is unconditionally freed at umount. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-04 12:13:59 +02:00
Boaz Harrosh	eb507bc189	ore: Make ore_striping_info and ore_calc_stripe_info public The struct ore_striping_info will be used later in other structures. And ore_calc_stripe_info as well. Rename them make struct ore_striping_info public. ore_calc_stripe_info is still static, will be made public on first use. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:51 +02:00
Boaz Harrosh	8d2d83a835	exofs: Remove unused data_map member from exofs_sb_info The struct pnfs_osd_data_map data_map member of exofs_sb_info was never used after mount. In fact all it's members were duplicated by the ore_layout structure. So just remove the duplicated information. Also removed some stupid, but perfectly supported, restrictions on layout parameters. The case where num_devices is not divisible by mirror_count+1 is perfectly fine since the rotating device view will eventually use all the devices it can get. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com>	2011-10-03 17:07:51 +02:00
Boaz Harrosh	5bf696dad4	exofs: Rename struct ore_components comps => oc ore_components already has a comps member so this leads to things like comps->comps which is annoying. the name oc was already used in new code. So rename all old usage of ore_components comps => ore_components oc. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:50 +02:00
H Hartley Sweeten	de74b05ace	exofs/super.c: local functions should be static This quiets the following sparse noise: warning: symbol 'exofs_sync_fs' was not declared. Should it be static? warning: symbol 'exofs_free_sbi' was not declared. Should it be static? warning: symbol 'exofs_get_parent' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:07:29 +02:00
H Hartley Sweeten	1958c7c284	exofs/ore.c: local functions should be static This quiets the sparse noise: warning: symbol '_calc_trunk_info' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-10-03 17:06:47 +02:00
Linus Torvalds	c2f340a69c	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: ore: Make ore its own module exofs: Rename raid engine from exofs/ios.c => ore exofs: ios: Move to a per inode components & device-table exofs: Move exofs specific osd operations out of ios.c exofs: Add offset/length to exofs_get_io_state exofs: Fix truncate for the raid-groups case exofs: Small cleanup of exofs_fill_super exofs: BUG: Avoid sbi realloc exofs: Remove pnfs-osd private definitions nfs_xdr: Move nfs4_string definition out of #ifdef CONFIG_NFS_V4	2011-08-06 22:56:03 -07:00
Boaz Harrosh	cf283ade08	ore: Make ore its own module Export everything from ore need exporting. Change Kbuild and Kconfig to build ore.ko as an independent module. Import ore from exofs Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:36:19 -07:00
Boaz Harrosh	8ff660ab85	exofs: Rename raid engine from exofs/ios.c => ore ORE stands for "Objects Raid Engine" This patch is a mechanical rename of everything that was in ios.c and its API declaration to an ore.c and an osd_ore.h header. The ore engine will later be used by the pnfs objects layout driver. * File ios.c => ore.c * Declaration of types and API are moved from exofs.h to a new osd_ore.h * All used types are prefixed by ore_ from their exofs_ name. * Shift includes from exofs.h to osd_ore.h so osd_ore.h is independent, include it from exofs.h. Other than a pure rename there are no other changes. Next patch will move the ore into it's own module and will export the API to be used by exofs and later the layout driver Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:36:18 -07:00
Boaz Harrosh	9e9db45649	exofs: ios: Move to a per inode components & device-table Exofs raid engine was saving on memory space by having a single layout-info, single pid, and a single device-table, global to the filesystem. Then passing a credential and object_id info at the io_state level, private for each inode. It would also devise this contraption of rotating the device table view for each inode->ino to spread out the device usage. This is not compatible with the pnfs-objects standard, demanding that each inode can have it's own layout-info, device-table, and each object component it's own pid, oid and creds. So: Bring exofs raid engine to be usable for generic pnfs-objects use by: * Define an exofs_comp structure that holds obj_id and credential info. * Break up exofs_layout struct to an exofs_components structure that holds a possible array of exofs_comp and the array of devices + the size of the arrays. * Add a "comps" parameter to get_io_state() that specifies the ids creds and device array to use for each IO. This enables to keep the layout global, but the device-table view, creds and IDs at the inode level. It only adds two 64bit to each inode, since some of these members already existed in another form. * ios raid engine now access layout-info and comps-info through the passed pointers. Everything is pre-prepared by caller for generic access of these structures and arrays. At the exofs Level: * Super block holds an exofs_components struct that holds the device array, previously in layout. The devices there are in device-table order. The device-array is twice bigger and repeats the device-table twice so now each inode's device array can point to a random device and have a round-robin view of the table, making it compatible to previous exofs versions. * Each inode has an exofs_components struct that is initialized at load time, with it's own view of the device table IDs and creds. When doing IO this gets passed to the io_state together with the layout. While preforming this change. Bugs where found where credentials with the wrong IDs where used to access the different SB objects (super.c). As well as some dead code. It was never noticed because the target we use does not check the credentials. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:32 -07:00
Boaz Harrosh	85e44df474	exofs: Move exofs specific osd operations out of ios.c ios.c will be moving to an external library, for use by the objects-layout-driver. Remove from it some exofs specific functions. Also g_attr_logical_length is used both by inode.c and ios.c move definition to the later, to keep it independent Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:31 -07:00
Boaz Harrosh	e1042ba099	exofs: Add offset/length to exofs_get_io_state In future raid code we will need to know the IO offset/length and if it's a read or write to determine some of the array sizes we'll need. So add a new exofs_get_rw_state() API for use when writeing/reading. All other simple cases are left using the old way. The major change to this is that now we need to call exofs_get_io_state later at inode.c::read_exec and inode.c::write_exec when we actually know these things. So this patch is kept separate so I can test things apart from other changes. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-06 19:35:31 -07:00
Boaz Harrosh	16f75bb35d	exofs: Fix truncate for the raid-groups case In the general raid-group case the truncate was wrong in that it did not also fix the object length of the neighboring groups. There are two bad cases in the old code: 1. Space that should be freed was not. 2. If a file That was big is truncated small, then made bigger again, the holes would not contain zeros but could expose old data. (If the growing of the file expands to more than a full groups cycle + group size (> S + T)) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:25 -07:00
Boaz Harrosh	9ce730475e	exofs: Small cleanup of exofs_fill_super Small cleanup that unifies duplicated code used in both the error and success cases Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:23 -07:00
Boaz Harrosh	6d4073e881	exofs: BUG: Avoid sbi realloc Since the beginning we realloced the sbi structure when a bigger then one device table was specified. (I know that was really stupid). Then much later when "register bdi" was added (By Jens) it was registering the pointer to sbi->bdi before the realloc. We never saw this problem because up till now the realloc did not do anything since the device table was small enough to fit in the original allocation. But once we starting testing with large device tables (Bigger then 28) we noticed the crash of writeback operating on a deallocated pointer. * Avoid the all mess by allocating the device-table as a second array and get rid of the variable-sized structure and the rest of this mess. * Take the chance to clean near by structures and comments. * Add a needed dprint on startup to indicate the loaded layout. * Also move the bdi registration to the very end because it will only fail in a low memory, which will probably fail before hand. There are many more likely causes to not load before that. This way the error handling is made simpler. (Just doing this would be enough to fix the BUG) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:20 -07:00
Boaz Harrosh	26ae93c2dc	exofs: Remove pnfs-osd private definitions Now that pnfs-osd has hit mainline we can remove exofs's private header. (And the FIXME comment) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-08-04 12:35:18 -07:00
Josef Bacik	02c24a8218	fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 20:47:59 -04:00
Al Viro	a9049376ee	make d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) ... and simplify the living hell out of callers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-20 01:44:26 -04:00
Al Viro	a803b8067e	fix exofs ->get_parent() NULL is not a possible return value for that method, TYVM... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-07-17 23:20:29 -04:00
Lucas De Marchi	25985edced	Fix common misspellings Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>	2011-03-31 11:26:23 -03:00
Linus Torvalds	6c51038900	Merge branch 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits) Documentation/iostats.txt: bit-size reference etc. cfq-iosched: removing unnecessary think time checking cfq-iosched: Don't clear queue stats when preempt. blk-throttle: Reset group slice when limits are changed blk-cgroup: Only give unaccounted_time under debug cfq-iosched: Don't set active queue in preempt block: fix non-atomic access to genhd inflight structures block: attempt to merge with existing requests on plug flush block: NULL dereference on error path in __blkdev_get() cfq-iosched: Don't update group weights when on service tree fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away block: Require subsystems to explicitly allocate bio_set integrity mempool jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging fs: make fsync_buffers_list() plug mm: make generic_writepages() use plugging blk-cgroup: Add unaccounted time to timeslice_used. block: fixup plugging stubs for !CONFIG_BLOCK block: remove obsolete comments for blkdev_issue_zeroout. blktrace: Use rq->cmd_flags directly in blk_add_trace_rq. ... Fix up conflicts in fs/{aio.c,super.c}	2011-03-24 10:16:26 -07:00
Boaz Harrosh	a49fb4c3d0	exofs: deprecate the commands pending counter One leftover from the days of IBM's original code, is an SB counter that counts in-flight asynchronous commands. And a piece of code that waits for the counter to reach zero at unmount. I guess it might have been needed then, cause of some reference missing or something. I'm not removing it yet but am putting a warning message if ever this counter triggers at unmount. If I'll never see it triggers or reported I'll remove the counter for good. (I had this print as a debug output for a long time and never had it trigger) Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:52 +02:00
Boaz Harrosh	1cea312ad4	exofs: Write sbi->s_nextid as part of the Create command Before when creating a new inode, we'd set the sb->s_dirt flag, and sometime later the system would write out s_nextid as part of the sb_info. Also on inode sync we would force the sb sync as well. Define the s_nextid as a new partition attribute and set it every time we create a new object. At mount we read it from it's new place. We now never set sb->s_dirt anywhere in exofs. write_super is actually never called. The call to exofs_write_super from exofs_put_super is also removed because the VFS always calls ->sync_fs before calling ->put_super twice. To stay backward-and-forward compatible we also write the old s_nextid in the super_block object at unmount, and support zero length attribute on mount. This also fixes a BUG where in layouts when group_width was not a divisor of EXOFS_SUPER_ID (0x10000) the s_nextid was not read from the device it was written to. Because of the sliding window layout trick, and because the read was always done from the 0 device but the write was done via the raid engine that might slide the device view. Now we read and write through the raid engine. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:51 +02:00
Boaz Harrosh	9ed9648431	exofs: Add option to mount by osdname If /dev/osd* devices are shuffled because more devices where added, and/or login order has changed. It is hard to mount the FS you want. Add an option to mount by osdname. osdname is any osd-device's osdname as specified to the mkfs.exofs command when formatting the osd-devices. The new mount format is: OPT="osdname=$UUID0,pid=$PID,_netdev" mount -t exofs -o $OPT $DEV_OSD0 $MOUNTDIR if "osdname=" is specified in options above $DEV_OSD0 is ignored and can be empty. Also while at it: Removed some old unused Opt_* enums. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:51 +02:00
bharrosh@panasas.com	66cd6cad49	exofs: Override read-ahead to align on stripe_size * Set all inode->i_mapping->backing_dev_info to point to the per super-block sb->s_bdi. * Calculating a read_ahead that is: - preferable 2 stripes long (Future patch will add a mount option to override this) - Minimum 128K aligned up to stripe-size - Caped to maximum-IO-sizes round down to stripe_size. (Max sizes are governed by max bio-size that fits in a page times number-of-devices) CC: Marc Dionne <marc.c.dionne@gmail.com> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:50 +02:00
Nick Piggin	97178b7b6c	exofs: simple fsync race fix It is incorrect to test inode dirty bits without participating in the inode writeback protocol. Inode writeback sets I_SYNC and clears I_DIRTY_?, then writes out the particular bits, then clears I_SYNC when it is done. BTW. it may not completely write all pages out, so I_DIRTY_PAGES would get set again. This is a standard pattern used throughout the kernel's writeback caches (I_SYNC ~= I_WRITEBACK, if that makes it clearer). And so it is not possible to determine an inode's dirty status just by checking I_DIRTY bits. Especially not for the purpose of data integrity syncs. Missing the check for these bits means that fsync can complete while writeback to the inode is underway. Inode writeback functions get this right, so call into them rather than try to shortcut things by testing dirty state improperly. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:50 +02:00
Boaz Harrosh	a8f1418f9e	exofs: Optimize read_4_write Don't attempt a read passed i_size, just zero the page and be done with it. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:02:49 +02:00
Boaz Harrosh	0a935519cc	exofs: Trivial: fix some indentation and debug prints I stumbled on some of these prints in log files so, might just submit the fixes. * All i_ino prints in exofs should be hex * All OSD_ERR prints should end with a "\n" Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>	2011-03-15 15:00:27 +02:00
Tobias Klauser	2c722c9a47	exofs: Remove redundant unlikely() IS_ERR() already implies unlikely(), so it can be omitted here. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2011-03-15 12:33:42 +02:00
Jens Axboe	7eaceaccab	block: remove per-queue plugging Code has been converted over to the new explicit on-stack plugging, and delay users have been converted to use the new API for that. So lets kill off the old plugging along with aops->sync_page(). Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2011-03-10 08:52:07 +01:00
Al Viro	babfe56046	exofs: i_nlink races in rename() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2011-03-03 01:28:17 -05:00
Boaz Harrosh	0b0abeaf3d	Revert "exofs: Set i_mapping->backing_dev_info anyway" This reverts commit `115e19c535`. Apparently setting inode->bdi to one's own sb->s_bdi stops VFS from sending read-aheads. This problem was bisected to this commit. A revert fixes it. I'll investigate farther why is this happening for the next Kernel, but for now a revert. I'm sending to stable@kernel.org as well, since it exists also in 2.6.37. 2.6.36 is good and does not have this patch. CC: Stable Tree <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-02-02 17:53:27 -08:00
Nick Piggin	fa0d7e3de6	fs: icache RCU free inodes RCU free the struct inode. This will allow: - Subsequent store-free path walking patch. The inode must be consulted for permissions when walking, so an RCU inode reference is a must. - sb_inode_list_lock to be moved inside i_lock because sb list walkers who want to take i_lock no longer need to take sb_inode_list_lock to walk the list in the first place. This will simplify and optimize locking. - Could remove some nested trylock loops in dcache code - Could potentially simplify things a bit in VM land. Do not need to take the page lock to follow page->mapping. The downsides of this is the performance cost of using RCU. In a simple creat/unlink microbenchmark, performance drops by about 10% due to inability to reuse cache-hot slab objects. As iterations increase and RCU freeing starts kicking over, this increases to about 20%. In cases where inode lifetimes are longer (ie. many inodes may be allocated during the average life span of a single inode), a lot of this cache reuse is not applicable, so the regression caused by this patch is smaller. The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU, however this adds some complexity to list walking and store-free path walking, so I prefer to implement this at a later date, if it is shown to be a win in real situations. I haven't found a regression in any non-micro benchmark so I doubt it will be a problem. Signed-off-by: Nick Piggin <npiggin@kernel.dk>	2011-01-07 17:50:26 +11:00

1 2 3 4 5

220 commits