Commit graph

13646 commits

Author SHA1 Message Date
Ryusuke Konishi 612392307c nilfs2: support nanosecond timestamp
After a review of user's feedback for finding out other compatibility
issues, I found nilfs improperly initializes timestamps in inode;
CURRENT_TIME was used there instead of CURRENT_TIME_SEC even though nilfs
didn't have nanosecond timestamps on disk.  A few users gave us the report
that the tar program sometimes failed to expand symbolic links on nilfs,
and it turned out to be the cause.

Instead of applying the above displacement, I've decided to support
nanosecond timestamps on this occation.  Fortunetaly, a needless 64-bit
field was in the nilfs_inode struct, and I found it's available for this
purpose without impact for the users.

So, this will do the enhancement and resolve the tar problem.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi e339ad31f5 nilfs2: introduce secondary super block
The former versions didn't have extra super blocks.  This improves the
weak point by introducing another super block at unused region in tail of
the partition.

This doesn't break disk format compatibility; older versions just ingore
the secondary super block, and new versions just recover it if it doesn't
exist.  The partition created by an old mkfs may not have unused region,
but in that case, the secondary super block will not be added.

This doesn't make more redundant copies of the super block; it is a future
work.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi cece552074 nilfs2: simplify handling of active state of segments
will reduce some lines of segment constructor.  Previously, the state was
complexly controlled through a list of segments in order to keep
consistency in meta data of usage state of segments.  Instead, this
presents ``calculated'' active flags to userland cleaner program and stop
maintaining its real flag on disk.

Only by this fake flag, the cleaner cannot exactly know if each segment is
reclaimable or not.  However, the recent extension of nilfs_sustat ioctl
struct (nilfs2-extend-nilfs_sustat-ioctl-struct.patch) can prevent the
cleaner from reclaiming in-use segment wrongly.

So, now I can apply this for simplification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:20 -07:00
Ryusuke Konishi c96fa464a5 nilfs2: mark minor flag for checkpoint created by internal operation
Nilfs creates checkpoints even for garbage collection or metadata updates
such as checkpoint mode change.  So, user often sees checkpoints created
only by such internal operations.

This is inconvenient in some situations.  For example, application that
monitors checkpoints and changes them to snapshots, will fall into an
infinite loop because it cannot distinguish internally created
checkpoints.

This patch solves this sort of problem by adding a flag to checkpoint for
identification.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 458c5b0822 nilfs2: clean up sketch file
The sketch file is a file to mark checkpoints with user data.  It was
experimentally introduced in the original implementation, and now
obsolete.  The file was handled differently with regular files; the file
size got truncated when a checkpoint was created.

This stops the special treatment and will treat it as a regular file.
Most users are not affected because mkfs.nilfs2 no longer makes this file.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi e626874685 nilfs2: super block operations fix endian bug
This adds a missing endian conversion of checksum field in the super
block.  This fixes compatibility issue on big endian machines which will
come to surface after supporting recovery of super block.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 1f5abe7e7d nilfs2: replace BUG_ON and BUG calls triggerable from ioctl
Pekka Enberg advised me:
> It would be nice if BUG(), BUG_ON(), and panic() calls would be
> converted to proper error handling using WARN_ON() calls. The BUG()
> call in nilfs_cpfile_delete_checkpoints(), for example, looks to be
> triggerable from user-space via the ioctl() system call.

This will follow the comment and keep them to a minimum.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 2c2e52fc4f nilfs2: extend nilfs_sustat ioctl struct
This adds a new argument to the nilfs_sustat structure.

The extended field allows to delete volatile active state of segments,
which was needed to protect freshly-created segments from garbage
collection but has confused code dealing with segments.  This
extension alleviates the mess and gives room for further
simplifications.

The volatile active flag is not persistent, so it's eliminable on this
occasion without affecting compatibility other than the ioctl change.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 7a9461939a nilfs2: use unlocked_ioctl
Pekka Enberg suggested converting ->ioctl operations to use
->unlocked_ioctl to avoid BKL.

The conversion was verified to be safe, so I will take it on this
occasion.

Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:19 -07:00
Ryusuke Konishi 8082d36aed nilfs2: remove compat ioctl code
This removes compat code from the nilfs ioctls and applies the same
function for both .ioctl and .compat_ioctl file operations.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi dc498d09be nilfs2: use fixed sized types for ioctl structures
Nilfs ioctl had structures not having fixed sized types such as:

  struct nilfs_argv {
         void *v_base;
         size_t v_nmembs;
         size_t v_size;
         int v_index;
         int v_flags;
  };

Further, some of them are wrongly aligned:

  e.g.

  struct nilfs_cpmode {
        __u64 cm_cno;
        int cm_mode;
  };

The size of wrongly aligned structures varies depending on
architectures, and it breaks the identity of ioctl commands, which
leads to arch dependent errors.

Previously, these are compensated by using compat_ioctl.

This fixes these problems and allows removal of compat ioctl.

Since this will change sizes of those structures, binary compatibility
for the past utilities will once break; new utilities have to be used
instead.  However, it would be helpful to avoid platform dependent
problems in the long term.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi 1088dcf4c3 nilfs2: remove timedwait ioctl command
This removes NILFS_IOCTL_TIMEDWAIT command from ioctl interface along
with the related flags and wait queue.

The command is terrible because it just sleeps in the ioctl.  I prefer
to avoid this by devising means of event polling in userland program.
By reconsidering the userland GC daemon, I found this is possible
without changing behaviour of the daemon and sacrificing efficiency.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Ryusuke Konishi 76068c4ff1 nilfs2: fix buggy behavior seen in enumerating checkpoints
This will fix the weird behavior of lscp command in listing continuously
created checkpoints; the output of lscp is rewinded regularly for the
recent nilfs.  As a result of debugging, a defect was found in
nilfs_cpfile_do_get_cpinfo() function.

Though the function can be repeatedly called to enumerate checkpoints and
it can skip invalid checkpoint entries, the index value was not carried
between successive calls.

The bug has long been present, and came to surface after applying a bugfix
nilfs2-fix-problems-of-memory-allocation-in-ioctl.patch, which increased
frequency of calling the function.  The similar bugfix was already applied
for ``snapshots'' by
nilfs2-fix-gc-failure-on-volumes-keeping-numerous-snapshots.patch.

This fixes the problem by making the index argument bidirectional on the
function.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:18 -07:00
Pekka Enberg 8acfbf0939 nilfs2: clean up indirect function calling conventions
This cleans up the strange indirect function calling convention used in
nilfs to follow the normal kernel coding style.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 7fa10d2001 nilfs2: fix improper return values of nilfs_get_cpinfo ioctl
A few tool developers gave me requests for fixing inconvenient return
value of nilfs_get_cpinfo() ioctl; if the requested mode is NILFS_SNAPSHOT
and the specified start entry is not a snapshot, the ioctl unnaturally
returns one as the number of acquired snapshot item.

In addition, the ioctl function returns an ENOENT error for checkpoints
within blocks deleted by garbage collection.

These behaviors require corrections for programs which enumerate
snapshots.  This resolves the inconvenience by changing the return values
to zero for the above cases.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi b028fcfc4c nilfs2: fix gc failure on volumes keeping numerous snapshots
This resolves the following failure of nilfs2 cleaner daemon:

 nilfs_cleanerd[20670]: cannot clean segments: No such file or directory
 nilfs_cleanerd[20670]: shutdown

When creating thousands of snapshots, the cleaner daemon had rarely died
as above due to an error returned from the kernel code.

After applying the recent patch which fixed memory allocation problems in
ioctl (Message-Id: <20081215.155840.105124170.ryusuke@osrg.net>), the
problem gets more frequent.

It turned out to be a bug of nilfs_ioctl_wrap_copy function and one of its
callback routines to read out information of snapshots; if the
nilfs_ioctl_wrap_copy function divided a large read request into multiple
requests, the second and later requests have failed since a restart
position on snapshot meta data was not properly set forward.

It's a deficiency of the callback interface that cannot pass the restart
position among multiple requests.  This patch fixes the issue by allowing
nilfs_ioctl_wrap_copy and snapshot read functions to exchange a position
argument.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 047180f2d7 nilfs2: insert explanations in gcinode file
The file gcinode.c gives buffer cache functions for on-disk blocks
moved in garbage collection.  Joern Engel has suggested inserting its
explanations in the source file (Message-ID:
<20080917144146.GD8750@logfs.org> and
<20080917224953.GB14644@logfs.org>).

This follows the comment.

Cc: Joern Engel <joern@logfs.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 47420c7998 nilfs2: avoid double error caused by nilfs_transaction_end
Pekka Enberg pointed out that double error handlings found after
nilfs_transaction_end() can be avoided by separating abort operation:

 OK, I don't understand this. The only way nilfs_transaction_end() can
 fail is if we have NILFS_TI_SYNC set and we fail to construct the
 segment. But why do we want to construct a segment if we don't commit?

 I guess what I'm asking is why don't we have a separate
 nilfs_transaction_abort() function that can't fail for the erroneous
 case to avoid this double error value tracking thing?

This does the separation and renames nilfs_transaction_end() to
nilfs_transaction_commit() for clarification.

Since, some calls of these functions were used just for exclusion control
against the segment constructor, they are replaced with semaphore
operations.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi a2e7d2df82 nilfs2: cleanup nilfs_clear_inode
This will remove the following unnecessary locks and cleanup code in
nilfs_clear_inode():

- unnecessary protection using nilfs_transaction_begin() and
  nilfs_transaction_end().

- cleanup code of i_dirty list field which is never chained
  when this function is called.

- spinlock used when releasing i_bh field.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:17 -07:00
Ryusuke Konishi 3358b4aaa8 nilfs2: fix problems of memory allocation in ioctl
This is another patch for fixing the following problems of a memory
copy function in nilfs2 ioctl:

(1) It tries to allocate 128KB size of memory even for small objects.

(2) Though the function repeatedly tries large memory allocations
    while reducing the size, GFP_NOWAIT flag is not specified.
    This increases the possibility of system memory shortage.

(3) During the retries of (2), verbose warnings are printed
    because _GFP_NOWARN flag is not used for the kmalloc calls.

The first patch was still doing large allocations by kmalloc which are
repeatedly tried while reducing the size.

Andi Kleen told me that using copy_from_user for large memory is not
good from the viewpoint of preempt latency:

 On Fri, 12 Dec 2008 21:24:11 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > > In the current interface, each data item is copied twice: one is to
 > > the allocated memory from user space (via copy_from_user), and another
 >
 > For such large copies it is better to use multiple smaller (e.g. 4K)
 > copy user, that gives better real time preempt latencies. Each cfu has a
 > cond_resched(), but only one, not multiple times in the inner loop.

He also advised me that:

 On Sun, 14 Dec 2008 16:13:27 +0100, Andi Kleen <andi@firstfloor.org> wrote:
 > Better would be if you could go to PAGE_SIZE. order 0 allocations
 > are typically the fastest / least likely to stall.
 >
 > Also in this case it's a good idea to use __get_free_pages()
 > directly, kmalloc tends to be become less efficient at larger
 > sizes.

For the function in question, the size of buffer memory can be reduced
since the buffer is repeatedly used for a number of small objects.  On
the other hand, it may incur large preempt latencies for larger buffer
because a copy_from_user (and a copy_to_user) was applied only once
each cycle.

With that, this revision uses the order 0 allocations with
__get_free_pages() to fix the original problems.

Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 0c4fb87764 nilfs2: update makefile and Kconfig
This adds a Makefile for the nilfs2 file system, and updates the
makefile and Kconfig file in the file system directory.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Koji Sato 7942b919f7 nilfs2: ioctl operations
This adds userland interface implemented with ioctl.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi a3d93f709e nilfs2: block cache for garbage collection
This adds the cache of on-disk blocks to be moved in garbage
collection.  The disk blocks are held with dummy inodes (called
gcinodes), and this file provides lookup function of the dummy inodes,
and their buffer read function.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 84ef1ecfde nilfs2: another dat for garbage collection
NILFS2 uses another DAT inode during garbage collection to ensure
atomicity and consistency of the DAT in the transient state.  This
twin inode is called GCDAT.

This adds functions to initialize the GCDAT and to switch page caches
and B-tree node caches between these two inodes.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi 0f3e1c7f23 nilfs2: recovery functions
This adds recovery function on mount.

Usually the recovery is achieved by just finding the latest super
root.  When logs without checkpoints were appended for data sync
operations after the latest super root, the recovery function will
perform roll forwarding and reconstruct new log(s) with a super root.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:16 -07:00
Ryusuke Konishi f30bf3e40f nilfs2: fix missed-sync issue for do_sync_mapping_range()
Chris Mason pointed out that there is a missed sync issue in
nilfs_writepages():

On Wed, 17 Dec 2008 21:52:55 -0500, Chris Mason wrote:
> It looks like nilfs_writepage ignores WB_SYNC_NONE, which is used by
> do_sync_mapping_range().

where WB_SYNC_NONE in do_sync_mapping_range() was replaced with
WB_SYNC_ALL by Nick's patch (commit:
ee53a891f4).

This fixes the problem by letting nilfs_writepages() write out the log of
file data within the range if sync_mode is WB_SYNC_ALL.

This involves removal of nilfs_file_aio_write() which was previously
needed to ensure O_SYNC sync writes.

Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 9ff05123e3 nilfs2: segment constructor
This adds the segment constructor (also called log writer).

The segment constructor collects dirty buffers for every dirty inode,
makes summaries of the buffers, assigns disk block addresses to the
buffers, and then submits BIOs for the buffers.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 64b5a32e0b nilfs2: segment buffer
This adds the segment buffer which is used to constuct logs.

[akpm@linux-foundation.org: BIO_RW_SYNC got removed]
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 783f61843e nilfs2: super block operations
This adds super block operations for the nilfs2 file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi 8a9d2191e9 nilfs2: operations for the_nilfs core object
This adds functions on the_nilfs object, which keeps shared resources and
states among a read/write mount and snapshots mounts going individually.

the_nilfs is allocated per block device; it is created when user first
mount a snapshot or a read/write mount on the device, then it is reused
for successive mounts.  It will be freed when all mount instances on the
device are detached.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi d25006523d nilfs2: pathname operations
This adds pathname operations, most of which comes from the ext2 file
system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Yoshiji Amagai 2ba466d74e nilfs2: directory entry operations
This adds directory handling functions, most of which comes from the ext2
file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:15 -07:00
Ryusuke Konishi f183ff4f05 nilfs2: file operations
This adds primitives for regular file handling.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 05fe58fdc1 nilfs2: inode operations
This adds inode level operations of the nilfs2 file system.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato 6c98cd4ecb nilfs2: segment usage file
This adds a meta data file which stores the allocation state of segments.

[konishi.ryusuke@lab.ntt.co.jp: fix wrong counting of checkpoints and dirty segments]
Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato 2961980972 nilfs2: checkpoint file
This adds a meta data file which holds checkpoint entries in its data
blocks.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 43bfb45ed4 nilfs2: inode map file
This adds a meta data file which stores on-disk inodes in its data blocks.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Koji Sato a17564f58b nilfs2: disk address translator
This adds the disk address translation file (DAT) whose primary function
is to convert virtual disk block numbers to actual disk block numbers.

The virtual block numbers of NILFS are associated with checkpoint
generation numbers, and this file also provides functions to manage the
lifetime information of each virtual block number.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:14 -07:00
Ryusuke Konishi 5442680fd2 nilfs2: persistent object allocator
This adds common functions to allocate or deallocate entries with bitmaps
on a meta data file.  This feature is used by the DAT and ifile.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Yoshiji Amagai <amagai.yoshiji@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 5eb563f5f2 nilfs2: meta data file
This adds the meta data file, which serves common buffer functions to the
DAT, sufile, cpfile, ifile, and so forth.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 0bd49f9446 nilfs2: buffer and page operations
This adds common routines for buffer/page operations used in B-tree
node caches, meta data files, or segment constructor (log writer).

NILFS uses copy functions for buffers and pages due to the following
reasons:

 1) Relocation required for COW
    Since NILFS changes address of on-disk blocks, moving buffers
    in page cache is needed for the buffers which are not addressed
    by a file offset.  If buffer size is smaller than page size,
    this involves partial copy of pages.

 2) Freezing mmapped pages
    NILFS calculates checksums for each log to ensure its validity.
    If page data changes after the checksum calculation, this validity
    check will not work correctly.  To avoid this failure for mmaped
    pages, NILFS freezes their data by copying.

 3) Copy-on-write for DAT pages
    NILFS makes clones of DAT page caches in a copy-on-write manner
    during GC processes, and this ensures atomicity and consistency
    of the DAT in the transient state.

In addition, NILFS uses two obsolete functions, nilfs_mark_buffer_dirty()
and nilfs_clear_page_dirty() respectively.

* nilfs_mark_buffer_dirty() was required to avoid NULL pointer
  dereference faults:

  Since the page cache of B-tree node pages or data page cache of pseudo
  inodes does not have a valid mapping->host, calling mark_buffer_dirty()
  for their buffers causes the fault; it calls __mark_inode_dirty(NULL)
  through __set_page_dirty().

* nilfs_clear_page_dirty() was needed in the two cases:

 1) For B-tree node pages and data pages of the dat/gcdat, NILFS2 clears
    page dirty flags when it copies back pages from the cloned cache
    (gcdat->{i_mapping,i_btnode_cache}) to its original cache
    (dat->{i_mapping,i_btnode_cache}).

 2) Some B-tree operations like insertion or deletion may dispose buffers
    in dirty state, and this needs to cancel the dirty state of their
    pages.  clear_page_dirty_for_io() caused faults because it does not
    clear the dirty tag on the page cache.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi a60be987d4 nilfs2: B-tree node cache
This adds routines for B-tree node buffers.

Signed-off-by: Seiji Kihara <kihara.seiji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato 36a580eb48 nilfs2: direct block mapping
This adds block mappings using direct pointers which are stored in the
i_bmap array of inode.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato 17c76b0104 nilfs2: B-tree based block mapping
This adds declarations and functions of NILFS2 B-tree.

Two variants are integrated in the NILFS2 B-tree.  The B-tree for the most
files points to the child nodes or data blocks with virtual block
addresses, whereas the B-tree of the DAT uses actual block addresses.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Koji Sato bdb265eae0 nilfs2: integrated block mapping
This adds structures and operations for the block mapping (bmap for
short).  NILFS2 uses direct mappings for short files or B-tree based
mappings for longer files.

Every on-disk data block is held with inodes and managed through this
block mapping.  The nilfs_bmap structure and a set of functions here
provide this capability to the NILFS2 inode.

[penberg@cs.helsinki.fi: remove a bunch of bmap wrapper macros]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:13 -07:00
Ryusuke Konishi 65b4643d3b nilfs2: add inode and other major structures
This adds the following common structures of the NILFS2 file system.

* nilfs_inode_info structure:
  gives on-memory inode.

* nilfs_sb_info structure:
  keeps per-mount state and a special inode for the ifile.
  This structure is attached to the super_block structure.

* the_nilfs structure:
  keeps shared state and locks among a read/write mount and snapshot
  mounts.  This keeps special inodes for the sufile, cpfile, dat, and
  another dat inode used during GC (gcdat).  This also has a hash table
  of dummy inodes to cache disk blocks during GC (gcinodes).

* nilfs_transaction_info structure:
  keeps per task state while nilfs is writing logs or doing indivisible
  inode or namespace operations.  This structure is used to identify
  context during log making and store nest level of the lock which
  ensures atomicity of file system operations.

Signed-off-by: Koji Sato <sato.koji@lab.ntt.co.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:12 -07:00
Coly Li 8a59f5d252 fs/romfs: return f_fsid for statfs(2)
Make romfs return f_fsid info for statfs(2).

Signed-off-by: Coly Li <coly.li@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:10 -07:00
Serge E. Hallyn 909e6d9479 namespaces: move proc_net_get_sb to a generic fs/super.c helper
The mqueuefs filesystem will use this helper as well.  Proc's main get_sb
could also be made to use it, but that will require a bit more rework.

Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:09 -07:00
KAMEZAWA Hiroyuki 6260a4b052 /proc/pid/maps: don't show pgoff of pure ANON VMAs
Recently, it's argued that what proc/pid/maps shows is ugly when a 32bit
binary runs on 64bit host.

/proc/pid/maps outputs vma's pgoff member but vma->pgoff is of no use
information is the vma is for ANON.  With this patch, /proc/pid/maps shows
just 0 if no file backing store.

[akpm@linux-foundation.org: coding-style fixes]
[kamezawa.hiroyu@jp.fujitsu.com: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mike Waychison <mikew@google.com>
Reported-by: Ying Han <yinghan@google.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 08:31:03 -07:00
Ingo Molnar f8201abcb2 ramfs: fix double freeing s_fs_info on failed mount
If ramfs mount fails, s_fs_info will be freed twice in ramfs_fill_super()
and ramfs_kill_sb(), leading to kernel oops.

Consolidate and beautify the code.
Make sure s_fs_info and s_root are in known good states.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-07 07:39:59 -07:00
Trond Myklebust d508afb437 NFS: Fix a double free in nfs_parse_mount_options()
Due to an apparent typo, commit a67d18f89f
(NFS: load the rpc/rdma transport module automatically) lead to the
'proto=' mount option doing a double free, while Opt_mountproto leaks a
string.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 17:19:48 -07:00
Linus Torvalds bbae8bcc49 ext3: make default data ordering mode configurable
This makes the defautl ext3 data ordering mode (when no explicit
ordering is set) configurable, so as to allow people to default to
'data=writeback' and get the resulting latency improvements.

This is a non-issue if a filesystem has been explicitly set to some
ordering (with 'tune2fs').

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 17:16:47 -07:00
Linus Torvalds e0724bf6e4 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
  UBIFS: fix recovery bug
  UBIFS: add R/O compatibility
  UBIFS: fix compiler warnings
  UBIFS: fully sort GCed nodes
  UBIFS: fix commentaries
  UBIFS: introduce a helpful variable
  UBIFS: use KERN_CONT
  UBIFS: fix lprops committing bug
  UBIFS: fix bogus assertion
  UBIFS: fix bug where page is marked uptodate when out of space
  UBIFS: amend key_hash return value
  UBIFS: improve find function interface
  UBIFS: list usage cleanup
  UBIFS: fix dbg_chk_lpt_sz()
2009-04-06 15:00:19 -07:00
Linus Torvalds 22ae77bc7a Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (53 commits)
  [MTD] struct device - replace bus_id with dev_name(), dev_set_name()
  [MTD] [NOR] Fixup for Numonyx M29W128 chips
  [MTD] mtdpart: Make ecc_stats more realistic.
  powerpc/85xx: TQM8548: Update DTS file for multi-chip support
  powerpc: NAND: FSL UPM: document new bindings
  [MTD] [NAND] FSL-UPM: Add wait flags to support board/chip specific delays
  [MTD] [NAND] FSL-UPM: add multi chip support
  [MTD] [NOR] Add device parent info to physmap_of
  [MTD] [NAND] Add support for NAND on the Socrates board
  [MTD] [NAND] Add support for 4KiB pages.
  [MTD] sysfs support should not depend on CONFIG_PROC_FS
  [MTD] [NAND] Add parent info for CAFÉ controller
  [MTD] support driver model updates
  [MTD] driver model updates (part 2)
  [MTD] driver model updates
  [MTD] [NAND] move gen_nand's probe function to .devinit.text
  [MTD] [MAPS] move sa1100 flash's probe function to .devinit.text
  [MTD] fix use after free in register_mtd_blktrans
  [MTD] [MAPS] Drop now unused sharpsl-flash map
  [MTD] ofpart: Check name property to determine partition nodes.
  ...

Manually fix trivial conflict in drivers/mtd/maps/Makefile
2009-04-06 14:56:26 -07:00
Linus Torvalds 12fe32e4f9 Merge branch 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'kmemtrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  kmemtrace: trace kfree() calls with NULL or zero-length objects
  kmemtrace: small cleanups
  kmemtrace: restore original tracing data binary format, improve ABI
  kmemtrace: kmemtrace_alloc() must fill type_id
  kmemtrace: use tracepoints
  kmemtrace, rcu: don't include unnecessary headers, allow kmemtrace w/ tracepoints
  kmemtrace, rcu: fix rcupreempt.c data structure dependencies
  kmemtrace, rcu: fix rcu_tree_trace.c data structure dependencies
  kmemtrace, rcu: fix linux/rcutree.h and linux/rcuclassic.h dependencies
  kmemtrace, mm: fix slab.h dependency problem in mm/failslab.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_unlzma.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_bunzip2.c
  kmemtrace, kbuild: fix slab.h dependency problem in lib/decompress_inflate.c
  kmemtrace, squashfs: fix slab.h dependency problem in squasfs
  kmemtrace, befs: fix slab.h dependency problem
  kmemtrace, security: fix linux/key.h header file dependencies
  kmemtrace, fs: fix linux/fdtable.h header file dependencies
  kmemtrace, fs: uninline simple_transaction_set()
  kmemtrace, fs, security: move alloc_secdata() and free_secdata() to linux/security.h
2009-04-06 13:30:00 -07:00
Linus Torvalds a63856252d Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits)
  nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4
  nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
  nfsd41: Documentation/filesystems/nfs41-server.txt
  nfsd41: CREATE_EXCLUSIVE4_1
  nfsd41: SUPPATTR_EXCLCREAT attribute
  nfsd41: support for 3-word long attribute bitmask
  nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
  nfsd41: pass writable attrs mask to nfsd4_decode_fattr
  nfsd41: provide support for minor version 1 at rpc level
  nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
  nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
  nfsd41: access_valid
  nfsd41: clientid handling
  nfsd41: check encode size for sessions maxresponse cached
  nfsd41: stateid handling
  nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
  nfsd41: destroy_session operation
  nfsd41: non-page DRC for solo sequence responses
  nfsd41: Add a create session replay cache
  nfsd41: create_session operation
  ...
2009-04-06 13:25:56 -07:00
Dave Chinner 8de2bf937a xfs: remove xfs_flush_space
The only thing we need to do now when we get an ENOSPC condition during delayed
allocation reservation is flush all the other inodes with delalloc blocks on
them and retry without EOF preallocation. Remove the unneeded mess that is
xfs_flush_space() and just call xfs_flush_inodes() directly from
xfs_iomap_write_delay().

Also, change the location of the retry label to avoid trying to do EOF
preallocation because we don't want to do that at ENOSPC. This enables us to
remove the BMAPI_SYNC flag as it is no longer used.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:49:12 +02:00
Dave Chinner 153fec43ce xfs: flush delayed allcoation blocks on ENOSPC in create
If we are creating lots of small files, we can fail to get
a reservation for inode create earlier than we should due to
EOF preallocation done during delayed allocation reservation.
Hence on the first reservation ENOSPC failure flush all the
delayed allocation blocks out of the system and retry.

This fixes the last commonly triggered spurious ENOSPC issue
that has been reported.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:48:30 +02:00
Dave Chinner e43afd72d2 xfs: block callers of xfs_flush_inodes() correctly
xfs_flush_inodes() currently uses a magic timeout to wait for
some inodes to be flushed before returning. This isn't
really reliable but used to be the best that could be done
due to deadlock potential of waiting for the entire flush.

Now the inode flush is safe to execute while we hold page
and inode locks, we can wait for all the inodes to flush
synchronously. Convert the wait mechanism to a completion
to do this efficiently. This should remove all remaining
spurious ENOSPC errors from the delayed allocation reservation
path.

This is extracted almost line for line from a larger patch
from Mikulas Patocka.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:47:27 +02:00
Dave Chinner 5825294edd xfs: make inode flush at ENOSPC synchronous
When we are writing to a single file and hit ENOSPC, we trigger a background
flush of the inode and try again.  Because we hold page locks and the iolock,
the flush won't proceed until after we release these locks. This occurs once
we've given up and ENOSPC has been reported. Hence if this one is the only
dirty inode in the system, we'll get an ENOSPC prematurely.

To fix this, remove the async flush from the allocation routines and move
it to the top of the write path where we can do a synchronous flush
and retry the write again. Only retry once as a second ENOSPC indicates
that we really are ENOSPC.

This avoids a page cache deadlock when trying to do this flush synchronously
in the allocation layer that was identified by Mikulas Patocka.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:45:44 +02:00
Dave Chinner a8d770d987 xfs: use xfs_sync_inodes() for device flushing
Currently xfs_device_flush calls sync_blockdev() which is
a no-op for XFS as all it's metadata is held in a different
address to the one sync_blockdev() works on.

Call xfs_sync_inodes() instead to flush all the delayed
allocation blocks out. To do this as efficiently as possible,
do it via two passes - one to do an async flush of all the
dirty blocks and a second to wait for all the IO to complete.
This requires some modification to the xfs-sync_inodes_ag()
flush code to do efficiently.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:44:54 +02:00
Dave Chinner 9d7fef74b2 xfs: inform the xfsaild of the push target before sleeping
When trying to reserve log space, we find the amount of space
we need, then go to sleep waiting for space. When we are
woken, we try to push the tail of the log forward to make
sure we have space available.

Unfortunately, this means that if there is not space available, and
everyone who needs space goes to sleep there is no-one left to push
the tail of the log to make space available. Once we have a thread
waiting for space to become available, the others queue up behind
it in a FIFO, and none of them push the tail of the log.

This can result in everyone going to sleep in xlog_grant_log_space()
if the first sleeper races with the last I/O that moves the tail
of the log forward. With no further I/O tomove the tail of the log,
there is nothing to wake the sleepers and hence all transactions
just stop.

Fix this by making sure the xfsaild will create enough space for the
transaction that is about to sleep by moving the push target far
enough forwards to ensure that that the curent proceeees will have
enough space available when it is woken. That is, we push the
AIL before we go to sleep.

Because we've inserted the log ticket into the queue before we've
pushed and gone to sleep, subsequent transactions will wait behind
this one. Hence we are guaranteed to have space available when we
are woken.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:42:59 +02:00
Dave Chinner c626d174cf xfs: prevent unwritten extent conversion from blocking I/O completion
Unwritten extent conversion can recurse back into the filesystem due
to memory allocation. Memory reclaim requires I/O completions to be
processed to allow the callers to make progress. If the I/O
completion workqueue thread is doing the recursion, then we have a
deadlock situation.

Move unwritten extent completion into it's own workqueue so it
doesn't block I/O completions for normal delayed allocation or
overwrite data.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:42:11 +02:00
Dave Chinner 705db3fd46 xfs: fix double free of inode
If we fail to initialise the VFS inode in inode_init_always(),
it will call ->delete_inode internally resulting in the inode being
freed. Hence we need to delay the call to inode_init_always()
until after the XFS inode is sufficient set up to handle a
call to ->delete_inode, and then if that fails do not touch
the inode again at all as it has been freed.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:40:17 +02:00
Dave Chinner a6cb767e24 xfs: validate log feature fields correctly
If the large log sector size feature bit is set in the
superblock by accident (say disk corruption), the then
fields that are now considered valid are not checked on
production kernels. The checks are present as ASSERT
statements so cause a panic on a debug kernel.

Change this so that the fields are validity checked if
the feature bit is set and abort the log mount if the
fields do not contain valid values.

Reported-by: Eric Sesterhenn <snakebyte@gmx.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2009-04-06 18:39:27 +02:00
Benny Halevy f0ad670d70 nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc
Fixes the following compiler error:
fs/nfsd/nfssvc.c: In function 'set_max_drc':
fs/nfsd/nfssvc.c:240: error: 'NFSD_DRC_SIZE_SHIFT' undeclared

CONFIG_NFSD_V4 is not set

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-06 09:17:53 -07:00
Jens Axboe 1aa2a7cc6f block: switch sync_dirty_buffer() over to WRITE_SYNC
We should now have the logic in place to handle this properly
without regressing on the write performance, so re-enable
the sync writes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe aeb6fafb8f block: Add flag for telling the IO schedulers NOT to anticipate more IO
By default, CFQ will anticipate more IO from a given io context if the
previously completed IO was sync. This used to be fine, since the only
sync IO was reads and O_DIRECT writes. But with more "normal" sync writes
being used now, we don't want to anticipate for those.

Add a bio/request flag that informs the IO scheduler that this is a sync
request that we should not idle for. Introduce WRITE_ODIRECT specifically
for O_DIRECT writes, and make sure that the other sync writes set this
flag.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe 4194b1eaf1 jbd2: use WRITE_SYNC_PLUG instead of WRITE_SYNC
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:54 -07:00
Jens Axboe 6c4bac6b33 jbd: use WRITE_SYNC_PLUG instead of WRITE_SYNC
When you are going to be submitting several sync writes, we want to
give the IO scheduler a chance to merge some of them. Instead of
using the implicitly unplugging WRITE_SYNC variant, use WRITE_SYNC_PLUG
and rely on sync_buffer() doing the unplug when someone does a
wait_on_buffer()/lock_buffer().

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:53 -07:00
Jens Axboe 9cf6b720f8 block: fsync_buffers_list() should use SWRITE_SYNC_PLUG
Then it can submit all the buffers without unplugging for each one.
We will kick off the pending IO if we come across a new address space.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-06 08:04:53 -07:00
Linus Torvalds 714f83d5d9 Merge branch 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
  tracing, net: fix net tree and tracing tree merge interaction
  tracing, powerpc: fix powerpc tree and tracing tree interaction
  ring-buffer: do not remove reader page from list on ring buffer free
  function-graph: allow unregistering twice
  trace: make argument 'mem' of trace_seq_putmem() const
  tracing: add missing 'extern' keywords to trace_output.h
  tracing: provide trace_seq_reserve()
  blktrace: print out BLK_TN_MESSAGE properly
  blktrace: extract duplidate code
  blktrace: fix memory leak when freeing struct blk_io_trace
  blktrace: fix blk_probes_ref chaos
  blktrace: make classic output more classic
  blktrace: fix off-by-one bug
  blktrace: fix the original blktrace
  blktrace: fix a race when creating blk_tree_root in debugfs
  blktrace: fix timestamp in binary output
  tracing, Text Edit Lock: cleanup
  tracing: filter fix for TRACE_EVENT_FORMAT events
  ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
  x86: kretprobe-booster interrupt emulation code fix
  ...

Fix up trivial conflicts in
 arch/parisc/include/asm/ftrace.h
 include/linux/memory.h
 kernel/extable.c
 kernel/module.c
2009-04-05 11:04:19 -07:00
Thiemo Nagel e44543b83b ext4: Fix off-by-one-error in ext4_valid_extent_idx()
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2009-04-04 23:30:44 -04:00
Thiemo Nagel f73953c065 ext4: Fix big-endian problem in __ext4_check_blockref()
Commit fe2c8191 introduced a regression on big-endian system, because
the checks to make sure block references in non-extent inodes are
valid failed to use le32_to_cpu().

Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Tested-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@kernel.org
2009-04-07 18:46:47 -04:00
Linus Torvalds 601cc11d05 Make non-compat preadv/pwritev use native register size
Instead of always splitting the file offset into 32-bit 'high' and 'low'
parts, just split them into the largest natural word-size - which in C
terms is 'unsigned long'.

This allows 64-bit architectures to avoid the unnecessary 32-bit
shifting and masking for native format (while the compat interfaces will
obviously always have to do it).

This also changes the order of 'high' and 'low' to be "low first".  Why?
Because when we have it like this, the 64-bit system calls now don't use
the "pos_high" argument at all, and it makes more sense for the native
system call to simply match the user-mode prototype.

This results in a much more natural calling convention, and allows the
compiler to generate much more straightforward code.  On x86-64, we now
generate

        testq   %rcx, %rcx      # pos_l
        js      .L122   #,
        movq    %rcx, -48(%rbp) # pos_l, pos

from the C source

        loff_t pos = pos_from_hilo(pos_h, pos_l);
	...
        if (pos < 0)
                return -EINVAL;

and the 'pos_h' register isn't even touched.  It used to generate code
like

        mov     %r8d, %r8d      # pos_low, pos_low
        salq    $32, %rcx       #, tmp71
        movq    %r8, %rax       # pos_low, pos.386
        orq     %rcx, %rax      # tmp71, pos.386
        js      .L122   #,
        movq    %rax, -48(%rbp) # pos.386, pos

which isn't _that_ horrible, but it does show how the natural word size
is just a more sensible interface (same arguments will hold in the user
level glibc wrapper function, of course, so the kernel side is just half
of the equation!)

Note: in all cases the user code wrapper can again be the same. You can
just do

	#define HALF_BITS (sizeof(unsigned long)*4)
	__syscall(PWRITEV, fd, iov, count, offset, (offset >> HALF_BITS) >> HALF_BITS);

or something like that.  That way the user mode wrapper will also be
nicely passing in a zero (it won't actually have to do the shifts, the
compiler will understand what is going on) for the last argument.

And that is a good idea, even if nobody will necessarily ever care: if
we ever do move to a 128-bit lloff_t, this particular system call might
be left alone.  Of course, that will be the least of our worries if we
really ever need to care, so this may not be worth really caring about.

[ Fixed for lost 'loff_t' cast noticed by Andrew Morton ]

Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ralf Baechle <ralf@linux-mips.org>>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-04 14:20:34 -07:00
Benny Halevy 79fb54abd2 nfsd41: CREATE_EXCLUSIVE4_1
Implement the CREATE_EXCLUSIVE4_1 open mode conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

This mode allows the client to atomically create a file
if it doesn't exist while setting some of its attributes.

It must be implemented if the server supports persistent
reply cache and/or pnfs.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Benny Halevy 8c18f2052e nfsd41: SUPPATTR_EXCLCREAT attribute
Return bitmask for supported EXCLUSIVE4_1 create attributes.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Andy Adamson 7e70570647 nfsd41: support for 3-word long attribute bitmask
Also, use client minorversion to generate supported attrs

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:23 -07:00
Benny Halevy 95ec28cda3 nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify
_nfsd4_verify currently skips 3 words from the encoded buffer begining.
With support for 3-word attr bitmaps in nfsd41, nfsd4_encode_fattr
may encode 1, 2, or 3 words, and not always 2 as it used to be, hence
we need to find out where to skip using the encoded bitmap length.

Note: This patch may be applied over pre-nfsd41 nfsd.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:22 -07:00
Benny Halevy c0d6fc8a2d nfsd41: pass writable attrs mask to nfsd4_decode_fattr
In preparation for EXCLUSIVE4_1

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:22 -07:00
Benny Halevy 8daf220a6a nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions
Support enabling and disabling nfsv4.1 via /proc/fs/nfsd/versions
by writing the strings "+4.1" or "-4.1" correspondingly.

Use user mode nfs-utils (rpc.nfsd option) to enable.
This will allow us to get rid of CONFIG_NFSD_V4_1

[nfsd41: disable support for minorversion by default]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:21 -07:00
Andy Adamson 84459a1162 nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap
Separate the access bits from the want bits and enable __set_bit to
work correctly with st_access_bmap.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:21 -07:00
Andy Adamson d87a8ade95 nfsd41: access_valid
For nfs41, the open share flags are used also for
delegation "wants" and "signals".  Check that they are valid.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:21 -07:00
Andy Adamson 60adfc50de nfsd41: clientid handling
Extract the clientid from sessionid to set the op_clientid on open.
Verify that the clid for other stateful ops is zero for minorversion != 0
Do all other checks for stateful ops without sessions.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixed whitespace indent]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41 remove sl_session from nfsd4_open]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:20 -07:00
Andy Adamson 496c262cf0 nfsd41: check encode size for sessions maxresponse cached
Calculate the space the compound response has taken after encoding the current
operation.

pad: add on 8 bytes for the next operation's op_code and status so that
there is room to cache a failure on the next operation.

Compare this length to the session se_fmaxresp_cached and return
nfserr_rep_too_big_to_cache if the length is too large.

Our se_fmaxresp_cached will always be a multiple of PAGE_SIZE, and so
will be at least a page and will therefore hold the xdr_buf head.

Signed-off-by: Andy Adamson <andros@netapp.com>
[nfsd41: non-page DRC for solo sequence responses]
[fixed nfsd4_check_drc_limit cosmetics]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use cstate session in nfsd4_check_drc_limit]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:20 -07:00
Andy Adamson 6668958fac nfsd41: stateid handling
When sessions are used, stateful operation sequenceid and stateid handling
are not used. When sessions are used,  on the first open set the seqid to 1,
mark state confirmed and skip seqid processing.

When sessionas are used the stateid generation number is ignored when it is zero
whereas without sessions bad_stateid or stale stateid is returned.

Add flags to propagate session use to all stateful ops and down to
check_stateid_generation.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfsd4_has_session should return a boolean, not u32]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: pass nfsd4_compoundres * to nfsd4_process_open1]
[nfsd41: calculate HAS_SESSION in nfs4_preprocess_stateid_op]
[nfsd41: calculate HAS_SESSION in nfs4_preprocess_seqid_op]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:19 -07:00
Benny Halevy dd453dfd70 nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op
Currently we only use cstate->current_fh,
will also be used by nfsd41 code.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:19 -07:00
Benny Halevy e10e0cfc2f nfsd41: destroy_session operation
Implement the destory_session operation confoming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

[use sessionid_lock spin lock]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:19 -07:00
Andy Adamson bf864a31d5 nfsd41: non-page DRC for solo sequence responses
A session inactivity time compound (lease renewal) or a compound where the
sequence operation has sa_cachethis set to FALSE do not require any pages
to be held in the v4.1 DRC. This is because struct nfsd4_slot is already
caching the session information.

Add logic to the nfs41 server to not cache response pages for solo sequence
responses.

Return nfserr_replay_uncached_rep on the operation following the sequence
operation when sa_cachethis is FALSE.

Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use cstate session in nfsd4_replay_cache_entry]
[nfsd41: rename nfsd4_no_page_in_cache]
[nfsd41 rename nfsd4_enc_no_page_replay]
[nfsd41 nfsd4_is_solo_sequence]
[nfsd41 change nfsd4_not_cached return]
Signed-off-by: Andy Adamson <andros@netapp.com>
[changed return type to bool]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41 drop parens in nfsd4_is_solo_sequence call]
Signed-off-by: Andy Adamson <andros@netapp.com>
[changed "== 0" to "!"]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:19 -07:00
Andy Adamson 38eb76a54d nfsd41: Add a create session replay cache
Replace the nfs4_client cl_seqid field with a single struct nfs41_slot used
for the create session replay cache.

The CREATE_SESSION slot sets the sl_session pointer to NULL. Otherwise, the
slot and it's replay cache are used just like the session slots.

Fix unconfirmed create_session replay response by initializing the
create_session slot sequence id to 0.

A future patch will set the CREATE_SESSION cache when a SEQUENCE operation
preceeds the CREATE_SESSION operation. This compound is currently only cached
in the session slot table.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use bool inuse for slot state]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: revert portion of nfsd4_set_cache_entry]
Signed-off-by: Andy Adamson <andros@netpp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:18 -07:00
Andy Adamson ec6b5d7b50 nfsd41: create_session operation
Implement the create_session operation confoming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Look up the client id (generated by the server on exchange_id,
given by the client on create_session).
If neither a confirmed or unconfirmed client is found
then the client id is stale
If a confirmed cilent is found (i.e. we already received
create_session for it) then compare the sequence id
to determine if it's a replay or possibly a mis-ordered rpc.
If the seqid is in order, update the confirmed client seqid
and procedd with updating the session parameters.

If an unconfirmed client_id is found then verify the creds
and seqid.  If both match move the client id to confirmed state
and proceed with processing the create_session.

Currently, we do not support persistent sessions, and RDMA.

alloc_init_session generates a new sessionid and creates
a session structure.

NFSD_PAGES_PER_SLOT is used for the max response cached calculation, and for
the counting of DRC pages using the hard limits set in struct srv_serv.

A note on NFSD_PAGES_PER_SLOT:

Other patches in this series allow for NFSD_PAGES_PER_SLOT + 1 pages to be
cached in a DRC slot when the response size is less than NFSD_PAGES_PER_SLOT *
PAGE_SIZE but xdr_buf pages are used. e.g. a READDIR operation will encode a
small amount of data in the xdr_buf head, and then the READDIR in the xdr_buf
pages.  So, the hard limit calculation use of pages by a session is
underestimated by the number of cached operations using the xdr_buf pages.

Yet another patch caches no pages for the solo sequence operation, or any
compound where cache_this is False.  So the hard limit calculation use of
pages by a session is overestimated by the number of these operations in the
cache.

TODO: improve resource pre-allocation and negotiate session
parameters accordingly.  Respect and possibly adjust
backchannel attributes.

Signed-off-by: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Dean Hildebrand <dhildeb@us.ibm.com>
[nfsd41: remove headerpadsz from channel attributes]
Our client and server only support a headerpadsz of 0.
[nfsd41: use DRC limits in fore channel init]
[nfsd41: do not change CREATE_SESSION back channel attrs]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[use sessionid_lock spin lock]
[nfsd41: use bool inuse for slot state]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41 remove sl_session from alloc_init_session]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[simplify nfsd4_encode_create_session error handling]
[nfsd41: fix comment style in init_forechannel_attrs]
[nfsd41: allocate struct nfsd4_session and slot table in one piece]
[nfsd41: no need to INIT_LIST_HEAD in alloc_init_session just prior to list_add]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:18 -07:00
Andy Adamson 14778a133e nfsd41: clear DRC cache on free_session
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:18 -07:00
Andy Adamson da3846a286 nfsd41: nfsd DRC logic
Replay a request in nfsd4_sequence.
Add a minorversion to struct nfsd4_compound_state.

Pass the current slot to nfs4svc_encode_compound res via struct
nfsd4_compoundres to set an NFSv4.1 DRC entry.

Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use bool inuse for slot state]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use cstate session in nfs4svc_encode_compoundres]
[nfsd41 replace nfsd4_set_cache_entry]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:17 -07:00
Andy Adamson c3d06f9ce8 nfsd41: hard page limit for DRC
Use no more than 1/128th of the number of free pages at nfsd startup for the
v4.1 DRC.

This is an arbitrary default which should probably end up under the control
of an administrator.

Signed-off-by: Andy Adamson <andros@netapp.com>
[moved added fields in struct svc_serv under CONFIG_NFSD_V4_1]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[fix set_max_drc calculation of sv_drc_max_pages]
[moved NFSD_DRC_SIZE_SHIFT's declaration up in header file]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:17 -07:00
Andy Adamson 074fe89753 nfsd41: DRC save, restore, and clear functions
Cache all the result pages, including the rpc header in rq_respages[0],
for a request in the slot table cache entry.

Cache the statp pointer from nfsd_dispatch which points into rq_respages[0]
just past the rpc header. When setting a cache entry, calculate and save the
length of the nfs data minus the rpc header for rq_respages[0].

When replaying a cache entry, replace the cached rpc header with the
replayed request rpc result header, unless there is not enough room in the
cached results first page. In that case, use the cached rpc header.

The sessions fore channel maxresponse size cached is set to NFSD_PAGES_PER_SLOT
* PAGE_SIZE. For compounds we are cacheing with operations such as READDIR
that use the xdr_buf->pages to hold data, we choose to cache the extra page of
data rather than copying data from xdr_buf->pages into the xdr_buf->head page.

[nfsd41: limit cache to maxresponsesize_cached]
[nfsd41: mv nfsd4_set_statp under CONFIG_NFSD_V4_1]
[nfsd41: rename nfsd4_move_pages]
[nfsd41: rename page_no variable]
[nfsd41: rename nfsd4_set_cache_entry]
[nfsd41: fix nfsd41_copy_replay_data comment]
[nfsd41: add to nfsd4_set_cache_entry]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:17 -07:00
Andy Adamson f9bb94c4c6 nfsd41: enforce NFS4ERR_SEQUENCE_POS operation order rules for minorversion != 0 only.
Signed-off-by: Andy Adamson<andros@netapp.com>
[nfsd41: do not verify nfserr_sequence_pos for minorversion 0]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:16 -07:00
Benny Halevy b85d4c01b7 nfsd41: sequence operation
Implement the sequence operation conforming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-26

Check for stale clientid (as derived from the sessionid).
Enforce slotid range and exactly-once semantics using
the slotid and seqid.

If everything went well renew the client lease and
mark the slot INPROGRESS.

Add a struct nfsd4_slot pointer to struct nfsd4_compound_state.
To be used for sessions DRC replay.

[nfsd41: rename sequence catchthis to cachethis]
Signed-off-by: Andy Adamson<andros@netapp.com>
[pulled some code to set cstate->slot from "nfsd DRC logic"]
[use sessionid_lock spin lock]
[nfsd41: use bool inuse for slot state]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd: add a struct nfsd4_slot pointer to struct nfsd4_compound_state]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: add nfsd4_session pointer to nfsd4_compound_state]
[nfsd41: set cstate session]
[nfsd41: use cstate session in nfsd4_sequence]
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[simplify nfsd4_encode_sequence error handling]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:16 -07:00
Andy Adamson a1bcecd29c nfsd41: match clientid establishment method
We need to distinguish between client names provided by NFSv4.0 clients
SETCLIENTID and those provided by NFSv4.1 via EXCHANGE_ID when looking
up the clientid by string.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamson <andros@netapp.com>
[nfsd41: use boolean values for use_exchange_id argument]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: simplify match_clientid_establishment logic]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:15 -07:00
Andy Adamson 0733d21338 nfsd41: exchange_id operation
Implement the exchange_id operation confoming to
http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-28

Based on the client provided name, hash a client id.
If a confirmed one is found, compare the op's creds and
verifier.  If the creds match and the verifier is different
then expire the old client (client re-incarnated), otherwise,
if both match, assume it's a replay and ignore it.

If an unconfirmed client is found, then copy the new creds
and verifer if need update, otherwise assume replay.

The client is moved to a confirmed state on create_session.

In the nfs41 branch set the exchange_id flags to
EXCHGID4_FLAG_USE_NON_PNFS | EXCHGID4_FLAG_SUPP_MOVED_REFER
(pNFS is not supported, Referrals are supported,
Migration is not.).

Address various scenarios from section 18.35 of the spec:

1. Check for EXCHGID4_FLAG_UPD_CONFIRMED_REC_A and set
   EXCHGID4_FLAG_CONFIRMED_R as appropriate.

2. Return error codes per 18.35.4 scenarios.

3. Update client records or generate new client ids depending on
   scenario.

Note: 18.35.4 case 3 probably still needs revisiting.  The handling
seems not quite right.

Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Andy Adamosn <andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use utsname for major_id (and copy to server_scope)]
[nfsd41: fix handling of various exchange id scenarios]
Signed-off-by: Mike Sager <sager@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: reverse use of EXCHGID4_INVAL_FLAG_MASK_A]
[simplify nfsd4_encode_exchange_id error handling]
[nfsd41: embed an xdr_netobj in nfsd4_exchange_id]
[nfsd41: return nfserr_serverfault for spa_how == SP4_MACH_CRED]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:15 -07:00
Andy Adamson 069b6ad4bb nfsd41: proc stubs
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2009-04-03 17:41:14 -07:00