1
0
Fork 0
Commit Graph

683 Commits (c771d683a62e5d36bc46036f5c07f4f5bb7dda61)

Author SHA1 Message Date
J. Bruce Fields b21996e36c locks: break delegations on unlink
We need to break delegations on any operation that changes the set of
links pointing to an inode.  Start with unlink.

Such operations also hold the i_mutex on a parent directory.  Breaking a
delegation may require waiting for a timeout (by default 90 seconds) in
the case of a unresponsive NFS client.  To avoid blocking all directory
operations, we therefore drop locks before waiting for the delegation.
The logic then looks like:

	acquire locks
	...
	test for delegation; if found:
		take reference on inode
		release locks
		wait for delegation break
		drop reference on inode
		retry

It is possible this could never terminate.  (Even if we take precautions
to prevent another delegation being acquired on the same inode, we could
get a different inode on each retry.)  But this seems very unlikely.

The initial test for a delegation happens after the lock on the target
inode is acquired, but the directory inode may have been acquired
further up the call stack.  We therefore add a "struct inode **"
argument to any intervening functions, which we use to pass the inode
back up to the caller in the case it needs a delegation synchronously
broken.

Cc: David Howells <dhowells@redhat.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:42 -05:00
J. Bruce Fields 9accbb977a namei: minor vfs_unlink cleanup
We'll be using dentry->d_inode in one more place.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:41 -05:00
J. Bruce Fields 6cedba8962 vfs: take i_mutex on renamed file
A read delegation is used by NFSv4 as a guarantee that a client can
perform local read opens without informing the server.

The open operation takes the last component of the pathname as an
argument, thus is also a lookup operation, and giving the client the
above guarantee means informing the client before we allow anything that
would change the set of names pointing to the inode.

Therefore, we need to break delegations on rename, link, and unlink.

We also need to prevent new delegations from being acquired while one of
these operations is in progress.

We could add some completely new locking for that purpose, but it's
simpler to use the i_mutex, since that's already taken by all the
operations we care about.

The single exception is rename.  So, modify rename to take the i_mutex
on the file that is being renamed.

Also fix up lockdep and Documentation/filesystems/directory-locking to
reflect the change.

Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:40 -05:00
J. Bruce Fields 13a2c3be03 dcache: fix outdated DCACHE_NEED_LOOKUP comment
The DCACHE_NEED_LOOKUP case referred to here was removed with
39e3c9553f "vfs: remove
DCACHE_NEED_LOOKUP".

There are only four real_lookup() callers and all of them pass in an
unhashed dentry just returned from d_alloc.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:34 -05:00
David Howells b18825a7c8 VFS: Put a small type field into struct dentry::d_flags
Put a type field into struct dentry::d_flags to indicate if the dentry is one
of the following types that relate particularly to pathwalk:

	Miss (negative dentry)
	Directory
	"Automount" directory (defective - no i_op->lookup())
	Symlink
	Other (regular, socket, fifo, device)

The type field is set to one of the first five types on a dentry by calls to
__d_instantiate() and d_obtain_alias() from information in the inode (if one is
given).

The type is cleared by dentry_unlink_inode() when it reconstitutes an existing
dentry as a negative dentry.

Accessors provided are:

	d_set_type(dentry, type)
	d_is_directory(dentry)
	d_is_autodir(dentry)
	d_is_symlink(dentry)
	d_is_file(dentry)
	d_is_negative(dentry)
	d_is_positive(dentry)

A bunch of checks in pathname resolution switched to those.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:30 -05:00
Al Viro 8b61e74ffc get rid of {lock,unlock}_rcu_walk()
those have become aliases for rcu_read_{lock,unlock}()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:20 -05:00
Al Viro 48a066e72d RCU'd vfsmounts
* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode.  We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock.  It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-09 00:16:19 -05:00
Jeff Layton 14e972b451 audit: add child record before the create to handle case where create fails
Historically, when a syscall that creates a dentry fails, you get an audit
record that looks something like this (when trying to create a file named
"new" in "/tmp/tmp.SxiLnCcv63"):

    type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

This record makes no sense since it's associating the inode information for
"/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent
patch I posted to fix the audit_inode call in do_last fixes this, by making it
look more like this:

    type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023

While this is more correct, if the creation of the file fails, then we
have no record of the filename that the user tried to create.

This patch adds a call to audit_inode_child to may_create. This creates
an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the
create succeeds. When and if the create does succeed, then this record
will be updated with the correct inode info from the create.

This fixes what was broken in commit bfcec708.
Commit 79f6530c should also be backported to stable v3.7+.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-11-05 11:08:44 -05:00
Al Viro 474279dc0f split __lookup_mnt() in two functions
Instead of passing the direction as argument (and checking it on every
step through the hash chain), just have separate __lookup_mnt() and
__lookup_mnt_last().  And use the standard iterators...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-10-24 23:35:00 -04:00
Randy Dunlap 606d6fe3ff fs/namei.c: fix new kernel-doc warning
Add @path parameter to fix kernel-doc warning.
Also fix a spello/typo.

  Warning(fs/namei.c:2304): No description found for parameter 'path'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-22 12:02:40 +01:00
Al Viro 03da633aa7 atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in fs/namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-17 17:08:50 -04:00
Miklos Szeredi 116cc02253 vfs: don't set FILE_CREATED before calling ->atomic_open()
If O_CREAT|O_EXCL are passed to open, then we know that either

 - the file is successfully created, or
 - the operation fails in some way.

So previously we set FILE_CREATED before calling ->atomic_open() so the
filesystem doesn't have to.  This, however, led to bugs in the
implementation that went unnoticed when the filesystem didn't check for
existence, yet returned success.  To prevent this kind of bug, require
filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and
verify this in the VFS.

Also added a couple more verifications for the result of atomic_open():

 - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT.
 - Warn if filesystem set FILE_CREATED but gave a negative dentry.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-16 19:17:24 -04:00
Dave Jones bcceeeba9b Add missing unlocks to error paths of mountpoint_last.
Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 17:09:32 -04:00
Al Viro 443ed254c3 ... and fold the renamed __vfs_follow_link() into its only caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 17:06:45 -04:00
Christoph Hellwig 4aa32895c3 fs: remove vfs_follow_link
For a long time no filesystem has been using vfs_follow_link, and as seen
by recent filesystem submissions any new use is accidental as well.

Remove vfs_follow_link, document the replacement in
Documentation/filesystems/porting and also rename __vfs_follow_link
to match its only caller better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-10 17:06:45 -04:00
Linus Torvalds b05430fc93 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 3 (of many) from Al Viro:
 "Waiman's conversion of d_path() and bits related to it,
  kern_path_mountpoint(), several cleanups and fixes (exportfs
  one is -stable fodder, IMO).

  There definitely will be more...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  split read_seqretry_or_unlock(), convert d_walk() to resulting primitives
  dcache: Translating dentry into pathname without taking rename_lock
  autofs4 - fix device ioctl mount lookup
  introduce kern_path_mountpoint()
  rename user_path_umountat() to user_path_mountpoint_at()
  take unlazy_walk() into umount_lookup_last()
  Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c
  prune_super(): sb->s_op is never NULL
  exportfs: don't assume that ->iterate() won't feed us too long entries
  afs: get rid of redundant ->d_name.len checks
2013-09-10 12:44:24 -07:00
Linus Torvalds d0d2727710 vfs: make sure we don't have a stale root path if unlazy_walk() fails
When I moved the RCU walk termination into unlazy_walk(), I didn't copy
quite all of it: for the successful RCU termination we properly add the
necessary reference counts to our temporary copy of the root path, but
for the failure case we need to make sure that any temporary root path
information is cleared out (since it does _not_ have the proper
reference counts from the RCU lookup).

We could clean up this mess by just always dropping the temporary root
information, but Al points out that that would mean that a single lookup
through symlinks could see multiple different root entries if it races
with another thread doing chroot.  Not that I think we should really
care (we had that before too, back before we had a copy of the root path
in the nameidata).

Al says he has a cunning plan.  In the meantime, this is the minimal fix
for the problem, even if it's not all that pretty.

Reported-by: Mace Moneta <moneta.mace@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-10 12:17:49 -07:00
Linus Torvalds e5c832d555 vfs: fix dentry RCU to refcounting possibly sleeping dput()
This is the fix that the last two commits indirectly led up to - making
sure that we don't call dput() in a bad context on the dentries we've
looked up in RCU mode after the sequence count validation fails.

This basically expands d_rcu_to_refcount() into the callers, and then
fixes the callers to delay the dput() in the failure case until _after_
we've dropped all locks and are no longer in an RCU-locked region.

The case of 'complete_walk()' was trivial, since its failure case did
the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
and as such that is just a pure expansion of the function with a trivial
movement of the resulting dput() to after 'unlock_rcu_walk()'.

In contrast, the unlazy_walk() case was much more complicated, because
not only does convert two different dentries from RCU to be reference
counted, but it used to not call unlock_rcu_walk() at all, and instead
just returned an error and let the caller clean everything up in
"terminate_walk()".

Happily, one of the dentries in question (called "parent" inside
unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
a refcount to anyway for the non-RCU case.

So what the new and improved unlazy_walk() does is to first turn that
dentry into a refcounted one, and once that is set up, the error cases
can continue to use the terminate_walk() helper for cleanup, but for the
non-RCU case.  Which makes it possible to drop out of RCU mode if we
actually hit the sequence number failure case.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08 18:13:49 -07:00
Al Viro 2d86465101 introduce kern_path_mountpoint()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-08 20:20:23 -04:00
Al Viro 197df04c74 rename user_path_umountat() to user_path_mountpoint_at()
... and move the extern from linux/namei.h to fs/internal.h,
along with that of vfs_path_lookup().

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-08 20:20:21 -04:00
Al Viro 35759521ee take unlazy_walk() into umount_lookup_last()
... and massage it a bit to reduce nesting

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-08 20:20:19 -04:00
Linus Torvalds 0d98439ea3 vfs: use lockred "dead" flag to mark unrecoverably dead dentries
This simplifies the RCU to refcounting code in particular.

I was originally intending to leave this for later, but walking through
all the dput() logic (see previous commit), I realized that the dput()
"might_sleep()" check was misleadingly weak.  And I removed it as
misleading, both for performance profiling and for debugging.

However, the might_sleep() debugging case is actually true: the final
dput() can indeed sleep, if the inode of the dentry that you are
releasing ends up sleeping at iput time (see dentry_iput()).  So the
problem with the might_sleep() in dput() wasn't that it wasn't true, it
was that it wasn't actually testing and triggering on the interesting
case.

In particular, just about *any* dput() can indeed sleep, if you happen
to race with another thread deleting the file in question, and you then
lose the race to the be the last dput() for that file.  But because it's
a very rare race, the debugging code would never trigger it in practice.

Why is this problematic? The new d_rcu_to_refcount() (see commit
15570086b590: "vfs: reimplement d_rcu_to_refcount() using
lockref_get_or_lock()") does a dput() for the failure case, and it does
it under the RCU lock.  So potentially sleeping really is a bug.

But there's no way I'm going to fix this with the previous complicated
"lockref_get_or_lock()" interface.  And rather than revert to the old
and crufty nested dentry locking code (which did get this right by
delaying the reference count updates until they were verified to be
safe), let's make forward progress.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-08 13:46:52 -07:00
Linus Torvalds 45d9a2220f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs pile 1 from Al Viro:
 "Unfortunately, this merge window it'll have a be a lot of small piles -
  my fault, actually, for not keeping #for-next in anything that would
  resemble a sane shape ;-/

  This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
  cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
  components) + several long-standing patches from various folks.

  There definitely will be a lot more (starting with Miklos'
  check_submount_and_drop() series)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  direct-io: Handle O_(D)SYNC AIO
  direct-io: Implement generic deferred AIO completions
  add formats for dentry/file pathnames
  kvm eventfd: switch to fdget
  powerpc kvm: use fdget
  switch fchmod() to fdget
  switch epoll_ctl() to fdget
  switch copy_module_from_fd() to fdget
  git simplify nilfs check for busy subtree
  ibmasmfs: don't bother passing superblock when not needed
  don't pass superblock to hypfs_{mkdir,create*}
  don't pass superblock to hypfs_diag_create_files
  don't pass superblock to hypfs_vm_create_files()
  oprofile: get rid of pointless forward declarations of struct super_block
  oprofilefs_create_...() do not need superblock argument
  oprofilefs_mkdir() doesn't need superblock argument
  don't bother with passing superblock to oprofile_create_stats_files()
  oprofile: don't bother with passing superblock to ->create_files()
  don't bother passing sb to oprofile_create_files()
  coh901318: don't open-code simple_read_from_buffer()
  ...
2013-09-05 08:50:26 -07:00
Jeff Layton 8033426e6b vfs: allow umount to handle mountpoints without revalidating them
Christopher reported a regression where he was unable to unmount a NFS
filesystem where the root had gone stale. The problem is that
d_revalidate handles the root of the filesystem differently from other
dentries, but d_weak_revalidate does not. We could simply fix this by
making d_weak_revalidate return success on IS_ROOT dentries, but there
are cases where we do want to revalidate the root of the fs.

A umount is really a special case. We generally aren't interested in
anything but the dentry and vfsmount that's attached at that point. If
the inode turns out to be stale we just don't care since the intent is
to stop using it anyway.

Try to handle this situation better by treating umount as a special
case in the lookup code. Have it resolve the parent using normal
means, and then do a lookup of the final dentry without revalidating
it. In most cases, the final lookup will come out of the dcache, but
the case where there's a trailing symlink or !LAST_NORM entry on the
end complicates things a bit.

Cc: Neil Brown <neilb@suse.de>
Reported-by: Christopher T Vogan <cvogan@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-09-03 22:50:29 -04:00
Linus Torvalds 15570086b5 vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()
This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
and re-implements it using the lockref infrastructure instead.  It also
adds a lot of comments about what is actually going on, because turning
a dentry that was looked up using RCU into a long-lived reference
counted entry is one of the more subtle parts of the rcu walk.

We also used to be _particularly_ subtle in unlazy_walk() where we
re-validate both the dentry and its parent using the same sequence
count.  We used to do it by nesting the locks and then verifying the
sequence count just once.

That was silly, because nested locking is expensive, but the sequence
count check is not.  So this just re-validates the dentry and the parent
separately, avoiding the nested locking, and making the lockref lookup
possible.

Acked-by: Waiman Long <waiman.long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-02 11:38:06 -07:00
Waiman Long 98474236f7 vfs: make the dentry cache use the lockref infrastructure
This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.

There are no semantic changes here, it's purely syntactic.  The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.

This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.

[ As with the previous commit, this is a rewritten version of a concept
  originally from Waiman, so credit goes to him, blame for any errors
  goes to me.

  Waiman's patch had some semantic differences for taking advantage of
  the lockless update in dget_parent(), while this patch is
  intentionally a pure search-and-replace change with no semantic
  changes.     - Linus ]

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 18:24:59 -07:00
Linus Torvalds f0cc6ffb8c Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"
This reverts commit bb2314b479.

It wasn't necessarily wrong per se, but we're still busily discussing
the exact details of this all, so I'm going to revert it for now.

It's true that you can already do flink() through /proc and that flink()
isn't new.  But as Brad Spengler points out, some secure environments do
not mount proc, and flink adds a new interface that can avoid path
lookup of the source for those kinds of environments.

We may re-do this (and even mark it for stable backporting back in 3.11
and possibly earlier) once the whole discussion about the interface is done.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-28 09:18:05 -07:00
Andy Lutomirski bb2314b479 fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink
Every now and then someone proposes a new flink syscall, and this spawns
a long discussion of whether it would be a security problem.  I think
that this is missing the point: flink is *already* allowed without
privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.

Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
write it, and link it in is very convenient.  The only problem is that
it requires that /proc be mounted so that you can do:

linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)

This sucks -- it's much nicer to do:

linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)

Let's allow it.

If this turns out to be excessively scary, it we could instead require
that the inode in question be I_LINKABLE, but this seems pointless given
the /proc situation

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-05 18:24:11 +04:00
Al Viro bb458c644a Safer ABI for O_TMPFILE
[suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE;
that will fail on old kernels in a lot more cases than what I came up with.
And make sure O_CREAT doesn't get there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-13 13:26:37 +04:00
Linus Torvalds da53be12bb Don't pass inode to ->d_hash() and ->d_compare()
Instances either don't look at it at all (the majority of cases) or
only want it to find the superblock (which can be had as dentry->d_sb).
A few cases that want more are actually safe with dentry->d_inode -
the only precaution needed is the check that it hadn't been replaced with
NULL by rmdir() or by overwriting rename(), which case should be simply
treated as cache miss.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:36 +04:00
Al Viro f4e0c30c19 allow the temp files created by open() to be linked to
O_TMPFILE | O_CREAT => linkat() with AT_SYMLINK_FOLLOW and /proc/self/fd/<n>
as oldpath (i.e. flink()) will create a link
O_TMPFILE | O_CREAT | O_EXCL => ENOENT on attempt to link those guys

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:11 +04:00
Al Viro 60545d0d46 [O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:10 +04:00
Al Viro f9652e10c1 allow build_open_flags() to return an error
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:09 +04:00
Al Viro bc77daa783 do_last(): fix missing checks for LAST_BIND case
/proc/self/cwd with O_CREAT should fail with EISDIR.  /proc/self/exe, OTOH,
should fail with ENOTDIR when opened with O_DIRECTORY.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:07 +04:00
Al Viro 0525290119 use can_lookup() instead of direct checks of ->i_op->lookup
a couple of places got missed back when Linus has introduced that one...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-15 05:41:45 +04:00
Linus Torvalds c4cc75c332 Merge git://git.infradead.org/users/eparis/audit
Pull audit changes from Eric Paris:
 "Al used to send pull requests every couple of years but he told me to
  just start pushing them to you directly.

  Our touching outside of core audit code is pretty straight forward.  A
  couple of interface changes which hit net/.  A simple argument bug
  calling audit functions in namei.c and the removal of some assembly
  branch prediction code on ppc"

* git://git.infradead.org/users/eparis/audit: (31 commits)
  audit: fix message spacing printing auid
  Revert "audit: move kaudit thread start from auditd registration to kaudit init"
  audit: vfs: fix audit_inode call in O_CREAT case of do_last
  audit: Make testing for a valid loginuid explicit.
  audit: fix event coverage of AUDIT_ANOM_LINK
  audit: use spin_lock in audit_receive_msg to process tty logging
  audit: do not needlessly take a lock in tty_audit_exit
  audit: do not needlessly take a spinlock in copy_signal
  audit: add an option to control logging of passwords with pam_tty_audit
  audit: use spin_lock_irqsave/restore in audit tty code
  helper for some session id stuff
  audit: use a consistent audit helper to log lsm information
  audit: push loginuid and sessionid processing down
  audit: stop pushing loginid, uid, sessionid as arguments
  audit: remove the old depricated kernel interface
  audit: make validity checking generic
  audit: allow checking the type of audit message in the user filter
  audit: fix build break when AUDIT_DEBUG == 2
  audit: remove duplicate export of audit_enabled
  Audit: do not print error when LSMs disabled
  ...
2013-05-11 14:29:11 -07:00
Jeff Layton 33e2208acf audit: vfs: fix audit_inode call in O_CREAT case of do_last
Jiri reported a regression in auditing of open(..., O_CREAT) syscalls.
In older kernels, creating a file with open(..., O_CREAT) created
audit_name records that looked like this:

type=PATH msg=audit(1360255720.628:64): item=1 name="/abc/foo" inode=138810 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255720.628:64): item=0 name="/abc/" inode=138635 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

...in recent kernels though, they look like this:

type=PATH msg=audit(1360255402.886:12574): item=2 name=(null) inode=264599 dev=fd:00 mode=0100640 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255402.886:12574): item=1 name=(null) inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0
type=PATH msg=audit(1360255402.886:12574): item=0 name="/abc/foo" inode=264598 dev=fd:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=unconfined_u:object_r:default_t:s0

Richard bisected to determine that the problems started with commit
bfcec708, but the log messages have changed with some later
audit-related patches.

The problem is that this audit_inode call is passing in the parent of
the dentry being opened, but audit_inode is being called with the parent
flag false. This causes later audit_inode and audit_inode_child calls to
match the wrong entry in the audit_names list.

This patch simply sets the flag to properly indicate that this inode
represents the parent. With this, the audit_names entries are back to
looking like they did before.

Cc: <stable@vger.kernel.org> # v3.7+
Reported-by: Jiri Jaburek <jjaburek@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Test By: Richard Guy Briggs <rbriggs@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2013-05-07 22:27:19 -04:00
Linus Torvalds 7b54c165a0 vfs: don't BUG_ON() if following a /proc fd pseudo-symlink results in a symlink
It's "normal" - it can happen if the file descriptor you followed was
opened with O_NOFOLLOW.

Reported-by: Dave Jones <davej@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-08 09:03:07 -08:00
Al Viro dcf787f391 constify path_get/path_put and fs_struct.c stuff
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-01 23:51:07 -05:00
Jeff Layton ecf3d1f1aa vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
The following set of operations on a NFS client and server will cause

    server# mkdir a
    client# cd a
    server# mv a a.bak
    client# sleep 30  # (or whatever the dir attrcache timeout is)
    client# stat .
    stat: cannot stat `.': Stale NFS file handle

Obviously, we should not be getting an ESTALE error back there since the
inode still exists on the server. The problem is that the lookup code
will call d_revalidate on the dentry that "." refers to, because NFS has
FS_REVAL_DOT set.

nfs_lookup_revalidate will see that the parent directory has changed and
will try to reverify the dentry by redoing a LOOKUP. That of course
fails, so the lookup code returns ESTALE.

The problem here is that d_revalidate is really a bad fit for this case.
What we really want to know at this point is whether the inode is still
good or not, but we don't really care what name it goes by or whether
the dcache is still valid.

Add a new d_op->d_weak_revalidate operation and have complete_walk call
that instead of d_revalidate. The intent there is to allow for a
"weaker" d_revalidate that just checks to see whether the inode is still
good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
special casing.

[AV: changed method name, added note in porting, fixed confusion re
having it possibly called from RCU mode (it won't be)]

Cc: NeilBrown <neilb@suse.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:09 -05:00
Al Viro cc2a527115 lookup_slow: get rid of name argument
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:35 -05:00
Al Viro e97cdc87be lookup_fast: get rid of name argument
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:34 -05:00
Al Viro 21b9b07392 get rid of name and type arguments of walk_component()
... always can be found in nameidata now.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:34 -05:00
Al Viro 5f4a6a6950 link_path_walk(): move assignments to nd->last/nd->last_type up
... and clean the main loop a bit

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:34 -05:00
Al Viro 1afc99beaf propagate error from get_empty_filp() to its callers
Based on parts from Anatol's patch (the rest is the next commit).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:32 -05:00
Al Viro 496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Jeff Layton c6a9428401 vfs: fix renameat to retry on ESTALE errors
...as always, rename is the messiest of the bunch. We have to track
whether to retry or not via a separate flag since the error handling
is already quite complex.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:05 -05:00
Jeff Layton 5d18f8133c vfs: make do_unlinkat retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton c6ee920698 vfs: make do_rmdir retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton 9e790bd65c vfs: add a flags argument to user_path_parent
...so we can pass in LOOKUP_REVAL. For now, nothing does yet.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:04 -05:00
Jeff Layton 442e31ca5a vfs: fix linkat to retry once on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:03 -05:00
Jeff Layton f46d3567b2 vfs: fix symlinkat to retry on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:03 -05:00
Jeff Layton b76d8b8226 vfs: fix mkdirat to retry once on an ESTALE error
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton 972567f14c vfs: fix mknodat to retry on ESTALE errors
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton 1ac12b4b6d vfs: turn is_dir argument to kern_path_create into a lookup_flags arg
Where we can pass in LOOKUP_DIRECTORY or LOOKUP_REVAL. Any other flags
passed in here are currently ignored.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 18:50:02 -05:00
Jeff Layton 39e3c9553f vfs: remove DCACHE_NEED_LOOKUP
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.

Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:36 -05:00
Al Viro 741b7c3f77 path_init(): make -ENOTDIR failure exits consistent
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:35 -05:00
Jeff Layton 582aa64a04 vfs: remove unneeded permission check from path_init
When path_init is called with a valid dfd, that code checks permissions
on the open directory fd and returns an error if the check fails. This
permission check is redundant, however.

Both callers of path_init immediately call link_path_walk afterward. The
first thing that link_path_walk does for pathnames that do not consist
only of slashes is to check for exec permissions at the starting point of
the path walk.  And this check in path_init() is on the path taken only
when *name != '/' && *name != '\0'.

In most cases, these checks are very quick, but when the dfd is for a
file on a NFS mount with the actimeo=0, each permission check goes
out onto the wire. The result is 2 identical ACCESS calls.

Given that these codepaths are fairly "hot", I think it makes sense to
eliminate the permission check in path_init and simply assume that the
caller will eventually check the permissions before proceeding.

Reported-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 13:57:04 -05:00
Al Viro 21d8a15ac3 lookup_one_len: don't accept . and ..
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-29 22:17:21 -05:00
Linus Torvalds 561ec64ae6 VFS: don't do protected {sym,hard}links by default
In commit 800179c9b8 ("This adds symlink and hardlink restrictions to
the Linux VFS"), the new link protections were enabled by default, in
the hope that no actual application would care, despite it being
technically against legacy UNIX (and documented POSIX) behavior.

However, it does turn out to break some applications.  It's rare, and
it's unfortunate, but it's unacceptable to break existing systems, so
we'll have to default to legacy behavior.

In particular, it has broken the way AFD distributes files, see

  http://www.dwd.de/AFD/

along with some legacy scripts.

Distributions can end up setting this at initrd time or in system
scripts: if you have security problems due to link attacks during your
early boot sequence, you have bigger problems than some kernel sysctl
setting. Do:

	echo 1 > /proc/sys/fs/protected_symlinks
	echo 1 > /proc/sys/fs/protected_hardlinks

to re-enable the link protections.

Alternatively, we may at some point introduce a kernel config option
that sets these kinds of "more secure but not traditional" behavioural
options automatically.

Reported-by: Nick Bowler <nbowler@elliptictech.com>
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # v3.6
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-26 10:05:07 -07:00
Jeff Layton 7950e3852a vfs: embed struct filename inside of names_cache allocation if possible
In the common case where a name is much smaller than PATH_MAX, an extra
allocation for struct filename is unnecessary. Before allocating a
separate one, try to embed the struct filename inside the buffer first. If
it turns out that that's not long enough, then fall back to allocating a
separate struct filename and redoing the copy.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:10 -04:00
Jeff Layton adb5c2473d audit: make audit_inode take struct filename
Keep a pointer to the audit_names "slot" in struct filename.

Have all of the audit_inode callers pass a struct filename ponter to
audit_inode instead of a string pointer. If the aname field is already
populated, then we can skip walking the list altogether and just use it
directly.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:09 -04:00
Jeff Layton 669abf4e55 vfs: make path_openat take a struct filename pointer
...and fix up the callers. For do_file_open_root, just declare a
struct filename on the stack and fill out the .name field. For
do_filp_open, make it also take a struct filename pointer, and fix up its
callers to call it appropriately.

For filp_open, add a variant that takes a struct filename pointer and turn
filp_open into a wrapper around it.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:09 -04:00
Jeff Layton 873f1eedc1 vfs: turn do_path_lookup into wrapper around struct filename variant
...and make the user_path callers use that variant instead.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:08 -04:00
Jeff Layton 7ac86265dc audit: allow audit code to satisfy getname requests from its names_list
Currently, if we call getname() on a userland string more than once,
we'll get multiple copies of the string and multiple audit_names
records.

Add a function that will allow the audit_names code to satisfy getname
requests using info from the audit_names list, avoiding a new allocation
and audit_names records.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:15:08 -04:00
Jeff Layton 91a27b2a75 vfs: define struct filename and have getname() return it
getname() is intended to copy pathname strings from userspace into a
kernel buffer. The result is just a string in kernel space. It would
however be quite helpful to be able to attach some ancillary info to
the string.

For instance, we could attach some audit-related info to reduce the
amount of audit-related processing needed. When auditing is enabled,
we could also call getname() on the string more than once and not
need to recopy it from userspace.

This patchset converts the getname()/putname() interfaces to return
a struct instead of a string. For now, the struct just tracks the
string in kernel space and the original userland pointer for it.

Later, we'll add other information to the struct as it becomes
convenient.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 20:14:55 -04:00
Jeff Layton 8e377d1507 vfs: unexport getname and putname symbols
I see no callers in module code.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:09 -04:00
Jeff Layton 4fa6b5ecbf audit: overhaul __audit_inode_child to accomodate retrying
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.

If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:03 -04:00
Jeff Layton bfcec70874 audit: set the name_len in audit_inode for parent lookups
Currently, this gets set mostly by happenstance when we call into
audit_inode_child. While that might be a little more efficient, it seems
wrong. If the syscall ends up failing before audit_inode_child ever gets
called, then you'll have an audit_names record that shows the full path
but has the parent inode info attached.

Fix this by passing in a parent flag when we call audit_inode that gets
set to the value of LOOKUP_PARENT. We can then fix up the pathname for
the audit entry correctly from the get-go.

While we're at it, clean up the no-op macro for audit_inode in the
!CONFIG_AUDITSYSCALL case.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:01 -04:00
Jeff Layton c43a25abba audit: reverse arguments to audit_inode_child
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.

Reverse those arguments for a micro-optimization.

Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:32:00 -04:00
Jeff Layton f78570dd6a audit: remove unnecessary NULL ptr checks from do_path_lookup
As best I can tell, whenever retval == 0, nd->path.dentry and nd->inode
are also non-NULL. Eliminate those checks and the superfluous
audit_context check.

Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-12 00:31:59 -04:00
Arnd Bergmann 98f6ef64b1 vfs: bogus warnings in fs/namei.c
The follow_link() function always initializes its *p argument,
or returns an error, but when building with 'gcc -s', the compiler
gets confused by the __always_inline attribute to the function
and can no longer detect where the cookie was initialized.

The solution is to always initialize the pointer from follow_link,
even in the error path. When building with -O2, this has zero impact
on generated code and adds a single instruction in the error path
for a -Os build on ARM.

Without this patch, building with gcc-4.6 through gcc-4.8 and
CONFIG_CC_OPTIMIZE_FOR_SIZE results in:

fs/namei.c: In function 'link_path_walk':
fs/namei.c:649:24: warning: 'cookie' may be used uninitialized in this function [-Wuninitialized]
fs/namei.c:1544:9: note: 'cookie' was declared here
fs/namei.c: In function 'path_lookupat':
fs/namei.c:649:24: warning: 'cookie' may be used uninitialized in this function [-Wuninitialized]
fs/namei.c:1934:10: note: 'cookie' was declared here
fs/namei.c: In function 'path_openat':
fs/namei.c:649:24: warning: 'cookie' may be used uninitialized in this function [-Wuninitialized]
fs/namei.c:2899:9: note: 'cookie' was declared here

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-11 20:02:16 -04:00
Sasha Levin ffd8d101a3 fs: prevent use after free in auditing when symlink following was denied
Commit "fs: add link restriction audit reporting" has added auditing of failed
attempts to follow symlinks. Unfortunately, the auditing was being done after
the struct path structure was released earlier.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-10-09 23:33:37 -04:00
Linus Torvalds aab174f0df Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:

 - big one - consolidation of descriptor-related logics; almost all of
   that is moved to fs/file.c

   (BTW, I'm seriously tempted to rename the result to fd.c.  As it is,
   we have a situation when file_table.c is about handling of struct
   file and file.c is about handling of descriptor tables; the reasons
   are historical - file_table.c used to be about a static array of
   struct file we used to have way back).

   A lot of stray ends got cleaned up and converted to saner primitives,
   disgusting mess in android/binder.c is still disgusting, but at least
   doesn't poke so much in descriptor table guts anymore.  A bunch of
   relatively minor races got fixed in process, plus an ext4 struct file
   leak.

 - related thing - fget_light() partially unuglified; see fdget() in
   there (and yes, it generates the code as good as we used to have).

 - also related - bits of Cyrill's procfs stuff that got entangled into
   that work; _not_ all of it, just the initial move to fs/proc/fd.c and
   switch of fdinfo to seq_file.

 - Alex's fs/coredump.c spiltoff - the same story, had been easier to
   take that commit than mess with conflicts.  The rest is a separate
   pile, this was just a mechanical code movement.

 - a few misc patches all over the place.  Not all for this cycle,
   there'll be more (and quite a few currently sit in akpm's tree)."

Fix up trivial conflicts in the android binder driver, and some fairly
simple conflicts due to two different changes to the sock_alloc_file()
interface ("take descriptor handling from sock_alloc_file() to callers"
vs "net: Providing protocol type via system.sockprotoname xattr of
/proc/PID/fd entries" adding a dentry name to the socket)

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
  MAX_LFS_FILESIZE should be a loff_t
  compat: fs: Generic compat_sys_sendfile implementation
  fs: push rcu_barrier() from deactivate_locked_super() to filesystems
  btrfs: reada_extent doesn't need kref for refcount
  coredump: move core dump functionality into its own file
  coredump: prevent double-free on an error path in core dumper
  usb/gadget: fix misannotations
  fcntl: fix misannotations
  ceph: don't abuse d_delete() on failure exits
  hypfs: ->d_parent is never NULL or negative
  vfs: delete surplus inode NULL check
  switch simple cases of fget_light to fdget
  new helpers: fdget()/fdput()
  switch o2hb_region_dev_write() to fget_light()
  proc_map_files_readdir(): don't bother with grabbing files
  make get_file() return its argument
  vhost_set_vring(): turn pollstart/pollstop into bool
  switch prctl_set_mm_exe_file() to fget_light()
  switch xfs_find_handle() to fget_light()
  switch xfs_swapext() to fget_light()
  ...
2012-10-02 20:25:04 -07:00
Linus Torvalds 437589a74b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull user namespace changes from Eric Biederman:
 "This is a mostly modest set of changes to enable basic user namespace
  support.  This allows the code to code to compile with user namespaces
  enabled and removes the assumption there is only the initial user
  namespace.  Everything is converted except for the most complex of the
  filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs,
  nfs, ocfs2 and xfs as those patches need a bit more review.

  The strategy is to push kuid_t and kgid_t values are far down into
  subsystems and filesystems as reasonable.  Leaving the make_kuid and
  from_kuid operations to happen at the edge of userspace, as the values
  come off the disk, and as the values come in from the network.
  Letting compile type incompatible compile errors (present when user
  namespaces are enabled) guide me to find the issues.

  The most tricky areas have been the places where we had an implicit
  union of uid and gid values and were storing them in an unsigned int.
  Those places were converted into explicit unions.  I made certain to
  handle those places with simple trivial patches.

  Out of that work I discovered we have generic interfaces for storing
  quota by projid.  I had never heard of the project identifiers before.
  Adding full user namespace support for project identifiers accounts
  for most of the code size growth in my git tree.

  Ultimately there will be work to relax privlige checks from
  "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing
  root in a user names to do those things that today we only forbid to
  non-root users because it will confuse suid root applications.

  While I was pushing kuid_t and kgid_t changes deep into the audit code
  I made a few other cleanups.  I capitalized on the fact we process
  netlink messages in the context of the message sender.  I removed
  usage of NETLINK_CRED, and started directly using current->tty.

  Some of these patches have also made it into maintainer trees, with no
  problems from identical code from different trees showing up in
  linux-next.

  After reading through all of this code I feel like I might be able to
  win a game of kernel trivial pursuit."

Fix up some fairly trivial conflicts in netfilter uid/git logging code.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits)
  userns: Convert the ufs filesystem to use kuid/kgid where appropriate
  userns: Convert the udf filesystem to use kuid/kgid where appropriate
  userns: Convert ubifs to use kuid/kgid
  userns: Convert squashfs to use kuid/kgid where appropriate
  userns: Convert reiserfs to use kuid and kgid where appropriate
  userns: Convert jfs to use kuid/kgid where appropriate
  userns: Convert jffs2 to use kuid and kgid where appropriate
  userns: Convert hpfs to use kuid and kgid where appropriate
  userns: Convert btrfs to use kuid/kgid where appropriate
  userns: Convert bfs to use kuid/kgid where appropriate
  userns: Convert affs to use kuid/kgid wherwe appropriate
  userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids
  userns: On ia64 deal with current_uid and current_gid being kuid and kgid
  userns: On ppc convert current_uid from a kuid before printing.
  userns: Convert s390 getting uid and gid system calls to use kuid and kgid
  userns: Convert s390 hypfs to use kuid and kgid where appropriate
  userns: Convert binder ipc to use kuids
  userns: Teach security_path_chown to take kuids and kgids
  userns: Add user namespace support to IMA
  userns: Convert EVM to deal with kuids and kgids in it's hmac computation
  ...
2012-10-02 11:11:09 -07:00
Al Viro 2903ff019b switch simple cases of fget_light to fdget
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 22:20:08 -04:00
Al Viro f6d2ac5ca7 namei.c: fix BS comment
get_write_access() is needed for nfsd, not binfmt_aout (the latter
has no business doing anything of that kind, of course)

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26 21:10:02 -04:00
Randy Dunlap 55852635a8 fs: fix fs/namei.c kernel-doc warnings
Fix kernel-doc warnings in fs/namei.c:

Warning(fs/namei.c:360): No description found for parameter 'inode'
Warning(fs/namei.c:672): No description found for parameter 'nd'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc:	Alexander Viro <viro@zeniv.linux.org.uk>
Cc:	linux-fsdevel@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-22 10:30:10 -04:00
Sage Weil 62b2ce964b vfs: fix propagation of atomic_open create error on negative dentry
If ->atomic_open() returns -ENOENT, we take care to return the create
error (e.g., EACCES), if any.  Do the same when ->atomic_open() returns 1
and provides a negative dentry.

This fixes a regression where an unprivileged open O_CREAT fails with
ENOENT instead of EACCES, introduced with the new atomic_open code.  It
is tested by the open/08.t test in the pjd posix test suite, and was
observed on top of fuse (backed by ceph-fuse).

Signed-off-by: Sage Weil <sage@inktank.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
2012-08-16 19:29:09 +02:00
Miklos Szeredi 38227f78a5 vfs: pass right create mode to may_o_create()
Pass the umask-ed create mode to may_o_create() instead of the original one.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2012-08-15 13:01:24 +02:00
Miklos Szeredi 62b259d8b3 vfs: atomic_open(): fix create mode usage
Don't mask S_ISREG off the create mode before passing to ->atomic_open().  Other
methods (->create, ->mknod) also get the complete file mode and filesystems
expect it.

Reported-by: Steve <steveamigauk@yahoo.co.uk>
Reported-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
2012-08-15 13:01:24 +02:00
Eric W. Biederman 81abe27b10 userns: Fix link restrictions to use uid_eq
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-08-03 19:23:45 -07:00
Jan Kara c30dabfe5d fs: Push mnt_want_write() outside of i_mutex
Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.

One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex.  Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.

[AV: scratch the previous, mnt_want_write() has been moved to kern_path_create()
by now]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 01:02:49 +04:00
Al Viro 64894cf843 simplify lookup_open()/atomic_open() - do the temporary mnt_want_write() early
The write ref to vfsmount taken in lookup_open()/atomic_open() is going to
be dropped; we take the one to stay in dentry_open().  Just grab the temporary
in caller if it looks like we are going to need it (create/truncate/writable open)
and pass (by value) "has it succeeded" flag.  Instead of doing mnt_want_write()
inside, check that flag and treat "false" as "mnt_want_write() has just failed".
mnt_want_write() is cheap and the things get considerably simpler and more robust
that way - we get it and drop it in the same function, to start with, rather
than passing a "has something in the guts of really scary functions taken it"
back to caller.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-31 00:53:35 +04:00
Al Viro f8310c5920 fix O_EXCL handling for devices
O_EXCL without O_CREAT has different semantics; it's "fail if already opened",
not "fail if already exists".  commit 71574865 broke that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-30 11:50:30 +04:00
Kees Cook a51d9eaa41 fs: add link restriction audit reporting
Adds audit messages for unexpected link restriction violations so that
system owners will have some sort of potentially actionable information
about misbehaving processes.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:43:08 +04:00
Kees Cook 800179c9b8 fs: add link restrictions
This adds symlink and hardlink restrictions to the Linux VFS.

Symlinks:

A long-standing class of security issues is the symlink-based
time-of-check-time-of-use race, most commonly seen in world-writable
directories like /tmp. The common method of exploitation of this flaw
is to cross privilege boundaries when following a given symlink (i.e. a
root process follows a symlink belonging to another user). For a likely
incomplete list of hundreds of examples across the years, please see:
http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp

The solution is to permit symlinks to only be followed when outside
a sticky world-writable directory, or when the uid of the symlink and
follower match, or when the directory owner matches the symlink's owner.

Some pointers to the history of earlier discussion that I could find:

 1996 Aug, Zygo Blaxell
  http://marc.info/?l=bugtraq&m=87602167419830&w=2
 1996 Oct, Andrew Tridgell
  http://lkml.indiana.edu/hypermail/linux/kernel/9610.2/0086.html
 1997 Dec, Albert D Cahalan
  http://lkml.org/lkml/1997/12/16/4
 2005 Feb, Lorenzo Hernández García-Hierro
  http://lkml.indiana.edu/hypermail/linux/kernel/0502.0/1896.html
 2010 May, Kees Cook
  https://lkml.org/lkml/2010/5/30/144

Past objections and rebuttals could be summarized as:

 - Violates POSIX.
   - POSIX didn't consider this situation and it's not useful to follow
     a broken specification at the cost of security.
 - Might break unknown applications that use this feature.
   - Applications that break because of the change are easy to spot and
     fix. Applications that are vulnerable to symlink ToCToU by not having
     the change aren't. Additionally, no applications have yet been found
     that rely on this behavior.
 - Applications should just use mkstemp() or O_CREATE|O_EXCL.
   - True, but applications are not perfect, and new software is written
     all the time that makes these mistakes; blocking this flaw at the
     kernel is a single solution to the entire class of vulnerability.
 - This should live in the core VFS.
   - This should live in an LSM. (https://lkml.org/lkml/2010/5/31/135)
 - This should live in an LSM.
   - This should live in the core VFS. (https://lkml.org/lkml/2010/8/2/188)

Hardlinks:

On systems that have user-writable directories on the same partition
as system files, a long-standing class of security issues is the
hardlink-based time-of-check-time-of-use race, most commonly seen in
world-writable directories like /tmp. The common method of exploitation
of this flaw is to cross privilege boundaries when following a given
hardlink (i.e. a root process follows a hardlink created by another
user). Additionally, an issue exists where users can "pin" a potentially
vulnerable setuid/setgid file so that an administrator will not actually
upgrade a system fully.

The solution is to permit hardlinks to only be created when the user is
already the existing file's owner, or if they already have read/write
access to the existing file.

Many Linux users are surprised when they learn they can link to files
they have no access to, so this change appears to follow the doctrine
of "least surprise". Additionally, this change does not violate POSIX,
which states "the implementation may require that the calling process
has permission to access the existing file"[1].

This change is known to break some implementations of the "at" daemon,
though the version used by Fedora and Ubuntu has been fixed[2] for
a while. Otherwise, the change has been undisruptive while in use in
Ubuntu for the last 1.5 years.

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/linkat.html
[2] http://anonscm.debian.org/gitweb/?p=collab-maint/at.git;a=commitdiff;h=f4114656c3a6c6f6070e315ffdf940a49eda3279

This patch is based on the patches in Openwall and grsecurity, along with
suggestions from Al Viro. I have added a sysctl to enable the protected
behavior, and documentation.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:37:58 +04:00
Jeff Layton 3134f37e93 vfs: don't let do_last pass negative dentry to audit_inode
I can reliably reproduce the following panic by simply setting an audit
rule on a recent 3.5.0+ kernel:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
 IP: [<ffffffff810d1250>] audit_copy_inode+0x10/0x90
 PGD 7acd9067 PUD 7b8fb067 PMD 0
 Oops: 0000 [#86] SMP
 Modules linked in: nfs nfs_acl auth_rpcgss fscache lockd sunrpc tpm_bios btrfs zlib_deflate libcrc32c kvm_amd kvm joydev virtio_net pcspkr i2c_piix4 floppy virtio_balloon microcode virtio_blk cirrus drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
 CPU 0
 Pid: 1286, comm: abrt-dump-oops Tainted: G      D      3.5.0+ #1 Bochs Bochs
 RIP: 0010:[<ffffffff810d1250>]  [<ffffffff810d1250>] audit_copy_inode+0x10/0x90
 RSP: 0018:ffff88007aebfc38  EFLAGS: 00010282
 RAX: 0000000000000000 RBX: ffff88003692d860 RCX: 00000000000038c4
 RDX: 0000000000000000 RSI: ffff88006baf5d80 RDI: ffff88003692d860
 RBP: ffff88007aebfc68 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
 R13: ffff880036d30f00 R14: ffff88006baf5d80 R15: ffff88003692d800
 FS:  00007f7562634740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000040 CR3: 000000003643d000 CR4: 00000000000006f0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process abrt-dump-oops (pid: 1286, threadinfo ffff88007aebe000, task ffff880079614530)
 Stack:
  ffff88007aebfdf8 ffff88007aebff28 ffff88007aebfc98 ffffffff81211358
  ffff88003692d860 0000000000000000 ffff88007aebfcc8 ffffffff810d4968
  ffff88007aebfcc8 ffff8800000038c4 0000000000000000 0000000000000000
 Call Trace:
  [<ffffffff81211358>] ? ext4_lookup+0xe8/0x160
  [<ffffffff810d4968>] __audit_inode+0x118/0x2d0
  [<ffffffff811955a9>] do_last+0x999/0xe80
  [<ffffffff81191fe8>] ? inode_permission+0x18/0x50
  [<ffffffff81171efa>] ? kmem_cache_alloc_trace+0x11a/0x130
  [<ffffffff81195b4a>] path_openat+0xba/0x420
  [<ffffffff81196111>] do_filp_open+0x41/0xa0
  [<ffffffff811a24bd>] ? alloc_fd+0x4d/0x120
  [<ffffffff811855cd>] do_sys_open+0xed/0x1c0
  [<ffffffff810d40cc>] ? __audit_syscall_entry+0xcc/0x300
  [<ffffffff811856c1>] sys_open+0x21/0x30
  [<ffffffff81611ca9>] system_call_fastpath+0x16/0x1b
  RSP <ffff88007aebfc38>
 CR2: 0000000000000040

The problem is that do_last is passing a negative dentry to audit_inode.
The comments on lookup_open note that it can pass back a negative dentry
if O_CREAT is not set.

This patch fixes the oops, but I'm not clear on whether there's a better
approach.

Cc: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:27:03 +04:00
Al Viro a8104a9fcd pull mnt_want_write()/mnt_drop_write() into kern_path_create()/done_path_create() resp.
One side effect - attempt to create a cross-device link on a read-only fs fails
with EROFS instead of EXDEV now.  Makes more sense, POSIX allows, etc.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:15 +04:00
Al Viro 8e4bfca1d1 mknod: take sanity checks on mode into the very beginning
Note that applying umask can't affect their results.  While
that affects errno in cases like
	mknod("/no_such_directory/a", 030000)
yielding -EINVAL (due to impossible mode_t) instead of
-ENOENT (due to inexistent directory), IMO that makes a lot
more sense, POSIX allows to return either and any software
that relies on getting -ENOENT instead of -EINVAL in that
case deserves everything it gets.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:14 +04:00
Al Viro 921a1650de new helper: done_path_create()
releases what needs to be released after {kern,user}_path_create()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-29 21:24:13 +04:00
Al Viro 32a7991b6a tidy up namei.c a bit
locking/unlocking for rcu walk taken to a couple of inline helpers

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-23 00:00:55 +04:00
Al Viro 3c0a616368 unobfuscate follow_up() a bit
really convoluted test in there has grown up during struct mount
introduction; what it checks is that we'd reached the root of
mount tree.
2012-07-23 00:00:45 +04:00
Al Viro 1e0ea00144 use __lookup_hash() in kern_path_parent()
No need to bother with lookup_one_len() here - it's an overkill

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2012-07-22 23:57:53 +04:00
David Howells 0bdaea9017 VFS: Split inode_permission()
Split inode_permission() into inode- and superblock-dependent parts.

This is aimed at unionmounts where the superblock from the upper layer has to
be checked rather than the superblock from the lower layer as the upper layer
may be writable, thus allowing an unwritable file from the lower layer to be
copied up and modified.

Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com> (Further development)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:36 +04:00
David Howells f015f1267b VFS: Comment mount following code
Add comments describing what the directions "up" and "down" mean and ref count
handling to the VFS mount following family of functions.

Signed-off-by: Valerie Aurora <vaurora@redhat.com> (Original author)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:38:32 +04:00
Christoph Hellwig b5fb63c183 fs: add nd_jump_link
Add a helper that abstracts out the jump to an already parsed struct path
from ->follow_link operation from procfs.  Not only does this clean up
the code by moving the two sides of this game into a single helper, but
it also prepares for making struct nameidata private to namei.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:40 +04:00
Christoph Hellwig 408ef013cc fs: move path_put on failure out of ->follow_link
Currently the non-nd_set_link based versions of ->follow_link are expected
to do a path_put(&nd->path) on failure.  This calling convention is unexpected,
undocumented and doesn't match what the nd_set_link-based instances do.

Move the path_put out of the only non-nd_set_link based ->follow_link
instance into the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:35 +04:00
Al Viro 79714f72d3 get rid of kern_path_parent()
all callers want the same thing, actually - a kinda-sorta analog of
kern_path_create().  I.e. they want parent vfsmount/dentry (with
->i_mutex held, to make sure the child dentry is still their child)
+ the child dentry.

Signed-off-by Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:35:02 +04:00
David Howells 1acf0af9b9 VFS: Fix the banner comment on lookup_open()
Since commit 197e37d9, the banner comment on lookup_open() no longer matches
what the function returns.  It used to return a struct file pointer or NULL and
now it returns an integer and is passed the struct file pointer it is to use
amongst its arguments.  Update the comment to reflect this.

Also add a banner comment to atomic_open().

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-07-14 16:34:57 +04:00