Commit graph

34285 commits

Author SHA1 Message Date
David Sterba e43f998e47 btrfs: call mnt_drop_write after interrupted subvol deletion
If btrfs_ioctl_snap_destroy blocks on the mutex and the process is
killed, mnt_write count is unbalanced and leads to unmountable
filesystem.

CC: stable@vger.kernel.org
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:38 -08:00
Miao Xie a7e252af5a Btrfs: don't clear the default compression type
We met a oops caused by the wrong compression type:
[  556.512356] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98
[SNIP]
[  556.512490]  [<ffffffff811dbb44>] ? list_del+0xd/0x2b
[  556.512539]  [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs]
[  556.512546]  [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10
[  556.512576]  [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs]
[  556.512601]  [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs]
[  556.512627]  [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs]
[  556.512655]  [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs]
[  556.512661]  [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8
[  556.512689]  [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs]
[  556.512695]  [<ffffffff8104fa4e>] kthread+0x8d/0x95
[  556.512699]  [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d
[  556.512704]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65
[  556.512709]  [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0
[  556.512713]  [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65

Steps to reproduce:
 # mkfs.btrfs -f <dev>
 # mount -o nodatacow <dev> <mnt>
 # touch <mnt>/<file>
 # chattr =c <mnt>/<file>
 # dd if=/dev/zero of=<mnt>/<file> bs=1M count=10

It is because we cleared the default compression type when setting the
nodatacow. In fact, we needn't do it because we have used COMPRESS flag to
indicate if we need compressed the file data or not, needn't use the
variant -- compress_type -- in btrfs_info to do the same thing, and just
use it to hold the default compression type. Or we would get a wrong compress
type for a file whose own compress flag is set but the compress flag of its
filesystem is not set.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12 07:11:19 -08:00
Jeff Layton 781c2a5a5f nfsd: when reusing an existing repcache entry, unhash it first
The DRC code will attempt to reuse an existing, expired cache entry in
preference to allocating a new one. It'll then search the cache, and if
it gets a hit it'll then free the cache entry that it was going to
reuse.

The cache code doesn't unhash the entry that it's going to reuse
however, so it's possible for it end up designating an entry for reuse
and then subsequently freeing the same entry after it finds it.  This
leads it to a later use-after-free situation and usually some list
corruption warnings or an oops.

Fix this by simply unhashing the entry that we intend to reuse. That
will mean that it's not findable via a search and should prevent this
situation from occurring.

Cc: stable@vger.kernel.org # v3.10+
Reported-by: Christoph Hellwig <hch@infradead.org>
Reported-by: g. artim <gartim@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-12-10 20:34:44 -05:00
Dave Chinner f94c44573e xfs: growfs overruns AGFL buffer on V4 filesystems
This loop in xfs_growfs_data_private() is incorrect for V4
superblocks filesystems:

		for (bucket = 0; bucket < XFS_AGFL_SIZE(mp); bucket++)
			agfl->agfl_bno[bucket] = cpu_to_be32(NULLAGBLOCK);

For V4 filesystems, we don't have a agfl header structure, and so
XFS_AGFL_SIZE() returns an entire sector's worth of entries, which
we then index from an offset into the sector. Hence: buffer overrun.

This problem was introduced in 3.10 by commit 77c95bba ("xfs: add
CRC checks to the AGFL") which changed the AGFL structure but failed
to update the growfs code to handle the different structures.

Fix it by using the correct offset into the buffer for both V4 and
V5 filesystems.

Cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit b7d961b35b)
2013-12-10 10:04:27 -06:00
Jie Liu 2f42d612e7 xfs: don't perform discard if the given range length is less than block size
For discard operation, we should return EINVAL if the given range length
is less than a block size, otherwise it will go through the file system
to discard data blocks as the end range might be evaluated to -1, e.g,
# fstrim -v -o 0 -l 100 /xfs7
/xfs7: 9811378176 bytes were trimmed

This issue can be triggered via xfstests/generic/288.

Also, it seems to get the request queue pointer via bdev_get_queue()
instead of the hard code pointer dereference is not a bad thing.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit f9fd013561)
2013-12-10 10:00:33 -06:00
Dan Carpenter 31978b5cc6 xfs: underflow bug in xfs_attrlist_by_handle()
If we allocate less than sizeof(struct attrlist) then we end up
corrupting memory or doing a ZERO_PTR_SIZE dereference.

This can only be triggered with CAP_SYS_ADMIN.

Reported-by: Nico Golde <nico@ngolde.de>
Reported-by: Fabian Yamaguchi <fabs@goesec.de>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>

(cherry picked from commit 071c529eb6)
2013-12-10 09:59:37 -06:00
Dmitry Monakhov a67c848a8b jbd2: rename obsoleted msg JBD->JBD2
Rename performed via: perl -pi -e 's/JBD:/JBD2:/g' fs/jbd2/*.c

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2013-12-08 21:14:59 -05:00
Jan Kara 75685071cd jbd2: revise KERN_EMERG error messages
Some of KERN_EMERG printk messages do not really deserve this log
level and the one in log_wait_commit() is even rather useless (the
journal has been previously aborted and *that* is where we should have
been complaining). So make some messages just KERN_ERR and remove the
useless message.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2013-12-08 21:13:59 -05:00
Theodore Ts'o f6c07cad08 jbd2: don't BUG but return ENOSPC if a handle runs out of space
If a handle runs out of space, we currently stop the kernel with a BUG
in jbd2_journal_dirty_metadata().  This makes it hard to figure out
what might be going on.  So return an error of ENOSPC, so we can let
the file system layer figure out what is going on, to make it more
likely we can get useful debugging information).  This should make it
easier to debug problems such as the one which was reported by:

    https://bugzilla.kernel.org/show_bug.cgi?id=44731

The only two callers of this function are ext4_handle_dirty_metadata()
and ocfs2_journal_dirty().  The ocfs2 function will trigger a
BUG_ON(), which means there will be no change in behavior.  The ext4
function will call ext4_error_inode() which will print the useful
debugging information and then handle the situation using ext4's error
handling mechanisms (i.e., which might mean halting the kernel or
remounting the file system read-only).

Also, since both file systems already call WARN_ON(), drop the WARN_ON
from jbd2_journal_dirty_metadata() to avoid two stack traces from
being displayed.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <jlbec@evilplan.org>
2013-12-08 21:12:59 -05:00
Jan Kara 30fac0f75d ext4: Do not reserve clusters when fs doesn't support extents
When the filesystem doesn't support extents (like in ext2/3
compatibility modes), there is no need to reserve any clusters. Space
estimates for writing are exact, hole punching doesn't need new
metadata, and there are no unwritten extents to convert.

This fixes a problem when filesystem still having some free space when
accessed with a native ext2/3 driver suddently reports ENOSPC when
accessed with ext4 driver.

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 21:11:59 -05:00
Al Viro 9105bb149b ext4: fix del_timer() misuse for ->s_err_report
That thing should be del_timer_sync(); consider what happens
if ext4_put_super() call of del_timer() happens to come just as it's
getting run on another CPU.  Since that timer reschedules itself
to run next day, you are pretty much guaranteed that you'll end up
with kfree'd scheduled timer, with usual fun consequences.  AFAICS,
that's -stable fodder all way back to 2010... [the second del_timer_sync()
is almost certainly not needed, but it doesn't hurt either]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-08 20:52:31 -05:00
Tejun Heo a8b1474442 sysfs: give different locking key to regular and bin files
027a485d12 ("sysfs: use a separate locking class for open files
depending on mmap") assigned different lockdep key to
sysfs_open_file->mutex depending on whether the file implements mmap
or not in an attempt to avoid spurious lockdep warning caused by
merging of regular and bin file paths.

While this restored some of the original behavior of using different
locks (at least lockdep is concerned) for the different clases of
files.  The restoration wasn't full because now the lockdep key
assignment depends on whether the file has mmap or not instead of
whether it's a regular file or not.

This means that bin files which don't implement mmap will get assigned
the same lockdep class as regular files.  This is problematic because
file_operations for bin files still implements the mmap file operation
and checking whether the sysfs file actually implements mmap happens
in the file operation after grabbing @sysfs_open_file->mutex.  We
still end up adding locking dependency from mmap locking to
sysfs_open_file->mutex to the regular file mutex which triggers
spurious circular locking warning.

Fix it by restoring the original behavior fully by differentiating
lockdep key by whether the file is regular or bin, instead of the
existence of mmap.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/g/20131203184324.GA11320@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-07 21:22:00 -08:00
Linus Torvalds c537aba00e Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fix from Benjamin LaHaise:
 "AIO fix from Gu Zheng that fixes a GPF that Dave Jones uncovered with
  trinity"

* git://git.kvack.org/~bcrl/aio-next:
  aio: clean up aio ring in the fail path
2013-12-06 08:32:59 -08:00
Gu Zheng d1b9432712 aio: clean up aio ring in the fail path
Clean up the aio ring file in the fail path of aio_setup_ring
and ioctx_alloc. And maybe it can fix the GPF issue reported by
Dave Jones:
https://lkml.org/lkml/2013/11/25/898

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-06 10:22:55 -05:00
Linus Torvalds 002acf1fc1 Power management fixes for 3.13-rc3
- cpufreq regression fix from Bjørn Mork restoring the pre-3.12
    behavior of the framework during system suspend/hibernation to
    avoid garbage sysfs files from being left behind in case of a
    suspend error.
 
  - PNP regression fix to restore the correct states of devices after
    resume from hibernation broken in 3.12.  From Dmitry Torokhov.
 
  - cpuidle fix to prevent cpuidle device unregistration from crashing
    due to a NULL pointer dereference if cpuidle has been disabled
    from the kernel command line.  From Konrad Rzeszutek Wilk.
 
  - intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
    processors from Arne Bockholdt.
 
  - Power capping framework fix to make the energy_uj sysfs attribute
    work in accordance with the documentation.  From Srinivas Pandruvada.
 
  - epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
    been compiled with CONFIG_PM_SLEEP unset (in which case that flag
    should not have any effect).  From Amit Pundir.
 
  - cpufreq fix to prevent governor sysfs files from being lost over
    system suspend/resume in some (arguably unusual) situations.  From
    Viresh Kumar.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABCAAGBQJSoTE5AAoJEILEb/54YlRxh/sP/jFZhLTc8g4MC3XzhguuROzT
 u0Pu9VJVfACqz8LyiCOtfOvvb2EPV7VSq7qYqszL5y9Dn5gwIHvQMMBK2PjZ4cc2
 MtSiw02Bk/DEESXYOjt++n5ja/0lc05CtJTlb3uoJXBOCqp3cMrvW7+QqnLbEfbG
 S+TcPFBr+4Owt/J7r2Z2JBYGZ6NbVol/x1hAFjiM+rBan6UGw7uNcg2LgQrVHcs4
 S1Cm6lsJTwRcSiswvJv9/C+ML9Z/1gYYUyu7ijQnGdbNUolyzHY6AxLWZdnSkAQO
 s8JVDRKy9+V44LtnWSENnJNftjlOoXWcZRJxvDePyM3dVpxESBa8Z/AxYWwCcmcB
 e4rsgm/WOF86DMhRu+gfeTF+1OkU7KhuPhbXskbw+JDcZKCui2FP/xti6IAaTsU/
 9M30/VeOpD1UBqckLnDTGcsFif7hVZ9LOHH5wK8OctjyaTMfUYtPd7WxfTQCpcSc
 1M0NQapwfXHASmPmMW4SszAaeduecUdgXU1epOPx0EpOMQhvuLeENJBgVC5uu1cA
 KAQ7suOx9ReS3slso8lpTlTEw2rsDPRuiHcF3hv7YpklNXV2jvtxsl4upHT5VN2t
 3L59Unq8vY2lt3SdzWMypaAquphc3Te5woYoFwgsSPfw40Kr+jg+oUtO6IHqM/ga
 OATUkTffzp+Rp3pg036Y
 =gHBV
 -----END PGP SIGNATURE-----

Merge tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:

 - cpufreq regression fix from Bjørn Mork restoring the pre-3.12
   behavior of the framework during system suspend/hibernation to avoid
   garbage sysfs files from being left behind in case of a suspend error

 - PNP regression fix to restore the correct states of devices after
   resume from hibernation broken in 3.12.  From Dmitry Torokhov.

 - cpuidle fix to prevent cpuidle device unregistration from crashing
   due to a NULL pointer dereference if cpuidle has been disabled from
   the kernel command line.  From Konrad Rzeszutek Wilk.

 - intel_idle fix for the C6 state definition on Intel Avoton/Rangeley
   processors from Arne Bockholdt.

 - Power capping framework fix to make the energy_uj sysfs attribute
   work in accordance with the documentation.  From Srinivas Pandruvada.

 - epoll fix to make it ignore the EPOLLWAKEUP flag if the kernel has
   been compiled with CONFIG_PM_SLEEP unset (in which case that flag
   should not have any effect).  From Amit Pundir.

 - cpufreq fix to prevent governor sysfs files from being lost over
   system suspend/resume in some (arguably unusual) situations.  From
   Viresh Kumar.

* tag 'pm-3.13-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PowerCap: Fix mode for energy counter
  PNP: fix restoring devices after hibernation
  cpuidle: Check for dev before deregistering it.
  epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
  cpufreq: fix garbage kobjects on errors during suspend/resume
  cpufreq: suspend governors on system suspend/hibernate
  intel_idle: Fixed C6 state on Avoton/Rangeley processors
2013-12-05 18:26:40 -08:00
Linus Torvalds 5ee540613d Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series. It contains:

   - A fix for a use-after-free of a request in blk-mq.  From Ming Lei

   - A fix for a blk-mq bug that could attempt to dereference a NULL rq
     if allocation failed

   - Two xen-blkfront small fixes

   - Cleanup of submit_bio_wait() type uses in the kernel, unifying
     that.  From Kent

   - A fix for 32-bit blkg_rwstat reading.  I apologize for this one
     looking mangled in the shortlog, it's entirely my fault for missing
     an empty line between the description and body of the text"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: fix use-after-free of request
  blk-mq: fix dereference of rq->mq_ctx if allocation fails
  block: xen-blkfront: Fix possible NULL ptr dereference
  xen-blkfront: Silence pfn maybe-uninitialized warning
  block: submit_bio_wait() conversions
  Update of blkg_stat and blkg_rwstat may happen in bh context
2013-12-05 15:33:27 -08:00
Linus Torvalds 29be6345bb NFS client bugfixes
- Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSoLTpAAoJEGcL54qWCgDy2dgQAIKkKAXccg3OG2b1SxJmiaja
 PcrovNmgg3HvYQ7clUMqtrMByiXEpSybl6tAeXYUWE3sS1DISSBVEwO3MoOiASiM
 951Ssx+CoyhsHYo5aH83sUIiWFl/YsRhpKmSr2cdQd13DQTFbPq896k64Inf6L2/
 9fngoqOD7FunQHn8AiVPoDOQzObB0OuKhYCwuwLt47oPiwgmm12JQNCDxU1i4sxb
 lkGUBLkPMs6D5IyI8XHaMyX3+8MvmPiIsjIKaNJRdhkuX/k7ollucTJXyvyEQKK0
 PhBIWyUULmKcAXYwCfHf9UoyGZFvmj47YggyKcBd26OZUEFekcWrULfym46F1xak
 EcO6D4mlTy5i5W0RBqYCj1oGud57rixZBmhLTbeq6sSJaiqBfGEs225Q17H7rsEB
 YIghHiEFNnBmVWELhHxbJHQoY6HOugmZOuc0dxopaikN/7to8gnYoVyTIVlMfe/t
 UNXZoer6GOOohJGtZ7s7v4Al7EzvwnVnBCBklEAKFJ7Ca2LEmq+b58oQW3nJ1mPn
 y4TnihxYXsSEbqy+Lds9rumRhJLG1oVTpwficAm7N3HdK3abzCIPEt6iOHoCmXQz
 J1B4gmwOKsDqVlCSpBsnc3ZiBlSJGOn6MmVQUCNFpzv/DetWn/BxEUPE8cNm8DaI
 WioD0grC0/9bR8oD1m+w
 =UZ51
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Stable fix for a NFSv4.1 delegation and state recovery deadlock
 - Stable fix for a loop on irrecoverable errors when returning
   delegations
 - Fix a 3-way deadlock between layoutreturn, open, and state recovery
 - Update the MAINTAINERS file with contact information for Trond
   Myklebust
 - Close needs to handle NFS4ERR_ADMIN_REVOKED
 - Enabling v4.2 should not recompile nfsd and lockd
 - Fix a couple of compile warnings

* tag 'nfs-for-3.13-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  nfs: fix do_div() warning by instead using sector_div()
  MAINTAINERS: Update contact information for Trond Myklebust
  NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
  SUNRPC: do not fail gss proc NULL calls with EACCES
  NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
  NFSv4: Update list of irrecoverable errors on DELEGRETURN
  NFSv4 wait on recovery for async session errors
  NFS: Fix a warning in nfs_setsecurity
  NFS: Enabling v4.2 should not recompile nfsd and lockd
2013-12-05 13:05:48 -08:00
Helge Deller 3873d064b8 nfs: fix do_div() warning by instead using sector_div()
When compiling a 32bit kernel with CONFIG_LBDAF=n the compiler complains like
shown below.  Fix this warning by instead using sector_div() which is provided
by the kernel.h header file.

fs/nfs/blocklayout/extents.c: In function ‘normalize’:
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
fs/nfs/blocklayout/extents.c:47:13: note: in expansion of macro ‘do_div’
nfs/blocklayout/extents.c:47:2: warning: right shift count >= width of type [enabled by default]
fs/nfs/blocklayout/extents.c:47:2: warning: passing argument 1 of ‘__div64_32’ from incompatible pointer type [enabled by default]
include/asm-generic/div64.h:35:17: note: expected ‘uint64_t *’ but argument is of type ‘sector_t *’
 extern uint32_t __div64_32(uint64_t *dividend, uint32_t divisor);

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-12-04 12:57:37 -05:00
Trond Myklebust f22e5edd22 NFSv4.1: Prevent a 3-way deadlock between layoutreturn, open and state recovery
Andy Adamson reports:

The state manager is recovering expired state and recovery OPENs are being
processed. If kswapd is pruning inodes at the same time, a deadlock can occur
when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
resultant layoutreturn gets an error that the state mangager is to handle,
causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.

At the same time an open is waiting for the inode deletion to complete in
__wait_on_freeing_inode.

If the open is either the open called by the state manager, or an open from
the same open owner that is holding the NFSv4 sequence id which causes the
OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
then the state is deadlocked with kswapd.

The fix is simply to have layoutreturn ignore all errors except NFS4ERR_DELAY.
We already know that layouts are dropped on all server reboots, and that
it has to be coded to deal with the "forgetful client model" that doesn't
send layoutreturns.

Reported-by: Andy Adamson <andros@netapp.com>
Link: http://lkml.kernel.org/r/1385402270-14284-1-git-send-email-andros@netapp.com
Signed-off-by: Trond Myklebust <Trond.Myklebust@primarydata.com>
2013-12-04 12:32:19 -05:00
Linus Torvalds 278717909d Just a single bug fix to the new "directly decompress into the
page cache" code.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSnpD0AAoJEJAch/D1fbHUYFwP/iGUTnyqJYMLOyvDz4CjPCjr
 w0Pfncr78oXiKkOe/ExBVH5JHlkmSa2Lfmt28tiltPgfNQbqLQB9eSvjOzOg+c73
 KNqCAXXUp5a+Jhjlapo1iuyZEnOEPaJHUgiRmneFHzesE9eW6u86JntHR3wLbD8v
 Hw0Z5zkrPGB+M/AsX96dt7hBiz9HJI7IQrurtx2ijrpCBeXq30YNtBMjxzqoYx1C
 qcyMw/qGREObD3qlWKeBgmsKguwMmaygLaA2lUk73RcswKcISwkrUQN+BEqfP+Sc
 cXvXf1C2lG7SOLQH29Osa5Y8EGFnRQlFyTaIxLFWopl3Oks7emDJOrH4rW2SuPLt
 mEUOIVhZ0v6VhJ5uRmauyH4bQxBl0Uonc74bxRm602ThnUEIg2E6i+RBiy0tr/3Z
 UOQB2lACEIfz/jspeCcTTOlnR1r6fKp89XLF+OLKq7qJN3XQ2EtNKrfGHyEUVBxq
 GqLqJuTYbqxidJtx+PDZPW3AT71f4GSJGnL4Tdg13oHoJmLQagIRJCh6Gf5pMp1W
 za0g7jDHimHJbQW/jujDhVu9bLEoWhfq+jo/JW6x+0bblrm8GM4h1FFEqcVjh0SV
 UyG3QjMA4uMLbofB4MNNkgRhGC3o0L2Z3CuYVCNIz6fbQKP1tp6NqyVP89uJ1Br9
 VM3Hhw2TWhXxPYDsDYeg
 =OvYc
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs bugfix from Phillip Lougher:
 "Just a single bug fix to the new "directly decompress into the page
  cache" code"

* tag 'squashfs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: fix failure to unlock pages on decompress error
2013-12-04 08:54:00 -08:00
Jan Kara df4e7ac0bb ext2: Fix oops in ext2_get_block() called from ext2_quota_write()
ext2_quota_write() doesn't properly setup bh it passes to
ext2_get_block() and thus we hit assertion BUG_ON(maxblocks == 0) in
ext2_get_blocks() (or we could actually ask for mapping arbitrary number
of blocks depending on whatever value was on stack).

Fix ext2_quota_write() to properly fill in number of blocks to map.

CC: stable@vger.kernel.org # >= 2.6.12
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2013-12-04 12:26:51 +01:00
Eryu Guan 5946d08937 ext4: check for overlapping extents in ext4_valid_extent_entries()
A corrupted ext4 may have out of order leaf extents, i.e.

extent: lblk 0--1023, len 1024, pblk 9217, flags: LEAF UNINIT
extent: lblk 1000--2047, len 1024, pblk 10241, flags: LEAF UNINIT
             ^^^^ overlap with previous extent

Reading such extent could hit BUG_ON() in ext4_es_cache_extent().

	BUG_ON(end < lblk);

The problem is that __read_extent_tree_block() tries to cache holes as
well but assumes 'lblk' is greater than 'prev' and passes underflowed
length to ext4_es_cache_extent(). Fix it by checking for overlapping
extents in ext4_valid_extent_entries().

I hit this when fuzz testing ext4, and am able to reproduce it by
modifying the on-disk extent by hand.

Also add the check for (ee_block + len - 1) in ext4_valid_extent() to
make sure the value is not overflow.

Ran xfstests on patched ext4 and no regression.

Cc: Lukáš Czerner <lczerner@redhat.com>
Signed-off-by: Eryu Guan <guaneryu@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 21:22:21 -05:00
Junho Ryu 4e8d213980 ext4: fix use-after-free in ext4_mb_new_blocks
ext4_mb_put_pa should hold pa->pa_lock before accessing pa->pa_count.
While ext4_mb_use_preallocated checks pa->pa_deleted first and then
increments pa->count later, ext4_mb_put_pa decrements pa->pa_count
before holding pa->pa_lock and then sets pa->pa_deleted.

* Free sequence
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa

* Use sequence
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_use_preallocated (4):		increase pa->pa_count
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_release_context:	access pa

* Use-after-free sequence
[initial status]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (1):	iterate over preallocation
ext4_mb_use_preallocated (2):	lock pa->pa_lock
ext4_mb_use_preallocated (3):		check pa->pa_deleted
ext4_mb_put_pa (1):		atomic_dec_and_test pa->pa_count
[pa_count decremented]		<pa->pa_deleted = 0, pa_count = 0>
ext4_mb_use_preallocated (4):		increase pa->pa_count
[pa_count incremented]		<pa->pa_deleted = 0, pa_count = 1>
ext4_mb_use_preallocated (5):	unlock pa->pa_lock
ext4_mb_put_pa (2):		lock pa->pa_lock
ext4_mb_put_pa (3):			check pa->pa_deleted
ext4_mb_put_pa (4):			set pa->pa_deleted=1
[race condition!]		<pa->pa_deleted = 1, pa_count = 1>
ext4_mb_put_pa (5):		unlock pa->pa_lock
ext4_mb_put_pa (6):		remove pa from a list
ext4_mb_pa_callback:		free pa
ext4_mb_release_context:	access pa

AddressSanitizer has detected use-after-free in ext4_mb_new_blocks
Bug report: http://goo.gl/rG1On3

Signed-off-by: Junho Ryu <jayr@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
2013-12-03 18:10:28 -05:00
Amit Pundir 95f19f658c epoll: drop EPOLLWAKEUP if PM_SLEEP is disabled
Drop EPOLLWAKEUP from epoll events mask if CONFIG_PM_SLEEP is disabled.

Signed-off-by: Amit Pundir <amit.pundir@linaro.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2013-12-03 15:35:52 +01:00
Linus Torvalds b0d8d22921 vfs: fix subtle use-after-free of pipe_inode_info
The pipe code was trying (and failing) to be very careful about freeing
the pipe info only after the last access, with a pattern like:

        spin_lock(&inode->i_lock);
        if (!--pipe->files) {
                inode->i_pipe = NULL;
                kill = 1;
        }
        spin_unlock(&inode->i_lock);
        __pipe_unlock(pipe);
        if (kill)
                free_pipe_info(pipe);

where the final freeing is done last.

HOWEVER.  The above is actually broken, because while the freeing is
done at the end, if we have two racing processes releasing the pipe
inode info, the one that *doesn't* free it will decrement the ->files
count, and unlock the inode i_lock, but then still use the
"pipe_inode_info" afterwards when it does the "__pipe_unlock(pipe)".

This is *very* hard to trigger in practice, since the race window is
very small, and adding debug options seems to just hide it by slowing
things down.

Simon originally reported this way back in July as an Oops in
kmem_cache_allocate due to a single bit corruption (due to the final
"spin_unlock(pipe->mutex.wait_lock)" incrementing a field in a different
allocation that had re-used the free'd pipe-info), it's taken this long
to figure out.

Since the 'pipe->files' accesses aren't even protected by the pipe lock
(we very much use the inode lock for that), the simple solution is to
just drop the pipe lock early.  And since there were two users of this
pattern, create a helper function for it.

Introduced commit ba5bb14733 ("pipe: take allocation and freeing of
pipe_inode_info out of ->i_mutex").

Reported-by: Simon Kirby <sim@hostway.ca>
Reported-by: Ian Applegate <ia@cloudflare.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: stable@kernel.org   # v3.10+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-02 09:44:51 -08:00
Theodore Ts'o ae1495b12d ext4: call ext4_error_inode() if jbd2_journal_dirty_metadata() fails
While it's true that errors can only happen if there is a bug in
jbd2_journal_dirty_metadata(), if a bug does happen, we need to halt
the kernel or remount the file system read-only in order to avoid
further data loss.  The ext4_journal_abort_handle() function doesn't
do any of this, and while it's likely that this call (since it doesn't
adjust refcounts) will likely result in the file system eventually
deadlocking since the current transaction will never be able to close,
it's much cleaner to call let ext4's error handling system deal with
this situation.

There's a separate bug here which is that if certain jbd2 errors
errors occur and file system is mounted errors=continue, the file
system will probably eventually end grind to a halt as described
above.  But things have been this way in a long time, and usually when
we have these sorts of errors it's pretty much a disaster --- and
that's why the jbd2 layer aggressively retries memory allocations,
which is the most likely cause of these jbd2 errors.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2013-12-02 09:31:36 -05:00
Linus Torvalds b01537bfbc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs dentry reference count fix from Al Viro.

This fixes a possible inode_permission NULL pointer dereference (and
other problems) that were due to the root dentry count being decremented
too much.  In commit 48a066e72d ("RCU'd vfsmounts") the placement of
clearing the LOOKUP_RCU bit changed, and we then returned failure of
incrementing the lockref on the parent dentry with LOOKUP_RCU cleared.

But that meant we needed to go through the same cleanup routines that
the later failures did wrt LOOKUP_ROOT and nd->root.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fix bogus path_put() of nd->root after some unlazy_walk() failures
2013-11-29 09:27:19 -08:00
Al Viro d870b4a191 fix bogus path_put() of nd->root after some unlazy_walk() failures
Failure to grab reference to parent dentry should go through the
same cleanup as nd->seq mismatch.  As it is, we might end up with
caller thinking it needs to path_put() nd->root, with obvious
nasty results once we'd hit that bug enough times to drive the
refcount of root dentry all the way to zero...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-29 01:50:51 -05:00
Linus Torvalds 3bad8bb5cd Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
 "SMB3 "validate negotiate" is needed to prevent certain types of
  downgrade attacks.

  Also changes SMB2/SMB3 copy offload from using the BTRFS copy ioctl
  (BTRFS_IOC_CLONE) to a cifs specific ioctl (CIFS_IOC_COPYCHUNK_FILE)
  to address Christoph's comment that there are semantic differences
  between requesting copy offload in which copy-on-write is mandatory
  (as in the BTRFS ioctl) and optional in the SMB2/SMB3 case.  Also
  fixes SMB2/SMB3 copychunk for large files"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  [CIFS] Do not use btrfs refcopy ioctl for SMB2 copy offload
  Check SMB3 dialects against downgrade attacks
  Removed duplicated (and unneeded) goto
  CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files
2013-11-28 09:50:25 -08:00
Linus Torvalds f496863658 Driver core fixes for 3.13-rc2
Here are 3 patches for sysfs issues that have been reported.  Well, 1
 patch really, the first one is reverted as it's not really needed (the
 correct fix is coming in through the different driver subsystems
 instead.)  But that 1 sysfs fix is needed, so this is still a good thing
 to pull in now.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iEYEABECAAYFAlKWdcsACgkQMUfUDdst+yn9HgCgvXeP/GeK7Bt+1YhFIsrdRrNq
 7qsAnAvXOHh5FCn7h2Cw0yYb35kgUMQx
 =ZZSr
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core fixes from Greg KH:
 "Here are 3 patches for sysfs issues that have been reported.  Well, 1
  patch really, the first one is reverted as it's not really needed (the
  correct fix is coming in through the different driver subsystems
  instead)

  But that 1 sysfs fix is needed, so this is still a good thing to pull
  in now"

Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

* tag 'driver-core-3.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
  Revert "sysfs: handle duplicate removal attempts in sysfs_remove_group()"
  sysfs: use a separate locking class for open files depending on mmap
  sysfs: handle duplicate removal attempts in sysfs_remove_group()
2013-11-27 21:04:37 -08:00
Dave Jones dad337501d remove obsolete references to powertweak
This tool hasn't been maintained in over a decade, and is pretty much
useless these days.  Let's pretend it never happened.

Also remove a long-dead email address.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-27 20:34:32 -08:00
Greg Kroah-Hartman 81440e7374 Revert "sysfs: handle duplicate removal attempts in sysfs_remove_group()"
This reverts commit 54d71145a4.

The root cause of these "inverted" sysfs removals have now been found,
so there is no need for this patch.  Keep this functionality around so
that this type of error doesn't show up in driver code again.

Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-27 09:44:55 -08:00
Linus Torvalds 4f9e5df211 Merge branch 'for-linus-bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph bug-fixes from Sage Weil:
 "These include a couple fixes to the new fscache code that went in
  during the last cycle (which will need to go stable@ shortly as well),
  a couple client-side directory fragmentation fixes, a fix for a race
  in the cap release queuing path, and a couple race fixes in the
  request abort and resend code.

  Obviously some of this could have gone into 3.12 final, but I
  preferred to overtest rather than send things in for a late -rc, and
  then my travel schedule intervened"

* 'for-linus-bugs' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: allocate non-zero page to fscache in readpage()
  ceph: wake up 'safe' waiters when unregistering request
  ceph: cleanup aborted requests when re-sending requests.
  ceph: handle race between cap reconnect and cap release
  ceph: set caps count after composing cap reconnect message
  ceph: queue cap release in __ceph_remove_cap()
  ceph: handle frag mismatch between readdir request and reply
  ceph: remove outdated frag information
  ceph: hung on ceph fscache invalidate in some cases
2013-11-26 18:02:46 -08:00
Steve French f19e84df37 [CIFS] Do not use btrfs refcopy ioctl for SMB2 copy offload
Change cifs.ko to using CIFS_IOCTL_COPYCHUNK instead
of BTRFS_IOC_CLONE to avoid confusion about whether
copy-on-write is required or optional for this operation.

SMB2/SMB3 copyoffload had used the BTRFS_IOC_CLONE ioctl since
they both speed up copy by offloading the copy rather than
passing many read and write requests back and forth and both have
identical syntax (passing file handles), but for SMB2/SMB3
CopyChunk the server is not required to use copy-on-write
to make a copy of the file (although some do), and Christoph
has commented that since CopyChunk does not require
copy-on-write we should not reuse BTRFS_IOC_CLONE.

This patch renames the ioctl to use a cifs specific IOCTL
CIFS_IOCTL_COPYCHUNK.  This ioctl is particularly important
for SMB2/SMB3 since large file copy over the network otherwise
can be very slow, and with this is often more than 100 times
faster putting less load on server and client.

Note that if a copy syscall is ever introduced, depending on
its requirements/format it could end up using one of the other
three methods that CIFS/SMB2/SMB3 can do for copy offload,
but this method is particularly useful for file copy
and broadly supported (not just by Samba server).

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: David Disseldorp <ddiss@samba.org>
2013-11-25 09:50:31 -06:00
Kent Overstreet c170bbb45f block: submit_bio_wait() conversions
It was being open coded in a few places.

Signed-off-by: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joern Engel <joern@logfs.org>
Cc: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-24 16:33:41 -07:00
Phillip Lougher 6d56540950 Squashfs: fix failure to unlock pages on decompress error
Direct decompression into the page cache.  If we fall back
to using an intermediate buffer (because we cannot grab all the
page cache pages) and we get a decompress fail, we forgot to
release the pages.

Reported-by: Roman Peniaev <r.peniaev@gmail.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2013-11-24 01:02:50 +00:00
Li Wang ff638b7df5 ceph: allocate non-zero page to fscache in readpage()
ceph_osdc_readpages() returns number of bytes read, currently,
the code only allocate full-zero page into fscache, this patch
fixes this.

Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Reviewed-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:07 -08:00
Yan, Zheng fc55d2c944 ceph: wake up 'safe' waiters when unregistering request
We also need to wake up 'safe' waiters if error occurs or request
aborted. Otherwise sync(2)/fsync(2) may hang forever.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:05 -08:00
Yan, Zheng eb1b8af33c ceph: cleanup aborted requests when re-sending requests.
Aborted requests usually get cleared when the reply is received.
If MDS crashes, no reply will be received. So we need to cleanup
aborted requests when re-sending requests.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:04 -08:00
Yan, Zheng 99a9c273b9 ceph: handle race between cap reconnect and cap release
When a cap get released while composing the cap reconnect message.
We should skip queuing the release message if the cap hasn't been
added to the cap reconnect message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:02 -08:00
Yan, Zheng 44c99757fa ceph: set caps count after composing cap reconnect message
It's possible that some caps get released while composing the cap
reconnect message.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:01:01 -08:00
Yan, Zheng a096b09aee ceph: queue cap release in __ceph_remove_cap()
call __queue_cap_release() in __ceph_remove_cap(), this avoids
acquiring s_cap_lock twice.

Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
2013-11-23 11:00:59 -08:00
Tejun Heo 027a485d12 sysfs: use a separate locking class for open files depending on mmap
The following two commits implemented mmap support in the regular file
path and merged bin file support into the regular path.

 73d9714627 ("sysfs: copy bin mmap support from fs/sysfs/bin.c to fs/sysfs/file.c")
 3124eb1679 ("sysfs: merge regular and bin file handling")

After the merge, the following commands trigger a spurious lockdep
warning.  "test-mmap-read" simply mmaps the file and dumps the
content.

  $ cat /sys/block/sda/trace/act_mask
  $ test-mmap-read /sys/devices/pci0000\:00/0000\:00\:03.0/resource0 4096

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.12.0-work+ #378 Not tainted
  -------------------------------------------------------
  test-mmap-read/567 is trying to acquire lock:
   (&of->mutex){+.+.+.}, at: [<ffffffff8120a8df>] sysfs_bin_mmap+0x4f/0x120

  but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: [<ffffffff8114b399>] vm_mmap_pgoff+0x49/0xa0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&mm->mmap_sem){++++++}:
  ...
  -> #2 (sr_mutex){+.+.+.}:
  ...
  -> #1 (&bdev->bd_mutex){+.+.+.}:
  ...
  -> #0 (&of->mutex){+.+.+.}:
  ...

  other info that might help us debug this:

  Chain exists of:
   &of->mutex --> sr_mutex --> &mm->mmap_sem

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&mm->mmap_sem);
				 lock(sr_mutex);
				 lock(&mm->mmap_sem);
    lock(&of->mutex);

   *** DEADLOCK ***

  1 lock held by test-mmap-read/567:
   #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff8114b399>] vm_mmap_pgoff+0x49/0xa0

  stack backtrace:
  CPU: 3 PID: 567 Comm: test-mmap-read Not tainted 3.12.0-work+ #378
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
   ffffffff81ed41a0 ffff880009441bc8 ffffffff81611ad2 ffffffff81eccb80
   ffff880009441c08 ffffffff8160f215 ffff880009441c60 ffff880009c75208
   0000000000000000 ffff880009c751e0 ffff880009c75208 ffff880009c74ac0
  Call Trace:
   [<ffffffff81611ad2>] dump_stack+0x4e/0x7a
   [<ffffffff8160f215>] print_circular_bug+0x2b0/0x2bf
   [<ffffffff8109ca0a>] __lock_acquire+0x1a3a/0x1e60
   [<ffffffff8109d6ba>] lock_acquire+0x9a/0x1d0
   [<ffffffff81615547>] mutex_lock_nested+0x67/0x3f0
   [<ffffffff8120a8df>] sysfs_bin_mmap+0x4f/0x120
   [<ffffffff8115d363>] mmap_region+0x3b3/0x5b0
   [<ffffffff8115d8ae>] do_mmap_pgoff+0x34e/0x3d0
   [<ffffffff8114b3ba>] vm_mmap_pgoff+0x6a/0xa0
   [<ffffffff8115be3e>] SyS_mmap_pgoff+0xbe/0x250
   [<ffffffff81008282>] SyS_mmap+0x22/0x30
   [<ffffffff8161a4d2>] system_call_fastpath+0x16/0x1b

This happens because one file nests sr_mutex, which nests mm->mmap_sem
under it, under of->mutex while mmap implementation naturally nests
of->mutex under mm->mmap_sem.  The warning is false positive as
of->mutex is per open-file and the two paths belong to two different
files.  This warning didn't trigger before regular and bin file
supports were merged because only bin file supported mmap and the
other side of locking happened only on regular files which used
equivalent but separate locking.

It'd be best if we give separate locking classes per file but we can't
easily do that.  Let's differentiate on ->mmap() for now.  Later we'll
add explicit file operations struct and can add per-ops lockdep key
there.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-23 10:52:13 -08:00
Mika Westerberg 54d71145a4 sysfs: handle duplicate removal attempts in sysfs_remove_group()
Commit bcdde7e221 (sysfs: make __sysfs_remove_dir() recursive) changed
the behavior so that directory removals will be done recursively. This
means that the sysfs group might already be removed if its parent directory
has been removed.

The current code outputs warnings similar to following log snippet when it
detects that there is no group for the given kobject:

 WARNING: CPU: 0 PID: 4 at fs/sysfs/group.c:214 sysfs_remove_group+0xc6/0xd0()
 sysfs group ffffffff81c6f1e0 not found for kobject 'host7'
 Modules linked in:
 CPU: 0 PID: 4 Comm: kworker/0:0 Not tainted 3.12.0+ #13
 Hardware name:                  /D33217CK, BIOS GKPPT10H.86A.0042.2013.0422.1439 04/22/2013
 Workqueue: kacpi_hotplug acpi_hotplug_work_fn
  0000000000000009 ffff8801002459b0 ffffffff817daab1 ffff8801002459f8
  ffff8801002459e8 ffffffff810436b8 0000000000000000 ffffffff81c6f1e0
  ffff88006d440358 ffff88006d440188 ffff88006e8b4c28 ffff880100245a48
 Call Trace:
  [<ffffffff817daab1>] dump_stack+0x45/0x56
  [<ffffffff810436b8>] warn_slowpath_common+0x78/0xa0
  [<ffffffff81043727>] warn_slowpath_fmt+0x47/0x50
  [<ffffffff811ad319>] ? sysfs_get_dirent_ns+0x49/0x70
  [<ffffffff811ae526>] sysfs_remove_group+0xc6/0xd0
  [<ffffffff81432f7e>] dpm_sysfs_remove+0x3e/0x50
  [<ffffffff8142a0d0>] device_del+0x40/0x1b0
  [<ffffffff8142a24d>] device_unregister+0xd/0x20
  [<ffffffff8144131a>] scsi_remove_host+0xba/0x110
  [<ffffffff8145f526>] ata_host_detach+0xc6/0x100
  [<ffffffff8145f578>] ata_pci_remove_one+0x18/0x20
  [<ffffffff812e8f48>] pci_device_remove+0x28/0x60
  [<ffffffff8142d854>] __device_release_driver+0x64/0xd0
  [<ffffffff8142d8de>] device_release_driver+0x1e/0x30
  [<ffffffff8142d257>] bus_remove_device+0xf7/0x140
  [<ffffffff8142a1b1>] device_del+0x121/0x1b0
  [<ffffffff812e43d4>] pci_stop_bus_device+0x94/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e437b>] pci_stop_bus_device+0x3b/0xa0
  [<ffffffff812e44dd>] pci_stop_and_remove_bus_device+0xd/0x20
  [<ffffffff812fc743>] trim_stale_devices+0x73/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fc78b>] trim_stale_devices+0xbb/0xe0
  [<ffffffff812fcb6e>] acpiphp_check_bridge+0x7e/0xd0
  [<ffffffff812fd90d>] hotplug_event+0xcd/0x160
  [<ffffffff812fd9c5>] hotplug_event_work+0x25/0x60
  [<ffffffff81316749>] acpi_hotplug_work_fn+0x17/0x22
  [<ffffffff8105cf3a>] process_one_work+0x17a/0x430
  [<ffffffff8105db29>] worker_thread+0x119/0x390
  [<ffffffff8105da10>] ? manage_workers.isra.25+0x2a0/0x2a0
  [<ffffffff81063a5d>] kthread+0xcd/0xf0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180
  [<ffffffff817eb33c>] ret_from_fork+0x7c/0xb0
  [<ffffffff81063990>] ? kthread_create_on_node+0x180/0x180

On this particular machine I see ~16 of these message during Thunderbolt
hot-unplug.

Fix this in similar way that was done for sysfs_remove_one() by checking
if the parent directory has already been removed and bailing out early.

Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-23 10:52:13 -08:00
Linus Torvalds 57498f9cb9 Quiet static checkers by removing unneeded conditionals
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.14 (GNU/Linux)
 
 iQIcBAABCgAGBQJSj6OkAAoJENaSAD2qAscK9ukP/Ar6Lracra9IViBvFRIgBGnf
 K0jUwF0atd9Uj+E7phUzG2cfTKHqCVHYHfP/tCSurjjQnb9TpQ1ZkFy22VGrlDYK
 jrU39wOfQQCQX+pPmJIX2L3W5ZAEFbBMgJZ7PaIM1qDV9cjy+mt/iNHJrH0BHBH4
 HryvCMeALRR5elwYkhfTql4VN2ClS9veAWdcGhAsbZ8vaNB01x+9Wps+t/VPiCVp
 +K3C+mC/LennTzdmDJUIbMFbKSsTjrdynRtuPLeSgVAA6Y05ps5vf+hoNmiLmphJ
 z1j0HCht1FApM8VrllRmJla8vo6DRoqn+DUqRRuoI7kE5zFS61x5EIpogGdqmhn/
 9zYA9xdj7Fpw/57G3xSpUdpS5u0JsxETc580iitl1tDgEHcQEgmRSCa89uhaaDFi
 pSKgV7au5DqJdBtTDvWeQPAKrkT9LWOsrlEXX5aqM8wIs8FheIPZMEz11m6vMp6n
 e9geohCvpLL0opUx209ofpYHpYLS6IXY/LrQeHSYeZ2DYGyDZ7R/KHt8GN1NwLrE
 AUJTOzdq3H37ZDlmkcTEO9z84qfgaHioiHGlzYcCfzR6aw6cey1jOj8Y9lg0gciQ
 nWuWXtES0UtvpR9w84nbXCpgOrQMYpGOhqvduDALa+7XhbDhpFtmN9e4f9CfTumv
 PYQ9hYRKD3w6iWsoYubd
 =IgXu
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.13-rc1-quiet-checkers' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull minor eCryptfs fix from Tyler Hicks:
 "Quiet static checkers by removing unneeded conditionals"

* tag 'ecryptfs-3.13-rc1-quiet-checkers' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  eCryptfs: file->private_data is always valid
2013-11-22 10:58:14 -08:00
Linus Torvalds d0f278c1dd Merge git://git.kvack.org/~bcrl/aio-next
Pull aio fixes from Benjamin LaHaise.

* git://git.kvack.org/~bcrl/aio-next:
  aio: nullify aio->ring_pages after freeing it
  aio: prevent double free in ioctx_alloc
  aio: Fix a trinity splat
2013-11-22 08:42:14 -08:00
Linus Torvalds 533db9b3d4 Merge branch 'for-3.13' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
 "A couple nfsd bugfixes"

* 'for-3.13' of git://linux-nfs.org/~bfields/linux:
  nfsd4: fix xdr decoding of large non-write compounds
  nfsd: make sure to balance get/put_write_access
  nfsd: split up nfsd_setattr
2013-11-22 08:41:17 -08:00
Linus Torvalds c85e07278e A couple of small, but important bug fixes for GFS2. The first one
fixes a possible NULL pointer dereference, and the second one
 resolves a reference counting issue in one of the lesser used paths
 through atomic_open.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJSjy+IAAoJEMrg3m4a/8jSy4wP+wXU26Hbj0nIkzxEokxgQLO/
 64flbmRx/960CAEFiFWhrSQUIpb2R2Lw1tNGiRiAhch4wLmRrKMEZnR/dI57KaMR
 /dcm78q2aPbNvdB/5pxrTQx8bkRYcVOCZdD7lizUUHs/oLIhl0IZ4uGSOy0kUdTd
 gcgus1Rd7mPJp+Smr33W7rPB6WAfo6+iAn7RNcdxgRcH2Ng8BZm82Z4Gn8TDN3H1
 cZ8yzFbycLvj67kpkdEesjaDTQv/zWrtNg/ocVDGbPRxykzf89v94LpodZkpvRAz
 vvCT7MNy9R8konDcGImSV6zRstYFtQwg8hqzIvSgla6yppghpGXRp9c7PpPDNzc9
 ynVvZqc0MbZeTlGt5/DQW6QmFKwgJcT/tJSaDdx1NRoHQTOXzSM9hw5it048Hm8L
 AUrsAvFJtTtm7Uwd7CE78IvrOQWNl0EAgw20twKR1GR6dnGmvvaVFrLXSacCNsFm
 VkOs6QLRoax6DbjHgC0ERka5J7YcGniNff+vx2ndRkth3KWv3N8fqh43mBtuQcRN
 yzAffJMXYGWWBHvaXR71dJ35Z4lFbMfKU+oPZokucl9c1dnFJ26cqVKMlhD+hUGK
 ms0qVDiRf4C8qhIyWor1FqWCGeFlXwtTzgGtoovrOoWJxqp6vLdwFjEEPKb1Tlai
 pBmD16Z3ehx8I76Yzgjx
 =aMoW
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes

Pull GFS2 fixes from Steven Whitehouse:
 "A couple of small, but important bug fixes for GFS2.  The first one
  fixes a possible NULL pointer dereference, and the second one resolves
  a reference counting issue in one of the lesser used paths through
  atomic_open"

* tag 'gfs2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
  GFS2: Fix ref count bug relating to atomic_open
  GFS2: fix potential NULL pointer dereference
2013-11-22 08:39:44 -08:00
Linus Torvalds fb0d1eb892 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Almost all of these are bug fixes.  Dave Sterba's documentation update
  is the big exception because he removed our promises to set any
  machine running Btrfs on fire"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Documentation: filesystems: update btrfs tools section
  Documentation: filesystems: add new btrfs mount options
  btrfs: update kconfig help text
  btrfs: fix bio_size_ok() for max_sectors > 0xffff
  btrfs: Use trace condition for get_extent tracepoint
  btrfs: fix typo in the log message
  Btrfs: fix list delete warning when removing ordered root from the list
  Btrfs: print bytenr instead of page pointer in check-int
  Btrfs: remove dead codes from ctree.h
  Btrfs: don't wait for ordered data outside desired range
  Btrfs: fix lockdep error in async commit
  Btrfs: avoid heavy operations in btrfs_commit_super
  Btrfs: fix __btrfs_start_workers retval
  Btrfs: disable online raid-repair on ro mounts
  Btrfs: do not inc uncorrectable_errors counter on ro scrubs
  Btrfs: only drop modified extents if we logged the whole inode
  Btrfs: make sure to copy everything if we rename
  Btrfs: don't BUG_ON() if we get an error walking backrefs
2013-11-22 08:38:55 -08:00
Linus Torvalds 6ea9786e76 xfs: update #2 for v3.13-rc1
Here we have a performance fix for inode iversion, increased inode cluster size
 for v5 superblock filesystems, a fix for error handling in
 xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.10 (GNU/Linux)
 
 iQIcBAABAgAGBQJSjjbHAAoJENaLyazVq6ZO/l4QAIajqdOy56MDDCaE1ocNRMej
 oPXqxMZpi+YAFRXIWQK/evjXjYE4hcCjiLI0tAKzQM0FGRHd1GgQQJvWKjvHBrLG
 rlrHWDcKq+3bsJ6KIw+JvLnhsOxxKXbRovIG1PI4frOeFtS3DCtzMZ4FGBErh+BV
 PkFjf5Lxe7VnegDL4eNpkAfSM/2/pEtn2gIMMj8eeKemTPAbDplMAk4UUxp18R7F
 FCr6mOKETWthHdBgkLB1xA6OVGUuYcleWvluFc5PONJN1VPSutEHJaRw01lY1VLY
 4COUt7MjLAqlAu24LTD1aNszhUlajYq0AL+nmd4gULZI2fpu+meUIGCHQG4BpVZl
 ds9isn80SmKiLT4ZQCbJAv4XMEXn6p41+uxdKP6ZkYXso4zJmVw0TyGLg/ZOJpw0
 9mNcaJG1s57ronN07dMCSMqsyNFLuLtX7+mVa5liO3sEkXv7hEK7EB9qojQ9Qn/p
 xC2r4jrE0//xgcbOE+uKOyIad3L0IBM6TXuy58xVPi5l0dpM69z4LCyUGZ+L1G8x
 0QhkA7NL3xI2tNgehzJBsBuxjtEePtm/jCIS1HfDSIRQZR3y+PxVEjgO4naS4vm2
 xKwo5dBodsxgXeSMUY3miMuEX7ZQse+xVDK1q0Zk4VXZyC2+erNS5hxs313yzeLD
 7X2ZESfIJaMrkLSKOVUe
 =w0M3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs

Pull second xfs update from Ben Myers:
 "There are a couple of patches that I wasn't quite sure about in time
  for our initial 3.13 pull request, a bugfix, and an update to add Dave
  to MAINTAINERS:

  Here we have a performance fix for inode iversion, increased inode
  cluster size for v5 superblock filesystems, a fix for error handling
  in xfs_bmap_add_attrfork, and a MAINTAINERS update to add Dave"

* tag 'xfs-for-linus-v3.13-rc1-2' of git://oss.sgi.com/xfs/xfs:
  xfs: open code inc_inode_iversion when logging an inode
  xfs: increase inode cluster size for v5 filesystems
  xfs: fix unlock in xfs_bmap_add_attrfork
  xfs: update maintainers
2013-11-22 08:37:47 -08:00
Linus Torvalds a5d6e63323 Merge branch 'akpm' (fixes from Andrew)
Merge patches from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: place page->pmd_huge_pte to right union
  MAINTAINERS: add keyboard driver to Hyper-V file list
  x86, mm: do not leak page->ptl for pmd page tables
  ipc,shm: correct error return value in shmctl (SHM_UNLOCK)
  mm, mempolicy: silence gcc warning
  block/partitions/efi.c: fix bound check
  ARM: drivers/rtc/rtc-at91rm9200.c: disable interrupts at shutdown
  mm: hugetlbfs: fix hugetlbfs optimization
  kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS cleanly
  ipc,shm: fix shm_file deletion races
  mm: thp: give transparent hugepage code a separate copy_page
  checkpatch: fix "Use of uninitialized value" warnings
  configfs: fix race between dentry put and lookup
2013-11-21 21:32:38 -08:00
Linus Torvalds 3eaded86ac Merge git://git.infradead.org/users/eparis/audit
Pull audit updates from Eric Paris:
 "Nothing amazing.  Formatting, small bug fixes, couple of fixes where
  we didn't get records due to some old VFS changes, and a change to how
  we collect execve info..."

Fixed conflict in fs/exec.c as per Eric and linux-next.

* git://git.infradead.org/users/eparis/audit: (28 commits)
  audit: fix type of sessionid in audit_set_loginuid()
  audit: call audit_bprm() only once to add AUDIT_EXECVE information
  audit: move audit_aux_data_execve contents into audit_context union
  audit: remove unused envc member of audit_aux_data_execve
  audit: Kill the unused struct audit_aux_data_capset
  audit: do not reject all AUDIT_INODE filter types
  audit: suppress stock memalloc failure warnings since already managed
  audit: log the audit_names record type
  audit: add child record before the create to handle case where create fails
  audit: use given values in tty_audit enable api
  audit: use nlmsg_len() to get message payload length
  audit: use memset instead of trying to initialize field by field
  audit: fix info leak in AUDIT_GET requests
  audit: update AUDIT_INODE filter rule to comparator function
  audit: audit feature to set loginuid immutable
  audit: audit feature to only allow unsetting the loginuid
  audit: allow unsetting the loginuid (with priv)
  audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE
  audit: loginuid functions coding style
  selinux: apply selinux checks on new audit message types
  ...
2013-11-21 19:18:14 -08:00
Junxiao Bi 76ae281f63 configfs: fix race between dentry put and lookup
A race window in configfs, it starts from one dentry is UNHASHED and end
before configfs_d_iput is called.  In this window, if a lookup happen,
since the original dentry was UNHASHED, so a new dentry will be
allocated, and then in configfs_attach_attr(), sd->s_dentry will be
updated to the new dentry.  Then in configfs_d_iput(),
BUG_ON(sd->s_dentry != dentry) will be triggered and system panic.

sys_open:                     sys_close:
 ...                           fput
                                dput
                                 dentry_kill
                                  __d_drop <--- dentry unhashed here,
                                           but sd->dentry still point
                                           to this dentry.

 lookup_real
  configfs_lookup
   configfs_attach_attr---> update sd->s_dentry
                            to new allocated dentry here.

                                   d_kill
                                     configfs_d_iput <--- BUG_ON(sd->s_dentry != dentry)
                                                     triggered here.

To fix it, change configfs_d_iput to not update sd->s_dentry if
sd->s_count > 2, that means there are another dentry is using the sd
beside the one that is going to be put.  Use configfs_dirent_lock in
configfs_attach_attr to sync with configfs_d_iput.

With the following steps, you can reproduce the bug.

1. enable ocfs2, this will mount configfs at /sys/kernel/config and
   fill configure in it.

2. run the following script.
	while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &
	while [ 1 ]; do cat /sys/kernel/config/cluster/$your_cluster_name/idle_timeout_ms > /dev/null; done &

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-21 16:42:27 -08:00
Steven Whitehouse ea0341e071 GFS2: Fix ref count bug relating to atomic_open
In the case that atomic_open calls finish_no_open() with
the dentry that was supplied to gfs2_atomic_open() an
extra reference count is required. This patch fixes that
issue preventing a bug trap triggering at umount time.

Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-11-21 18:47:57 +00:00
Michal Nazarewicz e3c4269d13 GFS2: fix potential NULL pointer dereference
Commit [e66cf1610: GFS2: Use lockref for glocks] replaced call:
    atomic_read(&gi->gl->gl_ref) == 0
with:
    __lockref_is_dead(&gl->gl_lockref)
therefore changing how gl is accessed, from gi->gl to plan gl.
However, gl can be a NULL pointer, and so gi->gl needs to be
used instead (which is guaranteed not to be NULL because fo
the while loop checking that condition).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2013-11-21 09:55:45 +00:00
David Sterba 4204617d14 btrfs: update kconfig help text
Reflect the current status. Portions of the text taken from the
wiki pages.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:49:09 -05:00
Akinobu Mita 475bf36ffb btrfs: fix bio_size_ok() for max_sectors > 0xffff
The data type of max_sectors in queue settings is unsigned int.  But
this value is stored to the local variable whose type is unsigned short
in bio_size_ok().  This can cause unexpected result when max_sectors >
0xffff.

Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:48:44 -05:00
Steven Rostedt 4cd8587ce8 btrfs: Use trace condition for get_extent tracepoint
Doing an if statement to test some condition to know if we should
trigger a tracepoint is pointless when tracing is disabled. This just
adds overhead and wastes a branch prediction. This is why the
TRACE_EVENT_CONDITION() was created. It places the check inside the jump
label so that the branch does not happen unless tracing is enabled.

That is, instead of doing:

	if (em)
		trace_btrfs_get_extent(root, em);

Which is basically this:

	if (em)
		if (static_key(trace_btrfs_get_extent)) {

Using a TRACE_EVENT_CONDITION() we can just do:

	trace_btrfs_get_extent(root, em);

And the condition trace event will do:

	if (static_key(trace_btrfs_get_extent)) {
		if (em) {
			...

The static key is a non conditional jump (or nop) that is faster than
having to check if em is NULL or not.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:47 -05:00
Anand Jain 52a1575921 btrfs: fix typo in the log message
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:47 -05:00
Miao Xie 931aa87791 Btrfs: fix list delete warning when removing ordered root from the list
Commit b02441999e "Btrfs: don't wait for
the completion of all the ordered extents" introduced a bug that broke
the ordered root list:
 WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98()

It is because we forgot to return the roots in the splice list to the
ordered list of the fs. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:46 -05:00
Stefan Behrens 56d140f5f6 Btrfs: print bytenr instead of page pointer in check-int
The page pointer information was useless. The bytenr is what you
want when you search for submitted write bios.

Additionally, a new bit in the print mask is added that allows
to selectively enable the check-int submit_bio verbose mode. Before,
the global verbose mode had to be enabled leading to many million
useless lines in the kernel log.

And a comment is added that explains that LOG_BUF_SHIFT needs to
be set to a really high value.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:46 -05:00
Wang Shilong 9650e05c07 Btrfs: remove dead codes from ctree.h
These two functions are only stated but undefined.

Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:45 -05:00
Filipe David Borba Manana b52abf1e3b Btrfs: don't wait for ordered data outside desired range
In btrfs_wait_ordered_range(), if we found an extent to the left
of the start of our desired wait range and the last byte of that
extent is 1 less than the desired range's start, we would would
wait for the IO completion of that extent unnecessarily.

Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:45 -05:00
Liu Bo b1a06a4b57 Btrfs: fix lockdep error in async commit
Lockdep complains about btrfs's async commit:

[ 2372.462171] [ BUG: bad unlock balance detected! ]
[ 2372.462191] 3.12.0+ #32 Tainted: G        W
[ 2372.462209] -------------------------------------
[ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at:
[ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462305] but there are no more locks to release!
[ 2372.462324]
[ 2372.462324] other info that might help us debug this:
[ 2372.462349] no locks held by ceph-osd/14048.
[ 2372.462367]
[ 2372.462367] stack backtrace:
[ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G        W    3.12.0+ #32
[ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015  11/09/2011
[ 2372.462455]  ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320
[ 2372.462491]  ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650
[ 2372.462526]  ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000
[ 2372.462562] Call Trace:
[ 2372.462584]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462619]  [<ffffffff816f094a>] dump_stack+0x45/0x56
[ 2372.462642]  [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100
[ 2372.462677]  [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs]
[ 2372.462710]  [<ffffffff810b01ee>] lock_release+0x18e/0x210
[ 2372.462742]  [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs]
[ 2372.462783]  [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs]
[ 2372.462822]  [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs]
[ 2372.462849]  [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0
[ 2372.462873]  [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0
[ 2372.462897]  [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100
[ 2372.462922]  [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0
[ 2372.462946]  [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0
[ 2372.462969]  [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0
[ 2372.462991]  [<ffffffff817045a4>] tracesys+0xdd/0xe2

====================================================

It's because that we don't do the right thing when checking if it's ok to
tell lockdep that we're trying to release the rwsem.

If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but
as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze
rwsem, which makes lockdep complains a lot.

Reported-by: Ma Jianpeng <majianpeng@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:44:44 -05:00
Liu Bo d52c1bcc64 Btrfs: avoid heavy operations in btrfs_commit_super
The 'git blame' history shows that, the old transaction commit code has to do
twice to ensure roots are updated and we have to flush metadata and super block
manually, however, right now all of these can be handled well inside
the transaction commit code without extra efforts.

And the error handling part remains same with the current code, -- 'return to
caller once we get error'.

This saves us a transaction commit and a flush of super block, which are both
heavy operations according to ftrace output analysis.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:16 -05:00
Ilya Dryomov ba69994a40 Btrfs: fix __btrfs_start_workers retval
__btrfs_start_workers returns 0 in case it raced with
btrfs_stop_workers and lost the race.  This is wrong because worker in
this case is not allowed to start and is in fact destroyed.  Return
-EINVAL instead.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:11 -05:00
Ilya Dryomov 908960c6c0 Btrfs: disable online raid-repair on ro mounts
This disables the "if needed, write the good copy back before the read
is completed" part of the read sequence for read-only mounts.

Cc: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:42:05 -05:00
Ilya Dryomov 33ef30add1 Btrfs: do not inc uncorrectable_errors counter on ro scrubs
Currently if we discover an error when scrubbing in ro mode we a)
blindly increment the uncorrectable_errors counter, and b) spam the
dmesg with the 'unable to fixup (regular) error at ...' message, even
though a) we haven't tried to determine if the error is correctable or
not, and b) we haven't tried to fixup anything.  Fix this.

Cc: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:38 -05:00
Josef Bacik d006a04816 Btrfs: only drop modified extents if we logged the whole inode
If we fsync, seek and write, rename and then fsync again we will lose the
modified hole extent because the rename will drop all of the modified extents
since we didn't do the fast search.  We need to only drop the modified extents
if we didn't do the fast search and we were logging the entire inode as we don't
need them anymore, otherwise this is being premature.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:32 -05:00
Josef Bacik 6cfab851f4 Btrfs: make sure to copy everything if we rename
If we rename a file that is already in the log and we fsync again we will lose
the new name.  This is because we just log the inode update and not the new ref.
To fix this we just need to check if we are logging the new name of the inode
and copy all the metadata instead of just updating the inode itself.  With this
patch my testcase now passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:24 -05:00
Josef Bacik 4724b106b9 Btrfs: don't BUG_ON() if we get an error walking backrefs
We can just return false for this so we stop doing the snapshot aware defrag
stuff.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-11-20 20:41:16 -05:00
Linus Torvalds af2e2f3280 These patches optionally improve the multi-threading peformance
of Squashfs by adding parallel decompression, and direct
 decompression into the page cache, eliminating an intermediate
 buffer (removing memcpy overhead and lock contention).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJSjP25AAoJEJAch/D1fbHUsFAQAJjpCfWyv7JfNtJSUk20UgbC
 kQvpMUbISwrLnszW6ooWBJxQ4OCyQ3AWN5yrAk6hb86oR0SN33WAJjqW5hR5htOy
 5ZMs3OnzE3haUej+Xxw/VTK61FoOq/PjK8UZ6NBfhdnihfE/fQykrgDhHznJ68iq
 0wWeqTY68sF5ZBQ2kKhBSfF+lGlWeLqhiFGOq68MAv4Rd8dGZsiLIFG7JQsrwmAn
 cswmiCTQppGGz/+6FBWvaEEpD+nUCX/h/1XKwMhuzWwZZTFWPM+BkEfMOKv78txW
 GWn/o1/kWA/u1f5V+nZlUhNtj+KCU11YZfTAJ30Ie1erzKCh8DGcLhCyfC0N+Hw/
 Na5vxyjEnTdJoBnRbcPpHcGwPB0J5Q2nCzu1b/3blUGdpXQrNp/zZ4hg53fYEKHy
 2KAf9j5rqs85IqoKwrzeod/V1WakjMQJPntoJ2r7ILP4lKfvOHt6m1D5/7tVodxZ
 mGa8eaQtH5SrtnLldKo4vGgh65/ViQ2cVlAbGC7I9rfXJ0fITYO8PvKBTcXvtOHc
 +rjCnoOHtSv8FvFf1G9qfbBMwaC+3n95rYSac0Ibl6O7x2pdQusUmiuyUf1NXsDg
 V4ENspn/DTrltZbTbBTgI3LizxvJOMtf72zo+Bhghitp09yeIFfieVqM/kuR74Ym
 O4EaVGcFdXJJc3UmK/69
 =lV/M
 -----END PGP SIGNATURE-----

Merge tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next

Pull squashfs updates from Phillip Lougher:
 "These patches optionally improve the multi-threading peformance of
  Squashfs by adding parallel decompression, and direct decompression
  into the page cache, eliminating an intermediate buffer (removing
  memcpy overhead and lock contention)"

* tag 'squashfs-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
  Squashfs: Check stream is not NULL in decompressor_multi.c
  Squashfs: Directly decompress into the page cache for file data
  Squashfs: Restructure squashfs_readpage()
  Squashfs: Generalise paging handling in the decompressors
  Squashfs: add multi-threaded decompression using percpu variable
  squashfs: Enhance parallel I/O
  Squashfs: Refactor decompressor interface and code
2013-11-20 14:51:37 -08:00
Linus Torvalds b5898cd057 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs bits and pieces from Al Viro:
 "Assorted bits that got missed in the first pull request + fixes for a
  couple of coredump regressions"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fold try_to_ascend() into the sole remaining caller
  dcache.c: get rid of pointless macros
  take read_seqbegin_or_lock() and friends to seqlock.h
  consolidate simple ->d_delete() instances
  gfs2: endianness misannotations
  dump_emit(): use __kernel_write(), not vfs_write()
  dump_align(): fix the dumb braino
2013-11-20 14:25:39 -08:00
Linus Torvalds 5a1efc6e68 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block IO fixes from Jens Axboe:
 "Normally I'd defer my initial for-linus pull request until after the
  merge window, but a race was uncovered in the virtio-blk conversion to
  blk-mq that could cause hangs.  So here's a small collection of fixes
  for you to pull:

   - The fix for the virtio-blk IO hang reported by Dave Chinner, from
     Shaohua and myself.

   - Add the Insert blktrace event for blk-mq.  This makes 'btt' happy
     when it is doing it's state transition analysis.

   - Ensure that blk-mq has disk/partition stats enabled by default,
     instead of making it opt-in.

   - A fix for __bio_add_page() and large sector counts"

* 'for-linus' of git://git.kernel.dk/linux-block:
  blk-mq: add blktrace insert event trace
  virtio-blk: virtqueue_kick() must be ordered with other virtqueue operations
  blk-mq: ensure that we set REQ_IO_STAT so diskstats work
  bio: fix argument of __bio_add_page() for max_sectors > 0xffff
2013-11-20 13:06:20 -08:00
Trond Myklebust 69794ad70c NFSv4: close needs to handle NFS4ERR_ADMIN_REVOKED
Also ensure that we zero out the stateid mode when exiting

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-20 15:55:39 -05:00
Trond Myklebust c97cf606e4 NFSv4: Update list of irrecoverable errors on DELEGRETURN
If the DELEGRETURN errors out with something like NFS4ERR_BAD_STATEID
then there is no recovery possible. Just quit without returning an error.

Also, note that the client must not assume that the NFSv4 lease has been
renewed when it sees an error on DELEGRETURN.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
2013-11-20 15:54:27 -05:00
Andy Adamson 4a82fd7c4e NFSv4 wait on recovery for async session errors
When the state manager is processing the NFS4CLNT_DELEGRETURN flag, session
draining is off, but DELEGRETURN can still get a session error.
The async handler calls nfs4_schedule_session_recovery returns -EAGAIN, and
the DELEGRETURN done then restarts the RPC task in the prepare state.
With the state manager still processing the NFS4CLNT_DELEGRETURN flag with
session draining off, these DELEGRETURNs will cycle with errors filling up the
session slots.

This prevents OPEN reclaims (from nfs_delegation_claim_opens) required by the
NFS4CLNT_DELEGRETURN state manager processing from completing, hanging the
state manager in the __rpc_wait_for_completion_task in nfs4_run_open_task
as seen in this kernel thread dump:

kernel: 4.12.32.53-ma D 0000000000000000     0  3393      2 0x00000000
kernel: ffff88013995fb60 0000000000000046 ffff880138cc5400 ffff88013a9df140
kernel: ffff8800000265c0 ffffffff8116eef0 ffff88013fc10080 0000000300000001
kernel: ffff88013a4ad058 ffff88013995ffd8 000000000000fbc8 ffff88013a4ad058
kernel: Call Trace:
kernel: [<ffffffff8116eef0>] ? cache_alloc_refill+0x1c0/0x240
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffffa0358152>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
kernel: [<ffffffff8152914f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffff815291f8>] out_of_line_wait_on_bit+0x78/0x90
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa035810d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
kernel: [<ffffffffa040d44c>] nfs4_run_open_task+0x11c/0x160 [nfs]
kernel: [<ffffffffa04114e7>] nfs4_open_recover_helper+0x87/0x120 [nfs]
kernel: [<ffffffffa0411646>] nfs4_open_recover+0xc6/0x150 [nfs]
kernel: [<ffffffffa040cc6f>] ? nfs4_open_recoverdata_alloc+0x2f/0x60 [nfs]
kernel: [<ffffffffa0414e1a>] nfs4_open_delegation_recall+0x6a/0xa0 [nfs]
kernel: [<ffffffffa0424020>] nfs_end_delegation_return+0x120/0x2e0 [nfs]
kernel: [<ffffffff8109580f>] ? queue_work+0x1f/0x30
kernel: [<ffffffffa0424347>] nfs_client_return_marked_delegations+0xd7/0x110 [nfs]
kernel: [<ffffffffa04225d8>] nfs4_run_state_manager+0x548/0x620 [nfs]
kernel: [<ffffffffa0422090>] ? nfs4_run_state_manager+0x0/0x620 [nfs]
kernel: [<ffffffff8109b0f6>] kthread+0x96/0xa0
kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
kernel: [<ffffffff8109b060>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

The state manager can not therefore process the DELEGRETURN session errors.
Change the async handler to wait for recovery on session errors.

Signed-off-by: Andy Adamson <andros@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-20 15:54:08 -05:00
Steve French ff1c038add Check SMB3 dialects against downgrade attacks
When we are running SMB3 or SMB3.02 connections which are signed
we need to validate the protocol negotiation information,
to ensure that the negotiate protocol response was not tampered with.

Add the missing FSCTL which is sent at mount time (immediately after
the SMB3 Tree Connect) to validate that the capabilities match
what we think the server sent.

"Secure dialect negotiation is introduced in SMB3 to protect against
man-in-the-middle attempt to downgrade dialect negotiation.
The idea is to prevent an eavesdropper from downgrading the initially
negotiated dialect and capabilities between the client and the server."

For more explanation see 2.2.31.4 of MS-SMB2 or
http://blogs.msdn.com/b/openspecification/archive/2012/06/28/smb3-secure-dialect-negotiation.aspx

Reviewed-by: Pavel Shilovsky <piastry@etersoft.ru>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-11-19 23:52:54 -06:00
Phillip Lougher ed4f381ec1 Squashfs: Check stream is not NULL in decompressor_multi.c
Fix static checker complaint that stream is not checked in
squashfs_decompressor_destroy().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
2013-11-20 03:59:20 +00:00
Phillip Lougher 0d455c12c6 Squashfs: Directly decompress into the page cache for file data
This introduces an implementation of squashfs_readpage_block()
that directly decompresses into the page cache.

This uses the previously added page handler abstraction to push
down the necessary kmap_atomic/kunmap_atomic operations on the
page cache buffers into the decompressors.  This enables
direct copying into the page cache without using the slow
kmap/kunmap calls.

The code detects when multiple threads are racing in
squashfs_readpage() to decompress the same block, and avoids
this regression by falling back to using an intermediate
buffer.

This patch enhances the performance of Squashfs significantly
when multiple processes are accessing the filesystem simultaneously
because it not only reduces memcopying, but it more importantly
eliminates the lock contention on the intermediate buffer.

Using single-thread decompression.

        dd if=file1 of=/dev/null bs=4096 &
        dd if=file2 of=/dev/null bs=4096 &
        dd if=file3 of=/dev/null bs=4096 &
        dd if=file4 of=/dev/null bs=4096

Before:

629145600 bytes (629 MB) copied, 45.8046 s, 13.7 MB/s

After:

629145600 bytes (629 MB) copied, 9.29414 s, 67.7 MB/s

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
2013-11-20 03:59:13 +00:00
Phillip Lougher 5f55dbc0c5 Squashfs: Restructure squashfs_readpage()
Restructure squashfs_readpage() splitting it into separate
functions for datablocks, fragments and sparse blocks.

Move the memcpying (from squashfs cache entry) implementation of
squashfs_readpage_block into file_cache.c

This allows different implementations to be supported.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
2013-11-20 03:59:07 +00:00
Phillip Lougher 846b730e99 Squashfs: Generalise paging handling in the decompressors
Further generalise the decompressors by adding a page handler
abstraction.  This adds helpers to allow the decompressors
to access and process the output buffers in an implementation
independant manner.

This allows different types of output buffer to be passed
to the decompressors, with the implementation specific
aspects handled at decompression time, but without the
knowledge being held in the decompressor wrapper code.

This will allow the decompressors to handle Squashfs
cache buffers, and page cache pages.

This patch adds the abstraction and an implementation for
the caches.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
2013-11-20 03:59:01 +00:00
Phillip Lougher d208383d64 Squashfs: add multi-threaded decompression using percpu variable
Add a multi-threaded decompression implementation which uses
percpu variables.

Using percpu variables has advantages and disadvantages over
implementations which do not use percpu variables.

Advantages:
  * the nature of percpu variables ensures decompression is
    load-balanced across the multiple cores.
  * simplicity.

Disadvantages: it limits decompression to one thread per core.

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2013-11-20 03:58:03 +00:00
Minchan Kim cd59c2ec5f squashfs: Enhance parallel I/O
Now squashfs have used for only one stream buffer for decompression
so it hurts parallel read performance so this patch supports
multiple decompressor to enhance performance parallel I/O.

Four 1G file dd read on KVM machine which has 2 CPU and 4G memory.

dd if=test/test1.dat of=/dev/null &
dd if=test/test2.dat of=/dev/null &
dd if=test/test3.dat of=/dev/null &
dd if=test/test4.dat of=/dev/null &

old : 1m39s -> new : 9s

* From v1
  * Change comp_strm with decomp_strm - Phillip
  * Change/add comments - Phillip

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
2013-11-20 03:35:18 +00:00
Phillip Lougher 9508c6b90b Squashfs: Refactor decompressor interface and code
The decompressor interface and code was written from
the point of view of single-threaded operation.  In doing
so it mixed a lot of single-threaded implementation specific
aspects into the decompressor code and elsewhere which makes it
difficult to seamlessly support multiple different decompressor
implementations.

This patch does the following:

1.  It removes compressor_options parsing from the decompressor
    init() function.  This allows the decompressor init() function
    to be dynamically called to instantiate multiple decompressors,
    without the compressor options needing to be read and parsed each
    time.

2.  It moves threading and all sleeping operations out of the
    decompressors.  In doing so, it makes the decompressors
    non-blocking wrappers which only deal with interfacing with
    the decompressor implementation.

3. It splits decompressor.[ch] into decompressor generic functions
   in decompressor.[ch], and moves the single threaded
   decompressor implementation into decompressor_single.c.

The result of this patch is Squashfs should now be able to
support multiple decompressors by adding new decompressor_xxx.c
files with specialised implementations of the functions in
decompressor_single.c

Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reviewed-by: Minchan Kim <minchan@kernel.org>
2013-11-20 03:35:18 +00:00
Linus Torvalds 1ee2dcc224 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Mostly these are fixes for fallout due to merge window changes, as
  well as cures for problems that have been with us for a much longer
  period of time"

 1) Johannes Berg noticed two major deficiencies in our genetlink
    registration.  Some genetlink protocols we passing in constant
    counts for their ops array rather than something like
    ARRAY_SIZE(ops) or similar.  Also, some genetlink protocols were
    using fixed IDs for their multicast groups.

    We have to retain these fixed IDs to keep existing userland tools
    working, but reserve them so that other multicast groups used by
    other protocols can not possibly conflict.

    In dealing with these two problems, we actually now use less state
    management for genetlink operations and multicast groups.

 2) When configuring interface hardware timestamping, fix several
    drivers that simply do not validate that the hwtstamp_config value
    is one the driver actually supports.  From Ben Hutchings.

 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar.

 4) In dev_forward_skb(), set the skb->protocol in the right order
    relative to skb_scrub_packet().  From Alexei Starovoitov.

 5) Bridge erroneously fails to use the proper wrapper functions to make
    calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid.  Fix from Toshiaki
    Makita.

 6) When detaching a bridge port, make sure to flush all VLAN IDs to
    prevent them from leaking, also from Toshiaki Makita.

 7) Put in a compromise for TCP Small Queues so that deep queued devices
    that delay TX reclaim non-trivially don't have such a performance
    decrease.  One particularly problematic area is 802.11 AMPDU in
    wireless.  From Eric Dumazet.

 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts
    here.  Fix from Eric Dumzaet, reported by Dave Jones.

 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn.

10) When computing mergeable buffer sizes, virtio-net fails to take the
    virtio-net header into account.  From Michael Dalton.

11) Fix seqlock deadlock in ip4_datagram_connect() wrt.  statistic
    bumping, this one has been with us for a while.  From Eric Dumazet.

12) Fix NULL deref in the new TIPC fragmentation handling, from Erik
    Hugne.

13) 6lowpan bit used for traffic classification was wrong, from Jukka
    Rissanen.

14) macvlan has the same issue as normal vlans did wrt.  propagating LRO
    disabling down to the real device, fix it the same way.  From Michal
    Kubecek.

15) CPSW driver needs to soft reset all slaves during suspend, from
    Daniel Mack.

16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet.

17) The xen-netfront RX buffer refill timer isn't properly scheduled on
    partial RX allocation success, from Ma JieYue.

18) When ipv6 ping protocol support was added, the AF_INET6 protocol
    initialization cleanup path on failure was borked a little.  Fix
    from Vlad Yasevich.

19) If a socket disconnects during a read/recvmsg/recvfrom/etc that
    blocks we can do the wrong thing with the msg_name we write back to
    userspace.  From Hannes Frederic Sowa.  There is another fix in the
    works from Hannes which will prevent future problems of this nature.

20) Fix route leak in VTI tunnel transmit, from Fan Du.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits)
  genetlink: make multicast groups const, prevent abuse
  genetlink: pass family to functions using groups
  genetlink: add and use genl_set_err()
  genetlink: remove family pointer from genl_multicast_group
  genetlink: remove genl_unregister_mc_group()
  hsr: don't call genl_unregister_mc_group()
  quota/genetlink: use proper genetlink multicast APIs
  drop_monitor/genetlink: use proper genetlink multicast APIs
  genetlink: only pass array to genl_register_family_with_ops()
  tcp: don't update snd_nxt, when a socket is switched from repair mode
  atm: idt77252: fix dev refcnt leak
  xfrm: Release dst if this dst is improper for vti tunnel
  netlink: fix documentation typo in netlink_set_err()
  be2net: Delete secondary unicast MAC addresses during be_close
  be2net: Fix unconditional enabling of Rx interface options
  net, virtio_net: replace the magic value
  ping: prevent NULL pointer dereference on write to msg_name
  bnx2x: Prevent "timeout waiting for state X"
  bnx2x: prevent CFC attention
  bnx2x: Prevent panic during DMAE timeout
  ...
2013-11-19 15:50:47 -08:00
J. Bruce Fields 365da4adeb nfsd4: fix xdr decoding of large non-write compounds
This fixes a regression from 247500820e
"nfsd4: fix decoding of compounds across page boundaries".  The previous
code was correct: argp->pagelist is initialized in
nfs4svc_deocde_compoundargs to rqstp->rq_arg.pages, and is therefore a
pointer to the page *after* the page we are currently decoding.

The reason that patch nevertheless fixed a problem with decoding
compounds containing write was a bug in the write decoding introduced by
5a80a54d21 "nfsd4: reorganize write
decoding", after which write decoding no longer adhered to the rule that
argp->pagelist point to the next page.

Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-11-19 18:06:54 -05:00
Sasha Levin ddb8c45ba1 aio: nullify aio->ring_pages after freeing it
After freeing ring_pages we leave it as is causing a dangling pointer. This
has already caused an issue so to help catching any issues in the future
NULL it out.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-11-19 17:40:48 -05:00
Sasha Levin d558023207 aio: prevent double free in ioctx_alloc
ioctx_alloc() calls aio_setup_ring() to allocate a ring. If aio_setup_ring()
fails to do so it would call aio_free_ring() before returning, but
ioctx_alloc() would call aio_free_ring() again causing a double free of
the ring.

This is easily reproducible from userspace.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-11-19 17:40:48 -05:00
Johannes Berg 2a94fe48f3 genetlink: make multicast groups const, prevent abuse
Register generic netlink multicast groups as an array with
the family and give them contiguous group IDs. Then instead
of passing the global group ID to the various functions that
send messages, pass the ID relative to the family - for most
families that's just 0 because the only have one group.

This avoids the list_head and ID in each group, adding a new
field for the mcast group ID offset to the family.

At the same time, this allows us to prevent abusing groups
again like the quota and dropmon code did, since we can now
check that a family only uses a group it owns.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19 16:39:06 -05:00
Johannes Berg 68eb55031d genetlink: pass family to functions using groups
This doesn't really change anything, but prepares for the
next patch that will change the APIs to pass the group ID
within the family, rather than the global group ID.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19 16:39:06 -05:00
Johannes Berg 2ecf7536b2 quota/genetlink: use proper genetlink multicast APIs
The quota code is abusing the genetlink API and is using
its family ID as the multicast group ID, which is invalid
and may belong to somebody else (and likely will.)

Make the quota code use the correct API, but since this
is already used as-is by userspace, reserve a family ID
for this code and also reserve that group ID to not break
userspace assumptions.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19 16:39:05 -05:00
Johannes Berg c53ed74236 genetlink: only pass array to genl_register_family_with_ops()
As suggested by David Miller, make genl_register_family_with_ops()
a macro and pass only the array, evaluating ARRAY_SIZE() in the
macro, this is a little safer.

The openvswitch has some indirection, assing ops/n_ops directly in
that code. This might ultimately just assign the pointers in the
family initializations, saving the struct genl_family_and_ops and
code (once mcast groups are handled differently.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-19 16:39:05 -05:00
Trond Myklebust 829e57d7c5 NFS: Fix a warning in nfs_setsecurity
Fix the following warning:

linux-nfs/fs/nfs/inode.c:315:1: warning: ‘inline’ is not at
beginning of declaration [-Wold-style-declaration]

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-19 16:20:41 -05:00
Anna Schumaker 694e096fd7 NFS: Enabling v4.2 should not recompile nfsd and lockd
When CONFIG_NFS_V4_2 is toggled nfsd and lockd will be recompiled,
instead of only the nfs client.  This patch moves a small amount of code
into the client directory to avoid unnecessary recompiles.

Signed-off-by: Anna Schumaker <bjschuma@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2013-11-19 16:20:40 -05:00
Al Viro 801a76050b seq_file: always clear m->count when we free m->buf
Once we'd freed m->buf, m->count should become zero - we have no valid
contents reachable via m->buf.

Reported-by: Charley (Hao Chuan) Chu <charley.chu@broadcom.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-18 19:07:53 -08:00
Steve French 7d3fb24bce Removed duplicated (and unneeded) goto
Remove an unneeded goto (and also was duplicated goto target name).

Signed-off-by: Steve French <smfrench@gmail.com>
2013-11-18 17:24:24 -06:00
Steve French 9bf0c9cd43 CIFS: Fix SMB2/SMB3 Copy offload support (refcopy) for large files
This third version of the patch, incorparating feedback from David Disseldorp
extends the ability of copychunk (refcopy) over smb2/smb3 mounts to
handle servers with smaller than usual maximum chunk sizes
and also fixes it to handle files bigger than the maximum chunk sizes

In the future this can be extended further to handle sending
multiple chunk requests in on SMB2 ioctl request which will
further improve performance, but even with one 1MB chunk per
request the speedup on cp is quite large.

Reviewed-by: David Disseldorp <ddiss@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2013-11-18 17:24:14 -06:00
Akinobu Mita 34f2fd8dfe bio: fix argument of __bio_add_page() for max_sectors > 0xffff
The data type of max_sectors and max_hw_sectors in queue settings are
unsigned int.  But these values are passed to __bio_add_page() as an
argument whose data type is unsigned short.  In the worst case such as
max_sectors is 0x10000, bio_add_page() can't add a page and IOs can't
proceed.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-11-18 12:31:27 -07:00
Christoph Hellwig 987da47910 nfsd: make sure to balance get/put_write_access
Use a straight goto error label style in nfsd_setattr to make sure
we always do the put_write_access call after we got it earlier.

Note that the we have been failing to do that in the case
nfsd_break_lease() returns an error, a bug introduced into 2.6.38 with
6a76bebefe "nfsd4: break lease on nfsd
setattr".

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2013-11-18 12:06:48 -05:00