Commit graph

49973 commits

Author SHA1 Message Date
NeilBrown 26fde4dfcb NFS: check for nfs_refresh_inode() errors in nfs_fhget()
If an NFS server returns a filehandle that we have previously
seen, and reports a different type, then nfs_refresh_inode()
will log a warning and return an error.

nfs_fhget() does not check for this error and may return an
inode with a different type than the one that the server
reported.

This is likely to cause confusion, and is one way that
->open_context() could return a directory inode as discussed
in the previous patch.

So if nfs_refresh_inode() returns and error, return that error
from nfs_fhget() to avoid the confusion propagating.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:09 -04:00
NeilBrown eaa2b82c3b NFS: guard against confused server in nfs_atomic_open()
A confused server could return a filehandle for an
NFSv4 OPEN request, which it previously returned for a directory.
So the inode returned by  ->open_context() in nfs_atomic_open()
could conceivably be a directory inode.

This has particular implications for the call to
nfs_file_set_open_context() in nfs_finish_open().
If that is called on a directory inode, then the nfs_open_context
that gets stored in the filp->private_data will be linked to
nfs_inode->open_files.

When the directory is closed, nfs_closedir() will (ultimately)
free the ->private_data, but not unlink it from nfs_inode->open_files
(because it doesn't expect an nfs_open_context there).

Subsequently the memory could get used for something else and eventually
if the ->open_files list is walked, the walker will fall off the end and
crash.

So: change nfs_finish_open() to only call nfs_file_set_open_context()
for regular-file inodes.

This failure mode has been seen in a production setting (unknown NFS
server implementation).  The kernel was v3.0 and the specific sequence
seen would not affect more recent kernels, but I think a risk is still
present, and caution is wise.

Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
NeilBrown cc89684c9a NFS: only invalidate dentrys that are clearly invalid.
Since commit bafc9b754f ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant.  The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
  -ESTALE from nfs_lookup_verify_inode() and
  -ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode.  Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO.  This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f ("vfs: More precise tests in d_invalidate")
Cc: stable@vger.kernel.org (v3.18+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia 22368ff11d PNFS for stateid errors retry against MDS first
Upon receiving a stateid error such as BAD_STATEID, the client
should retry the operation against the MDS before deciding to
do stateid recovery.

Previously, the code would initiate state recovery and it could
lead to a race in a state manager that could chose an incorrect
recovery method which would lead to the EIO failure for the
application.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 16:00:08 -04:00
Olga Kornievskaia a0bc01e0f1 PNFS fix EACCESS on commit to DS handling
Commit fabbbee0eb "PNFS fix fallback to MDS if got error on
commit to DS" moved the pnfs_set_lo_fail() to unhandled errors
which was not correct and lead to a kernel oops on umount.

Instead, fix the original EACCESS on commit to DS error by
getting the new layout and re-doing the IO.

Fixes: fabbbee0eb ("PNFS fix fallback to MDS if got error on commit to DS")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Cc: stable@vger.kernel.org # v4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:59:57 -04:00
Dan Carpenter 4cd1ec95bd NFS: silence a uninitialized variable warning
Static checkers have gotten clever enough to complain that "id_long" is
uninitialized on the failure path.  It's harmless, but simple to fix.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:28 -04:00
Tuo Chen Peng ce85bd2921 nfs: Fix fscache stat printing in nfs_show_stats()
nfs_show_stats() was incorrectly reading statistics for bytes when printing that
for fsc. It caused files like /proc/self/mountstats to report incorrect fsc
statistics for NFS mounts.

Signed-off-by: Tuo Chen Peng <tpeng@nvidia.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:27 -04:00
Benjamin Coddington 2eb3aea7d9 NFS: Fix initialization of nfs_page_array->npages
Commit 8ef9b0b9e1 open-coded nfs_pgarray_set(), and left out the
initialization of the nfs_page_array's npages.  This mistake didn't show up
until testing with block layouts, and there shows that all pNFS reads
return -EIO.

Fixes: 8ef9b0b9e1 ("NFS: move nfs_pgarray_set() to open code")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # 4.12
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:06 -04:00
Trond Myklebust 1a4edf0f46 NFS: Fix commit policy for non-blocking calls to nfs_write_inode()
Now that the writes will schedule a commit on their own, we don't
need nfs_write_inode() to schedule one if there are outstanding
writes, and we're being called in non-blocking mode.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Trond Myklebust 919e3bd9a8 NFS: Ensure we commit after writeback is complete
If the page cache is being flushed, then we want to ensure that we
do start a commit once the pages are done being flushed.
If we just wait until all I/O is done to that file, we can end up
livelocking until the balance_dirty_pages() mechanism puts its
foot down and forces I/O to stop.
So instead we do more or less the same thing that O_DIRECT does,
and set up a counter to tell us when the flush is done,

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Trond Myklebust b5973a8c1c NFS: Remove unused fields in the page I/O structures
Remove the 'layout_private' fields that were only used by the pNFS OSD
layout driver.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:05 -04:00
Benjamin Coddington 818a8dbe83 NFS: nfs_rename() - revalidate directories on -ERESTARTSYS
An interrupted rename will leave the old dentry behind if the rename
succeeds.  Fix this by forcing a lookup the next time through
->d_revalidate.

A previous attempt at solving this problem took the approach to complete
the work of the rename asynchronously, however that approach was wrong
since it would allow the d_move() to occur after the directory's i_mutex
had been dropped by the original process.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:04 -04:00
Benjamin Coddington a7a3b1e971 NFS: convert flags to bool
NFS uses some int, and unsigned int :1, and bool as flags in structs and
args.  Assert the preference for uniformly replacing these with the bool
type.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:04 -04:00
Anna Schumaker 18fe6a23e3 NFS: Set FATTR4_WORD0_TYPE for . and .. entries
The current code worked okay for getdents(), but getdents64() expects
the d_type field to get filled out properly in the stat structure.
Setting this field fixes xfstests generic/401.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13 15:58:03 -04:00
Christoph Hellwig 800222f80f nfsd4: const-ify nfsd4_ops
nfsd4_ops contains function pointers, and marking it as constant avoids
it being able to be used as an attach vector for code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:03 -04:00
Christoph Hellwig aa8217d5dc sunrpc: mark all struct svc_version instances as const
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:03 -04:00
Christoph Hellwig b9c744c19c sunrpc: mark all struct svc_procinfo instances as const
struct svc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig 0becc1181c sunrpc: move pc_count out of struct svc_procinfo
pc_count is the only writeable memeber of struct svc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct svc_procinfo, and into a
separate writable array that is pointed to by struct svc_version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig 72edc37a2c nfsd4: properly type op_func callbacks
Pass union nfsd4_op_u to the op_func callbacks instead of using unsafe
function pointer casts.

It also adds two missing structures to struct nfsd4_op.u to facilitate
this.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:02 -04:00
Christoph Hellwig 62bbf8bbb2 nfsd4: remove nfsd4op_rsize
Except for a lot of unnecessary casts this typedef only has one user,
so remove the casts and expand it in struct nfsd4_operation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig c2a1102aa2 nfsd4: properly type op_get_currentstateid callbacks
Pass union nfsd4_op_u to the op_set_currentstateid callbacks instead of
using unsafe function pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig 6c9600a71d nfsd4: properly type op_set_currentstateid callbacks
Given the args union in struct nfsd4_op a name, and pass it to the
op_set_currentstateid callbacks instead of using unsafe function
pointer casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:01 -04:00
Christoph Hellwig d16d186721 sunrpc: properly type pc_encode callbacks
Drop the resp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:58:00 -04:00
Christoph Hellwig cc6acc20a6 sunrpc: properly type pc_decode callbacks
Drop the argp argument as it can trivially be derived from the rqstp
argument.  With that all functions now have the same prototype, and we
can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:58:00 -04:00
Christoph Hellwig 1150ded804 sunrpc: properly type pc_release callbacks
Drop the p and resp arguments as they are always NULL or can trivially
be derived from the rqstp argument.  With that all functions now have the
same prototype, and we can remove the unsafe casting to kxdrproc_t.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig 1c8a5409f3 sunrpc: properly type pc_func callbacks
Drop the argp and resp arguments as they can trivially be derived from
the rqstp argument.  With that all functions now have the same prototype,
and we can remove the unsafe casting to svc_procfunc as well as the
svc_procfunc typedef itself.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:59 -04:00
Christoph Hellwig 36ba89c2a4 nfsd: remove the unused PROC() macro in nfs3proc.c
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig ec7e8caec9 nfsd: use named initializers in PROC()
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig 39d43f75c6 nfsd4: const-ify nfs_cb_version4
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:58 -04:00
Christoph Hellwig 511e936bf2 sunrpc: mark all struct rpc_procinfo instances as const
struct rpc_procinfo contains function pointers, and marking it as
constant avoids it being able to be used as an attach vector for
code injections.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:57 -04:00
Christoph Hellwig 9ae7d8ff29 nfs: use ARRAY_SIZE() in the nfsacl_version3 declaration
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:57 -04:00
Christoph Hellwig c551858a88 sunrpc: move p_count out of struct rpc_procinfo
p_count is the only writeable memeber of struct rpc_procinfo, which is
a good candidate to be const-ified as it contains function pointers.

This patch moves it into out out struct rpc_procinfo, and into a
separate writable array that is pointed to by struct rpc_version and
indexed by p_statidx.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-07-13 15:57:57 -04:00
Christoph Hellwig e91ff8e3c4 lockd: fix some weird indentation
Remove double indentation of a few struct rpc_version and
struct rpc_program instance.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig 947c6e431d nfs: don't cast callback decode/proc/encode routines
Instead declare all functions with the proper methods signature.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig fc016483eb nfs: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:56 -04:00
Christoph Hellwig 04000564c1 lockd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:55 -04:00
Christoph Hellwig 5362a4ec83 nfsd: fix decoder callback prototypes
Declare the p_decode callbacks with the proper prototype instead of
casting to kxdrdproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:55 -04:00
Christoph Hellwig 843efb7d7d nfsd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
2017-07-13 15:57:54 -04:00
Christoph Hellwig fcc85819ee nfs: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:53 -04:00
Christoph Hellwig d16073389b lockd: fix encoder callback prototypes
Declare the p_encode callbacks with the proper prototype instead of
casting to kxdreproc_t and losing all type safety.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-07-13 15:57:53 -04:00
Linus Torvalds ad51271afc Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

- various misc things

- kexec updates

- sysctl core updates

- scripts/gdb udpates

- checkpoint-restart updates

- ipc updates

- kernel/watchdog updates

- Kees's "rough equivalent to the glibc _FORTIFY_SOURCE=1 feature"

- "stackprotector: ascii armor the stack canary"

- more MM bits

- checkpatch updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
  writeback: rework wb_[dec|inc]_stat family of functions
  ARM: samsung: usb-ohci: move inline before return type
  video: fbdev: omap: move inline before return type
  video: fbdev: intelfb: move inline before return type
  USB: serial: safe_serial: move __inline__ before return type
  drivers: tty: serial: move inline before return type
  drivers: s390: move static and inline before return type
  x86/efi: move asmlinkage before return type
  sh: move inline before return type
  MIPS: SMP: move asmlinkage before return type
  m68k: coldfire: move inline before return type
  ia64: sn: pci: move inline before type
  ia64: move inline before return type
  FRV: tlbflush: move asmlinkage before return type
  CRIS: gpio: move inline before return type
  ARM: HP Jornada 7XX: move inline before return type
  ARM: KVM: move asmlinkage before type
  checkpatch: improve the STORAGE_CLASS test
  mm, migration: do not trigger OOM killer when migrating memory
  drm/i915: use __GFP_RETRY_MAYFAIL
  ...
2017-07-13 12:38:49 -07:00
Filipe Manana 6592e58c6b Btrfs: fix write corruption due to bio cloning on raid5/6
The recent changes to make bio cloning faster (added in the 4.13 merge
window) by using the bio_clone_fast() API introduced a regression on
raid5/6 modes, because cloned bios have an invalid bi_vcnt field
(therefore it can not be used) and the raid5/6 code uses the
bio_for_each_segment_all() API to iterate the segments of a bio, and this
API uses a bio's bi_vcnt field.

The issue is very simple to trigger by doing for example a direct IO write
against a raid5 or raid6 filesystem and then attempting to read what we
wrote before:

  $ mkfs.btrfs -m raid5 -d raid5 -f /dev/sdc /dev/sdd /dev/sde /dev/sdf
  $ mount /dev/sdc /mnt
  $ xfs_io -f -d -c "pwrite -S 0xab 0 1M" /mnt/foobar
  $ od -t x1 /mnt/foobar
  od: /mnt/foobar: read error: Input/output error

For that example, the following is also reported in dmesg/syslog:

  [18274.985557] btrfs_print_data_csum_error: 18 callbacks suppressed
  [18274.995277] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18274.997205] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.025221] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.047422] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054818] BTRFS warning (device sdf): csum failed root 5 ino 257 off 4096 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054834] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.054943] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 2
  [18275.055207] BTRFS warning (device sdf): csum failed root 5 ino 257 off 8192 csum 0x98f94189 expected csum 0x94374193 mirror 3
  [18275.055571] BTRFS warning (device sdf): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x94374193 mirror 1
  [18275.062171] BTRFS warning (device sdf): csum failed root 5 ino 257 off 12288 csum 0x98f94189 expected csum 0x94374193 mirror 1

A scrub will also fail correcting bad copies, mentioning the following in
dmesg/syslog:

  [18276.128696] scrub_handle_errored_block: 498 callbacks suppressed
  [18276.129617] BTRFS warning (device sdf): checksum error at logical 2186346496 on dev /dev/sde, sector 2116608, root 5, inode 257, offset 65536, length 4096, links $
  [18276.149235] btrfs_dev_stat_print_on_error: 498 callbacks suppressed
  [18276.157897] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.206059] BTRFS warning (device sdf): checksum error at logical 2186477568 on dev /dev/sdd, sector 2116736, root 5, inode 257, offset 196608, length 4096, links$
  [18276.206059] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.306552] BTRFS warning (device sdf): checksum error at logical 2186543104 on dev /dev/sdd, sector 2116864, root 5, inode 257, offset 262144, length 4096, links$
  [18276.319152] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.394316] BTRFS warning (device sdf): checksum error at logical 2186739712 on dev /dev/sdf, sector 2116992, root 5, inode 257, offset 458752, length 4096, links$
  [18276.396348] BTRFS error (device sdf): bdev /dev/sdf errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
  [18276.434127] BTRFS warning (device sdf): checksum error at logical 2186870784 on dev /dev/sde, sector 2117120, root 5, inode 257, offset 589824, length 4096, links$
  [18276.434127] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 2, gen 0
  [18276.500504] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186477568 on dev /dev/sdd
  [18276.538400] BTRFS warning (device sdf): checksum error at logical 2186481664 on dev /dev/sdd, sector 2116744, root 5, inode 257, offset 200704, length 4096, links$
  [18276.540452] BTRFS error (device sdf): bdev /dev/sdd errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.542012] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186481664 on dev /dev/sdd
  [18276.585030] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186346496 on dev /dev/sde
  [18276.598306] BTRFS warning (device sdf): checksum error at logical 2186412032 on dev /dev/sde, sector 2116736, root 5, inode 257, offset 131072, length 4096, links$
  [18276.598310] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 3, gen 0
  [18276.598582] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186350592 on dev /dev/sde
  [18276.603455] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
  [18276.638362] BTRFS warning (device sdf): checksum error at logical 2186354688 on dev /dev/sde, sector 2116624, root 5, inode 257, offset 73728, length 4096, links $
  [18276.640445] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 5, gen 0
  [18276.645942] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186354688 on dev /dev/sde
  [18276.657204] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186412032 on dev /dev/sde
  [18276.660563] BTRFS warning (device sdf): checksum error at logical 2186416128 on dev /dev/sde, sector 2116744, root 5, inode 257, offset 135168, length 4096, links$
  [18276.664609] BTRFS error (device sdf): bdev /dev/sde errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
  [18276.664609] BTRFS error (device sdf): unable to fixup (regular) error at logical 2186358784 on dev /dev/sde

So fix this by using the bio_for_each_segment() API and setting before
the bio's bi_iter field to the value of the corresponding btrfs bio
container's saved iterator if we are processing a cloned bio in the
raid5/6 code (the same code processes both cloned and non-cloned bios).

This incorrect iteration of cloned bios was also causing some occasional
BUG_ONs when running fstest btrfs/064, which have a trace like the
following:

  [ 6674.416156] ------------[ cut here ]------------
  [ 6674.416157] kernel BUG at fs/btrfs/raid56.c:1897!
  [ 6674.416159] invalid opcode: 0000 [#1] PREEMPT SMP
  [ 6674.416160] Modules linked in: dm_flakey dm_mod dax ppdev tpm_tis parport_pc tpm_tis_core evdev tpm psmouse sg i2c_piix4 pcspkr parport i2c_core serio_raw button s
  [ 6674.416184] CPU: 3 PID: 19236 Comm: kworker/u32:10 Not tainted 4.12.0-rc6-btrfs-next-44+ #1
  [ 6674.416185] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
  [ 6674.416210] Workqueue: btrfs-endio btrfs_endio_helper [btrfs]
  [ 6674.416211] task: ffff880147f6c740 task.stack: ffffc90001fb8000
  [ 6674.416229] RIP: 0010:__raid_recover_end_io+0x1ac/0x370 [btrfs]
  [ 6674.416230] RSP: 0018:ffffc90001fbbb90 EFLAGS: 00010217
  [ 6674.416231] RAX: ffff8801ff4b4f00 RBX: 0000000000000002 RCX: 0000000000000001
  [ 6674.416232] RDX: ffff880099b045d8 RSI: ffffffff81a5f6e0 RDI: 0000000000000004
  [ 6674.416232] RBP: ffffc90001fbbbc8 R08: 0000000000000001 R09: 0000000000000001
  [ 6674.416233] R10: ffffc90001fbbac8 R11: 0000000000001000 R12: 0000000000000002
  [ 6674.416234] R13: ffff880099b045c0 R14: 0000000000000004 R15: ffff88012bff2000
  [ 6674.416235] FS:  0000000000000000(0000) GS:ffff88023f2c0000(0000) knlGS:0000000000000000
  [ 6674.416235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 6674.416236] CR2: 00007f28cf282000 CR3: 00000001000c6000 CR4: 00000000000006e0
  [ 6674.416239] Call Trace:
  [ 6674.416259]  __raid56_parity_recover+0xfc/0x16e [btrfs]
  [ 6674.416276]  raid56_parity_recover+0x157/0x16b [btrfs]
  [ 6674.416293]  btrfs_map_bio+0xe0/0x259 [btrfs]
  [ 6674.416310]  btrfs_submit_bio_hook+0xbf/0x147 [btrfs]
  [ 6674.416327]  end_bio_extent_readpage+0x27b/0x4a0 [btrfs]
  [ 6674.416331]  bio_endio+0x17d/0x1b3
  [ 6674.416346]  end_workqueue_fn+0x3c/0x3f [btrfs]
  [ 6674.416362]  btrfs_scrubparity_helper+0x1aa/0x3b8 [btrfs]
  [ 6674.416379]  btrfs_endio_helper+0xe/0x10 [btrfs]
  [ 6674.416381]  process_one_work+0x276/0x4b6
  [ 6674.416384]  worker_thread+0x1ac/0x266
  [ 6674.416386]  ? rescuer_thread+0x278/0x278
  [ 6674.416387]  kthread+0x106/0x10e
  [ 6674.416389]  ? __list_del_entry+0x22/0x22
  [ 6674.416391]  ret_from_fork+0x27/0x40
  [ 6674.416395] Code: 44 89 e2 be 00 10 00 00 ff 15 b0 ab ef ff eb 72 4d 89 e8 89 d9 44 89 e2 be 00 10 00 00 ff 15 a3 ab ef ff eb 5d 41 83 fc ff 74 02 <0f> 0b 49 63 97
  [ 6674.416432] RIP: __raid_recover_end_io+0x1ac/0x370 [btrfs] RSP: ffffc90001fbbb90
  [ 6674.416434] ---[ end trace 74d56ebe7489dd6a ]---

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2017-07-13 19:26:01 +01:00
David Howells fdb254db21 isofs: Fix isofs_show_options()
The isofs patch needs a small fix to handle a signed/unsigned comparison that
the compiler didn't flag - thanks to Dan for catching it.

It should be noted, however, the session number handing appears to be incorrect
between where it is parsed and where it is used.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-13 12:30:43 -04:00
Ernesto A. Fernández 4d9bcaddac ext2: Fix memory leak when truncate races ext2_get_blocks
Buffer heads referencing indirect blocks may not be released if the file
is truncated at the right time. This happens because ext2_get_branch()
returns NULL when it finds the whole chain of indirect blocks already
set, and when truncate alters the chain this value of NULL is
treated as the address of the last head to be released. Handle this in the
same way as it's done after the got_it label.

Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2017-07-13 13:45:08 +02:00
Linus Torvalds 4ca6df1348 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull sysctl fix from Eric Biederman:
 "A rather embarassing and hard to hit bug was merged into 4.11-rc1.

  Andrei Vagin tracked this bug now and after some staring at the code
  I came up with a fix"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: Fix proc_sys_prune_dcache to hold a sb reference
2017-07-12 19:43:20 -07:00
Nikolay Borisov 3e8f399da4 writeback: rework wb_[dec|inc]_stat family of functions
Currently the writeback statistics code uses a percpu counters to hold
various statistics.  Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore.  However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance.  This will likely result in better
runtime of code which deals with modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.

Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:05 -07:00
Michal Hocko 91c63ecda7 xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set of
flags.  This means that small allocations actually never failed.

Now that we have __GFP_RETRY_MAYFAIL flag which works independently on
the allocation request size we can map KM_MAYFAIL to it.  The allocator
will try as hard as it can to fulfill the request but fails eventually
if the progress cannot be made.  It does so without triggering the OOM
killer which can be seen as an improvement because KM_MAYFAIL users
should be able to deal with allocation failures.

Link: http://lkml.kernel.org/r/20170623085345.11304-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:03 -07:00
Dmitry Vyukov e41d58185f fault-inject: support systematic fault injection
Add /proc/self/task/<current-tid>/fail-nth file that allows failing
0-th, 1-st, 2-nd and so on calls systematically.
Excerpt from the added documentation:

 "Write to this file of integer N makes N-th call in the current task
  fail (N is 0-based). Read from this file returns a single char 'Y' or
  'N' that says if the fault setup with a previous write to this file
  was injected or not, and disables the fault if it wasn't yet injected.
  Note that this file enables all types of faults (slab, futex, etc).
  This setting takes precedence over all other generic settings like
  probability, interval, times, etc. But per-capability settings (e.g.
  fail_futex/ignore-private) take precedence over it. This feature is
  intended for systematic testing of faults in a single system call. See
  an example below"

Why add a new setting:
1. Existing settings are global rather than per-task.
   So parallel testing is not possible.
2. attr->interval is close but it depends on attr->count
   which is non reset to 0, so interval does not work as expected.
3. Trying to model this with existing settings requires manipulations
   of all of probability, interval, times, space, task-filter and
   unexposed count and per-task make-it-fail files.
4. Existing settings are per-failure-type, and the set of failure
   types is potentially expanding.
5. make-it-fail can't be changed by unprivileged user and aggressive
   stress testing better be done from an unprivileged user.
   Similarly, this would require opening the debugfs files to the
   unprivileged user, as he would need to reopen at least times file
   (not possible to pre-open before dropping privs).

The proposed interface solves all of the above (see the example).

We want to integrate this into syzkaller fuzzer.  A prototype has found
10 bugs in kernel in first day of usage:

  https://groups.google.com/forum/#!searchin/syzkaller/%22FAULT_INJECTION%22%7Csort:relevance

I've made the current interface work with all types of our sandboxes.
For setuid the secret sauce was prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) to
make /proc entries non-root owned.  So I am fine with the current
version of the code.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/20170328130128.101773-1-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov 92ef6da3d0 kcmp: fs/epoll: wrap kcmp code with CONFIG_CHECKPOINT_RESTORE
kcmp syscall is build iif CONFIG_CHECKPOINT_RESTORE is selected, so wrap
appropriate helpers in epoll code with the config to build it
conditionally.

Link: http://lkml.kernel.org/r/20170513083456.GG1881@uranus.lan
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Andrew Morton <akpm@linuxfoundation.org>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov 0791e3644e kcmp: add KCMP_EPOLL_TFD mode to compare epoll target files
With current epoll architecture target files are addressed with
file_struct and file descriptor number, where the last is not unique.
Moreover files can be transferred from another process via unix socket,
added into queue and closed then so we won't find this descriptor in the
task fdinfo list.

Thus to checkpoint and restore such processes CRIU needs to find out
where exactly the target file is present to add it into epoll queue.
For this sake one can use kcmp call where some particular target file
from the queue is compared with arbitrary file passed as an argument.

Because epoll target files can have same file descriptor number but
different file_struct a caller should explicitly specify the offset
within.

To test if some particular file is matching entry inside epoll one have
to

 - fill kcmp_epoll_slot structure with epoll file descriptor,
   target file number and target file offset (in case if only
   one target is present then it should be 0)

 - call kcmp as kcmp(pid1, pid2, KCMP_EPOLL_TFD, fd, &kcmp_epoll_slot)
    - the kernel fetch file pointer matching file descriptor @fd of pid1
    - lookups for file struct in epoll queue of pid2 and returns traditional
      0,1,2 result for sorting purpose

Link: http://lkml.kernel.org/r/20170424154423.511592110@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrey Vagin <avagin@openvz.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Cyrill Gorcunov 77493f04b7 procfs: fdinfo: extend information about epoll target files
Since it is possbile to have same number in tfd field (say file added,
closed, then nother file dup'ed to same number and added back) it is
imposible to distinguish such target files solely by their numbers.

Strictly speaking regular applications don't need to recognize these
targets at all but for checkpoint/restore sake we need to collect
targets to be able to push them back on restore stage in a proper order.

Thus lets add file position, inode and device number where this target
lays.  This three fields can be used as a primary key for sorting, and
together with kcmp help CRIU can find out an exact file target (from the
whole set of processes being checkpointed).

Link: http://lkml.kernel.org/r/20170424154423.436491881@gmail.com
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrei Vagin <avagin@virtuozzo.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:01 -07:00
Davidlohr Bueso 59224ac1cf fs/Kconfig: kill CONFIG_PERCPU_RWSEM some more
As of commit bf3eac84c4 ("percpu-rwsem: kill CONFIG_PERCPU_RWSEM") we
unconditionally build pcpu-rwsems.  Remove a leftover in for
FILE_LOCKING.

Link: http://lkml.kernel.org/r/20170518180115.2794-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Rakesh Pandit 5f9f48f5b3 bfs: fix sanity checks for empty files
Mount fails if file system image has empty files because of sanity check
while reading superblock.  For empty files disk offset to end of file
(i_eoffset) is cpu_to_le32(-1).  Sanity check comparison, which compares
disk offset with file system size isn't valid for this value and hence
is ignored with this patch.

Steps to reproduce:

  $  dd if=/dev/zero of=bfs-image count=204800
  $  mkfs.bfs bfs-image
  $  mkdir bfs-mount-point
  $  sudo mount -t bfs -o loop bfs-image bfs-mount-point/
  $  cd bfs-mount-point/
  $  sudo touch a
  $  cd ..
  $  sudo umount bfs-mount-point/
  $  sudo mount -t bfs -o loop bfs-image bfs-mount-point/
  mount: /dev/loop0: can't read superblock

  $  dmesg
  [25526.689580] BFS-fs: bfs_fill_super(): Inode 0x00000003 corrupted

Tigran said:
 "If you had created the filesystem with the proper mkfs under SCO
  UnixWare 7 you (probably) wouldn't encounter this issue. But since
  commercial Unix-es are now part of history and the only proper way is
  the Linux mkfs.bfs utility, your patch is fine"

Link: http://lkml.kernel.org/r/20170505201625.GA3097@hercules.tuxera.com
Signed-off-by: Rakesh Pandit <rakesh@tuxera.com>
Acked-by: Tigran Aivazian <aivazian.tigran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez 61d9b56a89 sysctl: add unsigned int range support
To keep parity with regular int interfaces provide the an unsigned int
proc_douintvec_minmax() which allows you to specify a range of allowed
valid numbers.

Adding proc_douintvec_minmax_sysadmin() is easy but we can wait for an
actual user for that.

Link: http://lkml.kernel.org/r/20170519033554.18592-6-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez 4f2fec00af sysctl: simplify unsigned int support
Commit e7d316a02f ("sysctl: handle error writing UINT_MAX to u32
fields") added proc_douintvec() to start help adding support for
unsigned int, this however was only half the work needed.  Two fixes
have come in since then for the following issues:

  o Printing the values shows a negative value, this happens since
    do_proc_dointvec() and this uses proc_put_long()

This was fixed by commit 5380e5644a ("sysctl: don't print negative
flag for proc_douintvec").

  o We can easily wrap around the int values: UINT_MAX is 4294967295, if
    we echo in 4294967295 + 1 we end up with 0, using 4294967295 + 2 we
    end up with 1.
  o We echo negative values in and they are accepted

This was fixed by commit 425fffd886 ("sysctl: report EINVAL if value
is larger than UINT_MAX for proc_douintvec").

It still also failed to be added to sysctl_check_table()...  instead of
adding it with the current implementation just provide a proper and
simplified unsigned int support without any array unsigned int support
with no negative support at all.

Historically sysctl proc helpers have supported arrays, due to the
complexity this adds though we've taken a step back to evaluate array
users to determine if its worth upkeeping for unsigned int.  An
evaluation using Coccinelle has been done to perform a grammatical
search to ask ourselves:

  o How many sysctl proc_dointvec() (int) users exist which likely
    should be moved over to proc_douintvec() (unsigned int) ?
	Answer: about 8
	- Of these how many are array users ?
		Answer: Probably only 1
  o How many sysctl array users exist ?
	Answer: about 12

This last question gives us an idea just how popular arrays: they are not.
Array support should probably just be kept for strings.

The identified uint ports are:

  drivers/infiniband/core/ucma.c - max_backlog
  drivers/infiniband/core/iwcm.c - default_backlog
  net/core/sysctl_net_core.c - rps_sock_flow_sysctl()
  net/netfilter/nf_conntrack_timestamp.c - nf_conntrack_timestamp -- bool
  net/netfilter/nf_conntrack_acct.c nf_conntrack_acct -- bool
  net/netfilter/nf_conntrack_ecache.c - nf_conntrack_events -- bool
  net/netfilter/nf_conntrack_helper.c - nf_conntrack_helper -- bool
  net/phonet/sysctl.c proc_local_port_range()

The only possible array users is proc_local_port_range() but it does not
seem worth it to add array support just for this given the range support
works just as well.  Unsigned int support should be desirable more for
when you *need* more than INT_MAX or using int min/max support then does
not suffice for your ranges.

If you forget and by mistake happen to register an unsigned int proc
entry with an array, the driver will fail and you will get something as
follows:

sysctl table check failed: debug/test_sysctl//uint_0002 array now allowed
CPU: 2 PID: 1342 Comm: modprobe Tainted: G        W   E <etc>
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <etc>
Call Trace:
 dump_stack+0x63/0x81
 __register_sysctl_table+0x350/0x650
 ? kmem_cache_alloc_trace+0x107/0x240
 __register_sysctl_paths+0x1b3/0x1e0
 ? 0xffffffffc005f000
 register_sysctl_table+0x1f/0x30
 test_sysctl_init+0x10/0x1000 [test_sysctl]
 do_one_initcall+0x52/0x1a0
 ? kmem_cache_alloc_trace+0x107/0x240
 do_init_module+0x5f/0x200
 load_module+0x1867/0x1bd0
 ? __symbol_put+0x60/0x60
 SYSC_finit_module+0xdf/0x110
 SyS_finit_module+0xe/0x10
 entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f042b22d119
<etc>

Fixes: e7d316a02f ("sysctl: handle error writing UINT_MAX to u32 fields")
Link: http://lkml.kernel.org/r/20170519033554.18592-5-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Suggested-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Liping Zhang <zlpnobody@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
Luis R. Rodriguez 89c5b53b16 sysctl: fix lax sysctl_check_table() sanity check
Patch series "sysctl: few fixes", v5.

I've been working on making kmod more deterministic, and as I did that I
couldn't help but notice a few issues with sysctl.  My end goal was just
to fix unsigned int support, which back then was completely broken.
Liping Zhang has sent up small atomic fixes, however it still missed yet
one more fix and Alexey Dobriyan had also suggested to just drop array
support given its complexity.

I have inspected array support using Coccinelle and indeed its not that
popular, so if in fact we can avoid it for new interfaces, I agree its
best.

I did develop a sysctl stress driver but will hold that off for another
series.

This patch (of 5):

Commit 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
improved sanity checks considerbly, however the enhancements on
sysctl_check_table() meant adding a functional change so that only the
last table entry's sanity error is propagated.  It also changed the way
errors were propagated so that each new check reset the err value, this
means only last sanity check computed is used for an error.  This has
been in the kernel since v3.4 days.

Fix this by carrying on errors from previous checks and iterations as we
traverse the table and ensuring we keep any error from previous checks.
We keep iterating on the table even if an error is found so we can
complain for all errors found in one shot.  This works as -EINVAL is
always returned on error anyway, and the check for error is any non-zero
value.

Fixes: 7c60c48f58 ("sysctl: Improve the sysctl sanity checks")
Link: http://lkml.kernel.org/r/20170519033554.18592-2-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:00 -07:00
J. Bruce Fields 630458e730 nfsd4: factor ctime into change attribute
Factoring ctime into the nfsv4 change attribute gives us better
properties than just i_version alone.

Eventually we'll likely also expose this (as opposed to raw i_version)
to userspace, at which point we'll want to move it to a common helper,
called from either userspace or individual filesystems.  For now, nfsd
is the only user.

Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-07-12 15:55:00 -04:00
Linus Torvalds 6b1c776d3e Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
 "This work from Amir introduces the inodes index feature, which
  provides:

   - hardlinks are not broken on copy up

   - infrastructure for overlayfs NFS export

  This also fixes constant st_ino for samefs case for lower hardlinks"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (33 commits)
  ovl: mark parent impure and restore timestamp on ovl_link_up()
  ovl: document copying layers restrictions with inodes index
  ovl: cleanup orphan index entries
  ovl: persistent overlay inode nlink for indexed inodes
  ovl: implement index dir copy up
  ovl: move copy up lock out
  ovl: rearrange copy up
  ovl: add flag for upper in ovl_entry
  ovl: use struct copy_up_ctx as function argument
  ovl: base tmpfile in workdir too
  ovl: factor out ovl_copy_up_inode() helper
  ovl: extract helper to get temp file in copy up
  ovl: defer upper dir lock to tempfile link
  ovl: hash overlay non-dir inodes by copy up origin
  ovl: cleanup bad and stale index entries on mount
  ovl: lookup index entry for copy up origin
  ovl: verify index dir matches upper dir
  ovl: verify upper root dir matches lower root dir
  ovl: introduce the inodes index dir feature
  ovl: generalize ovl_create_workdir()
  ...
2017-07-12 09:28:55 -07:00
Linus Torvalds 908b852df1 Upgrade default dialect to more secure SMB3 from older cifs dialect
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQGcBAABAgAGBQJZZEkpAAoJEIosvXAHck9RHxML/2fAdq8fzWMACInmycbJuAuS
 o/mjXyli5EtwVwcTCE5Z4mW303ch1CY964hKKT/egeOe4WrtUS6a904UCDTkre48
 KAJBCoi52jqTT6ruTC4EoSlMZi5V8q6/O91fwTVZGBRzEsAz/oCb0uwg1dfyFgu7
 g5W+ppYmQcDTImPR9r3BuYJj56pYWj77vlrRwfN5pAko5OocZXL71JPtBWqYuXoi
 jxicWnHc1RPdCIgaLanqQtTOvPub8f19a5cAz3/IAR6AEo0ySzS45CQGKag+Da86
 JVuXiAQ8SQJUoFvEWQ8XdAMu/U+9Vn6UenB8k2MlrOXNh406X3Rdv0cF0UzSdE93
 E+6xJ0S47pno/3eOgKPs1kDuy5edqgqxTicpurvzzjtAHDJtJGhYxSYxHK9i8R2S
 iNmnkuqBjQf9bprabZG7yze38nTyf0vlO9FviYZnMAy7Pxwpd9ADNHhooDbPaZtG
 qbquIAr0s0XZQVHCM/1jCvvXxjdX+ENpTZ4z79u0mA==
 =8UiP
 -----END PGP SIGNATURE-----

Merge tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes and sane default from Steve French:
 "Upgrade default dialect to more secure SMB3 from older cifs dialect"

* tag 'smb3-security-fixes-for-4.13' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Clean up unused variables in smb2pdu.c
  [SMB3] Improve security, move default dialect to SMB3 from old CIFS
  [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
  CIFS: Reconnect expired SMB sessions
  CIFS: Display SMB2 error codes in the hex format
  cifs: Use smb 2 - 3 and cifsacl mount options setacl function
  cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options
2017-07-11 14:04:48 -07:00
Linus Torvalds 3bf7878f0f The main item here is support for v12.y.z ("Luminous") clusters:
RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
 feature bits, and various other changes in the RADOS client protocol.
 On top of that we have a new fsc mount option to allow supplying
 fscache uniquifier (similar to NFS) and the usual pile of filesystem
 fixes from Zheng.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZZQT+AAoJEEp/3jgCEfOLSsMH/i8ZdSzp7ocX00oLMlIxzFEk
 5BUXZ086mEPAE4fjJFPO7+qYk6y26MzAhJL+bj8r5E0GvBEpQkoAoSQZ19Mj5ApC
 nZnllzQ2C8kYvM4hp4Z2pLrF/OYACj/WJJgbTxubBET1zRq1iPj4EgbzBEraPvma
 K76W9ILKNUjIoSDlNR5qvykXXfvi2dxRpi/8nvfMCOcjlw/7orjXVLa05fKmmOoX
 OvpOjicWOrc8NlacGK+j1j1aaKlmLvZb9Ff+45hfC/L5PPQblM0dypFCVfq3MFFq
 nUxKgTCAQDPrndzCdURCtdovjFKbskRGKmhnd0EZkdDCcnUmg6nLxqta6g2Dbs0=
 =ioKM
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main item here is support for v12.y.z ("Luminous") clusters:
  RESEND_ON_SPLIT, RADOS_BACKOFF, OSDMAP_PG_UPMAP and CRUSH_CHOOSE_ARGS
  feature bits, and various other changes in the RADOS client protocol.

  On top of that we have a new fsc mount option to allow supplying
  fscache uniquifier (similar to NFS) and the usual pile of filesystem
  fixes from Zheng"

* tag 'ceph-for-4.13-rc1' of git://github.com/ceph/ceph-client: (44 commits)
  libceph: advertise support for NEW_OSDOP_ENCODING and SERVER_LUMINOUS
  libceph: osd_state is 32 bits wide in luminous
  crush: remove an obsolete comment
  crush: crush_init_workspace starts with struct crush_work
  libceph, crush: per-pool crush_choose_arg_map for crush_do_rule()
  crush: implement weight and id overrides for straw2
  libceph: apply_upmap()
  libceph: compute actual pgid in ceph_pg_to_up_acting_osds()
  libceph: pg_upmap[_items] infrastructure
  libceph: ceph_decode_skip_* helpers
  libceph: kill __{insert,lookup,remove}_pg_mapping()
  libceph: introduce and switch to decode_pg_mapping()
  libceph: don't pass pgid by value
  libceph: respect RADOS_BACKOFF backoffs
  libceph: make DEFINE_RB_* helpers more general
  libceph: avoid unnecessary pi lookups in calc_target()
  libceph: use target pi for calc_target() calculations
  libceph: always populate t->target_{oid,oloc} in calc_target()
  libceph: make sure need_resend targets reflect latest map
  libceph: delete from need_resend_linger before check_linger_pool_dne()
  ...
2017-07-11 12:12:28 -07:00
Eric W. Biederman 2fd1d2c4ce proc: Fix proc_sys_prune_dcache to hold a sb reference
Andrei Vagin writes:
FYI: This bug has been reproduced on 4.11.7
> BUG: Dentry ffff895a3dd01240{i=4e7c09a,n=lo}  still in use (1) [unmount of proc proc]
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 umount_check+0x6e/0x80
> CPU: 1 PID: 13588 Comm: kworker/1:1 Not tainted 4.11.7-200.fc25.x86_64 #1
> Hardware name: CompuLab sbc-flt1/fitlet, BIOS SBCFLT_0.08.04 06/27/2015
> Workqueue: events proc_cleanup_work
> Call Trace:
>  dump_stack+0x63/0x86
>  __warn+0xcb/0xf0
>  warn_slowpath_null+0x1d/0x20
>  umount_check+0x6e/0x80
>  d_walk+0xc6/0x270
>  ? dentry_free+0x80/0x80
>  do_one_tree+0x26/0x40
>  shrink_dcache_for_umount+0x2d/0x90
>  generic_shutdown_super+0x1f/0xf0
>  kill_anon_super+0x12/0x20
>  proc_kill_sb+0x40/0x50
>  deactivate_locked_super+0x43/0x70
>  deactivate_super+0x5a/0x60
>  cleanup_mnt+0x3f/0x90
>  mntput_no_expire+0x13b/0x190
>  kern_unmount+0x3e/0x50
>  pid_ns_release_proc+0x15/0x20
>  proc_cleanup_work+0x15/0x20
>  process_one_work+0x197/0x450
>  worker_thread+0x4e/0x4a0
>  kthread+0x109/0x140
>  ? process_one_work+0x450/0x450
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x2c/0x40
> ---[ end trace e1c109611e5d0b41 ]---
> VFS: Busy inodes after unmount of proc. Self-destruct in 5 seconds.  Have a nice day...
> BUG: unable to handle kernel NULL pointer dereference at           (null)
> IP: _raw_spin_lock+0xc/0x30
> PGD 0

Fix this by taking a reference to the super block in proc_sys_prune_dcache.

The superblock reference is the core of the fix however the sysctl_inodes
list is converted to a hlist so that hlist_del_init_rcu may be used.  This
allows proc_sys_prune_dache to remove inodes the sysctl_inodes list, while
not causing problems for proc_sys_evict_inode when if it later choses to
remove the inode from the sysctl_inodes list.  Removing inodes from the
sysctl_inodes list allows proc_sys_prune_dcache to have a progress
guarantee, while still being able to drop all locks.  The fact that
head->unregistering is set in start_unregistering ensures that no more
inodes will be added to the the sysctl_inodes list.

Previously the code did a dance where it delayed calling iput until the
next entry in the list was being considered to ensure the inode remained on
the sysctl_inodes list until the next entry was walked to.  The structure
of the loop in this patch does not need that so is much easier to
understand and maintain.

Cc: stable@vger.kernel.org
Reported-by: Andrei Vagin <avagin@gmail.com>
Tested-by: Andrei Vagin <avagin@openvz.org>
Fixes: ace0c791e6 ("proc/sysctl: Don't grab i_lock under sysctl_lock.")
Fixes: d6cffbbe9a ("proc/sysctl: prune stale dentries during unregistering")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-07-11 11:01:24 -05:00
David Howells 1d278a8790 VFS: Kill off s_options and helpers
Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves.  This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells 4dfdb71307 orangefs: Implement show_options
Implement the show_options superblock op for orangefs as part of a bid to
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Mike Marshall <hubcap@omnibond.com>
cc: pvfs2-developers@beowulf-underground.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:09:21 -04:00
David Howells c4fac91004 9p: Implement show_options
Implement the show_options superblock op for 9p as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Eric Van Hensbergen <ericvh@gmail.com>
cc: Ron Minnich <rminnich@sandia.gov>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: v9fs-developer@lists.sourceforge.net
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:08:58 -04:00
David Howells 86a1da6d30 isofs: Implement show_options
Implement the show_options superblock op for omfs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:18 -04:00
David Howells 677018a6ce afs: Implement show_options
Implement the show_options superblock op for afs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Also implement the show_devname op to display the correct device name and thus
avoid the need to display the cell= and volume= options.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:18 -04:00
David Howells 26a7655e6a affs: Implement show_options
Implement the show_options superblock op for affs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:17 -04:00
David Howells 3ab7947ac3 befs: Implement show_options
Implement the show_options superblock op for befs as part of a bid to get
rid of s_options and generic_show_options() to make it easier to implement
a context-based mount where the mount options can be passed individually
over a file descriptor.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Luis de Bethencourt <luisbg@osg.samsung.com>
cc: Salah Triki <salah.triki@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-11 06:06:17 -04:00
Linus Torvalds 9967468c0a Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - most of the rest of MM

 - KASAN updates

 - lib/ updates

 - checkpatch updates

 - some binfmt_elf changes

 - various misc bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (115 commits)
  kernel/exit.c: avoid undefined behaviour when calling wait4()
  kernel/signal.c: avoid undefined behaviour in kill_something_info
  binfmt_elf: safely increment argv pointers
  s390: reduce ELF_ET_DYN_BASE
  powerpc: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm64: move ELF_ET_DYN_BASE to 4GB / 4MB
  arm: move ELF_ET_DYN_BASE to 4MB
  binfmt_elf: use ELF_ET_DYN_BASE only for PIE
  fs, epoll: short circuit fetching events if thread has been killed
  checkpatch: improve multi-line alignment test
  checkpatch: improve macro reuse test
  checkpatch: change format of --color argument to --color[=WHEN]
  checkpatch: silence perl 5.26.0 unescaped left brace warnings
  checkpatch: improve tests for multiple line function definitions
  checkpatch: remove false warning for commit reference
  checkpatch: fix stepping through statements with $stat and ctx_statement_block
  checkpatch: [HLP]LIST_HEAD is also declaration
  checkpatch: warn when a MAINTAINERS entry isn't [A-Z]:\t
  checkpatch: improve the unnecessary OOM message test
  lib/bsearch.c: micro-optimize pivot position calculation
  ...
2017-07-10 16:58:42 -07:00
Kees Cook 67c6777a5d binfmt_elf: safely increment argv pointers
When building the argv/envp pointers, the envp is needlessly
pre-incremented instead of just continuing after the argv pointers are
finished.  In some (likely impossible) race where the strings could be
changed from userspace between copy_strings() and here, it might be
possible to confuse the envp position.  Instead, just use sp like
everything else.

Link: http://lkml.kernel.org/r/20170622173838.GA43308@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Kees Cook eab09532d4 binfmt_elf: use ELF_ET_DYN_BASE only for PIE
The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries.  (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN.  However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region.  This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader.  Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Qualys Security Advisory <qsa@qualys.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
David Rientjes c257a340ed fs, epoll: short circuit fetching events if thread has been killed
We've encountered zombies that are waiting for a thread to exit that are
looping in ep_poll() almost endlessly although there is a pending
SIGKILL as a result of a group exit.

This happens because we always find ep_events_available() and fetch more
events and never are able to check for signal_pending() that would break
from the loop and return -EINTR.

Special case fatal signals and break immediately to guarantee that we
loop to fetch more events and delay making a timely exit.

It would also be possible to simply move the check for signal_pending()
higher than checking for ep_events_available(), but there have been no
reports of delayed signal handling other than SIGKILL preventing zombies
from exiting that would be fixed by this.

It fixes an issue for us where we have witnessed zombies sticking around
for at least O(minutes), but considering the code has been like this
forever and nobody else has complained that I have found, I would simply
queue it up for 4.12.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705031722350.76784@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:36 -07:00
Heiner Kallweit cde1b69389 fs/proc/generic.c: switch to ida_simple_get/remove
The code can be much simplified by switching to ida_simple_get/remove.

Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Sahitya Tummala b17c070fb6 fs/dcache.c: fix spin lockup issue on nlru->lock
__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list.  As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time.  So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

  spin_bug+0x90
  do_raw_spin_lock+0xfc
  _raw_spin_lock+0x28
  list_lru_add+0x28
  dput+0x1c8
  path_put+0x20
  terminate_walk+0x3c
  path_lookupat+0x100
  filename_lookup+0x6c
  user_path_at_empty+0x54
  SyS_faccessat+0xd0
  el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

  d_lru_shrink_move+0x34
  dentry_lru_isolate_shrink+0x48
  __list_lru_walk_one.isra.10+0x94
  list_lru_walk_node+0x40
  shrink_dcache_sb+0x60
  do_remount_sb+0xbc
  do_emergency_remount+0xb0
  process_one_work+0x228
  worker_thread+0x2e0
  kthread+0xf4
  ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once.  Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Alexander Polakov <apolyakov@beget.ru>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:33 -07:00
Vasily Averin 8c03cc85a0 fs/proc/task_mmu.c: remove obsolete comment in show_map_vma()
After commit 1be7107fbe ("mm: larger stack guard gap, between vmas")
we do not hide stack guard page in /proc/<pid>/maps

Link: http://lkml.kernel.org/r/211f3c2a-f7ef-7c13-82bf-46fd426f6e1b@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:32 -07:00
Naoya Horiguchi 78bb920344 mm: hwpoison: dissolve in-use hugepage in unrecoverable memory error
Currently me_huge_page() relies on dequeue_hwpoisoned_huge_page() to
keep the error hugepage away from the system, which is OK but not good
enough because the hugepage still has a refcount and unpoison doesn't
work on the error hugepage (PageHWPoison flags are cleared but pages are
still leaked.) And there's "wasting health subpages" issue too.  This
patch reworks on me_huge_page() to solve these issues.

For hugetlb file, recently we have truncating code so let's use it in
hugetlbfs specific ->error_remove_page().

For anonymous hugepage, it's helpful to dissolve the error page after
freeing it into free hugepage list.  Migration entry and PageHWPoison in
the head page prevent the access to it.

TODO: dissolve_free_huge_page() can fail but we don't considered it yet.
It's not critical (and at least no worse that now) because in such case
the error hugepage just stays in free hugepage list without being
dissolved.  By virtue of PageHWPoison in head page, it's never allocated
to processes.

[akpm@linux-foundation.org: fix unused var warnings]
Fixes: 23a003bfd2 ("mm/madvise: pass return code of memory_failure() to userspace")
Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop
Link: http://lkml.kernel.org/r/1496305019-5493-8-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Eric Biggers 241f01fbed fs/buffer.c: make bh_lru_install() more efficient
To install a buffer_head into the cpu's LRU queue, bh_lru_install()
would construct a new copy of the queue and then memcpy it over the real
queue.  But it's easily possible to do the update in-place, which is
faster and simpler.  Some work can also be skipped if the buffer_head
was already in the queue.

As a microbenchmark I timed how long it takes to run sb_getblk()
10,000,000 times alternating between BH_LRU_SIZE + 1 blocks.
Effectively, this benchmarks looking up buffer_heads that are in the
page cache but not in the LRU:

	Before this patch: 1.758s
	After this patch: 1.653s

This patch also removes about 350 bytes of compiled code (on x86_64),
partly due to removal of the memcpy() which was being inlined+unrolled.

Link: http://lkml.kernel.org/r/20161229193445.1913-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:30 -07:00
Linus Torvalds 5cdd4c0468 for-f2fs-4.13
In this round, we've added new features such as disk quota and statx, and
 modified internal bio management flow to merge more IOs depending on block
 types. We've also made internal threads freezeable for Android battery life.
 In addition to them, there are some patches to avoid lock contention as well
 as a couple of deadlock conditions.
 
 = Enhancement
 - support usrquota, grpquota, and statx
 - manage DATA/NODE typed bios separately to serialize more IOs
 - modify f2fs_lock_op/wio_mutex to avoid lock contention
 - prevent lock contention in migratepage
 
 = Bug fix
 - miss to load written inode flag
 - fix worst case victim selection in GC
 - freezeable GC and discard threads for Android battery life
 - sanitize f2fs metadata to deal with security hole
 - clean up sysfs-related code and docs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAllj6fMACgkQQBSofoJI
 UNJ6Ng/+PqdGV/b6KroYIXI/scFx/1t87/0W+rY9tyLr1jX7nIHn9KLPjeDdvdlk
 5vEeZ/dGfW8wSI+ESzscvKberG2QlOPwJRyTB4jWR+bLatwzg7YjEblz+RX4/wfJ
 jKjnR7M//gRdhHdqA0xXrqguAjPbcEDK2RiVbhioMjWbZ/77j0IjcRokjMYdEf0m
 cJc2oMXFtlo+DJ1h9/8BmwQPTI9FfVdgbkPFTTJzV0ydQnBdxcAigrzwYZhPOVv0
 n2M1dKOiQewB4OADMuepZLFqJheItlgG9wlvEjGq7zTd5epHXRIqhM6h9GikQVb9
 YKAkajlKfWcwEXaEcVXtsMHC9x69Yf8xxOSQ1VrhypSUNbaynC9LDsErJx6yrF3P
 XC5baiqXsd/btg7tfrHJjk3gI+ck97d6TrTfUVR91X+1Tpkz7cyB226WxFKbyOG3
 EYCFVMbrIN2CaHHt1xWIT2zCfX5w9ycp8kFjY6jPi0OOZrKXpFw+1AwwTu9kn4xJ
 iuUc8pmc0/FyPqokmLef4Qp/RRM83+f+nzW/y//lkEf3nMn6qlHzNI1RAxXnBvGV
 DMXzuJDcJcHGcSDr7mWyKkm6gYcak/E4DdQLQqJ6VCt6KCdCEXP/XDlig5ey5ODY
 uGEr1QhXIpiYAON45HUi3gmytB3J3ZdzzpsG1PEco4+hjSuFhyE=
 =N4GZ
 -----END PGP SIGNATURE-----

Merge tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've added new features such as disk quota and statx,
  and modified internal bio management flow to merge more IOs depending
  on block types. We've also made internal threads freezeable for
  Android battery life. In addition to them, there are some patches to
  avoid lock contention as well as a couple of deadlock conditions.

  Enhancements:
   - support usrquota, grpquota, and statx
   - manage DATA/NODE typed bios separately to serialize more IOs
   - modify f2fs_lock_op/wio_mutex to avoid lock contention
   - prevent lock contention in migratepage

  Bug fixes:
   - fix missing load of written inode flag
   - fix worst case victim selection in GC
   - freezeable GC and discard threads for Android battery life
   - sanitize f2fs metadata to deal with security hole
   - clean up sysfs-related code and docs"

* tag 'for-f2fs-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (59 commits)
  f2fs: support plain user/group quota
  f2fs: avoid deadlock caused by lock order of page and lock_op
  f2fs: use spin_{,un}lock_irq{save,restore}
  f2fs: relax migratepage for atomic written page
  f2fs: don't count inode block in in-memory inode.i_blocks
  Revert "f2fs: fix to clean previous mount option when remount_fs"
  f2fs: do not set LOST_PINO for renamed dir
  f2fs: do not set LOST_PINO for newly created dir
  f2fs: skip ->writepages for {mete,node}_inode during recovery
  f2fs: introduce __check_sit_bitmap
  f2fs: stop gc/discard thread in prior during umount
  f2fs: introduce reserved_blocks in sysfs
  f2fs: avoid redundant f2fs_flush after remount
  f2fs: report # of free inodes more precisely
  f2fs: add ioctl to do gc with target block address
  f2fs: don't need to check encrypted inode for partial truncation
  f2fs: measure inode.i_blocks as generic filesystem
  f2fs: set CP_TRIMMED_FLAG correctly
  f2fs: require key for truncate(2) of encrypted file
  f2fs: move sysfs code from super.c to fs/f2fs/sysfs.c
  ...
2017-07-10 14:29:45 -07:00
Linus Torvalds 7cee9384cb Fix up over-eager 'wait_queue_t' renaming
Commit ac6424b981 ("sched/wait: Rename wait_queue_t =>
wait_queue_entry_t") had scripted the renaming incorrectly, and didn't
actually check that the 'wait_queue_t' was a full token.

As a result, it also triggered on 'wait_queue_token', and renamed that
to 'wait_queue_entry_token' entry in the autofs4 packet structure
definition too.  That was entirely incorrect, and not intended.

The end result built fine when building just the kernel - because
everything had been renamed consistently there - but caused problems in
user space because the "struct autofs_packet_missing" type is exported
as part of the uapi.

This scripts it all back again:

    git grep -lw wait_queue_entry_token |
        xargs sed -i 's/wait_queue_entry_token/wait_queue_token/g'

and checks the end result.

Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Fixes: ac6424b981 ("sched/wait: Rename wait_queue_t => wait_queue_entry_t")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 11:40:19 -07:00
Linus Torvalds 642338ba33 Changes for 4.13:
- Avoid quotacheck deadlocks
 - Fix transaction overflows when bunmapping fragmented files
 - Refactor directory readahead
 - Allow admin to configure if ASSERT is fatal
 - Improve transaction usage detail logging during overflows
 - Minor cleanups
 - Don't leak log items when the log shuts down
 - Remove double-underscore typedefs
 - Various preparation for online scrubbing
 - Introduce new error injection configuration sysfs knobs
 - Refactor dq_get_next to use extent map directly
 - Fix problems with iterating the page cache for unwritten data
 - Implement SEEK_{HOLE,DATA} via iomap
 - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA
 - Don't use MAXPATHLEN to check on-disk symlink target lengths
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZYDw4AAoJEPh/dxk0SrTr2IMP/3JLeygIDtKBBVRPvlCmEXQC
 j8w1C/ntn46zZKQ8l14fAFV4HV2d+KJWf8+yDuPuGdMXJfPeKZf95otYhnSx/9Th
 MvCH7Nzg63yjEGqXpBkfIVr/GT0KTx28lxiqNViChr7XiXWookgf3SSLINO+vU4J
 L2jgLqieJfijiHTBs4qGCQPDwSXVoSOi5XCCQWDYQrXz6DI5UEJc70U53WkH4tRu
 RctOgp1lralwEO0PhfomD3m/Gk94taE/4ZpX/j/5Y4tvH/yh5aY3/KTCLm6+mYT3
 rgMpmg5hmm+UiCTNoTnQ5RxzGZWCfI1I9FZ3HqDsbhmFtaWh32ti0dEEDYsF8Opj
 ARnTty3cRx41LH9dULrVWdwW105AHgwEz8/OZlG0JOca9qzj9GKERMg/hpHINAKN
 TrBlkweg86LWZDy23udZJ/v35svNqSFsqL1yV8j5dXyBi+Yi2CGfU27zbBwnj4Jk
 047l+OuRbBnEOUULqJTEVBY3euoclwl/yQrW2m409s7vPGkGQBLuFCsDKQdnvJ/A
 D7frZqH8XypwnhFOkKybUnBkn4P7vZ2sEuCIZMsrH5k/ys8XyEkaBaOurjvMBOKA
 vLIMD6RXDWrFbOoovfK/stEM6/UFoQkgMhBe7vB9EXk1AjM8NYyWZgp5BkHtytC7
 qa6GRjtGefhc67hbwXJd
 =/GZI
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull XFS updates from Darrick Wong:
 "Here are some changes for you for 4.13. For the most part it's fixes
  for bugs and deadlock problems, and preparation for online fsck in
  some future merge window.

   - Avoid quotacheck deadlocks

   - Fix transaction overflows when bunmapping fragmented files

   - Refactor directory readahead

   - Allow admin to configure if ASSERT is fatal

   - Improve transaction usage detail logging during overflows

   - Minor cleanups

   - Don't leak log items when the log shuts down

   - Remove double-underscore typedefs

   - Various preparation for online scrubbing

   - Introduce new error injection configuration sysfs knobs

   - Refactor dq_get_next to use extent map directly

   - Fix problems with iterating the page cache for unwritten data

   - Implement SEEK_{HOLE,DATA} via iomap

   - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA

   - Don't use MAXPATHLEN to check on-disk symlink target lengths"

* tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits)
  xfs: don't crash on unexpected holes in dir/attr btrees
  xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN
  xfs: fix contiguous dquot chunk iteration livelock
  xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA
  vfs: Add iomap_seek_hole and iomap_seek_data helpers
  vfs: Add page_cache_seek_hole_data helper
  xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk
  xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent
  xfs: Check for m_errortag initialization in xfs_errortag_test
  xfs: grab dquots without taking the ilock
  xfs: fix semicolon.cocci warnings
  xfs: Don't clear SGID when inheriting ACLs
  xfs: free cowblocks and retry on buffered write ENOSPC
  xfs: replace log_badcrc_factor knob with error injection tag
  xfs: convert drop_writes to use the errortag mechanism
  xfs: remove unneeded parameter from XFS_TEST_ERROR
  xfs: expose errortag knobs via sysfs
  xfs: make errortag a per-mountpoint structure
  xfs: free uncommitted transactions during log recovery
  xfs: don't allow bmap on rt files
  ...
2017-07-10 10:51:53 -07:00
Linus Torvalds 6618a24ab2 Merge branch 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
 "This fixes a user-visible bug introduced by the nowait-aio patches
  merged in this cycle"

* 'nowait-aio-btrfs-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: nowait aio: Correct assignment of pos
2017-07-10 10:27:48 -07:00
Goldwyn Rodrigues ff0fa73247 btrfs: nowait aio: Correct assignment of pos
Assigning pos for usage early messes up in append mode, where the pos is
re-assigned in generic_write_checks(). Assign pos later to get the
correct position to write from iocb->ki_pos.

Since check_can_nocow also uses the value of pos, we shift
generic_write_checks() before check_can_nocow(). Checks with IOCB_DIRECT
are present in generic_write_checks(), so checking for IOCB_NOWAIT is
enough.

Also, put locking sequence in the fast path.

This fixes a user visible bug, as reported:

"apparently breaks several shell related features on my system.
In zsh history stopped working, because no new entries are added
anymore.
I fist noticed the issue when I tried to build mplayer. It uses a shell
script to generate a help_mp.h file:
[...]

Here is a simple testcase:

 % echo "foo" >> test
 % echo "foo" >> test
 % cat test
 foo
 %
"

Fixes: edf064e7c6 ("btrfs: nowait aio support")
CC: Jens Axboe <axboe@kernel.dk>
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Link: https://lkml.kernel.org/r/20170704042306.GA274@x4
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-10 15:29:44 +02:00
Christos Gkekas 68a6afa7fa cifs: Clean up unused variables in smb2pdu.c
There are multiple unused variables struct TCP_Server_Info *server
defined in many methods in smb2pdu.c. They should be removed and related
logic simplified.

Signed-off-by: Christos Gkekas <chris.gekas@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-09 23:38:00 -05:00
David Howells d3e3b7eac8 afs: Add metadata xattrs
Add xattrs to allow the user to get/set metadata in lieu of having pioctl()
available.  The following xattrs are now available:

 - "afs.cell"

   The name of the cell in which the vnode's volume resides.

 - "afs.fid"

   The volume ID, vnode ID and vnode uniquifier of the file as three hex
   numbers separated by colons.

 - "afs.volume"

   The name of the volume in which the vnode resides.

For example:

	# getfattr -d -m ".*" /mnt/scratch
	getfattr: Removing leading '/' from absolute path names
	# file: mnt/scratch
	afs.cell="mycell.myorg.org"
	afs.fid="10000b:1:1"
	afs.volume="scratch"

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09 14:40:12 -07:00
Marc Dionne fd2498211a afs: Ignore AFS_ACE_READ and AFS_ACE_WRITE for directories
The AFS_ACE_READ and AFS_ACE_WRITE permission bits should not
be used to make access decisions for the directory itself.  They
are meant to control access for the objects contained in that
directory.

Reading a directory is allowed if the AFS_ACE_LOOKUP bit is set.
This would cause an incorrect access denied error for a directory
with AFS_ACE_LOOKUP but not AFS_ACE_READ.

The AFS_ACE_WRITE bit does not allow operations that modify the
directory.  For a directory with AFS_ACE_WRITE but neither
AFS_ACE_INSERT nor AFS_ACE_DELETE, this would result in trying
operations that would ultimately be denied by the server.

Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09 14:40:12 -07:00
Linus Torvalds bc2c6421cb The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
 directory entries (assuming ~64 byte file names; in practice, users
 will run into practical performance limits first.)  This feature was
 originally written by the Lustre team, and credit goes to Artem
 Blagodarenko from Seagate for getting this feature upstream.
 
 The second major major feature allows ext4 to support extended
 attribute values up to 64k.  This feature was also originally from
 Lustre, and has been enhanced by Tahsin Erdogan from Google with a
 deduplication feature so that if multiple files have the same xattr
 value (for example, Windows ACL's stored by Samba), only one copy will
 be stored on disk for encoding and caching efficiency.
 
 We also have the usual set of bug fixes, cleanups, and optimizations.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhl5AACgkQ8vlZVpUN
 gaOiNQf+L23sT9KIQmFwQP38vkBVw67Eo7gBfevmk7oqQLiRppT5mmLzW8EWEDxR
 PVaDQXvSZi18wSCAAcCd1ZqeIZk0P6tst0ufnIT60tGlZdUlwSLyrqvV/30axR2g
 6kcnv90ZszrQNx5U8q8bMzNrs1KtyPHFCRzavFsBX11WezNSpWnH2in/uxO+t9Jy
 F2zlrLUrE2m9AVMH48Dh6LbeaB6pqgr4k3jq1jG4Iqb2h9xgU8OKhs8gL07YS+Qi
 5A7s8GIvYQSoZUO9DOOie2f1zhpO0KrhXchyZTJukVQH7TsmFxoSh0vhXnP1Bohu
 CNLV6dzetDT0VfmPr1WhVe7lhZeeVw==
 =FFkF
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "The first major feature for ext4 this merge window is the largedir
  feature, which allows ext4 directories to support over 2 billion
  directory entries (assuming ~64 byte file names; in practice, users
  will run into practical performance limits first.) This feature was
  originally written by the Lustre team, and credit goes to Artem
  Blagodarenko from Seagate for getting this feature upstream.

  The second major major feature allows ext4 to support extended
  attribute values up to 64k. This feature was also originally from
  Lustre, and has been enhanced by Tahsin Erdogan from Google with a
  deduplication feature so that if multiple files have the same xattr
  value (for example, Windows ACL's stored by Samba), only one copy will
  be stored on disk for encoding and caching efficiency.

  We also have the usual set of bug fixes, cleanups, and optimizations"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
  ext4: fix spelling mistake: "prellocated" -> "preallocated"
  ext4: fix __ext4_new_inode() journal credits calculation
  ext4: skip ext4_init_security() and encryption on ea_inodes
  fs: generic_block_bmap(): initialize all of the fields in the temp bh
  ext4: change fast symlink test to not rely on i_blocks
  ext4: require key for truncate(2) of encrypted file
  ext4: don't bother checking for encryption key in ->mmap()
  ext4: check return value of kstrtoull correctly in reserved_clusters_store
  ext4: fix off-by-one fsmap error on 1k block filesystems
  ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
  ext4: return EIO on read error in ext4_find_entry
  ext4: forbid encrypting root directory
  ext4: send parallel discards on commit completions
  ext4: avoid unnecessary stalls in ext4_evict_inode()
  ext4: add nombcache mount option
  ext4: strong binding of xattr inode references
  ext4: eliminate xattr entry e_hash recalculation for removes
  ext4: reserve space for xattr entries/names
  quota: add get_inode_usage callback to transfer multi-inode charges
  ext4: xattr inode deduplication
  ...
2017-07-09 09:31:22 -07:00
Linus Torvalds 58f587cb0b Add support for 128-bit AES and some cleanups to fscrypt
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhktgACgkQ8vlZVpUN
 gaOQIQf+KM2s46sxxEl0/hjdBXR4OxTmSS2/0900NPyg7JHKlL8PdYslOyvMiKjo
 wEi+YPwwQgbHtxhI1VINfV/q12MZHwvmFOfD9NzjrISwfmfsKj0dBgZDAfBH82sK
 12wKgUxA8xJ4P+Xdvnz2PokRcFCsh1YUr5IUQkP3JR2RZOxNFUj42QwPJ2yWzqxO
 MsnepMjIHsxvXZi0E7sPjRaoFsh3DDeLmNl8sX6INodC7hxJ1LotYKqJhA4stQpB
 ezXY2tabwg3gaOWvWH7THyHhGntbZVDga3iRrKdNLahXN8OBdHktmG75ubiN6tEg
 x80pqQLgr41yIQuJVOuyeh5jLYZrww==
 =i4r9
 -----END PGP SIGNATURE-----

Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt

Pull fscrypt updates from Ted Ts'o:
 "Add support for 128-bit AES and some cleanups to fscrypt"

* tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt:
  fscrypt: make ->dummy_context() return bool
  fscrypt: add support for AES-128-CBC
  fscrypt: inline fscrypt_free_filename()
2017-07-09 09:03:31 -07:00
Tommy Nguyen 799ce1dbb9 befs: add kernel-doc formatting for befs_bt_read_super()
fs/befs/TODO mentions some comments needing conversion to Kernel-Doc
formatting. This patch changes the comment describing befs_bt_read_super().

Signed-off-by: Tommy Nguyen <remyabel@gmail.com>
Signed-off-by: Luis de Bethencourt <luisbg@kernel.org>
2017-07-09 10:42:50 +01:00
Chao Yu 0abd675e97 f2fs: support plain user/group quota
This patch adds to support plain user/group quota.

Change Note by Jaegeuk Kim.

- Use f2fs page cache for quota files in order to consider garbage collection.
  so, quota files are not tolerable for sudden power-cuts, so user needs to do
  quotacheck.

- setattr() calls dquot_transfer which will transfer inode->i_blocks.
  We can't reclaim that during f2fs_evict_inode(). So, we need to count
  node blocks as well in order to match i_blocks with dquot's space.

  Note that, Chao wrote a patch to count inode->i_blocks without inode block.
  (f2fs: don't count inode block in in-memory inode.i_blocks)

- in f2fs_remount, we need to make RW in prior to dquot_resume.

- handle fault_injection case during f2fs_quota_off_umount

- TODO: Project quota

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-07-08 23:12:27 -07:00
Steve French eef914a9eb [SMB3] Improve security, move default dialect to SMB3 from old CIFS
Due to recent publicity about security vulnerabilities in the
much older CIFS dialect, move the default dialect to the
widely accepted (and quite secure) SMB3.0 dialect from the
old default of the CIFS dialect.

We do not want to be encouraging use of less secure dialects,
and both Microsoft and CERT now strongly recommend not using the
older CIFS dialect (SMB Security Best Practices
"recommends disabling SMBv1").

SMB3 is both secure and widely available: in Windows 8 and later,
Samba and Macs.

Users can still choose to explicitly mount with the less secure
dialect (for old servers) by choosing "vers=1.0" on the cifs
mount

Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 18:59:42 -05:00
Steve French 2a38e12053 [SMB3] Remove ifdef since SMB3 (and later) now STRONGLY preferred
Remove the CONFIG_CIFS_SMB2 ifdef and Kconfig option since they
must always be on now.

For various security reasons, SMB3 and later are STRONGLY preferred
over CIFS and older dialects, and SMB3 (and later) will now be
the default dialects so we do not want to allow them to be
ifdeffed out.

In the longer term, we may be able to make older CIFS support
disableable in Kconfig with a new set of #ifdef, but we always
want SMB3 and later support enabled.

Signed-off-by: Steven French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 18:57:07 -05:00
Pavel Shilovsky 511c54a2f6 CIFS: Reconnect expired SMB sessions
According to the MS-SMB2 spec (3.2.5.1.6) once the client receives
STATUS_NETWORK_SESSION_EXPIRED error code from a server it should
reconnect the current SMB session. Currently the client doesn't do
that. This can result in subsequent client requests failing by
the server. The patch adds an additional logic to the demultiplex
thread to identify expired sessions and reconnect them.

Cc: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-08 17:23:10 -05:00
Pavel Shilovsky 4395d484b9 CIFS: Display SMB2 error codes in the hex format
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-07-08 17:23:10 -05:00
Shirish Pargaonkar 366ed846df cifs: Use smb 2 - 3 and cifsacl mount options setacl function
Added set acl function. Very similar to set cifs acl function for smb1.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 17:23:10 -05:00
Shirish Pargaonkar dac953401c cifs: prototype declaration and definition to set acl for smb 2 - 3 and cifsacl mount options
Modified current set info function to accommodate multiple info types and
additional information.

Added cifs acl specific function to invoke set info functionality.

Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2017-07-08 15:08:38 -05:00
Linus Torvalds b8d4c1f9f4 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc filesystem updates from Al Viro:
 "Assorted normal VFS / filesystems stuff..."

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  dentry name snapshots
  Make statfs properly return read-only state after emergency remount
  fs/dcache: init in_lookup_hashtable
  minix: Deinline get_block, save 2691 bytes
  fs: Reorder inode_owner_or_capable() to avoid needless
  fs: warn in case userspace lied about modprobe return
2017-07-08 10:50:54 -07:00
Linus Torvalds 46ace66b3b Merge branch 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull __copy_in_user removal from Al Viro:
 "There used to be 6 places in the entire tree calling __copy_in_user(),
  all of them bogus.

  Four got killed off in work.drm branch, this takes care of the
  remaining ones and kills the definition of that sucker"

* 'work.__copy_in_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kill __copy_in_user()
  sanitize do_i2c_smbus_ioctl()
2017-07-08 10:15:02 -07:00
Linus Torvalds cee37d83e6 Merge branch 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull read/write fix from Al Viro:
 "file_start_write()/file_end_write() got mixed into vfs_iter_write() by
  accident; that's a deadlock for all existing callers - they already do
  that, some - quite a bit outside.

  Easily fixed, fortunately"

* 'work.read_write' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move file_{start,end}_write() out of do_iter_write()
2017-07-07 21:48:15 -07:00
Kees Cook da029c11e6 exec: Limit arg stack to at most 75% of _STK_LIM
To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07 20:05:08 -07:00
Linus Torvalds 088737f44b Writeback error handling fixes (pile #2)
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJZXhmCAAoJEAAOaEEZVoIVpRkP/1qlYn3pq6d5Kuz84pejOmlL
 5jbkS/cOmeTxeUU4+B1xG8Lx7bAk8PfSXQOADbSJGiZd0ug95tJxplFYIGJzR/tG
 aNMHeu/BVKKhUKORGuKR9rJKtwC839L/qao+yPBo5U3mU4L73rFWX8fxFuhSJ8HR
 hvkgBu3Hx6GY59CzxJ8iJzj+B+uPSFrNweAk0+0UeWkBgTzEdiGqaXBX4cHIkq/5
 hMoCG+xnmwHKbCBsQ5js+YJT+HedZ4lvfjOqGxgElUyjJ7Bkt/IFYOp8TUiu193T
 tA4UinDjN8A7FImmIBIftrECmrAC9HIGhGZroYkMKbb8ReDR2ikE5FhKEpuAGU3a
 BXBgX2mPQuArvZWM7qeJCkxV9QJ0u/8Ykbyzo30iPrICyrzbEvIubeB/mDA034+Z
 Z0/z8C3v7826F3zP/NyaQEojUgRq30McMOIS8GMnx15HJwRsRKlzjfy9Wm4tWhl0
 t3nH1jMqAZ7068s6rfh/oCwdgGOwr5o4hW/bnlITzxbjWQUOnZIe7KBxIezZJ2rv
 OcIwd5qE8PNtpagGj5oUbnjGOTkERAgsMfvPk5tjUNt28/qUlVs2V0aeo47dlcsh
 oYr8WMOIzw98Rl7Bo70mplLrqLD6nGl0LfXOyUlT4STgLWW4ksmLVuJjWIUxcO/0
 yKWjj9wfYRQ0vSUqhsI5
 =3Z93
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux

Pull Writeback error handling updates from Jeff Layton:
 "This pile represents the bulk of the writeback error handling fixes
  that I have for this cycle. Some of the earlier patches in this pile
  may look trivial but they are prerequisites for later patches in the
  series.

  The aim of this set is to improve how we track and report writeback
  errors to userland. Most applications that care about data integrity
  will periodically call fsync/fdatasync/msync to ensure that their
  writes have made it to the backing store.

  For a very long time, we have tracked writeback errors using two flags
  in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a
  writeback error occurs (via mapping_set_error) and are cleared as a
  side-effect of filemap_check_errors (as you noted yesterday). This
  model really sucks for userland.

  Only the first task to call fsync (or msync or fdatasync) will see the
  error. Any subsequent task calling fsync on a file will get back 0
  (unless another writeback error occurs in the interim). If I have
  several tasks writing to a file and calling fsync to ensure that their
  writes got stored, then I need to have them coordinate with one
  another. That's difficult enough, but in a world of containerized
  setups that coordination may even not be possible.

  But wait...it gets worse!

  The calls to filemap_check_errors can be buried pretty far down in the
  call stack, and there are internal callers of filemap_write_and_wait
  and the like that also end up clearing those errors. Many of those
  callers ignore the error return from that function or return it to
  userland at nonsensical times (e.g. truncate() or stat()). If I get
  back -EIO on a truncate, there is no reason to think that it was
  because some previous writeback failed, and a subsequent fsync() will
  (incorrectly) return 0.

  This pile aims to do three things:

   1) ensure that when a writeback error occurs that that error will be
      reported to userland on a subsequent fsync/fdatasync/msync call,
      regardless of what internal callers are doing

   2) report writeback errors on all file descriptions that were open at
      the time that the error occurred. This is a user-visible change,
      but I think most applications are written to assume this behavior
      anyway. Those that aren't are unlikely to be hurt by it.

   3) document what filesystems should do when there is a writeback
      error. Today, there is very little consistency between them, and a
      lot of cargo-cult copying. We need to make it very clear what
      filesystems should do in this situation.

  To achieve this, the set adds a new data type (errseq_t) and then
  builds new writeback error tracking infrastructure around that. Once
  all of that is in place, we change the filesystems to use the new
  infrastructure for reporting wb errors to userland.

  Note that this is just the initial foray into cleaning up this mess.
  There is a lot of work remaining here:

   1) convert the rest of the filesystems in a similar fashion. Once the
      initial set is in, then I think most other fs' will be fairly
      simple to convert. Hopefully most of those can in via individual
      filesystem trees.

   2) convert internal waiters on writeback to use errseq_t for
      detecting errors instead of relying on the AS_* flags. I have some
      draft patches for this for ext4, but they are not quite ready for
      prime time yet.

  This was a discussion topic this year at LSF/MM too. If you're
  interested in the gory details, LWN has some good articles about this:

      https://lwn.net/Articles/718734/
      https://lwn.net/Articles/724307/"

* tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
  btrfs: minimal conversion to errseq_t writeback error reporting on fsync
  xfs: minimal conversion to errseq_t writeback error reporting
  ext4: use errseq_t based error handling for reporting data writeback errors
  fs: convert __generic_file_fsync to use errseq_t based reporting
  block: convert to errseq_t based writeback error tracking
  dax: set errors in mapping when writeback fails
  Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: new infrastructure for writeback error handling and reporting
  lib: add errseq_t type and infrastructure for handling it
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  jbd2: don't clear and reset errors after waiting on writeback
  buffer: set errors in mapping at the time that the error occurs
  fs: check for writeback errors after syncing out buffers in generic_file_fsync
  buffer: use mapping_set_error instead of setting the flag
  mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07 19:38:17 -07:00