Commit graph

38407 commits

Author SHA1 Message Date
Linus Torvalds 6929c35897 LLVMLinux patches for v3.18
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUPOQPAAoJEHKvRublQaWiZnQP/0bDfPvhaNKZtmDKYzbqQm9x
 Gh3fB53d7TtIei///5eD/LUGObg0Ze8g2k8aAC05hcq5Be8Haitma3iDDSvJq6ig
 umYU9BWkv47BHQy0gAVMXyaxHocICq7W9bnOvU4SSEW/sWeneBllyxgCfV7EKSTd
 B/OXP+Ovr4FtjAweOACCN8b0M9w4wdxNKfDNV+WDIHAddYBngxrwq7zASKNNCx2N
 0u9sXIdTp0Fvxmyx/lYLC5NXN0CUDjB3Ffdx3+eehrBp2lT6JdlkYU403c85cIMP
 oIPJVKFbOnM04MfCjhFNTrK9OtC2eD6PoWq+FLL0UJx3YW5HkbLsiHGm/2UTJ2Z1
 5QOwDebMxlvrb6f6Gv846ADl7YcByiXkieTDHRlOnplVRNV5Sj8UWAgq+zyq7sWq
 2uRuW2UvKx7vYoAwKRwCaIoqpIe3NIvZQzE7C9mGOprIawZ5e0YJzaR6OoBs4Y8i
 gmBeFx266URJun7isy1R7JJsMjYzxbEXju9zH/SkghbLHnf8yqafIG+pG1GD7n5R
 o2C/5TVXjmEhIoDn8j2ZozaElX4mD7REKILpIGaE4XltExNTexq8neRo9D3ajzif
 N5RrMCAkBzIxMz83evDppe3ObtkEaf0K43VCO3AVQ2g9jXg7ttKhR2hb8HRqCMHe
 Lp3d8qKyZyL/HZ5F62yX
 =zJyI
 -----END PGP SIGNATURE-----

Merge tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel

Pull LLVM updates from Behan Webster:
 "These patches remove the use of VLAIS using a new SHASH_DESC_ON_STACK
  macro.

  Some of the previously accepted VLAIS removal patches haven't used
  this macro.  I will push new patches to consistently use this macro in
  all those older cases for 3.19"

[ More LLVM patches coming in through subsystem trees, and LLVM itself
  needs some fixes that are already in many distributions but not in
  released versions of LLVM.  Some day this will all "just work"  - Linus ]

* tag 'llvmlinux-for-v3.18' of git://git.linuxfoundation.org/llvmlinux/kernel:
  crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c
  security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c
  crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c
  crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c
  crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt
  crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c
  crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c
  crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c
  crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c
  crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c
  btrfs: LLVMLinux: Remove VLAIS
  crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code
2014-10-15 07:30:52 +02:00
Linus Torvalds 6b04908166 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph updates from Sage Weil:
 "There is the long-awaited discard support for RBD (Guangliang Zhao,
  Josh Durgin), a pile of RBD bug fixes that didn't belong in late -rc's
  (Ilya Dryomov, Li RongQing), a pile of fs/ceph bug fixes and
  performance and debugging improvements (Yan, Zheng, John Spray), and a
  smattering of cleanups (Chao Yu, Fabian Frederick, Joe Perches)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits)
  ceph: fix divide-by-zero in __validate_layout()
  rbd: rbd workqueues need a resque worker
  libceph: ceph-msgr workqueue needs a resque worker
  ceph: fix bool assignments
  libceph: separate multiple ops with commas in debugfs output
  libceph: sync osd op definitions in rados.h
  libceph: remove redundant declaration
  ceph: additional debugfs output
  ceph: export ceph_session_state_name function
  ceph: include the initial ACL in create/mkdir/mknod MDS requests
  ceph: use pagelist to present MDS request data
  libceph: reference counting pagelist
  ceph: fix llistxattr on symlink
  ceph: send client metadata to MDS
  ceph: remove redundant code for max file size verification
  ceph: remove redundant io_iter_advance()
  ceph: move ceph_find_inode() outside the s_mutex
  ceph: request xattrs if xattr_version is zero
  rbd: set the remaining discard properties to enable support
  rbd: use helpers to handle discard for layered images correctly
  ...
2014-10-15 06:46:01 +02:00
Linus Torvalds ce9d7f7b45 Merge branch 'CVE-2014-7970' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux
Pull pivot_root() fix from Andy Lutomirski.

Prevent a leak of unreachable mounts.

* 'CVE-2014-7970' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
  mnt: Prevent pivot_root from creating a loop in the mount tree
2014-10-15 06:43:27 +02:00
Eric W. Biederman 0d0826019e mnt: Prevent pivot_root from creating a loop in the mount tree
Andy Lutomirski recently demonstrated that when chroot is used to set
the root path below the path for the new ``root'' passed to pivot_root
the pivot_root system call succeeds and leaks mounts.

In examining the code I see that starting with a new root that is
below the current root in the mount tree will result in a loop in the
mount tree after the mounts are detached and then reattached to one
another.  Resulting in all kinds of ugliness including a leak of that
mounts involved in the leak of the mount loop.

Prevent this problem by ensuring that the new mount is reachable from
the current root of the mount tree.

[Added stable cc.  Fixes CVE-2014-7970.  --Andy]

Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
2014-10-14 14:27:19 -07:00
Neale Ferguson c07127b48c dlm: fix missing endian conversion of rcom_status flags
The flags are already converted to le when being sent,
but are not being converted back to cpu when received.

Signed-off-by: Neale Ferguson <neale@sinenomine.net>
Signed-off-by: David Teigland <teigland@redhat.com>
2014-10-14 15:11:48 -05:00
Yan, Zheng 0bc62284ee ceph: fix divide-by-zero in __validate_layout()
The 'stripe_unit' field is 64 bits, casting it to 32 bits can result zero.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 12:57:05 -07:00
Fabian Frederick ab6c2c3ebe ceph: fix bool assignments
Fix some coccinelle warnings:
fs/ceph/caps.c:2400:6-10: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2401:6-15: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2402:6-17: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2403:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2404:6-22: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2405:6-19: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2440:4-20: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2469:3-16: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2490:2-18: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2519:3-7: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2549:3-12: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2575:2-6: WARNING: Assignment of bool to 0/1
fs/ceph/caps.c:2589:3-7: WARNING: Assignment of bool to 0/1

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
2014-10-14 12:57:04 -07:00
John Spray 14ed97033d ceph: additional debugfs output
MDS session state and client global ID is
useful instrumentation when testing.

Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14 12:57:01 -07:00
John Spray a687ecaf50 ceph: export ceph_session_state_name function
...so that it can be used from the ceph debugfs
code when dumping session info.

Signed-off-by: John Spray <john.spray@redhat.com>
2014-10-14 12:56:50 -07:00
Yan, Zheng b1ee94aa59 ceph: include the initial ACL in create/mkdir/mknod MDS requests
Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.

The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:56:49 -07:00
Yan, Zheng 25e6bae356 ceph: use pagelist to present MDS request data
Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.

The fix is use pagelist to present MDS request data. Pagelist is
reference counted.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:56:49 -07:00
Yan, Zheng e4339d28f6 libceph: reference counting pagelist
this allow pagelist to present data that may be sent multiple times.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 12:56:48 -07:00
Yan, Zheng 0abb43dcac ceph: fix llistxattr on symlink
only regular file and directory have vxattrs.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 12:56:48 -07:00
John Spray dbd0c8bf79 ceph: send client metadata to MDS
Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax,
which includes additional client metadata to allow
the MDS to report on clients by user-sensible names
like hostname.

Signed-off-by: John Spray <john.spray@redhat.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 12:56:47 -07:00
Chao Yu a4483e8a42 ceph: remove redundant code for max file size verification
Both ceph_update_writeable_page and ceph_setattr will verify file size
with max size ceph supported.
There are two caller for ceph_update_writeable_page, ceph_write_begin and
ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
chance to change file size when mmap. Likewise we have already verified the size
in inode_change_ok when we call ceph_setattr.
So let's remove the redundant code for max file size verification.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Reviewed-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 21:03:40 +04:00
Yan, Zheng 3b70b388e3 ceph: remove redundant io_iter_advance()
ceph_sync_read and generic_file_read_iter() have already advanced the
IO iterator.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 21:03:39 +04:00
Yan, Zheng 6cd3bcad0d ceph: move ceph_find_inode() outside the s_mutex
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
may cause deadlock. (the freeing inode is waiting for OSD read reply, but
dispatch thread is blocked by the s_mutex)

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 21:03:39 +04:00
Yan, Zheng 508b32d866 ceph: request xattrs if xattr_version is zero
Following sequence of events can happen.
  - Client releases an inode, queues cap release message.
  - A 'lookup' reply brings the same inode back, but the reply
    doesn't contain xattrs because MDS didn't receive the cap release
    message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version
is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client
does not have xattr.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 21:03:38 +04:00
Yan, Zheng 03974e8177 ceph: make sure request isn't in any waiting list when kicking request.
we may corrupt waiting list if a request in the waiting list is kicked.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 21:03:24 +04:00
Yan, Zheng 656e438294 ceph: protect kick_requests() with mdsc->mutex
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2014-10-14 21:03:24 +04:00
Yan, Zheng 5d23371fdb ceph: trim unused inodes before reconnecting to recovering MDS
So the recovering MDS does not need to fetch these ununsed inodes during
cache rejoin. This may reduce MDS recovery time.

Signed-off-by: Yan, Zheng <zyan@redhat.com>
2014-10-14 21:03:22 +04:00
Martin K. Petersen e19a8a0ad2 block: Remove REQ_KERNEL
REQ_KERNEL is no longer used. Remove it and drop the redundant uio
argument to nfs_file_direct_{read,write}.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2014-10-14 09:00:44 -06:00
Vinícius Tinti 0458a953d8 btrfs: LLVMLinux: Remove VLAIS
Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent.  This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Vinícius Tinti <viniciustinti@gmail.com>
Reviewed-by: Jan-Simon Möller <dl9pf@gmx.de>
Reviewed-by: Mark Charlebois <charlebm@gmail.com>
Signed-off-by: Behan Webster <behanw@converseincode.com>
Acked-by: Chris Mason <clm@fb.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Cc: "David S. Miller" <davem@davemloft.net>
2014-10-14 10:51:22 +02:00
Linus Torvalds 1b5a5f59e3 FS-Cache fixes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAVDwD3ROxKuMESys7AQLayg//Tmdi4eLzcky/HcOfAoVIY3B5Wvs1MBbN
 3HhaYWKDeJvWxFmRDfQK0c1dyjBA2Xe7bPhdwQ8S9epAWAoW6D4g3Mg2+YReGLCK
 U/CcrMHN77RSydTG0Mj/Z99IynSdf9rwdNrCEy8NiNkGe8Z/JCFPpZurRCc4PL44
 4miTUq3ESMTGkUsa9BH+T0ngEka2ZdwnmzlYkdzeqmjmlbFx8RxcEewBeAoAlU73
 eihKKyX+1uWX/2DmJol5NtZx+BbNkFsO+pX+s+70TsbjiyILCAmgh5meTpkGsDrW
 iJGcgxwhcmyq1aTPcHRmXeNsVenbqRefGUtz7B5Q0x1Uk+ofRYfVVdiyTS2juGbC
 DFGyNBUcFqsmbSMxM+yZGSzgR9KbzoZHDR/ppbJfMqIoe+oGju/NE+AZ6Q3f2/Es
 AIGc8imc96QU08OnrZtreZxfgFMcFxBoGHvAM9AUr1ue80SWhVRZjwYx/JcIP7Cm
 TKyilgb5hfxJ7zon+JuHSqttpeG3zOTjjhcKDmJlybYkKlTeRXm6ZcKVrro5d2+z
 GLnH32HQRJvXBZslymqb7OgkxIW4ySO3PcAWTosUv9+zG0BPR1mB0NVQrSLEPk4L
 JHA+Mjp8O37pN3kRantVNHk73t0z4qkbi8Ixft0yAus9qNNFMeKh+7NbBRjxUZAU
 ARcAbvVMyT0=
 =RtLr
 -----END PGP SIGNATURE-----

Merge tag 'fscache-fixes-20141013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull fs-cache fixes from David Howells:
 "Two fixes for bugs in CacheFiles and a cleanup in FS-Cache"

* tag 'fscache-fixes-20141013' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  fs/fscache/object-list.c: use __seq_open_private()
  CacheFiles: Fix incorrect test for in-memory object collision
  CacheFiles: Handle object being killed before being set up
2014-10-14 08:40:15 +02:00
Linus Torvalds b11445f830 * Fix for a theoretical race condition which could lead to a situation when
UBIFS is unable to mount a file-system (Hujianyang)
 * Few fixes for the ubiblock sybsystem, error path fixes
 * The ubiblock subsystem has had the volume size change handling improved
 * Few fixes and nicifications in the fastmap subsystem
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUO9l0AAoJECmIfjd9wqK01TwP/jAcA7GEnxUpQ8UFBZhJEIN0
 0Ad4oDrGShpuEYgYyRFjCstuXErJBhMwImrevJhRmwxaY2fzGqBeDO9YKKGkKDfa
 qjGsQrUaCJgV6qC2iT056ZmI7V/XyZfnZQ4Z8nQbzafoJ3MPbB6ExqBy8CZi8q/6
 A516cen/cnZfHOQ1aqN6gyw2l976IzdJx8v0WOeYaXcvfDMrmfY8mkfh7EahOIVm
 Kz9BVlVRxxfKPCqMpm+xV8KAOsMueOnKy+6zL7rFh+AvLQBACq44BV1HkZtg2avX
 NBAo1RTPumeCht2t4nLJfgc+BJZ7cNpNFAijsWVJxp6umUqlsnbqckAx69O+JE9/
 VZjM1KN1suI0bm01bj6xysGvg+JNTMiZ+HEqiseICSWtDbnCT4qDL3MPFgmD9OYh
 9ar92Ku2HeY3DakKNd89gqw0ey28cv4i957KleneYzewcfFQ5pC/dp4thcDWa5fH
 AHoblC4ShmcURDPYsIKRZsiTUf/uf3iLFIWAGJBDnSRg4dzzjoJkenz4W5ecWFDj
 JokceklSf0zm8qAAdIUXw5Sihza1cnSBAIYBxVR808U+bwkCTOFF5xcTQy6wKf3y
 NBb+ygh/ugps8B2evJEmp6ByLWQZr8j1q7IokZtglKWN2qOTfzyMxzlWl9vOQJYq
 EQytnka5OEEXamr7g1iB
 =2XCN
 -----END PGP SIGNATURE-----

Merge tag 'upstream-3.18-rc1-v2' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS fixes from Artem Bityutskiy:
 - fix for a theoretical race condition which could lead to a situation
   when UBIFS is unable to mount a file-system (Hujianyang)
 - a few fixes for the ubiblock sybsystem, error path fixes
 - the ubiblock subsystem has had the volume size change handling
   improved
 - a few fixes and nicifications in the fastmap subsystem

* tag 'upstream-3.18-rc1-v2' of git://git.infradead.org/linux-ubifs:
  UBI: Fastmap: Calc fastmap size correctly
  UBIFS: Fix trivial typo in power_cut_emulated()
  UBI: Fix trivial typo in __schedule_ubi_work
  UBI: wl: Rename cancel flag to shutdown
  UBI: ubi_eba_read_leb: Remove in vain variable assignment
  UBIFS: Align the dump messages of SB_NODE
  UBI: Fix livelock in produce_free_peb()
  UBI: return on error in rename_volumes()
  UBI: Improve comment on work_sem
  UBIFS: Remove bogus assert
  UBI: Dispatch update notification if the volume is updated
  UBI: block: Add support for the UBI_VOLUME_UPDATED notification
  UBI: block: Fix block device size setting
  UBI: block: fix dereference on uninitialized dev
  UBI: add missing kmem_cache_free() in process_pool_aeb error path
  UBIFS: fix free log space calculation
  UBIFS: fix a race condition
2014-10-14 08:38:54 +02:00
Darrick J. Wong 813d32f913 ext4: check s_chksum_driver when looking for bg csum presence
Convert the ext4_has_group_desc_csum predicate to look for a checksum
driver instead of the metadata_csum flag and change the bg checksum
calculation function to look for GDT_CSUM before taking the crc16
path.

Without this patch, if we mount with ^uninit_bg,^metadata_csum and
later metadata_csum gets turned on by accident, the block group
checksum functions will incorrectly assume that checksumming is
enabled (metadata_csum) but that crc16 should be used
(!s_chksum_driver).  This is totally wrong, so fix the predicate
and the checksum formula selection.

(Granted, if the metadata_csum feature bit gets enabled on a live FS
then something underhanded is going on, but we could at least avoid
writing garbage into the on-disk fields.)

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Cc: stable@vger.kernel.org
2014-10-14 02:35:49 -04:00
Linus Torvalds 0ef3a56b1c Merge branch 'CVE-2014-7975' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux
Pull do_umount fix from Andy Lutomirski:
 "This fix really ought to be safe.  Inside a mountns owned by a
  non-root user namespace, the namespace root almost always has
  MNT_LOCKED set (if it doesn't, then there's a bug, because rootfs
  could be exposed).  In that case, calling umount on "/" will return
  -EINVAL with or without this patch.

  Outside a userns, this patch will have no effect.  may_mount, required
  by umount, already checks
     ns_capable(current->nsproxy->mnt_ns->user_ns, CAP_SYS_ADMIN)
  so an additional capable(CAP_SYS_ADMIN) check will have no effect.

  That leaves anything that calls umount on "/" in a non-root userns
  while chrooted.  This is the case that is currently broken (it
  remounts ro, which shouldn't be allowed) and that my patch changes to
  -EPERM.  If anything relies on *that*, I'd be surprised"

* 'CVE-2014-7975' of git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux:
  fs: Add a missing permission check to do_umount
2014-10-14 08:35:01 +02:00
Peter Feiner 64e455079e mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared
For VMAs that don't want write notifications, PTEs created for read faults
have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
cleared, then the PTE's softdirty bit will remain clear after subsequent
writes.

Here's a simple code snippet to demonstrate the bug:

  char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                 MAP_ANONYMOUS | MAP_SHARED, -1, 0);
  system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
  assert(*m == '\0');     /* new PTE allows write access */
  assert(!soft_dirty(x));
  *m = 'x';               /* should dirty the page */
  assert(soft_dirty(x));  /* fails */

With this patch, write notifications are enabled when VM_SOFTDIRTY is
cleared.  Furthermore, to avoid unnecessary faults, write notifications
are disabled when VM_SOFTDIRTY is set.

As a side effect of enabling and disabling write notifications with
care, this patch fixes a bug in mprotect where vm_page_prot bits set by
drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
commit c9d0bf2414 ("mm: uncached vma support with writenotify").

Signed-off-by: Peter Feiner <pfeiner@google.com>
Reported-by: Peter Feiner <pfeiner@google.com>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:28 +02:00
Zach Brown 9470dd5d35 fs: check bh blocknr earlier when searching lru
It's very common for the buffer heads in the lru to have different block
numbers.  By comparing the blocknr before the bdev and size we can
reduce the cost of searching in the very common case where all the
entries have the same bdev and size.

In quick hot cache cycle counting tests on a single fs workstation this
cut the cost of a miss by about 20%.

A diff of the disassembly shows the reordering of the bdev and blocknr
comparisons.  This is in such a tiny loop that skipping one comparison
is a meaningful portion of the total work being done:

     1628:      83 c1 01                add    $0x1,%ecx
     162b:      83 f9 08                cmp    $0x8,%ecx
     162e:      74 60                   je     1690 <__find_get_block+0xa0>
     1630:      89 c8                   mov    %ecx,%eax
     1632:      65 4c 8b 04 c5 00 00    mov    %gs:0x0(,%rax,8),%r8
     1639:      00 00
     163b:      4d 85 c0                test   %r8,%r8
     163e:      4c 89 c3                mov    %r8,%rbx
     1641:      74 e5                   je     1628 <__find_get_block+0x38>
-    1643:      4d 3b 68 30             cmp    0x30(%r8),%r13
+    1643:      4d 3b 68 18             cmp    0x18(%r8),%r13
     1647:      75 df                   jne    1628 <__find_get_block+0x38>
-    1649:      4d 3b 60 18             cmp    0x18(%r8),%r12
+    1649:      4d 3b 60 30             cmp    0x30(%r8),%r12
     164d:      75 d9                   jne    1628 <__find_get_block+0x38>
     164f:      49 39 50 20             cmp    %rdx,0x20(%r8)
     1653:      75 d3                   jne    1628 <__find_get_block+0x38>

Signed-off-by: Zach Brown <zab@zabbo.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:26 +02:00
Rasmus Villemoes a97df4277d isofs: replace strnicmp with strncasecmp
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp.  The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.

To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:24 +02:00
Rasmus Villemoes 2bd63329cb ocfs2: replace strnicmp with strncasecmp
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp.  The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.

To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:24 +02:00
Rasmus Villemoes 87e747cdb9 cifs: replace strnicmp with strncasecmp
The kernel used to contain two functions for length-delimited,
case-insensitive string comparison, strnicmp with correct semantics and
a slightly buggy strncasecmp.  The latter is the POSIX name, so strnicmp
was renamed to strncasecmp, and strnicmp made into a wrapper for the new
strncasecmp to avoid breaking existing users.

To allow the compat wrapper strnicmp to be removed at some point in the
future, and to avoid the extra indirection cost, do
s/strnicmp/strncasecmp/g.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:24 +02:00
Fabian Frederick 76e5121089 FS/OMFS: block number sanity check during fill_super operation
This patch defines maximum block number to 2^31.  It also converts
bitmap_size and array_size to unsigned int in omfs_get_imap

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Bob Copeland <me@bobcopeland.com>
Acked-by: Bob Copeland <me@bobcopeland.com>
Tested-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:22 +02:00
Fabian Frederick c70b17b653 fs/affs: remove redundant sys_tz declarations
sys_tz is already declared in include/linux/time.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:22 +02:00
Fabian Frederick 73516ace94 fs/affs/file.c: fix shadow warnings
Four functions declared variables twice resulting in shadow warnings.

This patch renames internal variables and adds blank line after
declarations.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:22 +02:00
Fabian Frederick 3bc759931d fs/affs/inode.c: remove unused variable
head is set to AFFS_HEAD(bh) but never used.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:22 +02:00
Fabian Frederick 1e907f4f11 fs/affs/super.c: remove unused variable
key is set in affs_fill_super but never used.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:21 +02:00
Oleg Nesterov b03023ecbd coredump: add %i/%I in core_pattern to report the tid of the crashed thread
format_corename() can only pass the leader's pid to the core handler,
but there is no simple way to figure out which thread originated the
coredump.

As Jan explains, this also means that there is no simple way to create
the backtrace of the crashed process:

As programs are mostly compiled with implicit gcc -fomit-frame-pointer
one needs program's .eh_frame section (equivalently PT_GNU_EH_FRAME
segment) or .debug_frame section.  .debug_frame usually is present only
in separate debug info files usually not even installed on the system.
While .eh_frame is a part of the executable/library (and it is even
always mapped for C++ exceptions unwinding) it no longer has to be
present anywhere on the disk as the program could be upgraded in the
meantime and the running instance has its executable file already
unlinked from disk.

One possibility is to echo 0x3f >/proc/*/coredump_filter and dump all
the file-backed memory including the executable's .eh_frame section.
But that can create huge core files, for example even due to mmapped
data files.

Other possibility would be to read .eh_frame from /proc/PID/mem at the
core_pattern handler time of the core dump.  For the backtrace one needs
to read the register state first which can be done from core_pattern
handler:

    ptrace(PTRACE_SEIZE, tid, 0, PTRACE_O_TRACEEXIT)
    close(0);    // close pipe fd to resume the sleeping dumper
    waitpid();   // should report EXIT
    PTRACE_GETREGS or other requests

The remaining problem is how to get the 'tid' value of the crashed
thread.  It could be read from the first NT_PRSTATUS note of the core
file but that makes the core_pattern handler complicated.

Unfortunately %t is already used so this patch uses %i/%I.

Automatic Bug Reporting Tool (https://github.com/abrt/abrt/wiki/overview)
is experimenting with this.  It is using the elfutils
(https://fedorahosted.org/elfutils/) unwinder for generating the
backtraces.  Apart from not needing matching executables as mentioned
above, another advantage is that we can get the backtrace without saving
the core (which might be quite large) to disk.

[mmilata@redhat.com: final paragraph of changelog]
Signed-off-by: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: Mark Wielaard <mjw@redhat.com>
Cc: Martin Milata <mmilata@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:21 +02:00
Fabian Frederick 877aabd6ce fat: remove redundant sys_tz declaration
sys_tz is already declared extern struct in include/linux/time.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Fabian Frederick 54cc6cea73 fs/reiserfs/journal.c: fix sparse context imbalance warning
Merge conditional unlock/lock in the same condition to avoid sparse
warning:

  fs/reiserfs/journal.c:703:36: warning: context imbalance in 'add_to_chunk' - unexpected unlock

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Fabian Frederick 35c0b380d8 fs/ufs/balloc.c: remove unused variable
ucg is defined and set in ufs_bitmap_search but never used.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Evgeniy Dushistov <dushistov@mail.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Fabian Frederick a792d90829 fs/hfs/hfs_fs.h: remove redundant sys_tz declaration
sys_tz is already declared in include/linux/time.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Andreas Rohner b9f6614072 nilfs2: improve the performance of fdatasync()
Support for fdatasync() has been implemented in NILFS2 for a long time,
but whenever the corresponding inode is dirty the implementation falls
back to a full-flegded sync().  Since every write operation has to
update the modification time of the file, the inode will almost always
be dirty and fdatasync() will fall back to sync() most of the time.  But
this fallback is only necessary for a change of the file size and not
for a change of the various timestamps.

This patch adds a new flag NILFS_I_INODE_SYNC to differentiate between
those two situations.

 * If it is set the file size was changed and a full sync is necessary.
 * If it is not set then only the timestamps were updated and
   fdatasync() can go ahead.

There is already a similar flag I_DIRTY_DATASYNC on the VFS layer with
the exact same semantics.  Unfortunately it cannot be used directly,
because NILFS2 doesn't implement write_inode() and doesn't clear the VFS
flags when inodes are written out.  So the VFS writeback thread can
clear I_DIRTY_DATASYNC at any time without notifying NILFS2.  So
I_DIRTY_DATASYNC has to be mapped onto NILFS_I_INODE_SYNC in
nilfs_update_inode().

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Andreas Rohner e2c7617ae3 nilfs2: add missing blkdev_issue_flush() to nilfs_sync_fs()
Under normal circumstances nilfs_sync_fs() writes out the super block,
which causes a flush of the underlying block device.  But this depends
on the THE_NILFS_SB_DIRTY flag, which is only set if the pointer to the
last segment crosses a segment boundary.  So if only a small amount of
data is written before the call to nilfs_sync_fs(), no flush of the
block device occurs.

In the above case an additional call to blkdev_issue_flush() is needed.
To prevent unnecessary overhead, the new flag nilfs->ns_flushed_device
is introduced, which is cleared whenever new logs are written and set
whenever the block device is flushed.  For convenience the function
nilfs_flush_device() is added, which contains the above logic.

Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
Himangi Saraogi 0f2a84f41a fs/befs/btree.c: remove typedef befs_btree_node
The Linux kernel coding style guidelines suggest not using typedefs for
structure types.  This patch gets rid of the typedef for befs_btree_node.

The following Coccinelle semantic patch detects the case.

@tn1@
type td;
@@

typedef struct { ... } td;

@script:python tf@
td << tn1.td;
tdres;
@@

coccinelle.tdres = td;

@@
type tn1.td;
identifier tf.tdres;
@@

-typedef
 struct
+  tdres
   { ... }
-td
 ;

@@
type tn1.td;
identifier tf.tdres;
@@

-td
+ struct tdres

Signed-off-by: Himangi Saraogi <himangi774@gmail.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:20 +02:00
NeilBrown ef16cc5909 autofs4: d_manage() should return -EISDIR when appropriate in rcu-walk mode.
If rcu-walk mode we don't *have* to return -EISDIR for non-mount-traps
as we will simply drop into REF-walk and handling DCACHE_NEED_AUTOMOUNT
dentrys the slow way.  But it is better if we do when possible.

In 'oz_mode', use the same condition as ref-walk: if not a mountpoint,
then it must be -EISDIR.

In regular mode there are most tests needed.  Most of them can be
performed without taking any spinlocks.  If we find a directory that
isn't obviously empty, and isn't mounted on, we need to call
'simple_empty()' which does take a spinlock.  If this turned out to hurt
performance, some other approach could be found to signal when a
directory is known to be empty.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 4d885f90e3 autofs4: avoid taking fs_lock during rcu-walk
->fs_lock protects AUTOFS_INF_EXPIRING.  We need to be sure that once
the flag is set, no new references beneath the dentry are taken.  So
rcu-walk currently needs to take fs_lock before checking the flag.  This
hurts performance.

Change the expiry to a two-stage process.  First set AUTOFS_INF_NO_RCU
which forces any path walk into ref-walk mode, then drop the lock and
call synchronize_rcu().  Once that returns we can be sure no rcu-walk is
active beneath the dentry and we can check reference counts again.

Now during an RCU-walk we can test AUTOFS_INF_EXPIRING without taking
the lock as along as we test AUTOFS_INF_NO_RCU too.  If either are set,
we must abort the RCU-walk If neither are set, we know that refcounts
will be tested again after we finish the RCU-walk so we are safe to
continue.

->fs_lock is still taken in d_manage() to check for a non-trap
directory.  That will be resolved in the next patch.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 6ece08e618 autofs4: make "autofs4_can_expire" idempotent.
Have a "test" function change the value it is testing can be confusing,
particularly as a future patch will be calling this function twice.

So move the update for 'last_used' to avoid repeat expiry to the place
where the final determination on what to expire is known.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown a5d1dba143 autofs4: factor should_expire() out of autofs4_expire_indirect.
Future patch will potentially call this twice, so make it separate.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
NeilBrown 23bfc2a24e autofs4: allow RCU-walk to walk through autofs4
This series teaches autofs about RCU-walk so that we don't drop straight
into REF-walk when we hit an autofs directory, and so that we avoid
spinlocks as much as possible when performing an RCU-walk.

This is needed so that the benefits of the recent NFS support for
RCU-walk are fully available when NFS filesystems are automounted.

Patches have been carefully reviewed and tested both with test suites
and in production - thanks a lot to Ian Kent for his support there.

This patch (of 6):

Any attempt to look up a pathname that passes though an autofs4 mount is
currently forced out of RCU-walk into REF-walk.

This can significantly hurt performance of many-thread work loads on
many-core systems, especially if the automounted filesystem supports
RCU-walk but doesn't get to benefit from it.

So if autofs4_d_manage is called with rcu_walk set, only fail with -ECHILD
if it is necessary to wait longer than a spinlock.

Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Ian Kent <raven@themaw.net>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Fabian Frederick 8a273345dc fs/ncpfs/dir.c: remove redundant sys_tz declaration
sys_tz is already declared in include/linux/time.h

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Petr Vandrovec <petr@vandrovec.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Arnd Bergmann de8288b1f8 binfmt_misc: work around gcc-4.9 warning
gcc-4.9 on ARM gives us a mysterious warning about the binfmt_misc
parse_command function:

  fs/binfmt_misc.c: In function 'parse_command.part.3':
  fs/binfmt_misc.c:405:7: warning: array subscript is above array bounds [-Warray-bounds]

I've managed to trace this back to the ARM implementation of memset,
which is called from copy_from_user in case of a fault and which does

 #define memset(p,v,n)                                                  \
        ({                                                              \
                void *__p = (p); size_t __n = n;                        \
                if ((__n) != 0) {                                       \
                        if (__builtin_constant_p((v)) && (v) == 0)      \
                                __memzero((__p),(__n));                 \
                        else                                            \
                                memset((__p),(v),(__n));                \
                }                                                       \
                (__p);                                                  \
        })

Apparently gcc gets confused by the check for "size != 0" and believes
that the size might be zero when it gets to the line that does "if
(s[count-1] == '\n')", so it would access data outside of the array.

gcc is clearly wrong here, since this condition was already checked
earlier in the function and the 'size' value can not change in the
meantime.

Fortunately, we can work around it and get rid of the warning by
rearranging the function to check for zero size after doing the
copy_from_user.  It is still safe to pass a zero size into
copy_from_user, so it does not cause any side effects.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:16 +02:00
Mike Frysinger bbaecc0882 binfmt_misc: expand the register format limit to 1920 bytes
The current code places a 256 byte limit on the registration format.
This ends up being fairly limited when you try to do matching against a
binary format like ELF:

 - the magic & mask formats cannot have any embedded NUL chars
   (string_unescape_inplace halts at the first NUL)
 - each escape sequence quadruples the size: \x00 is needed for NUL
 - trying to match bytes at the start of the file as well as further
   on leads to a lot of \x00 sequences in the mask
 - magic & mask have to be the same length (when decoded)
 - still need bytes for the other fields
 - impossible!

Let's look at a concrete (and common) example: using QEMU to run MIPS
ELFs.  The name field uses 11 bytes "qemu-mipsel".  The interp uses 20
bytes "/usr/bin/qemu-mipsel".  The type & flags takes up 4 bytes.  We
need 7 bytes for the delimiter (usually ":").  We can skip offset.  So
already we're down to 107 bytes to use with the magic/mask instead of
the real limit of 128 (BINPRM_BUF_SIZE).  If people use shell code to
register (which they do the majority of the time), they're down to ~26
possible bytes since the escape sequence must be \x##.

The ELF format looks like (both 32 & 64 bit):

	e_ident: 16 bytes
	e_type: 2 bytes
	e_machine: 2 bytes

Those 20 bytes are enough for most architectures because they have so few
formats in the first place, thus they can be uniquely identified.  That
also means for shell users, since 20 is smaller than 26, they can sanely
register a handler.

But for some targets (like MIPS), we need to poke further.  The ELF fields
continue on:

	e_entry: 4 or 8 bytes
	e_phoff: 4 or 8 bytes
	e_shoff: 4 or 8 bytes
	e_flags: 4 bytes

We only care about e_flags here as that includes the bits to identify
whether the ELF is O32/N32/N64.  But now we have to consume another 16
bytes (for 32 bit ELFs) or 28 bytes (for 64 bit ELFs) just to match the
flags.  If every byte is escaped, we send 288 more bytes to the kernel
((20 {e_ident,e_type,e_machine} + 12 {e_entry,e_phoff,e_shoff} + 4
{e_flags}) * 2 {mask,magic} * 4 {escape}) and we've clearly blown our
budget.

Even if we try to be clever and do the decoding ourselves (rather than
relying on the kernel to process \x##), we still can't hit the mark --
string_unescape_inplace treats mask & magic as C strings so NUL cannot
be embedded.  That leaves us with having to pass \x00 for the 12/24
entry/phoff/shoff bytes (as those will be completely random addresses),
and that is a minimum requirement of 48/96 bytes for the mask alone.
Add up the rest and we blow through it (this is for 64 bit ELFs):
magic: 20 {e_ident,e_type,e_machine} + 24 {e_entry,e_phoff,e_shoff} +
       4 {e_flags} = 48              # ^^ See note below.
mask: 20 {e_ident,e_type,e_machine} + 96 {e_entry,e_phoff,e_shoff} +
       4 {e_flags} = 120
Remember above we had 107 left over, and now we're at 168.  This is of
course the *best* case scenario -- you'll also want to have NUL bytes
in the magic & mask too to match literal zeros.

Note: the reason we can use 24 in the magic is that we can work off of the
fact that for bytes the mask would clobber, we can stuff any value into
magic that we want.  So when mask is \x00, we don't need the magic to also
be \x00, it can be an unescaped raw byte like '!'.  This lets us handle
more formats (barely) under the current 256 limit, but that's a pretty
tall hoop to force people to jump through.

With all that said, let's bump the limit from 256 bytes to 1920.  This way
we support escaping every byte of the mask & magic field (which is 1024
bytes by themselves -- 128 * 4 * 2), and we leave plenty of room for other
fields.  Like long paths to the interpreter (when you have source in your
/really/long/homedir/qemu/foo).  Since the current code stuffs more than
one structure into the same buffer, we leave a bit of space to easily
round up to 2k.  1920 is just as arbitrary as 256 ;).

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14 02:18:15 +02:00
Rob Jones d5d962265d fs/fscache/object-list.c: use __seq_open_private()
Reduce boilerplate code by using __seq_open_private() instead of seq_open()
in fscache_objlist_open().

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
2014-10-13 17:52:21 +01:00
David Howells a30efe261b CacheFiles: Fix incorrect test for in-memory object collision
When CacheFiles cache objects are in use, they have in-memory representations,
as defined by the cachefiles_object struct.  These are kept in a tree rooted in
the cache and indexed by dentry pointer (since there's a unique mapping between
object index key and dentry).

Collisions can occur between a representation already in the tree and a new
representation being set up because it takes time to dispose of an old
representation - particularly if it must be unlinked or renamed.

When such a collision occurs, cachefiles_mark_object_active() is meant to check
to see if the old, already-present representation is in the process of being
discarded (ie. FSCACHE_OBJECT_IS_LIVE is not set on it) - and, if so, wait for
the representation to be removed (ie. CACHEFILES_OBJECT_ACTIVE is then
cleared).

However, the test for whether the old representation is still live is checking
the new object - which always will be live at this point.  This leads to an
oops looking like:

	CacheFiles: Error: Unexpected object collision
	object: OBJ1b354
	objstate=LOOK_UP_OBJECT fl=8 wbusy=2 ev=0[0]
	ops=0 inp=0 exc=0
	parent=ffff88053f5417c0
	cookie=ffff880538f202a0 [pr=ffff8805381b7160 nd=ffff880509c6eb78 fl=27]
	key=[8] '2490000000000000'
	xobject: OBJ1a600
	xobjstate=DROP_OBJECT fl=70 wbusy=2 ev=0[0]
	xops=0 inp=0 exc=0
	xparent=ffff88053f5417c0
	xcookie=ffff88050f4cbf70 [pr=ffff8805381b7160 nd=          (null) fl=12]
	------------[ cut here ]------------
	kernel BUG at fs/cachefiles/namei.c:200!
	...
	Workqueue: fscache_object fscache_object_work_func [fscache]
	...
	RIP: ... cachefiles_walk_to_object+0x7ea/0x860 [cachefiles]
	...
	Call Trace:
	 [<ffffffffa04dadd8>] ? cachefiles_lookup_object+0x58/0x100 [cachefiles]
	 [<ffffffffa01affe9>] ? fscache_look_up_object+0xb9/0x1d0 [fscache]
	 [<ffffffffa01afc4d>] ? fscache_parent_ready+0x2d/0x80 [fscache]
	 [<ffffffffa01b0672>] ? fscache_object_work_func+0x92/0x1f0 [fscache]
	 [<ffffffff8107e82b>] ? process_one_work+0x16b/0x400
	 [<ffffffff8107fc16>] ? worker_thread+0x116/0x380
	 [<ffffffff8107fb00>] ? manage_workers.isra.21+0x290/0x290
	 [<ffffffff81085edc>] ? kthread+0xbc/0xe0
	 [<ffffffff81085e20>] ? flush_kthread_worker+0x80/0x80
	 [<ffffffff81502d0c>] ? ret_from_fork+0x7c/0xb0
	 [<ffffffff81085e20>] ? flush_kthread_worker+0x80/0x80

Reported-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Steve Dickson <steved@redhat.com>
2014-10-13 17:52:21 +01:00
Trond Myklebust b8fb9c30f2 NFS: Fix a bogus warning in nfs_generic_pgio
It is OK for pageused == pagecount in the loop, as long as we don't add
another entry to the *pages array. Move the test so that it only triggers
in that case.

Reported-by: Steve Dickson <SteveD@redhat.com>
Fixes: bba5c1887a (nfs: disallow duplicate pages in pgio page vectors)
Cc: Weston Andros Adamson <dros@primarydata.com>
Cc: stable@vger.kernel.org # 3.16.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-13 11:04:02 -04:00
Trond Myklebust 3caa0c6ed7 NFS: Fix an uninitialised pointer Oops in the writeback error path
SteveD reports the following Oops:
 RIP: 0010:[<ffffffffa053461d>]  [<ffffffffa053461d>] __put_nfs_open_context+0x1d/0x100 [nfs]
 RSP: 0018:ffff880fed687b90  EFLAGS: 00010286
 RAX: 0000000000000024 RBX: 0000000000000000 RCX: 0000000000000006
 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: ffff880fed687bc0 R08: 0000000000000092 R09: 000000000000047a
 R10: 0000000000000000 R11: ffff880fed6878d6 R12: ffff880fed687d20
 R13: ffff880fed687d20 R14: 0000000000000070 R15: ffffea000aa33ec0
 FS:  00007fce290f0740(0000) GS:ffff8807ffc60000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000070 CR3: 00000007f2e79000 CR4: 00000000000007e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Stack:
  0000000000000000 ffff880036c5e510 ffff880fed687d20 ffff880fed687d20
  ffff880036c5e200 ffffea000aa33ec0 ffff880fed687bd0 ffffffffa0534710
  ffff880fed687be8 ffffffffa053d5f0 ffff880036c5e200 ffff880fed687c08
 Call Trace:
  [<ffffffffa0534710>] put_nfs_open_context+0x10/0x20 [nfs]
  [<ffffffffa053d5f0>] nfs_pgio_data_destroy+0x20/0x40 [nfs]
  [<ffffffffa053d672>] nfs_pgio_error+0x22/0x40 [nfs]
  [<ffffffffa053d8f4>] nfs_generic_pgio+0x74/0x2e0 [nfs]
  [<ffffffffa06b18c3>] pnfs_generic_pg_writepages+0x63/0x210 [nfsv4]
  [<ffffffffa053d579>] nfs_pageio_doio+0x19/0x50 [nfs]
  [<ffffffffa053eb84>] nfs_pageio_complete+0x24/0x30 [nfs]
  [<ffffffffa053cb25>] nfs_direct_write_schedule_iovec+0x115/0x1f0 [nfs]
  [<ffffffffa053675f>] ? nfs_get_lock_context+0x4f/0x120 [nfs]
  [<ffffffffa053d252>] nfs_file_direct_write+0x262/0x420 [nfs]
  [<ffffffffa0532d91>] nfs_file_write+0x131/0x1d0 [nfs]
  [<ffffffffa0532c60>] ? nfs_need_sync_write.isra.17+0x40/0x40 [nfs]
  [<ffffffff812127b8>] do_io_submit+0x3b8/0x840
  [<ffffffff81212c50>] SyS_io_submit+0x10/0x20
  [<ffffffff81610f29>] system_call_fastpath+0x16/0x1b

This is due to the calls to nfs_pgio_error() in nfs_generic_pgio(), which
happen before the nfs_pgio_header's open context is referenced in
nfs_pgio_rpcsetup().

Reported-by: Steve Dickson <SteveD@redhat.com>
Cc: stable@vger.kernel.org # 3.16.x
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2014-10-13 10:26:43 -04:00
Linus Torvalds faafcba3b5 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Optimized support for Intel "Cluster-on-Die" (CoD) topologies (Dave
     Hansen)

   - Various sched/idle refinements for better idle handling (Nicolas
     Pitre, Daniel Lezcano, Chuansheng Liu, Vincent Guittot)

   - sched/numa updates and optimizations (Rik van Riel)

   - sysbench speedup (Vincent Guittot)

   - capacity calculation cleanups/refactoring (Vincent Guittot)

   - Various cleanups to thread group iteration (Oleg Nesterov)

   - Double-rq-lock removal optimization and various refactorings
     (Kirill Tkhai)

   - various sched/deadline fixes

  ... and lots of other changes"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
  sched/dl: Use dl_bw_of() under rcu_read_lock_sched()
  sched/fair: Delete resched_cpu() from idle_balance()
  sched, time: Fix build error with 64 bit cputime_t on 32 bit systems
  sched: Improve sysbench performance by fixing spurious active migration
  sched/x86: Fix up typo in topology detection
  x86, sched: Add new topology for multi-NUMA-node CPUs
  sched/rt: Use resched_curr() in task_tick_rt()
  sched: Use rq->rd in sched_setaffinity() under RCU read lock
  sched: cleanup: Rename 'out_unlock' to 'out_free_new_mask'
  sched: Use dl_bw_of() under RCU read lock
  sched/fair: Remove duplicate code from can_migrate_task()
  sched, mips, ia64: Remove __ARCH_WANT_UNLOCKED_CTXSW
  sched: print_rq(): Don't use tasklist_lock
  sched: normalize_rt_tasks(): Don't use _irqsave for tasklist_lock, use task_rq_lock()
  sched: Fix the task-group check in tg_has_rt_tasks()
  sched/fair: Leverage the idle state info when choosing the "idlest" cpu
  sched: Let the scheduler see CPU idle states
  sched/deadline: Fix inter- exclusive cpusets migrations
  sched/deadline: Clear dl_entity params when setscheduling to different class
  sched/numa: Kill the wrong/dead TASK_DEAD check in task_numa_fault()
  ...
2014-10-13 16:23:15 +02:00
Linus Torvalds d6dd50e07c Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - changes related to No-CBs CPUs and NO_HZ_FULL

   - RCU-tasks implementation

   - torture-test updates

   - miscellaneous fixes

   - locktorture updates

   - RCU documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (81 commits)
  workqueue: Use cond_resched_rcu_qs macro
  workqueue: Add quiescent state between work items
  locktorture: Cleanup header usage
  locktorture: Cannot hold read and write lock
  locktorture: Fix __acquire annotation for spinlock irq
  locktorture: Support rwlocks
  rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
  locktorture: Document boot/module parameters
  rcutorture: Rename rcutorture_runnable parameter
  locktorture: Add test scenario for rwsem_lock
  locktorture: Add test scenario for mutex_lock
  locktorture: Make torture scripting account for new _runnable name
  locktorture: Introduce torture context
  locktorture: Support rwsems
  locktorture: Add infrastructure for torturing read locks
  torture: Address race in module cleanup
  locktorture: Make statistics generic
  locktorture: Teach about lock debugging
  locktorture: Support mutexes
  locktorture: Add documentation
  ...
2014-10-13 15:44:12 +02:00
Linus Torvalds 5ff0b9e1a1 xfs: update for 3.18-rc1
This update contains:
 o various cleanups
 o log recovery debug hooks
 o seek hole/data implementation merge
 o extent shift rework to fix collapse range bugs
 o various sparse warning fixes
 o log recovery transaction processing rework to fix use after free bugs
 o metadata buffer IO infrastructuer rework to ensure all buffers under IO have
   valid reference counts
 o various fixes for ondisk flags, writeback and zero range corner cases
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJUOxyCAAoJEK3oKUf0dfodzt8QAKcFdE8hyCAnD8IK85v46gWG
 IHnxOlTLrhs/22wfD1fSUcjCBQsQAIloQihvVGStugFnkEUHOUjlZ/oMcGNFPECC
 L7B4Ns6WmA9TA8ibgYvLZepautNjzhS5/lGfqSWpw4hQPsJJp2fGyCVF/ZhwnP6D
 qPeflVic8E8rgaJp98X8uFyZ+9EEoSF7/9EhmvVNwsO6UaThhIO/oPydx8oNrhKS
 k6aADmxNYtFWJb6kUjFbXaJwrFIFLvJc60FZz2eUViVGBx6K8D5FBiVbzZKe2WZ6
 VOz4fj63BYI7Nxk4rZGJPoyql+ChO/pIVwH15ZmYRkkgUXs8FGy85mNKMg7DHnFm
 K/ZUhW5IBc6GtkwPCjNIM642IQYnTR5SdQfFxMS2EYPBUumcQ3EbD44aGZY69YYu
 pP+2g4b1diadNkGACccj6teQ9V0fbyF0lfZqoZMeN/W0As6l9oYa0yFBGsK9sblq
 yrPfce+wEy5HBy9M7Fqpvm3bwMunNViqilGZXKlOyodSgXxSF3JwXuc+8/TNwcnL
 O0RSD7R7k6TvrmAntTgwT4beZi4ziG+/tVa0rD3mJM/sXyzcP2bwbA1APM74NcHh
 p8mrJRci6vtKPwIylQ1xeCeK/WD21OhbJWBYR+0JOEJSnAjtv8flk7mqGLhy+M+Y
 yCdHJIfuJ4NKj4X3f0Kc
 =TdAB
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs update from Dave Chinner:
 "This update contains:
   - various cleanups
   - log recovery debug hooks
   - seek hole/data implementation merge
   - extent shift rework to fix collapse range bugs
   - various sparse warning fixes
   - log recovery transaction processing rework to fix use after free
     bugs
   - metadata buffer IO infrastructuer rework to ensure all buffers
     under IO have valid reference counts
   - various fixes for ondisk flags, writeback and zero range corner
     cases"

* tag 'xfs-for-linus-3.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (56 commits)
  xfs: fix agno increment in xfs_inumbers() loop
  xfs: xfs_iflush_done checks the wrong log item callback
  xfs: flush the range before zero range conversion
  xfs: restore buffer_head unwritten bit on ioend cancel
  xfs: check for null dquot in xfs_quota_calc_throttle()
  xfs: fix crc field handling in xfs_sb_to/from_disk
  xfs: don't send null bp to xfs_trans_brelse()
  xfs: check for inode size overflow in xfs_new_eof()
  xfs: only set extent size hint when asked
  xfs: project id inheritance is a directory only flag
  xfs: kill time.h
  xfs: compat_xfs_bstat does not have forkoff
  xfs: simplify xfs_zero_remaining_bytes
  xfs: check xfs_buf_read_uncached returns correctly
  xfs: introduce xfs_buf_submit[_wait]
  xfs: kill xfs_bioerror_relse
  xfs: xfs_bioerror can die.
  xfs: kill xfs_bdstrat_cb
  xfs: rework xfs_buf_bio_endio error handling
  xfs: xfs_buf_ioend and xfs_buf_iodone_work duplicate functionality
  ...
2014-10-13 12:06:54 +02:00
Linus Torvalds 77c688ac87 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "The big thing in this pile is Eric's unmount-on-rmdir series; we
  finally have everything we need for that.  The final piece of prereqs
  is delayed mntput() - now filesystem shutdown always happens on
  shallow stack.

  Other than that, we have several new primitives for iov_iter (Matt
  Wilcox, culled from his XIP-related series) pushing the conversion to
  ->read_iter()/ ->write_iter() a bit more, a bunch of fs/dcache.c
  cleanups and fixes (including the external name refcounting, which
  gives consistent behaviour of d_move() wrt procfs symlinks for long
  and short names alike) and assorted cleanups and fixes all over the
  place.

  This is just the first pile; there's a lot of stuff from various
  people that ought to go in this window.  Starting with
  unionmount/overlayfs mess...  ;-/"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (60 commits)
  fs/file_table.c: Update alloc_file() comment
  vfs: Deduplicate code shared by xattr system calls operating on paths
  reiserfs: remove pointless forward declaration of struct nameidata
  don't need that forward declaration of struct nameidata in dcache.h anymore
  take dname_external() into fs/dcache.c
  let path_init() failures treated the same way as subsequent link_path_walk()
  fix misuses of f_count() in ppp and netlink
  ncpfs: use list_for_each_entry() for d_subdirs walk
  vfs: move getname() from callers to do_mount()
  gfs2_atomic_open(): skip lookups on hashed dentry
  [infiniband] remove pointless assignments
  gadgetfs: saner API for gadgetfs_create_file()
  f_fs: saner API for ffs_sb_create_file()
  jfs: don't hash direct inode
  [s390] remove pointless assignment of ->f_op in vmlogrdr ->open()
  ecryptfs: ->f_op is never NULL
  android: ->f_op is never NULL
  nouveau: __iomem misannotations
  missing annotation in fs/file.c
  fs: namespace: suppress 'may be used uninitialized' warnings
  ...
2014-10-13 11:28:42 +02:00
Dmitry Monakhov aef4885ae1 ext4: move error report out of atomic context in ext4_init_block_bitmap()
Error report likely result in IO so it is bad idea to do it from
atomic context.

This patch should fix following issue:

BUG: sleeping function called from invalid context at include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 137, name: kworker/u128:1
5 locks held by kworker/u128:1/137:
 #0:  ("writeback"){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #1:  ((&(&wb->dwork)->work)){......}, at: [<ffffffff81085618>] process_one_work+0x228/0x4d0
 #2:  (jbd2_handle){......}, at: [<ffffffff81242622>] start_this_handle+0x712/0x7b0
 #3:  (&ei->i_data_sem){......}, at: [<ffffffff811fa387>] ext4_map_blocks+0x297/0x430
 #4:  (&(&bgl->locks[i].lock)->rlock){......}, at: [<ffffffff811f3180>] ext4_read_block_bitmap_nowait+0x5d0/0x630
CPU: 3 PID: 137 Comm: kworker/u128:1 Not tainted 3.17.0-rc2-00184-g82752e4 #165
Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011
Workqueue: writeback bdi_writeback_workfn (flush-1:0)
 0000000000000411 ffff880813777288 ffffffff815c7fdc ffff880813777288
 ffff880813a8bba0 ffff8808137772a8 ffffffff8108fb30 ffff880803e01e38
 ffff880803e01e38 ffff8808137772c8 ffffffff811a8d53 ffff88080ecc6000
Call Trace:
 [<ffffffff815c7fdc>] dump_stack+0x51/0x6d
 [<ffffffff8108fb30>] __might_sleep+0xf0/0x100
 [<ffffffff811a8d53>] __sync_dirty_buffer+0x43/0xe0
 [<ffffffff811a8e03>] sync_dirty_buffer+0x13/0x20
 [<ffffffff8120f581>] ext4_commit_super+0x1d1/0x230
 [<ffffffff8120fa03>] save_error_info+0x23/0x30
 [<ffffffff8120fd06>] __ext4_error+0xb6/0xd0
 [<ffffffff8120f260>] ? ext4_group_desc_csum+0x140/0x190
 [<ffffffff811f2d8c>] ext4_read_block_bitmap_nowait+0x1dc/0x630
 [<ffffffff8122e23a>] ext4_mb_init_cache+0x21a/0x8f0
 [<ffffffff8113ae95>] ? lru_cache_add+0x55/0x60
 [<ffffffff8112e16c>] ? add_to_page_cache_lru+0x6c/0x80
 [<ffffffff8122eaa0>] ext4_mb_init_group+0x190/0x280
 [<ffffffff8122ec51>] ext4_mb_good_group+0xc1/0x190
 [<ffffffff8123309a>] ext4_mb_regular_allocator+0x17a/0x410
 [<ffffffff8122c821>] ? ext4_mb_use_preallocated+0x31/0x380
 [<ffffffff81233535>] ? ext4_mb_new_blocks+0x205/0x8e0
 [<ffffffff8116ed5c>] ? kmem_cache_alloc+0xfc/0x180
 [<ffffffff812335b0>] ext4_mb_new_blocks+0x280/0x8e0
 [<ffffffff8116f2c4>] ? __kmalloc+0x144/0x1c0
 [<ffffffff81221797>] ? ext4_find_extent+0x97/0x320
 [<ffffffff812257f4>] ext4_ext_map_blocks+0xbc4/0x1050
 [<ffffffff811fa387>] ? ext4_map_blocks+0x297/0x430
 [<ffffffff811fa3ab>] ext4_map_blocks+0x2bb/0x430
 [<ffffffff81200e43>] ? ext4_init_io_end+0x23/0x50
 [<ffffffff811feb44>] ext4_writepages+0x564/0xaf0
 [<ffffffff815cde3b>] ? _raw_spin_unlock+0x2b/0x40
 [<ffffffff810ac7bd>] ? lock_release_non_nested+0x2fd/0x3c0
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811a009e>] ? writeback_sb_inodes+0x10e/0x490
 [<ffffffff811377e3>] do_writepages+0x23/0x40
 [<ffffffff8119c8ce>] __writeback_single_inode+0x9e/0x280
 [<ffffffff811a026b>] writeback_sb_inodes+0x2db/0x490
 [<ffffffff811a0664>] wb_writeback+0x174/0x2d0
 [<ffffffff810ac359>] ? lock_release_holdtime+0x29/0x190
 [<ffffffff811a0863>] wb_do_writeback+0xa3/0x200
 [<ffffffff811a0a40>] bdi_writeback_workfn+0x80/0x230
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff810856cd>] process_one_work+0x2dd/0x4d0
 [<ffffffff81085618>] ? process_one_work+0x228/0x4d0
 [<ffffffff81085c1d>] worker_thread+0x35d/0x460
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff810858c0>] ? process_one_work+0x4d0/0x4d0
 [<ffffffff8108a885>] kthread+0xf5/0x100
 [<ffffffff810990e5>] ? local_clock+0x25/0x30
 [<ffffffff8108a790>] ? __init_kthread_worker+0x70/0x70
 [<ffffffff815ce2ac>] ret_from_fork+0x7c/0xb0
 [<ffffffff8108a790>] ? __init_kthread_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:42:12 -04:00
Dmitry Monakhov 9aa5d32ba2 ext4: Replace open coded mdata csum feature to helper function
Besides the fact that this replacement improves code readability
it also protects from errors caused direct EXT4_S(sb)->s_es manipulation
which may result attempt to use uninitialized  csum machinery.

#Testcase_BEGIN
IMG=/dev/ram0
MNT=/mnt
mkfs.ext4 $IMG
mount $IMG $MNT
#Enable feature directly on disk, on mounted fs
tune2fs -O metadata_csum  $IMG
# Provoke metadata update, likey result in OOPS
touch $MNT/test
umount $MNT
#Testcase_END

# Replacement script
@@
expression E;
@@
- EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
+ ext4_has_metadata_csum(E)

https://bugzilla.kernel.org/show_bug.cgi?id=82201

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2014-10-13 03:36:16 -04:00
Dave Chinner 6889e783cd Merge branch 'xfs-misc-fixes-for-3.18-3' into for-next 2014-10-13 10:22:45 +11:00
Eric Sandeen a8b1ee8baf xfs: fix agno increment in xfs_inumbers() loop
caused a regression in xfs_inumbers, which in turn broke
xfsdump, causing incomplete dumps.

The loop in xfs_inumbers() needs to fill the user-supplied
buffers, and iterates via xfs_btree_increment, reading new
ags as needed.

But the first time through the loop, if xfs_btree_increment()
succeeds, we continue, which triggers the ++agno at the bottom
of the loop, and we skip to soon to the next ag - without
the proper setup under next_ag to read the next ag.

Fix this by removing the agno increment from the loop conditional,
and only increment agno if we have actually hit the code under
the next_ag: target.

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2014-10-13 10:21:53 +11:00
Eric Biggers a457606a6f fs/file_table.c: Update alloc_file() comment
This comment is 5 years outdated; init_file() no longer exists.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:10 -04:00
Eric Biggers 8cc431165d vfs: Deduplicate code shared by xattr system calls operating on paths
The following pairs of system calls dealing with extended attributes only
differ in their behavior on whether the symbolic link is followed (when
the named file is a symbolic link):

- setxattr() and lsetxattr()
- getxattr() and lgetxattr()
- listxattr() and llistxattr()
- removexattr() and lremovexattr()

Despite this, the implementations all had duplicated code, so this commit
redirects each of the above pairs of system calls to a corresponding
function to which different lookup flags (LOOKUP_FOLLOW or 0) are passed.

For me this reduced the stripped size of xattr.o from 8824 to 8248 bytes.

Signed-off-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:10 -04:00
Al Viro 50b220bbe7 reiserfs: remove pointless forward declaration of struct nameidata
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:07 -04:00
Al Viro 810bb17267 take dname_external() into fs/dcache.c
never used outside and it's too low-level for legitimate uses outside
of fs/dcache.c anyway

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:05 -04:00
Al Viro 115cbfdc60 let path_init() failures treated the same way as subsequent link_path_walk()
As it is, path_lookupat() and path_mounpoint() might end up leaking struct file
reference in some cases.

Spotted-by: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-12 17:09:04 -04:00
Linus Torvalds 5e40d331bd Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris.

Mostly ima, selinux, smack and key handling updates.

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
  integrity: do zero padding of the key id
  KEYS: output last portion of fingerprint in /proc/keys
  KEYS: strip 'id:' from ca_keyid
  KEYS: use swapped SKID for performing partial matching
  KEYS: Restore partial ID matching functionality for asymmetric keys
  X.509: If available, use the raw subjKeyId to form the key description
  KEYS: handle error code encoded in pointer
  selinux: normalize audit log formatting
  selinux: cleanup error reporting in selinux_nlmsg_perm()
  KEYS: Check hex2bin()'s return when generating an asymmetric key ID
  ima: detect violations for mmaped files
  ima: fix race condition on ima_rdwr_violation_check and process_measurement
  ima: added ima_policy_flag variable
  ima: return an error code from ima_add_boot_aggregate()
  ima: provide 'ima_appraise=log' kernel option
  ima: move keyring initialization to ima_init()
  PKCS#7: Handle PKCS#7 messages that contain no X.509 certs
  PKCS#7: Better handling of unsupported crypto
  KEYS: Overhaul key identification when searching for asymmetric keys
  KEYS: Implement binary asymmetric key ID handling
  ...
2014-10-12 10:13:55 -04:00
Xiaoguang Wang 65dd8327eb ext4: delete useless comments about ext4_move_extents
In patch 'ext4: refactor ext4_move_extents code base',  Dmitry Monakhov has
refactored ext4_move_extents' implementation, but forgot to update the
corresponding comments, this patch will try to delete some useless comments.

Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Xiaoguang Wang <wangxg.fnst@cn.fujitsu.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2014-10-11 19:56:34 -04:00
Eric Sandeen 0ff8947fc5 ext4: fix reservation overflow in ext4_da_write_begin
Delalloc write journal reservations only reserve 1 credit,
to update the inode if necessary.  However, it may happen
once in a filesystem's lifetime that a file will cross
the 2G threshold, and require the LARGE_FILE feature to
be set in the superblock as well, if it was not set already.

This overruns the transaction reservation, and can be
demonstrated simply on any ext4 filesystem without the LARGE_FILE
feature already set:

dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \
	conv=notrunc of=testfile
sync
dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \
	conv=notrunc of=testfile

leads to:

EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super
EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28
EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem
EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28
EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28

Adjust the number of credits based on whether the flag is
already set, and whether the current write may extend past the
LARGE_FILE limit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2014-10-11 19:51:17 -04:00
Linus Torvalds ef4a48c513 File locking related changes for v3.18 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUNZK4AAoJEAAOaEEZVoIVI08P/iM7eaIVRnqaqtWw/JBzxiba
 EMDlJYUBSlv6lYk9s8RJT4bMmcmGAKSYzVAHSoPahzNcqTDdFLeDTLGxJ8uKBbjf
 d1qRRdH1yZHGUzCvJq3mEendjfXn435Y3YburUxjLfmzrzW7EbMvndiQsS5dhAm9
 PEZ+wrKF/zFL7LuXa1YznYrbqOD/GRsJAXGEWc3kNwfS9avephVG/RI3GtpI2PJj
 RY1mf8P7+WOlrShYoEuUo5aqs01MnU70LbqGHzY8/QKH+Cb0SOkCHZPZyClpiA+G
 MMJ+o2XWcif3BZYz+dobwz/FpNZ0Bar102xvm2E8fqByr/T20JFjzooTKsQ+PtCk
 DetQptrU2gtyZDKtInJUQSDPrs4cvA13TW+OEB1tT8rKBnmyEbY3/TxBpBTB9E6j
 eb/V3iuWnywR3iE+yyvx24Qe7Pov6deM31s46+Vj+GQDuWmAUJXemhfzPtZiYpMT
 exMXTyDS3j+W+kKqHblfU5f+Bh1eYGpG2m43wJVMLXKV7NwDf8nVV+Wea962ga+w
 BAM3ia4JRVgRWJBPsnre3lvGT5kKPyfTZsoG+kOfRxiorus2OABoK+SIZBZ+c65V
 Xh8VH5p3qyCUBOynXlHJWFqYWe2wH0LfbPrwe9dQwTwON51WF082EMG5zxTG0Ymf
 J2z9Shz68zu0ok8cuSlo
 =Hhee
 -----END PGP SIGNATURE-----

Merge tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux

Pull file locking related changes from Jeff Layton:
 "This release is a little more busy for file locking changes than the
  last:

   - a set of patches from Kinglong Mee to fix the lockowner handling in
     knfsd
   - a pile of cleanups to the internal file lease API.  This should get
     us a bit closer to allowing for setlease methods that can block.

  There are some dependencies between mine and Bruce's trees this cycle,
  and I based my tree on top of the requisite patches in Bruce's tree"

* tag 'locks-v3.18-1' of git://git.samba.org/jlayton/linux: (26 commits)
  locks: fix fcntl_setlease/getlease return when !CONFIG_FILE_LOCKING
  locks: flock_make_lock should return a struct file_lock (or PTR_ERR)
  locks: set fl_owner for leases to filp instead of current->files
  locks: give lm_break a return value
  locks: __break_lease cleanup in preparation of allowing direct removal of leases
  locks: remove i_have_this_lease check from __break_lease
  locks: move freeing of leases outside of i_lock
  locks: move i_lock acquisition into generic_*_lease handlers
  locks: define a lm_setup handler for leases
  locks: plumb a "priv" pointer into the setlease routines
  nfsd: don't keep a pointer to the lease in nfs4_file
  locks: clean up vfs_setlease kerneldoc comments
  locks: generic_delete_lease doesn't need a file_lock at all
  nfsd: fix potential lease memory leak in nfs4_setlease
  locks: close potential race in lease_get_mtime
  security: make security_file_set_fowner, f_setown and __f_setown void return
  locks: consolidate "nolease" routines
  locks: remove lock_may_read and lock_may_write
  lockd: rip out deferred lock handling from testlock codepath
  NFSD: Get reference of lockowner when coping file_lock
  ...
2014-10-11 13:21:34 -04:00
Linus Torvalds 90d0c376f5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
 "The largest set of changes here come from Miao Xie.  He's cleaning up
  and improving read recovery/repair for raid, and has a number of
  related fixes.

  I've merged another set of fsync fixes from Filipe, and he's also
  improved the way we handle metadata write errors to make sure we force
  the FS readonly if things go wrong.

  Otherwise we have a collection of fixes and cleanups.  Dave Sterba
  gets a cookie for removing the most lines (thanks Dave)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (139 commits)
  btrfs: Fix compile error when CONFIG_SECURITY is not set.
  Btrfs: fix compiles when CONFIG_BTRFS_FS_RUN_SANITY_TESTS is off
  btrfs: Make btrfs handle security mount options internally to avoid losing security label.
  Btrfs: send, don't delay dir move if there's a new parent inode
  btrfs: add more superblock checks
  Btrfs: fix race in WAIT_SYNC ioctl
  Btrfs: be aware of btree inode write errors to avoid fs corruption
  Btrfs: remove redundant btrfs_verify_qgroup_counts declaration.
  btrfs: fix shadow warning on cmp
  Btrfs: fix compilation errors under DEBUG
  Btrfs: fix crash of btrfs_release_extent_buffer_page
  Btrfs: add missing end_page_writeback on submit_extent_page failure
  btrfs: Fix the wrong condition judgment about subset extent map
  Btrfs: fix build_backref_tree issue with multiple shared blocks
  Btrfs: cleanup error handling in build_backref_tree
  btrfs: move checks for DUMMY_ROOT into a helper
  btrfs: new define for the inline extent data start
  btrfs: kill extent_buffer_page helper
  btrfs: drop constant param from btrfs_release_extent_buffer_page
  btrfs: hide typecast to definition of BTRFS_SEND_TRANS_STUB
  ...
2014-10-11 08:03:52 -04:00
Linus Torvalds ac0c49396d Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF and quota updates from Jan Kara:
 "A few UDF fixes and also a few patches which are preparing filesystems
  for support of project quotas in VFS"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix loading of special inodes
  ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()
  udf: remove redundant sys_tz declaration
  ocfs2: Don't use MAXQUOTAS value
  reiserfs: Don't use MAXQUOTAS value
  ext3: Don't use MAXQUOTAS value
  udf: Fix race between write(2) and close(2)
2014-10-11 08:02:31 -04:00
Linus Torvalds eca9fdf32d Minor code cleanups and a fix for when eCryptfs metadata is stored in xattrs
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJUNtXXAAoJENaSAD2qAscKx1UP/jqt4gm1AYFBpkwBhVRUssIQ
 wHckk8QPasIdEGeKvyCKXl88sUDLSsJwf/mUpl8pBfKm64LokP2fmUBU9Pkf9hVU
 lcuaFNIEmHh8p1IqcfaFbnZOjuuHc9M2ULQLmo5ShoTHNu6JYAP2zRBMfFrEdcMR
 vKh+RCARa5jr1CdwTHX+dH5vJQIQXW/qgRK5G5Z6KeBI766jK2BvZxYHivUgLWuC
 dV6K4RzvHHJYEVoXjCUhrgGepGwHlDoEgx/Y0GK9vbFPG38IrfSlN6fgvzV6mMYE
 ien8FsuHPv5oiBmFM2byRmWpJtWUOViCbMaMmKqY5Ix21E0lUafA7ixH3nSpOHNZ
 b29dxmHnEDomCJXLAF0NQUE84yTw6ITLp2FldUR2o+sidnJsDx/hph/KvmsK6d6P
 sDfEN/DtzPluZoXKY0jrRtoAhi0citNTgKfrujmx6baBstxRp7AfwGcP4skXJq7w
 wkg0Seo449CUaBJK9A4s7nIjMBQ6/3hjvF/NVcry1aY+/RyhJDz6uFrtnOEqW3So
 6khwJOOcRkmLLXrk+DzvthDRCVNJhWM80cB5/UBBfVG2Kk3PHa86Jfp5Y4teWugx
 zyoacEWiitKcK4Po8skZpLd2vUwny/cD0qS43LT+SMvgRsYZ4XdUnceoY8FkOnLr
 ooIFKHCR/+2XVnAaC101
 =mjS7
 -----END PGP SIGNATURE-----

Merge tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs

Pull eCryptfs updates from Tyler Hicks:
 "Minor code cleanups and a fix for when eCryptfs metadata is stored in
  xattrs"

* tag 'ecryptfs-3.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
  ecryptfs: remove unneeded buggy code in ecryptfs_do_create()
  ecryptfs: avoid to access NULL pointer when write metadata in xattr
  ecryptfs: remove unnecessary break after goto
  ecryptfs: Remove unnecessary include of syscall.h in keystore.c
  fs/ecryptfs/messaging.c: remove null test before kfree
  ecryptfs: Drop cast
  Use %pd in eCryptFS
2014-10-11 08:01:27 -04:00
Linus Torvalds 41e46ac0fa This time we have a couple of bug fixes, one relating to bad i_goal values
which are now ignored (i_goal is basically a hint so it is safe to so this)
 and another relating to the saving of the dirent location during rename.
 
 There is one performance improvement, which is an optimisation in rgblk_free
 so that multiple block deallocations will now be more efficient,
 and one clean up patch to use _RET_IP_ rather than writing it out longhand.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.15 (GNU/Linux)
 
 iQIcBAABAgAGBQJUNQIcAAoJEMrg3m4a/8jSKaQP+wT4SCjKI0TMkVqfyuyDjTAW
 5TxV9h3B5sOxlroJY063WSZUFFWCuooLn4rIK+IT72Jju/AtW1NOb+kx6T8PZ+bT
 5Qj7sReGNJADwWdFNlrE3l+7SecVHO2fxfoI5zKX06YgL8WDptR+nYo/Hn1kwQFL
 4/cukJSkKMLzvrParR/S7RildlG98jQUhS1WtgxUhyyLVC3+b/9HvwAznq6JUX2U
 DLePL2rliCTPZEEvq7tKZGi4uv3ZSbhfSgJHeQeYW565OAiYiJ4fmPFXK9LSyxvv
 +WWoRLGkKDjwHLejkBwjwVbpDOrzVSXuwGrRh0DixL/0xwSAsraBMhiIr6XzXxq8
 3S2uTbl+i+aeuIEwcE/OjDW1Mxp9GovuW0cKdpscVW/Hz6Id5asEtF+YUPBWFi/R
 fFydsn7Gjrjtd2K++n+094N6kqhYo6Vud9es0kigp9D7pq4NSaoKSu74qz2sCWel
 51SgysflmL92lXM2f619LgVL4IBdciECNKaC/JPiUJSc8K0MgUNWE4wreGoeTsuc
 ceQZ8dW1i0w/3MVy5RbjPiEkNpN2ISNs0vDga07dr1Imy/6CJQR+hq9IdLEoku9V
 iBJH46EorVWN4Flt2jeM3uvIc7j+w3kMc5Mxia8D4+aF5eIx/D44ucPb1zPDrILV
 O+LD3XOy+kxecCn1wesC
 =+QNn
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw

Pull gfs2 updates from Steven Whitehouse:
 "This time we have a couple of bug fixes, one relating to bad i_goal
  values which are now ignored (i_goal is basically a hint so it is safe
  to so this) and another relating to the saving of the dirent location
  during rename.

  There is one performance improvement, which is an optimisation in
  rgblk_free so that multiple block deallocations will now be more
  efficient, and one clean up patch to use _RET_IP_ rather than writing
  it out longhand"

* tag 'gfs2-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
  GFS2: use _RET_IP_ instead of (unsigned long)__builtin_return_address(0)
  GFS2: Use gfs2_rbm_incr in rgblk_free
  GFS2: Make rename not save dirent location
  GFS2: fix bad inode i_goal values during block allocation
2014-10-11 08:00:16 -04:00
Linus Torvalds c798360cd1 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "A lot of activities on percpu front.  Notable changes are...

   - percpu allocator now can take @gfp.  If @gfp doesn't contain
     GFP_KERNEL, it tries to allocate from what's already available to
     the allocator and a work item tries to keep the reserve around
     certain level so that these atomic allocations usually succeed.

     This will replace the ad-hoc percpu memory pool used by
     blk-throttle and also be used by the planned blkcg support for
     writeback IOs.

     Please note that I noticed a bug in how @gfp is interpreted while
     preparing this pull request and applied the fix 6ae833c7fe
     ("percpu: fix how @gfp is interpreted by the percpu allocator")
     just now.

   - percpu_ref now uses longs for percpu and global counters instead of
     ints.  It leads to more sparse packing of the percpu counters on
     64bit machines but the overhead should be negligible and this
     allows using percpu_ref for refcnting pages and in-memory objects
     directly.

   - The switching between percpu and single counter modes of a
     percpu_ref is made independent of putting the base ref and a
     percpu_ref can now optionally be initialized in single or killed
     mode.  This allows avoiding percpu shutdown latency for cases where
     the refcounted objects may be synchronously created and destroyed
     in rapid succession with only a fraction of them reaching fully
     operational status (SCSI probing does this when combined with
     blk-mq support).  It's also planned to be used to implement forced
     single mode to detect underflow more timely for debugging.

  There's a separate branch percpu/for-3.18-consistent-ops which cleans
  up the duplicate percpu accessors.  That branch causes a number of
  conflicts with s390 and other trees.  I'll send a separate pull
  request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
  percpu: fix how @gfp is interpreted by the percpu allocator
  blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
  percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
  percpu_ref: add PERCPU_REF_INIT_* flags
  percpu_ref: decouple switching to percpu mode and reinit
  percpu_ref: decouple switching to atomic mode and killing
  percpu_ref: add PCPU_REF_DEAD
  percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
  percpu_ref: replace pcpu_ prefix with percpu_
  percpu_ref: minor code and comment updates
  percpu_ref: relocate percpu_ref_reinit()
  Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
  Revert "percpu: free percpu allocation info for uniprocessor system"
  percpu-refcount: make percpu_ref based on longs instead of ints
  percpu-refcount: improve WARN messages
  percpu: fix locking regression in the failure path of pcpu_alloc()
  percpu-refcount: add @gfp to percpu_ref_init()
  proportions: add @gfp to init functions
  percpu_counter: add @gfp to percpu_counter_init()
  percpu_counter: make percpu_counters_lock irq-safe
  ...
2014-10-10 07:26:02 -04:00
Linus Torvalds b211e9d7c8 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "Nothing too interesting.  Just a handful of cleanup patches"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  Revert "cgroup: remove redundant variable in cgroup_mount()"
  cgroup: remove redundant variable in cgroup_mount()
  cgroup: fix missing unlock in cgroup_release_agent()
  cgroup: remove CGRP_RELEASABLE flag
  perf/cgroup: Remove perf_put_cgroup()
  cgroup: remove redundant check in cgroup_ino()
  cpuset: simplify proc_cpuset_show()
  cgroup: simplify proc_cgroup_show()
  cgroup: use a per-cgroup work for release agent
  cgroup: remove bogus comments
  cgroup: remove redundant code in cgroup_rmdir()
  cgroup: remove some useless forward declarations
  cgroup: fix a typo in comment.
2014-10-10 07:24:40 -04:00
Sebastien Buisson 86cf78d73d fs/buffer.c: increase the buffer-head per-CPU LRU size
Increase the buffer-head per-CPU LRU size to allow efficient filesystem
operations that access many blocks for each transaction.  For example,
creating a file in a large ext4 directory with quota enabled will access
multiple buffer heads and will overflow the LRU at the default 8-block LRU
size:

* parent directory inode table block (ctime, nlinks for subdirs)
* new inode bitmap
* inode table block
* 2 quota blocks
* directory leaf block (not reused, but pollutes one cache entry)
* 2 levels htree blocks (only one is reused, other pollutes cache)
* 2 levels indirect/index blocks (only one is reused)

The buffer-head per-CPU LRU size is raised to 16, as it shows in metadata
performance benchmarks up to 10% gain for create, 4% for lookup and 7% for
destroy.

Signed-off-by: Liang Zhen <liang.zhen@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Sebastien Buisson <sebastien.buisson@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:26:02 -04:00
Konstantin Khlebnikov 09316c09dd mm/balloon_compaction: add vmstat counters and kpageflags bit
Always mark pages with PageBalloon even if balloon compaction is disabled
and expose this mark in /proc/kpageflags as KPF_BALLOON.

Also this patch adds three counters into /proc/vmstat: "balloon_inflate",
"balloon_deflate" and "balloon_migrate".  They accumulate balloon
activity.  Current size of balloon is (balloon_inflate - balloon_deflate)
pages.

All generic balloon code now gathered under option CONFIG_MEMORY_BALLOON.
It should be selected by ballooning driver which wants use this feature.
Currently virtio-balloon is the only user.

Signed-off-by: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:26:01 -04:00
Peter Feiner 81d0fa623c mm: softdirty: unmapped addresses between VMAs are clean
If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
as softdirty.  Here's a program to demonstrate the bug:

int main() {
	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
	uint64_t pme[3];
	int fd = open("/proc/self/pagemap", O_RDONLY);;
	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
	munmap(m + getpagesize(), getpagesize());
	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
	return 0;
}

(Note that all pages in new VMAs are softdirty until cleared).

Tested:
	Used the program given above. I'm going to include this code in
	a selftest in the future.

[n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Jamie Liu <jamieliu@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Xue jiufei b246d3d11e ocfs2: fix a deadlock while o2net_wq doing direct memory reclaim
Fix a deadlock problem caused by direct memory reclaim in o2net_wq.  The
situation is as follows:

1) Receive a connect message from another node, node queues a
   work_struct o2net_listen_work.

2) o2net_wq processes this work and call the following functions:

o2net_wq
-> o2net_accept_one
  -> sock_create_lite
    -> sock_alloc()
      -> kmem_cache_alloc with GFP_KERNEL
        -> ____cache_alloc_node
          ->__alloc_pages_nodemask
            -> do_try_to_free_pages
              -> shrink_slab
                -> evict
                  -> ocfs2_evict_inode
                    -> ocfs2_drop_lock
                      -> dlmunlock
                        -> o2net_send_message_vec

   then o2net_wq wait for the unlock reply from master.

3) tcp layer received the reply, call o2net_data_ready() and queue
   sc_rx_work, waiting o2net_wq to process this work.

4) o2net_wq is a single thread workqueue, it process the work one by
   one.  Right now it is still doing o2net_listen_work and cannot handle
   sc_rx_work.  so we deadlock.

Junxiao Bi's patch "mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set"
(http://ozlabs.org/~akpm/mmots/broken-out/mm-clear-__gfp_fs-when-pf_memalloc_noio-is-set.patch)
clears __GFP_FS in memalloc_noio_flags() besides __GFP_IO.  We use
memalloc_noio_save() to set process flag PF_MEMALLOC_NOIO so that all
allocations done by this process are done as if GFP_NOIO was specified.
We are not reentering filesystem while doing memory reclaim.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:58 -04:00
Oleg Nesterov 498f237178 mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race
9e7814404b "hold task->mempolicy while numa_maps scans." fixed the
race with the exiting task but this is not enough.

The current code assumes that get_vma_policy(task) should either see
task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
by hold_task_mempolicy(), so we can never race with __mpol_put(). But
this can only work if we can't race with do_set_mempolicy(), and thus
we can't race with another do_set_mempolicy() or do_exit() after that.

However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
race. This task can exec, change it's ->mm, and call do_set_mempolicy()
after that; in this case they take 2 different locks.

Change hold_task_mempolicy() to use get_task_policy(), it never returns
NULL, and change show_numa_map() to use __get_vma_policy() or fall back
to proc_priv->task_mempolicy.

Note: this is the minimal fix, we will cleanup this code later. I think
hold_task_mempolicy() and release_task_mempolicy() should die, we can
move this logic into show_numa_map(). Or we can move get_task_policy()
outside of ->mmap_sem and !CONFIG_NUMA code at least.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:56 -04:00
Akinobu Mita 447f05bb48 block_dev: implement readpages() to optimize sequential read
Sequential read from a block device is expected to be equal or faster than
from the file on a filesystem.  But it is not correct due to the lack of
effective readpages() in the address space operations for block device.

This implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

Install 1GB of RAM disk storage:

	# modprobe scsi_debug dev_size_mb=1024 delay=0

Sequential read from file on a filesystem:

	# mkfs.ext4 /dev/$DEV
	# mount /dev/$DEV /mnt
	# fio --name=t --size=512m --rw=read --filename=/mnt/file
	...
	  read : io=524288KB, bw=2133.4MB/s, iops=546133, runt=   240msec

Sequential read from a block device:
	# fio --name=t --size=512m --rw=read --filename=/dev/$DEV
	...
(Without this commit)
	  read : io=524288KB, bw=1700.2MB/s, iops=435455, runt=   301msec

(With this commit)
	  read : io=524288KB, bw=2160.4MB/s, iops=553046, runt=   237msec

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Akinobu Mita 4db96b71e3 vfs: guard end of device for mpage interface
Add guard_bio_eod() check for mpage code in order to allow us to do IO
even on the odd last sectors of a device, even if the block size is some
multiple of the physical sector size.

Using mpage_readpages() for block device requires this guard check.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Akinobu Mita 59d43914ed vfs: make guard_bh_eod() more generic
This patchset implements readpages() operation for block device by using
mpage_readpages() which can create multipage BIOs instead of BIOs for each
page and reduce system CPU time consumption.

This patch (of 3):

guard_bh_eod() is used in submit_bh() to allow us to do IO even on the odd
last sectors of a device, even if the block size is some multiple of the
physical sector size.  This makes guard_bh_eod() more generic and renames
it guard_bio_eod() so that we can use it without struct buffer_head
argument.

The reason for this change is that using mpage_readpages() for block
device requires to add this guard check in mpage code.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:53 -04:00
Baoquan He bf3e269246 fs/proc/kcore.c: don't add modules range to kcore if it's equal to vmcore range
On some ARCHs modules range is eauql to vmalloc range. E.g on i686

	"#define MODULES_VADDR   VMALLOC_START"
	"#define MODULES_END     VMALLOC_END"

This will cause 2 duplicate program segments in /proc/kcore, and no flag
to indicate they are different.  This is confusing.  And usually people
who need check the elf header or read the content of kcore will check
memory ranges.  Two program segments which are the same are unnecessary.

So check if the modules range is equal to vmalloc range.  If so, just skip
adding the modules range.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 58cb65487e proc/maps: make vm_is_stack() logic namespace-friendly
- Rename vm_is_stack() to task_of_stack() and change it to return
  "struct task_struct *" rather than the global (and thus wrong in
  general) pid_t.

- Add the new pid_of_stack() helper which calls task_of_stack() and
  uses the right namespace to report the correct pid_t.

  Unfortunately we need to define this helper twice, in task_mmu.c
  and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
  and move at least pid_of_stack/task_of_stack there to avoid the
  code duplication.

- Change show_map_vma() and show_numa_map() to use the new helper.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 2c03376d2d proc/maps: replace proc_maps_private->pid with "struct inode *inode"
m_start() can use get_proc_task() instead, and "struct inode *"
provides more potentially useful info, see the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:50 -04:00
Oleg Nesterov 47fecca15c fs/proc/task_nommu.c: don't use priv->task->mm
I do not know if CONFIG_PREEMPT/SMP is possible without CONFIG_MMU
but the usage of task->mm in m_stop(). The task can exit/exec before
we take mmap_sem, in this case m_stop() can hit NULL or unlock the
wrong rw_semaphore.

Also, this code uses priv->task != NULL to decide whether we need
up_read/mmput. This is correct, but we will probably kill priv->task.
Change m_start/m_stop to rely on IS_ERR_OR_NULL() like task_mmu.c does.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 27692cd56e fs/proc/task_nommu.c: shift mm_access() from m_start() to proc_maps_open()
Copy-and-paste the changes from "fs/proc/task_mmu.c: shift mm_access()
from m_start() to proc_maps_open()" into task_nommu.c.

Change maps_open() to initialize priv->mm using proc_mem_open(), m_start()
can rely on atomic_inc_not_zero(mm_users) like task_mmu.c does.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ce34fddb5b fs/proc/task_nommu.c: change maps_open() to use __seq_open_private()
Cleanup and preparation. maps_open() can use __seq_open_private()
like proc_maps_open() does.

[akpm@linux-foundation.org: deuglify]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 557c2d8a73 fs/proc/task_mmu.c: update m->version in the main loop in m_start()
Change the main loop in m_start() to update m->version. Mostly for
consistency, but this can help to avoid the same loop if the very
1st ->show() fails due to seq_overflow().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov b8c20a9b85 fs/proc/task_mmu.c: reintroduce m->version logic
Add the "last_addr" optimization back. Like before, every ->show()
method checks !seq_overflow() and sets m->version = vma->vm_start.

However, it also checks that m_next_vma(vma) != NULL, otherwise it
sets m->version = -1 for the lockless "EOF" fast-path in m_start().

m_start() can simply do find_vma() + m_next_vma() if last_addr is
not zero, the code looks clear and simple and this case is clearly
separated from "scan vmas" path.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ad2a00e4b7 fs/proc/task_mmu.c: introduce m_next_vma() helper
Extract the tail_vma/vm_next calculation from m_next() into the new
trivial helper, m_next_vma().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 0c255321f8 fs/proc/task_mmu.c: simplify m_start() to make it readable
Now that m->version is gone we can cleanup m_start(). In particular,

  - Remove the "unsigned long" typecast, m->index can't be negative
    or exceed ->map_count. But lets use "unsigned int pos" to make
    it clear that "pos < map_count" is safe.

  - Remove the unnecessary "vma != NULL" check in the main loop. It
    can't be NULL unless we have a vm bug.

  - This also means that "pos < map_count" case can simply return the
    valid vma and avoid "goto" and subsequent checks.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov ebb6cdde1a fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic
m_start() carefully documents, checks, and sets "m->version = -1" if
we are going to return NULL. The only problem is that we will be never
called again if m_start() returns NULL, so this is simply pointless
and misleading.

Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
just wrong, we want -1 in this case. And in fact we also want -1 if
->vm_next == NULL and ->tail_vma == NULL.

And it is not used consistently, the "scan vmas" loop in m_start()
should update last_addr too.

Finally, imo the whole "last_addr" logic in m_start() looks horrible.
find_vma(last_addr) is called unconditionally even if we are not going
to use the result. But the main problem is that this code participates
in tail_vma-or-NULL mess, and this looks simply unfixable.

Remove this optimization. We will add it back after some cleanups.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 0d5f5f45f9 fs/proc/task_mmu.c: shift "priv->task = NULL" from m_start() to m_stop()
1. There is no reason to reset ->tail_vma in m_start(), if we return
   IS_ERR_OR_NULL() it won't be used.

2. m_start() also clears priv->task to ensure that m_stop() won't use
   the stale pointer if we fail before get_task_struct(). But this is
   ugly and confusing, move this initialization in m_stop().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:49 -04:00
Oleg Nesterov 23d54837e4 fs/proc/task_mmu.c: cleanup the "tail_vma" horror in m_next()
1. Kill the first "vma != NULL" check. Firstly this is not possible,
   m_next() won't be called if ->start() or the previous ->next()
   returns NULL.

   And if it was possible the 2nd "vma != tail_vma" check is buggy,
   we should not wrongly return ->tail_vma.

2. Make this function readable. The logic is very simple, we should
   return check "vma != tail" once and return "vm_next || tail_vma".

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 59b4bf12d4 fs/proc/task_mmu.c: simplify the vma_stop() logic
m_start() drops ->mmap_sem and does mmput() if it retuns vsyscall
vma. This is because in this case m_stop()->vma_stop() obviously
can't use gate_vma->vm_mm.

Now that we have proc_maps_private->mm we can simplify this logic:

  - Change m_start() to return with ->mmap_sem held unless it returns
    IS_ERR_OR_NULL().

  - Change vma_stop() to use priv->mm and avoid the ugly vma checks,
    this makes "vm_area_struct *vma" unnecessary.

  - This also allows m_start() to use vm_stop().

  - Cleanup m_next() to follow the new locking rule.

    Note: m_stop() looks very ugly, and this temporary uglifies it
    even more. Fixed by the next change.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 29a40ace84 fs/proc/task_mmu.c: shift mm_access() from m_start() to proc_maps_open()
A simple test-case from Kirill Shutemov

	cat /proc/self/maps >/dev/null
	chmod +x /proc/self/net/packet
	exec /proc/self/net/packet

makes lockdep unhappy, cat/exec take seq_file->lock + cred_guard_mutex in
the opposite order.

It's a false positive and probably we should not allow "chmod +x" on proc
files. Still I think that we should avoid mm_access() and cred_guard_mutex
in sys_read() paths, security checking should happen at open time. Besides,
this doesn't even look right if the task changes its ->mm between m_stop()
and m_start().

Add the new "mm_struct *mm" member into struct proc_maps_private and change
proc_maps_open() to initialize it using proc_mem_open(). Change m_start() to
use priv->mm if atomic_inc_not_zero(mm_users) succeeds or return NULL (eof)
otherwise.

The only complication is that proc_maps_open() users should additionally do
mmdrop() in fop->release(), add the new proc_map_release() helper for that.

Note: this is the user-visible change, if the task execs after open("maps")
the new ->mm won't be visible via this file. I hope this is fine, and this
matches /proc/pid/mem bahaviour.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 5381e169e7 proc: introduce proc_mem_open()
Extract the mm_access() code from __mem_open() into the new helper,
proc_mem_open(), the next patch will add another caller.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 4db7d0ee19 fs/proc/task_mmu.c: unify/simplify do_maps_open() and numa_maps_open()
do_maps_open() and numa_maps_open() are overcomplicated, they could use
__seq_open_private().  Plus they do the same, just sizeof(*priv)

Change them to use a new simple helper, proc_maps_open(ops, psize).  This
simplifies the code and allows us to do the next changes.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Oleg Nesterov 46c298cf69 fs/proc/task_mmu.c: don't use task->mm in m_start() and show_*map()
get_gate_vma(priv->task->mm) looks ugly and wrong, task->mm can be NULL or
it can changed by exec right after mm_access().

And in theory this race is not harmless, the task can exec and then later
exit and free the new mm_struct.  In this case get_task_mm(oldmm) can't
help, get_gate_vma(task->mm) can read the freed/unmapped memory.

I think that priv->task should simply die and hold_task_mempolicy() logic
can be simplified.  tail_vma logic asks for cleanups too.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Junxiao Bi f775da2fc2 ocfs2: fix deadlock due to wrong locking order
For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex
osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it
will wait until jbd2 commit thread done. In order journal mode, jbd2
needs flushing dirty data pages first, and this needs get page lock.
So osb->journal->j_trans_barrier should be got before page lock.

But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this
locking order, and this will cause deadlock and hung the whole cluster.

One deadlock catched is the following:

PID: 13449  TASK: ffff8802e2f08180  CPU: 31  COMMAND: "oracle"
 #0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524
 #1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf
 #2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85
 #3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55
 #4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4
 #5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2]
 #6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2]
 #7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2]
 #8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2]
 #9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2]
 #10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2]
 #11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2]
 #12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29
 #13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1
 #14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd
 #15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142
    RIP: 00007f8de750c6f7  RSP: 00007fffe786e478  RFLAGS: 00000206
    RAX: 000000000000004d  RBX: ffffffff81515142  RCX: 0000000000000000
    RDX: 0000000000000200  RSI: 0000000000028400  RDI: 000000000000000d
    RBP: 00007fffe786e040   R8: 0000000000000000   R9: 000000000000000d
    R10: 0000000000000000  R11: 0000000000000206  R12: 000000000000000d
    R13: 00007fffe786e710  R14: 00007f8de70f8340  R15: 0000000000028400
    ORIG_RAX: 000000000000004d  CS: 0033  SS: 002b

crash64> bt
PID: 7610   TASK: ffff88100fd56140  CPU: 1   COMMAND: "ocfs2cmt"
 #0 [ffff88100f4d1c50] __schedule at ffffffff8150a524
 #1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf
 #2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2]
 #3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2]
 #4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2]
 #5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2]
 #6 [ffff88100f4d1ee8] kthread at ffffffff81090db6
 #7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284

crash64> bt
PID: 7609   TASK: ffff88100f2d4480  CPU: 0   COMMAND: "jbd2/dm-20-86"
 #0 [ffff88100def3920] __schedule at ffffffff8150a524
 #1 [ffff88100def39c8] schedule at ffffffff8150acbf
 #2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c
 #3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e
 #4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a
 #5 [ffff88100def3a58] __lock_page at ffffffff81110687
 #6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752
 #7 [ffff88100def3be8] generic_writepages at ffffffff8111b901
 #8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2]
 #9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2]
 #10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2]
 #11 [ffff88100def3ee8] kthread at ffffffff81090db6
 #12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Alex Chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:48 -04:00
Joseph Qi 70e82a12db ocfs2: fix deadlock between o2hb thread and o2net_wq
The following case may lead to o2net_wq and o2hb thread deadlock on
o2hb_callback_sem.
Currently there are 2 nodes say N1, N2 in the cluster. And N2 down, at
the same time, N3 tries to join the cluster. So N1 will handle node
down (N2) and join (N3) simultaneously.
    o2hb                               o2net_wq
    ->o2hb_do_disk_heartbeat
    ->o2hb_check_slot
    ->o2hb_run_event_list
    ->o2hb_fire_callbacks
    ->down_write(&o2hb_callback_sem)
    ->o2net_hb_node_down_cb
    ->flush_workqueue(o2net_wq)
                                       ->o2net_process_message
                                       ->dlm_query_join_handler
                                       ->o2hb_check_node_heartbeating
                                       ->o2hb_fill_node_map
                                       ->down_read(&o2hb_callback_sem)

No need to take o2hb_callback_sem in dlm_query_join_handler,
o2hb_live_lock is enough to protect live node map.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: xMark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: jiangyiwen <jiangyiwen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Junxiao Bi 5046f18d5b ocfs2: don't fire quorum before connection established
Firing quorum before connection established can cause unexpected node to
reboot.

Assume there are 3 nodes in the cluster, Node 1, 2, 3.  Node 2 and 3 have
wrong ip address of Node 1 in cluster.conf and global heartbeat is enabled
in the cluster.  After the heatbeats are started on these three nodes,
Node 1 will reboot due to quorum fencing.  It is similar case if Node 1's
networking is not ready when starting the global heartbeat.

The reboot is not friendly as customer is not fully ready for ocfs2 to
work.  Fix it by not allowing firing quorum before the connection is
established.  In this case, ocfs2 will wait until the wrong configuration
is fixed or networking is up to continue.  Also update the log to guide
the user where to check when connection is not built for a long time.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Rob Jones 1848cb5530 fs/ocfs2/dlmglue.c: use __seq_open_private() not seq_open()
Reduce boilerplate code by using seq_open_private() instead of seq_open()

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Rob Jones f328833848 fs/ocfs2/cluster/netdebug.c: use seq_open_private() not seq_open()
Reduce boilerplate code by using seq_open_private() instead of seq_open()

Note that the code in and using sc_common_open() has been quite
extensively changed.  Not least because there was a latent memory leak in
the code as was: if sc_common_open() failed, the previously allocated
buffer was not freed.

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Rob Jones 8f9ac03232 fs/ocfs2/dlm/dlmdebug.c: use seq_open_private() not seq_open()
Reduce boilerplate code by using seq_open_private() instead of seq_open()

Signed-off-by: Rob Jones <rob.jones@codethink.co.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Xue jiufei 6ae075485e ocfs2: remove unused code in dlm_new_lockres()
Remove the branch that free res->lockname.name because the condition
is never satisfied when jump to label error.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
alex chen 9a7e6b5a0a ocfs2/dlm: call dlm_lockres_put without resource spinlock
dlm_lockres_put() should be called without &res->spinlock, otherwise a
deadlock case may happen.

spin_lock(&res->spinlock)
...
dlm_lockres_put
  ->dlm_lockres_release
    ->dlm_print_one_lock_resource
      ->spin_lock(&res->spinlock)

Signed-off-by: Alex Chen <alex.chen@huawei.com>
Reviewed-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Joseph Qi 4a4e07c1bd ocfs2: call o2quo_exit() if malloc failed in o2net_init()
In o2net_init, if malloc failed, it directly returns -ENOMEM.  Then
o2quo_exit won't be called in init_o2nm.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Joseph Qi 7fa05c6e46 ocfs2: fix shift left operations overflow
ocfs2_inode_info->ip_clusters and ocfs2_dinode->id1.bitmap1.i_total are
defined as type u32, so the shift left operations may overflow if volume
size is large, for example, 2TB and cluster size is 1MB.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:47 -04:00
Joseph Qi 190a7721ac ocfs2/dlm: refactor error handling in dlm_alloc_ctxt
Refactoring error handling in dlm_alloc_ctxt to simplify code.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:46 -04:00
Andrew Morton 98acbf63d6 fs/ocfs2/stack_user.c: fix typo in ocfs2_control_release()
It is supposed to zero pv_minor.

Reported-by: Himangi Saraogi <himangi774@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:46 -04:00
Andrea Gelmini 7143e49441 ntfs: remove bogus space
fs/ntfs/debug.c:124: WARNING: space prohibited between function name and
open parenthesis '('

Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:46 -04:00
Anton Altaparmakov 5272d036b2 ntfs: use find_get_page_flags() to mark page accessed as it is no longer marked later on
Mel Gorman's commit 2457aec637 ("mm: non-atomically mark page accessed
during page cache allocation where possible") removed mark_page_accessed()
calls from NTFS without updating the matching find_lock_page() to
find_get_page_flags(GFP_LOCK | FGP_ACCESSED) thus causing the page to
never be marked accessed.

This patch fixes that.

Signed-off-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:46 -04:00
Yann Droneaud 0b37e097a6 fanotify: enable close-on-exec on events' fd when requested in fanotify_init()
According to commit 80af258867 ("fanotify: groups can specify their
f_flags for new fd"), file descriptors created as part of file access
notification events inherit flags from the event_f_flags argument passed
to syscall fanotify_init(2)[1].

Unfortunately O_CLOEXEC is currently silently ignored.

Indeed, event_f_flags are only given to dentry_open(), which only seems to
care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in
open_check_o_direct() and O_LARGEFILE in generic_file_open().

It's a pity, since, according to some lookup on various search engines and
http://codesearch.debian.net/, there's already some userspace code which
use O_CLOEXEC:

- in systemd's readahead[2]:

    fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME);

- in clsync[3]:

    #define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC)

    int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS);

- in examples [4] from "Filesystem monitoring in the Linux
  kernel" article[5] by Aleksander Morgado:

    if ((fanotify_fd = fanotify_init (FAN_CLOEXEC,
                                      O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0)

Additionally, since commit 48149e9d3a ("fanotify: check file flags
passed in fanotify_init").  having O_CLOEXEC as part of fanotify_init()
second argument is expressly allowed.

So it seems expected to set close-on-exec flag on the file descriptors if
userspace is allowed to request it with O_CLOEXEC.

But Andrew Morton raised[6] the concern that enabling now close-on-exec
might break existing applications which ask for O_CLOEXEC but expect the
file descriptor to be inherited across exec().

In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file
descriptor returned as part of file access notify can break applications
due to deadlock.  So close-on-exec is needed for most applications.

More, applications asking for close-on-exec are likely expecting it to be
enabled, relying on O_CLOEXEC being effective.  If not, it might weaken
their security, as noted by Jan Kara[8].

So this patch replaces call to macro get_unused_fd() by a call to function
get_unused_fd_flags() with event_f_flags value as argument.  This way
O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is
interpreted and close-on-exec get enabled when requested.

[1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294
[3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631
    https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38
[4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c
[5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/
[6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org
[7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l
[8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz

Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Mihai Don\u021bu <mihai.dontu@gmail.com>
Cc: Pádraig Brady <P@draigBrady.com>
Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com>
Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Cc: Richard Guy Briggs <rgb@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:46 -04:00
Sasha Levin 105d1b4253 fsnotify: don't put user context if it was never assigned
On some failure paths we may attempt to free user context even if it
wasn't assigned yet.  This will cause a NULL ptr deref and a kernel BUG.

The path I was looking at is in inotify_new_group():

        oevent = kmalloc(sizeof(struct inotify_event_info), GFP_KERNEL);
        if (unlikely(!oevent)) {
                fsnotify_destroy_group(group);
                return ERR_PTR(-ENOMEM);
        }

fsnotify_destroy_group() would get called here, but
group->inotify_data.user is only getting assigned later:

	group->inotify_data.user = get_current_user();

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Reviewed-by: Heinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:45 -04:00
Andrew Morton cafbaae8af fs/notify/group.c: make fsnotify_final_destroy_group() static
No callers outside this file.

Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:45 -04:00
Jan Kara 6174c2eb8e udf: Fix loading of special inodes
Some UDF media have special inodes (like VAT or metadata partition
inodes) whose link_count is 0. Thus commit 4071b91362 (udf: Properly
detect stale inodes) broke loading these inodes because udf_iget()
started returning -ESTALE for them. Since we still need to properly
detect stale inodes queried by NFS, create two variants of udf_iget() -
one which is used for looking up special inodes (which ignores
link_count == 0) and one which is used for other cases which return
ESTALE when link_count == 0.

Fixes: 4071b91362
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
2014-10-09 13:06:14 +02:00
Linus Torvalds 47137c6ba1 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Nothing really exciting this time:

   - a few fixlets in the NOHZ code

   - a new ARM SoC timer abomination.  One should expect that we have
     enough of them already, but they insist on inventing new ones.

   - the usual bunch of ARM SoC timer updates.  That feels like herding
     cats"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  clocksource: arm_arch_timer: Consolidate arch_timer_evtstrm_enable
  clocksource: arm_arch_timer: Enable counter access for 32-bit ARM
  clocksource: arm_arch_timer: Change clocksource name if CP15 unavailable
  clocksource: sirf: Disable counter before re-setting it
  clocksource: cadence_ttc: Add support for 32bit mode
  clocksource: tcb_clksrc: Sanitize IRQ request
  clocksource: arm_arch_timer: Discard unavailable timers correctly
  clocksource: vf_pit_timer: Support shutdown mode
  ARM: meson6: clocksource: Add Meson6 timer support
  ARM: meson: documentation: Add timer documentation
  clocksource: sh_tmu: Document r8a7779 binding
  clocksource: sh_mtu2: Document r7s72100 binding
  clocksource: sh_cmt: Document SoC specific bindings
  timerfd: Remove an always true check
  nohz: Avoid tick's double reprogramming in highres mode
  nohz: Fix spurious periodic tick behaviour in low-res dynticks mode
2014-10-09 06:35:05 -04:00
Al Viro 821cc3070f ncpfs: use list_for_each_entry() for d_subdirs walk
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:16 -04:00
Seunghun Lee 5e6123f347 vfs: move getname() from callers to do_mount()
It would make more sense to pass char __user * instead of
char * in callers of do_mount() and do getname() inside do_mount().

Suggested-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Seunghun Lee <waydi1@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:16 -04:00
Al Viro 4d93bc3e81 gfs2_atomic_open(): skip lookups on hashed dentry
hashed dentry can be passed to ->atomic_open() only if
a) it has just passed revalidation and
b) it's negative

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:15 -04:00
Al Viro 9bb8730ed3 jfs: don't hash direct inode
hlist_add_fake(inode->i_hash), same as for the rest of special ones...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:13 -04:00
Al Viro c2e3f5d5f4 ecryptfs: ->f_op is never NULL
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:12 -04:00
Al Viro e983094d6d missing annotation in fs/file.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:11 -04:00
Tim Gardner b8850d1fa8 fs: namespace: suppress 'may be used uninitialized' warnings
The gcc version 4.9.1 compiler complains Even though it isn't possible for
these variables to not get initialized before they are used.

fs/namespace.c: In function ‘SyS_mount’:
fs/namespace.c:2720:8: warning: ‘kernel_dev’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
        ^
fs/namespace.c:2699:8: note: ‘kernel_dev’ was declared here
  char *kernel_dev;
        ^
fs/namespace.c:2720:8: warning: ‘kernel_type’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  ret = do_mount(kernel_dev, kernel_dir->name, kernel_type, flags,
        ^
fs/namespace.c:2697:8: note: ‘kernel_type’ was declared here
  char *kernel_type;
        ^

Fix the warnings by simplifying copy_mount_string() as suggested by Al Viro.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:10 -04:00
Al Viro 2ec3a12a66 cachefiles_write_page(): switch to __kernel_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:05 -04:00
Al Viro 4b8e992392 9p: switch to %p[dD]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:04 -04:00
Al Viro 35c265e008 cifs: switch to use of %p[dD]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:03 -04:00
Mikulas Patocka c2ca0fcd20 fs: make cont_expand_zero interruptible
This patch makes it possible to kill a process looping in
cont_expand_zero. A process may spend a lot of time in this function, so
it is desirable to be able to kill it.

It happened to me that I wanted to copy a piece data from the disk to a
file. By mistake, I used the "seek" parameter to dd instead of "skip". Due
to the "seek" parameter, dd attempted to extend the file and became stuck
doing so - the only possibility was to reset the machine or wait many
hours until the filesystem runs out of space and cont_expand_zero fails.
We need this patch to be able to terminate the process.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:03 -04:00
Tetsuo Handa 475d0db742 fs: Fix theoretical division by 0 in super_cache_scan().
total_objects could be 0 and is used as a denom.

While total_objects is a "long", total_objects == 0 unlikely happens for
3.12 and later kernels because 32-bit architectures would not be able to
hold (1 << 32) objects. However, total_objects == 0 may happen for kernels
between 3.1 and 3.11 because total_objects in prune_super() was an "int"
and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: stable <stable@kernel.org> # 3.1+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:02 -04:00
Daeseok Youn b8314f9303 dcache: Fix no spaces at the start of a line in dcache.c
Fixed coding style in dcache.c

Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:02 -04:00
Al Viro 99358a1ca5 [jffs2] kill wbuf_queued/wbuf_dwork_lock
schedule_delayed_work() happening when the work is already pending is
a cheap no-op.  Don't bother with ->wbuf_queued logics - it's both
broken (cancelling ->wbuf_dwork leaves it set, as spotted by Jeff Harris)
and pointless.  It's cheaper to let schedule_delayed_work() handle that
case.

Reported-by: Jeff Harris <jefftharris@gmail.com>
Tested-by: Jeff Harris <jefftharris@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:01 -04:00
Al Viro 19d860a140 handle suicide on late failure exits in execve() in search_binary_handler()
... rather than doing that in the guts of ->load_binary().
[updated to fix the bug spotted by Shentino - for SIGSEGV we really need
something stronger than send_sig_info(); again, better do that in one place]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:39:00 -04:00
Al Viro 2926620145 dcache.c: call ->d_prune() regardless of d_unhashed()
the only in-tree instance checks d_unhashed() anyway,
out-of-tree code can preserve the current behaviour by
adding such check if they want it and we get an ability
to use it in cases where we *want* to be notified of
killing being inevitable before ->d_lock is dropped,
whether it's unhashed or not.  In particular, autofs
would benefit from that.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:59 -04:00
Al Viro 29355c3904 d_prune_alias(): just lock the parent and call __dentry_kill()
The only reason for games with ->d_prune() was __d_drop(), which
was needed only to force dput() into killing the sucker off.

Note that lock_parent() can be called under ->i_lock and won't
drop it, so dentry is safe from somebody managing to kill it
under us - it won't happen while we are holding ->i_lock.

__dentry_kill() is called only with ->d_lockref.count being 0
(here and when picked from shrink list) or 1 (dput() and dropping
the ancestors in shrink_dentry_list()), so it will never be called
twice - the first thing it's doing is making ->d_lockref.count
negative and once that happens, nothing will increment it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:59 -04:00
Eric W. Biederman bbd5192412 proc: Update proc_flush_task_mnt to use d_invalidate
Now that d_invalidate always succeeds and flushes mount points use
it in stead of a combination of shrink_dcache_parent and d_drop
in proc_flush_task_mnt.  This removes the danger of a mount point
under /proc/<pid>/... becoming unreachable after the d_drop.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:58 -04:00
Eric W. Biederman c143c2333c vfs: Remove d_drop calls from d_revalidate implementations
Now that d_invalidate always succeeds it is not longer necessary or
desirable to hard code d_drop calls into filesystem specific
d_revalidate implementations.

Remove the unnecessary d_drop calls and rely on d_invalidate
to drop the dentries.  Using d_invalidate ensures that paths
to mount points will not be dropped.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:58 -04:00
Eric W. Biederman 5542aa2fa7 vfs: Make d_invalidate return void
Now that d_invalidate can no longer fail, stop returning a useless
return code.  For the few callers that checked the return code update
remove the handling of d_invalidate failure.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:57 -04:00
Eric W. Biederman 1ffe46d11c vfs: Merge check_submounts_and_drop and d_invalidate
Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:57 -04:00
Eric W. Biederman 9b053f3207 vfs: Remove unnecessary calls of check_submounts_and_drop
Now that check_submounts_and_drop can not fail and is called from
d_invalidate there is no longer a need to call check_submounts_and_drom
from filesystem d_revalidate methods so remove it.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:56 -04:00
Eric W. Biederman 8ed936b567 vfs: Lazily remove mounts on unlinked files and directories.
With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths.  It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else.  With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.

The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.

In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.

There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
maintained by fusermount.

In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.

In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger.  fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.

In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary.  Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.

v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
    renames.

Reviewed-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:56 -04:00
Eric W. Biederman 80b5dce8c5 vfs: Add a function to lazily unmount all mounts from any dentry.
The new function detach_mounts comes in two pieces.  The first piece
is a static inline test of d_mounpoint that returns immediately
without taking any locks if d_mounpoint is not set.  In the common
case when mountpoints are absent this allows the vfs to continue
running with it's same cacheline foot print.

The second piece of detach_mounts __detach_mounts actually does the
work and it assumes that a mountpoint is present so it is slow and
takes namespace_sem for write, and then locks the mount hash (aka
mount_lock) after a struct mountpoint has been found.

With those two locks held each entry on the list of mounts on a
mountpoint is selected and lazily unmounted until all of the mount
have been lazily unmounted.

v7: Wrote a proper change description and removed the changelog
    documenting deleted wrong turns.

Signed-off-by: Eric W. Biederman <ebiederman@twitter.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:55 -04:00
Eric W. Biederman e2dfa93546 vfs: factor out lookup_mountpoint from new_mountpoint
I am shortly going to add a new user of struct mountpoint that
needs to look up existing entries but does not want to create
a struct mountpoint if one does not exist.  Therefore to keep
the code simple and easy to read split out lookup_mountpoint
from new_mountpoint.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-09 02:38:55 -04:00