Commit graph

483 commits

Author SHA1 Message Date
Linus Torvalds 10d20bd25e shmem: fix shm fallocate() list corruption
The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
want to race with generating new pages by faulting them in.

However, the wait-queue used to delay the page faulting has a serious
problem: the wait queue head (in shmem_fallocate()) is allocated on the
stack, and the code expects that "wake_up_all()" will make sure that all
the queue entries are gone before the stack frame is de-allocated.

And that is not at all necessarily the case.

Yes, a normal wake-up sequence will remove the wait-queue entry that
caused the wakeup (see "autoremove_wake_function()"), but the key
wording there is "that caused the wakeup".  When there are multiple
possible wakeup sources, the wait queue entry may well stay around.

And _particularly_ in a page fault path, we may be faulting in new pages
from user space while we also have other things going on, and there may
well be other pending wakeups.

So despite the "wake_up_all()", it's not at all guaranteed that all list
entries are removed from the wait queue head on the stack.

Fix this by introducing a new wakeup function that removes the list
entry unconditionally, even if the target process had already woken up
for other reasons.  Use that "synchronous" function to set up the
waiters in shmem_fault().

This problem has never been seen in the wild afaik, but Dave Jones has
reported it on and off while running trinity.  We thought we fixed the
stack corruption with the blk-mq rq_list locking fix (commit
7fe311302f: "blk-mq: update hardware and software queues for sleeping
alloc"), but it turns out there was _another_ stack corruptor hiding
in the trinity runs.

Vegard Nossum (also running trinity) was able to trigger this one fairly
consistently, and made us look once again at the shmem code due to the
faults often being in that area.

Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-06 08:59:05 -08:00
Hugh Dickins 9956edf37e shmem: fix pageflags after swapping DMA32 object
If shmem_alloc_page() does not set PageLocked and PageSwapBacked, then
shmem_replace_page() needs to do so for itself.  Without this, it puts
newpage on the wrong lru, re-unlocks the unlocked newpage, and system
descends into "Bad page" reports and freeze; or if CONFIG_DEBUG_VM=y, it
hits an earlier VM_BUG_ON_PAGE(!PageLocked), depending on config.

But shmem_replace_page() is not a common path: it's only called when
swapin (or swapoff) finds the page was already read into an unsuitable
zone: usually all zones are suitable, but gem objects for a few drm
devices (gma500, omapdrm, crestline, broadwater) require zone DMA32 if
there's more than 4GB of ram.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1611062003510.11253@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>	[4.8.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-11 08:12:37 -08:00
Linus Torvalds 101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro 3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds 97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Linus Torvalds abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Linus Torvalds b66484cd74 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - fsnotify updates

 - ocfs2 updates

 - all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
  console: don't prefer first registered if DT specifies stdout-path
  cred: simpler, 1D supplementary groups
  CREDITS: update Pavel's information, add GPG key, remove snail mail address
  mailmap: add Johan Hovold
  .gitattributes: set git diff driver for C source code files
  uprobes: remove function declarations from arch/{mips,s390}
  spelling.txt: "modeled" is spelt correctly
  nmi_backtrace: generate one-line reports for idle cpus
  arch/tile: adopt the new nmi_backtrace framework
  nmi_backtrace: do a local dump_stack() instead of a self-NMI
  nmi_backtrace: add more trigger_*_cpu_backtrace() methods
  min/max: remove sparse warnings when they're nested
  Documentation/filesystems/proc.txt: add more description for maps/smaps
  mm, proc: fix region lost in /proc/self/smaps
  proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
  proc: add LSM hook checks to /proc/<tid>/timerslack_ns
  proc: relax /proc/<tid>/timerslack_ns capability requirements
  meminfo: break apart a very long seq_printf with #ifdefs
  seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
  proc: faster /proc/*/status
  ...
2016-10-07 21:38:00 -07:00
Andreas Gruenbacher fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Rasmus Villemoes 19938e350a mm/shmem.c: constify anon_ops
Every other dentry_operations instance is const, and this one might as
well be.

Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Al Viro 82c156f853 switch generic_file_splice_read() to use of ->read_iter()
... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:56 -04:00
Deepa Dinamani 078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Miklos Szeredi 2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Hugh Dickins 71664665c3 huge tmpfs: fix Committed_AS leak
Under swapping load on huge tmpfs, /proc/meminfo's Committed_AS grows
bigger and bigger: just a cosmetic issue for most users, but disabling
for those who run without overcommit (/proc/sys/vm/overcommit_memory 2).

shmem_uncharge() was forgetting to unaccount __vm_enough_memory's
charge, and shmem_charge() was forgetting it on the filesystem-full
error path.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24 11:20:01 -07:00
Toshi Kani 3089bf614c shmem: fix tmpfs to handle the huge= option properly
shmem_get_unmapped_area() checks SHMEM_SB(sb)->huge incorrectly, which
leads to a reversed effect of "huge=" mount option.

Fix the check in shmem_get_unmapped_area().

Note, the default value of SHMEM_SB(sb)->huge remains as
SHMEM_HUGE_NEVER.  User will need to specify "huge=" option to enable
huge page mappings.

Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24 11:20:01 -07:00
Jan Kara 31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Arnd Bergmann 3b33719c9b thp: move shmem_huge_enabled() outside of SYSFS ifdef
The newly introduced shmem_huge_enabled() function has two definitions,
but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
build error:

  mm/khugepaged.o: In function `khugepaged':
  khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'

This changes the #ifdef guards around the definition to match those that
are used in the header file.

Fixes: e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Geert Uytterhoeven 4620a06e4b shmem: Fix link error if huge pages support is disabled
If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):

    mm/built-in.o: In function `shmem_alloc_hugepage':
    shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'

To fix this, move the assignment to hindex after the check for huge
pages support.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-03 18:20:12 -04:00
Mel Gorman 11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Kirill A. Shutemov 779750d20b shmem: split huge pages beyond i_size under memory pressure
Even if user asked to allocate huge pages always (huge=always), we
should be able to free up some memory by splitting pages which are
partly byound i_size if memory presure comes or once we hit limit on
filesystem size (-o size=).

In order to do this we maintain per-superblock list of inodes, which
potentially have huge pages on the border of file size.

Per-fs shrinker can reclaim memory by splitting such pages.

If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to
free up space on the filesystem and retry allocation if it succeed.

Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov e496cf3d78 thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
For file mappings, we don't deposit page tables on THP allocation
because it's not strictly required to implement split_huge_pmd(): we can
just clear pmd and let following page faults to reconstruct the page
table.

But Power makes use of deposited page table to address MMU quirk.

Let's hide THP page cache, including huge tmpfs, under separate config
option, so it can be forbidden on Power.

We can revert the patch later once solution for Power found.

Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov f3f0e1d215 khugepaged: add support of collapse for tmpfs/shmem pages
This patch extends khugepaged to support collapse of tmpfs/shmem pages.
We share fair amount of infrastructure with anon-THP collapse.

Few design points:

  - First we are looking for VMA which can be suitable for mapping huge
    page;

  - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

  - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

  - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

      + scan over radix tree, making the range point to new huge page;

      + new huge page is not-uptodate, locked and freezed (refcount
        is 0), so nobody can touch them until we say so.

      + we swap in pages during the scan. khugepaged_scan_shmem()
        filters out ranges with more than khugepaged_max_ptes_swap
	swapped out pages. It's HPAGE_PMD_NR/8 by default.

      + old pages are isolated, unmapped and put to local list in case
        to be restored back if collapse failed.

  - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 4595ef88d1 shmem: make shmem_inode_info::lock irq-safe
We are going to need to call shmem_charge() under tree_lock to get
accoutning right on collapse of small tmpfs pages into a huge one.

The problem is that tree_lock is irq-safe and lockdep is not happy, that
we take irq-unsafe lock under irq-safe[1].

Let's convert the lock to irq-safe.

[1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc

Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 657e3038c4 shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
Let's wire up existing madvise() hugepage hints for file mappings.

MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA.  It only has effect if the filesystem is mounted with huge=advise
or huge=within_size.

MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA.  It doesn't prevent a huge page from being allocated by other
means, i.e.  page fault into different mapping or write(2) into file.

Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 800d8c63b2 shmem: add huge pages support
Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

  - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

  - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

  - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

  - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Hugh Dickins c01d5b3007 shmem: get_unmapped_area align huge page
Provide a shmem_get_unmapped_area method in file_operations, called at
mmap time to decide the mapping address.  It could be conditional on
CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout).  Lots of conditions, and in most cases it
just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back to
ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(), not via
the file_operations: because of the way shmem_zero_setup() is called to
create a shmem object late in the mmap sequence, when MAP_SHARED is
requested with MAP_ANONYMOUS or /dev/zero.  Though this only matters
when /proc/sys/vm/shmem_huge has been set.

Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 5a6e75f811 shmem: prepare huge= mount option and sysfs knob
This patch adds new mount option "huge=".  It can have following values:

  - "always":
	Attempt to allocate huge pages every time we need a new page;

  - "never":
	Do not allocate huge pages;

  - "within_size":
	Only allocate huge page if it will be fully within i_size.
	Also respect fadvise()/madvise() hints;

  - "advise:
	Only allocate huge pages if requested with fadvise()/madvise();

Default is "never" for now.

"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
is the appropriate option to protect those who don't want the new bloat,
and with which we shall share some pmd code.

Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).

Allow enabling THP only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
objects, Ashmem.  Though unlikely to suit all usages, provide sysfs knob
/sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
huge on those.

And allow shmem_enabled two further values:

  - "deny":
	For use in emergencies, to force the huge option off from
	all mounts;
  - "force":
	Force the huge option on for all - very useful for testing;

Based on patch by Hugh Dickins.

Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Hugh Dickins 7f55656703 tmpfs: fix regression hang in fallocate undo
The well-spotted fallocate undo fix is good in most cases, but not when
fallocate failed on the very first page.  index 0 then passes lend -1
to shmem_undo_range(), and that has two bad effects: (a) that it will
undo every fallocation throughout the file, unrestricted by the current
range; but more importantly (b) it can cause the undo to hang, because
lend -1 is treated as truncation, which makes it keep on retrying until
every page has gone, but those already fully instantiated will never go
away.  Big thank you to xfstests generic/269 which demonstrates this.

Fixes: b9b4bb26af ("tmpfs: don't undo fallocate past its last page")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-10 20:08:44 -07:00
Anthony Romano b9b4bb26af tmpfs: don't undo fallocate past its last page
When fallocate is interrupted it will undo a range that extends one byte
past its range of allocated pages.  This can corrupt an in-use page by
zeroing out its first byte.  Instead, undo using the inclusive byte
range.

Fixes: 1635f6a741 ("tmpfs: undo fallocation on failure")
Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com
Signed-off-by: Anthony Romano <anthony.romano@coreos.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Brandon Philips <brandon@ifup.co>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Al Viro 5930122683 switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27 15:39:43 -04:00
Andres Lagar-Cavilla 9e18eb2935 tmpfs: mem_cgroup charge fault to vm_mm not current mm
Although shmem_fault() has been careful to count a major fault to vm_mm,
shmem_getpage_gfp() has been careless in charging a remote access fault
to current->mm owner's memcg instead of to vma->vm_mm owner's memcg:
that is inconsistent with all the mem_cgroup charging on remote access
faults in mm/memory.c.

Fix it by passing fault_mm along with fault_type to
shmem_get_page_gfp(); but in that case, now knowing the right mm, it's
better for it to handle the PGMAJFAULT updates itself.

And let's keep this clutter out of most callers' way: change the common
shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp.

Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Hugh Dickins 75edd345e8 tmpfs: preliminary minor tidyups
Make a few cleanups in mm/shmem.c, before going on to complicate it.

shmem_alloc_page() will become more complicated: we can't afford to to
have that complication duplicated between a CONFIG_NUMA version and a
!CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a
single shmem_swapin() and a single shmem_alloc_page().

Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
configurations, but eliminating it is a larger cleanup: I have an
alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and
bug-prone, and changed yet again since my last version.

Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to
shmem_alloc_page(): that SwapBacked flag will be useful in future, to
help to distinguish different cases appropriately.

And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
little use (IIRC it dates back to when shmem_getpage() returned the page
unlocked): kill it and do the necessary in shmem_file_read_iter().

But an arm64 build then complained that info may be uninitialized (where
shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and
advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Hugh Dickins fa9949da59 mm: use __SetPageSwapBacked and dont ClearPageSwapBacked
v3.16 commit 07a4278843 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the same
justification.

Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
it's not an error to free a page with PG_swapbacked set.

Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
doing it differently in different places; but that's for tidiness - if
the ordering actually mattered, we should not be using the __variants.

There's probably scope for further __SetPageFlags in other places, but
SwapBacked is the one I'm interested in at the moment.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andres Lagar-Cavilla <andreslc@google.com>
Cc: Yang Shi <yang.shi@linaro.org>
Cc: Ning Qu <quning@gmail.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Al Viro 84e710da2a parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup
dentry with desired parent/name before adding to in-lookup set.

One possible solution would be to hold the parent's ->d_lock through
both checks, but while the in-lookup set is relatively small at any
time, dcache is not.  And holding the parent's ->d_lock through
something like __d_lookup_rcu() would suck too badly.

So we leave the parent's ->d_lock alone, which means that we watch
out for the following scenario:
	* we verify that there's no hashed match
	* existing in-lookup match gets hashed by another process
	* we verify that there's no in-lookup matches and decide
that everything's fine.

Solution: per-directory kinda-sorta seqlock, bumped around the times
we hash something that used to be in-lookup or move (and hash)
something in place of in-lookup.  Then the above would turn into
	* read the counter
	* do dcache lookup
	* if no matches found, check for in-lookup matches
	* if there had been none of those either, check if the
counter has changed; repeat if it has.

The "kinda-sorta" part is due to the fact that we don't have much spare
space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
so the counter part is not a problem, but spinlock is a different story.

We could use the parent's ->d_lock, and it would be less painful in
terms of contention, for __d_add() it would be rather inconvenient to
grab; we could do that (using lock_parent()), but...

Fortunately, we can get serialization on the counter itself, and it
might be a good idea in general; we can use cmpxchg() in a loop to
get from even to odd and smp_store_release() from odd to even.

This commit adds the counter and updating logics; the readers will be
added in the next commit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:26 -04:00
Al Viro 84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Al Viro b296821a7c xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 20:48:24 -04:00
Kirill A. Shutemov 09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Matthew Wilcox 7165092fe5 radix-tree,shmem: introduce radix_tree_iter_next()
shmem likes to occasionally drop the lock, schedule, then reacqire the
lock and continue with the iteration from the last place it left off.
This is currently done with a pretty ugly goto.  Introduce
radix_tree_iter_next() and use it throughout shmem.c.

[koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Matthew Wilcox 2cf938aae1 mm: use radix_tree_iter_retry()
Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
restart from our current position.  This will make a difference when
there are more ways to happen across an indirect pointer.  And it
eliminates some confusing gotos.

[vbabka@suse.cz: remove now-obsolete-and-misleading comment]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Joe Perches 1170532bb4 mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent.

Miscellanea:

 - Realign arguments
 - Add missing newline to format
 - kmemleak-test.c has a "kmemleak: " prefix added to the
   "Kmemleak testing" logging message via pr_fmt

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-17 15:09:34 -07:00
Johannes Weiner 6a93ca8fde mm: migrate: do not touch page->mem_cgroup of live pages
Changing a page's memcg association complicates dealing with the page,
so we want to limit this as much as possible.  Page migration e.g.  does
not have to do that.  Just like page cache replacement, it can forcibly
charge a replacement page, and then uncharge the old page when it gets
freed.  Temporarily overcharging the cgroup by a single page is not an
issue in practice, and charging is so cheap nowadays that this is much
preferrable to the headache of messing with live pages.

The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup.  But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()).  That means page->mem_cgroup is always stable
in callers that have the page isolated from the LRU or locked.  Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().

[akpm@linux-foundation.org: fix build]
[vdavydov@virtuozzo.com: fix lockdep splat]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15 16:55:16 -07:00
Al Viro 3ed47db34f make sure that freeing shmem fast symlinks is RCU-delayed
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:08:52 -05:00
Al Viro 5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
Vladimir Davydov 37e8435119 mm: memcontrol: charge swap to cgroup2
This patchset introduces swap accounting to cgroup2.

This patch (of 7):

In the legacy hierarchy we charge memsw, which is dubious, because:

 - memsw.limit must be >= memory.limit, so it is impossible to limit
   swap usage less than memory usage. Taking into account the fact that
   the primary limiting mechanism in the unified hierarchy is
   memory.high while memory.limit is either left unset or set to a very
   large value, moving memsw.limit knob to the unified hierarchy would
   effectively make it impossible to limit swap usage according to the
   user preference.

 - memsw.usage != memory.usage + swap.usage, because a page occupying
   both swap entry and a swap cache page is charged only once to memsw
   counter. As a result, it is possible to effectively eat up to
   memory.limit of memory pages *and* memsw.limit of swap entries, which
   looks unexpected.

That said, we should provide a different swap limiting mechanism for
cgroup2.

This patch adds mem_cgroup->swap counter, which charges the actual number
of swap entries used by a cgroup.  It is only charged in the unified
hierarchy, while the legacy hierarchy memsw logic is left intact.

The swap usage can be monitored using new memory.swap.current file and
limited using memory.swap.max.

Note, to charge swap resource properly in the unified hierarchy, we have
to make swap_entry_free uncharge swap only when ->usage reaches zero, not
just ->count, i.e.  when all references to a swap entry, including the one
taken by swap cache, are gone.  This is necessary, because otherwise
swap-in could result in uncharging swap even if the page is still in swap
cache and hence still occupies a swap entry.  At the same time, this
shouldn't break memsw counter logic, where a page is never charged twice
for using both memory and swap, because in case of legacy hierarchy we
uncharge swap on commit (see mem_cgroup_commit_charge).

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-20 17:09:18 -08:00
Kirill A. Shutemov f627c2f537 memcg: adjust to support new THP refcounting
As with rmap, with new refcounting we cannot rely on PageTransHuge() to
check if we need to charge size of huge page form the cgroup.  We need
to get information from caller to know whether it was mapped with PMD or
PTE.

We do uncharge when last reference on the page gone.  At that point if
we see PageTransHuge() it means we need to unchange whole huge page.

The tricky part is partial unmap -- when we try to unmap part of huge
page.  We don't do a special handing of this situation, meaning we don't
uncharge the part of huge page unless last user is gone or
split_huge_page() is triggered.  In case of cgroup memory pressure
happens the partial unmapped page will be split through shrinker.  This
should be good enough.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: Sasha Levin <sasha.levin@oracle.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Kirill A. Shutemov 48c935ad88 page-flags: define PG_locked behavior on compound pages
lock_page() must operate on the whole compound page.  It doesn't make
much sense to lock part of compound page.  Change code to use head
page's PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions --
__set_page_locked() and __clear_page_locked().  They are replaced with
helpers generated by __SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these
helper would trigger VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
appear there.  VM_BUG_ON() is added to make sure that this assumption is
correct.

[akpm@linux-foundation.org: fix fs/cifs/file.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Linus Torvalds 875fc4f5dd Merge branch 'akpm' (patches from Andrew)
Merge first patch-bomb from Andrew Morton:

 - A few hotfixes which missed 4.4 becasue I was asleep.  cc'ed to
   -stable

 - A few misc fixes

 - OCFS2 updates

 - Part of MM.  Including pretty large changes to page-flags handling
   and to thp management which have been buffered up for 2-3 cycles now.

  I have a lot of MM material this time.

[ It turns out the THP part wasn't quite ready, so that got dropped from
  this series  - Linus ]

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (117 commits)
  zsmalloc: reorganize struct size_class to pack 4 bytes hole
  mm/zbud.c: use list_last_entry() instead of list_tail_entry()
  zram/zcomp: do not zero out zcomp private pages
  zram: pass gfp from zcomp frontend to backend
  zram: try vmalloc() after kmalloc()
  zram/zcomp: use GFP_NOIO to allocate streams
  mm: add tracepoint for scanning pages
  drivers/base/memory.c: fix kernel warning during memory hotplug on ppc64
  mm/page_isolation: use macro to judge the alignment
  mm: fix noisy sparse warning in LIBCFS_ALLOC_PRE()
  mm: rework virtual memory accounting
  include/linux/memblock.h: fix ordering of 'flags' argument in comments
  mm: move lru_to_page to mm_inline.h
  Documentation/filesystems: describe the shared memory usage/accounting
  memory-hotplug: don't BUG() in register_memory_resource()
  hugetlb: make mm and fs code explicitly non-modular
  mm/swapfile.c: use list_for_each_entry_safe in free_swap_count_continuations
  mm: /proc/pid/clear_refs: no need to clear VM_SOFTDIRTY in clear_soft_dirty_pmd()
  mm: make sure isolate_lru_page() is never called for tail page
  vmstat: make vmstat_updater deferrable again and shut down on idle
  ...
2016-01-15 11:41:44 -08:00
Vlastimil Babka 48131e03ca mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings
Following the previous patch, further reduction of /proc/pid/smaps cost
is possible for private writable shmem mappings with unpopulated areas
where the page walk invokes the .pte_hole function.  We can use radix
tree iterator for each such area instead of calling find_get_entry() in
a loop.  This is possible at the extra maintenance cost of introducing
another shmem function shmem_partial_swap_usage().

To demonstrate the diference, I have measured this on a process that
creates a private writable 2GB mapping of a partially swapped out
/dev/shm/file (which cannot employ the optimizations from the prvious
patch) and doesn't populate it at all.  I time how long does it take to
cat /proc/pid/smaps of this process 100 times.

Before this patch:

real    0m3.831s
user    0m0.180s
sys     0m3.212s

After this patch:

real    0m1.176s
user    0m0.180s
sys     0m0.684s

The time is similar to the case where a radix tree iterator is employed
on the whole mapping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vlastimil Babka 6a15a37097 mm, proc: reduce cost of /proc/pid/smaps for shmem mappings
The previous patch has improved swap accounting for shmem mapping, which
however made /proc/pid/smaps more expensive for shmem mappings, as we
consult the radix tree for each pte_none entry, so the overal complexity
is O(n*log(n)).

We can reduce this significantly for mappings that cannot contain COWed
pages, because then we can either use the statistics tha shmem object
itself tracks (if the mapping contains the whole object, or the swap
usage of the whole object is zero), or use the radix tree iterator,
which is much more effective than repeated find_get_entry() calls.

This patch therefore introduces a function shmem_swap_usage(vma) and
makes /proc/pid/smaps use it when possible.  Only for writable private
mappings of shmem objects (i.e.  tmpfs files) with the shmem object
itself (partially) swapped outwe have to resort to the find_get_entry()
approach.

Hopefully such mappings are relatively uncommon.

To demonstrate the diference, I have measured this on a process that
creates a 2GB mapping and dirties single pages with a stride of 2MB, and
time how long does it take to cat /proc/pid/smaps of this process 100
times.

Private writable mapping of a /dev/shm/file (the most complex case):

real    0m3.831s
user    0m0.180s
sys     0m3.212s

Shared mapping of an almost full mapping of a partially swapped /dev/shm/file
(which needs to employ the radix tree iterator).

real    0m1.351s
user    0m0.096s
sys     0m0.768s

Same, but with /dev/shm/file not swapped (so no radix tree walk needed)

real    0m0.935s
user    0m0.128s
sys     0m0.344s

Private anonymous mapping:

real    0m0.949s
user    0m0.116s
sys     0m0.348s

The cost is now much closer to the private anonymous mapping case, unless
the shmem mapping is private and writable.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Vladimir Davydov 5d097056c9 kmemcg: account certain kmem allocations to memcg
Mark those kmem allocations that are known to be easily triggered from
userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to
memcg.  For the list, see below:

 - threadinfo
 - task_struct
 - task_delay_info
 - pid
 - cred
 - mm_struct
 - vm_area_struct and vm_region (nommu)
 - anon_vma and anon_vma_chain
 - signal_struct
 - sighand_struct
 - fs_struct
 - files_struct
 - fdtable and fdtable->full_fds_bits
 - dentry and external_name
 - inode for all filesystems. This is the most tedious part, because
   most filesystems overwrite the alloc_inode method.

The list is far from complete, so feel free to add more objects.
Nevertheless, it should be close to "account everything" approach and
keep most workloads within bounds.  Malevolent users will be able to
breach the limit, but this was possible even with the former "account
everything" approach (simply because it did not account everything in
fact).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Greg Thelen <gthelen@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Al Viro e8ecde25f5 Make sure that highmem pages are not added to symlink page cache
inode_nohighmem() is sufficient to make sure that page_get_link()
won't try to allocate a highmem page.  Moreover, it is sufficient
to make sure that page_symlink/__page_symlink won't do the same
thing.  However, any filesystem that manually preseeds the symlink's
page cache upon symlink(2) needs to make sure that the page it
inserts there won't be a highmem one.

Fortunately, only nfs and shmem have run afoul of that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-14 17:56:54 -05:00
Linus Torvalds ddf1d6238d Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "Andreas' xattr cleanup series.

  It's a followup to his xattr work that went in last cycle; -0.5KLoC"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  xattr handlers: Simplify list operation
  ocfs2: Replace list xattr handler operations
  nfs: Move call to security_inode_listsecurity into nfs_listxattr
  xfs: Change how listxattr generates synthetic attributes
  tmpfs: listxattr should include POSIX ACL xattrs
  tmpfs: Use xattr handler infrastructure
  btrfs: Use xattr handler infrastructure
  vfs: Distinguish between full xattr names and proper prefixes
  posix acls: Remove duplicate xattr name definitions
  gfs2: Remove gfs2_xattr_acl_chmod
  vfs: Remove vfs_xattr_cmp
2016-01-11 13:32:10 -08:00
Linus Torvalds 32fb378437 Merge branch 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs RCU symlink updates from Al Viro:
 "Replacement of ->follow_link/->put_link, allowing to stay in RCU mode
  even if the symlink is not an embedded one.

  No changes since the mailbomb on Jan 1"

* 'work.symlinks' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->get_link() to delayed_call, kill ->put_link()
  kill free_page_put_link()
  teach nfs_get_link() to work in RCU mode
  teach proc_self_get_link()/proc_thread_self_get_link() to work in RCU mode
  teach shmem_get_link() to work in RCU mode
  teach page_get_link() to work in RCU mode
  replace ->follow_link() with new method that could stay in RCU mode
  don't put symlink bodies in pagecache into highmem
  namei: page_getlink() and page_follow_link_light() are the same thing
  ufs: get rid of ->setattr() for symlinks
  udf: don't duplicate page_symlink_inode_operations
  logfs: don't duplicate page_symlink_inode_operations
  switch befs long symlinks to page_symlink_operations
2016-01-11 13:13:23 -08:00
Al Viro fceef393a5 switch ->get_link() to delayed_call, kill ->put_link()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-30 13:01:03 -05:00
Hugh Dickins 267a4c76bb tmpfs: fix shmem_evict_inode() warnings on i_blocks
Dmitry Vyukov provides a little program, autogenerated by syzkaller,
which races a fault on a mapping of a sparse memfd object, against
truncation of that object below the fault address: run repeatedly for a
few minutes, it reliably generates shmem_evict_inode()'s
WARN_ON(inode->i_blocks).

(But there's nothing specific to memfd here, nor to the fstat which it
happened to use to generate the fault: though that looked suspicious,
since a shmem_recalc_inode() had been added there recently.  The same
problem can be reproduced with open+unlink in place of memfd_create, and
with fstatfs in place of fstat.)

v3.7 commit 0f3c42f522 ("tmpfs: change final i_blocks BUG to WARNING")
explains one cause of such a warning (a race with shmem_writepage to
swap), and possible solutions; but we never took it further, and this
syzkaller incident turns out to have a different cause.

shmem_getpage_gfp()'s error recovery, when a freshly allocated page is
then found to be beyond eof, looks plausible - decrementing the alloced
count that was just before incremented - but in fact can go wrong, if a
racing thread (the truncator, for example) gets its shmem_recalc_inode()
in just after our delete_from_page_cache().  delete_from_page_cache()
decrements nrpages, that shmem_recalc_inode() will balance the books by
decrementing alloced itself, then our decrement of alloced take it one
too low: leading to the WARNING when the object is finally evicted.

Once the new page has been exposed in the page cache,
shmem_getpage_gfp() must leave it to shmem_recalc_inode() itself to get
the accounting right in all cases (and not fall through from "trunc:" to
"decused:").  Adjust that error recovery block; and the reinitialization
of info and sbinfo can be removed too.

While we're here, fix shmem_writepage() to avoid the original issue: it
will be safe against a racing shmem_recalc_inode(), if it merely
increments swapped before the shmem_delete_from_page_cache() which
decrements nrpages (but it must then do its own shmem_recalc_inode()
before that, while still in balance, instead of after).  (Aside: why do
we shmem_recalc_inode() here in the swap path? Because its raison d'etre
is to cope with clean sparse shmem pages being reclaimed behind our
back: so here when swapping is a good place to look for that case.) But
I've not now managed to reproduce this bug, even without the patch.

I don't see why I didn't do that earlier: perhaps inhibited by the
preference to eliminate shmem_recalc_inode() altogether.  Driven by this
incident, I do now have a patch to do so at last; but still want to sit
on it for a bit, there's a couple of questions yet to be resolved.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-12-12 10:15:34 -08:00
Al Viro 6a6c990496 teach shmem_get_link() to work in RCU mode
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:55 -05:00
Al Viro 6b2553918d replace ->follow_link() with new method that could stay in RCU mode
new method: ->get_link(); replacement of ->follow_link().  The differences
are:
	* inode and dentry are passed separately
	* might be called both in RCU and non-RCU mode;
the former is indicated by passing it a NULL dentry.
	* when called that way it isn't allowed to block
and should return ERR_PTR(-ECHILD) if it needs to be called
in non-RCU mode.

It's a flagday change - the old method is gone, all in-tree instances
converted.  Conversion isn't hard; said that, so far very few instances
do not immediately bail out when called in RCU mode.  That'll change
in the next commits.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:54 -05:00
Al Viro 21fc61c73c don't put symlink bodies in pagecache into highmem
kmap() in page_follow_link_light() needed to go - allowing to hold
an arbitrary number of kmaps for long is a great way to deadlocking
the system.

new helper (inode_nohighmem(inode)) needs to be used for pagecache
symlinks inodes; done for all in-tree cases.  page_follow_link_light()
instrumented to yell about anything missed.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-08 22:41:36 -05:00
Andreas Gruenbacher 786534b92f tmpfs: listxattr should include POSIX ACL xattrs
When a file on tmpfs has an ACL or a Default ACL, listxattr should include the
corresponding xattr name.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:34:15 -05:00
Andreas Gruenbacher aa7c5241c3 tmpfs: Use xattr handler infrastructure
Use the VFS xattr handler infrastructure and get rid of similar code in
the filesystem.  For implementing shmem_xattr_handler_set, we need a
version of simple_xattr_set which removes the attribute when value is
NULL.  Use this to implement kernfs_iop_removexattr as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: James Morris <james.l.morris@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: linux-mm@kvack.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-06 21:34:15 -05:00
Mel Gorman dd56b04642 mm: page_alloc: hide some GFP internals and document the bits and flag combinations
Andrew stated the following

	We have quite a history of remote parts of the kernel using
	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
	to think that this is because we didn't adequately explain the
	interface.

	And I don't think that gfp.h really improved much in this area as
	a result of this patchset.  Could you go through it some time and
	decide if we've adequately documented all this stuff?

This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06 17:50:42 -08:00
Hugh Dickins d0424c429f tmpfs: avoid a little creat and stat slowdown
LKP reports that v4.2 commit afa2db2fb6 ("tmpfs: truncate prealloc
blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
benchmark.

creat-clo does just what you'd expect from the name, and creat's O_TRUNC
on 0-length file does indeed get into more overhead now shmem_setattr()
tests "0 <= 0" instead of "0 < 0".

I'm not sure how much we care, but I think it would not be too VW-like to
add in a check for whether any pages (or swap) are allocated: if none are
allocated, there's none to remove from the radix_tree.  At first I thought
that check would be good enough for the unmaps too, but no: we should not
skip the unlikely case of unmapping pages beyond the new EOF, which were
COWed from holes which have now been reclaimed, leaving none.

This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere, and
running a debug config before and after: I hope those account for the
lesser speedup.

And probably someone has a benchmark where a thousand threads keep on
stat'ing the same file repeatedly: forestall that report by adjusting v4.3
commit 44a30220bc ("shmem: recalculate file inode when fstat") not to
take the spinlock in shmem_getattr() when there's no work to do.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Ying Huang <ying.huang@linux.intel.com>
Tested-by: Ying Huang <ying.huang@linux.intel.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Hugh Dickins 45637bab30 mm: rename mem_cgroup_migrate to mem_cgroup_replace_page
After v4.3's commit 0610c25daa ("memcg: fix dirty page migration")
mem_cgroup_migrate() doesn't have much to offer in page migration: convert
migrate_misplaced_transhuge_page() to set_page_memcg() instead.

Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its
remaining callers are replace_page_cache_page() and shmem_replace_page():
both of whom passed lrucare true, so just eliminate that argument.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
Yu Zhao 44a30220bc shmem: recalculate file inode when fstat
Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
undoes range or swaps.  But mm can drop clean page without notifying
shmem.  This makes fstat sometimes return out-of-date block size.

The problem can be partially solved when we add
inode_operations->getattr which calls shmem_recalc_inode to update
i_blocks for fstat.

shmem_recalc_inode also updates counter used by statfs and
vm_committed_as.  For them the situation is not changed.  They still
suffer from the discrepancy after dropping clean page and before the
function is called by aforementioned triggers.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-08 15:35:28 -07:00
Stephen Smalley e1832f2923 ipc: use private shmem or hugetlbfs inodes for shm segments.
The shm implementation internally uses shmem or hugetlbfs inodes for shm
segments.  As these inodes are never directly exposed to userspace and
only accessed through the shm operations which are already hooked by
security modules, mark the inodes with the S_PRIVATE flag so that inode
security initialization and permission checking is skipped.

This was motivated by the following lockdep warning:

  ======================================================
   [ INFO: possible circular locking dependency detected ]
   4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G        W
  -------------------------------------------------------
   httpd/1597 is trying to acquire lock:
   (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130
   but task is already holding lock:
   (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180
   which lock already depends on the new lock.
   the existing dependency chain (in reverse order) is:
   -> #3 (&mm->mmap_sem){++++++}:
        lock_acquire+0xc7/0x270
        __might_fault+0x7a/0xa0
        filldir+0x9e/0x130
        xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs]
        xfs_readdir+0x1b4/0x330 [xfs]
        xfs_file_readdir+0x2b/0x30 [xfs]
        iterate_dir+0x97/0x130
        SyS_getdents+0x91/0x120
        entry_SYSCALL_64_fastpath+0x12/0x76
   -> #2 (&xfs_dir_ilock_class){++++.+}:
        lock_acquire+0xc7/0x270
        down_read_nested+0x57/0xa0
        xfs_ilock+0x167/0x350 [xfs]
        xfs_ilock_attr_map_shared+0x38/0x50 [xfs]
        xfs_attr_get+0xbd/0x190 [xfs]
        xfs_xattr_get+0x3d/0x70 [xfs]
        generic_getxattr+0x4f/0x70
        inode_doinit_with_dentry+0x162/0x670
        sb_finish_set_opts+0xd9/0x230
        selinux_set_mnt_opts+0x35c/0x660
        superblock_doinit+0x77/0xf0
        delayed_superblock_init+0x10/0x20
        iterate_supers+0xb3/0x110
        selinux_complete_init+0x2f/0x40
        security_load_policy+0x103/0x600
        sel_write_load+0xc1/0x750
        __vfs_write+0x37/0x100
        vfs_write+0xa9/0x1a0
        SyS_write+0x58/0xd0
        entry_SYSCALL_64_fastpath+0x12/0x76
  ...

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Reported-by: Morten Stevens <mstevens@fedoraproject.org>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07 04:39:41 +03:00
Josef Bacik afa2db2fb6 tmpfs: truncate prealloc blocks past i_size
One of the rocksdb people noticed that when you do something like this

    fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 10M)
    pwrite(fd, buf, 5M, 0)
    ftruncate(5M)

on tmpfs, the file would still take up 10M: which led to super fun
issues because we were getting ENOSPC before we thought we should be
getting ENOSPC.  This patch fixes the problem, and mirrors what all the
other fs'es do (and was agreed to be the correct behaviour at LSF).

I tested it locally to make sure it worked properly with the following

    xfs_io -f -c "falloc -k 0 10M" -c "pwrite 0 5M" -c "truncate 5M" file

Without the patch we have "Blocks: 20480", with the patch we have the
correct value of "Blocks: 10240".

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:45 -07:00
Linus Torvalds 052b398a43 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "In this pile: pathname resolution rewrite.

   - recursion in link_path_walk() is gone.

   - nesting limits on symlinks are gone (the only limit remaining is
     that the total amount of symlinks is no more than 40, no matter how
     nested).

   - "fast" (inline) symlinks are handled without leaving rcuwalk mode.

   - stack footprint (independent of the nesting) is below kilobyte now,
     about on par with what it used to be with one level of nested
     symlinks and ~2.8 times lower than it used to be in the worst case.

   - struct nameidata is entirely private to fs/namei.c now (not even
     opaque pointers are being passed around).

   - ->follow_link() and ->put_link() calling conventions had been
     changed; all in-tree filesystems converted, out-of-tree should be
     able to follow reasonably easily.

     For out-of-tree conversions, see Documentation/filesystems/porting
     for details (and in-tree filesystems for examples of conversion).

  That has sat in -next since mid-May, seems to survive all testing
  without regressions and merges clean with v4.1"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (131 commits)
  turn user_{path_at,path,lpath,path_dir}() into static inlines
  namei: move saved_nd pointer into struct nameidata
  inline user_path_create()
  inline user_path_parent()
  namei: trim do_last() arguments
  namei: stash dfd and name into nameidata
  namei: fold path_cleanup() into terminate_walk()
  namei: saner calling conventions for filename_parentat()
  namei: saner calling conventions for filename_create()
  namei: shift nameidata down into filename_parentat()
  namei: make filename_lookup() reject ERR_PTR() passed as name
  namei: shift nameidata inside filename_lookup()
  namei: move putname() call into filename_lookup()
  namei: pass the struct path to store the result down into path_lookupat()
  namei: uninline set_root{,_rcu}()
  namei: be careful with mountpoint crossings in follow_dotdot_rcu()
  Documentation: remove outdated information from automount-support.txt
  get rid of assorted nameidata-related debris
  lustre: kill unused helper
  lustre: kill unused macro (LOOKUP_CONTINUE)
  ...
2015-06-22 12:51:21 -07:00
Hugh Dickins 66fc130394 mm: shmem_zero_setup skip security check and lockdep conflict with XFS
It appears that, at some point last year, XFS made directory handling
changes which bring it into lockdep conflict with shmem_zero_setup():
it is surprising that mmap() can clone an inode while holding mmap_sem,
but that has been so for many years.

Since those few lockdep traces that I've seen all implicated selinux,
I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
v3.13's commit c727709092 ("security: shmem: implement kernel private
shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.

This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
(and MAP_SHARED|MAP_ANONYMOUS).  I thought there were also drivers
which cloned inode in mmap(), but if so, I cannot locate them now.

Reported-and-tested-by: Prarit Bhargava <prarit@redhat.com>
Reported-and-tested-by: Daniel Wagner <wagi@monom.org>
Reported-and-tested-by: Morten Stevens <mstevens@fedoraproject.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-17 20:40:19 -10:00
Al Viro 5f2c4179e1 switch ->put_link() from dentry to inode
only one instance looks at that argument at all; that sole
exception wants inode rather than dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-11 08:13:12 -04:00
Al Viro 6e77137b36 don't pass nameidata to ->follow_link()
its only use is getting passed to nd_jump_link(), which can obtain
it from current->nameidata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:20:15 -04:00
Al Viro 680baacbca new ->follow_link() and ->put_link() calling conventions
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
is ignored in all cases except the last one.

Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().

b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is.  In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:19:45 -04:00
Al Viro 60380f193e shmem: switch to simple_follow_link()
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:18:24 -04:00
David Howells 75c3cfa855 VFS: assorted weird filesystems: d_inode() annotations
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:58 -04:00
Al Viro 5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Christoph Hellwig e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Sasha Levin f0774d884b mm: shmem: check for mapping owner before dereferencing
mapping->host can be NULL and shouldn't be dereferenced before being checked.

[ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
[ 1295.746387] Dumping ftrace buffer:
[ 1295.748217]    (ftrace buffer empty)
[ 1295.749527] Modules linked in:
[ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
[ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
[ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
[ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
[ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
[ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
[ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
[ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
[ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
[ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
[ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
[ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
[ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[ 1295.750268] Stack:
[ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
[ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
[ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
[ 1295.750268] Call Trace:
[ 1295.750268] mincore_page (mm/mincore.c:61)
[ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
[ 1295.750268] mincore_pte_range (mm/mincore.c:151)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
[ 1295.750268] walk_page_range (mm/pagewalk.c:275)
[ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
[ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
[ 1295.750268] ? ptlock_free (mm/mincore.c:24)
[ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
[ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48

All code
========
   0:	e5 48                	in     $0x48,%eax
   2:	c1 ea 03             	shr    $0x3,%edx
   5:	53                   	push   %rbx
   6:	48 89 fb             	mov    %rdi,%rbx
   9:	48 83 ec 08          	sub    $0x8,%rsp
   d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
  11:	75 4f                	jne    0x62
  13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  1a:	fc ff df
  1d:	48 8b 1b             	mov    (%rbx),%rbx
  20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
  24:	48 89 fa             	mov    %rdi,%rdx
  27:	48 c1 ea 03          	shr    $0x3,%rdx
  2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
  2f:	75 3f                	jne    0x70
  31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  38:	fc ff df
  3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  3f:	48                   	rex.W
	...

Code starting with the faulting instruction
===========================================
   0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   4:	75 3f                	jne    0x45
   6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   d:	fc ff df
  10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  14:	48                   	rex.W
	...
[ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
[ 1295.750268]  RSP <ffff8803a4dcfbf8>

Fixes: 97b713ba3e ("fs: kill BDI_CAP_SWAP_BACKED")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-23 10:00:11 -08:00
David Howells e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
Linus Torvalds 6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Vladimir Davydov 93aa7d9524 swap: remove unused mem_cgroup_uncharge_swapcache declaration
The body of this function was removed by commit 0a31bc97c8 ("mm:
memcontrol: rewrite uncharge API").

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:00 -08:00
Kirill A. Shutemov d83a08db5b mm: drop vm_ops->remap_pages and generic_file_remap_pages() stub
Nobody uses it anymore.

[akpm@linux-foundation.org: fix filemap_xip.c]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:30:30 -08:00
Michal Hocko f5e03a4989 memcg, shmem: fix shmem migration to use lrucare
It has been reported that 965GM might trigger

  VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage)

in mem_cgroup_migrate when shmem wants to replace a swap cache page
because of shmem_should_replace_page (the page is allocated from an
inappropriate zone).  shmem_replace_page expects that the oldpage is not
on LRU list and calls mem_cgroup_migrate without lrucare.  This is
obviously incorrect because swapcache pages might be on the LRU list
(e.g. swapin readahead page).

Fix this by enabling lrucare for the migration in shmem_replace_page.
Also clarify that lrucare should be used even if one of the pages might
be on LRU list.

The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even
without that the migration code might leave the old page on an
inappropriate memcg' LRU which is not that critical because the page
would get removed with its last reference but it is still confusing.

Fixes: 0a31bc97c8 ("mm: memcontrol: rewrite uncharge API")
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
Reported-by: Dave Airlie <airlied@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>	[3.17+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-05 13:35:29 -08:00
Christoph Hellwig b83ae6d421 fs: remove mapping->backing_dev_info
Now that we never use the backing_dev_info pointer in struct address_space
we can simply remove it and save 4 to 8 bytes in every inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:03:05 -07:00
Christoph Hellwig 97b713ba3e fs: kill BDI_CAP_SWAP_BACKED
This bdi flag isn't too useful - we can determine that a vma is backed by
either swap or shmem trivially in the caller.

This also allows removing the backing_dev_info instaces for swap and shmem
in favor of noop_backing_dev_info.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-01-20 14:02:56 -07:00
Al Viro 777eda2c5b new helper: iter_is_iovec()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-17 06:43:56 -05:00
Miklos Szeredi 46fdb794e3 shmem: support RENAME_WHITEOUT
Allocate a dentry, initialize it with a whiteout and hash it in the place
of the old dentry.  Later the old dentry will be moved away and the
whiteout will remain.

i_mutex protects agains concurrent readdir.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
2014-10-24 00:14:37 +02:00
Linus Torvalds c798360cd1 Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu updates from Tejun Heo:
 "A lot of activities on percpu front.  Notable changes are...

   - percpu allocator now can take @gfp.  If @gfp doesn't contain
     GFP_KERNEL, it tries to allocate from what's already available to
     the allocator and a work item tries to keep the reserve around
     certain level so that these atomic allocations usually succeed.

     This will replace the ad-hoc percpu memory pool used by
     blk-throttle and also be used by the planned blkcg support for
     writeback IOs.

     Please note that I noticed a bug in how @gfp is interpreted while
     preparing this pull request and applied the fix 6ae833c7fe
     ("percpu: fix how @gfp is interpreted by the percpu allocator")
     just now.

   - percpu_ref now uses longs for percpu and global counters instead of
     ints.  It leads to more sparse packing of the percpu counters on
     64bit machines but the overhead should be negligible and this
     allows using percpu_ref for refcnting pages and in-memory objects
     directly.

   - The switching between percpu and single counter modes of a
     percpu_ref is made independent of putting the base ref and a
     percpu_ref can now optionally be initialized in single or killed
     mode.  This allows avoiding percpu shutdown latency for cases where
     the refcounted objects may be synchronously created and destroyed
     in rapid succession with only a fraction of them reaching fully
     operational status (SCSI probing does this when combined with
     blk-mq support).  It's also planned to be used to implement forced
     single mode to detect underflow more timely for debugging.

  There's a separate branch percpu/for-3.18-consistent-ops which cleans
  up the duplicate percpu accessors.  That branch causes a number of
  conflicts with s390 and other trees.  I'll send a separate pull
  request w/ resolutions once other branches are merged"

* 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits)
  percpu: fix how @gfp is interpreted by the percpu allocator
  blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode
  percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
  percpu_ref: add PERCPU_REF_INIT_* flags
  percpu_ref: decouple switching to percpu mode and reinit
  percpu_ref: decouple switching to atomic mode and killing
  percpu_ref: add PCPU_REF_DEAD
  percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
  percpu_ref: replace pcpu_ prefix with percpu_
  percpu_ref: minor code and comment updates
  percpu_ref: relocate percpu_ref_reinit()
  Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe"
  Revert "percpu: free percpu allocation info for uniprocessor system"
  percpu-refcount: make percpu_ref based on longs instead of ints
  percpu-refcount: improve WARN messages
  percpu: fix locking regression in the failure path of pcpu_alloc()
  percpu-refcount: add @gfp to percpu_ref_init()
  proportions: add @gfp to init functions
  percpu_counter: add @gfp to percpu_counter_init()
  percpu_counter: make percpu_counters_lock irq-safe
  ...
2014-10-10 07:26:02 -04:00
Andrew Morton 1c93923cc2 include/linux/migrate.h: remove migrate_page #define
This is designed to avoid a few ifdefs in .c files but it's obnoxious
because it can cause unsuspecting "migrate_page" symbols to get turned into
"NULL".

Just nuke it and use the ifdefs.

Cc: Konstantin Khlebnikov <k.khlebnikov@samsung.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-09 22:25:56 -04:00
Miklos Szeredi b928095b0a shmem: fix nlink for rename overwrite directory
If overwriting an empty directory with rename, then need to drop the extra
nlink.

Test prog:

#include <stdio.h>
#include <fcntl.h>
#include <err.h>
#include <sys/stat.h>

int main(void)
{
	const char *test_dir1 = "test-dir1";
	const char *test_dir2 = "test-dir2";
	int res;
	int fd;
	struct stat statbuf;

	res = mkdir(test_dir1, 0777);
	if (res == -1)
		err(1, "mkdir(\"%s\")", test_dir1);

	res = mkdir(test_dir2, 0777);
	if (res == -1)
		err(1, "mkdir(\"%s\")", test_dir2);

	fd = open(test_dir2, O_RDONLY);
	if (fd == -1)
		err(1, "open(\"%s\")", test_dir2);

	res = rename(test_dir1, test_dir2);
	if (res == -1)
		err(1, "rename(\"%s\", \"%s\")", test_dir1, test_dir2);

	res = fstat(fd, &statbuf);
	if (res == -1)
		err(1, "fstat(%i)", fd);

	if (statbuf.st_nlink != 0) {
		fprintf(stderr, "nlink is %lu, should be 0\n", statbuf.st_nlink);
		return 1;
	}

	return 0;
}

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-09-26 21:16:42 -04:00
Tejun Heo 908c7f1949 percpu_counter: add @gfp to percpu_counter_init()
Percpu allocator now supports allocation mask.  Add @gfp to
percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
with percpu_counters too.

We could have left percpu_counter_init() alone and added
percpu_counter_init_gfp(); however, the number of users isn't that
high and introducing _gfp variants to all percpu data structures would
be quite ugly, so let's just do the conversion.  This is the one with
the most users.  Other percpu data structures are a lot easier to
convert.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: "David S. Miller" <davem@davemloft.net>
Cc: x86@kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
2014-09-08 09:51:29 +09:00
Linus Torvalds f6f993328b Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Stuff in here:

   - acct.c fixes and general rework of mnt_pin mechanism.  That allows
     to go for delayed-mntput stuff, which will permit mntput() on deep
     stack without worrying about stack overflows - fs shutdown will
     happen on shallow stack.  IOW, we can do Eric's umount-on-rmdir
     series without introducing tons of stack overflows on new mntput()
     call chains it introduces.
   - Bruce's d_splice_alias() patches
   - more Miklos' rename() stuff.
   - a couple of regression fixes (stable fodder, in the end of branch)
     and a fix for API idiocy in iov_iter.c.

  There definitely will be another pile, maybe even two.  I'd like to
  get Eric's series in this time, but even if we miss it, it'll go right
  in the beginning of for-next in the next cycle - the tricky part of
  prereqs is in this pile"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  fix copy_tree() regression
  __generic_file_write_iter(): fix handling of sync error after DIO
  switch iov_iter_get_pages() to passing maximal number of pages
  fs: mark __d_obtain_alias static
  dcache: d_splice_alias should detect loops
  exportfs: update Exporting documentation
  dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED
  dcache: remove unused d_find_alias parameter
  dcache: d_obtain_alias callers don't all want DISCONNECTED
  dcache: d_splice_alias should ignore DCACHE_DISCONNECTED
  dcache: d_splice_alias mustn't create directory aliases
  dcache: close d_move race in d_splice_alias
  dcache: move d_splice_alias
  namei: trivial fix to vfs_rename_dir comment
  VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode.
  cifs: support RENAME_NOREPLACE
  hostfs: support rename flags
  shmem: support RENAME_EXCHANGE
  shmem: support RENAME_NOREPLACE
  btrfs: add RENAME_NOREPLACE
  ...
2014-08-11 11:44:11 -07:00
David Herrmann 05f65b5c70 shm: wait for pins to be released when sealing
If we set SEAL_WRITE on a file, we must make sure there cannot be any
ongoing write-operations on the file.  For write() calls, we simply lock
the inode mutex, for mmap() we simply verify there're no writable
mappings.  However, there might be pages pinned by AIO, Direct-IO and
similar operations via GUP.  We must make sure those do not write to the
memfd file after we set SEAL_WRITE.

As there is no way to notify GUP users to drop pages or to wait for them
to be done, we implement the wait ourself: When setting SEAL_WRITE, we
check all pages for their ref-count.  If it's bigger than 1, we know
there's some user of the page.  We then mark the page and wait for up to
150ms for those ref-counts to be dropped.  If the ref-counts are not
dropped in time, we refuse the seal operation.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
David Herrmann 9183df25fe shm: add memfd_create() syscall
memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap().  It can support sealing and avoids any
connection to user-visible mount-points.  Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but with
a file-descriptor to it.

memfd_create() returns the raw shmem file, so calls like ftruncate() can
be used to modify the underlying inode.  Also calls like fstat() will
return proper information and mark the file as regular file.  If you want
sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
supported (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to a filesystem size limit.  It is still properly accounted to
memcg limits, though, and to the same overcommit or no-overcommit
accounting as all user memory.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
David Herrmann 40e041a2c8 shm: add sealing API
If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement.  However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies.  Look at the following two
use-cases:

  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING.  If you seal a file, a
specific set of operations is blocked on that file forever.  Unlike locks,
seals can only be set, never removed.  Hence, once you verified a specific
set of seals is set, you're guaranteed that no-one can perform the blocked
operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:

  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Ryan Lortie <desrt@desrt.ca>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <zonque@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:31 -07:00
Johannes Weiner 0a31bc97c8 mm: memcontrol: rewrite uncharge API
The memcg uncharging code that is involved towards the end of a page's
lifetime - truncation, reclaim, swapout, migration - is impressively
complicated and fragile.

Because anonymous and file pages were always charged before they had their
page->mapping established, uncharges had to happen when the page type
could still be known from the context; as in unmap for anonymous, page
cache removal for file and shmem pages, and swap cache truncation for swap
pages.  However, these operations happen well before the page is actually
freed, and so a lot of synchronization is necessary:

- Charging, uncharging, page migration, and charge migration all need
  to take a per-page bit spinlock as they could race with uncharging.

- Swap cache truncation happens during both swap-in and swap-out, and
  possibly repeatedly before the page is actually freed.  This means
  that the memcg swapout code is called from many contexts that make
  no sense and it has to figure out the direction from page state to
  make sure memory and memory+swap are always correctly charged.

- On page migration, the old page might be unmapped but then reused,
  so memcg code has to prevent untimely uncharging in that case.
  Because this code - which should be a simple charge transfer - is so
  special-cased, it is not reusable for replace_page_cache().

But now that charged pages always have a page->mapping, introduce
mem_cgroup_uncharge(), which is called after the final put_page(), when we
know for sure that nobody is looking at the page anymore.

For page migration, introduce mem_cgroup_migrate(), which is called after
the migration is successful and the new page is fully rmapped.  Because
the old page is no longer uncharged after migration, prevent double
charges by decoupling the page's memcg association (PCG_USED and
pc->mem_cgroup) from the page holding an actual charge.  The new bits
PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
to the new page during migration.

mem_cgroup_migrate() is suitable for replace_page_cache() as well,
which gets rid of mem_cgroup_replace_page_cache().  However, care
needs to be taken because both the source and the target page can
already be charged and on the LRU when fuse is splicing: grab the page
lock on the charge moving side to prevent changing pc->mem_cgroup of a
page under migration.  Also, the lruvecs of both pages change as we
uncharge the old and charge the new during migration, and putback may
race with us, so grab the lru lock and isolate the pages iff on LRU to
prevent races and ensure the pages are on the right lruvec afterward.

Swap accounting is massively simplified: because the page is no longer
uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
before the final put_page() in page reclaim.

Finally, page_cgroup changes are now protected by whatever protection the
page itself offers: anonymous pages are charged under the page table lock,
whereas page cache insertions, swapin, and migration hold the page lock.
Uncharging happens under full exclusion with no outstanding references.
Charging and uncharging also ensure that the page is off-LRU, which
serializes against charge migration.  Remove the very costly page_cgroup
lock and set pc->flags non-atomically.

[mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
[vdavydov@parallels.com: fix flags definition]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
Johannes Weiner 00501b531c mm: memcontrol: rewrite charge API
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code and
reduces charging and uncharging overhead.  The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
 executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.31%              cat  [kernel.kallsyms]   [k] memset
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.38%              cat  [kernel.kallsyms]   [k] put_page
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.48%           cat  [kernel.kallsyms]   [k] memset
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.46%           cat  [kernel.kallsyms]   [k] put_page
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
     1.30%           cat  [kernel.kallsyms]   [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

This patch (of 4):

The memcg charge API charges pages before they are rmapped - i.e.  have an
actual "type" - and so every callsite needs its own set of charge and
uncharge functions to know what type is being operated on.  Worse,
uncharge has to happen from a context that is still type-specific, rather
than at the end of the page's lifetime with exclusive access, and so
requires a lot of synchronization.

Rewrite the charge API to provide a generic set of try_charge(),
commit_charge() and cancel_charge() transaction operations, much like
what's currently done for swap-in:

  mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
  pages from the memcg if necessary.

  mem_cgroup_commit_charge() commits the page to the charge once it
  has a valid page->mapping and PageAnon() reliably tells the type.

  mem_cgroup_cancel_charge() aborts the transaction.

This reduces the charge API and enables subsequent patches to
drastically simplify uncharging.

As pages need to be committed after rmap is established but before they
are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
additions again.  Revive lru_cache_add_active_or_unevictable().

[hughd@google.com: fix shmem_unuse]
[hughd@google.com: Add comments on the private use of -EAGAIN]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:17 -07:00
Miklos Szeredi 37456771c5 shmem: support RENAME_EXCHANGE
This is really simple in tmpfs since the VFS already takes care of
shuffling the dentries.  Just adjust nlink on parent directories and touch
c & mtimes.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Miklos Szeredi 3b69ff51d0 shmem: support RENAME_NOREPLACE
Implement ->rename2 instead of ->rename.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Wang Sheng-Hui fed400a181 mm/shmem.c: remove the unused gfp arg to shmem_add_to_page_cache()
The gfp arg is not used in shmem_add_to_page_cache.  Remove this unused
arg.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:20 -07:00
Hugh Dickins eb39d618f9 mm: replace init_page_accessed by __SetPageReferenced
Do we really need an exported alias for __SetPageReferenced()? Its
callers better know what they're doing, in which case the page would not
be already marked referenced.  Kill init_page_accessed(), just
__SetPageReferenced() inline.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Prabhakar Lad <prabhakar.csengg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06 18:01:19 -07:00