Commit graph

735239 commits

Author SHA1 Message Date
Colin Ian King 2e85283dab be2net: remove redundant initialization of 'head' and pointer txq
Variable head is initialized to a value that is never read and is
being updated to a new value a few lines later, hence this
initialization is redundant and can be safely removed as well
as the now unused pointer txq.

Cleans up clang warning:
drivers/net/ethernet/emulex/benet/be_main.c:996:6: warning: Value
stored to 'head' during its initialization is never read

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:43:00 -05:00
David S. Miller 26c26ab02c Merge branch 'bnx2x-disable-GSO-on-too-large-packets'
Daniel Axtens says:

====================
bnx2x: disable GSO on too-large packets

We observed a case where a packet received on an ibmveth device had a
GSO size of around 10kB. This was forwarded by Open vSwitch to a bnx2x
device, where it caused a firmware assert. This is described in detail
at [0].

Ultimately we want a fix in the core, but that is very tricky to
backport. So for now, just stop the bnx2x driver from crashing.

When net-next re-opens I will send the fix to the core and a revert
for this.

v4 changes:
  - fix compilation error with EXPORTs (patch 1)
  - only do slow test if gso_size is greater than 9000 bytes (patch 2)

Thanks,
Daniel

[0]: https://patchwork.ozlabs.org/patch/859410/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:36:04 -05:00
Daniel Axtens 8914a59511 bnx2x: disable GSO where gso_size is too big for hardware
If a bnx2x card is passed a GSO packet with a gso_size larger than
~9700 bytes, it will cause a firmware error that will bring the card
down:

bnx2x: [bnx2x_attn_int_deasserted3:4323(enP24p1s0f0)]MC assert!
bnx2x: [bnx2x_mc_assert:720(enP24p1s0f0)]XSTORM_ASSERT_LIST_INDEX 0x2
bnx2x: [bnx2x_mc_assert:736(enP24p1s0f0)]XSTORM_ASSERT_INDEX 0x0 = 0x00000000 0x25e43e47 0x00463e01 0x00010052
bnx2x: [bnx2x_mc_assert:750(enP24p1s0f0)]Chip Revision: everest3, FW Version: 7_13_1
... (dump of values continues) ...

Detect when the mac length of a GSO packet is greater than the maximum
packet size (9700 bytes) and disable GSO.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:36:03 -05:00
Daniel Axtens 2b16f04872 net: create skb_gso_validate_mac_len()
If you take a GSO skb, and split it into packets, will the MAC
length (L2 + L3 + L4 headers + payload) of those packets be small
enough to fit within a given length?

Move skb_gso_mac_seglen() to skbuff.h with other related functions
like skb_gso_network_seglen() so we can use it, and then create
skb_gso_validate_mac_len to do the full calculation.

Signed-off-by: Daniel Axtens <dja@axtens.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-01 09:36:03 -05:00
Thomas Gleixner 1beaeacdc8 genirq: Make legacy autoprobing work again
Meelis reported the following warning on a quad P3 HP NetServer museum piece:

WARNING: CPU: 3 PID: 258 at kernel/irq/chip.c:244 __irq_startup+0x80/0x100
EIP: __irq_startup+0x80/0x100
irq_startup+0x7e/0x170
probe_irq_on+0x128/0x2b0
parport_irq_probe.constprop.18+0x8d/0x1af [parport_pc]
parport_pc_probe_port+0xf11/0x1260 [parport_pc]
parport_pc_init+0x78a/0xf10 [parport_pc]
parport_parse_param.constprop.16+0xf0/0xf0 [parport_pc]
do_one_initcall+0x45/0x1e0

This is caused by the rewrite of the irq activation/startup sequence which
missed to convert a callsite in the irq legacy auto probing code.

To fix this irq_activate_and_startup() needs to gain a return value so the
pending logic can work proper.

Fixes: c942cee46b ("genirq: Separate activation and startup")
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Meelis Roos <mroos@linux.ee>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801301935410.1797@nanos
2018-02-01 11:09:40 +01:00
Dan Williams 085331dfc6 x86/kvm: Update spectre-v1 mitigation
Commit 75f139aaf8 "KVM: x86: Add memory barrier on vmcs field lookup"
added a raw 'asm("lfence");' to prevent a bounds check bypass of
'vmcs_field_to_offset_table'.

The lfence can be avoided in this path by using the array_index_nospec()
helper designed for these types of fixes.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Honig <ahonig@google.com>
Cc: kvm@vger.kernel.org
Cc: Jim Mattson <jmattson@google.com>
Link: https://lkml.kernel.org/r/151744959670.6342.3001723920950249067.stgit@dwillia2-desk3.amr.corp.intel.com
2018-02-01 10:59:10 +01:00
Dmitry Torokhov d67ad78e09 Merge branch 'next' into for-linus
Prepare input updates for 4.16 merge window.
2018-02-01 00:37:30 -08:00
Darrick J. Wong 131fa58d39 xfs: fix u32 type usage in sb validation function
Don't use u32, use uint32_t, because this won't work in xfsprogs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2018-01-31 20:39:20 -08:00
Linus Torvalds 255442c938 Documentation updates for 4.16. New stuff includes refcount_t
documentation, errseq documentation, kernel-doc support for nested
 structure definitions, the removal of lots of crufty kernel-doc support for
 unused formats, SPDX tag documentation, the beginnings of a manual for
 subsystem maintainers, and lots of fixes and updates.
 
 As usual, some of the changesets reach outside of Documentation/ to effect
 kerneldoc comment fixes.  It also adds the new LICENSES directory, of which
 Thomas promises I do not need to be the maintainer.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJab11TAAoJEI3ONVYwIuV6i1UP/1LgGPHW9Ygq5qaLFbReZd/u
 Mx/orrhHX0PdkbCCE+CbL8Vm1m4UKFDTBdlpk3s542zxeeG0ZBXuTnvq4Kyk+cTN
 p4/vsIEzk/Ih13/glGE5MlV+EjiEK+8hK69TIUj7bAyuHmpzofjRz9/1M6RLDGDC
 HY6UI58AXG0yOQWMWCGRMYpQAFUGij2equ7Doe1ugXRq14dx7V4RsOhI140iRk7t
 bquAq1rS2fXniiuPFmLBUe4dWW28isVa/Vl/aXcaWQDKMyT0OLhjOMW36wWKqtPi
 WdVCpHv1NLZNyZZr9S3kvfOwW+BUqpEzfVwssyBLW4h0tsnIx0U0HVhSTY8/TvFZ
 QD9yCSana4LB/e5CHXIX5lBHbjHxf+rETXqVV4MgwDaMvM3mCo4X6WUTJDmZADo6
 vQISEKeb4su5uWAbc9T9xwRSLhZnFVdJ/QuYdNQ5+EpFJYLhzQ9eBvEz6JstSIXL
 p9ASBiPNY3ulpVZ8q0JOHJRBhq5mHJH6Dy8achzbILy2l/ZI4b8lJ53mw9II04cp
 puF96E6HpvuZ8Tgjjrg9U3ZdxXNrUgc/tjk2ZDkyTglk1XF2jKSq2tiNSZ3oLrJm
 XqJPnpCeyJM5UDvwkIBzgC41WEHwe8uvoNbUnc4X7UJSZegFzcSLQXf5qaprHS5k
 XeQ7sbd+S+jzVVjFi0W5
 =Z15Z
 -----END PGP SIGNATURE-----

Merge tag 'docs-4.16' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "Documentation updates for 4.16.

  New stuff includes refcount_t documentation, errseq documentation,
  kernel-doc support for nested structure definitions, the removal of
  lots of crufty kernel-doc support for unused formats, SPDX tag
  documentation, the beginnings of a manual for subsystem maintainers,
  and lots of fixes and updates.

  As usual, some of the changesets reach outside of Documentation/ to
  effect kerneldoc comment fixes. It also adds the new LICENSES
  directory, of which Thomas promises I do not need to be the
  maintainer"

* tag 'docs-4.16' of git://git.lwn.net/linux: (65 commits)
  linux-next: docs-rst: Fix typos in kfigure.py
  linux-next: DOC: HWPOISON: Fix path to debugfs in hwpoison.txt
  Documentation: Fix misconversion of #if
  docs: add index entry for networking/msg_zerocopy
  Documentation: security/credentials.rst: explain need to sort group_list
  LICENSES: Add MPL-1.1 license
  LICENSES: Add the GPL 1.0 license
  LICENSES: Add Linux syscall note exception
  LICENSES: Add the MIT license
  LICENSES: Add the BSD-3-clause "Clear" license
  LICENSES: Add the BSD 3-clause "New" or "Revised" License
  LICENSES: Add the BSD 2-clause "Simplified" license
  LICENSES: Add the LGPL-2.1 license
  LICENSES: Add the LGPL 2.0 license
  LICENSES: Add the GPL 2.0 license
  Documentation: Add license-rules.rst to describe how to properly identify file licenses
  scripts: kernel_doc: better handle show warnings logic
  fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at
  doc: md: Fix a file name to md-fault.c in fault-injection.txt
  errseq: Add to documentation tree
  ...
2018-01-31 19:25:25 -08:00
Linus Torvalds d76e0a050e Merge branch 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vmci iov_iter updates from Al Viro:
 "Get rid of "is it an iovec or an entire array?" flags in vmxi - just
  use iov_iter. Simplifies the living hell out of that code..."

* 'work.vmci' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vmci: the same on the send side...
  vmci: simplify qp_dequeue_locked()
  vmci: get rid of qp_memcpy_from_queue()
  vmci: fix buf_size in case of iovec-based accesses
2018-01-31 19:21:14 -08:00
Linus Torvalds 40b9672a2f Merge branch 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull asm/uaccess.h whack-a-mole from Al Viro:
 "It's linux/uaccess.h, damnit... Oh, well - eventually they'll stop
  cropping up..."

* 'work.whack-a-mole' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  asm-prototypes.h: use linux/uaccess.h, not asm/uaccess.h
  riscv: use linux/uaccess.h, not asm/uaccess.h...
  ppc: for put_user() pull linux/uaccess.h, not asm/uaccess.h
2018-01-31 19:18:12 -08:00
Linus Torvalds dc1efc3cfa Merge branch 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull dcache updates from Al Viro:
 "Neil Brown's d_move()/d_path() race fix"

* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  VFS: close race between getcwd() and d_move()
2018-01-31 19:15:23 -08:00
Linus Torvalds 73da9e1a9f Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - misc fixes

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  mm: remove PG_highmem description
  tools, vm: new option to specify kpageflags file
  mm/swap.c: make functions and their kernel-doc agree
  mm, memory_hotplug: fix memmap initialization
  mm: correct comments regarding do_fault_around()
  mm: numa: do not trap faults on shared data section pages.
  hugetlb, mbind: fall back to default policy if vma is NULL
  hugetlb, mempolicy: fix the mbind hugetlb migration
  mm, hugetlb: further simplify hugetlb allocation API
  mm, hugetlb: get rid of surplus page accounting tricks
  mm, hugetlb: do not rely on overcommit limit during migration
  mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
  mm, hugetlb: unify core page allocation accounting and initialization
  mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
  mm/memcontrol.c: make local symbol static
  mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
  include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
  mm/compaction.c: fix comment for try_to_compact_pages()
  mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
  zsmalloc: use U suffix for negative literals being shifted
  ...
2018-01-31 18:46:22 -08:00
Daniel Vetter 24b8ef699e drm/ast: Load lut in crtc_commit
In the past the ast driver relied upon the fbdev emulation helpers to
call ->load_lut at boot-up. But since

commit b8e2b0199c
Author: Peter Rosin <peda@axentia.se>
Date:   Tue Jul 4 12:36:57 2017 +0200

    drm/fb-helper: factor out pseudo-palette

that's cleaned up and drivers are expected to boot into a consistent
lut state. This patch fixes that.

Fixes: b8e2b0199c ("drm/fb-helper: factor out pseudo-palette")
Cc: Peter Rosin <peda@axenita.se>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: <stable@vger.kernel.org> # v4.14+
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198123
Cc: Bill Fraser <bill.fraser@gmail.com>
Reported-and-Tested-by: Bill Fraser <bill.fraser@gmail.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
2018-02-01 11:35:46 +10:00
Dave Airlie 7ec3c0957f This contains a fix to restrict what lessee can do with masters and
another one when waiting for timeouts on reservation objects.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJacduMAAoJEEN0HIUfOBk0y5IQALpys8ycG3b5Wm5Qz/ftKr3b
 YfhVx2kRXSPKLWutKN+Mseo4hNKhbrk6MsLctlXOAKLGCIG9cs8erwB4bQm+a7sX
 3LTx9QNRIQsSK09bVY+MGrfSjbTcyzDVC2qCk/4w6F53jRVlvZZpZ2JZQ/OeRihK
 F9Jg37P2TyXfJbS5xq3rP7N18+0TwdIZcUhqSYenTrpXJIguWqL8w836v+GD/cJA
 3DqfzQI+TSNiY9uAT76BAz1x7GOqecf5Cd5G7Sag4z9IeutqIrnt3+hcl7fZ3a2v
 CwY+TmTzu9GO0WOG2J8Az0/p0B9nr3OaYlSILMQ9fvVwGNY1Y8wF94SLrjEapMsu
 wv1X5V5/nwUOqWPbZZJNP1849PYuI7oTOZcrB013lrA+AE/yCxF+2XuF4u/GPMvf
 ULmXll8D2+bWq4Oqv0dJFPXYGvAj+5Ksn9yin0CDW+g2cHe1JQRGjitajd40usPh
 qBHr7Q7ARyWX7MccGQUx5nYHNCp7su5ofNPlX5aTi8CeUiTp8py1xAYCzzyULDBA
 ScHIWz83dc2f/Kq65g8lBQotfSPCqcKpduv4/bkvu+tqMSxI1NEaGQ/iQpWzsI8v
 PwMl7dbQRp66iN2VxNtvkdEcVLfOBofKowU/exD7eSY7tuQLhb7YpwLHMAJYzUhn
 J/f2NiZMFDCXzUHAranR
 =h8qb
 -----END PGP SIGNATURE-----

Merge tag 'drm-misc-next-fixes-2018-01-31' of git://anongit.freedesktop.org/drm/drm-misc into drm-next

This contains a fix to restrict what lessee can do with masters and
another one when waiting for timeouts on reservation objects.

* tag 'drm-misc-next-fixes-2018-01-31' of git://anongit.freedesktop.org/drm/drm-misc:
  drm: Check for lessee in DROP_MASTER ioctl
  dma-buf: fix reservation_object_wait_timeout_rcu once more v2
2018-02-01 11:34:47 +10:00
Miles Chen 3f56a2f803 mm: remove PG_highmem description
Commit cbe37d0937 ("[PATCH] mm: remove PG_highmem") removed PG_highmem
to save a page flag.  So the description of PG_highmem is no longer
needed.

Link: http://lkml.kernel.org/r/1517391212-2950-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen <miles.chen@mediatek.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
David Rientjes c7905f2002 tools, vm: new option to specify kpageflags file
page-types currently hardcodes /proc/kpageflags as the file to parse.
This works when using the tool to examine the state of pageflags on the
same system, but does not allow storing a snapshot of pageflags at a
given time to debug issues nor on a different system.

This allows the user to specify a saved version of kpageflags with a new
page-types -F option.

[akpm@linux-foundation.org: add "filename" to fix usage() string]
[rientjes@google.com: fix layout]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301840050.140969@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801301458180.153857@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Randy Dunlap e02a9f048e mm/swap.c: make functions and their kernel-doc agree
Fix some basic kernel-doc notation in mm/swap.c:

 - for function lru_cache_add_anon(), make its kernel-doc function name
   match its function name and change colon to hyphen following the
   function name

 - for function pagevec_lookup_entries(), change the function parameter
   name from nr_pages to nr_entries since that is more descriptive of
   what the parameter actually is and then it matches the kernel-doc
   comments also

Fix function kernel-doc to match the change in commit 67fd707f46:

 - drop the kernel-doc notation for @nr_pages from
   pagevec_lookup_range() and correct the function description for that
   change

Link: http://lkml.kernel.org/r/3b42ee3e-04a9-a6ca-6be4-f00752a114fe@infradead.org
Fixes: 67fd707f46 ("mm: remove nr_pages argument from pagevec_lookup_{,range}_tag()")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko 9bb5a391f9 mm, memory_hotplug: fix memmap initialization
Bharata has noticed that onlining a newly added memory doesn't increase
the total memory, pointing to commit f7f99100d8 ("mm: stop zeroing
memory during allocation in vmemmap") as a culprit.  This commit has
changed the way how the memory for memmaps is initialized and moves it
from the allocation time to the initialization time.  This works
properly for the early memmap init path.

It doesn't work for the memory hotplug though because we need to mark
page as reserved when the sparsemem section is created and later
initialize it completely during onlining.  memmap_init_zone is called in
the early stage of onlining.  With the current code it calls
__init_single_page and as such it clears up the whole stage and
therefore online_pages_range skips those pages.

Fix this by skipping mm_zero_struct_page in __init_single_page for
memory hotplug path.  This is quite uggly but unifying both early init
and memory hotplug init paths is a large project.  Make sure we plug the
regression at least.

Link: http://lkml.kernel.org/r/20180130101141.GW21609@dhcp22.suse.cz
Fixes: f7f99100d8 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Tested-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
William Kucharski da391d640c mm: correct comments regarding do_fault_around()
There are multiple comments surrounding do_fault_around that memtion
fault_around_pages() and fault_around_mask(), two routines that do not
exist.  These comments should be reworded to reference
fault_around_bytes, the value which is used to determine how much
do_fault_around() will attempt to read when processing a fault.

These comments should have been updated when fault_around_pages() and
fault_around_mask() were removed in commit aecd6f4426 ("mm: close race
between do_fault_around() and fault_around_bytes_set()").

Fixes: aecd6f4426 ("mm: close race between do_fault_around() and fault_around_bytes_set()")
Link: http://lkml.kernel.org/r/302D0B14-C7E9-44C6-8BED-033F9ACBD030@oracle.com
Signed-off-by: William Kucharski <william.kucharski@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Henry Willard 859d4adc34 mm: numa: do not trap faults on shared data section pages.
Workloads consisting of a large number of processes running the same
program with a very large shared data segment may experience performance
problems when numa balancing attempts to migrate the shared cow pages.
This manifests itself with many processes or tasks in
TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.

The program listed below simulates the conditions with these results
when run with 288 processes on a 144 core/8 socket machine.

Average throughput 	Average throughput     Average throughput
with numa_balancing=0	with numa_balancing=1  with numa_balancing=1
     			without the patch      with the patch
---------------------	---------------------  ---------------------
2118782			2021534		       2107979

Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes
waiting on NUMA page migration with this patch applied.  In some cases,
%iowait drops from 16%-26% to 0.

  // SPDX-License-Identifier: GPL-2.0
  /*
   * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
   */
  #include <sys/time.h>
  #include <stdio.h>
  #include <wait.h>
  #include <sys/mman.h>

  int a[1000000] = {13};

  int  main(int argc, const char **argv)
  {
	int n = 0;
	int i;
	pid_t pid;
	int stat;
	int *count_array;
	int cpu_count = 288;
	long total = 0;

	struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};

	if (argc > 2)
		cpu_count = atoi(argv[2]);

	count_array = mmap(NULL, cpu_count * sizeof(int),
			   (PROT_READ|PROT_WRITE),
			   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);

	if (count_array == MAP_FAILED) {
		perror("mmap:");
		return 0;
	}

	for (i = 0; i < cpu_count; ++i) {
		pid = fork();
		if (pid <= 0)
			break;
		if ((i & 0xf) == 0)
			usleep(2);
	}

	if (pid != 0) {
		if (i == 0) {
			perror("fork:");
			return 0;
		}

		for (;;) {
			pid = wait(&stat);
			if (pid < 0)
				break;
		}

		for (i = 0; i < cpu_count; ++i)
			total += count_array[i];

		printf("Total %ld\n", total);
		munmap(count_array, cpu_count * sizeof(int));
		return 0;
	}

	gettimeofday(&t1, 0);
	timeradd(&t1, &t2, &t1);
	while (timercmp(&t2, &t1, <)) {
		int b = 0;
		int j;

		for (j = 0; j < 1000000; j++)
			b += a[j];
		gettimeofday(&t2, 0);
		n++;
	}
	count_array[i] = n;
	return 0;
  }

This patch changes change_pte_range() to skip shared copy-on-write pages
when called from change_prot_numa().

NOTE: change_prot_numa() is nominally called from task_numa_work() and
queue_pages_test_walk().  task_numa_work() is the auto NUMA balancing
path, and queue_pages_test_walk() is part of explicit NUMA policy
management.  However, queue_pages_test_walk() only calls
change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
not allowed, so change_prot_numa() is only called from auto NUMA
balancing.

In the case of explicit NUMA policy management, shared pages are not
migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
depends on CAP_SYS_NICE.  Currently, there is no way to pass information
about MPOL_MF_MOVE_ALL to change_pte_range.  This will have to be fixed
if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
migration mode.

task_numa_work() skips the read-only VMAs of programs and shared
libraries.

Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
Signed-off-by: Henry Willard <henry.willard@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko 389c8178d0 hugetlb, mbind: fall back to default policy if vma is NULL
Dan Carpenter has noticed that mbind migration callback (new_page) can
get a NULL vma pointer and choke on it inside alloc_huge_page_vma which
relies on the VMA to get the hstate.  We used to BUG_ON this case but
the BUG_+ON has been removed recently by "hugetlb, mempolicy: fix the
mbind hugetlb migration".

The proper way to handle this is to get the hstate from the migrated
page and rely on huge_node (resp.  get_vma_policy) do the right thing
with null VMA.  We are currently falling back to the default mempolicy
in that case which is in line what THP path is doing here.

Link: http://lkml.kernel.org/r/20180110104712.GR1732@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko ebd6372358 hugetlb, mempolicy: fix the mbind hugetlb migration
do_mbind migration code relies on alloc_huge_page_noerr for hugetlb
pages.  alloc_huge_page_noerr uses alloc_huge_page which is a highlevel
allocation function which has to take care of reserves, overcommit or
hugetlb cgroup accounting.  None of that is really required for the page
migration because the new page is only temporal and either will replace
the original page or it will be dropped.  This is essentially as for
other migration call paths and there shouldn't be any reason to handle
mbind in a special way.

The current implementation is even suboptimal because the migration
might fail just because the hugetlb cgroup limit is reached, or the
overcommit is saturated.

Fix this by making mbind like other hugetlb migration paths.  Add a new
migration helper alloc_huge_page_vma as a wrapper around
alloc_huge_page_nodemask with additional mempolicy handling.

alloc_huge_page_noerr has no more users and it can go.

Link: http://lkml.kernel.org/r/20180103093213.26329-7-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko 0c397daea1 mm, hugetlb: further simplify hugetlb allocation API
Hugetlb allocator has several layer of allocation functions depending
and the purpose of the allocation.  There are two allocators depending
on whether the page can be allocated from the page allocator or we need
a contiguous allocator.  This is currently opencoded in
alloc_fresh_huge_page which is the only path that might allocate giga
pages which require the later allocator.  Create alloc_fresh_huge_page
which hides this implementation detail and use it in all callers which
hardcoded the buddy allocator path (__hugetlb_alloc_buddy_huge_page).
This shouldn't introduce any funtional change because both migration and
surplus allocators exlude giga pages explicitly.

While we are at it let's do some renaming.  The current scheme is not
consistent and overly painfull to read and understand.  Get rid of
prefix underscores from most functions.  There is no real reason to make
names longer.

* alloc_fresh_huge_page is the new layer to abstract underlying
  allocator
* __hugetlb_alloc_buddy_huge_page becomes shorter and neater
  alloc_buddy_huge_page.
* Former alloc_fresh_huge_page becomes alloc_pool_huge_page because we put
  the new page directly to the pool
* alloc_surplus_huge_page can drop the opencoded prep_new_huge_page code
  as it uses alloc_fresh_huge_page now
* others lose their excessive prefix underscores to make names shorter

[dan.carpenter@oracle.com: fix double unlock bug in alloc_surplus_huge_page()]
  Link: http://lkml.kernel.org/r/20180109200559.g3iz5kvbdrz7yydp@mwanda
Link: http://lkml.kernel.org/r/20180103093213.26329-6-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko 9980d744a0 mm, hugetlb: get rid of surplus page accounting tricks
alloc_surplus_huge_page increases the pool size and the number of
surplus pages opportunistically to prevent from races with the pool size
change.  See commit d1c3fb1f8f ("hugetlb: introduce
nr_overcommit_hugepages sysctl") for more details.

The resulting code is unnecessarily hairy, cause code duplication and
doesn't allow to share the allocation paths.  Moreover pool size changes
tend to be very seldom so optimizing for them is not really reasonable.
Simplify the code and allow to allocate a fresh surplus page as long as
we are under the overcommit limit and then recheck the condition after
the allocation and drop the new page if the situation has changed.  This
should provide a reasonable guarantee that an abrupt allocation requests
will not go way off the limit.

If we consider races with the pool shrinking and enlarging then we
should be reasonably safe as well.  In the first case we are off by one
in the worst case and the second case should work OK because the page is
not yet visible.  We can waste CPU cycles for the allocation but that
should be acceptable for a relatively rare condition.

Link: http://lkml.kernel.org/r/20180103093213.26329-5-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko ab5ac90aec mm, hugetlb: do not rely on overcommit limit during migration
hugepage migration relies on __alloc_buddy_huge_page to get a new page.
This has 2 main disadvantages.

1) it doesn't allow to migrate any huge page if the pool is used
   completely which is not an exceptional case as the pool is static and
   unused memory is just wasted.

2) it leads to a weird semantic when migration between two numa nodes
   might increase the pool size of the destination NUMA node while the
   page is in use.  The issue is caused by per NUMA node surplus pages
   tracking (see free_huge_page).

Address both issues by changing the way how we allocate and account
pages allocated for migration.  Those should temporal by definition.  So
we mark them that way (we will abuse page flags in the 3rd page) and
update free_huge_page to free such pages to the page allocator.  Page
migration path then just transfers the temporal status from the new page
to the old one which will be freed on the last reference.  The global
surplus count will never change during this path but we still have to be
careful when migrating a per-node suprlus page.  This is now handled in
move_hugetlb_state which is called from the migration path and it copies
the hugetlb specific page state and fixes up the accounting when needed

Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
reflect its purpose.  The new allocation routine for the migration path
is __alloc_migrate_huge_page.

The user visible effect of this patch is that migrated pages are really
temporal and they travel between NUMA nodes as per the migration
request:

Before migration
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

After
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
  /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
  /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0

with the previous implementation, both nodes would have nr_hugepages:1
until the page is freed.

Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko d9cc948f6f mm, hugetlb: integrate giga hugetlb more naturally to the allocation path
Gigantic hugetlb pages were ingrown to the hugetlb code as an alien
specie with a lot of special casing.  The allocation path is not an
exception.  Unnecessarily so to be honest.  It is true that the
underlying allocator is different but that is an implementation detail.

This patch unifies the hugetlb allocation path that a prepares fresh
pool pages.  alloc_fresh_gigantic_page basically copies
alloc_fresh_huge_page logic so we can move everything there.  This will
simplify set_max_huge_pages which doesn't have to care about what kind
of huge page we allocate.

Link: http://lkml.kernel.org/r/20180103093213.26329-3-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Michal Hocko af0fb9df78 mm, hugetlb: unify core page allocation accounting and initialization
Patch series "mm, hugetlb: allocation API and migration improvements"

Motivation:

this is a follow up for [3] for the allocation API and [4] for the
hugetlb migration.  It wasn't really easy to split those into two
separate patch series as they share some code.

My primary motivation to touch this code is to make the gigantic pages
migration working.  The giga pages allocation code is just too fragile
and hacked into the hugetlb code now.  This series tries to move giga
pages closer to the first class citizen.  We are not there yet but
having 5 patches is quite a lot already and it will already make the
code much easier to follow.  I will come with other changes on top after
this sees some review.

The first two patches should be trivial to review.  The third patch
changes the way how we migrate huge pages.  Newly allocated pages are a
subject of the overcommit check and they participate surplus accounting
which is quite unfortunate as the changelog explains.  This patch
doesn't change anything wrt.  giga pages.

Patch #4 removes the surplus accounting hack from
__alloc_surplus_huge_page.  I hope I didn't miss anything there and a
deeper review is really due there.

Patch #5 finally unifies allocation paths and giga pages shouldn't be
any special anymore.  There is also some renaming going on as well.

This patch (of 6):

hugetlb allocator has two entry points to the page allocator
 - alloc_fresh_huge_page_node
 - __hugetlb_alloc_buddy_huge_page

The two differ very subtly in two aspects.  The first one doesn't care
about HTLB_BUDDY_* stats and it doesn't initialize the huge page.
prep_new_huge_page is not used because it not only initializes hugetlb
specific stuff but because it also put_page and releases the page to the
hugetlb pool which is not what is required in some contexts.  This makes
things more complicated than necessary.

Simplify things by a) removing the page allocator entry point duplicity
and only keep __hugetlb_alloc_buddy_huge_page and b) make
prep_new_huge_page more reusable by removing the put_page which moves
the page to the allocator pool.  All current callers are updated to call
put_page explicitly.  Later patches will add new callers which won't
need it.

This patch shouldn't introduce any functional change.

Link: http://lkml.kernel.org/r/20180103093213.26329-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Reale <ar@linux.vnet.ibm.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Andrey Ryabinin 1ab5c05695 mm/memcontrol.c: try harder to decrease [memory,memsw].limit_in_bytes
mem_cgroup_resize_[memsw]_limit() tries to free only 32
(SWAP_CLUSTER_MAX) pages on each iteration.  This makes it practically
impossible to decrease limit of memory cgroup.  Tasks could easily
allocate back 32 pages, so we can't reduce memory usage, and once
retry_count reaches zero we return -EBUSY.

Easy to reproduce the problem by running the following commands:

  mkdir /sys/fs/cgroup/memory/test
  echo $$ >> /sys/fs/cgroup/memory/test/tasks
  cat big_file > /dev/null &
  sleep 1 && echo $((100*1024*1024)) > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
  -bash: echo: write error: Device or resource busy

Instead of relying on retry_count, keep retrying the reclaim until the
desired limit is reached or fail if the reclaim doesn't make any
progress or a signal is pending.

Link: http://lkml.kernel.org/r/20180119132544.19569-1-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Christopher Díaz Riveros 8ad6e404ef mm/memcontrol.c: make local symbol static
Fix the following sparse warning:

  mm/memcontrol.c:1097:14: warning: symbol 'memcg1_stats' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180118193327.14200-1-chrisadr@gentoo.org
Signed-off-by: Christopher Díaz Riveros <chrisadr@gentoo.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Ralph Campbell 8d63e4cd62 mm/hmm: fix uninitialized use of 'entry' in hmm_vma_walk_pmd()
The variable 'entry' is used before being initialized in
hmm_vma_walk_pmd().

No bad effect (beside performance hit) so !non_swap_entry(0) evaluate to
true which trigger a fault as if CPU was trying to access migrated
memory and migrate memory back from device memory to regular memory.

This function (hmm_vma_walk_pmd()) is called when a device driver tries
to populate its own page table.  For migrated memory it should not
happen as the device driver should already have populated its page table
correctly during the migration.

Only case I can think of is multi-GPU where a second GPU triggers
migration back to regular memory.  Again this would just result in a
performance hit, nothing bad would happen.

Link: http://lkml.kernel.org/r/20180122185759.26286-1-jglisse@redhat.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:40 -08:00
Petr Tesarik def9b71ee6 include/linux/mmzone.h: fix explanation of lower bits in the SPARSEMEM mem_map pointer
The comment is confusing.  On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this would
only guarantee that the 2 lowest bits must be zero.  On the other hand,
it claims that at least 3 bits are available, and 3 bits are actually
used.

This is not broken, because there is a stronger alignment guarantee,
just less obvious.  Let's fix the comment to make it clear how many bits
are available and why.

Although memmap arrays are allocated in various places, the resulting
pointer is encoded eventually, so I am adding a BUG_ON() here to enforce
at runtime that all expected bits are indeed available.

I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily checked
at build time.

[ptesarik@suse.com: v2]
  Link: http://lkml.kernel.org/r/20180125100516.589ea6af@ezekiel.suse.cz
Link: http://lkml.kernel.org/r/20180119080908.3a662e6f@ezekiel.suse.cz
Signed-off-by: Petr Tesarik <ptesarik@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kemi Wang <kemi.wang@intel.com>
Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Yang Shi 112d2d29fc mm/compaction.c: fix comment for try_to_compact_pages()
"mode" argument is not used by try_to_compact_pages() and sub functions
anymore, it has been replaced by "prio".  Fix the comment to explain the
use of "prio" argument.

Link: http://lkml.kernel.org/r/1515801336-20611-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Oscar Salvador 3a45acc086 mm/page_ext.c: make page_ext_init a noop when CONFIG_PAGE_EXTENSION but nothing uses it
static struct page_ext_operations *page_ext_ops[] always contains debug_guardpage_ops,

static struct page_ext_operations *page_ext_ops[] = {
        &debug_guardpage_ops,
 #ifdef CONFIG_PAGE_OWNER
        &page_owner_ops,
 #endif
...
}

but for it to work, CONFIG_DEBUG_PAGEALLOC must be enabled first.  If
someone has CONFIG_PAGE_EXTENSION, but has none of its users, eg:
(CONFIG_PAGE_OWNER, CONFIG_DEBUG_PAGEALLOC, CONFIG_IDLE_PAGE_TRACKING),
we can shrink page_ext_init() to a simple retq.

  $ size vmlinux  (before patch)
        text      data       bss       dec       hex  filename
    14356698   5681582   1687748  21726028   14b834c  vmlinux

  $ size vmlinux  (after patch)
        text      data       bss       dec       hex  filename
    14356008   5681538   1687748  21725294   14b806e  vmlinux

On the other hand, it might does not even make sense, since if someone
enables CONFIG_PAGE_EXTENSION, I would expect him to enable also at
least one of its users.

Link: http://lkml.kernel.org/r/20180105130235.GA21241@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@techadventures.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jaewon Kim <jaewon31.kim@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Nick Desaulniers 01a6ad9ac8 zsmalloc: use U suffix for negative literals being shifted
Fix warning about shifting unsigned literals being undefined behavior.

Link: http://lkml.kernel.org/r/1515642078-4259-1-git-send-email-nick.desaulniers@gmail.com
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nick Desaulniers <nick.desaulniers@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Oscar Salvador 6787c1dab1 mm/page_owner.c: clean up init_pages_in_zone()
Remove two redundant assignments in init_pages_in_zone().

[osalvador@techadventures.net: v3]
  Link: http://lkml.kernel.org/r/20180117124513.GA876@techadventures.net
[akpm@linux-foundation.org: coding style tweaks]
Link: http://lkml.kernel.org/r/20180110084355.GA22822@techadventures.net
Signed-off-by: Oscar Salvador <osalvador@techadventures.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Shile Zhang 3c2c648842 mm/page_alloc.c: fix typos in comments
Link: http://lkml.kernel.org/r/1515485774-4768-1-git-send-email-zhangshile@gmail.com
Signed-off-by: Shile Zhang <zhangshile@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Yu Zhao c054a78c66 memcg: refactor mem_cgroup_resize_limit()
mem_cgroup_resize_limit() and mem_cgroup_resize_memsw_limit() have
identical logics.  Refactor code so we don't need to keep two pieces of
code that does same thing.

Link: http://lkml.kernel.org/r/20180108224238.14583-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Yu Zhao 9c3760eb80 zswap: only save zswap header when necessary
We waste sizeof(swp_entry_t) for zswap header when using zsmalloc as
zpool driver because zsmalloc doesn't support eviction.

Add zpool_evictable() to detect if zpool is potentially evictable, and
use it in zswap to avoid waste memory for zswap header.

[yuzhao@google.com: The zpool->" prefix is a result of copy & paste]
  Link: http://lkml.kernel.org/r/20180110225626.110330-1-yuzhao@google.com
Link: http://lkml.kernel.org/r/20180110224741.83751-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Dan Streetman <ddstreet@ieee.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
shidao.ytt a7ab400d6f mm/fadvise: discard partial page if endbyte is also EOF
During our recent testing with fadvise(FADV_DONTNEED), we find that if
given offset/length is not page-aligned, the last page will not be
discarded.  The tool we use is vmtouch (https://hoytech.com/vmtouch/),
we map a 10KB-sized file into memory and then try to run this tool to
evict the whole file mapping, but the last single page always remains
staying in the memory:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 2.1e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 1/3  4K/12K  33.3%
         Elapsed: 5.5e-05 seconds

However when we test with an older kernel, say 3.10, this problem is
gone.  So we wonder if this is a regression:

$./vmtouch -e test_10K
           Files: 1
     Directories: 0
   Evicted Pages: 3 (12K)
         Elapsed: 8.2e-05 seconds

$./vmtouch test_10K
           Files: 1
     Directories: 0
  Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
         Elapsed: 5e-05 seconds

After digging a little bit into this problem, we find it seems not a
regression.  Not discarding partial page is likely to be on purpose
according to commit 441c228f81 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel
Gorman.  He explained why partial pages should be preserved instead of
being discarded when using fadvise(FADV_DONTNEED).

However, the interesting part is that the actual code did NOT work as
the same as it was described, the partial page was still discarded
anyway, due to a calculation mistake of `end_index' passed to
invalidate_mapping_pages().  This mistake has not been fixed until
recently, that's why we fail to reproduce our problem in old kernels.
The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard
partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin.

Back to the original testing, our problem becomes that there is a
special case that, if the page-unaligned `endbyte' is also the end of
file, it is not necessary at all to preserve the last partial page, as
we all know no one else will use the rest of it.  It should be safe
enough if we just discard the whole page.  So we add an EOF check in
this patch.

We also find a poosbile real world issue in mainline kernel.  Assume
such scenario: A userspace backup application want to backup a huge
amount of small files (<4k) at once, the developer might (I guess) want
to use fadvise(FADV_DONTNEED) to save memory.  However, FADV_DONTNEED
won't really happen since the only page mapped is a partial page, and
kernel will preserve it.  Our patch also fixes this problem, since we
know the endbyte is EOF, so we discard it.

Here is a simple reproducer to reproduce and verify each scenario we
described above:

  test_fadvise.c
  ==============================
  #include <sys/mman.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <stdlib.h>
  #include <string.h>
  #include <stdio.h>
  #include <unistd.h>

  int main(int argc, char **argv)
  {
  	int i, fd, ret, len;
  	struct stat buf;
  	void *addr;
  	unsigned char *vec;
  	char *strbuf;
  	ssize_t pagesize = getpagesize();
  	ssize_t filesize;

  	fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
  	if (fd < 0)
  		return -1;
  	filesize = strtoul(argv[2], NULL, 10);

  	strbuf = malloc(filesize);
  	memset(strbuf, 42, filesize);
  	write(fd, strbuf, filesize);
  	free(strbuf);
  	fsync(fd);

  	len = (filesize + pagesize - 1) / pagesize;
  	printf("length of pages: %d\n", len);

  	addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
  	if (addr == MAP_FAILED)
  		return -1;

  	ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
  	if (ret < 0)
  		return -1;

  	vec = malloc(len);
  	ret = mincore(addr, filesize, (void *)vec);
  	if (ret < 0)
  		return -1;

  	for (i = 0; i < len; i++)
  		printf("pages[%d]: %x\n", i, vec[i] & 0x1);

  	free(vec);
  	close(fd);

  	return 0;
  }
  ==============================

Test 1: running on kernel with commit 18aba41cbf reverted:

  [root@caspar ~]# uname -r
  4.15.0-rc6.revert+
  [root@caspar ~]# ./test_fadvise file1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page discarded
  [root@caspar ~]# ./test_fadvise file2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise file3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page discarded

Test 2: running on mainline kernel:

  [root@caspar ~]# uname -r
  4.15.0-rc6+
  [root@caspar ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 1    # <-- partial and the only page not discarded
  [root@caspar ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 1    # <-- partial page not discarded

Test 3: running on kernel with this patch:

  [root@caspar ~]# uname -r
  4.15.0-rc6.patched+
  [root@caspar ~]# ./test_fadvise test1 1024
  length of pages: 1
  pages[0]: 0    # <-- partial page and EOF, discarded
  [root@caspar ~]# ./test_fadvise test2 8192
  length of pages: 2
  pages[0]: 0
  pages[1]: 0
  [root@caspar ~]# ./test_fadvise test3 10240
  length of pages: 3
  pages[0]: 0
  pages[1]: 0
  pages[2]: 0    # <-- partial page and EOF, discarded

[akpm@linux-foundation.org: tweak code comment]
Link: http://lkml.kernel.org/r/5222da9ee20e1695eaabb69f631f200d6e6b8876.1515132470.git.jinli.zjl@alibaba-inc.com
Signed-off-by: shidao.ytt <shidao.ytt@alibaba-inc.com>
Signed-off-by: Caspar Zhang <jinli.zjl@alibaba-inc.com>
Reviewed-by: Oliver Yang <zhiche.yy@alibaba-inc.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Mel Gorman 69d763fc6d mm: pin address_space before dereferencing it while isolating an LRU page
Minchan Kim asked the following question -- what locks protects
address_space destroying when race happens between inode trauncation and
__isolate_lru_page? Jan Kara clarified by describing the race as follows

CPU1                                            CPU2

truncate(inode)                                 __isolate_lru_page()
  ...
  truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
      spin_lock_irqsave(&mapping->tree_lock, flags);
        __delete_from_page_cache(page, NULL)
          page_cache_tree_delete(..)
            ...                                   mapping = page_mapping(page);
            page->mapping = NULL;
            ...
      spin_unlock_irqrestore(&mapping->tree_lock, flags);
      page_cache_free_page(mapping, page)
        put_page(page)
          if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space

                                                  if (mapping && !mapping->a_ops->migratepage)
- we've dereferenced mapping which is potentially already free.

The race is theoretically possible but unlikely.  Before the
delete_from_page_cache, truncate_cleanup_page is called so the page is
likely to be !PageDirty or PageWriteback which gets skipped by the only
caller that checks the mappping in __isolate_lru_page.  Even if the race
occurs, a substantial amount of work has to happen during a tiny window
with no preemption but it could potentially be done using a virtual
machine to artifically slow one CPU or halt it during the critical
window.

This patch should eliminate the race with truncation by try-locking the
page before derefencing mapping and aborting if the lock was not
acquired.  There was a suggestion from Huang Ying to use RCU as a
side-effect to prevent mapping being freed.  However, I do not like the
solution as it's an unconventional means of preserving a mapping and
it's not a context where rcu_read_lock is obviously protecting rcu data.

Link: http://lkml.kernel.org/r/20180104102512.2qos3h5vqzeisrek@techsingularity.net
Fixes: c824493528 ("mm: compaction: make isolate_lru_page() filter-aware again")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Eric Biggers 284cd241a1 userfaultfd: convert to use anon_inode_getfd()
Nothing actually calls userfaultfd_file_create() besides the
userfaultfd() system call itself.  So simplify things by folding it into
the system call and using anon_inode_getfd() instead of
anon_inode_getfile().  Do the same in resolve_userfault_fork() as well.

This removes over 50 lines with no change in functionality.

Link: http://lkml.kernel.org/r/20171229212403.22800-1-ebiggers3@gmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau c5c63835e5 memfd-test: run fuse test on hugetlb backend memory
Link: http://lkml.kernel.org/r/20171107122800.25517-10-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau 29f34d1dd6 memfd-test: move common code to a shared unit
The memfd & fuse tests will share more common code in the following
commits to test hugetlb support.

Link: http://lkml.kernel.org/r/20171107122800.25517-9-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau 3037aeb991 memfd-test: add 'memfd-hugetlb:' prefix when testing hugetlbfs
Link: http://lkml.kernel.org/r/20171107122800.25517-8-marcandre.lureau@redhat.com
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau 724978457f memfd-test: test hugetlbfs sealing
Remove most of the special-casing of hugetlbfs now that sealing is
supported.

Link: http://lkml.kernel.org/r/20171107122800.25517-7-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau 47b9012ecd shmem: add sealing support to hugetlb-backed memfd
Adapt add_seals()/get_seals() to work with hugetbfs-backed memory.

Teach memfd_create() to allow sealing operations on MFD_HUGETLB.

Link: http://lkml.kernel.org/r/20171107122800.25517-6-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau ff62a34210 hugetlb: implement memfd sealing
Implements memfd sealing, similar to shmem:
 - WRITE: deny fallocate(PUNCH_HOLE). mmap() write is denied in
   memfd_add_seals(). write() doesn't exist for hugetlbfs.
 - SHRINK: added similar check as shmem_setattr()
 - GROW: added similar check as shmem_setattr() & shmem_fallocate()

Except write() operation that doesn't exist with hugetlbfs, that should
make sealing as close as it can be to shmem support.

Link: http://lkml.kernel.org/r/20171107122800.25517-5-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau da14c1e524 hugetlb: expose hugetlbfs_inode_info in header
hugetlbfs inode information will need to be accessed by code in
mm/shmem.c for file sealing operations.  Move inode information
definition from .c file to header for needed access.

Link: http://lkml.kernel.org/r/20171107122800.25517-4-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00
Marc-André Lureau 5aadc431a5 shmem: rename functions that are memfd-related
Those functions are called for memfd files, backed by shmem or hugetlb
(the next patches will handle hugetlb).

Link: http://lkml.kernel.org/r/20171107122800.25517-3-marcandre.lureau@redhat.com
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-31 17:18:39 -08:00