1
0
Fork 0
Commit Graph

13559 Commits (6e22fd003e28d4414d2a84eb51f1491f49a70ab9)

Author SHA1 Message Date
Henry Burns 810481a246 mm/z3fold.c: lock z3fold page before __SetPageMovable()
Following zsmalloc.c's example we call trylock_page() and unlock_page().
Also make z3fold_page_migrate() assert that newpage is passed in locked,
as per the documentation.

[akpm@linux-foundation.org: fix trylock_page return value test, per Shakeel]
Link: http://lkml.kernel.org/r/20190702005122.41036-1-henryburns@google.com
Link: http://lkml.kernel.org/r/20190702233538.52793-1-henryburns@google.com
Signed-off-by: Henry Burns <henryburns@google.com>
Suggested-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: Vitaly Wool <vitalywool@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Vitaly Vul <vitaly.vul@sony.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Xidong Wang <wangxidong_97@163.com>
Cc: Jonathan Adams <jwadams@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:40 -07:00
Yafang Shao dd9239900e mm/memcontrol: fix wrong statistics in memory.stat
When we calculate total statistics for memcg1_stats and memcg1_events,
we use the the index 'i' in the for loop as the events index.  Actually
we should use memcg1_stats[i] and memcg1_events[i] as the events index.

Link: http://lkml.kernel.org/r/1562116978-19539-1-git-send-email-laoar.shao@gmail.com
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty").
Signed-off-by: Yafang Shao <laoar.shao@gmail.com
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yafang Shao <shaoyafang@didiglobal.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:40 -07:00
Kuo-Hsin Yang 2c012a4ad1 mm: vmscan: scan anonymous pages on file refaults
When file refaults are detected and there are many inactive file pages,
the system never reclaim anonymous pages, the file pages are dropped
aggressively when there are still a lot of cold anonymous pages and
system thrashes.  This issue impacts the performance of applications
with large executable, e.g.  chrome.

With this patch, when file refault is detected, inactive_list_is_low()
always returns true for file pages in get_scan_count() to enable
scanning anonymous pages.

The problem can be reproduced by the following test program.

---8<---
void fallocate_file(const char *filename, off_t size)
{
	struct stat st;
	int fd;

	if (!stat(filename, &st) && st.st_size >= size)
		return;

	fd = open(filename, O_WRONLY | O_CREAT, 0600);
	if (fd < 0) {
		perror("create file");
		exit(1);
	}
	if (posix_fallocate(fd, 0, size)) {
		perror("fallocate");
		exit(1);
	}
	close(fd);
}

long *alloc_anon(long size)
{
	long *start = malloc(size);
	memset(start, 1, size);
	return start;
}

long access_file(const char *filename, long size, long rounds)
{
	int fd, i;
	volatile char *start1, *end1, *start2;
	const int page_size = getpagesize();
	long sum = 0;

	fd = open(filename, O_RDONLY);
	if (fd == -1) {
		perror("open");
		exit(1);
	}

	/*
	 * Some applications, e.g. chrome, use a lot of executable file
	 * pages, map some of the pages with PROT_EXEC flag to simulate
	 * the behavior.
	 */
	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
		      fd, 0);
	if (start1 == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}
	end1 = start1 + size / 2;

	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
	if (start2 == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	for (i = 0; i < rounds; ++i) {
		struct timeval before, after;
		volatile char *ptr1 = start1, *ptr2 = start2;
		gettimeofday(&before, NULL);
		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
			sum += *ptr1 + *ptr2;
		gettimeofday(&after, NULL);
		printf("File access time, round %d: %f (sec)
", i,
		       (after.tv_sec - before.tv_sec) +
		       (after.tv_usec - before.tv_usec) / 1000000.0);
	}
	return sum;
}

int main(int argc, char *argv[])
{
	const long MB = 1024 * 1024;
	long anon_mb, file_mb, file_rounds;
	const char filename[] = "large";
	long *ret1;
	long ret2;

	if (argc != 4) {
		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
");
		exit(0);
	}
	anon_mb = atoi(argv[1]);
	file_mb = atoi(argv[2]);
	file_rounds = atoi(argv[3]);

	fallocate_file(filename, file_mb * MB);
	printf("Allocate %ld MB anonymous pages
", anon_mb);
	ret1 = alloc_anon(anon_mb * MB);
	printf("Access %ld MB file pages
", file_mb);
	ret2 = access_file(filename, file_mb * MB, file_rounds);
	printf("Print result to prevent optimization: %ld
",
	       *ret1 + ret2);
	return 0;
}
---8<---

Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
fills ram with 2048 MB memory, access a 200 MB file for 10 times.  Without
this patch, the file cache is dropped aggresively and every access to the
file is from disk.

  $ ./thrash 2048 200 10
  Allocate 2048 MB anonymous pages
  Access 200 MB file pages
  File access time, round 0: 2.489316 (sec)
  File access time, round 1: 2.581277 (sec)
  File access time, round 2: 2.487624 (sec)
  File access time, round 3: 2.449100 (sec)
  File access time, round 4: 2.420423 (sec)
  File access time, round 5: 2.343411 (sec)
  File access time, round 6: 2.454833 (sec)
  File access time, round 7: 2.483398 (sec)
  File access time, round 8: 2.572701 (sec)
  File access time, round 9: 2.493014 (sec)

With this patch, these file pages can be cached.

  $ ./thrash 2048 200 10
  Allocate 2048 MB anonymous pages
  Access 200 MB file pages
  File access time, round 0: 2.475189 (sec)
  File access time, round 1: 2.440777 (sec)
  File access time, round 2: 2.411671 (sec)
  File access time, round 3: 1.955267 (sec)
  File access time, round 4: 0.029924 (sec)
  File access time, round 5: 0.000808 (sec)
  File access time, round 6: 0.000771 (sec)
  File access time, round 7: 0.000746 (sec)
  File access time, round 8: 0.000738 (sec)
  File access time, round 9: 0.000747 (sec)

Checked the swap out stats during the test [1], 19006 pages swapped out
with this patch, 3418 pages swapped out without this patch. There are
more swap out, but I think it's within reasonable range when file backed
data set doesn't fit into the memory.

$ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
Access 2100 MB regular file pages File access time, round 0: 12.165,
(sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
(+ 11309120)

With 4 processes accessing non-overlapping parts of a large file, 30316
pages swapped out with this patch, 5152 pages swapped out without this
patch.  The swapout number is small comparing to pgpgin.

[1]: https://github.com/vovo/testing/blob/master/mem_thrash.c

Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
Fixes: e986850598 ("mm,vmscan: only evict file pages when we have plenty")
Fixes: 7c5bd705d8 ("mm: memcg: only evict file pages when we have plenty")
Signed-off-by: Kuo-Hsin Yang <vovoy@chromium.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Sonny Rao <sonnyrao@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>	[4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-12 11:05:39 -07:00
Linus Torvalds 2e756758e5 Many bug fixes and cleanups, and an optimization for case-insensitive
lookups.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl0lFIoACgkQ8vlZVpUN
 gaOwNQf/aJxFxHVf4t3lga8kfoMhlbwINQknsGUVwg32HporMa1NxQXjbEMMhs6V
 A31gBJ44nYVz1enz7nvbE4kx4quF4E8rDVprEetphv4i8GSdUAihwJwY5/H0oSd8
 rxzTZzNKddoyN/j7H4LgAh7bo6IFk54kUuaAWuZDJnJtfLNQ6RBaIwg6u6Z8Fael
 9H3u/RtFHqWPQp5j50PMUG06abr26GKi1gLL+yeoFD1tuzC54B5i6uy34amrXlon
 5agIQ7YuB9bigK4VaLoF4df7o+7+Oa6ENaQ9O/TQc9Uy9ngdVlPpNb2bVDizRLNn
 e369sBFTf3C8sMycJy6x9TCqg2B7Hw==
 =EpCF
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 updates from Ted Ts'o:
 "Many bug fixes and cleanups, and an optimization for case-insensitive
  lookups"

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix coverity warning on error path of filename setup
  ext4: replace ktype default_attrs with default_groups
  ext4: rename htree_inline_dir_to_tree() to ext4_inlinedir_to_tree()
  ext4: refactor initialize_dirent_tail()
  ext4: rename "dirent_csum" functions to use "dirblock"
  ext4: allow directory holes
  jbd2: drop declaration of journal_sync_buffer()
  ext4: use jbd2_inode dirty range scoping
  jbd2: introduce jbd2_inode dirty range scoping
  mm: add filemap_fdatawait_range_keep_errors()
  ext4: remove redundant assignment to node
  ext4: optimize case-insensitive lookups
  ext4: make __ext4_get_inode_loc plug
  ext4: clean up kerneldoc warnigns when building with W=1
  ext4: only set project inherit bit for directory
  ext4: enforce the immutable flag on open files
  ext4: don't allow any modifications to an immutable file
  jbd2: fix typo in comment of journal_submit_inode_data_buffers
  jbd2: fix some print format mistakes
  ext4: gracefully handle ext4_break_layouts() failure during truncate
2019-07-10 21:06:01 -07:00
Linus Torvalds 40f06c7995 Changes to copy_file_range for 5.3 from Dave and Amir:
- Create a generic copy_file_range handler and make individual
   filesystems responsible for calling it (i.e. no more assuming that
   do_splice_direct will work or is appropriate)
 - Refactor copy_file_range and remap_range parameter checking where they
   are the same
 - Install missing copy_file_range parameter checking(!)
 - Remove suid/sgid and update mtime like any other file write
 - Change the behavior so that a copy range crossing the source file's
   eof will result in a short copy to the source file's eof instead of
   EINVAL
 - Permit filesystems to decide if they want to handle cross-superblock
   copy_file_range in their local handlers.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl0BGvAACgkQ+H93GTRK
 tOu2aw/+KGG7PiXm9ED3ZXUppKVddrZMOgqM7mSfHo6TBgW3pJUJcRIhawK0Wz/P
 stgTsOkurHSl3iT3vQyX4GTZvLoGN/rfsRLPxogJptBUqVv3BOrXsrI53f7V/kbm
 rtjlYsgExji7VBUiMTe5kOWWqxyR7B4nXyvY/8rier57rW/8C1I58B0OrxAmTK0k
 rz1e5BtE1dg91xA7cSdEc38FInz8MW8cvsrEzW9vyYY4IVE0PBuhhA1EvryxTrAZ
 hfthHFfzwxhJkI0mdha8uqNufNWrHLSqiwyjYC7pwAwSQzQPiQz9U17flu+URnfF
 kXaR5LdXbBP3pl46RdthrfuonWsv612cC1Qwfjs8PBG9lG7b9PGJ40MGVTiw7LlQ
 924/03ho0zAnV0E8Qn5O9nPshQNDJhwhzMS39EmMyFKb1D5XGzdMV0gDdIfx6hdO
 HDbw6VQ33S59gvk7v/gxsFB5Bs4PKfamHx/QmwQwpqWM5XExcr0yJ90OTBtAuY4r
 S+9gwG6uED3aPh8HbQ5UgnA8bZmMmi8AkcBvqJ9GgNw5SbZl0oyv9Sj6JNpoOejV
 8y9JkhoZUxqiihnKTw/vtMrj5RCOfifNBjMSwrShfLdLKtK0AZl1mXC0/1Q3VnEQ
 TUcyRHEzrtHgJ9/AK9xIyDNvNYzvHSLZj7maoZZumgQa2FOFrmw=
 =qM44
 -----END PGP SIGNATURE-----

Merge tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull copy_file_range updates from Darrick Wong:
 "This fixes numerous parameter checking problems and inconsistent
  behaviors in the new(ish) copy_file_range system call.

  Now the system call will actually check its range parameters
  correctly; refuse to copy into files for which the caller does not
  have sufficient privileges; update mtime and strip setuid like file
  writes are supposed to do; and allows copying up to the EOF of the
  source file instead of failing the call like we used to.

  Summary:

   - Create a generic copy_file_range handler and make individual
     filesystems responsible for calling it (i.e. no more assuming that
     do_splice_direct will work or is appropriate)

   - Refactor copy_file_range and remap_range parameter checking where
     they are the same

   - Install missing copy_file_range parameter checking(!)

   - Remove suid/sgid and update mtime like any other file write

   - Change the behavior so that a copy range crossing the source file's
     eof will result in a short copy to the source file's eof instead of
     EINVAL

   - Permit filesystems to decide if they want to handle
     cross-superblock copy_file_range in their local handlers"

* tag 'copy-file-range-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  fuse: copy_file_range needs to strip setuid bits and update timestamps
  vfs: allow copy_file_range to copy across devices
  xfs: use file_modified() helper
  vfs: introduce file_modified() helper
  vfs: add missing checks to copy_file_range
  vfs: remove redundant checks from generic_remap_checks()
  vfs: introduce generic_file_rw_checks()
  vfs: no fallback for ->copy_file_range
  vfs: introduce generic_copy_file_range()
2019-07-10 20:32:37 -07:00
Linus Torvalds e9a83bd232 It's been a relatively busy cycle for docs:
- A fair pile of RST conversions, many from Mauro.  These create more
    than the usual number of simple but annoying merge conflicts with other
    trees, unfortunately.  He has a lot more of these waiting on the wings
    that, I think, will go to you directly later on.
 
  - A new document on how to use merges and rebases in kernel repos, and one
    on Spectre vulnerabilities.
 
  - Various improvements to the build system, including automatic markup of
    function() references because some people, for reasons I will never
    understand, were of the opinion that :c:func:``function()`` is
    unattractive and not fun to type.
 
  - We now recommend using sphinx 1.7, but still support back to 1.4.
 
  - Lots of smaller improvements, warning fixes, typo fixes, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl0krAEPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5Yg98H/AuLqO9LpOgUjF4LhyjxGPdzJkY9RExSJ7km
 gznyreLCZgFaJR+AY6YDsd4Jw6OJlPbu1YM/Qo3C3WrZVFVhgL/s2ebvBgCo50A8
 raAFd8jTf4/mGCHnAqRotAPQ3mETJUk315B66lBJ6Oc+YdpRhwXWq8ZW2bJxInFF
 3HDvoFgMf0KhLuMHUkkL0u3fxH1iA+KvDu8diPbJYFjOdOWENz/CV8wqdVkXRSEW
 DJxIq89h/7d+hIG3d1I7Nw+gibGsAdjSjKv4eRKauZs4Aoxd1Gpl62z0JNk6aT3m
 dtq4joLdwScydonXROD/Twn2jsu4xYTrPwVzChomElMowW/ZBBY=
 =D0eO
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.3' of git://git.lwn.net/linux

Pull Documentation updates from Jonathan Corbet:
 "It's been a relatively busy cycle for docs:

   - A fair pile of RST conversions, many from Mauro. These create more
     than the usual number of simple but annoying merge conflicts with
     other trees, unfortunately. He has a lot more of these waiting on
     the wings that, I think, will go to you directly later on.

   - A new document on how to use merges and rebases in kernel repos,
     and one on Spectre vulnerabilities.

   - Various improvements to the build system, including automatic
     markup of function() references because some people, for reasons I
     will never understand, were of the opinion that
     :c:func:``function()`` is unattractive and not fun to type.

   - We now recommend using sphinx 1.7, but still support back to 1.4.

   - Lots of smaller improvements, warning fixes, typo fixes, etc"

* tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
  docs: automarkup.py: ignore exceptions when seeking for xrefs
  docs: Move binderfs to admin-guide
  Disable Sphinx SmartyPants in HTML output
  doc: RCU callback locks need only _bh, not necessarily _irq
  docs: format kernel-parameters -- as code
  Doc : doc-guide : Fix a typo
  platform: x86: get rid of a non-existent document
  Add the RCU docs to the core-api manual
  Documentation: RCU: Add TOC tree hooks
  Documentation: RCU: Rename txt files to rst
  Documentation: RCU: Convert RCU UP systems to reST
  Documentation: RCU: Convert RCU linked list to reST
  Documentation: RCU: Convert RCU basic concepts to reST
  docs: filesystems: Remove uneeded .rst extension on toctables
  scripts/sphinx-pre-install: fix out-of-tree build
  docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
  Documentation: PGP: update for newer HW devices
  Documentation: Add section about CPU vulnerabilities for Spectre
  Documentation: platform: Delete x86-laptop-drivers.txt
  docs: Note that :c:func: should no longer be used
  ...
2019-07-09 12:34:26 -07:00
Linus Torvalds 5ad18b2e60 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull force_sig() argument change from Eric Biederman:
 "A source of error over the years has been that force_sig has taken a
  task parameter when it is only safe to use force_sig with the current
  task.

  The force_sig function is built for delivering synchronous signals
  such as SIGSEGV where the userspace application caused a synchronous
  fault (such as a page fault) and the kernel responded with a signal.

  Because the name force_sig does not make this clear, and because the
  force_sig takes a task parameter the function force_sig has been
  abused for sending other kinds of signals over the years. Slowly those
  have been fixed when the oopses have been tracked down.

  This set of changes fixes the remaining abusers of force_sig and
  carefully rips out the task parameter from force_sig and friends
  making this kind of error almost impossible in the future"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
  signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
  signal: Remove the signal number and task parameters from force_sig_info
  signal: Factor force_sig_info_to_task out of force_sig_info
  signal: Generate the siginfo in force_sig
  signal: Move the computation of force into send_signal and correct it.
  signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
  signal: Remove the task parameter from force_sig_fault
  signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
  signal: Explicitly call force_sig_fault on current
  signal/unicore32: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from __do_user_fault
  signal/arm: Remove tsk parameter from ptrace_break
  signal/nds32: Remove tsk parameter from send_sigtrap
  signal/riscv: Remove tsk parameter from do_trap
  signal/sh: Remove tsk parameter from force_sig_info_fault
  signal/um: Remove task parameter from send_sigtrap
  signal/x86: Remove task parameter from send_sigtrap
  signal: Remove task parameter from force_sig_mceerr
  signal: Remove task parameter from force_sig
  signal: Remove task parameter from force_sigsegv
  ...
2019-07-08 21:48:15 -07:00
Linus Torvalds dfd437a257 arm64 updates for 5.3:
- arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}
 
 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly
 
 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)
 
 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new XAFLAG
   and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)
 
 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)
 
 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic
 
 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms
 
 - perf: DDR performance monitor support for iMX8QXP
 
 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers
 
 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent
 
 - arm64 do_page_fault() and hugetlb cleanups
 
 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)
 
 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the 'arm_boot_flags'
   introduced in 5.1)
 
 - CONFIG_RANDOMIZE_BASE now enabled in defconfig
 
 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area
 
 - Make ZONE_DMA32 configurable
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl0eHqcACgkQa9axLQDI
 XvFyNA/+L+bnkz8m3ncydlqqfXomQn4eJJVQ8Uksb0knJz+1+3CUxxbO4ry4jXZN
 fMkbggYrDPRKpDbsUl0lsRipj7jW9bqan+N37c3SWqCkgb6HqDaHViwxdx6Ec/Uk
 gHudozDSPh/8c7hxGcSyt/CFyuW6b+8eYIQU5rtIgz8aVY2BypBvS/7YtYCbIkx0
 w4CFleRTK1zXD5mJQhrc6jyDx659sVkrAvdhf6YIymOY8nBTv40vwdNo3beJMYp8
 Po/+0Ixu+VkHUNtmYYZQgP/AGH96xiTcRnUqd172JdtRPpCLqnLqwFokXeVIlUKT
 KZFMDPzK+756Ayn4z4huEePPAOGlHbJje8JVNnFyreKhVVcCotW7YPY/oJR10bnc
 eo7yD+DxABTn+93G2yP436bNVa8qO1UqjOBfInWBtnNFJfANIkZweij/MQ6MjaTA
 o7KtviHnZFClefMPoiI7HDzwL8XSmsBDbeQ04s2Wxku1Y2xUHLx4iLmadwLQ1ZPb
 lZMTZP3N/T1554MoURVA1afCjAwiqU3bt1xDUGjbBVjLfSPBAn/25IacsG9Li9AF
 7Rp1M9VhrfLftjFFkB2HwpbhRASOxaOSx+EI3kzEfCtM2O9I1WHgP3rvCdc3l0HU
 tbK0/IggQicNgz7GSZ8xDlWPwwSadXYGLys+xlMZEYd3pDIOiFc=
 =0TDT
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - arm64 support for syscall emulation via PTRACE_SYSEMU{,_SINGLESTEP}

 - Wire up VM_FLUSH_RESET_PERMS for arm64, allowing the core code to
   manage the permissions of executable vmalloc regions more strictly

 - Slight performance improvement by keeping softirqs enabled while
   touching the FPSIMD/SVE state (kernel_neon_begin/end)

 - Expose a couple of ARMv8.5 features to user (HWCAP): CondM (new
   XAFLAG and AXFLAG instructions for floating point comparison flags
   manipulation) and FRINT (rounding floating point numbers to integers)

 - Re-instate ARM64_PSEUDO_NMI support which was previously marked as
   BROKEN due to some bugs (now fixed)

 - Improve parking of stopped CPUs and implement an arm64-specific
   panic_smp_self_stop() to avoid warning on not being able to stop
   secondary CPUs during panic

 - perf: enable the ARM Statistical Profiling Extensions (SPE) on ACPI
   platforms

 - perf: DDR performance monitor support for iMX8QXP

 - cache_line_size() can now be set from DT or ACPI/PPTT if provided to
   cope with a system cache info not exposed via the CPUID registers

 - Avoid warning on hardware cache line size greater than
   ARCH_DMA_MINALIGN if the system is fully coherent

 - arm64 do_page_fault() and hugetlb cleanups

 - Refactor set_pte_at() to avoid redundant READ_ONCE(*ptep)

 - Ignore ACPI 5.1 FADTs reported as 5.0 (infer from the
   'arm_boot_flags' introduced in 5.1)

 - CONFIG_RANDOMIZE_BASE now enabled in defconfig

 - Allow the selection of ARM64_MODULE_PLTS, currently only done via
   RANDOMIZE_BASE (and an erratum workaround), allowing modules to spill
   over into the vmalloc area

 - Make ZONE_DMA32 configurable

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (54 commits)
  perf: arm_spe: Enable ACPI/Platform automatic module loading
  arm_pmu: acpi: spe: Add initial MADT/SPE probing
  ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens
  ACPI/PPTT: Modify node flag detection to find last IDENTICAL
  x86/entry: Simplify _TIF_SYSCALL_EMU handling
  arm64: rename dump_instr as dump_kernel_instr
  arm64/mm: Drop [PTE|PMD]_TYPE_FAULT
  arm64: Implement panic_smp_self_stop()
  arm64: Improve parking of stopped CPUs
  arm64: Expose FRINT capabilities to userspace
  arm64: Expose ARMv8.5 CondM capability to userspace
  arm64: defconfig: enable CONFIG_RANDOMIZE_BASE
  arm64: ARM64_MODULES_PLTS must depend on MODULES
  arm64: bpf: do not allocate executable memory
  arm64/kprobes: set VM_FLUSH_RESET_PERMS on kprobe instruction pages
  arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
  arm64: module: create module allocations without exec permissions
  arm64: Allow user selection of ARM64_MODULE_PLTS
  acpi/arm64: ignore 5.1 FADTs that are reported as 5.0
  arm64: Allow selecting Pseudo-NMI again
  ...
2019-07-08 09:54:55 -07:00
Linus Torvalds 69bf4b6b54 Revert "mm: page cache: store only head pages in i_pages"
This reverts commit 5fd4ca2d84.

Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
__delete_from_swap_cache() to trigger:

   page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
   anon
   flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
   raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
   raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
   page dumped because: VM_BUG_ON_PAGE(entry != page)
   page->mem_cgroup:ffff978433ace000
   ------------[ cut here ]------------
   kernel BUG at mm/swap_state.c:170!
   invalid opcode: 0000 [#1] SMP NOPTI
   CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
   Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
   RIP: 0010:__delete_from_swap_cache+0x20d/0x240
   Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
   RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
   RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
   RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
   RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
   R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
   R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
   FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
   Call Trace:
    delete_from_swap_cache+0x46/0xa0
    try_to_free_swap+0xbc/0x110
    swap_writepage+0x13/0x70
    pageout.isra.0+0x13c/0x350
    shrink_page_list+0xc14/0xdf0
    shrink_inactive_list+0x1e5/0x3c0
    shrink_node_memcg+0x202/0x760
    shrink_node+0xe0/0x470
    balance_pgdat+0x2d1/0x510
    kswapd+0x220/0x420
    kthread+0xfb/0x130
    ret_from_fork+0x22/0x40

and it's not immediately obvious why it happens.  It's too late in the
rc cycle to do anything but revert for now.

Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
Reported-and-bisected-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Kirill Shutemov <kirill@shutemov.name>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 19:55:18 -07:00
Oleg Nesterov 8751853091 swap_readpage(): avoid blk_wake_io_task() if !synchronous
swap_readpage() sets waiter = bio->bi_private even if synchronous = F,
this means that the caller can get the spurious wakeup after return.

This can be fatal if blk_wake_io_task() does
set_current_state(TASK_RUNNING) after the caller does
set_special_state(), in the worst case the kernel can crash in
do_task_dead().

Link: http://lkml.kernel.org/r/20190704160301.GA5956@redhat.com
Fixes: 0619317ff8 ("block: add polled wakeup task helper")
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Qian Cai <cai@lca.pw>
Acked-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 11:12:07 +09:00
Shakeel Butt dffcac2cb8 mm/vmscan.c: prevent useless kswapd loops
In production we have noticed hard lockups on large machines running
large jobs due to kswaps hoarding lru lock within isolate_lru_pages when
sc->reclaim_idx is 0 which is a small zone.  The lru was couple hundred
GiBs and the condition (page_zonenum(page) > sc->reclaim_idx) in
isolate_lru_pages() was basically skipping GiBs of pages while holding
the LRU spinlock with interrupt disabled.

On further inspection, it seems like there are two issues:

(1) If kswapd on the return from balance_pgdat() could not sleep (i.e.
    node is still unbalanced), the classzone_idx is unintentionally set
    to 0 and the whole reclaim cycle of kswapd will try to reclaim only
    the lowest and smallest zone while traversing the whole memory.

(2) Fundamentally isolate_lru_pages() is really bad when the
    allocation has woken kswapd for a smaller zone on a very large machine
    running very large jobs.  It can hoard the LRU spinlock while skipping
    over 100s of GiBs of pages.

This patch only fixes (1).  (2) needs a more fundamental solution.  To
fix (1), in the kswapd context, if pgdat->kswapd_classzone_idx is
invalid use the classzone_idx of the previous kswapd loop otherwise use
the one the waker has requested.

Link: http://lkml.kernel.org/r/20190701201847.251028-1-shakeelb@google.com
Fixes: e716f2eb24 ("mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 11:12:07 +09:00
Juergen Gross b9705d8778 mm/page_alloc.c: fix regression with deferred struct page init
Commit 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time
instead of doing larger sections") is causing a regression on some
systems when the kernel is booted as Xen dom0.

The system will just hang in early boot.

Reason is an endless loop in get_page_from_freelist() in case the first
zone looked at has no free memory.  deferred_grow_zone() is always
returning true due to the following code snipplet:

  /* If the zone is empty somebody else may have cleared out the zone */
  if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
                                           first_deferred_pfn)) {
          pgdat->first_deferred_pfn = ULONG_MAX;
          pgdat_resize_unlock(pgdat, &flags);
          return true;
  }

This in turn results in the loop as get_page_from_freelist() is assuming
forward progress can be made by doing some more struct page
initialization.

Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
Fixes: 0e56acae4b ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
Signed-off-by: Juergen Gross <jgross@suse.com>
Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-07-05 11:12:07 +09:00
Huang Ying 1a5f439c7c mm, swap: fix THP swap out
0-Day test system reported some OOM regressions for several THP
(Transparent Huge Page) swap test cases.  These regressions are bisected
to 6861428921 ("block: always define BIO_MAX_PAGES as 256").  In the
commit, BIO_MAX_PAGES is set to 256 even when THP swap is enabled.  So the
bio_alloc(gfp_flags, 512) in get_swap_bio() may fail when swapping out
THP.  That causes the OOM.

As in the patch description of 6861428921 ("block: always define
BIO_MAX_PAGES as 256"), THP swap should use multi-page bvec to write THP
to swap space.  So the issue is fixed via doing that in get_swap_bio().

BTW: I remember I have checked the THP swap code when 6861428921
("block: always define BIO_MAX_PAGES as 256") was merged, and thought the
THP swap code needn't to be changed.  But apparently, I was wrong.  I
should have done this at that time.

Link: http://lkml.kernel.org/r/20190624075515.31040-1-ying.huang@intel.com
Fixes: 6861428921 ("block: always define BIO_MAX_PAGES as 256")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Arnd Bergmann 2c9292336a mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
gcc gets confused in pcpu_get_vm_areas() because there are too many
branches that affect whether 'lva' was initialized before it gets used:

  mm/vmalloc.c: In function 'pcpu_get_vm_areas':
  mm/vmalloc.c:991:4: error: 'lva' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      insert_vmap_area_augment(lva, &va->rb_node,
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       &free_vmap_area_root, &free_vmap_area_list);
       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/vmalloc.c:916:20: note: 'lva' was declared here
    struct vmap_area *lva;
                      ^~~

Add an intialization to NULL, and check whether this has changed before
the first use.

[akpm@linux-foundation.org: tweak comments]
Link: http://lkml.kernel.org/r/20190618092650.2943749-1-arnd@arndb.de
Fixes: 68ad4a3304 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Joel Fernandes <joelaf@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Colin Ian King 7298e3b0a1 mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
Currently the calcuation of end_pfn can round up the pfn number to more
than the actual maximum number of pfns, causing an Oops.  Fix this by
ensuring end_pfn is never more than max_pfn.

This can be easily triggered when on systems where the end_pfn gets
rounded up to more than max_pfn using the idle-page stress-ng stress test:

sudo stress-ng --idle-page 0

  BUG: unable to handle kernel paging request at 00000000000020d8
  #PF error: [normal kernel read fault]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP PTI
  CPU: 1 PID: 11039 Comm: stress-ng-idle- Not tainted 5.0.0-5-generic #6-Ubuntu
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
  RIP: 0010:page_idle_get_page+0xc8/0x1a0
  Code: 0f b1 0a 75 7d 48 8b 03 48 89 c2 48 c1 e8 33 83 e0 07 48 c1 ea 36 48 8d 0c 40 4c 8d 24 88 49 c1 e4 07 4c 03 24 d5 00 89 c3 be <49> 8b 44 24 58 48 8d b8 80 a1 02 00 e8 07 d5 77 00 48 8b 53 08 48
  RSP: 0018:ffffafd7c672fde8 EFLAGS: 00010202
  RAX: 0000000000000005 RBX: ffffe36341fff700 RCX: 000000000000000f
  RDX: 0000000000000284 RSI: 0000000000000275 RDI: 0000000001fff700
  RBP: ffffafd7c672fe00 R08: ffffa0bc34056410 R09: 0000000000000276
  R10: ffffa0bc754e9b40 R11: ffffa0bc330f6400 R12: 0000000000002080
  R13: ffffe36341fff700 R14: 0000000000080000 R15: ffffa0bc330f6400
  FS: 00007f0ec1ea5740(0000) GS:ffffa0bc7db00000(0000) knlGS:0000000000000000
  CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000000020d8 CR3: 0000000077d68000 CR4: 00000000000006e0
  Call Trace:
    page_idle_bitmap_write+0x8c/0x140
    sysfs_kf_bin_write+0x5c/0x70
    kernfs_fop_write+0x12e/0x1b0
    __vfs_write+0x1b/0x40
    vfs_write+0xab/0x1b0
    ksys_write+0x55/0xc0
    __x64_sys_write+0x1a/0x20
    do_syscall_64+0x5a/0x110
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

Link: http://lkml.kernel.org/r/20190618124352.28307-1-colin.king@canonical.com
Fixes: 33c3fc71c8 ("mm: introduce idle page tracking")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Yafang Shao 432b1de0de mm/oom_kill.c: fix uninitialized oc->constraint
In dump_oom_summary() oc->constraint is used to show oom_constraint_text,
but it hasn't been set before.  So the value of it is always the default
value 0.  We should inititialize it before.

Bellow is the output when memcg oom occurs,

before this patch:
  oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0

after this patch:
  oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0

Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com
Fixes: ef8444ea01 ("mm, oom: reorganize the oom report in dump_header")
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Wind Yu <yuzhoujian@didichuxing.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Naoya Horiguchi faf53def3b mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
madvise(MADV_SOFT_OFFLINE) often returns -EBUSY when calling soft offline
for hugepages with overcommitting enabled.  That was caused by the
suboptimal code in current soft-offline code.  See the following part:

    ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
                            MIGRATE_SYNC, MR_MEMORY_FAILURE);
    if (ret) {
            ...
    } else {
            /*
             * We set PG_hwpoison only when the migration source hugepage
             * was successfully dissolved, because otherwise hwpoisoned
             * hugepage remains on free hugepage list, then userspace will
             * find it as SIGBUS by allocation failure. That's not expected
             * in soft-offlining.
             */
            ret = dissolve_free_huge_page(page);
            if (!ret) {
                    if (set_hwpoison_free_buddy_page(page))
                            num_poisoned_pages_inc();
            }
    }
    return ret;

Here dissolve_free_huge_page() returns -EBUSY if the migration source page
was freed into buddy in migrate_pages(), but even in that case we actually
has a chance that set_hwpoison_free_buddy_page() succeeds.  So that means
current code gives up offlining too early now.

dissolve_free_huge_page() checks that a given hugepage is suitable for
dissolving, where we should return success for !PageHuge() case because
the given hugepage is considered as already dissolved.

This change also affects other callers of dissolve_free_huge_page(), which
are cleaned up together.

[n-horiguchi@ah.jp.nec.com: v3]
  Link: http://lkml.kernel.org/r/1560761476-4651-3-git-send-email-n-horiguchi@ah.jp.nec.comLink: http://lkml.kernel.org/r/1560154686-18497-3-git-send-email-n-horiguchi@ah.jp.nec.com
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Chen, Jerry T <jerry.t.chen@intel.com>
Tested-by: Chen, Jerry T <jerry.t.chen@intel.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
Naoya Horiguchi b38e5962f8 mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
The pass/fail of soft offline should be judged by checking whether the
raw error page was finally contained or not (i.e.  the result of
set_hwpoison_free_buddy_page()), but current code do not work like
that.  It might lead us to misjudge the test result when
set_hwpoison_free_buddy_page() fails.

Without this fix, there are cases where madvise(MADV_SOFT_OFFLINE) may
not offline the original page and will not return an error.

Link: http://lkml.kernel.org/r/1560154686-18497-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Fixes: 6bc9b56433 ("mm: fix race on soft-offlining")
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Xishi Qiu <xishi.qiuxishi@alibaba-inc.com>
Cc: "Chen, Jerry T" <jerry.t.chen@intel.com>
Cc: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>
Cc: <stable@vger.kernel.org>	[4.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:45 +08:00
zhong jiang 29b190fa77 mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
mempoclicies when the tasks's cpuset's mems_allowed changes.  For
policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
it works by remapping the policy's allowed nodes (stored in v.nodes)
using the previous value of mems_allowed (stored in
w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
(passed as nodes) as the range of the map (see the comment of
bitmap_remap() for details).

The result of remapping is stored back as policy's nodemask in v.nodes,
and the new value of mems_allowed should be stored in
w.cpuset_mems_allowed to facilitate the next rebind, if it happens.

However, 213980c0f2 ("mm, mempolicy: simplify rebinding mempolicies
when updating cpusets") introduced a bug where the result of remapping
is stored in w.cpuset_mems_allowed instead.  Thus, a mempolicy's
allowed nodes can evolve in an unexpected way after a series of
rebinding due to cpuset mems_allowed changes, possibly binding to a
wrong node or a smaller number of nodes which may e.g.  overload them.
This patch fixes the bug so rebinding again works as intended.

[vbabka@suse.cz: new changlog]
  Link: http://lkml.kernel.org/r/ef6a69c6-c052-b067-8f2c-9d615c619bb9@suse.cz
Link: http://lkml.kernel.org/r/1558768043-23184-1-git-send-email-zhongjiang@huawei.com
Fixes: 213980c0f2 ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-29 16:43:44 +08:00
Ard Biesheuvel 4739d53fcd arm64/mm: wire up CONFIG_ARCH_HAS_SET_DIRECT_MAP
Wire up the special helper functions to manipulate aliases of vmalloc
regions in the linear map.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2019-06-24 18:10:39 +01:00
Ross Zwisler aa0bfcd939 mm: add filemap_fdatawait_range_keep_errors()
In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.

Signed-off-by: Ross Zwisler <zwisler@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2019-06-20 17:05:37 -04:00
Thomas Gleixner d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Thomas Gleixner 20c8ccb197 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2 see
  the copying file in the top level directory

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 35 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:53 +02:00
Thomas Gleixner 7a338472f2 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482
Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 48 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:52 +02:00
Thomas Gleixner 8092f73c51 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 248
Based on 1 normalized pattern(s):

  this file is released under the gpl v2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204655.103854853@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:08 +02:00
Linus Torvalds 963172d9c7 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
 "The accumulated fixes from this and last week:

   - Fix vmalloc TLB flush and map range calculations which lead to
     stale TLBs, spurious faults and other hard to diagnose issues.

   - Use fault_in_pages_writable() for prefaulting the user stack in the
     FPU code as it's less fragile than the current solution

   - Use the PF_KTHREAD flag when checking for a kernel thread instead
     of current->mm as the latter can give the wrong answer due to
     use_mm()

   - Compute the vmemmap size correctly for KASLR and 5-Level paging.
     Otherwise this can end up with a way too small vmemmap area.

   - Make KASAN and 5-level paging work again by making sure that all
     invalid bits are masked out when computing the P4D offset. This
     worked before but got broken recently when the LDT remap area was
     moved.

   - Prevent a NULL pointer dereference in the resource control code
     which can be triggered with certain mount options when the
     requested resource is not available.

   - Enforce ordering of microcode loading vs. perf initialization on
     secondary CPUs. Otherwise perf tries to access a non-existing MSR
     as the boot CPU marked it as available.

   - Don't stop the resource control group walk early otherwise the
     control bitmaps are not updated correctly and become inconsistent.

   - Unbreak kgdb by returning 0 on success from
     kgdb_arch_set_breakpoint() instead of an error code.

   - Add more Icelake CPU model defines so depending changes can be
     queued in other trees"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode, cpuhotplug: Add a microcode loader CPU hotplug callback
  x86/kasan: Fix boot with 5-level paging and KASAN
  x86/fpu: Don't use current->mm to check for a kthread
  x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
  x86/resctrl: Prevent NULL pointer dereference when local MBM is disabled
  x86/resctrl: Don't stop walking closids when a locksetup group is found
  x86/fpu: Update kernel's FPU state before using for the fsave header
  x86/mm/KASLR: Compute the size of the vmemmap section properly
  x86/fpu: Use fault_in_pages_writeable() for pre-faulting
  x86/CPU: Add more Icelake model numbers
  mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
  mm/vmalloc: Fix calculation of direct map addr range
2019-06-16 07:28:14 -10:00
Jonathan Corbet 8afecfb0ec Linux 5.2-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlz8fAYeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiG1asH/3ySguxqtqL1MCBa
 4/SZ37PHeWKMerfX6ZyJdgEqK3B+PWlmuLiOMNK5h2bPLzeQQQAmHU/mfKmpXqgB
 dHwUbG9yNnyUtTfsfRqAnCA6vpuw9Yb1oIzTCVQrgJLSWD0j7scBBvmzYqguOkto
 ThwigLUq3AILr8EfR4rh+GM+5Dn9OTEFAxwil9fPHQo7QoczwZxpURhScT6Co9TB
 DqLA3fvXbBvLs/CZy/S5vKM9hKzC+p39ApFTURvFPrelUVnythAM0dPDJg3pIn5u
 g+/+gDxDFa+7ANxvxO2ng1sJPDqJMeY/xmjJYlYyLpA33B7zLNk2vDHhAP06VTtr
 XCMhQ9s=
 =cb80
 -----END PGP SIGNATURE-----

Merge tag 'v5.2-rc4' into mauro

We need to pick up post-rc1 changes to various document files so they don't
get lost in Mauro's massive RST conversion push.
2019-06-14 14:18:53 -06:00
Dan Williams 50f44ee724 mm/devm_memremap_pages: fix final page put race
Logan noticed that devm_memremap_pages_release() kills the percpu_ref
drops all the page references that were acquired at init and then
immediately proceeds to unplug, arch_remove_memory(), the backing pages
for the pagemap.  If for some reason device shutdown actually collides
with a busy / elevated-ref-count page then arch_remove_memory() should
be deferred until after that reference is dropped.

As it stands the "wait for last page ref drop" happens *after*
devm_memremap_pages_release() returns, which is obviously too late and
can lead to crashes.

Fix this situation by assigning the responsibility to wait for the
percpu_ref to go idle to devm_memremap_pages() with a new ->cleanup()
callback.  Implement the new cleanup callback for all
devm_memremap_pages() users: pmem, devdax, hmm, and p2pdma.

Link: http://lkml.kernel.org/r/155727339156.292046.5432007428235387859.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes: 41e94a8513 ("add devm_memremap_pages")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Minchan Kim a58f2cef26 mm/vmscan.c: fix trying to reclaim unevictable LRU page
There was the below bug report from Wu Fangsuo.

On the CMA allocation path, isolate_migratepages_range() could isolate
unevictable LRU pages and reclaim_clean_page_from_list() can try to
reclaim them if they are clean file-backed pages.

  page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
  flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
  raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
  raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
  page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
  page->mem_cgroup:ffffffc09bf3ee80
  ------------[ cut here ]------------
  kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
  Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
  Modules linked in:
  CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
  Hardware name: ASR AQUILAC EVB (DT)
  task: ffffffc00a54cd00 task.stack: ffffffc009b00000
  PC is at shrink_page_list+0x1998/0x3240
  LR is at shrink_page_list+0x1998/0x3240
  pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
  sp : ffffffc009b05940
  ..
     shrink_page_list+0x1998/0x3240
     reclaim_clean_pages_from_list+0x3c0/0x4f0
     alloc_contig_range+0x3bc/0x650
     cma_alloc+0x214/0x668
     ion_cma_allocate+0x98/0x1d8
     ion_alloc+0x200/0x7e0
     ion_ioctl+0x18c/0x378
     do_vfs_ioctl+0x17c/0x1780
     SyS_ioctl+0xac/0xc0

Wu found it's due to commit ad6b67041a ("mm: remove SWAP_MLOCK in
ttu").  Before that, unevictable pages go to cull_mlocked so that we
can't reach the VM_BUG_ON_PAGE line.

To fix the issue, this patch filters out unevictable LRU pages from the
reclaim_clean_pages_from_list in CMA.

Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
Fixes: ad6b67041a ("mm: remove SWAP_MLOCK in ttu")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Debugged-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Tested-by: Wu Fangsuo <fangsuowu@asrmicro.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
Cc: <stable@vger.kernel.org>	[4.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Andrea Arcangeli 59ea6d06cf coredump: fix race condition between collapse_huge_page() and core dumping
When fixing the race conditions between the coredump and the mmap_sem
holders outside the context of the process, we focused on
mmget_not_zero()/get_task_mm() callers in 04f5866e41 ("coredump: fix
race condition between mmget_not_zero()/get_task_mm() and core
dumping"), but those aren't the only cases where the mmap_sem can be
taken outside of the context of the process as Michal Hocko noticed
while backporting that commit to older -stable kernels.

If mmgrab() is called in the context of the process, but then the
mm_count reference is transferred outside the context of the process,
that can also be a problem if the mmap_sem has to be taken for writing
through that mm_count reference.

khugepaged registration calls mmgrab() in the context of the process,
but the mmap_sem for writing is taken later in the context of the
khugepaged kernel thread.

collapse_huge_page() after taking the mmap_sem for writing doesn't
modify any vma, so it's not obvious that it could cause a problem to the
coredump, but it happens to modify the pmd in a way that breaks an
invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
needs the mmap_sem for writing just to block concurrent page faults that
call pmd_trans_huge_lock().

Specifically the invariant that "!pmd_trans_huge()" cannot become a
"pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.

The coredump will call __get_user_pages() without mmap_sem for reading,
which eventually can invoke a lockless page fault which will need a
functional pmd_trans_huge_lock().

So collapse_huge_page() needs to use mmget_still_valid() to check it's
not running concurrently with the coredump...  as long as the coredump
can invoke page faults without holding the mmap_sem for reading.

This has "Fixes: khugepaged" to facilitate backporting, but in my view
it's more a bug in the coredump code that will eventually have to be
rewritten to stop invoking page faults without the mmap_sem for reading.
So the long term plan is still to drop all mmget_still_valid().

Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
Fixes: ba76149f47 ("thp: khugepaged")
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
swkhack 0874bb49bb mm/mlock.c: change count_mm_mlocked_page_nr return type
On a 64-bit machine the value of "vma->vm_end - vma->vm_start" may be
negative when using 32 bit ints and the "count >> PAGE_SHIFT"'s result
will be wrong.  So change the local variable and return value to
unsigned long to fix the problem.

Link: http://lkml.kernel.org/r/20190513023701.83056-1-swkhack@gmail.com
Fixes: 0cf2f6f6dc ("mm: mlock: check against vma for actual mlock() size")
Signed-off-by: swkhack <swkhack@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Yang Shi 7a30df49f6 mm: mmu_gather: remove __tlb_reset_range() for force flush
A few new fields were added to mmu_gather to make TLB flush smarter for
huge page by telling what level of page table is changed.

__tlb_reset_range() is used to reset all these page table state to
unchanged, which is called by TLB flush for parallel mapping changes for
the same range under non-exclusive lock (i.e.  read mmap_sem).

Before commit dd2283f260 ("mm: mmap: zap pages with read mmap_sem in
munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
PTEs in parallel don't remove page tables.  But, the forementioned
commit may do munmap() under read mmap_sem and free page tables.  This
may result in program hang on aarch64 reported by Jan Stancek.  The
problem could be reproduced by his test program with slightly modified
below.

---8<---

static int map_size = 4096;
static int num_iter = 500;
static long threads_total;

static void *distant_area;

void *map_write_unmap(void *ptr)
{
	int *fd = ptr;
	unsigned char *map_address;
	int i, j = 0;

	for (i = 0; i < num_iter; i++) {
		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
		if (map_address == MAP_FAILED) {
			perror("mmap");
			exit(1);
		}

		for (j = 0; j < map_size; j++)
			map_address[j] = 'b';

		if (munmap(map_address, map_size) == -1) {
			perror("munmap");
			exit(1);
		}
	}

	return NULL;
}

void *dummy(void *ptr)
{
	return NULL;
}

int main(void)
{
	pthread_t thid[2];

	/* hint for mmap in map_write_unmap() */
	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
	distant_area += DISTANT_MMAP_SIZE / 2;

	while (1) {
		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
		pthread_create(&thid[1], NULL, dummy, NULL);

		pthread_join(thid[0], NULL);
		pthread_join(thid[1], NULL);
	}
}
---8<---

The program may bring in parallel execution like below:

        t1                                        t2
munmap(map_address)
  downgrade_write(&mm->mmap_sem);
  unmap_region()
  tlb_gather_mmu()
    inc_tlb_flush_pending(tlb->mm);
  free_pgtables()
    tlb->freed_tables = 1
    tlb->cleared_pmds = 1

                                        pthread_exit()
                                        madvise(thread_stack, 8M, MADV_DONTNEED)
                                          zap_page_range()
                                            tlb_gather_mmu()
                                              inc_tlb_flush_pending(tlb->mm);

  tlb_finish_mmu()
    if (mm_tlb_flush_nested(tlb->mm))
      __tlb_reset_range()

__tlb_reset_range() would reset freed_tables and cleared_* bits, but this
may cause inconsistency for munmap() which do free page tables.  Then it
may result in some architectures, e.g.  aarch64, may not flush TLB
completely as expected to have stale TLB entries remained.

Use fullmm flush since it yields much better performance on aarch64 and
non-fullmm doesn't yields significant difference on x86.

The original proposed fix came from Jan Stancek who mainly debugged this
issue, I just wrapped up everything together.

Jan's testing results:

v5.2-rc2-24-gbec7550cca10
--------------------------
         mean     stddev
real    37.382   2.780
user     1.420   0.078
sys     54.658   1.855

v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
---------------------------------------------------------------------------------------_
         mean     stddev
real    37.119   2.105
user     1.548   0.087
sys     55.698   1.357

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
Fixes: dd2283f260 ("mm: mmap: zap pages with read mmap_sem in munmap")
Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Tested-by: Jan Stancek <jstancek@redhat.com>
Suggested-by: Will Deacon <will.deacon@arm.com>
Tested-by: Will Deacon <will.deacon@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[4.20+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Kirill Tkhai b17f18aff2 mm/vmscan.c: fix recent_rotated history
Johannes pointed out that after commit 886cf1901d ("mm: move
recent_rotated pages calculation to shrink_inactive_list()") we lost all
zone_reclaim_stat::recent_rotated history.

This fixes it.

Link: http://lkml.kernel.org/r/155905972210.26456.11178359431724024112.stgit@localhost.localdomain
Fixes: 886cf1901d ("mm: move recent_rotated pages calculation to shrink_inactive_list()")
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Potyra, Stefan dedca63504 mm/mlock.c: mlockall error for flag MCL_ONFAULT
If mlockall() is called with only MCL_ONFAULT as flag, it removes any
previously applied lockings and does nothing else.

This behavior is counter-intuitive and doesn't match the Linux man page.

  For mlockall():

  EINVAL Unknown flags were specified or MCL_ONFAULT was specified
  without either MCL_FUTURE or MCL_CURRENT.

Consequently, return the error EINVAL, if only MCL_ONFAULT is passed.
That way, applications will at least detect that they are calling
mlockall() incorrectly.

Link: http://lkml.kernel.org/r/20190527075333.GA6339@er01809n.ebgroup.elektrobit.com
Fixes: b0f205c2a3 ("mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage")
Signed-off-by: Stefan Potyra <Stefan.Potyra@elektrobit.com>
Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Shakeel Butt 3510955b32 mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node
Syzbot reported following memory leak:

ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
BUG: memory leak
unreferenced object 0xffff888114f26040 (size 32):
  comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
  hex dump (first 32 bytes):
    40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff  @`......@`......
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     slab_post_alloc_hook mm/slab.h:439 [inline]
     slab_alloc mm/slab.c:3326 [inline]
     kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
     kmalloc include/linux/slab.h:547 [inline]
     __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
     memcg_init_list_lru_node mm/list_lru.c:375 [inline]
     memcg_init_list_lru mm/list_lru.c:459 [inline]
     __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
     alloc_super+0x2e0/0x310 fs/super.c:269
     sget_userns+0x94/0x2a0 fs/super.c:609
     sget+0x8d/0xb0 fs/super.c:660
     mount_nodev+0x31/0xb0 fs/super.c:1387
     fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
     legacy_get_tree+0x27/0x80 fs/fs_context.c:661
     vfs_get_tree+0x2e/0x120 fs/super.c:1476
     do_new_mount fs/namespace.c:2790 [inline]
     do_mount+0x932/0xc50 fs/namespace.c:3110
     ksys_mount+0xab/0x120 fs/namespace.c:3319
     __do_sys_mount fs/namespace.c:3333 [inline]
     __se_sys_mount fs/namespace.c:3330 [inline]
     __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
     do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is a simple off by one bug on the error path.

Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
Fixes: 60d3fd32a7 ("list_lru: introduce per-memcg lists")
Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.0+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Johannes Weiner 815744d751 mm: memcontrol: don't batch updates of local VM stats and events
The kernel test robot noticed a 26% will-it-scale pagefault regression
from commit 42a3003535 ("mm: memcontrol: fix recursive statistics
correctness & scalabilty").  This appears to be caused by bouncing the
additional cachelines from the new hierarchical statistics counters.

We can fix this by getting rid of the batched local counters instead.

Originally, there were *only* group-local counters, and they were fully
maintained per cpu.  A reader of a stats file high up in the cgroup tree
would have to walk the entire subtree and collect each level's per-cpu
counters to get the recursive view.  This was prohibitively expensive,
and so we switched to per-cpu batched updates of the local counters
during a983b5ebee ("mm: memcontrol: fix excessive complexity in
memory.stat reporting"), reducing the complexity from nr_subgroups *
nr_cpus to nr_subgroups.

With growing machines and cgroup trees, the tree walk itself became too
expensive for monitoring top-level groups, and this is when the culprit
patch added hierarchy counters on each cgroup level.  When the per-cpu
batch size would be reached, both the local and the hierarchy counters
would get batch-updated from the per-cpu delta simultaneously.

This makes local and hierarchical counter reads blazingly fast, but it
unfortunately makes the write-side too cache line intense.

Since local counter reads were never a problem - we only centralized
them to accelerate the hierarchy walk - and use of the local counters
are becoming rarer due to replacement with hierarchical views (ongoing
rework in the page reclaim and workingset code), we can make those local
counters unbatched per-cpu counters again.

The scheme will then be as such:

   when a memcg statistic changes, the writer will:
   - update the local counter (per-cpu)
   - update the batch counter (per-cpu). If the batch is full:
   - spill the batch into the group's atomic_t
   - spill the batch into all ancestors' atomic_ts
   - empty out the batch counter (per-cpu)

   when a local memcg counter is read, the reader will:
   - collect the local counter from all cpus

   when a hiearchy memcg counter is read, the reader will:
   - read the atomic_t

We might be able to simplify this further and make the recursive
counters unbatched per-cpu counters as well (batch upward propagation,
but leave per-cpu collection to the readers), but that will require a
more in-depth analysis and testing of all the callsites.  Deal with the
immediate regression for now.

Link: http://lkml.kernel.org/r/20190521151647.GB2870@cmpxchg.org
Fixes: 42a3003535 ("mm: memcontrol: fix recursive statistics correctness & scalabilty")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Tested-by: kernel test robot <rong.a.chen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-13 17:34:56 -10:00
Amir Goldstein 96e6e8f4a6 vfs: add missing checks to copy_file_range
Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-09 10:06:19 -07:00
Amir Goldstein 646955cd54 vfs: remove redundant checks from generic_remap_checks()
The access limit checks on input file range in generic_remap_checks()
are redundant because the input file size is guaranteed to be within
limits and pos+len are already checked to be within input file size.

Beyond the fact that the check cannot fail, if it would have failed,
it could return -EFBIG for input file range error. There is no precedent
for that. -EFBIG is returned in syscalls that would change file length.

With that call removed, we can fold generic_access_check_limits() into
generic_write_check_limits().

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-09 10:06:19 -07:00
Amir Goldstein a31713517d vfs: introduce generic_file_rw_checks()
Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-09 10:06:19 -07:00
Mauro Carvalho Chehab cb1aaebea8 docs: fix broken documentation links
Mostly due to x86 and acpi conversion, several documentation
links are still pointing to the old file. Fix them.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Wolfram Sang <wsa@the-dreams.de>
Reviewed-by: Sven Van Asbroeck <TheSven73@gmail.com>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Acked-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-08 13:42:13 -06:00
Thomas Gleixner b886d83c5b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 of the license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 315 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190115.503150771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:17 +02:00
Thomas Gleixner 55716d2643 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 428
Based on 1 normalized pattern(s):

  this file is released under the gplv2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 68 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190114.292346262@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:16 +02:00
Thomas Gleixner b2139ce04f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 403
Based on 1 normalized pattern(s):

  this software may be redistributed and or modified under the terms
  of the gnu general public license gpl version 2 as published by the
  free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531190112.039124428@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:13 +02:00
Thomas Gleixner 4505153954 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 333
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 59 temple place suite 330 boston ma 02111
  1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 136 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.384967451@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:06 +02:00
Thomas Gleixner 1439f94c54 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 263
Based on 1 normalized pattern(s):

  this software may be redistributed and or modified under the terms
  of the gnu general public license gpl version 2 only as published by
  the free software foundation

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.676969322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:28 +02:00
Rick Edgecombe 31e67340cc mm/vmalloc: Avoid rare case of flushing TLB with weird arguments
In a rare case, flush_tlb_kernel_range() could be called with a start
higher than the end.

In vm_remove_mappings(), in case page_address() returns 0 for all pages
(for example they were all in highmem), _vm_unmap_aliases() will be
called with start = ULONG_MAX, end = 0 and flush = 1.

If at the same time, the vmalloc purge operation is triggered by something
else while the current operation is between remove_vm_area() and
_vm_unmap_aliases(), then the vm mapping just removed will be already
purged. In this case the call of vm_unmap_aliases() may not find any other
mappings to flush and so end up flushing start = ULONG_MAX, end = 0. So
only set flush = true if we find something in the direct mapping that we
need to flush, and this way this can't happen.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Nadav Amit <namit@vmware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 868b104d73 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-3-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:47:25 +02:00
Rick Edgecombe 8e41f8726d mm/vmalloc: Fix calculation of direct map addr range
The calculation of the direct map address range to flush was wrong.
This could cause the RO direct map alias to not get flushed. Today
this shouldn't be a problem because this flush is only needed on x86
right now and the spurious fault handler will fix cached RO->RW
translations. In the future though, it could cause the permissions
to remain RO in the TLB for the direct map alias, and then the page
would return from the page allocator to some other component as RO
and cause a crash.

So fix fix the address range calculation so the flush will include the
direct map range.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Meelis Roos <mroos@linux.ee>
Cc: Nadav Amit <namit@vmware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 868b104d73 ("mm/vmalloc: Add flag for freeing of special permsissions")
Link: https://lkml.kernel.org/r/20190527211058.2729-2-rick.p.edgecombe@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-06-03 11:47:25 +02:00
Suzuki K Poulose e577c8b64d mm, compaction: make sure we isolate a valid PFN
When we have holes in a normal memory zone, we could endup having
cached_migrate_pfns which may not necessarily be valid, under heavy memory
pressure with swapping enabled ( via __reset_isolation_suitable(),
triggered by kswapd).

Later if we fail to find a page via fast_isolate_freepages(), we may end
up using the migrate_pfn we started the search with, as valid page.  This
could lead to accessing NULL pointer derefernces like below, due to an
invalid mem_section pointer.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [47/1825]
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000082f94ae9
 [0000000000000008] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 ...
 CPU: 10 PID: 6080 Comm: qemu-system-aar Not tainted 510-rc1+ #6
 Hardware name: AmpereComputing(R) OSPREY EV-883832-X3-0001/OSPREY, BIOS 4819 09/25/2018
 pstate: 60000005 (nZCv daif -PAN -UAO)
 pc : set_pfnblock_flags_mask+0x58/0xe8
 lr : compaction_alloc+0x300/0x950
 [...]
 Process qemu-system-aar (pid: 6080, stack limit = 0x0000000095070da5)
 Call trace:
  set_pfnblock_flags_mask+0x58/0xe8
  compaction_alloc+0x300/0x950
  migrate_pages+0x1a4/0xbb0
  compact_zone+0x750/0xde8
  compact_zone_order+0xd8/0x118
  try_to_compact_pages+0xb4/0x290
  __alloc_pages_direct_compact+0x84/0x1e0
  __alloc_pages_nodemask+0x5e0/0xe18
  alloc_pages_vma+0x1cc/0x210
  do_huge_pmd_anonymous_page+0x108/0x7c8
  __handle_mm_fault+0xdd4/0x1190
  handle_mm_fault+0x114/0x1c0
  __get_user_pages+0x198/0x3c0
  get_user_pages_unlocked+0xb4/0x1d8
  __gfn_to_pfn_memslot+0x12c/0x3b8
  gfn_to_pfn_prot+0x4c/0x60
  kvm_handle_guest_abort+0x4b0/0xcd8
  handle_exit+0x140/0x1b8
  kvm_arch_vcpu_ioctl_run+0x260/0x768
  kvm_vcpu_ioctl+0x490/0x898
  do_vfs_ioctl+0xc4/0x898
  ksys_ioctl+0x8c/0xa0
  __arm64_sys_ioctl+0x28/0x38
  el0_svc_common+0x74/0x118
  el0_svc_handler+0x38/0x78
  el0_svc+0x8/0xc
 Code: f8607840 f100001f 8b011401 9a801020 (f9400400)
 ---[ end trace af6a35219325a9b6 ]---

The issue was reported on an arm64 server with 128GB with holes in the
zone (e.g, [32GB@4GB, 96GB@544GB]), with a swap device enabled, while
running 100 KVM guest instances.

This patch fixes the issue by ensuring that the page belongs to a valid
PFN when we fallback to using the lower limit of the scan range upon
failure in fast_isolate_freepages().

Link: http://lkml.kernel.org/r/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com
Fixes: 5a811889de ("mm, compaction: use free lists to quickly locate a migration target")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reported-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:32 -07:00
Nathan Chancellor 0600597c85 kasan: initialize tag to 0xff in __kasan_kmalloc
When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
warns:

mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
used here [-Wuninitialized]
        kasan_unpoison_shadow(set_tag(object, tag), size);
                                              ^~~

set_tag ignores tag in this configuration but clang doesn't realize it at
this point in its pipeline, as it points to arch_kasan_set_tag as being
the point where it is used, which will later be expanded to (void
*)(object) without a use of tag.  Initialize tag to 0xff, as it removes
this warning and doesn't change the meaning of the code.

Link: https://github.com/ClangBuiltLinux/linux/issues/465
Link: http://lkml.kernel.org/r/20190502163057.6603-1-natechancellor@gmail.com
Fixes: 7f94ffbc4c ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00
Vitaly Wool bb9f6f63f3 z3fold: fix sheduling while atomic
kmem_cache_alloc() may be called from z3fold_alloc() in atomic context, so
we need to pass correct gfp flags to avoid "scheduling while atomic" bug.

Link: http://lkml.kernel.org/r/20190523153245.119dfeed55927e8755250ddd@gmail.com
Fixes: 7c2b8baa61 ("mm/z3fold.c: add structure for buddy handles")
Signed-off-by: Vitaly Wool <vitaly.vul@sony.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-01 15:51:31 -07:00