1
0
Fork 0
remarkable-linux/mm
Michel Lespinasse 940e7da516 mm: remap_file_pages() fixes
We have many vma manipulation functions that are fast in the typical
case, but can optionally be instructed to populate an unbounded number
of ptes within the region they work on:

 - mmap with MAP_POPULATE or MAP_LOCKED flags;
 - remap_file_pages() with MAP_NONBLOCK not set or when working on a
   VM_LOCKED vma;
 - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
   effect;
 - brk() when mlock(MCL_FUTURE) is in effect.

Current code handles these pte operations locally, while the
sourrounding code has to hold the mmap_sem write side since it's
manipulating vmas.  This means we're doing an unbounded amount of pte
population work with mmap_sem held, and this causes problems as Andy
Lutomirski reported (we've hit this at Google as well, though it's not
entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
first place).

I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released.  mm_populate()
does need to acquire the mmap_sem read side, but critically, it doesn't
need to hold it continuously for the entire duration of the operation -
it can drop it whenever things take too long (such as when hitting disk
for a file read) and re-acquire it later on.

The following patches are included

- Patches 1 fixes some issues I noticed while working on the existing code.
  If needed, they could potentially go in before the rest of the patches.

- Patch 2 introduces the new mm_populate() function and changes
  mmap_region() call sites to use it after they drop mmap_sem. This is
  inspired from Andy Lutomirski's proposal and is built as an extension
  of the work I had previously done for mlock() and mlockall() around
  v2.6.38-rc1. I had tried doing something similar at the time but had
  given up as there were so many do_mmap() call sites; the recent cleanups
  by Linus and Viro are a tremendous help here.

- Patches 3-5 convert some of the less-obvious places doing unbounded
  pte populates to the new mm_populate() mechanism.

- Patches 6-7 are code cleanups that are made possible by the
  mm_populate() work. In particular, they remove more code than the
  entire patch series added, which should be a good thing :)

- Patch 8 is optional to this entire series. It only helps to deal more
  nicely with racy userspace programs that might modify their mappings
  while we're trying to populate them. It adds a new VM_POPULATE flag
  on the mappings we do want to populate, so that if userspace replaces
  them with mappings it doesn't want populated, mm_populate() won't
  populate those replacement mappings.

This patch:

Assorted small fixes. The first two are quite small:

- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
  within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
  Purely cosmetic.

- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
  range, make sure we own the mmap_sem write lock around the
  munlock_vma_pages_range call as this manipulates the vma's vm_flags.

Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:

- In the VM_NONLINEAR case, new file pages would be populated if
  the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
  new file pages wouldn't be populated even if the vma is already
  marked as VM_LOCKED.

- In the linear (emulated) case, the work is passed to the mmap_region()
  function which would populate the pages if the vma is marked as
  VM_LOCKED, and would not otherwise - regardless of the value of the
  MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
  mmap_region().

The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Andy Lutomirski <luto@amacapital.net>
Cc: Greg Ungerer <gregungerer@westnet.com.au>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:10 -08:00
..
Kconfig Merge branch 'akpm' (incoming from Andrew) 2013-02-21 17:38:49 -08:00
Kconfig.debug mm: more intensive memory corruption debugging 2012-01-10 16:30:42 -08:00
Makefile mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
backing-dev.c bdi: allow block devices to say that they require stable page writes 2013-02-21 17:22:19 -08:00
balloon_compaction.c mm: introduce a common interface for balloon pages mobility 2012-12-11 17:22:26 -08:00
bootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
bounce.c block: optionally snapshot page contents to provide stable pages during write 2013-02-21 17:22:20 -08:00
cleancache.c ->encode_fh() API change 2012-05-29 23:28:33 -04:00
compaction.c mm: compaction: make __compact_pgdat() and compact_pgdat() return void 2013-02-23 17:50:10 -08:00
debug-pagealloc.c mm, x86: Remove debug_pagealloc_enabled 2011-12-06 09:24:07 +01:00
dmapool.c dmapool: make DMAPOOL_DEBUG detect corruption of free marker 2012-12-11 17:22:24 -08:00
fadvise.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
failslab.c switch debugfs to umode_t 2012-01-03 22:54:56 -05:00
filemap.c mm: only enforce stable page writes if the backing device requires it 2013-02-21 17:22:19 -08:00
filemap_xip.c mm: move all mmu notifier invocations to be done outside the PT lock 2012-10-09 16:22:58 +09:00
fremap.c mm: remap_file_pages() fixes 2013-02-23 17:50:10 -08:00
frontswap.c frontswap: support exclusive gets if tmem backend is capable 2012-09-21 10:38:12 -04:00
highmem.c Some nice cleanups, and even a patch my wife did as a "live" demo for 2012-12-20 08:37:05 -08:00
huge_memory.c mm/huge_memory.c: use new hashtable implementation 2013-02-23 17:50:10 -08:00
hugetlb.c mm/hugetlb.c: convert to pr_foo() 2013-02-23 17:50:09 -08:00
hugetlb_cgroup.c mm/hugetlb: create hugetlb cgroup file in hugetlb_init 2012-12-18 15:02:15 -08:00
hwpoison-inject.c memcg: rename config variables 2012-07-31 18:42:43 -07:00
init-mm.c atomic: use <linux/atomic.h> 2011-07-26 16:49:47 -07:00
internal.h mm: compaction: partially revert capture of suitable high-order page 2013-01-11 14:54:56 -08:00
interval_tree.c mm: add CONFIG_DEBUG_VM_RB build option 2012-10-09 16:22:42 +09:00
kmemcheck.c kmemcheck: Fix build errors due to missing slab.h 2010-03-30 22:02:32 +09:00
kmemleak-test.c kmemleak: remove memset by using kzalloc 2011-01-27 18:31:51 +00:00
kmemleak.c mm/kmemleak.c: remove obsolete simple_strtoul 2012-12-18 15:02:15 -08:00
ksm.c mm/ksm.c: use new hashtable implementation 2013-02-23 17:50:10 -08:00
maccess.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
madvise.c mm: make madvise(MADV_WILLNEED) support swap file prefetch 2013-02-23 17:50:10 -08:00
memblock.c memblock: Add memblock_mem_size() 2013-01-29 19:32:57 -08:00
memcontrol.c mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo() 2013-02-23 17:50:09 -08:00
memory-failure.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00
memory.c mm: reduce rmap overhead for ex-KSM page copies created on swap faults 2013-02-23 17:50:09 -08:00
memory_hotplug.c mm/memory_hotplug.c: improve comments 2012-12-18 15:02:15 -08:00
mempolicy.c mm: mempolicy: Convert shared_policy mutex to spinlock 2013-01-02 17:32:13 -08:00
mempool.c mempool: add @gfp_mask to mempool_create_node() 2012-06-25 11:53:47 +02:00
migrate.c mm/hugetlb: set PTE as huge in hugetlb_change_protection and remove_migration_pte 2013-02-05 20:38:47 +11:00
mincore.c mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode 2012-03-21 17:54:54 -07:00
mlock.c mm: don't overwrite mm->def_flags in do_mlockall() 2013-02-12 14:34:00 -08:00
mm_init.c mm: Map most files to use export.h instead of module.h 2011-10-31 09:20:12 -04:00
mmap.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2013-02-19 18:19:48 -08:00
mmu_context.c mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() 2012-03-21 17:54:59 -07:00
mmu_notifier.c mm/mmu_notifier: allocate mmu_notifier in advance 2012-10-25 14:37:53 -07:00
mmzone.c memcg: fix hotplugged memory zone oops 2012-11-16 14:33:04 -08:00
mprotect.c mm/mprotect.c: coding-style cleanups 2012-12-18 15:02:15 -08:00
mremap.c sched: Move sched.h sysctl bits into separate header 2013-02-07 20:50:54 +01:00
msync.c sanitize vfs_fsync calling conventions 2010-05-21 18:31:21 -04:00
nobootmem.c mm: Add alloc_bootmem_low_pages_nopanic() 2013-01-29 19:32:59 -08:00
nommu.c sched: Move sched.h sysctl bits into separate header 2013-02-07 20:50:54 +01:00
oom_kill.c memcg, oom: provide more precise dump info while memcg oom happening 2013-02-23 17:50:08 -08:00
page-writeback.c block: optionally snapshot page contents to provide stable pages during write 2013-02-21 17:22:20 -08:00
page_alloc.c mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long 2013-02-23 17:50:10 -08:00
page_cgroup.c memcontrol: use N_MEMORY instead N_HIGH_MEMORY 2012-12-12 17:38:32 -08:00
page_io.c mm: add support for direct_IO to highmem pages 2012-07-31 18:42:47 -07:00
page_isolation.c mm: fix zone_watermark_ok_safe() accounting of isolated pages 2013-01-04 16:11:46 -08:00
pagewalk.c thp: change split_huge_page_pmd() interface 2012-12-12 17:38:31 -08:00
percpu-km.c percpu: clear memory allocated with the km allocator 2010-10-02 10:28:42 +03:00
percpu-vm.c mm: fix kernel-doc warnings 2012-06-20 14:39:36 -07:00
percpu.c mm, percpu: Make sure percpu_alloc early parameter has an argument 2012-12-02 06:23:04 -08:00
pgtable-generic.c mm: Only flush the TLB when clearing an accessible pte 2012-12-11 14:28:34 +00:00
process_vm_access.c aio/vfs: cleanup of rw_copy_check_uvector() and compat_rw_copy_check_uvector() 2012-05-31 17:49:32 -07:00
quicklist.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
readahead.c switch simple cases of fget_light to fdget 2012-09-26 22:20:08 -04:00
rmap.c s390/mm: implement software dirty bits 2013-02-14 15:55:23 +01:00
shmem.c mempolicy: remove arg from mpol_parse_str, mpol_to_str 2013-01-02 09:27:10 -08:00
slab.c memcg: add comments clarifying aspects of cache attribute propagation 2012-12-18 15:02:15 -08:00
slab.h slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slab_common.c slab: propagate tunable values 2012-12-18 15:02:14 -08:00
slob.c sl[au]b: always get the cache from its page in kmem_cache_free() 2012-12-18 15:02:14 -08:00
slub.c slub: drop mutex before deleting sysfs entry 2012-12-18 15:02:15 -08:00
sparse-vmemmap.c mm: delete various needless include <linux/module.h> 2011-10-31 09:20:11 -04:00
sparse.c memory-hotplug, mm/sparse.c: clear the memory to store struct page 2012-12-11 17:22:23 -08:00
swap.c mm: remove vma arg from page_evictable 2012-10-09 16:22:55 +09:00
swap_state.c mm: add support for a filesystem to activate swap files and use direct_IO for writing swap pages 2012-07-31 18:42:47 -07:00
swapfile.c mm, oom: fix race when specifying a thread as the oom origin 2012-12-11 17:22:27 -08:00
truncate.c mm: drop vmtruncate 2012-12-20 18:46:29 -05:00
util.c Merge branch 'master' into for-next 2012-10-28 19:29:19 +01:00
vmalloc.c mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILD 2012-12-11 17:22:22 -08:00
vmscan.c mm: avoid calling pgdat_balanced() needlessly 2013-02-23 17:50:10 -08:00
vmstat.c Automatic NUMA Balancing V11 2012-12-16 15:18:08 -08:00