alistair23-linux/mm
Andy Whitcroft 84afd99b83 hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits
When a hugetlb mapping with a reservation is split, a new VMA is cloned
from the original.  This new VMA is a direct copy of the original
including the reservation count.  When this pair of VMAs are unmapped we
will incorrect double account the unused reservation and the overall
reservation count will be incorrect, in extreme cases it will wrap.

The problem occurs when we split an existing VMA say to unmap a page in
the middle.  split_vma() will create a new VMA copying all fields from the
original.  As we are storing our reservation count in vm_private_data this
is also copies, endowing the new VMA with a duplicate of the original
VMA's reservation.  Neither of the new VMAs can exhaust these reservations
as they are too small, but when we unmap and close these VMAs we will
incorrect credit the remainder twice and resv_huge_pages will become out
of sync.  This can lead to allocation failures on mappings with
reservations and even to resv_huge_pages wrapping which prevents all
subsequent hugepage allocations.

The simple fix would be to correctly apportion the remaining reservation
count when the split is made.  However the only hook we have vm_ops->open
only has the new VMA we do not know the identity of the preceeding VMA.
Also even if we did have that VMA to hand we do not know how much of the
reservation was consumed each side of the split.

This patch therefore takes a different tack.  We know that the whole of
any private mapping (which has a reservation) has a reservation over its
whole size.  Any present pages represent consumed reservation.  Therefore
if we track the instantiated pages we can calculate the remaining
reservation.

This patch reuses the existing regions code to track the regions for which
we have consumed reservation (ie.  the instantiated pages), as each page
is faulted in we record the consumption of reservation for the new page.
When we need to return unused reservations at unmap time we simply count
the consumed reservation region subtracting that from the whole of the
map.  During a VMA split the newly opened VMA will point to the same
region map, as this map is offset oriented it remains valid for both of
the split VMAs.  This map is referenced counted so that it is removed when
all VMAs which are part of the mmap are gone.

Thanks to Adam Litke and Mel Gorman for their review feedback.

Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Johannes Weiner <hannes@saeurebad.de>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 10:47:16 -07:00
..
allocpercpu.c Merge commit 'v2.6.26-rc9' into cpus4096 2008-07-06 14:23:39 +02:00
backing-dev.c mm: bdi: fix race in bdi_class device creation 2008-05-20 13:31:53 -07:00
bootmem.c mm: unexport __alloc_bootmem_core() 2008-07-24 10:47:14 -07:00
bounce.c
dmapool.c dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too 2008-04-28 08:58:20 -07:00
fadvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
filemap.c kill generic_file_direct_IO() 2008-07-24 10:47:14 -07:00
filemap_xip.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
fremap.c mm: fix various kernel-doc comments 2008-03-19 18:53:35 -07:00
highmem.c highmem: Export totalhigh_pages. 2008-07-19 22:39:46 -07:00
hugetlb.c hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits 2008-07-24 10:47:16 -07:00
internal.h mm: remove double indirection on tlb parameter to free_pgd_range() & Co 2008-07-24 10:47:15 -07:00
Kconfig Merge git://git.kernel.org/pub/scm/linux/kernel/git/hskinnemoen/avr32-2.6 2008-07-14 13:37:29 -07:00
maccess.c kgdb: fix optional arch functions and probe_kernel_* 2008-04-17 20:05:39 +02:00
madvise.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
Makefile mm: add a basic debugging framework for memory initialisation 2008-07-24 10:47:13 -07:00
memcontrol.c memcg: simple stats for memory resource controller 2008-05-01 08:04:02 -07:00
memory.c hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed 2008-07-24 10:47:16 -07:00
memory_hotplug.c mm: drop unneeded pgdat argument from free_area_init_node() 2008-07-24 10:47:16 -07:00
mempolicy.c mempolicy: mask off internal flags for userspace API 2008-07-04 13:03:05 -07:00
mempool.c
migrate.c mm/migrate.c should #include <linux/syscalls.h> 2008-07-24 10:47:14 -07:00
mincore.c mm: remove nopage 2008-04-28 08:58:18 -07:00
mlock.c
mm_init.c mm: print out the zonelists on request for manual verification 2008-07-24 10:47:14 -07:00
mmap.c mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations 2008-07-24 10:47:16 -07:00
mmzone.c mm: filter based on a nodemask as well as a gfp_mask 2008-04-28 08:58:19 -07:00
mprotect.c mm: record MAP_NORESERVE status on vmas and fix small page mprotect reservations 2008-07-24 10:47:16 -07:00
mremap.c
msync.c
nommu.c nommu: Correct kobjsize() page validity checks. 2008-06-12 07:56:17 -07:00
oom_kill.c oom_kill: remove unused parameter in badness() 2008-04-28 08:58:26 -07:00
page-writeback.c Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2008-07-15 08:36:38 -07:00
page_alloc.c mm: drop unneeded pgdat argument from free_area_init_node() 2008-07-24 10:47:16 -07:00
page_io.c
page_isolation.c
pagewalk.c pagemap: pass mm into pagewalkers 2008-06-12 18:05:41 -07:00
pdflush.c mm/pdflush.c: merge the same code in two path 2008-05-13 08:02:24 -07:00
prio_tree.c
quicklist.c
readahead.c mm: bdi: export BDI attributes in sysfs 2008-04-30 08:29:49 -07:00
rmap.c mm: remove nopage 2008-04-28 08:58:18 -07:00
shmem.c mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
shmem_acl.c
slab.c Merge branch 'generic-ipi' into generic-ipi-for-linus 2008-07-15 21:55:59 +02:00
slob.c slob: record page flag overlays explicitly 2008-07-24 10:47:15 -07:00
slub.c slub: record page flag overlays explicitly 2008-07-24 10:47:15 -07:00
sparse-vmemmap.c Christoph has moved 2008-07-04 10:40:04 -07:00
sparse.c mm: make defensive checks around PFN values registered for memory usage 2008-07-24 10:47:13 -07:00
swap.c mm: fix atomic_t overflow in vm 2008-05-24 09:56:09 -07:00
swap_state.c mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
swapfile.c mm: use non-racy method for /proc/swaps creation 2008-04-29 08:06:20 -07:00
thrash.c
tiny-shmem.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2008-03-25 08:57:47 -07:00
truncate.c fix invalidate_inode_pages2_range() to not clear ret 2008-04-28 08:58:18 -07:00
util.c
vmalloc.c docbook: fix vmalloc missing parameter notation 2008-05-01 08:03:59 -07:00
vmscan.c mm: fix incorrect variable type in do_try_to_free_pages() 2008-06-12 18:05:39 -07:00
vmstat.c mm/vmstat.c: proper externs 2008-07-24 10:47:14 -07:00