alistair23-linux/mm
Nishanth Aravamudan 63b4613c3f hugetlb: fix hugepage allocation with memoryless nodes
Anton found a problem with the hugetlb pool allocation when some nodes have
no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2).  Lee worked
on versions that tried to fix it, but none were accepted.  Christoph has
created a set of patches which allow for GFP_THISNODE allocations to fail
if the node has no memory.

Currently, alloc_fresh_huge_page() returns NULL when it is not able to
allocate a huge page on the current node, as specified by its custom
interleave variable.  The callers of this function, though, assume that a
failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
on the system period.  This might not be the case, for instance, if we have
an uneven NUMA system, and we happen to try to allocate a hugepage on a
node with less memory and fail, while there is still plenty of free memory
on the other nodes.

To correct this, make alloc_fresh_huge_page() search through all online
nodes before deciding no hugepages can be allocated.  Add a helper function
for actually allocating the hugepage.  Use a new global nid iterator to
control which nid to allocate on.

Note: we expect particular semantics for __GFP_THISNODE, which are now
enforced even for memoryless nodes.  That is, there is should be no
fallback to other nodes.  Therefore, we rely on the nid passed into
alloc_pages_node() to be the nid the page comes from.  If this is
incorrect, accounting will break.

Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
memoryless nodes).

Before on the ppc64 box:
Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 0 HugePages_Free:     25
Node 1 HugePages_Free:     75
Node 2 HugePages_Free:      0
Node 3 HugePages_Free:      0
Done. Initially     100 free
Trying to resize the pool to 200
Node 0 HugePages_Free:     50
Node 1 HugePages_Free:    150
Node 2 HugePages_Free:      0
Node 3 HugePages_Free:      0
Done.     200 free

After:
Trying to clear the hugetlb pool
Done.       0 free
Trying to resize the pool to 100
Node 0 HugePages_Free:     50
Node 1 HugePages_Free:     50
Node 2 HugePages_Free:      0
Node 3 HugePages_Free:      0
Done. Initially     100 free
Trying to resize the pool to 200
Node 0 HugePages_Free:    100
Node 1 HugePages_Free:    100
Node 2 HugePages_Free:      0
Node 3 HugePages_Free:      0
Done.     200 free

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:03 -07:00
..
allocpercpu.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
backing-dev.c remove mm/backing-dev.c:congestion_wait_interruptible() 2007-07-16 09:05:52 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap.c fs: remove some AOP_TRUNCATED_PAGE 2007-10-16 09:42:58 -07:00
filemap_xip.c mm: write iovec cleanup 2007-10-16 09:42:54 -07:00
fremap.c fix VM_CAN_NONLINEAR check in sys_remap_file_pages 2007-10-08 12:58:14 -07:00
highmem.c Create the ZONE_MOVABLE zone 2007-07-17 10:22:59 -07:00
hugetlb.c hugetlb: fix hugepage allocation with memoryless nodes 2007-10-16 09:43:03 -07:00
internal.h Breakout page_order() to internal.h to avoid special knowledge of the buddy allocator 2007-10-16 09:43:01 -07:00
Kconfig memory unplug: page offline 2007-10-16 09:43:02 -07:00
madvise.c speed up madvise_need_mmap_write() usage 2007-07-16 09:05:36 -07:00
Makefile memory unplug: page isolation 2007-10-16 09:43:02 -07:00
memory.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
memory_hotplug.c fix memory hot remove not configured case. 2007-10-16 09:43:02 -07:00
mempolicy.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
mempool.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
migrate.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY 2007-07-16 09:05:37 -07:00
mmap.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mremap.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
oom_kill.c Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly 2007-10-16 09:42:58 -07:00
page-writeback.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
page_alloc.c memory unplug: page offline 2007-10-16 09:43:02 -07:00
page_io.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
page_isolation.c memory unplug: page isolation 2007-10-16 09:43:02 -07:00
pdflush.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
prio_tree.c
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c mm: buffered write cleanup 2007-10-16 09:42:54 -07:00
rmap.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
shmem.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
shmem_acl.c
slab.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
slob.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
slub.c slub: list_locations() can use GFP_TEMPORARY 2007-10-16 09:43:01 -07:00
sparse-vmemmap.c memory hotplug: Hot-add with sparsemem-vmemmap 2007-10-16 09:43:02 -07:00
sparse.c memory hotplug: Hot-add with sparsemem-vmemmap 2007-10-16 09:43:02 -07:00
swap.c mm: use pagevec to rotate reclaimable page 2007-10-16 09:42:54 -07:00
swap_state.c mm: clarify __add_to_swap_cache locking 2007-10-16 09:42:53 -07:00
swapfile.c Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION 2007-07-29 16:45:38 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c mm: merge populate and nopage into fault (fixes nonlinear) 2007-07-19 10:04:41 -07:00
util.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
vmalloc.c Categorize GFP flags 2007-10-16 09:42:59 -07:00
vmscan.c make swappiness safer to use 2007-10-16 09:42:59 -07:00
vmstat.c Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo 2007-10-16 09:43:00 -07:00