alistair23-linux/kernel
Nishanth Aravamudan d1c3fb1f8f hugetlb: introduce nr_overcommit_hugepages sysctl
hugetlb: introduce nr_overcommit_hugepages sysctl

While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:

1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.

2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.

To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition

	nr_overcommit_hugepages > 0

indicates the same administrative setting as

	hugetlb_dynamic_pool == 1

Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.

A few caveats:

1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.

2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.

Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:17 -08:00
..
irq __do_IRQ does not check IRQ_DISABLED when IRQ_PER_CPU is set 2007-11-14 18:45:43 -08:00
power hibernate: fix lockdep report 2007-11-14 18:45:43 -08:00
time clockevents: warn once when program_event() is called with negative expiry 2007-12-07 19:16:17 +01:00
.gitignore
acct.c sched: fix kernel/acct.c comment 2007-11-26 21:21:49 +01:00
audit.c [PATCH] audit: watching subtrees 2007-10-21 02:37:45 -04:00
audit.h [PATCH] audit: watching subtrees 2007-10-21 02:37:45 -04:00
audit_tree.c [PATCH] audit: watching subtrees 2007-10-21 02:37:45 -04:00
auditfilter.c [PATCH] audit: watching subtrees 2007-10-21 02:37:45 -04:00
auditsc.c auditsc: fix kernel-doc param warnings 2007-10-22 19:40:02 -07:00
capability.c Uninline find_pid etc set of functions 2007-10-19 11:53:41 -07:00
cgroup.c Improve cgroup printks 2007-11-14 18:45:37 -08:00
cgroup_debug.c Task Control Groups: simple task cgroup debug info subsystem 2007-10-19 11:53:36 -07:00
compat.c Merge ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt 2007-10-18 15:12:41 -07:00
configs.c
cpu.c CPU HOTPLUG: avoid hotadd when proper possible_map isn't specified 2007-10-19 11:53:44 -07:00
cpuset.c hotplug cpu: migrate a task within its cpuset 2007-10-19 11:53:44 -07:00
delayacct.c Add scaled time to taskstats based process accounting 2007-10-18 14:37:28 -07:00
dma.c whitespace fixes: DMA channel allocator 2007-10-18 14:37:24 -07:00
exec_domain.c whitespace fixes: execution domains 2007-10-18 14:37:26 -07:00
exit.c wait_task_stopped(): pass correct exit_code to wait_noreap_copyout() 2007-11-29 09:24:55 -08:00
extable.c
fork.c fix clone(CLONE_NEWPID) 2007-12-05 09:21:18 -08:00
futex.c futex: correctly return -EFAULT not -EINVAL 2007-12-05 15:46:09 +01:00
futex_compat.c [FUTEX] Fix address computation in compat code. 2007-11-09 16:13:08 -08:00
hrtimer.c hrtimers: avoid overflow for large relative timeouts 2007-12-07 19:16:17 +01:00
itimer.c whitespace fixes: interval timers 2007-10-18 14:37:26 -07:00
kallsyms.c FRV: fix the extern declaration of kallsyms_num_syms 2007-11-29 09:24:54 -08:00
Kconfig.hz
Kconfig.instrumentation Tiny clean-up of OPROFILE/KPROBES configuration 2007-12-06 09:41:12 -08:00
Kconfig.preempt Move PREEMPT_NOTIFIERS into an always-included Kconfig 2007-10-17 08:42:55 -07:00
kexec.c Extended crashkernel command line 2007-10-19 11:53:49 -07:00
kfifo.c
kmod.c
kprobes.c kprobes: support kretprobe blacklist 2007-10-16 09:43:10 -07:00
ksysfs.c add-vmcore: cleanup the coding style according to Andrew's comments 2007-10-17 08:42:54 -07:00
kthread.c
latency.c
lockdep.c lockdep: make cli/sti annotation warnings clearer 2007-12-07 19:02:47 +01:00
lockdep_internals.h
lockdep_proc.c
Makefile revert "Task Control Groups: example CPU accounting subsystem" 2007-11-14 18:45:40 -08:00
marker.c Linux Kernel Markers: fix marker mutex not taken upon module load 2007-11-14 18:45:40 -08:00
module.c module: fix and elaborate comments 2007-11-19 11:20:43 +11:00
mutex-debug.c
mutex-debug.h
mutex.c
mutex.h
notifier.c Add kernel/notifier.c 2007-10-19 11:53:34 -07:00
ns_cgroup.c cgroups: implement namespace tracking subsystem 2007-10-19 11:53:37 -07:00
nsproxy.c pid namespaces: allow cloning of new namespace 2007-10-19 11:53:39 -07:00
panic.c trivial comment wording/typo fix regarding taint flags 2007-10-20 00:30:06 +02:00
params.c fix param_sysfs_builtin name length check 2007-11-14 18:45:42 -08:00
pid.c pidns: Place under CONFIG_EXPERIMENTAL 2007-11-14 18:45:43 -08:00
posix-cpu-timers.c Isolate some explicit usage of task->tgid 2007-10-19 11:53:40 -07:00
posix-timers.c Isolate some explicit usage of task->tgid 2007-10-19 11:53:40 -07:00
printk.c serial: turn serial console suspend a boot rather than compile time option 2007-10-18 14:37:19 -07:00
profile.c sched: document profile=sleep requiring CONFIG_SCHEDSTATS 2007-10-24 18:23:50 +02:00
ptrace.c Isolate some explicit usage of task->tgid 2007-10-19 11:53:40 -07:00
rcupdate.c Clean up duplicate includes in kernel/ 2007-10-17 08:42:48 -07:00
rcutorture.c Make rcutorture RNG use temporal entropy 2007-10-17 08:42:53 -07:00
relay.c whitespace fixes: relayfs 2007-10-18 14:37:24 -07:00
resource.c Add IORESOUCE_BUSY flag for System RAM 2007-11-14 18:45:39 -08:00
rtmutex-debug.c Use helpers to obtain task pid in printks 2007-10-19 11:53:43 -07:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c Use helpers to obtain task pid in printks 2007-10-19 11:53:43 -07:00
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c sched: enable early use of sched_clock() 2007-12-07 19:02:47 +01:00
sched_debug.c sched: clean up overlong line in kernel/sched_debug.c 2007-11-28 15:52:56 +01:00
sched_fair.c sched: default to more agressive yield for SCHED_BATCH tasks 2007-12-04 17:04:39 +01:00
sched_idletask.c sched: isolate SMP balancing code a bit more 2007-10-24 18:23:51 +02:00
sched_rt.c sched: cpu accounting controller (V2) 2007-12-02 20:04:49 +01:00
sched_stats.h sched: clean up kernel/sched_stat.h 2007-11-28 15:52:56 +01:00
seccomp.c
signal.c sigwait eats blocked default-ignore signals 2007-11-12 16:05:23 -08:00
softirq.c
softlockup.c Use helpers to obtain task pid in printks 2007-10-19 11:53:43 -07:00
spinlock.c
srcu.c
stacktrace.c
stop_machine.c
sys.c x86: ignore the sys_getcpu() tcache parameter 2007-11-17 16:27:00 +01:00
sys_ni.c [COMPAT]: Fix build on COMPAT platforms when CONFIG_NET is disabled. 2007-10-30 21:29:56 -07:00
sysctl.c hugetlb: introduce nr_overcommit_hugepages sysctl 2007-12-17 19:28:17 -08:00
sysctl_check.c [SYSCTL_CHECK]: Fix typo in KERN_SPARC_SCONS_PWROFF entry string. 2007-12-05 05:37:56 -08:00
taskstats.c kernel/taskstats.c: fix bogus nlmsg_free() 2007-11-14 18:45:44 -08:00
time.c whitespace fixes: time syscalls 2007-10-18 14:37:24 -07:00
timer.c sched: restore deterministic CPU accounting on powerpc 2007-11-09 22:39:38 +01:00
tsacct.c Add scaled time to taskstats based process accounting 2007-10-18 14:37:28 -07:00
uid16.c
user.c sched: don't forget to unlock uids_mutex on error paths 2007-11-26 21:21:49 +01:00
user_namespace.c
utsname.c
utsname_sysctl.c Isolate the UTS namespace's domainname and hostname back 2007-11-29 09:24:53 -08:00
wait.c
workqueue.c Use helpers to obtain task pid in printks 2007-10-19 11:53:43 -07:00