1
0
Fork 0
Commit Graph

2754 Commits (827afdf093cce59bbbf9dc9f1c2eab86e2232b9e)

Author SHA1 Message Date
Avi Kivity e98c320291 Move PREEMPT_NOTIFIERS into an always-included Kconfig
Kconfig.preempt is not included on some archs (for example, m68k).  On those
archs, the Kconfig machinery complains that KVM selects an undefined symbol
PREEMPT_NOTIFIERS (which lives in Kconfig.preempt).

So move the offending symbol into a Kconfig file which is included by
everyone.

Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Alexey Dobriyan 42b2dd0a02 Shrink task_struct if CONFIG_FUTEX=n
robust_list, compat_robust_list, pi_state_list, pi_state_cache are
really used if futexes are on.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:55 -07:00
Ken'ichi Ohmichi bcbba6c10e add-vmcore: add a prefix "VMCOREINFO_" to the vmcoreinfo macros
Add a prefix "VMCOREINFO_" to the vmcoreinfo macros.  Old vmcoreinfo macros
were defined as generic names SYMBOL/SIZE/OFFSET /LENGTH/CONFIG, and it is
impossible to grep for them.  So these names should be changed.  This
discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.1/0415.html

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi 6cfa062f01 add-vmcore: add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data
[2/3] Add nodemask_t's size and NR_FREE_PAGES's value to vmcoreinfo_data.
  The dump filetering command 'makedumpfile'(v1.1.6 or before) had assumed
  the above values, and it was not good from the reliability viewpoint.
  So makedumpfile v1.2.0 came to need these values and I created the patch
  to let the kernel output them.
  makedumpfile site:
  https://sourceforge.net/projects/makedumpfile/

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi d768281e97 add-vmcore: cleanup the coding style according to Andrew's comments
[1/3] Cleanup the coding style according to Andrew's comments:
http://lists.infradead.org/pipermail/kexec/2007-August/000522.html
- vmcoreinfo_append_str() should have suitable __attribute__s so that
  the compiler can check its use.
- vmcoreinfo_max_size should have size_t.
- Use get_seconds() instead of xtime.tv_sec.
- Use init_uts_ns.name.release instead of UTS_RELEASE.

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Ken'ichi Ohmichi fd59d231f8 Add vmcoreinfo
This patch set frees the restriction that makedumpfile users should install a
vmlinux file (including the debugging information) into each system.

makedumpfile command is the dump filtering feature for kdump.  It creates a
small dumpfile by filtering unnecessary pages for the analysis.  To
distinguish unnecessary pages, it needs a vmlinux file including the debugging
information.  These days, the debugging package becomes a huge file, and it is
hard to install it into each system.

To solve the problem, kdump developers discussed it at lkml and kexec-ml.  As
the result, we reached the conclusion that necessary information for dump
filtering (called "vmcoreinfo") should be embedded into the first kernel file
and it should be accessed through /proc/vmcore during the second kernel.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.0/1806.html)

Dan Aloni created the patch set for the above implementation.
(http://www.uwsg.iu.edu/hypermail/linux/kernel/0707.1/1053.html)

And I updated it for multi architectures and memory models.
(http://lists.infradead.org/pipermail/kexec/2007-August/000479.html)

Signed-off-by: Dan Aloni <da-x@monatomic.org>
Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Signed-off-by: Bernhard Walle <bwalle@suse.de>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Oleg Nesterov 13fbcb7312 do_sigaction: don't worry about signal_pending()
do_sigaction() returns -ERESTARTNOINTR if signal_pending(). The comment says:

	* If there might be a fatal signal pending on multiple
	* threads, make sure we take it before changing the action.

I think this is not needed. We should only worry about SIGNAL_GROUP_EXIT case,
bit it implies a pending SIGKILL which can't be cleared by do_sigaction.

Kill this special case.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Oleg Nesterov 6db840fa78 exec: RT sub-thread can livelock and monopolize CPU on exec
de_thread() yields waiting for ->group_leader to be a zombie. This deadlocks
if an rt-prio execer shares the same cpu with ->group_leader. Change the code
to use ->group_exit_task/notify_count mechanics.

This patch certainly uglifies the code, perhaps someone can suggest something
better.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:54 -07:00
Paul E. McKenney c17ac85504 Make rcutorture RNG use temporal entropy
Repost of http://lkml.org/lkml/2007/8/10/472 made available by request.

The locking used by get_random_bytes() can conflict with the
preempt_disable() and synchronize_sched() form of RCU.  This patch changes
rcutorture's RNG to gather entropy from the new cpu_clock() interface
(relying on interrupts, preemption, daemons, and rcutorture's reader
thread's rock-bottom scheduling priority to provide useful entropy), and
also adds and EXPORT_SYMBOL_GPL() to make that interface available to GPLed
kernel modules such as rcutorture.

Passes several hours of rcutorture.

[ego@in.ibm.com: Use raw_smp_processor_id() in rcu_random()]
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
john stultz b2d9323d13 Use num_possible_cpus() instead of NR_CPUS for timer distribution
To avoid lock contention, we distribute the sched_timer calls across the
cpus so they do not trigger at the same instant.  However, I used NR_CPUS,
which can cause needless grouping on small smp systems depending on your
kernel config.  This patch converts to using num_possible_cpus() so we
spread it as evenly as possible on every machine.

Briefly tested w/ NR_CPUS=255 and verified reduced contention.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Adrian Bunk ba2a631b14 kernel/time/timekeeping.c: cleanups
- remove the no longer required __attribute__((weak)) of xtime_lock
- remove the following no longer used EXPORT_SYMBOL's:
  - xtime
  - xtime_lock

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:53 -07:00
Oleg Nesterov 715015e8da wait_task_stopped/continued: remove unneeded p->signal != NULL check
The child was found on ->children list under tasklist_lock, it must have a
valid ->signal. __exit_signal() both removes the task from parent->children
and clears ->signal "atomically" under write_lock(tasklist).

Remove unneeded checks.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:52 -07:00
Oleg Nesterov 18442cf28a __group_complete_signal: eliminate unneeded wakeup of ->group_exit_task
Cleanup.  __group_complete_signal() wakes up ->group_exit_task twice.  The
second wakeup's state includes TASK_UNINTERRUPTIBLE, which is not very
appropriate.

Change the code to pass the "correct" argument to signal_wake_up() and kill
now unneeded wake_up_process().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 442a10cf9e wait_task_zombie: don't fight with non-existing race with a dying ptracee
The "p->exit_signal == -1 && p->ptrace == 0" check and the comment are
bogus.  We already did exactly the same check in eligible_child(), we did
not drop tasklist_lock since then, and both variables need
write_lock(tasklist) to be changed.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov ebca4cda11 zap_other_threads: don't optimize thread_group_empty() case
Nowadays thread_group_empty() and next_thread() are simple list operations,
this optimization doesn't make sense: we are doing exactly same check one
line below.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 3ae4cbadf4 exit_notify: don't take tasklist for TIF_SIGPENDING re-targeting
->siglock provides enough protection to iterate over the thread group.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 2f4e6e2a81 wait_task_zombie: fix 2/3 races vs forget_original_parent()
Two threads, T1 and T2.  T2 ptraces P, and P is not a child of ptracer's
thread group.  P exits and goes to TASK_ZOMBIE.

T1 does wait_task_zombie(P):

	P->exit_state = TASK_DEAD;
	...
	read_unlock(&tasklist_lock);

					T2 does exit(), takes tasklist,
					forget_original_parent() does
					__ptrace_unlink(P) but doesn't
					call do_notify_parent(P) because
					p->exit_state == EXIT_DEAD.

Now, P is not visible to our process: __ptrace_unlink() removed it from
->children. We should send notification to P->parent and release P if and
only if SIGCHLD is ignored.

And we have 3 bugs:

1. P->parent does do_wait() and gets -ECHILD (P is on ->parent->children,
   but its state is TASK_DEAD).

2. // wait_task_zombie() continues

	if (put_user(...)) {
		// TODO: is this safe?
		p->exit_state = EXIT_ZOMBIE;
		return;
	}

   we return without notification/release, task_struct leaked.

   Solution: ignore -EFAULT and proceed. It is an application's bug if
   we can't fill infop/stat_addr (in case of VM_FAULT_OOM we have much
   more problems).

3. // wait_task_zombie() continues

	if (p->real_parent != p->parent) {
		// Not taken, it was untraced'ed
		...
	}

	release_task(p);

   we released the task which we shouldn't.

   Solution: check ->real_parent != ->parent before, under tasklist_lock,
   but use ptrace_unlink() instead of __ptrace_unlink() to check ->ptrace.

This patch hopefully solves 2 and 3, the 1st bug will be fixed later, we need
some cleanups in forget_original_parent/reparent_thread.

However, the first race is very unlikely and not critical, so I hope it makes
sense to fix 1 and 2 for now.

4. Small cleanup: don't "restore" EXIT_ZOMBIE unless we know we are not going
   to realease the child.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 407af46a96 wait_task_zombie: remove unneeded child->signal check
A zombie must have a valid ->signal, we are going to release it and
__exit_signal() starts with BUG_ON(!sig).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 84eb646b6e handle the multi-threaded init's exit() properly
With or without this patch, multi-threaded init's are not fully supported,
but do_exit() is completely wrong.  This becomes a real problem when we
support pid namespaces.

1. do_exit() panics when the main thread of /sbin/init exits. It should not
   until the whole thread group exits. Move the code below, under the
   "if (group_dead)" check.

   Note: this means that forget_original_parent() can use an already dead
   child_reaper()'s task_struct. This is OK for /sbin/init because

   	- do_wait() from alive sub-thread still can reap a zombie, we iterate
   	  over all sub-thread's ->children lists

   	- do_notify_parent() will wakeup some alive sub-thread because it sends
   	  the group-wide signal

   However, we should remove choose_new_parent()->BUG_ON(reaper->exit_state)
   for this.

2. We are playing games with ->nsproxy->pid_ns. This code is bogus today, and
   it has to be changed anyway when we really support pid namespaces, just
   remove it.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov 045f902de5 do_sigaction: remove now unneeded recalc_sigpending()
With the recent changes, do_sigaction()->recalc_sigpending_and_wake() can
never clear TIF_SIGPENDING. Instead, it can set this flag and wake up the
thread without any reason. Harmless, but unneeded and wastes CPU.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Oleg Nesterov d2ee7198cc pi-futex: set PF_EXITING without taking ->pi_lock
It is a bit annoying that do_exit() takes ->pi_lock to set PF_EXITING.  All
we need is to synchronize with lookup_pi_state() which saw this task
without PF_EXITING under ->pi_lock.

Change do_exit() to use spin_unlock_wait().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:51 -07:00
Mike Frysinger 0b15d04af3 printk: add interfaces for external access to the log buffer
Add two new functions for reading the kernel log buffer.  The intention is for
them to be used by recovery/dump/debug code so the kernel log can be easily
retrieved/parsed in a crash scenario, but they are generic enough for other
people to dream up other fun uses.

[akpm@linux-foundation.org: buncha fixes]
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: Greg Ungerer <gerg@snapgear.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Paul Mundt <lethal@linux-sh.org>
Acked-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
Adrian Bunk a4d63e729e kernel/rtmutex-debug.c: cleanups
This patch contains the following cleanups:
- make the needlessly global variable rt_trace_on static
- remove the unused global function deadlock_trace_off()

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
Roland McGrath 6d76013381 Add /sys/module/name/notes
This patch adds the /sys/module/<name>/notes/ magic directory, which has a
file for each allocated SHT_NOTE section that appears in <name>.ko.  This
is the counterpart for each module of /sys/kernel/notes for vmlinux.
Reading this delivers the contents of the module's SHT_NOTE sections.  This
lets userland easily glean any detailed information about that module's
build that was stored there at compile time (e.g.  by ld --build-id).

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:50 -07:00
David Woodhouse 1d99493b3a Fix CONFIG_DEBUG_SHIRQ trigger on free_irq()
Andy Gospodarek pointed out that because we return in the middle of the
free_irq() function, we never actually do call the IRQ handler that just
got deregistered. This should fix it, although I expect Andrew will want
to convert those 'return's to 'break'. That's a separate change though.

Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: Fernando Luis Vzquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Rusty Russell af49d9248f Remove "unsafe" from module struct
Adrian Bunk points out that "unsafe" was used to mark modules touched by
the deprecated MOD_INC_USE_COUNT interface, which has long gone.  It's time
to remove the member from the module structure, as well.

If you want a module which can't unload, don't register an exit function.

(Vlad Yasevich says SCTP is now safe to unload, so just remove the
__unsafe there).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Shannon Nelson <shannon.nelson@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
Cc: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:49 -07:00
Avi Kivity bf020cb7b3 time: simplify smp_call_function_single() call sequence
smp_call_function_single() now knows how to call the function on the
current cpu.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Jesper Juhl a9022e9cb9 Clean up duplicate includes in kernel/
This patch cleans up duplicate includes in
	kernel/

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Satyam Sharma <ssatyam@cse.iitk.ac.in>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:48 -07:00
Alexey Dobriyan 040b5c6f95 SLAB_PANIC more (proc, posix-timers, shmem)
These aren't modular, so SLAB_PANIC is OK.

Signed-off-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ravikiran G Thirumalai c4f3b63fe1 softlockup: add a /proc tuning parameter
Control the trigger limit for softlockup warnings.  This is useful for
debugging softlockups, by lowering the softlockup_thresh to identify
possible softlockups earlier.

This patch:
1. Adds a sysctl softlockup_thresh with valid values of 1-60s
   (Higher value to disable false positives)
2. Changes the softlockup printk to print the cpu softlockup time

[akpm@linux-foundation.org: Fix various warnings and add definition of "two"]
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ingo Molnar a5f2ce3c60 softlockup watchdog: style cleanups
kernel/softirq.c grew a few style uncleanlinesses in the past few
months, clean that up. No functional changes:

   text    data     bss     dec     hex filename
   1126      76       4    1206     4b6 softlockup.o.before
   1129      76       4    1209     4b9 softlockup.o.after

( the 3 bytes .text increase is due to the "<1>" appended to one of
  the printk messages. )

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ingo Molnar 43581a1007 softlockup: improve debug output
Improve the debuggability of kernel lockups by enhancing the debug
output of the softlockup detector: print the task that causes the lockup
and try to print a more intelligent backtrace.

The old format was:

  BUG: soft lockup detected on CPU#1!
   [<c0105e4a>] show_trace_log_lvl+0x19/0x2e
   [<c0105f43>] show_trace+0x12/0x14
   [<c0105f59>] dump_stack+0x14/0x16
   [<c015f6bc>] softlockup_tick+0xbe/0xd0
   [<c013457d>] run_local_timers+0x12/0x14
   [<c01346b8>] update_process_times+0x3e/0x63
   [<c0145fb8>] tick_sched_timer+0x7c/0xc0
   [<c0140a75>] hrtimer_interrupt+0x135/0x1ba
   [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80
   [<c0105aa3>] apic_timer_interrupt+0x33/0x38
   [<c0104f8a>] syscall_call+0x7/0xb
   =======================

The new format is:

  BUG: soft lockup detected on CPU#1! [prctl:2363]

  Pid: 2363, comm:                prctl
  EIP: 0060:[<c013915f>] CPU: 1
  EIP is at sys_prctl+0x24/0x18c
   EFLAGS: 00000213    Not tainted  (2.6.22-cfs-v20 #26)
  EAX: 00000001 EBX: 000003e7 ECX: 00000001 EDX: f6df0000
  ESI: 000003e7 EDI: 000003e7 EBP: f6df0fb0 DS: 007b ES: 007b FS: 00d8
  CR0: 8005003b CR2: 4d8c3340 CR3: 3731d000 CR4: 000006d0
   [<c0105e4a>] show_trace_log_lvl+0x19/0x2e
   [<c0105f43>] show_trace+0x12/0x14
   [<c01040be>] show_regs+0x1ab/0x1b3
   [<c015f807>] softlockup_tick+0xef/0x108
   [<c013457d>] run_local_timers+0x12/0x14
   [<c01346b8>] update_process_times+0x3e/0x63
   [<c0145fcc>] tick_sched_timer+0x7c/0xc0
   [<c0140a89>] hrtimer_interrupt+0x135/0x1ba
   [<c011bde7>] smp_apic_timer_interrupt+0x6e/0x80
   [<c0105aa3>] apic_timer_interrupt+0x33/0x38
   [<c0104f8a>] syscall_call+0x7/0xb
   =======================

Note that in the old format we only knew that some system call locked
up, we didnt know _which_. With the new format we know that it's at a
specific place in sys_prctl(). [which was where i created an artificial
kernel lockup to test the new format.]

This is also useful if the lockup happens in user-space - the user-space
EIP (and other registers) will be printed too. (such a lockup would
either suggest that the task was running at SCHED_FIFO:99 and looping
for more than 10 seconds, or that the softlockup detector has a
false-positive.)

The task name is printed too first, just in case we dont manage to print
a useful backtrace.

[satyam@infradead.org: fix warning]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ingo Molnar a115d5caca fix the softlockup watchdog to actually work
this Xen related commit:

   commit 966812dc98
   Author: Jeremy Fitzhardinge <jeremy@goop.org>
   Date:   Tue May 8 00:28:02 2007 -0700

       Ignore stolen time in the softlockup watchdog

broke the softlockup watchdog to never report any lockups. (!)

print_timestamp defaults to 0, this makes the following condition
always true:

	if (print_timestamp < (touch_timestamp + 1) ||

and we'll in essence never report soft lockups.

apparently the functionality of the soft lockup watchdog was never
actually tested with that patch applied ...

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:47 -07:00
Ingo Molnar a3b13c23f1 softlockup: use cpu_clock() instead of sched_clock()
sched_clock() is not a reliable time-source, use cpu_clock() instead.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes bbe373f2c6 oom: compare cpuset mems_allowed instead of exclusive ancestors
Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors.  This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
David Rientjes fe071d7e8a oom: add oom_kill_allocating_task sysctl
Adds a new sysctl, 'oom_kill_allocating_task', which will automatically kill
the OOM-triggering task instead of scanning through the tasklist to find a
memory-hogging target.  This is helpful for systems with an insanely large
number of tasks where scanning the tasklist significantly degrades
performance.

Cc: Andrea Arcangeli <andrea@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:46 -07:00
Christoph Lameter 4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra 3e26c149c3 mm: dirty balancing for tasks
Based on ideas of Andrew:
  http://marc.info/?l=linux-kernel&m=102912915020543&w=2

Scale the bdi dirty limit inversly with the tasks dirty rate.
This makes heavy writers have a lower dirty limit than the occasional writer.

Andrea proposed something similar:
  http://lwn.net/Articles/152277/

The main disadvantage to his patch is that he uses an unrelated quantity to
measure time, which leaves him with a workload dependant tunable. Other than
that the two approaches appear quite similar.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Peter Zijlstra 04fbfdc14e mm: per device dirty threshold
Scale writeback cache per backing device, proportional to its writeout speed.

By decoupling the BDI dirty thresholds a number of problems we currently have
will go away, namely:

 - mutual interference starvation (for any number of BDIs);
 - deadlocks with stacked BDIs (loop, FUSE and local NFS mounts).

It might be that all dirty pages are for a single BDI while other BDIs are
idling. By giving each BDI a 'fair' share of the dirty limit, each one can have
dirty pages outstanding and make progress.

A global threshold also creates a deadlock for stacked BDIs; when A writes to
B, and A generates enough dirty pages to get throttled, B will never start
writeback until the dirty pages go away. Again, by giving each BDI its own
'independent' dirty limit, this problem is avoided.

So the problem is to determine how to distribute the total dirty limit across
the BDIs fairly and efficiently. A DBI that has a large dirty limit but does
not have any dirty pages outstanding is a waste.

What is done is to keep a floating proportion between the DBIs based on
writeback completions. This way faster/more active devices get a larger share
than slower/idle devices.

[akpm@linux-foundation.org: fix warnings]
[hugh@veritas.com: Fix occasional hang when a task couldn't get out of balance_dirty_pages]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Ingo Molnar f20bf61256 time: introduce xtime_seconds
improve performance of sys_time(). sys_time() returns time in seconds,
but it does so by calling do_gettimeofday() and then returning the
tv_sec portion of the GTOD time. But the data structure "xtime", which
is updated by every timer/scheduler tick, already offers HZ granularity
time.

the patch improves the sysbench oltp macrobenchmark by 4-5% on an AMD
dual-core system:

v2.6.23:

#threads

   1:     transactions:                        4073   (407.23 per sec.)
   2:     transactions:                        8530   (852.81 per sec.)
   3:     transactions:                        8321   (831.88 per sec.)
   4:     transactions:                        8407   (840.58 per sec.)
   5:     transactions:                        8070   (806.74 per sec.)

v2.6.23 + sys_time-speedup.patch:

   1:     transactions:                        4281   (428.09 per sec.)
   2:     transactions:                        8910   (890.85 per sec.)
   3:     transactions:                        8659   (865.79 per sec.)
   4:     transactions:                        8676   (867.34 per sec.)
   5:     transactions:                        8532   (852.91 per sec.)

and by 4-5% on an Intel dual-core system too:

2.6.23:

  1:     transactions:                        4560   (455.94 per sec.)
  2:     transactions:                        10094  (1009.30 per sec.)
  3:     transactions:                        9755   (975.36 per sec.)
  4:     transactions:                        9859   (985.78 per sec.)
  5:     transactions:                        9701   (969.72 per sec.)

2.6.23 + sys_time-speedup.patch:

  1:     transactions:                        4779   (477.84 per sec.)
  2:     transactions:                        10103  (1010.14 per sec.)
  3:     transactions:                        10141  (1013.93 per sec.)
  4:     transactions:                        10371  (1036.89 per sec.)
  5:     transactions:                        10178  (1017.50 per sec.)

(the more CPUs the system has, the more speedup this patch gives for
this particular workload.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 10:01:50 -07:00
Masami Hiramatsu f438d914b2 kprobes: support kretprobe blacklist
Introduce architecture dependent kretprobe blacklists to prohibit users
from inserting return probes on the function in which kprobes can be
inserted but kretprobes can not.

This patch also removes "__kprobes" mark from "__switch_to" on x86_64 and
registers "__switch_to" to the blacklist on x86-64, because that mark is to
prohibit user from inserting only kretprobe.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:10 -07:00
Paul Jackson 607717a65d cpuset: remove sched domain hooks from cpusets
Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this.  One real time group did use it to
affectively isolate CPUs from any load balancing efforts.  They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system.  They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment.  My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked.  There is no concensus on the
replacement.  But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:09 -07:00
Christoph Hellwig 0ac1555915 m32r: convert to generic sys_ptrace
Convert m32r to the generic sys_ptrace.  The conversion requires an
architecture hook after ptrace_attach which this patch adds.  The hook
will also be needed for a conersion of ia64 to the generic ptrace code.

Thanks to Hirokazu Takata for fixing a bug in the first version of this
code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:04 -07:00
Adam Litke 54f9f80d65 hugetlb: Add hugetlb_dynamic_pool sysctl
The maximum size of the huge page pool can be controlled using the overall
size of the hugetlb filesystem (via its 'size' mount option).  However in the
common case the this will not be set as the pool is traditionally fixed in
size at boot time.  In order to maintain the expected semantics, we need to
prevent the pool expanding by default.

This patch introduces a new sysctl controlling dynamic pool resizing.  When
this is enabled the pool will expand beyond its base size up to the size of
the hugetlb filesystem.  It is disabled by default.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Dave McCracken <dave.mccracken@oracle.com>
Cc: William Irwin <bill.irwin@oracle.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
KAMEZAWA Hiroyuki 75884fb1c6 memory unplug: memory hotplug cleanup
A clean up patch for "scanning memory resource [start, end)" operation.

Now, find_next_system_ram() function is used in memory hotplug, but this
interface is not easy to use and codes are complicated.

This patch adds walk_memory_resouce(start,len,arg,func) function.
The function 'func' is called per valid memory resouce range in [start,pfn).

[pbadari@us.ibm.com: Error handling in walk_memory_resource()]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Mel Gorman e12ba74d8f Group short-lived and reclaimable kernel allocations
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Christoph Lameter 0e1e7c7a73 Memoryless nodes: Use N_HIGH_MEMORY for cpusets
cpusets try to ensure that any node added to a cpuset's mems_allowed is
on-line and contains memory.  The assumption was that online nodes contained
memory.  Thus, it is possible to add memoryless nodes to a cpuset and then add
tasks to this cpuset.  This results in continuous series of oom-kill and
apparent system hang.

Change cpusets to use node_states[N_HIGH_MEMORY] [a.k.a.  node_memory_map] in
place of node_online_map when vetting memories.  Return error if admin
attempts to write a non-empty mems_allowed node mask containing only
memoryless-nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Bob Picco <bob.picco@hp.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter 4199cfa02b Memoryless nodes: Allow profiling data to fall back to other nodes
Processors on memoryless nodes must be able to fall back to remote nodes in
order to get a profiling buffer.  This may lead to excessive NUMA traffic but
I think we should allow this rather than failing.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Hellwig 74a0b57627 x86: optimize page faults like all other achitectures and kill notifier cruft
x86(-64) are the last architectures still using the page fault notifier
cruft for the kprobes page fault hook.  This patch converts them to the
proper direct calls, and removes the now unused pagefault notifier bits
aswell as the cruft in kprobes.c that was related to this mess.

I know Andi didn't really like this, but all other architecture maintainers
agreed the direct calls are much better and besides the obvious cruft
removal a common way of dealing with kprobes across architectures is
important aswell.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix sparc64]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Andi Kleen <ak@suse.de>
Cc: <linux-arch@vger.kernel.org>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00
Mike Travis d5a7430ddc Convert cpu_sibling_map to be a per cpu variable
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable.  This saves sizeof(cpumask_t) * NR unused cpus.  Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:50 -07:00