Commit graph

3996 commits

Author SHA1 Message Date
Ingo Molnar 1fc8afa4c8 sched: feat affine wakeups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar b85d066726 sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off
turn off sync wakeups by default. They are not needed anymore - the
buddy logic should be smart enough to keep the system from
overscheduling.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Peter Zijlstra 0bbd3336ee sched: fix wakeup granularity for buddies
The wakeup buddy logic didn't use the same wakeup granularity logic as the
wakeup preemption did, this might cause the ->next buddy to be selected past
the point where we would have preempted had the task been a single running
instance.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Guillaume Chazarain 15934a3732 sched: fix rq->clock overflows detection with CONFIG_NO_HZ
When using CONFIG_NO_HZ, rq->tick_timestamp is not updated every TICK_NSEC.
We check that the number of skipped ticks matches the clock jump seen in
__update_rq_clock().

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Reynes Philippe 30914a58af sched: sched.c needs tick.h
kernel/sched.c:506: erreur: implicit declaration of function tick_get_tick_sched
kernel/sched.c:506: erreur: invalid type argument of ->
kernel/sched.c:506: erreur: NOHZ_MODE_INACTIVE undeclared (first use in this function)
kernel/sched.c:506: erreur: (Each undeclared identifier is reported only once
kernel/sched.c:506: erreur: for each function it appears in.)

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar 27ec440779 sched: make cpu_clock() globally synchronous
Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.

Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Ingo Molnar 018d6db4cb sched: re-do "sched: fix fair sleepers"
re-apply:

| commit e22ecef1d2
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Fri Mar 14 22:16:08 2008 +0100
|
|     sched: fix fair sleepers
|
|     Fair sleepers need to scale their latency target down by runqueue
|     weight. Otherwise busy systems will gain ever larger sleep bonus.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:44:57 +02:00
Suresh Siddha 2adee9b30d x86: fpu xstate split fix
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-19 19:19:55 +02:00
Suresh Siddha 61c4628b53 x86, fpu: split FPU state from task struct - v5
Split the FPU save area from the task struct. This allows easy migration
of FPU context, and it's generally cleaner. It also allows the following
two optimizations:

1) only allocate when the application actually uses FPU, so in the first
lazy FPU trap. This could save memory for non-fpu using apps. Next patch
does this lazy allocation.

2) allocate the right size for the actual cpu rather than 512 bytes always.
Patches enabling xsave/xrstor support (coming shortly) will take advantage
of this.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-19 19:19:55 +02:00
Thomas Gleixner d8bb6f4c16 x86: tsc prevent time going backwards
We already catch most of the TSC problems by sanity checks, but there
is a subtle bug which has been in the code forever. This can cause
time jumps in the range of hours.

This was reported in:
     http://lkml.org/lkml/2007/8/23/96
and
     http://lkml.org/lkml/2008/3/31/23

I was able to reproduce the problem with a gettimeofday loop test on a
dual core and a quad core machine which both have sychronized
TSCs. The TSCs seems not to be perfectly in sync though, but the
kernel is not able to detect the slight delta in the sync check. Still
there exists an extremly small window where this delta can be observed
with a real big time jump. So far I was only able to reproduce this
with the vsyscall gettimeofday implementation, but in theory this
might be observable with the syscall based version as well.

CPU 0 updates the clock source variables under xtime/vyscall lock and
CPU1, where the TSC is slighty behind CPU0, is reading the time right
after the seqlock was unlocked.

The clocksource reference data was updated with the TSC from CPU0 and
the value which is read from TSC on CPU1 is less than the reference
data. This results in a huge delta value due to the unsigned
subtraction of the TSC value and the reference value. This algorithm
can not be changed due to the support of wrapping clock sources like
pm timer.

The huge delta is converted to nanoseconds and added to xtime, which
is then observable by the caller. The next gettimeofday call on CPU1
will show the correct time again as now the TSC has advanced above the
reference value.

To prevent this TSC specific wreckage we need to compare the TSC value
against the reference value and return the latter when it is larger
than the actual TSC value.

I pondered to mark the TSC unstable when the readout is smaller than
the reference value, but this would render an otherwise good and fast
clocksource unusable without a real good reason.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-19 19:19:55 +02:00
Erik Bosman 8fb402bccf generic, x86: add prctl commands PR_GET_TSC and PR_SET_TSC
This patch adds prctl commands that make it possible
to deny the execution of timestamp counters in userspace.
If this is not implemented on a specific architecture,
prctl will return -EINVAL.

ned-off-by: Erik Bosman <ejbosman@cs.vu.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-19 19:19:55 +02:00
Matthew Wilcox a655020753 kernel: Remove unnecessary inclusions of asm/semaphore.h
None of these files use any of the functionality promised by
asm/semaphore.h.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-18 22:17:04 -04:00
Ahmed S. Darwish 04305e4aff Audit: Final renamings and cleanup
Rename the se_str and se_rule audit fields elements to
lsm_str and lsm_rule to avoid confusion.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
2008-04-19 09:59:43 +10:00
Ahmed S. Darwish 9d57a7f9e2 SELinux: use new audit hooks, remove redundant exports
Setup the new Audit LSM hooks for SELinux.
Remove the now redundant exported SELinux Audit interface.

Audit: Export 'audit_krule' and 'audit_field' to the public
since their internals are needed by the implementation of the
new LSM hook 'audit_rule_known'.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
2008-04-19 09:53:46 +10:00
Ahmed S. Darwish d7a96f3a1a Audit: internally use the new LSM audit hooks
Convert Audit to use the new LSM Audit hooks instead of
the exported SELinux interface.

Basically, use:
security_audit_rule_init
secuirty_audit_rule_free
security_audit_rule_known
security_audit_rule_match

instad of (respectively) :
selinux_audit_rule_init
selinux_audit_rule_free
audit_rule_has_selinux
selinux_audit_rule_match

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
2008-04-19 09:52:37 +10:00
Ahmed S. Darwish 2a862b32f3 Audit: use new LSM hooks instead of SELinux exports
Stop using the following exported SELinux interfaces:
selinux_get_inode_sid(inode, sid)
selinux_get_ipc_sid(ipcp, sid)
selinux_get_task_sid(tsk, sid)
selinux_sid_to_string(sid, ctx, len)
kfree(ctx)

and use following generic LSM equivalents respectively:
security_inode_getsecid(inode, secid)
security_ipc_getsecid*(ipcp, secid)
security_task_getsecid(tsk, secid)
security_sid_to_secctx(sid, ctx, len)
security_release_secctx(ctx, len)

Call security_release_secctx only if security_secid_to_secctx
succeeded.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
Acked-by: James Morris <jmorris@namei.org>
Reviewed-by: Paul Moore <paul.moore@hp.com>
2008-04-19 09:52:34 +10:00
Linus Torvalds 73e3e6481f Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
  clocksource: make clocksource watchdog cycle through online CPUs
  Documentation: move timer related documentation to a single place
  clockevents: optimise tick_nohz_stop_sched_tick() a bit
  locking: remove unused double_spin_lock()
  hrtimers: simplify lockdep handling
  timers: simplify lockdep handling
  posix-timers: fix shadowed variables
  timer_list: add annotations to workqueue.c
  hrtimer: use nanosleep specific restart_block fields
  hrtimer: add nanosleep specific restart_block member
2008-04-18 08:37:41 -07:00
Linus Torvalds 9732b61123 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-kgdb
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-kgdb:
  kgdb: always use icache flush for sw breakpoints
  kgdb: fix SMP NMI kgdb_handle_exception exit race
  kgdb: documentation fixes
  kgdb: allow static kgdbts boot configuration
  kgdb: add documentation
  kgdb: Kconfig fix
  kgdb: add kgdb internal test suite
  kgdb: fix several kgdb regressions
  kgdb: kgdboc pl011 I/O module
  kgdb: fix optional arch functions and probe_kernel_*
  kgdb: add x86 HW breakpoints
  kgdb: print breakpoint removed on exception
  kgdb: clocksource watchdog
  kgdb: fix NMI hangs
  kgdb: fix kgdboc dynamic module configuration
  kgdb: document parameters
  x86: kgdb support
  consoles: polling support, kgdboc
  kgdb: core
  uaccess: add probe_kernel_write()
2008-04-18 08:37:01 -07:00
Linus Torvalds d7bb545d86 Merge branch 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'semaphore' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc:
  Remove DEBUG_SEMAPHORE from Kconfig
  Improve semaphore documentation
  Simplify semaphore implementation
  Add down_timeout and change ACPI to use it
  Introduce down_killable()
  Generic semaphore implementation
  Add semaphore.h to kernel_lock.c
  Fix quota.h includes
2008-04-18 08:25:29 -07:00
Linus Torvalds 4cba84b5d6 Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6: (36 commits)
  [S390] Remove code duplication from monreader / dcssblk.
  [S390] kernel: show last breaking-event-address on oops
  [S390] lowcore: Change type of lowcores softirq_pending to __u32.
  [S390] zcrypt: Comments and kernel-doc cleanup
  [S390] uaccess: Always access the correct address space.
  [S390] Fix a lot of sparse warnings.
  [S390] Convert s390 to GENERIC_CLOCKEVENTS.
  [S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h
  [S390] Convert monitor calls to function calls.
  [S390] qdio (new feature): enhancing info-retrieval from QDIO-adapters
  [S390] replace remaining __FUNCTION__ occurrences
  [S390] remove redundant display of free swap space in show_mem()
  [S390] qdio: remove outdated developerworks link.
  [S390] Add debug_register_mode() function to debug feature API
  [S390] crypto: use more descriptive function names for init/exit routines.
  [S390] switch sched_clock to store-clock-extended.
  [S390] zcrypt: add support for large random numbers
  [S390] hw_random: allow rng_dev_read() to return hardware errors.
  [S390] Vertical cpu management.
  [S390] cpu topology support for s390.
  ...
2008-04-18 08:19:15 -07:00
Roland McGrath 18c98b6527 ptrace_signal subroutine
This breaks out the ptrace handling from get_signal_to_deliver into a
new subroutine.  The actual code there doesn't change, and it gets
inlined into nearly identical compiled code.  This makes the function
substantially shorter and thus easier to read, and it nicely isolates
the ptrace magic.

Signed-off-by: Roland McGrath <roland@redhat.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-18 08:17:57 -07:00
Li Zefan 0e04388f01 cgroup: fix a race condition in manipulating tsk->cg_list
When I ran a test program to fork mass processes and at the same time
'cat /cgroup/tasks', I got the following oops:

  ------------[ cut here ]------------
  kernel BUG at lib/list_debug.c:72!
  invalid opcode: 0000 [#1] SMP
  Pid: 4178, comm: a.out Not tainted (2.6.25-rc9 #72)
  ...
  Call Trace:
   [<c044a5f9>] ? cgroup_exit+0x55/0x94
   [<c0427acf>] ? do_exit+0x217/0x5ba
   [<c0427ed7>] ? do_group_exit+0.65/0x7c
   [<c0427efd>] ? sys_exit_group+0xf/0x11
   [<c0404842>] ? syscall_call+0x7/0xb
   [<c05e0000>] ? init_cyrix+0x2fa/0x479
  ...
  EIP: [<c04df671>] list_del+0x35/0x53 SS:ESP 0068:ebc7df4
  ---[ end trace caffb7332252612b ]---
  Fixing recursive fault but reboot is needed!

After digging into the code and debugging, I finlly found out a race
situation:

				do_exit()
				  ->cgroup_exit()
				    ->if (!list_empty(&tsk->cg_list))
				        list_del(&tsk->cg_list);

  cgroup_iter_start()
    ->cgroup_enable_task_cg_list()
      ->list_add(&tsk->cg_list, ..);

In this case the list won't be deleted though the process has exited.

We got two bug reports in the past, which seem to be the same bug as
this one:
	http://lkml.org/lkml/2008/3/5/332
	http://lkml.org/lkml/2007/10/17/224

Actually sometimes I got oops on list_del, sometimes oops on list_add.
And I can change my test program a bit to trigger other oops.

The patch has been tested both on x86_32 and x86_64.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-18 08:17:57 -07:00
Jason Wessel 1a9a3e76dd kgdb: always use icache flush for sw breakpoints
On the ppc 4xx architecture the instruction cache must be flushed as
well as the data cache.  This patch just makes it generic for all
architectures where CACHE_FLUSH_IS_SAFE is set to 1.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:43 +02:00
Jason Wessel 56fb709329 kgdb: fix SMP NMI kgdb_handle_exception exit race
Fix the problem of protecting the kgdb handle_exception exit
which had an NMI race condition, while trying to restore
normal system operation.

There was a small window after the master processor sets cpu_in_debug
to zero but before it has set kgdb_active to zero where a
non-master processor in an SMP system could receive an NMI and
re-enter the kgdb_wait() loop.

As long as the master processor sets the cpu_in_debug before sending
the cpu roundup the cpu_in_debug variable can also be used to guard
against the race condition.

The kgdb_wait() function no longer needs to check
kgdb_active because it is done in the arch specific code
and handled along with the nmi traps at the low level.
This also allows kgdb_wait() to exit correctly if it was
entered for some unknown reason due to a spurious NMI that
could not be handled by the arch specific code.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:43 +02:00
Jason Wessel 737a460f21 kgdb: fix several kgdb regressions
kgdb core fixes:
- Check to see that mm->mmap_cache is not null before calling
  flush_cache_range(), else on arch=ARM it will cause a fatal
  fault.

- Breakpoints should only be restored if they are in the BP_ACTIVE
  state.

- Fix a typo in comments to "kgdb_register_io_module"

x86 kgdb fixes:
- Fix the x86 arch handler such that on a kill or detach that the
  appropriate cleanup on the single stepping flags gets run.

- Add in the DIE_NMIWATCHDOG call for x86_64

- Touch the nmi watchdog before returning the system to normal
  operation after performing any kind of kgdb operation, else
  the possibility exists to trigger the watchdog.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:40 +02:00
Jason Wessel b4b8ac524d kgdb: fix optional arch functions and probe_kernel_*
Fix two regressions dealing with the kgdb core.

1) kgdb_skipexception and kgdb_post_primary_code are optional
functions that are only required on archs that need special exception
fixups.

2) The kernel address space scope must be set on any probe_kernel_*
function or archs such as ARCH=arm will not allow access to the kernel
memory space.  As an example, it is required to allow the full kernel
address space is when you the kernel debugger to inspect a system
call.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:39 +02:00
Jason Wessel 64e9ee3095 kgdb: add x86 HW breakpoints
Add HW breakpoints into the arch specific portion of x86 kgdb.  In the
current x86 kernel.org kernels HW breakpoints are changed out in lazy
fashion because there is no infrastructure around changing them when
changing to a kernel task or entering the kernel mode via a system
call.  This lazy approach means that if a user process uses HW
breakpoints the kgdb will loose out.  This is an acceptable trade off
because the developer debugging the kernel is assumed to know what is
going on system wide and would be aware of this trade off.

There is a minor bug fix to the kgdb core so as to correctly call the
hw breakpoint functions with a valid value from the enum.

There is also a minor change to the x86_64 startup code when using
early HW breakpoints.  When the debugger is connected, the cpu startup
code must not zero out the HW breakpoint registers or you cannot hit
the breakpoints you are interested in, in the first place.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:39 +02:00
Jason Wessel 67baf94cd2 kgdb: print breakpoint removed on exception
If kgdb does remove a breakpoint that had a problem on the recursion
check, it should also print the address of the breakpoint.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:39 +02:00
Jason Wessel 7c3078b637 kgdb: clocksource watchdog
In order to not trip the clocksource watchdog, kgdb must touch the
clocksource watchdog on the return to normal system run state.

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 20:05:38 +02:00
Jason Wessel dc7d552705 kgdb: core
kgdb core code. Handles the protocol and the arch details.

[ mingo@elte.hu: heavily modified, simplified and cleaned up. ]
[ xemul@openvz.org: use find_task_by_pid_ns ]

Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 20:05:37 +02:00
Matthew Wilcox 714493cd54 Improve semaphore documentation
Move documentation from semaphore.h to semaphore.c as requested by
Andrew Morton.  Also reformat to kernel-doc style and add some more
notes about the implementation.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-17 10:43:01 -04:00
Matthew Wilcox b17170b2fa Simplify semaphore implementation
By removing the negative values of 'count' and relying on the wait_list to
indicate whether we have any waiters, we can simplify the implementation
by removing the protection against an unlikely race condition.  Thanks to
David Howells for his suggestions.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-17 10:42:54 -04:00
Matthew Wilcox f1241c87a1 Add down_timeout and change ACPI to use it
ACPI currently emulates a timeout for semaphores with calls to
down_trylock and sleep.  This produces horrible behaviour in terms of
fairness and excessive wakeups.  Now that we have a unified semaphore
implementation, adding a real down_trylock is almost trivial.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-17 10:42:46 -04:00
Matthew Wilcox f06d968658 Introduce down_killable()
down_killable() is the functional counterpart of mutex_lock_killable.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
2008-04-17 10:42:40 -04:00
Matthew Wilcox 64ac24e738 Generic semaphore implementation
Semaphores are no longer performance-critical, so a generic C
implementation is better for maintainability, debuggability and
extensibility.  Thanks to Peter Zijlstra for fixing the lockdep
warning.  Thanks to Harvey Harrison for pointing out that the
unlikely() was unnecessary.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
2008-04-17 10:42:34 -04:00
Andi Kleen 6993fc5bbc clocksource: make clocksource watchdog cycle through online CPUs
This way it checks if the clocks are synchronized between CPUs too.
This might be able to detect slowly drifting TSCs which only
go wrong over longer time.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:31 +02:00
Karsten Wiese 903b8a8d48 clockevents: optimise tick_nohz_stop_sched_tick() a bit
Call
	ts = &per_cpu(tick_cpu_sched, cpu);
and
	cpu = smp_processor_id();
once instead of twice.

No functional change done, as changed code runs with local irq off.
Reduces source lines and text size (20bytes on x86_64).

[ akpm@linux-foundation.org: Build fix ]

Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:31 +02:00
Oleg Nesterov 8e60e05fdc hrtimers: simplify lockdep handling
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_hrtimers() uses double_spin_lock().

This is overcomplicated: except for migrate_hrtimers() we never take 2 locks
at once, and migrate_hrtimers() can use spin_lock_nested().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:31 +02:00
Oleg Nesterov 0d180406f2 timers: simplify lockdep handling
In order to avoid the false positive from lockdep, each per-cpu base->lock has
the separate lock class and migrate_timers() uses double_spin_lock().

This all is overcomplicated: except for migrate_timers() we never take 2 locks
at once, and migrate_timers() can use spin_lock_nested().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:31 +02:00
WANG Cong ee7dd205b5 posix-timers: fix shadowed variables
Fix sparse warnings like this:
kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one
kernel/posix-cpu-timers.c:1058:21: originally declared here

Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:30 +02:00
Pavel Machek d59b949f77 timer_list: add annotations to workqueue.c
Add timer list annotations to workqueue.c so we can see the call site
in the timer stats.

Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:30 +02:00
Thomas Gleixner 029a07e031 hrtimer: use nanosleep specific restart_block fields
Convert all the nanosleep related users of restart_block to the
new nanosleep specific restart_block fields.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-04-17 12:22:30 +02:00
Russell King d7b906897e [S390] genirq/clockevents: move irq affinity prototypes/inlines to interrupt.h
> Generic code is not supposed to include irq.h. Replace this include
> by linux/hardirq.h instead and add/replace an include of linux/irq.h
> in asm header files where necessary.
> This change should only matter for architectures that make use of
> GENERIC_CLOCKEVENTS.
> Architectures in question are mips, x86, arm, sh, powerpc, uml and sparc64.
>
> I did some cross compile tests for mips, x86_64, arm, powerpc and sparc64.
> This patch fixes also build breakages caused by the include replacement in
> tick-common.h.

I generally dislike adding optional linux/* includes in asm/* includes -
I'm nervous about this causing include loops.

However, there's a separate point to be discussed here.

That is, what interfaces are expected of every architecture in the kernel.
If generic code wants to be able to set the affinity of interrupts, then
that needs to become part of the interfaces listed in linux/interrupt.h
rather than linux/irq.h.

So what I suggest is this approach instead (against Linus' tree of a
couple of days ago) - we move irq_set_affinity() and irq_can_set_affinity()
to linux/interrupt.h, change the linux/irq.h includes to linux/interrupt.h
and include asm/irq_regs.h where needed (asm/irq_regs.h is supposed to be
rarely used include since not much touches the stacked parent context
registers.)

Build tested on ARM PXA family kernels and ARM's Realview platform
kernels which both use genirq.

[ tglx@linutronix.de: add GENERIC_HARDIRQ dependencies ]

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2008-04-17 07:47:05 +02:00
Linus Torvalds 093a07e2fd Fix locking bug in "acquire_console_semaphore_for_printk()"
When I cleaned up printk() and split up the printk locking logic in
commit 266c2e0abe ("Make printk() console
semaphore accesses sensible") I had incorrectly moved the call to
have_callable_console() outside of the console semaphore.

That was buggy.  The console semaphore protects the console_drivers list
that is used by have_callable_console().

Thanks go to Bongani Hlope who saw this as a hang on shutdown and reboot
and bisected the bug to the right commit, and tested this patch. See

	http://lkml.org/lkml/2008/4/11/315

Bisected-and-tested-by: Bongani Hlope <bonganilinux@mweb.co.za>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-15 13:09:54 -07:00
Pavel Machek 6afe1a1fe8 PM: Remove legacy PM
AFAICT pm_send_all is a nop when noone uses pm_register...

Hmm.. can we just force CONFIG_PM_LEGACY=n, and see what happens?

Or maybe this is better idea? It may break build somewhere, but it
should be easy to fix... (it builds here, i386 and x86-64).

Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-04-15 03:19:07 -04:00
Ingo Molnar e2df9e0905 revert "sched: fix fair sleepers"
revert "sched: fix fair sleepers" (e22ecef1d2),
because it is causing audio skipping, see:

   http://bugzilla.kernel.org/show_bug.cgi?id=10428

the patch is correct and the real cause of the skipping is not
understood (tracing makes it go away), but time has run out so we'll
revert it and re-try in 2.6.26.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-04-14 14:26:23 +02:00
Paul Menage b6c3006d20 cgroups: include hierarchy ids in /proc/<pid>/cgroup
Extend the /proc/<pid>/cgroup file to include the appropriate hierarchy ID on
each line.

Currently this ID isn't really needed since a hierarchy can be completely
identified by the set of subsystems bound to it, but this is likely to change
in the near future in order to support stateless subsystems and
merging/rebinding of subsystems.  Getting this change into 2.6.25 reduces the
need for an API change later.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-11 08:06:43 -07:00
Roland McGrath 54a0151041 asmlinkage_protect replaces prevent_tail_call
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs.  The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.

Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.

More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function.  This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else.  It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-10 17:28:26 -07:00
Paul Menage 8bab8dded6 cgroups: add cgroup support for enabling controllers at boot time
The effects of cgroup_disable=foo are:

- foo isn't auto-mounted if you mount all cgroups in a single hierarchy
- foo isn't visible as an individually mountable subsystem

As a result there will only ever be one call to foo->create(), at init time;
all processes will stay in this group, and the group will never be mounted on
a visible hierarchy.  Any additional effects (e.g.  not allocating metadata)
are up to the foo subsystem.

This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
but it could easily be extended to do so if any of the early_init systems
wanted it - I think it would just involve some nastier parameter processing
since it would occur before the command-line argument parser had been run.

Hugh said:

  Ballpark figures, I'm trying to get this question out rather than
  processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
  to the affected paths, booting with cgroup_disable=memory cuts that back to
  1% overhead (due to slightly bigger struct page).

  I'm no expert on distros, they may have no interest whatever in
  CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
  without it, or apply the cgroup_disable=memory patches.

Unix bench's execl test result on x86_64 was

== just after boot without mounting any cgroup fs.==
mem_cgorup=off : Execl Throughput       43.0     3150.1      732.6
mem_cgroup=on  : Execl Throughput       43.0     2932.6      682.0
==

[lizf@cn.fujitsu.com: fix boot option parsing]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-04 14:46:26 -07:00
Mathieu Desnoyers 6496968e6c markers: use synchronize_sched()
Markers do not mix well with CONFIG_PREEMPT_RCU because it uses
preempt_disable/enable() and not rcu_read_lock/unlock for minimal
intrusiveness.  We would need call_sched and sched_barrier primitives.

Currently, the modification (connection and disconnection) of probes
from markers requires changes to the data structure done in RCU-style :
a new data structure is created, the pointer is changed atomically, a
quiescent state is reached and then the old data structure is freed.

The quiescent state is reached once all the currently running
preempt_disable regions are done running.  We use the call_rcu mechanism
to execute kfree() after such quiescent state has been reached.
However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier
does not guarantee that all preempt_disable code regions have finished,
hence the race.

The "proper" way to do this is to use rcu_read_lock/unlock, but we don't
want to use it to minimize intrusiveness on the traced system.  (we do
not want the marker code to call into much of the OS code, because it
would quickly restrict what can and cannot be instrumented, such as the
scheduler).

The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in
mainline, is to use synchronize_sched before each call_rcu calls, so we
wait for the quiescent state in the system call code path.  It will slow
down batch marker enable/disable, but will make sure the race is gone.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-02 15:28:19 -07:00
Al Viro 8481664d37 futex_compat __user annotation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-30 14:18:41 -07:00
Al Viro 9dce07f1a4 NULL noise: fs/*, mm/*, kernel/*
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-30 14:18:41 -07:00
Dave Jones f706d5d22c audit: silence two kerneldoc warnings in kernel/audit.c
Silence two kerneldoc warnings.

Warning(kernel/audit.c:1276): No description found for parameter 'string'
Warning(kernel/audit.c:1276): No description found for parameter 'len'

[also fix a typo for bonus points]

Signed-off-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 14:45:21 -07:00
YAMAMOTO Takashi 1d4a788f15 memcgroup: fix spurious EBUSY on memory cgroup removal
Call mm_free_cgroup earlier.  Otherwise a reference due to lazy mm switching
can prevent cgroup removal.

Signed-off-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-28 14:45:21 -07:00
Benjamin Herrenschmidt f6d107fb10 Give futex init a proper name
The futex init function is called init(). This is a pain in the neck
when debugging when you code dies in ... init :-)

This renames it to futex_init().

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-27 08:02:13 -07:00
Linus Torvalds 8f404faa72 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
  NOHZ: reevaluate idle sleep length after add_timer_on()
  clocksource: revert: use init_timer_deferrable for clocksource_watchdog
2008-03-26 11:29:35 -07:00
Jens Axboe 5eb7f9fa84 relay: set an spd_release() hook for splice
relay doesn't reference the pages it adds, however we need a non-NULL
hook or splice_to_pipe() can oops.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-26 12:04:09 +01:00
Lai Jiangshan 37529fe9f6 set relay file can not be read by pread(2)
I found that relay files can be read by pread(2). I fix it,
for relay files are not capable of seeking.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-26 12:01:28 +01:00
Thomas Gleixner 06d8308c61 NOHZ: reevaluate idle sleep length after add_timer_on()
add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.

To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.

Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.

Call this function from add_timer_on().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org

--
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c        |   10 +++++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)
2008-03-26 08:28:55 +01:00
Thomas Gleixner 898a19de15 clocksource: revert: use init_timer_deferrable for clocksource_watchdog
Revert

commit 1077f5a917
Author: Parag Warudkar <parag.warudkar@gmail.com>
Date:   Wed Jan 30 13:30:01 2008 +0100

    clocksource.c: use init_timer_deferrable for clocksource_watchdog
    
    clocksource_watchdog can use a deferrable timer - reduces wakeups from
    idle per second.

The watchdog timer needs to run with the specified interval. Otherwise
it will miss the possible wrap of the watchdog clocksource.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org
2008-03-25 20:13:25 +01:00
Linus Torvalds 266c2e0abe Make printk() console semaphore accesses sensible
The printk() logic on when/how to get the console semaphore was
unreadable, this splits the code up into a few helper functions and
makes it easier to follow what is going on.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 19:25:08 -07:00
Pavel Emelyanov 5f7b703fe2 bsd_acct: using task_struct->tgid is not right in pid-namespaces
In case we're accounting from a sub-namespace, the tgids reported will not
refer to the right namespace.

Save the pid_namespace we're accounting in on the acct_glbs and use it in
do_acct_process.

Two less :) places using the task_struct.tgid member.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 19:22:20 -07:00
Pavel Emelyanov a846a1954b bsd_acct: plain current->real_parent access is not always safe
This is minor, but dereferencing even current real_parent is not safe on debug
kernels, since the memory, this points to, can be unmapped - RCU protection is
required.

Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx
call (the 2nd patch), so RCU will be required anyway.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 19:22:19 -07:00
Mathieu Desnoyers 58336114af markers: remove ACCESS_ONCE
As Paul pointed out, the ACCESS_ONCE are not needed because we already have
the explicit surrounding memory barriers.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 19:22:19 -07:00
Mathieu Desnoyers fd3c36f8b5 markers: update preempt_disable. call_rcu, rcu_barrier comments
Add comments requested by Andrew.

Updated comments about synchronize_sched().  Since we use call_rcu and
rcu_barrier now, these comments were out of sync with the code.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 19:22:19 -07:00
Linus Torvalds 92896bd9fd Don't 'printk()' while holding xtime lock for writing
The printk() can deadlock because it can wake up klogd(), and
task enqueueing will try to read the time in order to set a hrtimer.

Reported-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Debugged-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24 11:07:15 -07:00
Heiko Carstens 22e52b072d sched: add arch_update_cpu_topology hook.
Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:48 +01:00
Heiko Carstens 9aefd0abd8 sched: add exported arch_reinit_sched_domains() to header file.
Needed so it can be called from outside of sched.c.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Roel Kluin 23e3c3cd2e sched: remove double unlikely from schedule()
Combine two unlikely's

Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Peter Zijlstra 2070ee01d3 sched: cleanup old and rarely used 'debug' features.
TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-21 16:43:47 +01:00
Linus Torvalds 7d3628b230 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (46 commits)
  [NET] ifb: set separate lockdep classes for queue locks
  [IPV6] KCONFIG: Fix description about IPV6_TUNNEL.
  [TCP]: Fix shrinking windows with window scaling
  netpoll: zap_completion_queue: adjust skb->users counter
  bridge: use time_before() in br_fdb_cleanup()
  [TG3]: Fix build warning on sparc32.
  MAINTAINERS: bluez-devel is subscribers-only
  audit: netlink socket can be auto-bound to pid other than current->pid (v2)
  [NET]: Fix permissions of /proc/net
  [SCTP]: Fix a race between module load and protosw access
  [NETFILTER]: ipt_recent: sanity check hit count
  [NETFILTER]: nf_conntrack_h323: logical-bitwise & confusion in process_setup()
  [RT2X00] drivers/net/wireless/rt2x00/rt2x00dev.c: remove dead code, fix warning
  [IPV4]: esp_output() misannotations
  [8021Q]: vlan_dev misannotations
  xfrm: ->eth_proto is __be16
  [IPV4]: ipv4_is_lbcast() misannotations
  [SUNRPC]: net/* NULL noise
  [SCTP]: fix misannotated __sctp_rcv_asconf_lookup()
  [PKT_SCHED]: annotate cls_u32
  ...
2008-03-21 07:57:45 -07:00
Pavel Emelyanov 75c0371a2d audit: netlink socket can be auto-bound to pid other than current->pid (v2)
From:	Pavel Emelyanov <xemul@openvz.org>

This patch is based on the one from Thomas.

The kauditd_thread() calls the netlink_unicast() and passes 
the audit_pid to it. The audit_pid, in turn, is received from 
the user space and the tool (I've checked the audit v1.6.9) 
uses getpid() to pass one in the kernel. Besides, this tool 
doesn't bind the netlink socket to this id, but simply creates 
it allowing the kernel to auto-bind one.

That's the preamble.

The problem is that netlink_autobind() _does_not_ guarantees
that the socket will be auto-bound to the current pid. Instead
it uses the current pid as a hint to start looking for a free
id. So, in case of conflict, the audit messages can be sent
to a wrong socket. This can happen (it's unlikely, but can be)
in case some task opens more than one netlink sockets and then
the audit one starts - in this case the audit's pid can be busy
and its socket will be bound to another id.

The proposal is to introduce an audit_nlk_pid in audit subsys,
that will point to the netlink socket to send packets to. It
will most often be equal to audit_pid. The socket id can be 
got from the skb's netlink CB right in the audit_receive_msg.
The audit_nlk_pid reset to 0 is not required, since all the
decisions are taken based on audit_pid value only.

Later, if the audit tools will bind the socket themselves, the
kernel will have to provide a way to setup the audit_nlk_pid
as well.

A good side effect of this patch is that audit_pid can later 
be converted to struct pid, as it is not longer safe to use 
pid_t-s in the presence of pid namespaces. But audit code still 
uses the tgid from task_struct in the audit_signal_info and in
the audit_filter_syscall.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-03-20 15:39:41 -07:00
Andrew Morton 3150e63df4 revert "clocksource: make clocksource watchdog cycle through online CPUs"
Revert commit 1ada5cba6a ("clocksource:
make clocksource watchdog cycle through online CPUs") due to the
regression reported by Gabriel C at

	http://lkml.org/lkml/2008/2/24/281

(short vesion: it makes TSC be marked as always unstable on his
machine).

Cc: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Hancock <hancockr@shaw.ca>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19 18:53:37 -07:00
Ingo Molnar 74e3cd7f48 sched: retune wake granularity
reduce wake-up granularity for better interactivity.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Ingo Molnar f540a6080a sched: wakeup-buddy tasks are cache-hot
Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Ingo Molnar 4ae7d5cefd sched: improve affine wakeups
improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.

Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.

( Also slightly move the preempt_check within try_to_wake_up() - this has
  no effect on functionality but allows 'early wakeups' (for still-on-rq
  tasks) to be correctly accounted as well.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Ingo Molnar f48273860e sched: clean up wakeup balancing, code flow
Clean up the code flow. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:53 +01:00
Ingo Molnar ac192d3921 sched: clean up wakeup balancing, rename variables
rename 'cpu' to 'prev_cpu'. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:52 +01:00
Ingo Molnar 098fb9db2c sched: clean up wakeup balancing, move wake_affine()
split out the affine-wakeup bits.

No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   9d76738f1272aa82f0b7affd2f51df6b  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm

(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-19 04:27:52 +01:00
Jens Axboe 16d5466942 relay: fix subbuf_splice_actor() adding too many pages
If subbuf_pages was larger than the max number of pages the pipe
buffer will hold, subbuf_splice_actor() would happily go beyond
the array size.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-03-17 09:04:59 +01:00
Ingo Molnar 6a6029b8ce sched: simplify sched_slice()
Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.

It also improves code size:

      text    data     bss     dec     hex filename
     42659    2740     144   45543    b1e7 sched.o.before
     42093    2740     144   44977    afb1 sched.o.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:50 +01:00
Ingo Molnar e22ecef1d2 sched: fix fair sleepers
Fair sleepers need to scale their latency target down by runqueue
weight. Otherwise busy systems will gain ever larger sleep bonus.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:50 +01:00
Peter Zijlstra aa2ac25229 sched: fix overload performance: buddy wakeups
Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15 03:02:50 +01:00
Ingo Molnar 27d1172660 sched: fix calc_delta_mine()
lw->weight can be 0 for a short time during bootup.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:50 +01:00
Ingo Molnar e89996ae3f sched: fix update_load_add()/sub()
Clear the cached inverse value when updating load. This is needed for
calc_delta_mine() to work correctly when using the rq load.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2008-03-15 03:02:49 +01:00
Peter Zijlstra 3fe69747da sched: min_vruntime fix
Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.

min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.

The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.

Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15 03:02:49 +01:00
Hiroshi Shimamoto 0e1f34833b sched: fix race in schedule()
Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.

Here is a possible scenario:

   CPU0                               CPU1
    |                              schedule()
    |                              ->deactivate_task()
    |                              ->idle_balance()
    |                              -->load_balance_newidle()
rt_mutex_setprio()                     |
    |                              --->double_lock_balance()
    *get lock                          *rel lock
    * on_rq=0, ruuning=1               |
    * sched_class is changed           |
    *rel lock                          *get lock
    :                                  |
                                       :
                                   ->put_prev_task_rt()
                                   ->pick_next_task_fair()
                                       => panic

The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-15 03:02:49 +01:00
Len Brown 29ea5171cb Merge branches 'release' and 'doc' into release 2008-03-13 01:59:53 -04:00
Randy Dunlap 53471121a8 documentation: Move power-related files to Documentation/power/
Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).

Update references to moved filenames.

Fix some trailing whitespace.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-03-12 18:10:51 -04:00
Rafael J. Wysocki a82f7119fd Hibernation: Fix mark_nosave_pages()
There is a problem in the hibernation code that triggers on some NUMA
systems on which pfn_valid() returns 'true' for some PFNs that don't
belong to any zone.  Namely, there is a BUG_ON() in
memory_bm_find_bit() that triggers for PFNs not belonging to any
zone and passing the pfn_valid() test.  On the affected systems it
triggers when we mark PFNs reported by the platform as not saveable,
because the PFNs in question belong to a region mapped directly using
iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
test.

Modify memory_bm_find_bit() so that it returns an error if given PFN
doesn't belong to any zone instead of crashing the kernel and ignore
the result returned by it in mark_nosave_pages(), while marking the
"nosave" memory regions.

This doesn't affect the hibernation functionality, as we won't touch
the PFNs in question anyway.

http://bugzilla.kernel.org/show_bug.cgi?id=9966 .

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-03-11 23:15:55 -04:00
Gregory Haskins 08f503b0c0 keep rd->online and cpu_online_map in sync
It is possible to allow the root-domain cache of online cpus to
become out of sync with the global cpu_online_map.  This is because we
currently trigger removal of cpus too early in the notifier chain.
Other DOWN_PREPARE handlers may in fact run and reconfigure the
root-domain topology, thereby stomping on our own offline handling.

The end result is that rd->online may become out of sync with
cpu_online_map, which results in potential task misrouting.

So change the offline handling to be more tightly coupled with the
global offline process by triggering on CPU_DYING intead of
CPU_DOWN_PREPARE.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 14:02:58 +01:00
Gregory Haskins 1f94ef598e Revert "cpu hotplug: adjust root-domain->online span in response to hotplug event"
This reverts commit 393d94d98b.

Lets fix this right.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-11 14:02:58 +01:00
Paul E. McKenney 21bbb39c37 rcu: move PREEMPT_RCU config option back under PREEMPT
The original preemptible-RCU patch put the choice between classic and
preemptible RCU into kernel/Kconfig.preempt, which resulted in build failures
on machines not supporting CONFIG_PREEMPT.  This choice was therefore moved to
init/Kconfig, which worked, but placed the choice between classic and
preemptible RCU at the top level, a very obtuse choice indeed.

This patch changes from the Kconfig "choice" mechanism to a pair of booleans,
only one of which (CONFIG_PREEMPT_RCU) is user-visible, and is located in
kernel/Kconfig.preempt, where one would expect it to be.  The other
(CONFIG_CLASSIC_RCU) is in init/Kconfig so that it is available to all
architectures, hopefully avoiding build breakage.  Thanks to Roman Zippel for
suggesting this approach.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Josh Triplett <josh@freedesktop.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:20 -07:00
Alexey Dobriyan e24e2e64c4 modules: warn about suspicious return values from module's ->init() hook
Return value convention of module's init functions is 0/-E.  Sometimes,
e.g.  during forward-porting mistakes happen and buggy module created,
where result of comparison "workqueue != NULL" is propagated all the way up
to sys_init_module.  What happens is that some other module created
workqueue in question, our module created it again and module was
successfully loaded.

Or it could be some other bug.

Let's make such mistakes much more visible.  In retrospective, such
messages would noticeably shorten some of my head-scratching sessions.

Note, that dump_stack() is just a way to get attention from user.  Sample
message:

sys_init_module: 'foo'->init suspiciously returned 1, it should follow 0/-E convention
sys_init_module: loading module anyway...
Pid: 4223, comm: modprobe Not tainted 2.6.24-25f666300625d894ebe04bac2b4b3aadb907c861 #5

Call Trace:
 [<ffffffff80254b05>] sys_init_module+0xe5/0x1d0
 [<ffffffff8020b39b>] system_call_after_swapgs+0x7b/0x80

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:20 -07:00
Rusty Russell 6c5db22d28 modules: fix module waiting for dependent modules' init
Commit c9a3ba55 (module: wait for dependent modules doing init.) didn't quite
work because the waiter holds the module lock, meaning that the state of the
module it's waiting for cannot change.

Fortunately, it's fairly simple to update the state outside the lock and do
the wakeup.

Thanks to Jan Glauber for tracking this down and testing (qdio and qeth).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10 18:01:19 -07:00
Linus Torvalds bf5a25e1ff Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
  time: remove obsolete CLOCK_TICK_ADJUST
  time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer()
  time: prevent the loop in timespec_add_ns() from being optimised away
  ntp: use unsigned input for do_div()
2008-03-09 10:06:49 -07:00
Gregory Haskins 393d94d98b cpu hotplug: adjust root-domain->online span in response to hotplug event
We currently set the root-domain online span automatically when the
domain is added to the cpu if the cpu is already a member of
cpu_online_map.

This was done as a hack/bug-fix for s2ram, but it also causes a problem
with hotplug CPU_DOWN transitioning.  The right way to fix the original
problem is to actually respond to CPU_UP events, instead of CPU_ONLINE,
which is already too late.

This solves the hung reboot regression reported by Andrew Morton and
others.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-09 10:05:14 -07:00
Roman Zippel 10a398d04c time: remove obsolete CLOCK_TICK_ADJUST
The first version of the ntp_interval/tick_length inconsistent usage patch was
recently merged as bbe4d18ac2

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bbe4d18ac2e058c56adb0cd71f49d9ed3216a405

While the fix did greatly improve the situation, it was correctly pointed out
by Roman that it does have a small bug: If the users change clocksources after
the system has been running and NTP has made corrections, the correctoins made
against the old clocksource will be applied against the new clocksource,
causing error.

The second attempt, which corrects the issue in the NTP_INTERVAL_LENGTH
definition has also made it up-stream as commit
e13a2e61dd

http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e13a2e61dd5152f5499d2003470acf9c838eab84

Roman has correctly pointed out that CLOCK_TICK_ADJUST is calculated
based on the PIT's frequency, and isn't really relevant to non-PIT
driven clocksources (that is, clocksources other then jiffies and pit).

This patch reverts both of those changes, and simply removes
CLOCK_TICK_ADJUST.

This does remove the granularity error correction for users of PIT and Jiffies
clocksource users, but the granularity error but for the majority of users, it
should be within the 500ppm range NTP can accommodate for.

For systems that have granularity errors greater then 500ppm, the
"ntp_tick_adj=" boot option can be used to compensate.

[johnstul@us.ibm.com: provided changelog]
[mattilinnanvuori@yahoo.com: maek ntp_tick_adj static]
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Acked-by: john stultz <johnstul@us.ibm.com>
Signed-off-by: Matti Linnanvuori <mattilinnanvuori@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: mingo@elte.hu
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-09 08:42:57 +01:00
Karsten Wiese a79017660e time: don't touch an offlined CPU's ts->tick_stopped in tick_cancel_sched_timer()
Silences WARN_ONs in rcu_enter_nohz() and rcu_exit_nohz(), which appeared
before caused by (repeated) calls to:
        $ echo 0 > /sys/devices/system/cpu/cpu1/online
        $ echo 1 > /sys/devices/system/cpu/cpu1/online

Signed-off-by: Karsten Wiese <fzu@wemgehoertderstaat.de>
Cc: johnstul@us.ibm.com
Cc: Rafael Wysocki <rjw@sisk.pl>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-09 08:42:57 +01:00
David Howells e48af19f56 ntp: use unsigned input for do_div()
The kernel NTP code shouldn't hand 64-bit *signed* values to do_div().  Make it
instead hand 64-bit unsigned values.  This gets rid of a couple of warnings.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-03-09 08:42:57 +01:00
Roland McGrath 6efcae4601 Fix waitid si_code regression
In commit ee7c82da83 ("wait_task_stopped:
simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
cast when storing si_code was lost in wait_task_stopped.  This leaks the
in-kernel CLD_* values that do not match what userland expects.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-08 11:54:00 -08:00
Dhaval Giani 521f1a2489 sched: don't allow rt_runtime_us to be zero for groups having rt tasks
This patch checks if we can set the rt_runtime_us to 0. If there is a
realtime task in the group, we don't want to set the rt_runtime_us as 0
or bad things will happen. (that task wont get any CPU time despite
being TASK_RUNNNG)

Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Peter Zijlstra 2692a2406b sched: rt-group: fixup schedulability constraints calculation
it was only possible to configure the rt-group scheduling parameters
beyond the default value in a very small range.

that's because div64_64() has a different calling convention than
do_div() :/

fix a few untidies while we are here; sysctl_sched_rt_period may overflow
due to that multiplication, so cast to u64 first. Also that RUNTIME_INF
juggling makes little sense although its an effective NOP.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Miao Xie 1868f958eb sched: fix the wrong time slice value for SCHED_FIFO tasks
Function sys_sched_rr_get_interval returns wrong time slice value for
SCHED_FIFO tasks. The time slice for SCHED_FIFO tasks should be 0.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Pavel Roskin 150d8bede7 sched: export task_nice
The API is trivial, and so is the implementation.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Steven Rostedt 6fa46fa526 sched: balance RT task resched only on runqueue
Sripathi Kodi reported a crash in the -rt kernel:

  https://bugzilla.redhat.com/show_bug.cgi?id=435674

this is due to a place that can reschedule a task without holding
the tasks runqueue lock.  This was caused by the RT balancing code
that pulls RT tasks to the current run queue and will reschedule the
current task.

There's a slight chance that the pulling of the RT tasks will release
the current runqueue's lock and retake it (in the double_lock_balance).
During this time that the runqueue is released, the current task can
migrate to another runqueue.

In the prio_changed_rt code, after the pull, if the current task is of
lesser priority than one of the RT tasks pulled, resched_task is called
on the current task. If the current task had migrated in that small
window, resched_task will be called without holding the runqueue lock
for the runqueue that the task is on.

This race condition also exists in the mainline kernel and this patch
adds a check to make sure the task hasn't migrated before calling
resched_task.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Tested-by: Sripathi Kodi <sripathik@in.ibm.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:43:00 +01:00
Peter Zijlstra 810b38179e sched: retain vruntime
Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.

Reported-by: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-07 16:42:59 +01:00
David Rientjes 41f7f60d31 cpusets: fix obsolete comment
mm migration is no longer done in cpuset_update_task_memory_state() so it
can no longer take current->mm->mmap_sem, so fix the obsolete comment.

[ This changed in commit 04c19fa6f1
  ("cpuset: migrate all tasks in cpuset at once") when the mm migration
  was moved from cpuset_update_task_memory_state() to update_nodemask() ]

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-05 17:53:33 -08:00
Pavel Roskin 9b37ccfc63 module: allow ndiswrapper to use GPL-only symbols
A change after 2.6.24 broke ndiswrapper by accidentally removing its
access to GPL-only symbols.  Revert that change and add comments about
the reasons why ndiswrapper and driverloader are treated in a special
way.

Signed-off-by: Pavel Roskin <proski@gnu.org>
Acked-by: Greg KH <gregkh@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jon Masters <jonathan@jonmasters.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 20:29:40 -08:00
Masami Hiramatsu b2a5cd6938 kprobes: fix a null pointer bug in register_kretprobe()
Fix a bug in regiseter_kretprobe() which does not check rp->kp.symbol_name ==
NULL before calling kprobe_lookup_name.

For maintainability, this introduces kprobe_addr helper function which
resolves addr field.  It is used by register_kprobe and register_kretprobe.

Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:19 -08:00
Jesper Juhl 544adb4107 markers: don't risk NULL deref in marker
get_marker() may return NULL, so test for it.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:14 -08:00
Ananth N Mavinakayanahalli 9edddaa200 Kprobes: indicate kretprobe support in Kconfig
Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant
architectures with kprobes support.  This facilitates easy handling of
in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on
kretprobes being present in the kernel.

Thanks to Sam Ravnborg for helping make the patch more lean.

Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:11 -08:00
Balbir Singh fb78922ce9 Memory Resource Controller use strstrip while parsing arguments
The memory controller has a requirement that while writing values, we need
to use echo -n. This patch fixes the problem and makes the UI more consistent.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:09 -08:00
Li Zefan b6abdb0e6c cgroup: fix default notify_on_release setting
The documentation says the default value of notify_on_release of a child
cgroup is inherited from its parent, which is reasonable, but the
implementation just sets the flag disabled.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 16:35:09 -08:00
Linus Torvalds 87baa2bb90 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel:
  sched: revert load_balance_monitor() changes
2008-03-04 09:23:28 -08:00
Peter Zijlstra 62fb185130 sched: revert load_balance_monitor() changes
The following commits cause a number of regressions:

  commit 58e2d4ca58
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduling, change how cpu load is calculated

  commit 6b2d770026
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

Namely:
 - very frequent wakeups on SMP, reported by PowerTop users.
 - cacheline trashing on (large) SMP
 - some latencies larger than 500ms

While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).

Hence we revert them and try again in v2.6.26.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-03-04 17:54:06 +01:00
Roland McGrath 13b1c3d4b4 freezer vs stopped or traced
This changes the "freezer" code used by suspend/hibernate in its treatment
of tasks in TASK_STOPPED (job control stop) and TASK_TRACED (ptrace) states.

As I understand it, the intent of the "freezer" is to hold all tasks
from doing anything significant.  For this purpose, TASK_STOPPED and
TASK_TRACED are "frozen enough".  It's possible the tasks might resume
from ptrace calls (if the tracer were unfrozen) or from signals
(including ones that could come via timer interrupts, etc).  But this
doesn't matter as long as they quickly block again while "freezing" is
in effect.  Some minor adjustments to the signal.c code make sure that
try_to_freeze() very shortly follows all wakeups from both kinds of
stop.  This lets the freezer code safely leave stopped tasks unmolested.

Changing this fixes the longstanding bug of seeing after resuming from
suspend/hibernate your shell report "[1] Stopped" and the like for all
your jobs stopped by ^Z et al, as if you had freshly fg'd and ^Z'd them.
It also removes from the freezer the arcane special case treatment for
ptrace'd tasks, which relied on intimate knowledge of ptrace internals.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04 07:59:54 -08:00
Oleg Nesterov 821c7de719 exit_notify: fix kill_orphaned_pgrp() usage with mt exit
1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
   should do this only when the whole process exits.

2. exit_notify() uses "current" as "ignored_task", obviously wrong.
   Use ->group_leader instead.

Test case:

	void hup(int sig)
	{
		printf("HUP received\n");
	}

	void *tfunc(void *arg)
	{
		sleep(2);
		printf("sub-thread exited\n");
		return NULL;
	}

	int main(int argc, char *argv[])
	{
		if (!fork()) {
			signal(SIGHUP, hup);
			kill(getpid(), SIGSTOP);
			exit(0);
		}

		pthread_t thr;
		pthread_create(&thr, NULL, tfunc, NULL);

		sleep(1);
		printf("main thread exited\n");
		syscall(__NR_exit, 0);

		return 0;
	}

output:

	main thread exited
	HUP received
	Hangup

With this patch the output is:

	main thread exited
	sub-thread exited
	HUP received

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03 14:53:16 -08:00
Oleg Nesterov 05e83df624 will_become_orphaned_pgrp: partially fix insufficient ->exit_state check
p->exit_state != 0 doesn't mean this process is dead, it may have
sub-threads.  Change the code to use "p->exit_state && thread_group_empty(p)"
instead.

Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
if the main thread has exited.

However, the new check is not perfect either.  There is a window when
exit_notify() drops tasklist and before release_task().  Suppose that
the last (non-leader) thread exits.  This means that entire group exits,
but thread_group_empty() is not true yet.

As Eric pointed out, is_global_init() is wrong as well, but I did not
dare to do other changes.

Just for the record, has_stopped_jobs() is absolutely wrong too.  But we
can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.

Even with this patch ^Z doesn't play well with the dead main thread.
The task is stopped correctly but do_wait(WSTOPPED) won't see it.  This
is another unrelated issue, will be (hopefully) fixed separately.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03 14:53:16 -08:00
Oleg Nesterov f49ee505b1 introduce kill_orphaned_pgrp() helper
Factor out the common code in reparent_thread() and exit_notify().

No functional changes.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-03 14:53:16 -08:00
Steve Grubb 8d07a67cfa [PATCH] drop EOE records from printk
Hi,

While we are looking at the printk issue, I see that its printk'ing the EOE
(end of event) records which is really not something that we need in syslog.
Its really intended for the realtime audit event stream handled by the audit
daemon. So, lets avoid printk'ing that record type.

Signed-off-by: Steve Grubb <sgrubb@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-01 07:16:06 -05:00
Eric Paris b29ee87e9b [RFC] AUDIT: do not panic when printk loses messages
On the latest kernels if one was to load about 15 rules, set the failure
state to panic, and then run service auditd stop the kernel will panic.
This is because auditd stops, then the script deletes all of the rules.
These deletions are sent as audit messages out of the printk kernel
interface which is already known to be lossy.  These will overun the
default kernel rate limiting (10 really fast messages) and will call
audit_panic().  The same effect can happen if a slew of avc's come
through while auditd is stopped.

This can be fixed a number of ways but this patch fixes the problem by
just not panicing if auditd is not running.  We know printk is lossy and
if the user chooses to set the failure mode to panic and tries to use
printk we can't make any promises no matter how hard we try, so why try?
At least in this way we continue to get lost message accounting and will
eventually know that things went bad.

The other change is to add a new call to audit_log_lost() if auditd
disappears.  We already pulled the skb off the queue and couldn't send
it so that message is lost.  At least this way we will account for the
last message and panic if the machine is configured to panic.  This code
path should only be run if auditd dies for unforeseen reasons.  If
auditd closes correctly audit_pid will get set to 0 and we won't walk
this code path.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-01 07:16:06 -05:00
Paul Moore 422b03cf75 [PATCH] Audit: Fix the format type for size_t variables
Fix the following compiler warning by using "%zu" as defined in C99.

  CC      kernel/auditsc.o
  kernel/auditsc.c: In function 'audit_log_single_execve_arg':
  kernel/auditsc.c:1074: warning: format '%ld' expects type 'long int', but
  argument 4 has type 'size_t'

Signed-off-by: Paul Moore <paul.moore@hp.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-01 07:16:06 -05:00
Paul E. McKenney c9e71002aa rcupreempt: remove never-migrates assumption from rcu_process_callbacks()
This patch fixes a potentially invalid access to a per-CPU variable in
rcu_process_callbacks().

This per-CPU access needs to be done in such a way as to guarantee that
the code using it cannot move to some other CPU before all uses of the
value accessed have completed.  Even though this code is currently only
invoked from softirq context, which currrently cannot migrate to some
other CPU, life would be better if this code did not silently make such
an assumption.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-29 20:21:13 +01:00
Paul E. McKenney ae778869ae rcupreempt: fix hibernate/resume in presence of PREEMPT_RCU and hotplug
This fixes a oops encountered when doing hibernate/resume in presence of
PREEMPT_RCU.

The problem was that the code failed to disable preemption when
accessing a per-CPU variable.  This is OK when called from code that
already has preemption disabled, but such is not the case from the
suspend/resume code path.

Reported-by: Dave Young <hidave.darkstar@gmail.com>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-29 20:21:13 +01:00
Dmitry Adamushko 7be2a03e31 softlockup: fix task state setting
kthread_stop() can be called when a 'watchdog' thread is executing after
kthread_should_stop() but before set_task_state(TASK_INTERRUPTIBLE).

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-29 18:46:53 +01:00
Steven Rostedt 2232c2d8e0 rcu: add support for dynamic ticks and preempt rcu
The PREEMPT-RCU can get stuck if a CPU goes idle and NO_HZ is set. The
idle CPU will not progress the RCU through its grace period and a
synchronize_rcu my get stuck. Without this patch I have a box that will
not boot when PREEMPT_RCU and NO_HZ are set. That same box boots fine
with this patch.

This patch comes from the -rt kernel where it has been tested for
several months.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-29 18:46:50 +01:00
Linus Torvalds 3fca96eed1 Merge branch 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep
* 'v2.6.25-rc3-lockdep' of git://git.kernel.org/pub/scm/linux/kernel/git/peterz/linux-2.6-lockdep:
  Subject: lockdep: include all lock classes in all_lock_classes
  lockdep: increase MAX_LOCK_DEPTH
2008-02-26 07:49:15 -08:00
Linus Torvalds 37c00b84d0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  latencytop: change /proc task_struct access method
  latencytop: fix memory leak on latency proc file
  latencytop: fix kernel panic while reading latency proc file
  sched: add declaration of sched_tail to sched.h
  sched: fix signedness warnings in sched.c
  sched: clean up __pick_last_entity() a bit
  sched: remove duplicate code from sched_fair.c
  sched: make early bootup sched_clock() use safer
2008-02-26 07:43:31 -08:00
Tejun Heo cf3680b90c printk: fix possible printk overrun
printk recursion detection prepends message to printk_buf and offsets
printk_buf when actual message is printed but it forgets to trim buffer
length accordingly. This can result in overrun in extreme cases. Fix it.

[ mingo@elte.hu:

  bug was introduced by me via:

   commit 32a7600668
   Author: Ingo Molnar <mingo@elte.hu>
   Date:   Fri Jan 25 21:07:58 2008 +0100

       printk: make printk more robust by not allowing recursion
]

Signed-off-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-26 07:42:37 -08:00
Dale Farnsworth 1481197b50 Subject: lockdep: include all lock classes in all_lock_classes
Add each lock class to the all_lock_classes list when it is
first registered.

Previously, lock classes were added to all_lock_classes when
the lock class was first used.  Since one of the uses of the
list is to find unused locks, this didn't work well.

Signed-off-by: Dale Farnsworth <dale@farnsworth.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 23:03:02 +01:00
Harvey Harrison 67ca7bde2e sched: fix signedness warnings in sched.c
Unsigned long values are always assigned to switch_count,
make it unsigned long.

kernel/sched.c:3897:15: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3897:15:    expected long *switch_count
kernel/sched.c:3897:15:    got unsigned long *<noident>
kernel/sched.c:3921:16: warning: incorrect type in assignment (different signedness)
kernel/sched.c:3921:16:    expected long *switch_count
kernel/sched.c:3921:16:    got unsigned long *<noident>

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:17 +01:00
Ingo Molnar 7eee3e677d sched: clean up __pick_last_entity() a bit
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:17 +01:00
Balbir Singh 70eee74b70 sched: remove duplicate code from sched_fair.c
pick_task_entity() duplicates existing code. This functionality can be
easily obtained using rb_last(). Avoid code duplication by using rb_last().

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:17 +01:00
Ingo Molnar 6892b75e60 sched: make early bootup sched_clock() use safer
do not call sched_clock() too early. Not only might rq->idle
not be set up - but pure per-cpu data might not be accessible
either.

this solves an ia64 early bootup hang with CONFIG_PRINTK_TIME=y.

Tested-by: Tony Luck <tony.luck@gmail.com>
Acked-by: Tony Luck <tony.luck@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-25 16:34:16 +01:00
Linus Torvalds 04e2f1741d Add memory barrier semantics to wake_up() & co
Oleg Nesterov and others have pointed out that on some architectures,
the traditional sequence of

	set_current_state(TASK_INTERRUPTIBLE);
	if (CONDITION)
		return;
	schedule();

is racy wrt another CPU doing

	CONDITION = 1;
	wake_up_process(p);

because while set_current_state() has a memory barrier separating
setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION
variable, there is no such memory barrier on the wakeup side.

Now, wake_up_process() does actually take a spinlock before it reads and
sets the task state on the waking side, and on x86 (and many other
architectures) that spinlock is in fact equivalent to a memory barrier,
but that is not generally guaranteed.  The write that sets CONDITION
could move into the critical region protected by the runqueue spinlock.

However, adding a smp_wmb() to before the spinlock should now order the
writing of CONDITION wrt the lock itself, which in turn is ordered wrt
the accesses within the spinlock (which includes the reading of the old
state).

This should thus close the race (which probably has never been seen in
practice, but since smp_wmb() is a no-op on x86, it's not like this will
make anything worse either on the most common architecture where the
spinlock already gave the required protection).

Acked-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 18:05:03 -08:00
Li Zefan bc231d2a04 cgroup: remove dead code in cgroup_get_rootdir()
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:25 -08:00
Li Zefan 68db38f153 cgroup: remove duplicate code in find_css_set()
The list head res->tasks gets initialized twice in find_css_set().

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:25 -08:00
Li Zefan 8d53d55d27 cgroup: fix subsys bitops
Cgroup uses unsigned long for subsys bitops, not unsigned long long.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Li Zefan f777073848 cgroup: fix memory leak in cgroup_get_sb()
opts.release_agent is not kfree()ed in all necessary places.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Li Zefan a043e3b2c6 cgroup: fix comments
fix:
- comments about need_forkexit_callback
- comments about release agent
- typo and comment style, etc.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Srinivasa Ds 4362758279 kprobes: refuse kprobe insertion on add/sub_preempt_counter()
Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
functions inturn call add/sub_preempt_count().  So we need to refuse user from
inserting probe in to these functions.

This patch disallows user from probing add/sub_preempt_count().

Signed-off-by: Srinivasa DS <srinivasa@in.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:13:24 -08:00
Thomas Gleixner a0c1e9073e futex: runtime enable pi and robust functionality
Not all architectures implement futex_atomic_cmpxchg_inatomic().  The default
implementation returns -ENOSYS, which is currently not handled inside of the
futex guts.

Futex PI calls and robust list exits with a held futex result in an endless
loop in the futex code on architectures which have no support.

Fixing up every place where futex_atomic_cmpxchg_inatomic() is called would
add a fair amount of extra if/else constructs to the already complex code.  It
is also not possible to disable the robust feature before user space tries to
register robust lists.

Compile time disabling is not a good idea either, as there are already
architectures with runtime detection of futex_atomic_cmpxchg_inatomic support.

Detect the functionality at runtime instead by calling
cmpxchg_futex_value_locked() with a NULL pointer from the futex initialization
code.  This is guaranteed to fail, but the call of
futex_atomic_cmpxchg_inatomic() happens with pagefaults disabled.

On architectures, which use the asm-generic implementation or have a runtime
CPU feature detection, a -ENOSYS return value disables the PI/robust features.

On architectures with a working implementation the call returns -EFAULT and
the PI/robust features are enabled.

The relevant syscalls return -ENOSYS and the robust list exit code is blocked,
when the detection fails.

Fixes http://lkml.org/lkml/2008/2/11/149
Originally reported by: Lennart Buytenhek

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:12:15 -08:00
Thomas Gleixner 3e4ab747ef futex: fix init order
When the futex init code fails to initialize the futex pseudo file system it
returns early without initializing the hash queues.  Should the boot succeed
then a futex syscall which tries to enqueue a waiter on the hashqueue will
crash due to the unitilialized plist heads.

Initialize the hash queues before the filesystem.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Lennert Buytenhek <buytenh@wantstofly.org>
Cc: Riku Voipio <riku.voipio@movial.fi>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:12:15 -08:00
Harvey Harrison de4fc64f0f markers: fix sparse warnings in markers.c
char can be unsigned
kernel/marker.c:64:20: error: dubious one-bit signed bitfield
kernel/marker.c:65:14: error: dubious one-bit signed bitfield

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 17:12:14 -08:00
Rafael J. Wysocki 3a2d5b7001 PM: Introduce PM_EVENT_HIBERNATE callback state
During the last step of hibernation in the "platform" mode (with the
help of ACPI) we use the suspend code, including the devices'
->suspend() methods, to prepare the system for entering the ACPI S4
system sleep state.

But at least for some devices the operations performed by the
->suspend() callback in that case must be different from its operations
during regular suspend.

For this reason, introduce the new PM event type PM_EVENT_HIBERNATE and
pass it to the device drivers' ->suspend() methods during the last phase
of hibernation, so that they can distinguish this case and handle it as
appropriate.  Modify the drivers that handle PM_EVENT_SUSPEND in a
special way and need to handle PM_EVENT_HIBERNATE in the same way.

These changes are necessary to fix a hibernation regression related
to the i915 driver (ref. http://lkml.org/lkml/2008/2/22/488).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Jeff Chua <jeff.chua.linux@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-23 10:40:04 -08:00
Linus Torvalds 20f8d2a493 Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (26 commits)
  PM: Make suspend_device() static
  PCI ACPI: Fix comment describing acpi_pci_choose_state
  Hibernation: Handle DEBUG_PAGEALLOC on x86
  ACPI: fix build warning
  ACPI: TSC breaks atkbd suspend
  ACPI: remove is_processor_present prototype
  acer-wmi: Add DMI match for mail LED on Acer TravelMate 4200 series
  ACPI: sparse fix, replace macro with static function
  ACPI: thinkpad-acpi: add tablet-mode reporting
  ACPI: thinkpad-acpi: minor hotkey_radio_sw fixes
  ACPI: thinkpad-acpi: improve thinkpad-acpi input device documentation
  ACPI: thinkpad-acpi: issue input events for tablet swivel events
  ACPI: thinkpad-acpi: make the video output feature optional
  ACPI: thinkpad-acpi: synchronize input device switches
  ACPI: thinkpad-acpi: always track input device open/close
  ACPI: thinkpad-acpi: trivial fix to documentation
  ACPI: thinkpad-acpi: trivial fix to module_desc typo
  intel_menlo: extract return values using PTR_ERR
  ACPI video: check for error from thermal_cooling_device_register
  ACPI thermal: extract return values using PTR_ERR
  ...
2008-02-21 16:33:19 -08:00
Kay Sievers 120fc3d77a modules: do not try to add sysfs attributes if !CONFIG_SYSFS
Thanks to Alexey for the testing and the fix of the fix.

Cc: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-02-21 15:27:08 -08:00
Rafael J. Wysocki 8a235efad5 Hibernation: Handle DEBUG_PAGEALLOC on x86
Make hibernation work with CONFIG_DEBUG_PAGEALLOC set on x86, by
checking if the pages to be copied are marked as present in the
kernel mapping and temporarily marking them as present if that's not
the case.  No functional modifications are introduced if
CONFIG_DEBUG_PAGEALLOC is unset.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-21 02:15:28 -05:00
Thomas Gleixner 89d694b9db genirq: do not leave interupts enabled on free_irq
The default_disable() function was changed in commit:

 76d2160147
 genirq: do not mask interrupts by default

It removed the mask function in favour of the default delayed
interrupt disabling. Unfortunately this also broke the shutdown in
free_irq() when the last handler is removed from the interrupt for
those architectures which rely on the default implementations. Now we
can end up with a enabled interrupt line after the last handler was
removed, which can result in spurious interrupts.

Fix this by adding a default_shutdown function, which is only
installed, when the irqchip implementation does provide neither a
shutdown nor a disable function.

[@stable: affected versions: .21 - .24 ]

Pointed-out-by: Michael Hennerich <Michael.Hennerich@analog.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Tested-by: Michael Hennerich <Michael.Hennerich@analog.com>
2008-02-19 10:43:58 +01:00
S.Caglar Onur 188fd89d53 genirq: spurious.c: use time_* macros
The functions time_before, time_before_eq, time_after, and
time_after_eq are more robust for comparing jiffies against other
values.

So following patch implements usage of the time_after() macro, defined
at linux/jiffies.h, which deals with wrapping correctly

Signed-off-by: S.Caglar Onur <caglar@pardus.org.tr>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-19 10:43:58 +01:00
Eric Paris b0abcfc146 Audit: use == not = in if statements
Clearly this was supposed to be an == not an = in the if statement.
This patch also causes us to stop processing execve args once we have
failed rather than continuing to loop on failure over and over and over.

Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-18 18:46:28 -08:00
Pavel Machek db4315d6f5 timer_list: print relative expiry time signed
Relative expiry time can get negative, so it should be signed.

Signed-off-by: Pavel Machek <Pavel@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-17 17:29:38 +01:00
Linus Torvalds c24ce1d887 Merge git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt
* git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt:
  hrtimer: catch expired CLOCK_REALTIME timers early
  hrtimer: check relative timeouts for overflow
2008-02-14 21:27:52 -08:00
Jan Blunck cf28b4863f d_path: Make d_path() use a struct path
d_path() is used on a <dentry,vfsmount> pair.  Lets use a struct path to
reflect this.

[akpm@linux-foundation.org: fix build in mm/memory.c]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Bryan Wu <bryan.wu@analog.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:09 -08:00
Jan Blunck 44707fdf59 d_path: Use struct path in struct avc_audit_data
audit_log_d_path() is a d_path() wrapper that is used by the audit code.  To
use a struct path in audit_log_d_path() I need to embed it into struct
avc_audit_data.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:17:08 -08:00
Jan Blunck 6ac08c39a1 Use struct path in fs_struct
* Use struct path in fs_struct.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck 1d957f9bf8 Introduce path_put()
* Add path_put() functions for releasing a reference to the dentry and
  vfsmount of a struct path in the right order

* Switch from path_release(nd) to path_put(&nd->path)

* Rename dput_path() to path_put_conditional()

[akpm@linux-foundation.org: fix cifs]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Steven French <sfrench@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck 4ac9137858 Embed a struct path into struct nameidata instead of nd->{dentry,mnt}
This is the central patch of a cleanup series. In most cases there is no good
reason why someone would want to use a dentry for itself. This series reflects
that fact and embeds a struct path into nameidata.

Together with the other patches of this series
- it enforced the correct order of getting/releasing the reference count on
  <dentry,vfsmount> pairs
- it prepares the VFS for stacking support since it is essential to have a
  struct path in every place where the stack can be traversed
- it reduces the overall code size:

without patch series:
   text    data     bss     dec     hex filename
5321639  858418  715768 6895825  6938d1 vmlinux

with patch series:
   text    data     bss     dec     hex filename
5320026  858418  715768 6894212  693284 vmlinux

This patch:

Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix cifs]
[akpm@linux-foundation.org: fix smack]
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:33 -08:00
Jan Blunck db74ece990 Dont touch fs_struct in usermodehelper
This test seems to be unnecessary since we always have rootfs mounted before
calling a usermodehelper.

Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Jan Blunck <jblunck@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-14 21:13:32 -08:00
Thomas Gleixner 63070a79ba hrtimer: catch expired CLOCK_REALTIME timers early
A CLOCK_REALTIME timer, which has an absolute expiry time less than
the clock realtime offset calls with a negative delta into the clock
events code and triggers the WARN_ON() there.

This is a false positive and needs to be prevented. Check the result
of timer->expires - timer->base->offset right away and return -ETIME
right away.

Thanks to Frans Pop, who reported the problem and tested the fixes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Frans Pop <elendil@planet.nl>
2008-02-14 22:08:30 +01:00
Thomas Gleixner 5a7780e725 hrtimer: check relative timeouts for overflow
Various user space callers ask for relative timeouts. While we fixed
that overflow issue in hrtimer_start(), the sites which convert
relative user space values to absolute timeouts themself were uncovered.

Instead of putting overflow checks into each place add a function
which does the sanity checking and convert all affected callers to use
it.

Thanks to Frans Pop, who reported the problem and tested the fixes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Frans Pop <elendil@planet.nl>
2008-02-14 22:08:30 +01:00
Mathieu Desnoyers fb40bd78b0 Linux Kernel Markers: support multiple probes
RCU style multiple probes support for the Linux Kernel Markers.  Common case
(one probe) is still fast and does not require dynamic allocation or a
supplementary pointer dereference on the fast path.

- Move preempt disable from the marker site to the callback.

Since we now have an internal callback, move the preempt disable/enable to the
callback instead of the marker site.

Since the callback change is done asynchronously (passing from a handler that
supports arguments to a handler that does not setup the arguments is no
arguments are passed), we can safely update it even if it is outside the
preempt disable section.

- Move probe arm to probe connection. Now, a connected probe is automatically
  armed.

Remove MARK_MAX_FORMAT_LEN, unused.

This patch modifies the Linux Kernel Markers API : it removes the probe
"arm/disarm" and changes the probe function prototype : it now expects a
va_list * instead of a "...".

If we want to have more than one probe connected to a marker at a given
time (LTTng, or blktrace, ssytemtap) then we need this patch. Without it,
connecting a second probe handler to a marker will fail.

It allow us, for instance, to do interesting combinations :

Do standard tracing with LTTng and, eventually, to compute statistics
with SystemTAP, or to have a special trigger on an event that would call
a systemtap script which would stop flight recorder tracing.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:20 -08:00
Nishanth Aravamudan 064d9efe94 hugetlb: fix overcommit locking
proc_doulongvec_minmax() calls copy_to_user()/copy_from_user(), so we can't
hold hugetlb_lock over the call.  Use a dummy variable to store the sysctl
result, like in hugetlb_sysctl_handler(), then grab the lock to update
nr_overcommit_huge_pages.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Reported-by: Miles Lane <miles.lane@gmail.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Harvey Harrison b5606c2d44 remove final fastcall users
fastcall always expands to empty, remove it.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Paul E. McKenney fbf6bfca76 rcupdate: fix comment
This comment caused some consternation during fastcall removal.  Make it
truthful.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-13 16:21:18 -08:00
Linus Torvalds 3174ffaa93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched
* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
  sched: rt-group: refure unrunnable tasks
  sched: rt-group: clean up the ifdeffery
  sched: rt-group: make rt groups scheduling configurable
  sched: rt-group: interface
  sched: rt-group: deal with PI
  sched: fix incorrect irq lock usage in normalize_rt_tasks()
  sched: fair-group: separate tg->shares from task_group_lock
  hrtimer: more hrtimer_init_sleeper() fallout.
2008-02-13 08:22:41 -08:00
Peter Zijlstra b68aa2300c sched: rt-group: refure unrunnable tasks
Refuse to accept or create RT tasks in groups that can't run them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra bccbe08a60 sched: rt-group: clean up the ifdeffery
Clean up some of the excessive ifdeffery introduces in the last patch.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra 052f1dc7eb sched: rt-group: make rt groups scheduling configurable
Make the rt group scheduler compile time configurable.
Keep it experimental for now.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:40 +01:00
Peter Zijlstra 9f0c1e560c sched: rt-group: interface
Change the rt_ratio interface to rt_runtime_us, to match rt_period_us.
This avoids picking a granularity for the ratio.

Extend the /sys/kernel/uids/<uid>/ interface to allow setting
the group's rt_runtime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra 23b0fdfc92 sched: rt-group: deal with PI
Steven mentioned the fun case where a lock holding task will be throttled.

Simple fix: allow groups that have boosted tasks to run anyway.

If a runnable task in a throttled group gets boosted the dequeue/enqueue
done by rt_mutex_setprio() is enough to unthrottle the group.

This is ofcourse not quite correct. Two possible ways forward are:
  - second prio array for boosted tasks
  - boost to a prio ceiling (this would also work for deadline scheduling)

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra 4cf5d77a6e sched: fix incorrect irq lock usage in normalize_rt_tasks()
lockdep spotted this bogus irq locking. normalize_rt_tasks() can be called
from hardirq context through sysrq-n

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra 8ed3699682 sched: fair-group: separate tg->shares from task_group_lock
On Mon, 2008-02-11 at 15:09 +0300, Denis V. Lunev wrote:
> BUG: sleeping function called from invalid context
> at /home/den/src/linux-netns26/kernel/mutex.c:209
> in_atomic():1, irqs_disabled():0
> no locks held by swapper/0.
> Pid: 0, comm: swapper Not tainted 2.6.24 #304
>
> Call Trace:
>  <IRQ>  [<ffffffff80252d1e>] ? __debug_show_held_locks+0x15/0x27
>  [<ffffffff8022c2a8>] __might_sleep+0xc0/0xdf
>  [<ffffffff8049f1df>] mutex_lock_nested+0x28/0x2a9
>  [<ffffffff80231294>] sched_destroy_group+0x18/0xea
>  [<ffffffff8023e835>] sched_destroy_user+0xd/0xf
>  [<ffffffff8023e8c1>] free_uid+0x8a/0xab
>  [<ffffffff80233e24>] __put_task_struct+0x3f/0xd3
>  [<ffffffff80236708>] delayed_put_task_struct+0x23/0x25
>  [<ffffffff8026fda7>] __rcu_process_callbacks+0x8d/0x215
>  [<ffffffff8026ff52>] rcu_process_callbacks+0x23/0x44
>  [<ffffffff8023a2ae>] __do_softirq+0x79/0xf8
>  [<ffffffff8020f8c3>] ? profile_pc+0x2a/0x67
>  [<ffffffff8020d38c>] call_softirq+0x1c/0x30
>  [<ffffffff8020f689>] do_softirq+0x61/0x9c
>  [<ffffffff8023a233>] irq_exit+0x51/0x53
>  [<ffffffff8021bd1a>] smp_apic_timer_interrupt+0x77/0xad
>  [<ffffffff8020ce3b>] apic_timer_interrupt+0x6b/0x70
>  <EOI>  [<ffffffff8020b0dd>] ? default_idle+0x43/0x76
>  [<ffffffff8020b0db>] ? default_idle+0x41/0x76
>  [<ffffffff8020b09a>] ? default_idle+0x0/0x76
>  [<ffffffff8020b186>] ? cpu_idle+0x76/0x98

separate the tg->shares protection from the task_group lock.

Reported-by: Denis V. Lunev <den@openvz.org>
Tested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:39 +01:00
Peter Zijlstra 720a2592cf hrtimer: more hrtimer_init_sleeper() fallout.
Missed an instance...

  futex_lock_pi()
    hrtimer_init_sleeper()
    rt_mutex_timed_lock()
      rt_mutex_timed_fastlock()
        rt_mutex_slowlock()
          hrtimer_start()

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-13 15:45:36 +01:00
H. Peter Anvin c98aa86df3 timeconst.pl: correct reversal of USEC_TO_HZ and HZ_TO_USEC
The USEC_TO_HZ and HZ_TO_USEC constant sets were mislabelled, with
seriously incorrect results.  This among other things manifested
itself as cpufreq not working when a tickless kernel was configured.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Tested-by: Carlos R. Mafra <crmafra@ift.unesp.br>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-12 14:29:26 -08:00
Oleg Nesterov c289b074b6 hrtimer: don't modify restart_block->fn in restart functions
hrtimer_nanosleep_restart() clears/restores restart_block->fn. This is
pointless and complicates its usage. Note that if sys_restart_syscall()
doesn't actually happen, we have a bogus "pending" restart->fn anyway,
this is harmless.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-10 10:48:03 +01:00
Oleg Nesterov 416529374b hrtimer: fix *rmtp/restarts handling in compat_sys_nanosleep()
Spotted by Pavel Emelyanov and Alexey Dobriyan.

compat_sys_nanosleep() implicitly uses hrtimer_nanosleep_restart(), this can't
work. Make a suitable compat_nanosleep_restart() helper.

Introduced by commit c70878b4e0
hrtimer: hook compat_sys_nanosleep up to high res timer code

Also, set ->addr_limit = KERNEL_DS before doing hrtimer_nanosleep(), this func
was changed by the previous patch and now takes the "__user *" parameter.

Thanks to Ingo Molnar for fixing the bug in this patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-10 10:48:03 +01:00
Oleg Nesterov 080344b988 hrtimer: fix *rmtp handling in hrtimer_nanosleep()
Spotted by Pavel Emelyanov and Alexey Dobriyan.

hrtimer_nanosleep() sets restart_block->arg1 = rmtp, but this rmtp points to
the local variable which lives in the caller's stack frame. This means that
if sys_restart_syscall() actually happens and it is interrupted as well, we
don't update the user-space variable, but write into the already dead stack
frame.

Introduced by commit 04c227140f
hrtimer: Rework hrtimer_nanosleep to make sys_compat_nanosleep easier

Change the callers to pass "__user *rmtp" to hrtimer_nanosleep(), and change
hrtimer_nanosleep() to use copy_to_user() to actually update *rmtp.

Small problem remains. man 2 nanosleep states that *rtmp should be written if
nanosleep() was interrupted (it says nothing whether it is OK to update *rmtp
if nanosleep returns 0), but (with or without this patch) we can dirty *rem
even if nanosleep() returns 0.

NOTE: this patch doesn't change compat_sys_nanosleep(), because it has other
bugs. Fixed by the next patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Alexey Dobriyan <adobriyan@sw.ru>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Pavel Emelyanov <xemul@sw.ru>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Toyo Abe <toyoa@mvista.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

 include/linux/hrtimer.h |    2 -
 kernel/hrtimer.c        |   51 +++++++++++++++++++++++++-----------------------
 kernel/posix-timers.c   |   14 +------------
 3 files changed, 30 insertions(+), 37 deletions(-)
2008-02-10 10:48:03 +01:00
john stultz e13a2e61dd ntp: correct inconsistent interval/tick_length usage
clocksource initialization and error accumulation.  This corrects a 280ppm
drift seen on some systems using acpi_pm, and affects other clocksources as
well (likely to a lesser degree).

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-02-10 10:48:03 +01:00
S.Çağlar Onur c1cb795338 Update kernel/.gitignore with new auto-generated files
Signed-off-by: S.Çağlar Onur <caglar@pardus.org.tr>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-09 23:27:01 -08:00
Linus Torvalds dde0013782 Merge branch 'for-2.6.25' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'for-2.6.25' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
  [POWERPC] Add arch-specific walk_memory_remove() for 64-bit powerpc
  [POWERPC] Enable hotplug memory remove for 64-bit powerpc
  [POWERPC] Add remove_memory() for 64-bit powerpc
  [POWERPC] Make cell IOMMU fixed mapping printk more useful
  [POWERPC] Fix potential cell IOMMU bug when switching back to default DMA ops
  [POWERPC] Don't enable cell IOMMU fixed mapping if there are no dma-ranges
  [POWERPC] Fix cell IOMMU null pointer explosion on old firmwares
  [POWERPC] spufs: Fix timing dependent false return from spufs_run_spu
  [POWERPC] spufs: No need to have a runnable SPU for libassist update
  [POWERPC] spufs: Update SPU_Status[CISHP] in backing runcntl write
  [POWERPC] spufs: Fix state_mutex leaks
  [POWERPC] Disable G5 NAP mode during SMU commands on U3
2008-02-08 09:31:42 -08:00
Ralf Baechle 46f4f8f665 IRQ_NOPROBE helper functions
Probing non-ISA interrupts using the handle_percpu_irq as their handle_irq
method may crash the system because handle_percpu_irq does not check
IRQ_WAITING.  This for example hits the MIPS Qemu configuration.

This patch provides two helper functions set_irq_noprobe and set_irq_probe to
set rsp.  clear the IRQ_NOPROBE flag.  The only current caller is MIPS code
but this really belongs into generic code.

As an aside, interrupt probing these days has become a mostly obsolete if not
dangerous art.  I think Linux interrupts should be changed to default to
non-probing but that's subject of this patch.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-and-tested-by: Rob Landley <rob@landley.net>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:42 -08:00
Yi Yang 06b2a76d25 Add new string functions strict_strto* and convert kernel params to use them
Currently, for every sysfs node, the callers will be responsible for
implementing store operation, so many many callers are doing duplicate
things to validate input, they have the same mistakes because they are
calling simple_strtol/ul/ll/uul, especially for module params, they are
just numeric, but you can echo such values as 0x1234xxx, 07777888 and
1234aaa, for these cases, module params store operation just ignores
succesive invalid char and converts prefix part to a numeric although input
is acctually invalid.

This patch tries to fix the aforementioned issues and implements
strict_strtox serial functions, kernel/params.c uses them to strictly
validate input, so module params will reject such values as 0x1234xxxx and
returns an error:

write error: Invalid argument

Any modules which export numeric sysfs node can use strict_strtox instead of
simple_strtox to reject any invalid input.

Here are some test results:

Before applying this patch:

[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]#

After applying this patch:

[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000g > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo 0x1000gggggggg > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# echo 010000 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# echo 0100008 > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# echo 010000aaaaa > /sys/module/e1000/parameters/copybreak
-bash: echo: write error: Invalid argument
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]# echo -n 4096 > /sys/module/e1000/parameters/copybreak
[root@yangyi-dev /]# cat /sys/module/e1000/parameters/copybreak
4096
[root@yangyi-dev /]#

[akpm@linux-foundation.org: fix compiler warnings]
[akpm@linux-foundation.org: fix off-by-one found by tiwai@suse.de]
Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Cc: Greg KH <greg@kroah.com>
Cc: "Randy.Dunlap" <rdunlap@xenotime.net>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:41 -08:00
Sam Ravnborg fa7303e22c cpu: fix section mismatch warnings for enable_nonboot_cpus
Fix following warning:
WARNING: o-x86_64/kernel/built-in.o(.text+0x36d8b): Section mismatch in reference from the function enable_nonboot_cpus() to the function .cpuinit.text:_cpu_up()

enable_nonboot_cpus() are used solely from CONFIG_CONFIG_PM_SLEEP_SMP=y
and PM_SLEEP_SMP imply HOTPLUG_CPU therefore the reference
to _cpu_up() is valid.
Annotate enable_nonboot_cpus() with __ref to silence modpost.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:41 -08:00
Pavel Emelyanov 48d13e483c Don't operate with pid_t in rtmutex tester
The proper behavior to store task's pid and get this task later is to get the
struct pid pointer and get the task with the pid_task() call.

Make it for rt_mutex_waiter->deadlock_task_pid field.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:41 -08:00
Pavel Emelyanov 8dc86af006 Use find_task_by_vpid in posix timers
All the functions that need to lookup a task by pid in posix timers obtain
this pid from a user space, and thus this value refers to a task in the same
namespace, as the current task lives in.

So the proper behavior is to call find_task_by_vpid() here.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:41 -08:00
H. Peter Anvin bdc807871d avoid overflows in kernel/time.c
When the conversion factor between jiffies and milli- or microseconds is
not a single multiply or divide, as for the case of HZ == 300, we currently
do a multiply followed by a divide.  The intervening result, however, is
subject to overflows, especially since the fraction is not simplified (for
HZ == 300, we multiply by 300 and divide by 1000).

This is exposed to the user when passing a large timeout to poll(), for
example.

This patch replaces the multiply-divide with a reciprocal multiplication on
32-bit platforms.  When the input is an unsigned long, there is no portable
way to do this on 64-bit platforms there is no portable way to do this
since it requires a 128-bit intermediate result (which gcc does support on
64-bit platforms but may generate libgcc calls, e.g.  on 64-bit s390), but
since the output is a 32-bit integer in the cases affected, just simplify
the multiply-divide (*3/10 instead of *300/1000).

The reciprocal multiply used can have off-by-one errors in the upper half
of the valid output range.  This could be avoided at the expense of having
to deal with a potential 65-bit intermediate result.  Since the intent is
to avoid overflow problems and most of the other time conversions are only
semiexact, the off-by-one errors were considered an acceptable tradeoff.

At Ralf Baechle's suggestion, this version uses a Perl script to compute
the necessary constants.  We already have dependencies on Perl for kernel
compiles.  This does, however, require the Perl module Math::BigInt, which
is included in the standard Perl distribution starting with version 5.8.0.
In order to support older versions of Perl, include a table of canned
constants in the script itself, and structure the script so that
Math::BigInt isn't required if pulling values from said table.

Running the script requires that the HZ value is available from the
Makefile.  Thus, this patch also adds the Kconfig variable CONFIG_HZ to the
architectures which didn't already have it (alpha, cris, frv, h8300, m32r,
m68k, m68knommu, sparc, v850, and xtensa.) It does *not* touch the sh or
sh64 architectures, since Paul Mundt has dealt with those separately in the
sh tree.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Ralf Baechle <ralf@linux-mips.org>,
Cc: Sam Ravnborg <sam@ravnborg.org>,
Cc: Paul Mundt <lethal@linux-sh.org>,
Cc: Richard Henderson <rth@twiddle.net>,
Cc: Michael Starvik <starvik@axis.com>,
Cc: David Howells <dhowells@redhat.com>,
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>,
Cc: Hirokazu Takata <takata@linux-m32r.org>,
Cc: Geert Uytterhoeven <geert@linux-m68k.org>,
Cc: Roman Zippel <zippel@linux-m68k.org>,
Cc: William L. Irwin <sparclinux@vger.kernel.org>,
Cc: Chris Zankel <chris@zankel.net>,
Cc: H. Peter Anvin <hpa@zytor.com>,
Cc: Jan Engelhardt <jengelh@computergmbh.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:39 -08:00
Joe Perches 7ef3d2fd17 printk_ratelimit() functions should use CONFIG_PRINTK
Makes an embedded image a bit smaller.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:39 -08:00
Li Zefan 6d141c3ff6 workqueue: make delayed_work_timer_fn() static
delayed_work_timer_fn() is a timer function, make it static.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:37 -08:00
Adrian Bunk a36219ac93 The scheduled 'time' option removal
The scheduled removal of the 'time' option.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:36 -08:00
Jesper Juhl efae09f3e9 Nuke duplicate header from sysctl.c
Don't include linux/security.h twice in kernel/sysctl.c

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:34 -08:00
Jesper Juhl f8db694e46 Nuke a duplicate include from profile.c
Remove duplicate inclusion of linux/profile.h from kernel/profile.c

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:34 -08:00
Jesper Juhl 2dc9c91315 Nuke duplicate include from printk.c
Remove the duplicate inclusion of linux/jiffies.h from kernel/printk.c

Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:34 -08:00
Jan Beulich 8b21985c91 constify tables in kernel/sysctl_check.c
Remains the question whether it is intended that many, perhaps even large,
tables are compiled in without ever having a chance to get used, i.e.
whether there shouldn't #ifdef CONFIG_xxx get added.

[akpm@linux-foundation.org: fix cut-n-paste error]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:31 -08:00
Harvey Harrison 7ad5b3a505 kernel: remove fastcall in kernel/*
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:31 -08:00
Li Zefan 3eb056764d time: fix typo in comments
Fix typo in comments.

BTW: I have to fix coding style in arch/ia64/kernel/time.c also, otherwise
checkpatch.pl will be complaining.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Li Zefan cf4fc6cb76 timekeeping: rename timekeeping_is_continuous to timekeeping_valid_for_hres
Function timekeeping_is_continuous() no longer checks flag
CLOCK_IS_CONTINUOUS, and it checks CLOCK_SOURCE_VALID_FOR_HRES now.  So rename
the function accordingly.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Li Zefan 0b858e6ff9 clockevent: simplify list operations
list_for_each_safe() suffices here.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Li Zefan 818c357802 clocksource: remove redundant code
Flag CLOCK_SOURCE_WATCHDOG is cleared twice.  Note clocksource_change_rating()
won't do anyting with the cs flag.

Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Pavel Emelyanov 146a505d49 Get rid of the kill_pgrp_info() function
There's only one caller left - the kill_pgrp one - so merge these two
functions and forget the kill_pgrp_info one.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Pavel Emelyanov d5df763b81 Clean up the kill_something_info
This is the first step (of two) in removing the kill_pgrp_info.

All the users of this function are in kernel/signal.c, but all they need is to
call __kill_pgrp_info() with the tasklist_lock read-locked.

Fortunately, one of its users is the kill_something_info(), which already
needs this lock in one of its branches, so clean these branches up and call
the __kill_pgrp_info() directly.

Based on Oleg's view of how this function should look.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Pavel Emelyanov 6c5f3e7b43 Pidns: make full use of xxx_vnr() calls
Some time ago the xxx_vnr() calls (e.g.  pid_vnr or find_task_by_vpid) were
_all_ converted to operate on the current pid namespace.  After this each call
like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
one.

Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
appropriate.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Reviewed-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Oleg Nesterov fea9d17554 ITIMER_REAL: convert to use struct pid
signal_struct->tsk points to the ->group_leader and thus we have the nasty
code in de_thread() which has to change it and restart ->real_timer if the
leader is changed.

Use "struct pid *leader_pid" instead.  This also allows us to kill now
unneeded send_group_sig_info().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:29 -08:00
Oleg Nesterov d36174bc2b uglify kill_pid_info() to fix kill() vs exec() race
kill_pid_info()->pid_task() could be the old leader of the execing process.
In that case it is possible that the leader will be released before we take
siglock. This means that kill_pid_info() (and thus sys_kill()) can return a
false -ESRCH.

Change the code to retry when lock_task_sighand() fails. The endless loop is
not possible, __exit_signal() both clears ->sighand and does detach_pid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:28 -08:00
Oleg Nesterov ac9a8e3f0f sys_getsid: don't use ->nsproxy directly
With the new semantics of find_vpid() we don't need to play with ->nsproxy
explicitely, _vxx() do the right things.

Also s/tasklist/rcu/.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:28 -08:00
Eric W. Biederman 44c4e1b258 pid: Extend/Fix pid_vnr
pid_vnr returns the user space pid with respect to the pid namespace the
struct pid was allocated in.  What we want before we return a pid to user
space is the user space pid with respect to the pid namespace of current.

pid_vnr is a very nice optimization but because it isn't quite what we want
it is easy to use pid_vnr at times when we aren't certain the struct pid
was allocated in our pid namespace.

Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
parent process reported in the core dumps and the parent process in
get_signal_to_deliver.

So unless the performance impact is huge having an interface that does what
we want instead of always what we want should be much more reliable and
much less error prone.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Eric W. Biederman 161550d74c pid: sys_wait... fixes
This modifies do_wait and eligible child to take a pair of enum pid_type
and struct pid *pid to precisely specify what set of processes are eligible
to be waited for, instead of the raw pid_t value from sys_wait4.

This fixes a bug in sys_waitid where you could not wait for children in
just process group 1.

This fixes a pid namespace crossing case in eligible_child.  Allowing us to
wait for a processes in our current process group even if our current
process group == 0.

This allows the no child with this pid case to be optimized.  This allows
us to optimize the pid membership test in eligible child to be optimized.

This even closes a theoretical pid wraparound race where in a threaded
parent if two threads are waiting for the same child and one thread picks
up the child and the pid numbers wrap around and generate another child
with that same pid before the other thread is scheduled (teribly insanely
unlikely) we could end up waiting on the second child with the same pid#
and not discover that the specific child we were waiting for has exited.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 5dee1707df move the related code from exit_notify() to exit_signals()
The previous bugfix was not optimal, we shouldn't care about group stop
when we are the only thread or the group stop is in progress.  In that case
nothing special is needed, just set PF_EXITING and return.

Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().

So, from the performance POV the only difference is that we don't trust
!signal_pending() until we take ->siglock.  But this in fact fixes another
___pure___ theoretical minor race.  __group_complete_signal() finds the
task without PF_EXITING and chooses it as the target for signal_wake_up().
But nothing prevents this task from exiting in between without noticing the
pending signal and thus unpredictably delaying the actual delivery.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 6806aac6d2 sys_setsid: remove now unneeded session != 1 check
Eric's "fix clone(CLONE_NEWPID)" eliminated the last reason for this hack.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov d12619b5ff fix group stop with exit race
do_signal_stop() counts all sub-thread and sets ->group_stop_count
accordingly.  Every thread should decrement ->group_stop_count and stop,
the last one should notify the parent.

However a sub-thread can exit before it notices the signal_pending(), or it
may be somewhere in do_exit() already.  In that case the group stop never
finishes properly.

Note: this is a minimal fix, we can add some optimizations later.  Say we
can return quickly if thread_group_empty().  Also, we can move some signal
related code from exit_notify() to exit_signals().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 430c623121 start the global /sbin/init with 0,0 special pids
As Eric pointed out, there is no problem with init starting with sid == pgid
== 0, and this was historical linux behavior changed in 2.6.18.

Remove kernel_init()->__set_special_pids(), this is unneeded and complicates
the rules for sys_setsid().

This change and the previous change in daemonize() mean that /sbin/init does
not need the special "session != 1" hack in sys_setsid() any longer. We can't
remove this check yet, we should cleanup copy_process(CLONE_NEWPID) first, so
update the comment only.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 297bd42b15 move daemonized kernel threads into the swapper's session
Daemonized kernel threads run in the init's session. This doesn't match the
behaviour of kthread_create()'ed threads, and this is one of the 2 reasons
why we need a special hack in sys_setsid().

Now that set_special_pids() was changed to use struct pid, not pid_t, we can
use init_struct_pid and set 0,0 special pids.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 8520d7c7f8 teach set_special_pids() to use struct pid
Change set_special_pids() to work with struct pid, not pid_t from global name
space. This again speedups and imho cleanups the code, also a preparation for
the next patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov e4cc0a9c87 fix setsid() for sub-namespace /sbin/init
sys_setsid() still deals with pid_t's from the global namespace. This means
that the "session > 1" check can't help for sub-namespace init, setsid() can't
succeed because copy_process(CLONE_NEWPID) populates PIDTYPE_PGID/SID links.

Remove the usage of task_struct->pid and convert the code to use "struct pid".
This also simplifies and speedups the code, saves one find_pid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 4e021306cf sys_setpgid(): simplify pid/ns interaction
sys_setpgid() does unneeded conversions from pid_t to "struct pid" and vice
versa.  Use "struct pid" more consistently.  Saves one find_vpid() and
eliminates the explicit usage of ->nsproxy->pid_ns.  Imho, cleanups the
code.

Also use the same_thread_group() helper.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov c543f1ee08 wait_task_zombie: remove ->exit_state/exit_signal checks for WNOWAIT
The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense.
The exit_state was EXIT_ZOMBIE when the function was called, and another
thread can change it to EXIT_DEAD right after the check.

The second condition is not possible, detached non-traced threads were already
filtered out by eligible_child(), we didn't drop tasklist since then.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:27 -08:00
Oleg Nesterov 3a515e4a62 wait_task_continued/zombie: don't use task_pid_nr_ns() lockless
Surprise, the other two wait_task_*() functions also abuse the
task_pid_nr_ns() function, and may cause read-after-free or report nr == 0
in wait_task_continued().  wait_task_zombie() doesn't have this problem,
but it is still better to cache pid_t rather than call task_pid_nr_ns()
three times on the saved pid_namespace.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov f2cc3eb133 do_wait: fix security checks
Imho, the current usage of security_task_wait() is not logical.

Suppose we have the single child p, and security_task_wait(p) return
-EANY.  In that case waitpid(-1) returns this error.  Why? Isn't it
better to return ECHLD? We don't really have reapable children.

Now suppose that child was stolen by gdb.  In that case we find this
child on ->ptrace_children and set flag = 1, but we don't check that the
child was denied.  So, do_wait(..., WNOHANG) returns 0, this doesn't
match the behaviour above.  Without WNOHANG do_wait() blocks only to
return the error later, when the child will be untraced.  Inho, really
strange.

I think eligible_child() should return the error only if the child's pid
was requested explicitly, otherwise we should silently ignore the tasks
which were nacked by security_task_wait().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 96fabbf55a do_wait: cleanup delay_group_leader() usage
eligible_child() == 2 means delay_group_leader().  With the previous patch
this only matters for EXIT_ZOMBIE task, we can move that special check to
the only place it is really needed.

Also, with this patch we don't skip security_task_wait() for the group
leaders in a non-empty thread group.  I don't really understand the exact
semantics of security_task_wait(), but imho this change is a bugfix.

Also rearrange the code a bit to kill an ugly "check_continued" backdoor.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Eric Paris <eparis@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Roland McGrath <roland@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 1bad95c3be wait_task_stopped(): remove unneeded delay_group_leader check
wait_task_stopped() doesn't need the "delay_group_leader" parameter.  If
the child is not traced it must be a group leader.  With or without
subthreads ->group_stop_count == 0 when the whole task is stopped.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Mika Penttila <mika.penttila@kolumbus.fi>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 20686a309a ptrace_stop: fix racy nonstop_code setting
If the tracer is gone and we are not going to stop, ptrace_stop() sets
->exit_code = nostop_code.  However, the tracer could actually clear the
exit code before detaching.  In that case get_signal_to_deliver() "resends"
the signal which was cancelled by the debugger.  For example, it is
possible that a quick PTRACE_ATTACH + PTRACE_DETACH can leave the tracee in
STOPPED state.

Change the behaviour of ptrace_stop().  If the caller is ptrace notify(),
we should always clear ->exit_code.  If the caller is
get_signal_to_deliver(), we should not touch it at all.  To do so, change
the nonstop_code parameter to "bool clear_code" and change the callers
accordingly.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 9cbab81005 do_wait: factor out "retval != 0" checks
Every branch if the main "if" statement does the same code at the end.  Move
it down.  Also, fix the indentation.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov ee7c82da83 wait_task_stopped: simplify and fix races with SIGCONT/SIGKILL/untrace
wait_task_stopped() has multiple races with SIGCONT/SIGKILL.  tasklist_lock
does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info
reported (including exit_code) may be wrong.

In fact, the code under write_lock_irq(tasklist_lock) is not safe.  The child
may be PTRACE_DETACH'ed at this time by another subthread, in that case it is
possible we are no longer its ->parent.

Change wait_task_stopped() to take ->siglock before inspecting the task.  This
guarantees that the child can't resume and (for example) clear its
->exit_code, so we don't need to use xchg(&p->exit_code) and re-check.  The
only exception is ptrace_stop() which changes ->state and ->exit_code without
->siglock held during abort.  But this can only happen if both the tracer and
the tracee are dying (coredump is in progress), we don't care.

With this patch wait_task_stopped() doesn't move the child to the end of
the ->parent list on success.  This optimization could be restored, but
in that case we have to take write_lock(tasklist) and do some nasty
checks.

Also change the do_wait() since we don't return EAGAIN any longer.

[akpm@linux-foundation.org: fix up after Willy renamed everything]
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 6405f7f467 ptrace_stop: fix the race with ptrace detach+attach
If the tracer went away (may_ptrace_stop() failed), ptrace_stop() drops
tasklist and then changes the ->state from TASK_TRACED to TASK_RUNNING.

This can fool another tracer which attaches to us in between.  Change the
->state under tasklist_lock to ensure that ptrace_check_attach() can't wrongly
succeed.  Also, remove the unnecessary mb().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov c0c0b649d6 ptrace_check_attach: remove unneeded ->signal != NULL check
It is not possible to see the PT_PTRACED task without ->signal/sighand under
tasklist_lock, release_task() does ptrace_unlink() first.  If the task was
already released before, ptrace_attach() can't succeed and set PT_PTRACED.
Remove this check.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 34a1738f7d kill my_ptrace_child()
Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED"
inline and simplify the corresponding logic in do_wait: we can't find the
child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace()
either sets TASK_STOPPED or wakes up the tracee.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Oleg Nesterov 6b39c7bfbd kill PT_ATTACHED
Since the patch

	"Fix ptrace_attach()/ptrace_traceme()/de_thread() race"
	commit f5b40e363a

we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock.

This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer"
which were needed to catch the "ptrace attach is in progress" case.  We can
also remove the flag itself since nobody else uses it.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:26 -08:00
Andrew Morton 92dfc9dc7b fix "modules: make module_address_lookup() safe"
Get the constness right, avoid nasty cast.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Christoph Lameter 6d7623943c modules: include sections.h to avoid defining linker variables explicitly
module.c should not define linker variables on its own. We have an include
file for that.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Christoph Lameter 88173507e4 Modules: handle symbols that have a zero value
The module subsystem cannot handle symbols that are zero.  If symbols are
present that have a zero value then the module resolver prints out a
message that these symbols are unresolved.

[akinobu.mita@gmail.com: fix __find_symbl() error checks]
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Kay Sievers <kay.sievers@vrfy.org
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andi Kleen <ak@suse.de>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Eric W. Biederman df5f8314ca proc: seqfile convert proc_pid_status to properly handle pid namespaces
Currently we possibly lookup the pid in the wrong pid namespace.  So
seq_file convert proc_pid_status which ensures the proper pid namespaces is
passed in.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: another build fix]
[akpm@linux-foundation.org: s390 build fix]
[akpm@linux-foundation.org: fix task_name() output]
[akpm@linux-foundation.org: fix nommu build]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Andrew Morgan <morgan@kernel.org>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:24 -08:00
Pavel Emelyanov 74bd59bb39 namespaces: cleanup the code managed with PID_NS option
Just like with the user namespaces, move the namespace management code into
the separate .c file and mark the (already existing) PID_NS option as "depend
on NAMESPACES"

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Pavel Emelyanov aee16ce73c namespaces: cleanup the code managed with the USER_NS option
Make the user_namespace.o compilation depend on this option and move the
init_user_ns into user.c file to make the kernel compile and work without the
namespaces support.  This make the user namespace code be organized similar to
other namespaces'.

Also mask the USER_NS option as "depend on NAMESPACES".

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Pavel Emelyanov ae5e1b22f1 namespaces: move the IPC namespace under IPC_NS option
Currently the IPC namespace management code is spread over the ipc/*.c files.
I moved this code into ipc/namespace.c file which is compiled out when needed.

The linux/ipc_namespace.h file is used to store the prototypes of the
functions in namespace.c and the stubs for NAMESPACES=n case.  This is done
so, because the stub for copy_ipc_namespace requires the knowledge of the
CLONE_NEWIPC flag, which is in sched.h.  But the linux/ipc.h file itself in
included into many many .c files via the sys.h->sem.h sequence so adding the
sched.h into it will make all these .c depend on sched.h which is not that
good.  On the other hand the knowledge about the namespaces stuff is required
in 4 .c files only.

Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
msg.c and shm.c files.  It turned out that moving these functions into
namespaces.c is not that easy because they use many other calls and macros
from the original file.  Moving them would make this patch complicated.  On
the other hand all these functions can be consolidated, so I will send a
separate patch doing this a bit later.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Pavel Emelyanov 58bfdd6dee namespaces: move the UTS namespace under UTS_NS option
Currently all the namespace management code is in the kernel/utsname.c file,
so just compile it out and make stubs in the appropriate header.

The init namespace itself is in init/version.c and is in the kernel all the
time.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Nishanth Aravamudan a3d0c6aa1b hugetlb: add locking for overcommit sysctl
When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly.  However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock.  Add a
callback into the hugetlb code similar to the one for nr_hugepages.  Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-08 09:22:23 -08:00
Badari Pulavarty a99824f327 [POWERPC] Add arch-specific walk_memory_remove() for 64-bit powerpc
walk_memory_resource() verifies if there are holes in a given memory
range, by checking against /proc/iomem.  On x86/ia64 system memory is
represented in /proc/iomem.  On powerpc, we don't show system memory as
IO resource in /proc/iomem - instead it's maintained in
/proc/device-tree.

This provides a way for an architecture to provide its own
walk_memory_resource() function.  On powerpc, the memory region is
small (16MB), contiguous and non-overlapping.  So extra checking
against the device-tree is not needed.

Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Kumar Gala <galak@gate.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
2008-02-08 19:52:48 +11:00
Linus Torvalds f0f1b3364a Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (112 commits)
  ACPI: fix build warning
  Revert "cpuidle: build fix for non-x86"
  ACPI: update intrd DSDT override console messages
  ACPI: update DSDT override documentation
  ACPI: Add "acpi_no_initrd_override" kernel parameter
  ACPI: its a directory not a folder....
  ACPI: misc cleanups
  ACPI: add missing prink prefix strings
  ACPI: cleanup acpi.h
  ACPICA: fix CONFIG_ACPI_DEBUG_FUNC_TRACE build
  ACPI: video: Ignore ACPI video devices that aren't present in hardware
  ACPI: video: reset brightness on resume
  ACPI: video: call ACPI notifier chain for ACPI video notifications
  ACPI: create notifier chain to get hotkey events to graphics driver
  ACPI: video: delete unused display switch on hotkey event code
  ACPI: video: create "brightness_switch_enabled" modparam
  cpuidle: Add a poll_idle method
  ACPI: cpuidle: Support C1 idle time accounting
  ACPI: enable MWAIT for C1 idle
  ACPI: idle: Fix acpi_safe_halt usages and interrupt enabling/disabling
  ...
2008-02-07 09:45:58 -08:00
Ken'ichi Ohmichi bba1f603b8 vmcoreinfo: add "VMCOREINFO_" to all the call for vmcoreinfo_append_str()
For readability, all the calls to vmcoreinfo_append_str() are changed to macros
having a prefix "VMCOREINFO_".

This discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.3/0584.html

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Acked-by: Simon Horman <horms@verge.net.au>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:25 -08:00
Ken'ichi Ohmichi c76f860c44 vmcoreinfo: rename vmcoreinfo's macros returning the size
This patchset is for the vmcoreinfo data.

The vmcoreinfo data has the minimum debugging information only for dump
filtering.  makedumpfile (dump filtering command) gets it to distinguish
unnecessary pages, and makedumpfile creates a small dumpfile.

This patch:

VMCOREINFO_SIZE() should be renamed VMCOREINFO_STRUCT_SIZE() since it's always
returning the size of the struct with a given name. This change would allow
VMCOREINFO_TYPEDEF_SIZE() to simply become VMCOREINFO_SIZE() since it need not
be used exclusively for typedefs.

This discussion is the following:
http://www.ussg.iu.edu/hypermail/linux/kernel/0709.3/0582.html

Signed-off-by: Ken'ichi Ohmichi <oomichi@mxs.nes.nec.co.jp>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:25 -08:00
Pavel Emelyanov 73507f335f Handle pid namespaces in cgroups code
There's one place that works with task pids - its the "tasks" file in cgroups.
 The read/write handlers assume, that the pid values go to/come from the user
space and thus it is a virtual pid, i.e.  the pid as it is seen from inside a
namespace.

Tune the code accordingly.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Paul Jackson b450129554 hotplug cpu move tasks in empty cpusets - refinements
- Narrow the scope of callback_mutex in scan_for_empty_cpusets().

- Avoid rewriting the cpus, mems of cpusets except when it is likely that
  we'll be changing them.

- Have remove_tasks_in_empty_cpuset() also check for empty mems.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Paul Jackson c8d9c90c7e hotplug cpu: move tasks in empty cpusets to parent various other fixes
Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch

I had had "iff", meaning "if and only if" in a comment.  However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic.  Cliff
changed it to "if", presumably figuring that the "iff" was a typo.  However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.

The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry.  The opposite is true.

Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().

A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
	set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.

Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.

Removed extra parentheses and unnecessary return statement.

Renamed attach_task() to cpuset_attach() in various comments.

Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Cliff Wickman <cpw@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Paul Menage 2df167a300 cgroups: update comments in cpuset.c
Some of the comments in kernel/cpuset.c were stale following the
transition to control groups; this patch updates them to more closely
match reality.

Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Cliff Wickman 58f4790b73 cpusets: update_cpumask revision
Use the new function cgroup_scan_tasks() to step through all tasks in a
cpuset.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Cliff Wickman 956db3ca06 hotplug cpu: move tasks in empty cpusets to parent
This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.

Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.

Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset.  Or if the parent cpuset has no cpus, to its parent,
etc.

And the empty cpuset should be released (if it is flagged notify_on_release).

Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset.  We are deliberately
avoiding a walk of the tasklist.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
Cliff Wickman 31a7df01fd cgroups: mechanism to process each task in a cgroup
Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
calling a test function and a process function for each.  And call the process
function without holding the css_set_lock lock.

The idea is David Rientjes', predicting that such a function will make it much
easier in the future to extend things that require access to each task in a
cgroup without holding the lock,

[akpm@linux-foundation.org: cleanup]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:22 -08:00
KAMEZAWA Hiroyuki 4fca88c87b memory cgroup enhancements: add- pre_destroy() handler
Add a handler "pre_destroy" to cgroup_subsys.  It is called before
cgroup_rmdir() checks all subsys's refcnt.

I think this is useful for subsys which have some extra refs even if there
are no tasks in cgroup.  By adding pre_destroy(), the kernel keeps the rule
"destroy() against subsystem is called only when refcnt=0." and allows css
ref to be used by other objects than tasks.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:20 -08:00
David Rientjes fef1bdd68c oom: add sysctl to enable task memory dump
Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing.  Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.

This is helpful for determining why there was an OOM condition and which
rogue task caused it.

It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.

If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.

Cc: Andrea Arcangeli <andrea@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:19 -08:00
Balbir Singh 0eea103017 Memory controller improve user interface
Change the interface to use bytes instead of pages.  Page sizes can vary
across platforms and configurations.  A new strategy routine has been added
to the resource counters infrastructure to format the data as desired.

Suggested by David Rientjes, Andrew Morton and Herbert Poetzl

Tested on a UML setup with the config for memory control enabled.

[kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Pavel Emelianov 78fb74669e Memory controller: accounting setup
Basic setup routines, the mm_struct has a pointer to the cgroup that
it belongs to and the the page has a page_cgroup associated with it.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Pavel Emelianov e552b66170 Memory controller: resource counters
With fixes from David Rientjes <rientjes@google.com>

Introduce generic structures and routines for resource accounting.

Each resource accounting cgroup is supposed to aggregate it,
cgroup_subsystem_state and its resource-specific members within.

Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Adrian Bunk e9685a03c8 kernel/cgroup.c: make 2 functions static
cgroup_is_releasable() and notify_on_release() should be static,
not global inline.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Paul Menage 8dc4f3e17d cgroups: move cgroups destroy() callbacks to cgroup_diput()
Move the calls to the cgroup subsystem destroy() methods from
cgroup_rmdir() to cgroup_diput().  This allows control file reads and
writes to access their subsystem state without having to be concerned with
locking against cgroup destruction - the control file dentry will keep the
cgroup and its subsystem state objects alive until the file is closed.

The documentation is updated to reflect the changed semantics of destroy();
additionally the locking comments for destroy() and some other methods were
clarified and decrustified.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Paul Jackson 622d42cac9 cgroup simplify space stripping
Simplify the space stripping code in cgroup file write.

[akpm@linux-foundation.org: s/BUG_ON/BUILD_BUG_ON/]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:18 -08:00
Paul Jackson e18f6318e5 cgroup brace coding style fix
Coding style fix - one line conditionals don't get braces.

Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:17 -08:00
Adrian Bunk 3cdeed2986 kernel/cgroup.c: remove dead code
This patch removes dead code spotted by the Coverity checker
(look at the "(nbytes >= PATH_MAX)" check).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Paul Jackson <pj@sgi.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:17 -08:00
Pavel Emelyanov eccba06891 gfs2: make gfs2_glock.gl_owner_pid be a struct pid *
The gl_owner_pid field is used to get the lock owning task by its pid, so make
it in a proper manner, i.e.  by using the struct pid pointer and pid_task()
function.

The pid_task() becomes exported for the gfs2 module.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-07 08:42:06 -08:00
Len Brown 81e242d0ef Merge branches 'release' and 'dsdt-override' into release 2008-02-07 04:01:53 -05:00
Pavel Machek 23b168d425 PM: documentation cleanups
Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-07 01:27:17 -05:00
Éric Piel 6ed31e92e9 ACPI: Taint kernel on ACPI table override (format corrected)
When an ACPI table is overridden (for now this can happen only for DSDT)
display a big warning and taint the kernel with flag A.

Signed-off-by: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-06 22:07:51 -05:00
Abhishek Sagar f47cd9b553 kprobes: kretprobe user entry-handler
Provide support to add an optional user defined callback to be run at
function entry of a kretprobe'd function.  Also modify the kprobe smoke
tests to include an entry-handler during the kretprobe sanity test.

Signed-off-by: Abhishek Sagar <sagar.abhishek@gmail.com>
Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Acked-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:11 -08:00
Andrew Morton ec03d70739 speed up jiffies conversion functions if HZ==USER_HZ
Avoid calling do_div(x, 1) in this case.

Cc: David Fries <david@fries.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:10 -08:00
David Fries 6ffc787a44 system timer: fix crash in <100Hz system timer
The kernel has a divide by zero crash when trying to run the system timer
less than 100Hz.  The problem is x/(HZ/USER_HZ) and related.  Now
x*(USER_HZ/HZ) will be used if HZ<USER_HZ.

I'm running the Linux kernel under qemu and went to run a slower system
timer to take less CPU (and battery) on the host.  I found that the kernel
paniced under emulation because of a divide by zero in three places.  Here
is the patch.  The base git was updated today 01-05-2008.  I went for a
20Hz system time by adding config HZ_20 etc to kernel/Kconfig.hz.  With
this patch I verified the system timer by looking at /proc/interrupts.

[akpm@linux-foundation.org: partially clean up the macro maze]
Signed-off-by: David Fries <david@fries.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:10 -08:00
Eric Dumazet 1bf47346d7 kernel/sys.c: get rid of expensive divides in groups_sort()
groups_sort() can be quite long if user loads a large gid table.

This is because GROUP_AT(group_info, some_integer) uses an integer divide.
So having to do XXX thousand divides during one syscall can lead to very
high latencies.  (NGROUPS_MAX=65536)

In the past (25 Mar 2006), an analog problem was found in groups_search()
(commit d74beb9f33 ) and at that time I
changed some variables to unsigned int.

I believe that a more generic fix is to make sure NGROUPS_PER_BLOCK is
unsigned.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:09 -08:00
Adrian Bunk 6b2fb3c658 idle_regs() must be __cpuinit
Fix the following section mismatch with CONFIG_HOTPLUG=n,
CONFIG_HOTPLUG_CPU=y:

WARNING: vmlinux.o(.text+0x399a6): Section mismatch: reference to .init.text.5:idle_regs (between 'fork_idle' and 'get_task_mm')

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:08 -08:00
Richard Knutsson eb38a996eb kernel/params.c: remove sparse-warning (different signedness)
Fixing:
  CHECK   kernel/params.c
kernel/params.c:329:41: warning: incorrect type in argument 8 (different signedness)
kernel/params.c:329:41:    expected int *num
kernel/params.c:329:41:    got unsigned int *

Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:08 -08:00
Daniel Walker 6c6080f74c stopmachine: semaphore to mutex
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Daniel Walker <dwalker@mvista.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:08 -08:00
Roland McGrath 1a669c2f16 Add arch_ptrace_stop
This adds support to allow asm/ptrace.h to define two new macros,
arch_ptrace_stop_needed and arch_ptrace_stop.  These control special
machine-specific actions to be done before a ptrace stop.  The new code
compiles away to nothing when the new macros are not defined.  This is the
case on all machines to begin with.

On ia64, these macros will be defined to solve the long-standing issue of
ptrace vs register backing store.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Petr Tesarik <ptesarik@suse.cz>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:07 -08:00
Nick Piggin a1e096129b relay: nopage
Convert relay from nopage to fault.
Remove redundant vma range checks.
Switch from OOM to SIGBUS if the resource is not available.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:07 -08:00
Eric Dumazet 9cfe015aa4 get rid of NR_OPEN and introduce a sysctl_nr_open
NR_OPEN (historically set to 1024*1024) actually forbids processes to open
more than 1024*1024 handles.

Unfortunatly some production servers hit the not so 'ridiculously high
value' of 1024*1024 file descriptors per process.

Changing NR_OPEN is not considered safe because of vmalloc space potential
exhaust.

This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
1024*1024, so that admins can decide to change this limit if their workload
needs it.

[akpm@linux-foundation.org: export it for sparc64]
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:06 -08:00
Oleg Nesterov 0a76fe8e50 do_wait: remove one "else if" branch
Minor cleanup. We can remove one "else if" branch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:04 -08:00
Denys Vlasenko eed4a2aba7 printk.c: use unsigned ints instead of longs for logbuf index
Stop using unsigned _longs_ for printk buffer indexes.  Log buffer is way
smaller than 2 gigabytes and unsigned ints will work too .  Indeed, they do
work nicely on all 32-bit platforms where longs and ints are the same.

With this patch, we have following size savings on amd64:

   text    data     bss     dec     hex filename
   5997     313   17736   24046    5dee 2.6.23.1.t64/kernel/printk.o
   5858     313   17700   23871    5d3f 2.6.23.1.printk.t64/kernel/printk.o

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:04 -08:00
Miao Xie 5e2cb1018a time: fix sysfs_show_{available,current}_clocksources() buffer overflow problem
I found that there is a buffer overflow problem in the following code.

Version:	2.6.24-rc2,
File:		kernel/time/clocksource.c:417-432
--------------------------------------------------------------------
static ssize_t
sysfs_show_available_clocksources(struct sys_device *dev, char *buf)
{
	struct clocksource *src;
	char *curr = buf;

	spin_lock_irq(&clocksource_lock);
	list_for_each_entry(src, &clocksource_list, list) {
		curr += sprintf(curr, "%s ", src->name);
	}
	spin_unlock_irq(&clocksource_lock);

	curr += sprintf(curr, "\n");

	return curr - buf;
}
-----------------------------------------------------------------------

sysfs_show_current_clocksources() also has the same problem though in practice
the size of current clocksource's name won't exceed PAGE_SIZE.

I fix the bug by using snprintf according to the specification of the kernel
(Version:2.6.24-rc2,File:Documentation/filesystems/sysfs.txt)

Fix sysfs_show_available_clocksources() and sysfs_show_current_clocksources()
buffer overflow problem with snprintf().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:03 -08:00
Adrian Bunk c166f23cb5 kernel/notifier.c should #include <linux/reboot.h>
Every file should include the headers containing the prototypes for its global
functions (in this case {,un}register_reboot_notifier()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:02 -08:00
Adrian Bunk bb695170d8 make srcu_readers_active() static
Make the needlessly global srcu_readers_active() static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:02 -08:00
Adrian Bunk f17d30a803 kernel/ptrace.c should #include <linux/syscalls.h>
Every file should include the headers containing the prototypes for its global
functions (in this case sys_ptrace()).

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:02 -08:00
Robin Getz a3b81113fb remove support for un-needed _extratext section
When passing a zero address to kallsyms_lookup(), the kernel thought it was
a valid kernel address, even if it is not.  This is because is_ksym_addr()
called is_kernel_extratext() and checked against labels that don't exist on
many archs (which default as zero).  Since PPC was the only kernel which
defines _extra_text, (in 2005), and no longer needs it, this patch removes
_extra_text support.

For some history (provided by Jon):
 http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019734.html
 http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019736.html
 http://ozlabs.org/pipermail/linuxppc-dev/2005-September/019751.html

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Robin Getz <rgetz@blackfin.uclinux.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Jon Loeliger <jdl@freescale.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:01 -08:00
Oleg Nesterov d9ae90ac4b use __set_task_state() for TRACED/STOPPED tasks
1. It is much easier to grep for ->state change if __set_task_state() is used
   instead of the direct assignment.

2. ptrace_stop() and handle_group_stop() use set_task_state() which adds the
   unneeded mb() (btw even if we use mb() it is still possible that do_wait()
   sees the new ->state but not ->exit_code, but this is ok).

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:00 -08:00
Michael Neuling 06b8e878a9 taskstats scaled time cleanup
This moves the ability to scale cputime into generic code.  This allows us
to fix the issue in kernel/timer.c (noticed by Balbir) where we could only
add an unscaled value to the scaled utime/stime.

This adds a cputime_to_scaled function.  As before, the POWERPC version
does the scaling based on the last SPURR/PURR ratio calculated.  The
generic and s390 (only other arch to implement asm/cputime.h) versions are
both NOPs.

Also moves the SPURR and PURR snapshots closer.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-06 10:41:00 -08:00
Mark Gross f011e2e2df latency.c: use QoS infrastructure
Replace latency.c use with pm_qos_params use.

Signed-off-by: mark gross <mgross@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:22 -08:00
Mark Gross d82b35186e pm qos infrastructure and interface
The following patch is a generalization of the latency.c implementation done
by Arjan last year.  It provides infrastructure for more than one parameter,
and exposes a user mode interface for processes to register pm_qos
expectations of processes.

This interface provides a kernel and user mode interface for registering
performance expectations by drivers, subsystems and user space applications on
one of the parameters.

Currently we have {cpu_dma_latency, network_latency, network_throughput} as
the initial set of pm_qos parameters.

The infrastructure exposes multiple misc device nodes one per implemented
parameter.  The set of parameters implement is defined by pm_qos_power_init()
and pm_qos_params.h.  This is done because having the available parameters
being runtime configurable or changeable from a driver was seen as too easy to
abuse.

For each parameter a list of performance requirements is maintained along with
an aggregated target value.  The aggregated target value is updated with
changes to the requirement list or elements of the list.  Typically the
aggregated target value is simply the max or min of the requirement values
held in the parameter list elements.

>From kernel mode the use of this interface is simple:

pm_qos_add_requirement(param_id, name, target_value):

  Will insert a named element in the list for that identified PM_QOS
  parameter with the target value.  Upon change to this list the new target is
  recomputed and any registered notifiers are called only if the target value
  is now different.

pm_qos_update_requirement(param_id, name, new_target_value):

  Will search the list identified by the param_id for the named list element
  and then update its target value, calling the notification tree if the
  aggregated target is changed.  with that name is already registered.

pm_qos_remove_requirement(param_id, name):

  Will search the identified list for the named element and remove it, after
  removal it will update the aggregate target and call the notification tree
  if the target was changed as a result of removing the named requirement.

>From user mode:

  Only processes can register a pm_qos requirement.  To provide for
  automatic cleanup for process the interface requires the process to register
  its parameter requirements in the following way:

  To register the default pm_qos target for the specific parameter, the
  process must open one of /dev/[cpu_dma_latency, network_latency,
  network_throughput]

  As long as the device node is held open that process has a registered
  requirement on the parameter.  The name of the requirement is
  "process_<PID>" derived from the current->pid from within the open system
  call.

  To change the requested target value the process needs to write a s32
  value to the open device node.  This translates to a
  pm_qos_update_requirement call.

  To remove the user mode request for a target value simply close the device
  node.

[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: fix build again]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: mark gross <mgross@linux.intel.com>
Cc: "John W. Linville" <linville@tuxdriver.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Jaroslav Kysela <perex@suse.cz>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Venki Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Adam Belay <abelay@novell.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:22 -08:00
Adrian Bunk 4ef7229ffa make kernel_shutdown_prepare() static
kernel_shutdown_prepare() can now become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:22 -08:00
Adrian Bunk 47a460d5a3 kernel/power/disk.c: make code static
resume_file[] and create_image() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:22 -08:00
Serge E. Hallyn 3b7391de67 capabilities: introduce per-process capability bounding set
The capability bounding set is a set beyond which capabilities cannot grow.
 Currently cap_bset is per-system.  It can be manipulated through sysctl,
but only init can add capabilities.  Root can remove capabilities.  By
default it includes all caps except CAP_SETPCAP.

This patch makes the bounding set per-process when file capabilities are
enabled.  It is inherited at fork from parent.  Noone can add elements,
CAP_SETPCAP is required to remove them.

One example use of this is to start a safer container.  For instance, until
device namespaces or per-container device whitelists are introduced, it is
best to take CAP_MKNOD away from a container.

The bounding set will not affect pP and pE immediately.  It will only
affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
and exec() does not constrain pI'.  So to really start a shell with no way
of regain CAP_MKNOD, you would do

	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
	cap_t cap = cap_get_proc();
	cap_value_t caparray[1];
	caparray[0] = CAP_MKNOD;
	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
	cap_set_proc(cap);
	cap_free(cap);

The following test program will get and set the bounding
set (but not pI).  For instance

	./bset get
		(lists capabilities in bset)
	./bset drop cap_net_raw
		(starts shell with new bset)
		(use capset, setuid binary, or binary with
		file capabilities to try to increase caps)

************************************************************
cap_bound.c
************************************************************
 #include <sys/prctl.h>
 #include <linux/capability.h>
 #include <sys/types.h>
 #include <unistd.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

 #ifndef PR_CAPBSET_READ
 #define PR_CAPBSET_READ 23
 #endif

 #ifndef PR_CAPBSET_DROP
 #define PR_CAPBSET_DROP 24
 #endif

int usage(char *me)
{
	printf("Usage: %s get\n", me);
	printf("       %s drop <capability>\n", me);
	return 1;
}

 #define numcaps 32
char *captable[numcaps] = {
	"cap_chown",
	"cap_dac_override",
	"cap_dac_read_search",
	"cap_fowner",
	"cap_fsetid",
	"cap_kill",
	"cap_setgid",
	"cap_setuid",
	"cap_setpcap",
	"cap_linux_immutable",
	"cap_net_bind_service",
	"cap_net_broadcast",
	"cap_net_admin",
	"cap_net_raw",
	"cap_ipc_lock",
	"cap_ipc_owner",
	"cap_sys_module",
	"cap_sys_rawio",
	"cap_sys_chroot",
	"cap_sys_ptrace",
	"cap_sys_pacct",
	"cap_sys_admin",
	"cap_sys_boot",
	"cap_sys_nice",
	"cap_sys_resource",
	"cap_sys_time",
	"cap_sys_tty_config",
	"cap_mknod",
	"cap_lease",
	"cap_audit_write",
	"cap_audit_control",
	"cap_setfcap"
};

int getbcap(void)
{
	int comma=0;
	unsigned long i;
	int ret;

	printf("i know of %d capabilities\n", numcaps);
	printf("capability bounding set:");
	for (i=0; i<numcaps; i++) {
		ret = prctl(PR_CAPBSET_READ, i);
		if (ret < 0)
			perror("prctl");
		else if (ret==1)
			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
	}
	printf("\n");
	return 0;
}

int capdrop(char *str)
{
	unsigned long i;

	int found=0;
	for (i=0; i<numcaps; i++) {
		if (strcmp(captable[i], str) == 0) {
			found=1;
			break;
		}
	}
	if (!found)
		return 1;
	if (prctl(PR_CAPBSET_DROP, i)) {
		perror("prctl");
		return 1;
	}
	return 0;
}

int main(int argc, char *argv[])
{
	if (argc<2)
		return usage(argv[0]);
	if (strcmp(argv[1], "get")==0)
		return getbcap();
	if (strcmp(argv[1], "drop")!=0 || argc<3)
		return usage(argv[0]);
	if (capdrop(argv[2])) {
		printf("unknown capability\n");
		return 1;
	}
	return execl("/bin/bash", "/bin/bash", NULL);
}
************************************************************

[serue@us.ibm.com: fix typo]
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>a
Signed-off-by: "Serge E. Hallyn" <serue@us.ibm.com>
Tested-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:20 -08:00
Andrew Morgan e338d263a7 Add 64-bit capability support to the kernel
The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities.  If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

 http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan <morgan@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: James Morris <jmorris@namei.org>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:20 -08:00
Bron Gondwana 195cf453d2 mm/page-writeback: highmem_is_dirtyable option
Add vm.highmem_is_dirtyable toggle

A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
approximately 2Gb size which contains a hash format that is written
randomly by the dbclean process.  On 2.6.16 this process took a few
minutes.  With lowmem only accounting of dirty ratios, this takes about 12
hours of 100% disk IO, all random writes.

Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
add the highmem back to the total available memory count.

[akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
Signed-off-by: Bron Gondwana <brong@fastmail.fm>
Cc: Ethan Solomita <solo@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
Benjamin Herrenschmidt 5e5419734c add mm argument to pte/pmd/pud/pgd_free
(with Martin Schwidefsky <schwidefsky@de.ibm.com>)

The pgd/pud/pmd/pte page table allocation functions get a mm_struct pointer as
first argument.  The free functions do not get the mm_struct argument.  This
is 1) asymmetrical and 2) to do mm related page table allocations the mm
argument is needed on the free function as well.

[kamalesh@linux.vnet.ibm.com: i386 fix]
[akpm@linux-foundation.org: coding-syle fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
Christoph Lameter 9f8f217253 Page allocator: clean up pcp draining functions
- Add comments explaing how drain_pages() works.

- Eliminate useless functions

- Rename drain_all_local_pages to drain_all_pages(). It does drain
  all pages not only those of the local processor.

- Eliminate useless interrupt off / on sequences. drain_pages()
  disables interrupts on its own. The execution thread is
  pinned to processor by the caller. So there is no need to
  disable interrupts.

- Put drain_all_pages() declaration in gfp.h and remove the
  declarations from suspend.h and from mm/memory_hotplug.c

- Make software suspend call drain_all_pages(). The draining
  of processor local pages is may not the right approach if
  software suspend wants to support SMP. If they call drain_all_pages
  then we can make drain_pages() static.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Daniel Walker <dwalker@mvista.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:17 -08:00
Davide Libenzi 4d672e7ac7 timerfd: new timerfd API
This is the new timerfd API as it is implemented by the following patch:

int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
		    const struct itimerspec *utmr,
		    struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);

The timerfd_create() API creates an un-programmed timerfd fd.  The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.

The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not
NULL).

The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter.  Otherwise it's a relative time.

The timerfd_gettime() API returns the next expiration time of the timer, or
{0, 0} if the timerfd has not been set yet.

Like the previous timerfd API implementation, read(2) and poll(2) are
supported (with the same interface).  Here's a simple test program I used to
exercise the new timerfd APIs:

http://www.xmailserver.org/timerfd-test2.c

[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: fix ia64 build]
[akpm@linux-foundation.org: fix m68k build]
[akpm@linux-foundation.org: fix mips build]
[akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
[heiko.carstens@de.ibm.com: fix s390]
[akpm@linux-foundation.org: fix powerpc build]
[akpm@linux-foundation.org: fix sparc64 more]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
Oleg Nesterov ed5d2cac11 exec: rework the group exit and fix the race with kill
As Roland pointed out, we have the very old problem with exec.  de_thread()
sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
clears signal->flags.  All signals (even fatal ones) sent in this window
(which is not too small) will be lost.

With this patch exec doesn't abuse SIGNAL_GROUP_EXIT.  signal_group_exit(),
the new helper, should be used to detect exit_group() or exec() in progress.
It can have more users, but this patch does only strictly necessary changes.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
Oleg Nesterov f558b7e408 remove handle_group_stop() in favor of do_signal_stop()
Every time we set SIGNAL_GROUP_EXIT or clear SIGNAL_STOP_DEQUEUED we also
reset ->group_stop_count.

This means that the SIGNAL_GROUP_EXIT check in handle_group_stop() is not
needed, and do_signal_stop() should check SIGNAL_STOP_DEQUEUED only when
->group_stop_count == 0. With these changes handle_group_stop() becomes the
subset of do_signal_stop(), we can kill it and use do_signal_stop() instead.

Also, a preparation for the next patch.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
Oleg Nesterov 198466b41d __group_complete_signal(): fix coredump with group stop race
When __group_complete_signal() sees sig_kernel_coredump() signal, it starts
the group stop, but sets ->group_exit_task = t in a hope that "t" will
actually dequeue this signal and invoke do_coredump().  However, by the
time "t" enters get_signal_to_deliver() it is possible that the signal was
blocked/ignored or we have another pending !SIG_KERNEL_COREDUMP_MASK signal
which will be dequeued first.  This means the task could be stopped but not
killed.

Remove this code from __group_complete_signal().  Note also this patch
removes the bogus signal_wake_up(t, 1).  This thread can't be
STOPPED/TRACED, note the corresponding check in wants_signal().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
Andrew Morton bdff746a39 clone: prepare to recycle CLONE_STOPPED
Ulrich says that we never used this clone flags and that nothing should be
using it.

As we're down to only a single bit left in clone's flags argument, let's add a
warning to check that no userspace is actually using it.  Hopefully we will
be able to recycle it.

Roland said:

  CLONE_STOPPED was previously used by some NTPL versions when under
  thread_db (i.e.  only when being actively debugged by gdb), but not for a
  long time now, and it never worked reliably when it was used.  Removing it
  seems fine to me.

[akpm@linux-foundation.org: it looks like CLONE_DETACHED is being used]
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:07 -08:00
Linus Torvalds f5bb3a5e9d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (79 commits)
  Jesper Juhl is the new trivial patches maintainer
  Documentation: mention email-clients.txt in SubmittingPatches
  fs/binfmt_elf.c: spello fix
  do_invalidatepage() comment typo fix
  Documentation/filesystems/porting fixes
  typo fixes in net/core/net_namespace.c
  typo fix in net/rfkill/rfkill.c
  typo fixes in net/sctp/sm_statefuns.c
  lib/: Spelling fixes
  kernel/: Spelling fixes
  include/scsi/: Spelling fixes
  include/linux/: Spelling fixes
  include/asm-m68knommu/: Spelling fixes
  include/asm-frv/: Spelling fixes
  fs/: Spelling fixes
  drivers/watchdog/: Spelling fixes
  drivers/video/: Spelling fixes
  drivers/ssb/: Spelling fixes
  drivers/serial/: Spelling fixes
  drivers/scsi/: Spelling fixes
  ...
2008-02-04 07:58:52 -08:00
Linus Torvalds 519cb68807 Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild:
  scsi: fix dependency bug in aic7 Makefile
  kbuild: add svn revision information to setlocalversion
  kbuild: do not warn about __*init/__*exit symbols being exported
  Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig
  Add HAVE_KPROBES
  Add HAVE_OPROFILE
  Create arch/Kconfig
  Fix ARM to play nicely with generic Instrumentation menu
  kconfig: ignore select of unknown symbol
  kconfig: mark config as changed when loading an alternate config
  kbuild: Spelling/grammar fixes for config DEBUG_SECTION_MISMATCH
  Remove __INIT_REFOK and __INITDATA_REFOK
  kbuild: print only total number of section mismatces found
2008-02-04 07:56:17 -08:00
Nick Piggin 2f98735c9c vm audit: add VM_DONTEXPAND to mmap for drivers that need it
Drivers that register a ->fault handler, but do not range-check the
offset argument, must set VM_DONTEXPAND in the vm_flags in order to
prevent an expanding mremap from overflowing the resource.

I've audited the tree and attempted to fix these problems (usually by
adding VM_DONTEXPAND where it is not obvious).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-04 07:55:38 -08:00
Joe Perches 0b0a3e7b18 kernel/: Spelling fixes
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-02-03 17:48:04 +02:00
Pierre Peiffer 06c93e8757 Remove one useless extern declaration
The file exit.c contains one useless extern declaration of sem_exit().
Moreover, it refers to nothing.

This trivial patch removes it.

Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
2008-02-03 16:22:12 +02:00
Mathieu Desnoyers 125e564582 Move Kconfig.instrumentation to arch/Kconfig and init/Kconfig
Move the instrumentation Kconfig to

arch/Kconfig for architecture dependent options
  - oprofile
  - kprobes

and

init/Kconfig for architecture independent options
  - profiling
  - markers

Remove the "Instrumentation Support" menu. Everything moves to "General setup".
Delete the kernel/Kconfig.instrumentation file.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-02-03 08:58:08 +01:00
Mathieu Desnoyers 3f550096de Add HAVE_KPROBES
Linus:

On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like

        depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32

really shouldn't exist in a file like kernel/Kconfig.instrumentation.

It would be much better to do

        depends on ARCH_SUPPORTS_KPROBES

in that generic file, and then architectures that do support it would just
have a

        bool ARCH_SUPPORTS_KPROBES
                default y

in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...

Changelog:

Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use

        config KPROBES_SUPPORT
                def_bool y

instead, which is a bit denser.

We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...

- Use HAVE_KPROBES
- Use a select

- Yet another update :
Moving to HAVE_* now.

- Update ARM for kprobes support.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-02-03 08:58:07 +01:00
Mathieu Desnoyers 42d4b839c8 Add HAVE_OPROFILE
Linus:
On the per-architecture side, I do think it would be better to *not* have
internal architecture knowledge in a generic file, and as such a line like

        depends on X86_32 || IA64 || PPC || S390 || SPARC64 || X86_64 || AVR32

really shouldn't exist in a file like kernel/Kconfig.instrumentation.

It would be much better to do

        depends on ARCH_SUPPORTS_KPROBES

in that generic file, and then architectures that do support it would just
have a

        bool ARCH_SUPPORTS_KPROBES
                default y

in *their* architecture files. That would seem to be much more logical,
and is readable both for arch maintainers *and* for people who have no
clue - and don't care - about which architecture is supposed to support
which interface...

Changelog:

Actually, I know I gave this as the magic incantation, but now that I see
it, I realize that I should have told you to just use

        config ARCH_SUPPORTS_KPROBES
                def_bool y

instead, which is a bit denser.

We seem to use both kinds of syntax for these things, but this is really
what "def_bool" is there for...

Changelog :

- Moving to HAVE_*.
- Add AVR32 oprofile.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-02-03 08:58:07 +01:00
Mathieu Desnoyers c0ffa3a951 Fix ARM to play nicely with generic Instrumentation menu
The conflicting commit for
move-kconfiginstrumentation-to-arch-kconfig-and-init-kconfig.patch
is the ARM fix from Linus :

commit 38ad9aebe7

He just seemed to agree that my approach (just putting the missing ARM
config options in arch/arm/Kconfig) works too. The main advantage it has
is that it is smaller, does not need a cleanup in the future and does
not break the following patches unnecessarily.

It's just been discussed here

http://lkml.org/lkml/2008/1/15/267

However, Linus might prefer to stay with his own patch and I would
totally understand it that late in the release cycle. Therefore I submit
this for the next release cycle.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
CC: Russell King <rmk+lkml@arm.linux.org.uk>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2008-02-03 08:58:07 +01:00
Paul Mundt dfacd68e49 kobject: Always build in kernel/ksysfs.o.
kernel/ksysfs.c seems to be a random dumping group for misc globals
that the rest of the tree depend on. This has caused problems with
exports in the past when sysfs is disabled, which can already be
observed in commit-id 51107301b6.

The latest one is the kernel_kobj usage, which presently results in:

fs/built-in.o: In function `debugfs_init':
inode.c:(.init.text+0xc34): undefined reference to `kernel_kobj'
make: *** [.tmp_vmlinux1] Error 1

kernel/ksysfs.c itself at this point only contains globals and some
basic sysfs initialization, the sysfs initialization code is optimized
out when we build with sysfs disabled. Given that, it's easier to just
build in unconditionally, rather than trying to find some other random
place to dump and initialize the globals.

Additionally, the current trend seems to be decoupling of kobjects from
sysfs, in which case it still makes sense to perform the kernel_kobj
initialization that happens here even if sysfs is disabled, as
lib/kobject.o is built-in unconditionally.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-02-02 15:14:46 -08:00
Linus Torvalds 687fcdf741 Merge branch 'suspend' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'suspend' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (38 commits)
  suspend: cleanup reference to swsusp_pg_dir[]
  PM: Remove obsolete /sys/devices/.../power/state docs
  Hibernation: Invoke suspend notifications after console switch
  Suspend: Invoke suspend notifications after console switch
  Suspend: Clean up suspend_64.c
  Suspend: Add config option to disable the freezer if architecture wants that
  ACPI: Print message before calling _PTS
  ACPI hibernation: Call _PTS before suspending devices
  Hibernation: Introduce begin() and end() callbacks
  ACPI suspend: Call _PTS before suspending devices
  ACPI: Separate disabling of GPEs from _PTS
  ACPI: Separate invocations of _GTS and _BFS from _PTS and _WAK
  Suspend: Introduce begin() and end() callbacks
  suspend: fix ia64 allmodconfig build
  ACPI: clear GPE earily in resume to avoid warning
  Suspend: Clean up Kconfig (V2)
  Hibernation: Clean up Kconfig (V2)
  Hibernation: Update messages
  Suspend: Use common prefix in messages
  Hibernation: Remove unnecessary variable declaration
  ...
2008-02-02 14:29:57 +11:00
Peter Zijlstra ed50d6cbc3 debug: softlockup looping fix
Rafael J. Wysocki reported weird, multi-seconds delays during
suspend/resume and bisected it back to:

  commit 82a1fcb902
  Author: Ingo Molnar <mingo@elte.hu>
  Date:   Fri Jan 25 21:08:02 2008 +0100

      softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks

fix it:

 - restore the old wakeup mechanism
 - fix break usage in do_each_thread() { } while_each_thread().
 - fix the hotplug switch stmt, a fall-through case was broken.

Bisected-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-02 14:27:45 +11:00
Rafael J. Wysocki 5a0a2f3046 Hibernation: Invoke suspend notifications after console switch
Following the recent change in the suspend code path, switch consoles before
calling PM notifiers during hibernation.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:59 -05:00
Johannes Berg af258f516b Suspend: Invoke suspend notifications after console switch
In order to fix APM emulation it is necessary to enable apm-emulation
notifications for suspends triggered in various ways via the suspend
notifiers.  However, this will cause the systems using APM emulation
to lock up between X being needed to switch away from the VT and X
already waiting for resume in the APM ioctl.

This patch moves the console switch (if enabled) before the suspend
notification (and after the resume notification) to avoid this issue.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:58 -05:00
Johannes Berg b28f508112 Suspend: Add config option to disable the freezer if architecture wants that
This patch makes the freezer optional for suspend to allow the
system to work (or not work) like the original PMU suspend.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:58 -05:00
Rafael J. Wysocki caea99ef33 Hibernation: Introduce begin() and end() callbacks
Introduce global hibernation callback .end() and rename global
hibernation callback .start() to .begin(), in analogy with the
recent modifications of the global suspend callbacks.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:58 -05:00
Rafael J. Wysocki c697eecebc Suspend: Introduce begin() and end() callbacks
On ACPI systems the target state set by acpi_pm_set_target() is
reset by acpi_pm_finish(), but that need not be called if the
suspend fails.  All platforms that use the .set_target() global
suspend callback are affected by analogous issues.

For this reason, we need an additional global suspend callback that
will reset the target state regardless of whether or not the suspend
is successful.  Also, it is reasonable to rename the .set_target()
callback, since it will be used for a different purpose on ACPI
systems (due to ACPI 1.0x code ordering requirements).

Introduce the global suspend callback .end() to be executed at the
end of the suspend sequence and rename the .set_target() global
suspend callback to .begin().

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:56 -05:00
Rafael J. Wysocki 7671b8ae53 suspend: fix ia64 allmodconfig build
kernel/power/main.c:488: error: ‘pm_test_attr’ undeclared here (not in a function)

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:56 -05:00
Johannes Berg f4cb570076 Suspend: Clean up Kconfig (V2)
This cleans up the suspend Kconfig and removes the need to
declare centrally which architectures support suspend. All
architectures that currently support suspend are modified
accordingly.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Acked-by: Paul Mackerras <paulus@samba.org>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Johannes Berg 801e4062fd Hibernation: Clean up Kconfig (V2)
This cleans up the hibernation Kconfig and removes the need to
declare centrally which architectures support hibernation. All
architectures that currently support hibernation are modified
accordingly.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Cc: Pavel Machek <pavel@suse.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki 23976728a4 Hibernation: Update messages
Make hibernation messages start with one common prefix "PM: " and use
the word "hibernation" in the messages as a synonym of "suspend to
disk".

Turn some KERN_INFO messages into debug ones.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki 465d2b477f Suspend: Use common prefix in messages
Make suspend messages start with one common prefix "PM: ".

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki b6887a2944 Hibernation: Remove unnecessary variable declaration
Remove the unnecessary extern declaration of resume_file[]
from kernel/power/swap.c .

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki 9575809c6f Hibernation: Fix comment in disk.c
Fix a comment in kernel/power/disk.c so that it doesn't contain lines
longer that 80 characters.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki 9628c0ee6a Suspend: Fix comment in main.c
Fix a comment in kernel/power/main.c so that it doesn't contain lines
longer that 80 characters.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:55 -05:00
Rafael J. Wysocki 72df68ca8e Hibernation: Move low level resume to disk.c
Move the low level restore code to kernel/power/disk.c , since the
corresponding low level hibernation code is already there.

Make restore fail if device_power_down(PMSG_PRETHAW) returns an
error.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Rafael J. Wysocki 2ed43b6328 Suspend: Fix compilation warning for CONFIG_SUSPEND unset
Suspend: Make debug facility depend on CONFIG_SUSPEND

Make the new suspend debug facility code depend on CONFIG_SUSPEND,
as appropriate, to remove the compiler warning printed when CONFIG_PM is set
and CONFIG_SUSPEND is not set.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Alan Stern 8252575693 PM: Convert PM notifiers to out-of-line code
This patch (as1008b) converts the PM notifier routines from inline
calls to out-of-line code.  It also prevents pm_chain_head from
being created when CONFIG_PM_SLEEP isn't enabled, and EXPORTs the
notifier registration and unregistration routines.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Johannes Berg 90dda1cb6a PM: Make PM_TRACE more architecture independent
When trying to debug a suspend failure I started implementing
PM_TRACE for powerpc. I then noticed that I'm debugging a suspend
failure and so PM_TRACE isn't useful at all, but thought that
nonetheless this could be useful in the future.

Basically, to support PM_TRACE, you add a Kconfig option that
selects PM_TRACE and provides the infrastructure as per the
help text of PM_TRACE.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Rafael J. Wysocki 4cc79776c9 Hibernation: New testing facility (rev. 2)
Make it possible to test the hibernation core code with the help of the
/sys/power/pm_test attribute introduced for suspend testing in the previous
patch.

Writing an appropriate string to this file causes the hibernation code to work
in one of the test modes defined as follows:

freezer
- test the freezing of processes

devices
- test the freezing of processes and suspending of devices

platform
- test the freezing of processes, suspending of devices and platform global
  control methods(*)

processors
- test the freezing of processes, suspending of devices, platform global
  control methods(*) and the disabling of nonboot CPUs

core
- test the freezing of processes, suspending of devices, platform global
  control methods(*), the disabling of nonboot CPUs and suspending of
  platform/system devices

(*) - the platform global control methods are only available on ACPI systems
      and are only tested if the hibernation mode is set to "platform"

Then, if a hibernation is started by normal means, the hibernation core will
perform its normal operations up to the point indicated by given test level.
Next, it will wait for 5 seconds and carry out the resume operations needed to
transition the system back to the fully functional state.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Rafael J. Wysocki 039a75c6e1 suspend: build fix responding to 2.6.25 kset change
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Rafael J. Wysocki 0e7d56e3d9 Suspend: Testing facility (rev. 2)
Introduce sysfs attribute /sys/power/pm_test allowing one to test the suspend
core code.  Namely, writing one of the strings:

freezer
devices
platform
processors
core

to this file causes the suspend code to work in one of the test modes defined as
follows:

freezer
- test the freezing of processes

devices
- test the freezing of processes and suspending of devices

platform
- test the freezing of processes, suspending of devices and platform global
  control methods(*)

processors
- test the freezing of processes, suspending of devices, platform global
  control methods and the disabling of nonboot CPUs

core
- test the freezing of processes, suspending of devices, platform global
  control methods, the disabling of nonboot CPUs and suspending of
  platform/system devices

(*) These are ACPI global control methods on ACPI systems

Then, if a suspend is started by normal means, the suspend core will perform
its normal operations up to the point indicated by given test level.  Next, it
will wait for 5 seconds and carry out the resume operations needed to transition
the system back to the fully functional state.

Writing "none" to /sys/power/pm_test turns the testing off.

When open for reading, /sys/power/pm_test contains a space-separated list of all
available tests (including "none" that represents the normal functionality) in
which the current test level is indicated by square brackets.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:54 -05:00
Alan Stern c3e94d899c Hibernation: Add PM_RESTORE_PREPARE and PM_POST_RESTORE notifiers (rev. 2)
Add PM_RESTORE_PREPARE and PM_POST_RESTORE notifiers to the PM core, to be used
in analogy with the existing PM_HIBERNATION_PREPARE and PM_POST_HIBERNATION
notifiers.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Adrian Bunk 2f8ed1c60b Hibernation: Move function prototypes to header
This patch moves the prototypes of count_highmem_pages() and
restore_highmem() to kernel/power/power.h

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Rafael J. Wysocki 3010f8caa4 Hibernation: Introduce exportable suspend ioctls header (rev. 2)
Move the definitions of hibernation ioctls to a separate header file in
include/linux, which can be exported to the user space.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Rafael J. Wysocki cc5d207c85 Hibernation: Correct definitions of some ioctls (rev. 2)
Three ioctl numbers belonging to the hibernation userland interface,
SNAPSHOT_ATOMIC_SNAPSHOT, SNAPSHOT_AVAIL_SWAP, SNAPSHOT_GET_SWAP_PAGE,
are defined in a wrong way (eg. not portable).  Provide new ioctl numbers for
these ioctls and mark the existing ones as deprecated.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Rafael J. Wysocki 96f737490c Hibernation: Mark SNAPSHOT_SET_SWAP_FILE ioctl as deprecated (rev. 2)
Mark the SNAPSHOT_SET_SWAP_FILE ioctl belonging to the hibernation userland
interface as deprecated.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Rafael J. Wysocki eb57c1cf05 Hibernation: Rework platform support ioctls (rev. 2)
Modify the hibernation userland interface by adding two new ioctls to it,
SNAPSHOT_PLATFORM_SUPPORT and SNAPSHOT_POWER_OFF, that can be used,
respectively, to switch the hibernation platform support on/off and to make the
kernel transition the system to the hibernation state (eg. ACPI S4) using the
platform (eg. ACPI) driver.

These ioctls are intended to replace the misdesigned SNAPSHOT_PMOPS ioctl,
which from now is regarded as obsolete and will be removed in the future.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:53 -05:00
Rafael J. Wysocki af508b34d2 Hibernation: Introduce SNAPSHOT_GET_IMAGE_SIZE ioctl
Add a new ioctl, SNAPSHOT_GET_IMAGE_SIZE, returning the size of the (just
created) hibernation image, to the hibernation userland interface.

This ioctl is necessary so that the userland utilities using the interface need
not access the hibernation image header, owned by the kernel, in order to obtain
the size of the image.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Len Brown <len.brown@intel.com>
2008-02-01 18:30:52 -05:00
Linus Torvalds dd5f5fed6c Merge branch 'audit.b46' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
* 'audit.b46' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  [AUDIT] Add uid, gid fields to ANOM_PROMISCUOUS message
  [AUDIT] ratelimit printk messages audit
  [patch 2/2] audit: complement va_copy with va_end()
  [patch 1/2] kernel/audit.c: warning fix
  [AUDIT] create context if auditing was ever enabled
  [AUDIT] clean up audit_receive_msg()
  [AUDIT] make audit=0 really stop audit messages
  [AUDIT] break large execve argument logging into smaller messages
  [AUDIT] include audit type in audit message when using printk
  [AUDIT] do not panic on exclude messages in audit_log_pid_context()
  [AUDIT] Add End of Event record
  [AUDIT] add session id to audit messages
  [AUDIT] collect uid, loginuid, and comm in OBJ_PID records
  [AUDIT] return EINTR not ERESTART*
  [PATCH] get rid of loginuid races
  [PATCH] switch audit_get_loginuid() to task_struct *
2008-02-02 08:37:03 +11:00
Eric Paris 320f1b1ed2 [AUDIT] ratelimit printk messages audit
some printk messages from the audit system can become excessive.  This
patch ratelimits those messages.  It was found that messages, such as
the audit backlog lost printk message could flood the logs to the point
that a machine could take an nmi watchdog hit or otherwise become
unresponsive.

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:25:04 -05:00
Richard Knutsson 148b38dc93 [patch 2/2] audit: complement va_copy with va_end()
Complement va_copy() with va_end().

Signed-off-by: Richard Knutsson <ricknu-0@student.ltu.se>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-01 14:24:57 -05:00
Andrew Morton ef00be0554 [patch 1/2] kernel/audit.c: warning fix
kernel/audit.c: In function 'audit_log_start':
kernel/audit.c:1133: warning: 'serial' may be used uninitialized in this function

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-01 14:24:51 -05:00
Eric Paris b593d384ef [AUDIT] create context if auditing was ever enabled
Disabling audit at runtime by auditctl doesn't mean that we can
stop allocating contexts for new processes; we don't want to miss them
when that sucker is reenabled.

(based on work from Al Viro in the RHEL kernel series)

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:24:45 -05:00
Eric Paris 50397bd1e4 [AUDIT] clean up audit_receive_msg()
generally clean up audit_receive_msg() don't free random memory if
selinux_sid_to_string fails for some reason.  Move generic auditing
to a helper function

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:24:39 -05:00
Eric Paris 1a6b9f2317 [AUDIT] make audit=0 really stop audit messages
Some audit messages (namely configuration changes) are still emitted even if
the audit subsystem has been explicitly disabled.  This patch turns those
messages off as well.

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:24:33 -05:00
Eric Paris de6bbd1d30 [AUDIT] break large execve argument logging into smaller messages
execve arguments can be quite large.  There is no limit on the number of
arguments and a 4G limit on the size of an argument.

this patch prints those aruguments in bite sized pieces.  a userspace size
limitation of 8k was discovered so this keeps messages around 7.5k

single arguments larger than 7.5k in length are split into multiple records
and can be identified as aX[Y]=

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:23:55 -05:00
Eric Paris e445deb593 [AUDIT] include audit type in audit message when using printk
Currently audit drops the audit type when an audit message goes through
printk instead of the audit deamon.  This is a minor annoyance in
that the audit type is no longer part of the message and the information
the audit type conveys needs to be carried in, or derived from the
message data.

The attached patch includes the type number as part of the printk.
Admittedly it isn't the type name that the audit deamon provides but I
think this is better than dropping the type completely.

Signed-pff-by: John Johansen <jjohansen@suse.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:08:14 -05:00
Eric Paris 6246ccab99 [AUDIT] do not panic on exclude messages in audit_log_pid_context()
If we fail to get an ab in audit_log_pid_context this may be due to an exclude
rule rather than a memory allocation failure.  If it was due to a memory
allocation failue we would have already paniced and no need to do it again.

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:07:46 -05:00
Eric Paris c0641f28dc [AUDIT] Add End of Event record
This patch adds an end of event record type. It will be sent by the kernel as
the last record when a multi-record event is triggered. This will aid realtime
analysis programs since they will now reliably know they have the last record
to complete an event. The audit daemon filters this and will not write it to
disk.

Signed-off-by: Steve Grubb <sgrubb redhat com>
Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:07:19 -05:00
Eric Paris 4746ec5b01 [AUDIT] add session id to audit messages
In order to correlate audit records to an individual login add a session
id.  This is incremented every time a user logs in and is included in
almost all messages which currently output the auid.  The field is
labeled ses=  or oses=

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:06:51 -05:00
Eric Paris c2a7780efe [AUDIT] collect uid, loginuid, and comm in OBJ_PID records
Add uid, loginuid, and comm collection to OBJ_PID records.  This just
gives users a little more information about the task that received a
signal.  pid is rather meaningless after the fact, and even though comm
isn't great we can't collect exe reasonably on this code path for
performance reasons.

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:06:23 -05:00
Eric Paris f701b75ed5 [AUDIT] return EINTR not ERESTART*
The syscall exit code will change ERESTART* kernel internal return codes
to EINTR if it does not restart the syscall.  Since we collect the audit
info before that point we should fix those in the audit log as well.

Signed-off-by: Eric Paris <eparis@redhat.com>
2008-02-01 14:05:55 -05:00
Al Viro bfef93a5d1 [PATCH] get rid of loginuid races
Keeping loginuid in audit_context is racy and results in messier
code.  Taken to task_struct, out of the way of ->audit_context
changes.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-01 14:05:28 -05:00
Al Viro 0c11b9428f [PATCH] switch audit_get_loginuid() to task_struct *
all callers pass something->audit_context

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-02-01 14:04:59 -05:00
Thomas Gleixner cd689985cf futex: Add bitset conditional wait/wakeup functionality
To allow the implementation of optimized rw-locks in user space, glibc
needs a possibility to select waiters for wakeup depending on a bitset
mask.

This requires two new futex OPs: FUTEX_WAIT_BITS and FUTEX_WAKE_BITS
These OPs are basically the same as FUTEX_WAIT and FUTEX_WAKE plus an
additional argument - a bitset. Further the FUTEX_WAIT_BITS OP is
expecting an absolute timeout value instead of the relative one, which
is used for the FUTEX_WAIT OP.

FUTEX_WAIT_BITS calls into the kernel with a bitset. The bitset is
stored in the futex_q structure, which is used to enqueue the waiter
into the hashed futex waitqueue.

FUTEX_WAKE_BITS also calls into the kernel with a bitset. The wakeup
function logically ANDs the bitset with the bitset stored in each
waiters futex_q structure. If the result is zero (i.e. none of the set
bits in the bitsets is matching), then the waiter is not woken up. If
the result is not zero (i.e. one of the set bits in the bitsets is
matching), then the waiter is woken.

The bitset provided by the caller must be non zero. In case the
provided bitset is zero the kernel returns EINVAL.

Internaly the new OPs are only extensions to the existing FUTEX_WAIT
and FUTEX_WAKE functions. The existing OPs hand a bitset with all bits
set into the futex_wait() and futex_wake() functions.

Signed-off-by: Thomas Gleixner <tgxl@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-01 17:45:14 +01:00
Thomas Gleixner 83e96c604e futex: Remove warn on in return fixup path
The WARN_ON() in the fixup return path of futex_lock_pi() can
trigger with false positives.

The following scenario happens:
t1 holds the futex and t2 and t3 are blocked on the kernel side rt_mutex.
t1 releases the futex (and the rt_mutex) and assigned t2 to be the next
owner of the futex.

t2 is interrupted and returns w/o acquiring the rt_mutex, before t1 can
release the rtmutex.

t1 releases the rtmutex and t3 becomes the pending owner of the rtmutex.

t2 notices that it is the designated owner (user space variable) and
fails to acquire the rt_mutex via trylock, because it is not allowed to
steal the rt_mutex from t3. Now it looks at the rt_mutex pending owner (t3)
and assigns the futex and the pi_state to it.

During the fixup t4 steals the rtmutex from t3.

t2 returns from the fixup and the owner of the rt_mutex has changed from
t3 to t4.

There is no need to do another round of fixups from t2. The important
part (t2 is not returning as the user space visible owner) is
done. The further fixups are done, before either t3 or t4 return to
user space.

For the user space it is not relevant which task (t3 or t4) is the real
owner, as long as those are both in the kernel, which is guaranteed by
the serialization of the hash bucket lock. Both tasks (which ever returns
first to userspace - t4 because it locked the rt_mutex or t3 due to a signal)
are going through the lock_futex_pi() return path where the ownership is
fixed before the return to user space.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-01 17:45:14 +01:00
Thomas Gleixner 5df7fa1c62 tick-sched: add more debug information
To allow better diagnosis of tick-sched related, especially NOHZ
related problems, we need to know when the last wakeup via an irq
happened and when the CPU left the idle state.

Add two fields (idle_waketime, idle_exittime) to the tick_sched
structure and add them to the timer_list output.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-01 17:45:14 +01:00
Thomas Gleixner 1001d0a9ee timekeeping: update xtime_cache when time(zone) changes
xtime_cache needs to be updated whenever xtime and or wall_to_monotic
are changed. Otherwise users of xtime_cache might see a stale (and in
the case of timezone changes utterly wrong) value until the next
update happens.

Fixup the obvious places, which miss this update.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-01 17:45:13 +01:00
Peter Zijlstra 3588a085cd hrtimer: fix hrtimer_init_sleeper() users
this patch:

 commit 37bb6cb409
 Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 Date:   Fri Jan 25 21:08:32 2008 +0100

     hrtimer: unlock hrtimer_wakeup

Broke hrtimer_init_sleeper() users. It forgot to fix up the futex
caller of this function to detect the failed queueing and messed up
the do_nanosleep() caller in that it could leak a TASK_INTERRUPTIBLE
state.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-02-01 17:45:13 +01:00
Herbert Xu 406a1d8680 [AUDIT]: Increase skb->truesize in audit_expand
The recent UDP patch exposed this bug in the audit code.  It
was calling pskb_expand_head without increasing skb->truesize.
The caller of pskb_expand_head needs to do so because that function
is designed to be called in places where truesize is already fixed
and therefore it doesn't update its value.

Because the audit system is using it in a place where the truesize
has not yet been fixed, it needs to update its value manually.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31 19:27:08 -08:00
Trond Myklebust 13f09b95a8 Ensure that we export __fatal_signal_pending()
It may be used by the modules nfs.ko and sunrpc.ko

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Made it a regular export rather than GPL-only  - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-01 12:58:14 +11:00
Linus Torvalds 75659ca0c1 Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
  Remove commented-out code copied from NFS
  NFS: Switch from intr mount option to TASK_KILLABLE
  Add wait_for_completion_killable
  Add wait_event_killable
  Add schedule_timeout_killable
  Use mutex_lock_killable in vfs_readdir
  Add mutex_lock_killable
  Use lock_page_killable
  Add lock_page_killable
  Add fatal_signal_pending
  Add TASK_WAKEKILL
  exit: Use task_is_*
  signal: Use task_is_*
  sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
  ptrace: Use task_is_*
  power: Use task_is_*
  wait: Use TASK_NORMAL
  proc/base.c: Use task_is_*
  proc/array.c: Use TASK_REPORT
  perfmon: Use task_is_*
  ...

Fixed up conflicts in NFS/sunrpc manually..
2008-02-01 11:45:47 +11:00
Ingo Molnar c4772d9930 debug: turn ignore_loglevel into an early param
i was debugging early crashes and wondered where all the printks
went. The reason: ignore_loglevel_setup() was not called yet ...

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-31 22:45:23 +01:00
Gerald Stralko 5aff0531ee sched: remove unused params
This removes the extra struct task_struct *p parameter in inc_nr_running
and dec_nr_running functions.

Signed-off by: Jerry Stralko <gerb.stralko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-31 22:45:23 +01:00
Peter Zijlstra ef9884e6f2 sched: let +nice tasks have smaller impact
Michel Dänzr has bisected an interactivity problem with
plus-reniced tasks back to this commit:

 810e95ccd5 is first bad commit
 commit 810e95ccd5
 Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 Date:   Mon Oct 15 17:00:14 2007 +0200

 sched: another wakeup_granularity fix

      unit mis-match: wakeup_gran was used against a vruntime

fix this by assymetrically scaling the vtime of positive reniced
tasks.

Bisected-by: Michel Dänzer <michel@tungstengraphics.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-31 22:45:22 +01:00
Srivatsa Vaddagiri 296825cbe1 sched: fix high wake up latencies with FAIR_USER_SCHED
The reason why we are getting better wakeup latencies for
!FAIR_USER_SCHED is because of this snippet of code in place_entity():

	if (!initial) {
		/* sleeps upto a single latency don't count. */
		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
						     ^^^^^^^^^^^^^^^^^^
			vruntime -= sysctl_sched_latency;

		/* ensure we never gain time by being placed backwards. */
		vruntime = max_vruntime(se->vruntime, vruntime);
	}

NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that
wakeup latencies with FAIR_USER_SCHED are restored to the same level as
!FAIR_USER_SCHED.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-31 22:45:22 +01:00
Paul Mackerras bd45ac0c5d Merge branch 'linux-2.6' 2008-01-31 11:25:51 +11:00
Avi Kivity 6d4e4c4fca KVM: Disallow fork() and similar games when using a VM
We don't want the meaning of guest userspace changing under our feet.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 17:53:13 +02:00
Mike Travis dd5af90a7f x86/non-x86: percpu, node ids, apic ids x86.git fixup
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:32 +01:00
Ingo Molnar 70edcd77a0 genirq: stackdump after the "Trying to free already-free IRQ" message
these bugs are harder to find than they seem, a stackdump helps.

make it dependent on CONFIG_DEBUG_SHIRQ so that people can turn it off
if it annoys them.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:24 +01:00
Arjan van de Ven 6dab27784b x86: add a simple backtrace test module
During the work on the x86 32 and 64 bit backtrace code I found it useful
to have a simple test module to test a process and irq context backtrace.
Since the existing backtrace code was buggy, I figure it might be useful
to have such a test module in the kernel so that maybe we can even
detect such bugs earlier..

[ mingo@elte.hu: build fix ]

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:08 +01:00
Ingo Molnar 076f9776f5 x86: make early printk selectable on 64-bit as well
Enable CONFIG_EMBEDDED to select CONFIG_EARLY_PRINTK on 64-bit as well.

saves ~2K:

   text    data     bss     dec     hex filename
   7290283 3672091 1907848 12870222         c4624e vmlinux.before
   7288373 3671795 1907848 12868016         c459b0 vmlinux.after

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:33:06 +01:00
Ananth N Mavinakayanahalli 8c1c935642 x86: kprobes: add kprobes smoke tests that run on boot
Here is a quick and naive smoke test for kprobes. This is intended to
just verify if some unrelated change broke the *probes subsystem. It is
self contained, architecture agnostic and isn't of any great use by itself.

This needs to be built in the kernel and runs a basic set of tests to
verify if kprobes, jprobes and kretprobes run fine on the kernel. In case
of an error, it'll print out a message with a "BUG" prefix.

This is a start; we intend to add more tests to this bucket over time.

Thanks to Jim Keniston and Masami Hiramatsu for comments and suggestions.

Tested on x86 (32/64) and powerpc.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-30 13:32:53 +01:00
Arjan van de Ven 71c339116a debug: add the end-of-trace marker and the module list to
Unlike oopses, WARN_ON() currently does't print the loaded modules list.
This makes it harder to take action on certain bug reports. For example,
recently there were a set of WARN_ON()s reported in the mac80211 stack,
which were just signalling a driver bug. It takes then anther round trip
to the bug reporter (if he responds at all) to find out which driver
is at fault.

Another issue is that, unlike oopses, WARN_ON() doesn't currently printk
the helpful "cut here" line, nor the "end of trace" marker.
Now that WARN_ON() is out of line, the size increase due to this is
minimal and it's worth adding.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:50 +01:00
Arjan van de Ven 79b4cc5ee7 debug: move WARN_ON() out of line
A quick grep shows that there are currently 1145 instances of WARN_ON
in the kernel. Currently, WARN_ON is pretty much entirely inlined,
which makes it hard to enhance it without growing the size of the kernel
(and getting Andrew unhappy).

This patch build on top of Olof's patch that introduces __WARN,
and places the slowpath out of line. It also uses Ingo's suggestion
to not use __FUNCTION__ but to use kallsyms to do the lookup;
this saves a ton of extra space since gcc doesn't need to store the function
string twice now:

3936367  833603  624736 5394706  525112 vmlinux.before
3917508  833603  624736 5375847  520767 vmlinux-slowpath

15Kb savings...

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Olof Johansson <olof@lixom.net>
Acked-by: Matt Meckall <mpm@selenic.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:50 +01:00
Andi Kleen 96d97cf03b x86: add /proc/irq/*/spurious to dump the spurious irq debugging state
This is useful to debug problems with interrupt handlers that return
sometimes IRQ_NONE.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:48 +01:00
Andi Kleen 9e094c17ee genirq: turn irq debugging options into module params
This allows to change them at runtime using sysfs. No need to
reboot to set them.

I only added aliases (kernel.noirqdebug etc.) so the old options
still work.

Signed-off-by: Andi Kleen <ak@suse.de>

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:32:48 +01:00
Roland McGrath c269f19617 x86: compat_sys_ptrace
This adds a generic definition of compat_sys_ptrace that calls
compat_arch_ptrace, parallel to sys_ptrace/arch_ptrace.  Some
machines needing this already define a function by that name.
The new generic function is defined only on machines that
put #define __ARCH_WANT_COMPAT_SYS_PTRACE into asm/ptrace.h.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:48 +01:00
Roland McGrath 032d82d906 x86: compat_ptrace_request
This adds a compat_ptrace_request that is the analogue of ptrace_request
for the things that 32-on-64 ptrace implementations can share in common.
So far there are just a couple of requests handled generically.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:47 +01:00
Roland McGrath 16c3e389e7 x86: ptrace_request peekdata/pokedata
This makes ptrace_request handle {PEEK,POKE}{TEXT,DATA} directly.
Every arch_ptrace that could call generic_ptrace_peekdata already
has a default case calling ptrace_request, so this keeps things
simpler for the arch code.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:47 +01:00
Nick Piggin 95c354fe9f spinlock: lockbreak cleanup
The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.

Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.

Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:31:20 +01:00
H. Peter Anvin 65ea5b0349 x86: rename the struct pt_regs members for 32/64-bit consistency
We have a lot of code which differs only by the naming of specific
members of structures that contain registers.  In order to enable
additional unifications, this patch drops the e- or r- size prefix
from the register names in struct pt_regs, and drops the x- prefixes
for segment registers on the 32-bit side.

This patch also performs the equivalent renames in some additional
places that might be candidates for unification in the future.

Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:56 +01:00
Roland McGrath 5b88abbf77 ptrace: generic PTRACE_SINGLEBLOCK
This makes ptrace_request handle PTRACE_SINGLEBLOCK along with
PTRACE_CONT et al.  The new generic code makes use of the
arch_has_block_step macro and generic entry points on machines
that define them.

[ mingo@elte.hu: bugfix ]

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:53 +01:00
Roland McGrath 36df29d799 ptrace: generic resume
This makes ptrace_request handle all the ptrace requests that wake
up the traced task.  These do low-level ptrace implementation magic
that is not arch-specific and should be kept out of arch code.  The
implementations on each arch usually do the same thing.  The new
generic code makes use of the arch_has_single_step macro and generic
entry points to handle PTRACE_SINGLESTEP.

Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:51 +01:00
Ingo Molnar 6e7c402590 x86: various changes and cleanups to in_p/out_p delay details
various changes to the in_p/out_p delay details:

- add the io_delay=none method
- make each method selectable from the kernel config
- simplify the delay code a bit by getting rid of an indirect function call
- add the /proc/sys/kernel/io_delay_type sysctl
- change 'io_delay=standard|alternate' to io_delay=0x80 and io_delay=0xed
- make the io delay config not depend on CONFIG_DEBUG_KERNEL

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: "David P. Reed" <dpreed@reed.com>
2008-01-30 13:30:05 +01:00
Venki Pallipadi 6378ddb592 time: track accurate idle time with tick_sched.idle_sleeptime
Current idle time in kstat is based on jiffies and is coarse grained.
tick_sched.idle_sleeptime is making some attempt to keep track of idle time
in a fine grained manner.  But, it is not handling the time spent in
interrupts fully.

Make tick_sched.idle_sleeptime accurate with respect to time spent on
handling interrupts and also add tick_sched.idle_lastupdate, which keeps
track of last time when idle_sleeptime was updated.

This statistics will be crucial for cpufreq-ondemand governor, which can
shed some conservative gaurd band that is uses today while setting the
frequency.  The ondemand changes that uses the exact idle time is coming
soon.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:04 +01:00
john stultz bbe4d18ac2 NTP: correct inconsistent ntp interval/tick_length usage
I recently noticed on one of my boxes that when synched with an NTP
server, the drift value reported for the system was ~283ppm. While in
some cases, clock hardware can be that bad, it struck me as unusual as
the system was using the acpi_pm clocksource, which is one of the more
trustworthy and accurate clocksources on x86 hardware.

I brought up another system and let it sync to the same NTP server, and
I noticed a similar 280some ppm drift.

In looking at the code, I found that the acpi_pm's constant frequency
was being computed correctly at boot-up, however once the system was up,
even without the ntp daemon running, the clocksource's frequency was
being modified by the clocksource_adjust() function.

Digging deeper, I realized that in the code that keeps track of how much
the clocksource is skewing from the ntp desired time, we were using
different lengths to establish how long an time interval was.

The clocksource was being setup with the following interval:
	NTP_INTERVAL_LENGTH = NSEC_PER_SEC/NTP_INTERVAL_FREQ

While the ntp code was using the tick_length_base value:
	tick_length_base ~= (tick_usec * NSEC_PER_USEC * USER_HZ)
					/NTP_INTERVAL_FREQ

The subtle difference is:
	(tick_usec * NSEC_PER_USEC * USER_HZ) != NSEC_PER_SEC

This difference in calculation was causing the clocksource correction
code to apply a correction factor to the clocksource so the two
intervals were the same, however this results in the actual frequency of
the clocksource to be made incorrect. I believe this difference would
affect all clocksources, although to differing degrees depending on the
clocksource resolution.

The issue was introduced when my HZ free ntp patch landed in 2.6.21-rc1,
so my apologies for the mistake, and for not noticing it until now.

The following patch, corrects the clocksource's initialization code so
it uses the same interval length as the code in ntp.c. After applying
this patch, the drift value for the same system went from ~283ppm to
only 2.635ppm.

I believe this patch to be good, however it does affect all arches and
I've only tested on x86, so some caution is advised. I do think it would
be a likely candidate for a stable 2.6.24.x release.

Any thoughts or feedback would be appreciated.

Signed-off-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:03 +01:00
Ingo Molnar 45fe4fe191 x86: make clockevents more robust
detect zero event-device multiplicators - they then cause
division-by-zero crashes if a clockevent has been initialized
incorrectly.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:03 +01:00
Thomas Gleixner 4713e22ce8 clocksource: add unregister function to disable unusable clocksources
On x86 the PIT might become an unusable clocksource. Add an unregister
function to provide a possibilty to remove the PIT from the list of
available clock sources.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-30 13:30:02 +01:00
Andi Kleen 1ada5cba6a clocksource: make clocksource watchdog cycle through online CPUs
This way it checks if the clocks are synchronized between CPUs too.
This might be able to detect slowly drifting TSCs which only
go wrong over longer time.

Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:02 +01:00
Parag Warudkar 1077f5a917 clocksource.c: use init_timer_deferrable for clocksource_watchdog
clocksource_watchdog can use a deferrable timer - reduces wakeups from
idle per second.

Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:01 +01:00
Geert Uytterhoeven efd9ac8630 time: fold __get_realtime_clock_ts() into getnstimeofday()
- getnstimeofday() was just a wrapper around __get_realtime_clock_ts()
  - Replace calls to __get_realtime_clock_ts() by calls to getnstimeofday()
  - Fix bogus reference to get_realtime_clock_ts(), which never existed

Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:01 +01:00
Thomas Gleixner 186e3cb8a4 timer: clean up tick-broadcast.c
clean up tick-broadcast.c

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-30 13:30:01 +01:00
Pavel Machek b10db7f0d2 time: more timer related cleanups
I was confused by FSEC = 10^15 NSEC statement, plus small whitespace
fixes. When there's copyright, there should be GPL.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:00 +01:00
Pavel Machek 4c9dc64122 time: timer cleanups
Small cleanups to tick-related code. Wrong preempt count is followed
by BUG(), so it is hardly KERN_WARNING.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:00 +01:00
Pavel Machek a6fa8e5a61 time: clean hungarian notation from timers
Clean up hungarian notation from timer code.

Signed-off-by: Pavel Machek <pavel@suse.cz>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2008-01-30 13:30:00 +01:00
Linus Torvalds 0ba6c33bcd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.25: (1470 commits)
  [IPV6] ADDRLABEL: Fix double free on label deletion.
  [PPP]: Sparse warning fixes.
  [IPV4] fib_trie: remove unneeded NULL check
  [IPV4] fib_trie: More whitespace cleanup.
  [NET_SCHED]: Use nla_policy for attribute validation in ematches
  [NET_SCHED]: Use nla_policy for attribute validation in actions
  [NET_SCHED]: Use nla_policy for attribute validation in classifiers
  [NET_SCHED]: Use nla_policy for attribute validation in packet schedulers
  [NET_SCHED]: sch_api: introduce constant for rate table size
  [NET_SCHED]: Use typeful attribute parsing helpers
  [NET_SCHED]: Use typeful attribute construction helpers
  [NET_SCHED]: Use NLA_PUT_STRING for string dumping
  [NET_SCHED]: Use nla_nest_start/nla_nest_end
  [NET_SCHED]: Propagate nla_parse return value
  [NET_SCHED]: act_api: use PTR_ERR in tcf_action_init/tcf_action_get
  [NET_SCHED]: act_api: use nlmsg_parse
  [NET_SCHED]: act_api: fix netlink API conversion bug
  [NET_SCHED]: sch_netem: use nla_parse_nested_compat
  [NET_SCHED]: sch_atm: fix format string warning
  [NETNS]: Add namespace for ICMP replying code.
  ...
2008-01-29 22:54:01 +11:00
Greg Kroah-Hartman 6494a93d55 Module: check to see if we have a built in module with the same name
When trying to load a module with the same name as a built-in one, a
scary kobject backtrace comes up.  Prevent that from checking for this
condition and warning the user as to what exactly is going on.

Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:27 +11:00
Jon Masters 0aa5bd52d0 module: add module taint on ndiswrapper
The struct module taints member is supposed to store per-module taint
data. The kernel knows about certain specific external modules that will
taint the kernel, such as ndiswrapper. Use of ndiswrapper possibly
should set the per-module taint in addition to the global kernel
taint flag, unless we're arguing not because wrapper module itself
is not what actually causes the kernel to be tainted as such?

Signed-off-by: Jon Masters <jcm@jonmasters.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:26 +11:00
Denis Cheng 8686c99875 module: fix the module name length in param_sysfs_builtin
the original code use KOBJ_NAME_LEN for built-in module name length,
that's defined to 20 in linux/kobject.h, but this is not enough appearntly,
many module names are longer than this;
 #define KOBJ_NAME_LEN                   20

another macro is MODULE_NAME_LEN defined in linux/module.h, I think this is
enough for module names:
 #define MODULE_NAME_LEN (64 - sizeof(unsigned long))

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:24 +11:00
Rusty Russell 6dd06c9fbe module: make module_address_lookup safe
module_address_lookup releases preemption then returns a pointer into
the module space.  The only user (kallsyms) copies the result, so just
do that under the preempt disable.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:23 +11:00
Rusty Russell bb9d3d56e7 module: better OOPS and lockdep coverage for loading modules
If we put the module in the linked list *before* calling into to, we
get the module name and functions in the OOPS (is_module_address can
find the module).  It also helps lockdep in a similar way.

Acked-and-tested-by: Joern Engel <joern@lazybastard.org>
Tested-by: Erez Zadok <ezk@cs.sunysb.edu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:22 +11:00
Rusty Russell efa5345e39 module: Fix gratuitous sprintf in module.c
Andrew sent an older version of this patch: we shouldn't use sprintf
to copy a string.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:21 +11:00
Rusty Russell c9a3ba55bb module: wait for dependent modules doing init.
There have been reports of modules failing to load because the modules
they depend on are still loading.  This changes the modules to wait
for a reasonable length of time in that case.  We time out eventually,
because there can be module loops or broken modules.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:20 +11:00
Rusty Russell a2da4052f1 module: Don't report discarded init pages as kernel text.
Current code could cause a bug in symbol_put_addr() if an arch used
kmalloc module text: we might think the symbol belongs to the core
kernel.

The downside is that this might make backtraces through (discarded)
init functions harder to read on some archs, but we already have that
issue for modules and noone has complained.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-01-29 17:13:18 +11:00
Pavel Emelyanov 08913681e4 [NET]: Remove the empty net_table
I have removed all the entries from this table (core_table,
ipv4_table and tr_table), so now we can safely drop it.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:56:29 -08:00
Eric W. Biederman e51b6ba077 sysctl: Infrastructure for per namespace sysctls
This patch implements the basic infrastructure for per namespace sysctls.

A list of lists of sysctl headers is added, allowing each namespace to have
it's own list of sysctl headers.

Each list of sysctl headers has a lookup function to find the first
sysctl header in the list, allowing the lists to have a per namespace
instance.

register_sysct_root is added to tell sysctl.c about additional
lists of sysctl_headers.  As all of the users are expected to be in
kernel no unregister function is provided.

sysctl_head_next is updated to walk through the list of lists.

__register_sysctl_paths is added to add a new sysctl table on
a non-default sysctl list.

The only intrusive part of this patch is propagating the information
to decided which list of sysctls to use for sysctl_check_table.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:17 -08:00
Eric W. Biederman 23eb06de7d sysctl: Remember the ctl_table we passed to register_sysctl_paths
By doing this we allow users of register_sysctl_paths that build
and dynamically allocate their ctl_table to be simpler.  This allows
them to just remember the ctl_table_header returned from
register_sysctl_paths from which they can now find the
ctl_table array they need to free.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:17 -08:00
Eric W. Biederman 29e796fd4d sysctl: Add register_sysctl_paths function
There are a number of modules that register a sysctl table
somewhere deeply nested in the sysctl hierarchy, such as
fs/nfs, fs/xfs, dev/cdrom, etc.

They all specify several dummy ctl_tables for the path name.
This patch implements register_sysctl_path that takes
an additional path name, and makes up dummy sysctl nodes
for each component.

This patch was originally written by Olaf Kirch and
brought to my attention and reworked some by Olaf Hering.
I have changed a few additional things so the bugs are mine.

After converting all of the easy callers Olaf Hering observed
allyesconfig ARCH=i386, the patch reduces the final binary size by 9369 bytes.

.text +897
.data -7008

   text    data     bss     dec     hex filename
   26959310        4045899 4718592 35723801        2211a19 ../vmlinux-vanilla
   26960207        4038891 4718592 35717690        221023a ../O-allyesconfig/vmlinux

So this change is both a space savings and a code simplification.

CC: Olaf Kirch <okir@suse.de>
CC: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Daniel Lezcano <dlezcano@fr.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28 14:55:16 -08:00
Jens Axboe fadad878cc kernel: add CLONE_IO to specifically request sharing of IO contexts
syslets (or other threads/processes that want io context sharing) can
set this to enforce sharing of io context.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:36 +01:00
Jens Axboe d38ecf935f io context sharing: preliminary support
Detach task state from ioc, instead keep track of how many processes
are accessing the ioc.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:31 +01:00
Jens Axboe fd0928df98 ioprio: move io priority from task_struct to io_context
This is where it belongs and then it doesn't take up space for a
process that doesn't do IO.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-01-28 10:50:29 +01:00
Ingo Molnar 326e96b923 printk: revert ktime_get() timestamps
revert 19ef930927.

Kevin Winchester reported a lockup during X startup an bisected
it to this commit.

Reported-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-27 08:03:54 +01:00
Heiko Carstens 81ef16e763 [S390] Remove appldata include from sysctl_check.c
Forgot to remove this when removing the appldata binary sysctls.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2008-01-26 14:11:16 +01:00
Linus Torvalds 9b73e76f3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (200 commits)
  [SCSI] usbstorage: use last_sector_bug flag universally
  [SCSI] libsas: abstract STP task status into a function
  [SCSI] ultrastor: clean up inline asm warnings
  [SCSI] aic7xxx: fix firmware build
  [SCSI] aacraid: fib context lock for management ioctls
  [SCSI] ch: remove forward declarations
  [SCSI] ch: fix device minor number management bug
  [SCSI] ch: handle class_device_create failure properly
  [SCSI] NCR5380: fix section mismatch
  [SCSI] sg: fix /proc/scsi/sg/devices when no SCSI devices
  [SCSI] IB/iSER: add logical unit reset support
  [SCSI] don't use __GFP_DMA for sense buffers if not required
  [SCSI] use dynamically allocated sense buffer
  [SCSI] scsi.h: add macro for enclosure bit of inquiry data
  [SCSI] sd: add fix for devices with last sector access problems
  [SCSI] fix pcmcia compile problem
  [SCSI] aacraid: add Voodoo Lite class of cards.
  [SCSI] aacraid: add new driver features flags
  [SCSI] qla2xxx: Update version number to 8.02.00-k7.
  [SCSI] qla2xxx: Issue correct MBC_INITIALIZE_FIRMWARE command.
  ...
2008-01-25 17:19:08 -08:00
Arjan van de Ven 6d082592b6 sched: keep total / count stats in addition to the max for
Right now, the linux kernel (with scheduler statistics enabled) keeps track
of the maximum time a process is waiting to be scheduled. While the maximum
is a very useful metric, tracking average and total is equally useful
(at least for latencytop) to figure out the accumulated effect of scheduler
delays. The accumulated effect is important to judge the performance impact
of scheduler tuning/behavior.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:35 +01:00
Peter Zijlstra 5973e5b954 sched: fix: don't take a mutex from interrupt context
print_cfs_stats is callable from interrupt context (sysrq), hence it should
not take mutexes. Change it to use RCU since the task group data is RCU
freed anyway.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Nick Piggin 5fb5e6de55 sched: print backtrace of running tasks too
The attached patch is something really simple that can sometimes help
in getting more info out of a hung system.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Ingo Molnar 19ef930927 printk: use ktime_get()
printk timestamps: use ktime_get().

Some platforms have a functioning clocksource function only after
they are done with early bootup, so delay this until out of
SYSTEM_BOOTING state.

it's also inherently safe now, as any bugs in this area will be
caught by the printk recursion checks.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Ingo Molnar 90739081ef softlockup: fix signedness
fix softlockup tunables signedness.

mark tunables read-mostly.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Arjan van de Ven 9745512ce7 sched: latencytop support
LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Dmitry Adamushko 326587b840 sched: fix goto retry in pick_next_task_rt()
looking at it one more time:

(1) it looks to me that there is no need to call
sched_rt_ratio_exceeded() from pick_next_rt_entity()

- [ for CONFIG_FAIR_GROUP_SCHED ] queues with rt_rq->rt_throttled are
not within this 'tree-like hierarchy' (or whatever we should call it
:-)

- there is also no need to re-check 'rt_rq->rt_time > ratio' at this
point as 'rt_rq->rt_time' couldn't have been increased since the last
call to update_curr_rt() (which obviously calls
sched_rt_ratio_esceeded())
well, it might be that 'ratio' for this rt_rq has been re-configured
(and the period over which this rt_rq was active has not yet been
finished)... but I don't think we should really take this into
account.

(2) now pick_next_rt_entity() must never return NULL, so let's change
pick_next_task_rt() accordingly.

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Guillaume Chazarain cc203d2422 sched: monitor clock underflows in /proc/sched_debug
We monitor clock overflows, let's also monitor clock underflows.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:34 +01:00
Guillaume Chazarain 782daeee3d sched: fix rq->clock warps on frequency changes
sched: fix rq->clock warps on frequency changes

Fix 2bacec8c31
(sched: touch softlockup watchdog after idling) that reintroduced warps
on frequency changes. touch_softlockup_watchdog() calls __update_rq_clock
that checks rq->clock for warps, so call it after adjusting rq->clock.

Signed-off-by: Guillaume Chazarain <guichaz@yahoo.fr>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Michal Schmidt 4f05b98d54 sched: fix, always create kernel threads with normal priority
Ensure that the kernel threads are created with the usual nice level
and affinity even if kthreadd's properties were changed from the
default by root.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Paolo Ciarrocchi 1ad82fd547 debug: clean up kernel/profile.c
Before:
 total: 25 errors, 13 warnings, 602 lines checked

 After:
 total: 0 errors, 2 warnings, 601 lines checked

No code changed:

kernel/profile.o:
   text    data     bss     dec     hex filename
   3048     236      24    3308     cec profile.o.before
   3048     236      24    3308     cec profile.o.after
 md5:
   2501d64748a4d350dffb11203e2a5182  profile.o.before.asm
   2501d64748a4d350dffb11203e2a5182  profile.o.after.asm

Signed-off-by: Paolo Ciarrocchi <paolo.ciarrocchi@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Ingo Molnar 6478d8800b sched: remove the !PREEMPT_BKL code
remove the !PREEMPT_BKL code.

this removes 160 lines of legacy code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Ingo Molnar 58b8a73ab8 sched: make PREEMPT_BKL the default
make PREEMPT_BKL the default.

precursor to removal of the !PREEMPT_BKL code.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Arjan van de Ven e14af7eeb4 debug: track and print last unloaded module in the oops trace
Based on a suggestion from Andi:

 In various cases, the unload of a module may leave some bad state around
 that causes a kernel crash AFTER a module is unloaded; and it's then hard
 to find which module caused that.

This patch tracks the last unloaded module, and prints this as part of the
module list in the oops trace.

Right now, only the last 1 module is tracked; I expect that this is enough
for the vast majority of cases where this information matters; if it turns
out that tracking more is important, we can always extend it to that.

[ mingo@elte.hu: build fix ]

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Arjan van de Ven 21aa9280b9 debug: show being-loaded/being-unloaded indicator for modules
It's rather common that an oops/WARN_ON/BUG happens during the load or
unload of a module. Unfortunatly, it's not always easy to see directly
which module is being loaded/unloaded from the oops itself. Worse,
it's not even always possible to ask the bug reporter, since there
are so many components (udev etc) that auto-load modules that there's
a good chance that even the reporter doesn't know which module this is.

This patch extends the existing "show if it's tainting" print code,
which is used as part of printing the modules in the oops/BUG/WARN_ON
to include a "+" for "being loaded" and a "-" for "being unloaded".

As a result this extension, the "taint_flags()" function gets renamed to
"module_flags()" (and takes a module struct as argument, not a taint
flags int).

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:33 +01:00
Peter Zijlstra 5a52dd5009 sched: rt-watchdog: fix .rlim_max = RLIM_INFINITY
Remove the curious logic to set it_sched_expires in the future. It useless
because rt.timeout wouldn't be incremented anyway.

Explicity check for RLIM_INFINITY as a test programm that had a 1s soft limit
and a inf hard limit would SIGKILL at 1s. This is because RLIM_INFINITY+d-1
is d-2.

Signed-off-by: Peter Zijlsta <a.p.zijlstra@chello.nl>
CC: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:32 +01:00
Peter Zijlstra 1020387f5f sched: rt-group: reduce rescheduling
Only reschedule if the new group has a higher prio task.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:32 +01:00
Peter Zijlstra 37bb6cb409 hrtimer: unlock hrtimer_wakeup
hrtimer_wakeup creates a

  base->lock
    rq->lock

lock dependancy. Avoid this by switching to HRTIMER_CB_IRQSAFE_NO_SOFTIRQ
which doesn't hold base->lock.

This fully untangles hrtimer locks from the scheduler locks, and allows
hrtimer usage in the scheduler proper.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:32 +01:00
Peter Zijlstra d3d74453c3 hrtimer: fixup the HRTIMER_CB_IRQSAFE_NO_SOFTIRQ fallback
Currently all highres=off timers are run from softirq context, but
HRTIMER_CB_IRQSAFE_NO_SOFTIRQ timers expect to run from irq context.

Fix this up by splitting it similar to the highres=on case.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:31 +01:00
Peter Zijlstra 2d44ae4d71 hrtimer: clean up cpu->base locking tricks
In order to more easily allow for the scheduler to use timers, clean up
the locking a bit.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:31 +01:00
Peter Zijlstra 48d5e25821 sched: rt throttling vs no_hz
We need to teach no_hz about the rt throttling because its tick driven.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:31 +01:00
Mike Galbraith 614ee1f61f sched: pull_rt_task() cleanup
"goto out" is an odd way to spell "skip".

Signed-off-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:30 +01:00
Peter Zijlstra 6f505b1642 sched: rt group scheduling
Extend group scheduling to also cover the realtime classes. It uses the time
limiting introduced by the previous patch to allow multiple realtime groups.

The hard time limit is required to keep behaviour deterministic.

The algorithms used make the realtime scheduler O(tg), linear scaling wrt the
number of task groups. This is the worst case behaviour I can't seem to get out
of, the avg. case of the algorithms can be improved, I focused on correctness
and worst case.

[ akpm@linux-foundation.org: move side-effects out of BUG_ON(). ]

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:30 +01:00
Peter Zijlstra fa85ae2418 sched: rt time limit
Very simple time limit on the realtime scheduling classes.
Allow the rq's realtime class to consume sched_rt_ratio of every
sched_rt_period slice. If the class exceeds this quota the fair class
will preempt the realtime class.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:29 +01:00
Peter Zijlstra 8f4d37ec07 sched: high-res preemption tick
Use HR-timers (when available) to deliver an accurate preemption tick.

The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.

The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:29 +01:00
Herbert Xu 02b67cc3ba sched: do not do cond_resched() when CONFIG_PREEMPT
Why do we even have cond_resched when real preemption
is on? It seems to be a waste of space and time.

remove cond_resched with CONFIG_PREEMPT on.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:28 +01:00
Ingo Molnar 03319ec8b0 sched: documentation, whitespace fixes
whitespace fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:28 +01:00
Peter Zijlstra 78f2c7db60 sched: SCHED_FIFO/SCHED_RR watchdog timer
Introduce a new rlimit that allows the user to set a runtime timeout on
real-time tasks their slice. Once this limit is exceeded the task will receive
SIGXCPU.

So it measures runtime since the last sleep.

Input and ideas by Thomas Gleixner and Lennart Poettering.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Lennart Poettering <mzxreary@0pointer.de>
CC: Michael Kerrisk <mtk.manpages@googlemail.com>
CC: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:27 +01:00
Peter Zijlstra fa717060f1 sched: sched_rt_entity
Move the task_struct members specific to rt scheduling together.
A future optimization could be to put sched_entity and sched_rt_entity
into a union.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:27 +01:00
Pavel Emelyanov 8eb703e4f3 uids: merge multiple error paths in alloc_uid() into one
There are already 4 error paths in alloc_uid() that do incremental rollbacks.
I think it's time to merge them.  This costs us 8 lines of code :)

Maybe it would be better to merge this patch with the previous one, but I
remember that some time ago I sent a similar patch (fixing the error path and
cleaning it), but I was told to make two patches in such cases.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:26 +01:00
Gregory Haskins dc938520d2 sched: dynamically update the root-domain span/online maps
The baseline code statically builds the span maps when the domain is formed.
Previous attempts at dynamically updating the maps caused a suspend-to-ram
regression, which should now be fixed.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
CC: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:26 +01:00
Paul E. McKenney eaf649e9fe Preempt-RCU: CPU Hotplug handling
This patch allows preemptible RCU to tolerate CPU-hotplug operations.
It accomplishes this by maintaining a local copy of a map of online
CPUs, which it accesses under its own lock.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:25 +01:00
Paul E. McKenney e260be673a Preempt-RCU: implementation
This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

	http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

and the article-

	http://lwn.net/Articles/253651/

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption.  As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs.

[ akpm@linux-foundation.org: build fixes on non-preempt architectures ]

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:24 +01:00
Paul E. McKenney e0ecfa7917 Preempt-RCU: fix rcu_barrier for preemptive environment.
Fix rcu_barrier() to work properly in preemptive kernel environment.
Also, the ordering of callback must be preserved while moving
callbacks to another CPU during CPU hotplug.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:24 +01:00
Paul E. McKenney 01c1c660f4 Preempt-RCU: reorganize RCU code into rcuclassic.c and rcupdate.c
This patch re-organizes the RCU code to enable multiple implementations
of RCU. Users of RCU continues to include rcupdate.h and the
RCU interfaces remain the same. This is in preparation for
subsequently merging the preemptible RCU implementation.

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:24 +01:00
Dipankar Sarma c2d727aa2f Preempt-RCU: Use softirq instead of tasklets for
This patch makes RCU use softirq instead of tasklets.

It also adds a memory barrier after raising the softirq
inorder to ensure that the cpu sees the most recently updated
value of rcu->cur while processing callbacks.
The discussion of the related theoretical race pointed out
by James Huang can be found here --> http://lkml.org/lkml/2007/11/20/603

Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Reviewed-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:23 +01:00
Gregory Haskins c49443c538 sched: remove some old cpuset logic
We had support for overlapping cpuset based rto logic in early
prototypes that is no longer used, so remove it.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:23 +01:00
Gregory Haskins cdc8eb984c sched: RT-balance, only adjust overload state when changing
The overload set/clears were originally idempotent when this logic was first
implemented.  But that is no longer true due to the addition of the atomic
counter and this logic was never updated to work properly with that change.
So only adjust the overload state if it is actually changing to avoid
getting out of sync.

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:23 +01:00
Steven Rostedt cb46984504 sched: RT-balance, add new methods to sched_class
Dmitry Adamushko found that the current implementation of the RT
balancing code left out changes to the sched_setscheduler and
rt_mutex_setprio.

This patch addresses this issue by adding methods to the schedule classes
to handle being switched out of (switched_from) and being switched into
(switched_to) a sched_class. Also a method for changing of priorities
is also added (prio_changed).

This patch also removes some duplicate logic between rt_mutex_setprio and
sched_setscheduler.

Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-01-25 21:08:22 +01:00