Commit graph

1474 commits

Author SHA1 Message Date
Jay Lan 8f0ab51479 [PATCH] csa: convert CONFIG tag for extended accounting routines
There were a few accounting data/macros that are used in CSA but are #ifdef'ed
inside CONFIG_BSD_PROCESS_ACCT.  This patch is to change those ifdef's from
CONFIG_BSD_PROCESS_ACCT to CONFIG_TASK_XACCT.  A few defines are moved from
kernel/acct.c and include/linux/acct.h to kernel/tsacct.c and
include/linux/tsacct_kern.h.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan 9acc185351 [PATCH] csa: Extended system accounting over taskstats
Add extended system accounting handling over taskstats interface.  A
CONFIG_TASK_XACCT flag is created to enable the extended accounting code.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Jay Lan f3cef7a994 [PATCH] csa: basic accounting over taskstats
Add some basic accounting fields to the taskstats struct, add a new
kernel/tsacct.c to handle basic accounting data handling upon exit.  A handle
is added to taskstats.c to invoke the basic accounting data handling.

Signed-off-by: Jay Lan <jlan@sgi.com>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Chris Sturtivant <csturtiv@sgi.com>
Cc: Tony Ernst <tee@sgi.com>
Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net>
Cc: "Michal Piotrowski" <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Balbir Singh 0ae646845b [PATCH] Fix taskstats size calculation (use the new genetlink utility functions)
The addition of the CSA patch pushed the size of struct taskstats to 256
bytes.  This exposed a problem with prepare_reply(), we were not allocating
space for the netlink and genetlink header.  It worked earlier because
alloc_skb() would align the skb to SMP_CACHE_BYTES, which added some additonal
bytes.

Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jamal Hadi <hadi@cyberus.ca>
Cc: Shailabh Nagar <nagar@watson.ibm.com>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:29 -07:00
Atsushi Nemoto 8ef386092d [PATCH] kill wall_jiffies
With 2.6.18-rc4-mm2, now wall_jiffies will always be the same as jiffies.
So we can kill wall_jiffies completely.

This is just a cleanup and logically should not change any real behavior
except for one thing: RTC updating code in (old) ppc and xtensa use a
condition "jiffies - wall_jiffies == 1".  This condition is never met so I
suppose it is just a bug.  I just remove that condition only instead of
kill the whole "if" block.

[heiko.carstens@de.ibm.com: s390 build fix and cleanup]
Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Adrian Bunk 70bc42f90a [PATCH] kernel/time/ntp.c: possible cleanups
This patch contains the following possible cleanups:
- make the following needlessly global function static:
  - ntp_update_frequency()
- make the following needlessly global variables static:
  - time_state
  - time_offset
  - time_constant
  - time_reftime
- remove the following read-only global variable:
  - time_precision

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel f199239373 [PATCH] ntp: convert to the NTP4 reference model
This converts the kernel ntp model into a model which matches the nanokernel
reference implementations.  The previous patches already increased the
resolution and precision of the computations, so that this conversion becomes
quite simple.

<linux@horizon.com> explains:

The original NTP kernel interface was defined in units of microseconds.
That's what Linux implements.  As computers have gotten faster and can now
split microseconds easily, a new kernel interface using nanosecond units was
defined ("the nanokernel", confusing as that name is to OS hackers), and
there's an STA_NANO bit in the adjtimex() status field to tell the application
which units it's using.

The current ntpd supports both, but Linux loses some possible timing
resolution because of quantization effects, and the ntpd hackers would really
like to be able to drop the backwards compatibility code.

Ulrich Windl has been maintaining a patch set to do the conversion for years,
but it's hard to keep in sync.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel 04b617e71e [PATCH] ntp: convert time_freq to nsec value
This converts time_freq to a scaled nsec value and adds around 6bit of extra
resolution.  This pushes the time_freq to its 32bit limits so the calculatons
have to be done with 64bit.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:27 -07:00
Roman Zippel 97eebe138c [PATCH] ntp: remove time_tolerance
time_tolerance isn't changed at all in the kernel, so simply remove it, this
simplifies the next patch, as it avoids a number of conversions.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel 8f807f8d21 [PATCH] ntp: add time_adjust to tick length
This folds update_ntp_one_tick() into second_overflow() and adds time_adjust
to the tick length, this makes time_next_adjust unnecessary.  This slightly
changes the adjtime() behaviour, instead of applying it to the next tick, it's
applied to the next second.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel 3d3675cc3d [PATCH] ntp: prescale time_offset
This converts time_offset into a scaled per tick value.  This avoids now
completely the crude compensation in second_overflow().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel dc6a43e46f [PATCH] ntp: add time_freq to tick length
This adds the frequency part to ntp_update_frequency().

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel ab8783b688 [PATCH] ntp: add time_adj to tick length
This makes time_adj local to second_overflow() and integrates it into the tick
length instead of adding it everytime.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Roman Zippel b0ee75561b [PATCH] ntp: add ntp_update_frequency
This introduces ntp_update_frequency() and deinlines ntp_clear() (as it's not
performance critical).  ntp_update_frequency() calculates the base tick length
using tick_usec and adds a base adjustment, in case the frequency doesn't
divide evenly by HZ.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
john stultz 4c7ee8de95 [PATCH] NTP: Move all the NTP related code to ntp.c
Move all the NTP related code to ntp.c

[akpm@osdl.org: cleanups, build fix]
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:26 -07:00
Martin Schwidefsky ef6edc9746 [PATCH] Directed yield: cpu_relax variants for spinlocks and rw-locks
On systems running with virtual cpus there is optimization potential in
regard to spinlocks and rw-locks.  If the virtual cpu that has taken a lock
is known to a cpu that wants to acquire the same lock it is beneficial to
yield the timeslice of the virtual cpu in favour of the cpu that has the
lock (directed yield).

With CONFIG_PREEMPT="n" this can be implemented by the architecture without
common code changes.  Powerpc already does this.

With CONFIG_PREEMPT="y" the lock loops are coded with _raw_spin_trylock,
_raw_read_trylock and _raw_write_trylock in kernel/spinlock.c.  If the lock
could not be taken cpu_relax is called.  A directed yield is not possible
because cpu_relax doesn't know anything about the lock.  To be able to
yield the lock in favour of the current lock holder variants of cpu_relax
for spinlocks and rw-locks are needed.  The new _raw_spin_relax,
_raw_read_relax and _raw_write_relax primitives differ from cpu_relax
insofar that they have an argument: a pointer to the lock structure.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:21 -07:00
Cal Peake 756184b7d7 [PATCH] CodingStyle cleanup for kernel/sys.c
Fix up kernel/sys.c to be consistent with CodingStyle and the rest of the
file.

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:20 -07:00
Arjan van de Ven 5c87579e65 [PATCH] maximum latency tracking infrastructure
Add infrastructure to track "maximum allowable latency" for power saving
policies.

The reason for adding this infrastructure is that power management in the
idle loop needs to make a tradeoff between latency and power savings
(deeper power save modes have a longer latency to running code again).  The
code that today makes this tradeoff just does a rather simple algorithm;
however this is not good enough: There are devices and use cases where a
lower latency is required than that the higher power saving states provide.
 An example would be audio playback, but another example is the ipw2100
wireless driver that right now has a very direct and ugly acpi hook to
disable some higher power states randomly when it gets certain types of
error.

The proposed solution is to have an interface where drivers can

* announce the maximum latency (in microseconds) that they can deal with
* modify this latency
* give up their constraint

and a function where the code that decides on power saving strategy can
query the current global desired maximum.

This patch has a user of each side: on the consumer side, ACPI is patched
to use this, on the producer side the ipw2100 driver is patched.

A generic maximum latency is also registered of 2 timer ticks (more and you
lose accurate time tracking after all).

While the existing users of the patch are x86 specific, the infrastructure
is not.  I'd like to ask the arch maintainers of other architectures if the
infrastructure is generic enough for their use (assuming the architecture
has such a tradeoff as concept at all), and the sound/multimedia driver
owners to look at the driver facing API to see if this is something they
can use.

[akpm@osdl.org: cleanups]
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Jesse Barnes <jesse.barnes@intel.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01 00:39:19 -07:00
David Howells 9361401eb7 [PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer.  Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.

This patch does the following:

 (*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
     support.

 (*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
     an item that uses the block layer.  This includes:

     (*) Block I/O tracing.

     (*) Disk partition code.

     (*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.

     (*) The SCSI layer.  As far as I can tell, even SCSI chardevs use the
     	 block layer to do scheduling.  Some drivers that use SCSI facilities -
     	 such as USB storage - end up disabled indirectly from this.

     (*) Various block-based device drivers, such as IDE and the old CDROM
     	 drivers.

     (*) MTD blockdev handling and FTL.

     (*) JFFS - which uses set_bdev_super(), something it could avoid doing by
     	 taking a leaf out of JFFS2's book.

 (*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
     linux/elevator.h contingent on CONFIG_BLOCK being set.  sector_div() is,
     however, still used in places, and so is still available.

 (*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
     parts of linux/fs.h.

 (*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.

 (*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.

 (*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
     is not enabled.

 (*) fs/no-block.c is created to hold out-of-line stubs and things that are
     required when CONFIG_BLOCK is not set:

     (*) Default blockdev file operations (to give error ENODEV on opening).

 (*) Makes some /proc changes:

     (*) /proc/devices does not list any blockdevs.

     (*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.

 (*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.

 (*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
     given command other than Q_SYNC or if a special device is specified.

 (*) In init/do_mounts.c, no reference is made to the blockdev routines if
     CONFIG_BLOCK is not defined.  This does not prohibit NFS roots or JFFS2.

 (*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
     error ENOSYS by way of cond_syscall if so).

 (*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
     CONFIG_BLOCK is not set, since they can't then happen.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:31 +02:00
David Howells 07f3f05c1e [PATCH] BLOCK: Move extern declarations out of fs/*.c into header files [try #6]
Create a new header file, fs/internal.h, for common definitions local to the
sources in the fs/ directory.

Move extern definitions that should be in header files from fs/*.c to
fs/internal.h or other main header files where they span directories.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:52:18 +02:00
David Howells 0d67a46df0 [PATCH] BLOCK: Remove duplicate declaration of exit_io_context() [try #6]
Remove the duplicate declaration of exit_io_context() from linux/sched.h.

Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-09-30 20:31:20 +02:00
Andi Kleen 34596dc9e5 [PATCH] Define vsyscall cache as blob to make clearer that user space shouldn't use it
Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Andi Kleen 29cbc78b90 [PATCH] x86: Clean up x86 NMI sysctls
Use prototypes in headers
Don't define panic_on_unrecovered_nmi for all architectures

Cc: dzickus@redhat.com

Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-30 01:47:55 +02:00
Paul Jackson 181b648036 [PATCH] cpuset: fix obscure attach_task vs exiting race
Fix obscure race condition in kernel/cpuset.c attach_task() code.

There is basically zero chance of anyone accidentally being harmed by this
race.

It requires a special 'micro-stress' load and a special timing loop hacks
in the kernel to hit in less than an hour, and even then you'd have to hit
it hundreds or thousands of times, followed by some unusual and senseless
cpuset configuration requests, including removing the top cpuset, to cause
any visibly harm affects.

One could, with perhaps a few days or weeks of such effort, get the
reference count on the top cpuset below zero, and manage to crash the
kernel by asking to remove the top cpuset.

I found it by code inspection.

The race was introduced when 'the_top_cpuset_hack' was introduced, and one
piece of code was not updated.  An old check for a possibly null task
cpuset pointer needed to be changed to a check for a task marked
PF_EXITING.  The pointer can't be null anymore, thanks to
the_top_cpuset_hack (documented in kernel/cpuset.c).  But the task could
have gone into PF_EXITING state after it was found in the task_list scan.

If a task is PF_EXITING in this code, it is possible that its task->cpuset
pointer is pointing to the top cpuset due to the_top_cpuset_hack, rather
than because the top_cpuset was that tasks last valid cpuset.  In that
case, the wrong cpuset reference counter would be decremented.

The fix is trivial.  Instead of failing the system call if the tasks cpuset
pointer is null here, fail it if the task is in PF_EXITING state.

The code for 'the_top_cpuset_hack' that changes an exiting tasks cpuset to
the top_cpuset is done without locking, so could happen at anytime.  But it
is done during the exit handling, after the PF_EXITING flag is set.  So if
we verify that a task is still not PF_EXITING after we copy out its cpuset
pointer (into 'oldcs', below), we know that 'oldcs' is not one of these
hack references to the top_cpuset.

Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Ingo Molnar 03cbc358aa [PATCH] lockdep core: improve the lock-chain-hash
With CONFIG_DEBUG_LOCK_ALLOC turned off i was getting sporadic failures in
the locking self-test:

  ------------>
  | Locking API testsuite:
  ----------------------------------------------------------------------------
                                   | spin |wlock |rlock |mutex | wsem | rsem |
    --------------------------------------------------------------------------
                       A-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
                   A-B-B-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-B-C-C-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
               A-B-C-A-B-C deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-B-C-C-D-D-A deadlock:  ok  |FAILED|  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-D-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |  ok  |
           A-B-C-D-B-C-D-A deadlock:  ok  |  ok  |  ok  |  ok  |  ok  |FAILED|

after much debugging it turned out to be caused by accidental chain-hash
key collisions.  The current hash is:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS/2) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS/2)) ^ \
 	(key2))

where MAX_LOCKDEP_KEYS_BITS is 11.  This hash is pretty good as it will
shift by 5 bits in every iteration, where every new ID 'mixed' into the
hash would have up to 11 bits.  But because there was a 6 bits overlap
between subsequent IDs and their high bits tended to be similar, there was
a chance for accidental chain-hash collision for a low number of locks
held.

the solution is to shift by 11 bits:

 #define iterate_chain_key(key1, key2) \
	(((key1) << MAX_LOCKDEP_KEYS_BITS) ^ \
	((key1) >> (64-MAX_LOCKDEP_KEYS_BITS)) ^ \
 	(key2))

This keeps the hash perfect up to 5 locks held, but even above that the
hash is still good because 11 bits is a relative prime to the total 64
bits, so a complete match will only occur after 64 held locks (which doesnt
happen in Linux).  Even after 5 locks held, entropy of the 5 IDs mixed into
the hash is already good enough so that overlap doesnt generate a colliding
hash ID.

with this change the false positives went away.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Alan Cox eb84a20e9e [PATCH] audit/accounting: tty locking
Add tty locking around the audit and accounting code.

The whole current->signal-> locking is all deeply strange but it's for
someone else to sort out.  Add rather than replace the lock for acct.c

Signed-off-by: Alan Cox <alan@redhat.com>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:25 -07:00
Rusty Russell e5582ca21a [PATCH] stop_machine.c copyright
I had to look back: this code was extracted from the module.c code in 2005.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:24 -07:00
Ian S. Nelson 04b1db9fd7 [PATCH] /sys/modules: allow full length section names
I've been using systemtap for some debugging and I noticed that it can't
probe a lot of modules.  Turns out it's kind of silly, the sections section
of /sys/module is limited to 32byte filenames and many of the actual
sections are a a bit longer than that.

[akpm@osdl.org: rewrite to use dymanic allocation]
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:23 -07:00
Paul Jackson b1aac8bb82 [PATCH] cpuset: hotunplug cpus and mems in all cpusets
The cpuset code handling hot unplug of CPUs or Memory Nodes was incorrect -
it could remove a CPU or Node from the top cpuset, while leaving it still
in some child cpusets.

One basic rule of cpusets is that each cpusets cpus and mems are subsets of
its parents.  The cpuset hot unplug code violated this rule.

So the cpuset hotunplug handler must walk down the tree, removing any
removed CPU or Node from all cpusets.

However, it is not allowed to make a cpusets cpus or mems become empty.
They can only transition from empty to non-empty, not back.

So if the last CPU or Node would be removed from a cpuset by the above
walk, we scan back up the cpuset hierarchy, finding the nearest ancestor
that still has something online, and copy its CPU or Memory placement.

Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Nathan Lynch <ntl@pobox.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Paul Jackson 38837fc75a [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map
Change the list of memory nodes allowed to tasks in the top (root) nodeset
to dynamically track what cpus are online, using a call to a cpuset hook
from the memory hotplug code.  Make this top cpus file read-only.

On systems that have cpusets configured in their kernel, but that aren't
actively using cpusets (for some distros, this covers the majority of
systems) all tasks end up in the top cpuset.

If that system does support memory hotplug, then these tasks cannot make
use of memory nodes that are added after system boot, because the memory
nodes are not allowed in the top cpuset.  This is a surprising regression
over earlier kernels that didn't have cpusets enabled.

One key motivation for this change is to remain consistent with the
behaviour for the top_cpuset's 'cpus', which is also read-only, and which
automatically tracks the cpu_online_map.

This change also has the minor benefit that it fixes a long standing,
little noticed, minor bug in cpusets.  The cpuset performance tweak to
short circuit the cpuset_zone_allowed() check on systems with just a single
cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
changing the 'mems' of the top_cpuset had no affect, even though the change
(the write system call) appeared to succeed.  With the following change,
that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
refuses to be changed via user space writes.  Thus no one should be mislead
into thinking they've changed the top_cpusets's 'mems' when in affect they
haven't.

In order to keep the behaviour of cpusets consistent between systems
actively making use of them and systems not using them, this patch changes
the behaviour of the 'mems' file in the top (root) cpuset, making it read
only, and making it automatically track the value of node_online_map.  Thus
tasks in the top cpuset will have automatic use of hot plugged memory nodes
allowed by their cpuset.

[akpm@osdl.org: build fix]
[bunk@stusta.de: build fix]
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov c394cc9fbb [PATCH] introduce TASK_DEAD state
I am not sure about this patch, I am asking Ingo to take a decision.

task_struct->state == EXIT_DEAD is a very special case, to avoid a confusion
it makes sense to introduce a new state, TASK_DEAD, while EXIT_DEAD should
live only in ->exit_state as documented in sched.h.

Note that this state is not visible to user-space, get_task_state() masks off
unsuitable states.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:21 -07:00
Oleg Nesterov 55a101f8f7 [PATCH] kill PF_DEAD flag
After the previous change (->flags & PF_DEAD) <=> (->state == EXIT_DEAD), we
don't need PF_DEAD any longer.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov 29b8849216 [PATCH] set EXIT_DEAD state in do_exit(), not in schedule()
schedule() checks PF_DEAD on every context switch and sets ->state = EXIT_DEAD
to ensure that the exiting task will be deactivated.  Note that this EXIT_DEAD
is in fact a "random" value, we can use any bit except normal TASK_XXX values.

It is better to set this state in do_exit() along with PF_DEAD flag and remove
that check in schedule().

We are safe wrt concurrent try_to_wake_up() (for example ptrace, tkill), it
can not change task's ->state: the 'state' argument of try_to_wake_up() can't
have EXIT_DEAD bit.  And in case when try_to_wake_up() sees a stale value of
->state == TASK_RUNNING it will do nothing.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:20 -07:00
Oleg Nesterov aaa2a97eb9 [PATCH] sys_get_robust_list(): don't take tasklist_lock
use rcu locks for find_task_by_pid().

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov d359b549bf [PATCH] futex_find_get_task(): don't take tasklist_lock
It is ok to do find_task_by_pid() + get_task_struct() under
rcu_read_lock(), we cand drop tasklist_lock.

Note that testing of ->exit_state is racy with or without tasklist anyway.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov 5b160f5ecd [PATCH] copy_process: cosmetic ->ioprio tweak
copy_process:
// holds tasklist_lock + ->siglock
       /*
        * inherit ioprio
        */
       p->ioprio = current->ioprio;

Why?  ->ioprio was already copied in dup_task_struct().  I guess this is
needed to ensure that the child can't escape
sys_ioprio_set(IOPRIO_WHO_{PGRP,USER}), yes?

In that case we don't need ->siglock held, and the comment should be
updated.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov 1c573afebc [PATCH] reparent_to_init(): use has_rt_policy()
Remove open-coded has_rt_policy(), no changes in kernel/exit.o

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:18 -07:00
Oleg Nesterov 8dc3e9099e [PATCH] sched_setscheduler: fix? policy checks
I am not sure this patch is correct: I can't understand what the current
code does, and I don't know what it was supposed to do.

The comment says:

		 * can't change policy, except between SCHED_NORMAL
		 * and SCHED_BATCH:

The code:

		if (((policy != SCHED_NORMAL && p->policy != SCHED_BATCH) &&
			(policy != SCHED_BATCH && p->policy != SCHED_NORMAL)) &&

But this is equivalent to:

		if ( (is_rt_policy(policy) && has_rt_policy(p)) &&

which means something different.  We can't _decrease_ the current
->rt_priority with such a check (if rlim[RLIMIT_RTPRIO] == 0).

Probably, it was supposed to be:

		if (	!(policy == SCHED_NORMAL && p->policy == SCHED_BATCH)  &&
			!(policy == SCHED_BATCH  && p->policy == SCHED_NORMAL)

this matches the comment, but strange: it doesn't allow to _drop_ the
realtime priority when rlim[RLIMIT_RTPRIO] == 0.

I think the right check would be:

		/* can't set/change rt policy */
		if (is_rt_policy(policy) &&
				policy != p->policy &&
				!rlim_rtprio)
			return -EPERM;

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov 57a6f51c42 [PATCH] introduce is_rt_policy() helper
Imho, makes the code a bit easier to read.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Oleg Nesterov 5fe1d75f34 [PATCH] do_sched_setscheduler(): don't take tasklist_lock
Use rcu locks instead. sched_setscheduler() now takes ->siglock
before reading ->signal->rlim[].

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:17 -07:00
Cal Peake c9472e0f28 [PATCH] kill extraneous printk in kernel_restart()
Get rid of an extraneous printk in kernel_restart().

Signed-off-by: Cal Peake <cp@absolutedigital.net>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Björn Steinbrink 111dbe0c8a [PATCH] Fix ____call_usermodehelper errors being silently ignored
If ____call_usermodehelper fails, we're not interested in the child
process' exit value, but the real error, so let's stop wait_for_helper from
overwriting it in that case.

Issue discovered by Benedikt Böhm while working on a Linux-VServer usermode
helper.

Signed-off-by: Björn Steinbrink <B.Steinbrink@gmx.de>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Alan Cox 54306cf04c [PATCH] exit: fix crash case
If we are going to BUG() not panic() here then we should cover the case of
the BUG being compiled out

Signed-off-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:16 -07:00
Roland McGrath 0b4a8a789a [PATCH] kexec warning fix
This fixes a couple of compiler warnings, and adds paranoia checks as well.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Atsushi Nemoto 3171a0305d [PATCH] simplify update_times (avoid jiffies/jiffies_64 aliasing problem)
Pass ticks to do_timer() and update_times(), and adjust x86_64 and s390
timer interrupt handler with this change.

Currently update_times() calculates ticks by "jiffies - wall_jiffies", but
callers of do_timer() should know how many ticks to update.  Passing ticks
get rid of this redundant calculation.  Also there are another redundancy
pointed out by Martin Schwidefsky.

This cleanup make a barrier added by
5aee405c66 needless.  So this patch removes
it.

As a bonus, this cleanup make wall_jiffies can be removed easily, since now
wall_jiffies is always synced with jiffies.  (This patch does not really
remove wall_jiffies.  It would be another cleanup patch)

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <ak@muc.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Acked-by: Russell King <rmk@arm.linux.org.uk>
Cc: Ian Molton <spyro@f2s.com>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Hirokazu Takata <takata.hirokazu@renesas.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
Cc: Richard Curnow <rc@rc0.org.uk>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: "Luck, Tony" <tony.luck@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath 27d91e07f9 [PATCH] __dequeue_signal() cleanup
This tightens up __dequeue_signal a little.  It also avoids doing
recalc_sigpending twice in a row, instead doing it once in dequeue_signal.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Roland McGrath b9ecb2bd5d [PATCH] has_stopped_jobs() cleanup
This check has been obsolete since the introduction of TASK_TRACED.  Now
TASK_STOPPED always means job control stop.

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe e4b765551a [PATCH] posix-timers: Fix the flags handling in posix_cpu_nsleep()
When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it
incorrectly reports the sleep time remaining to the user.  Because
posix_cpu_nsleep() doesn't report back to the user when it's called from
restart function due to the wrong flags handling.

This patch, which applies after previous one, moves the nanosleep() function
from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling
appropriately.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Toyo Abe 1711ef3866 [PATCH] posix-timers: Fix clock_nanosleep() doesn't return the remaining time in compatibility mode
The clock_nanosleep() function does not return the time remaining when the
sleep is interrupted by a signal.

This patch creates a new call out, compat_clock_nanosleep_restart(), which
handles returning the remaining time after a sleep is interrupted.  This
patch revives clock_nanosleep_restart().  It is now accessed via the new
call out.  The compat_clock_nanosleep_restart() is used for compatibility
access.

Since this is implemented in compatibility mode the normal path is
virtually unaffected - no real performance impact.

Signed-off-by: Toyo Abe <toyoa@mvista.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:15 -07:00
Akinobu Mita 07dccf3344 [PATCH] check return value of cpu_callback
Spawing ksoftirqd, migration, or watchdog, and calling init_timers_cpu()
may fail with small memory.  If it happens in initcalls, kernel NULL
pointer dereference happens later.  This patch makes crash happen
immediately in such cases.  It seems a bit better than getting kernel NULL
pointer dereference later.

Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Akinobu Mita <mita@miraclelinux.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:14 -07:00
Paul E. McKenney a45bce4954 [PATCH] memory ordering in __kfifo primitives
Both __kfifo_put() and __kfifo_get() have header comments stating that if
there is but one concurrent reader and one concurrent writer, locking is not
necessary.  This is almost the case, but a couple of memory barriers are
needed.  Another option would be to change the header comments to remove the
bit about locking not being needed, and to change the those callers who
currently don't use locking to add the required locking.  The attachment
analyzes this approach, but the patch below seems simpler.

Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Stelian Pop <stelian@popies.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Dave Jones 99de055ac0 [PATCH] lockdep: print kernel version
Lets do the same thing we do for oopses - print out the version in the
report.  It's an extra line of output though.  We could tack it on the end
of the INFO: lines, but that screws up Ingo's pretty output.

Signed-off-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:13 -07:00
Sukadev Bhattiprolu f400e198b2 [PATCH] pidspace: is_init()
This is an updated version of Eric Biederman's is_init() patch.
(http://lkml.org/lkml/2006/2/6/280).  It applies cleanly to 2.6.18-rc3 and
replaces a few more instances of ->pid == 1 with is_init().

Further, is_init() checks pid and thus removes dependency on Eric's other
patches for now.

Eric's original description:

	There are a lot of places in the kernel where we test for init
	because we give it special properties.  Most  significantly init
	must not die.  This results in code all over the kernel test
	->pid == 1.

	Introduce is_init to capture this case.

	With multiple pid spaces for all of the cases affected we are
	looking for only the first process on the system, not some other
	process that has pid == 1.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: <lxc-devel@lists.sourceforge.net>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Kirill Korotaev 3b9b8ab65d [PATCH] Fix unserialized task->files changing
Fixed race on put_files_struct on exec with proc.  Restoring files on
current on error path may lead to proc having a pointer to already kfree-d
files_struct.

->files changing at exit.c and khtread.c are safe as exit_files() makes all
things under lock.

Found during OpenVZ stress testing.

[akpm@osdl.org: add export]
Signed-off-by: Pavel Emelianov <xemul@openvz.org>
Signed-off-by: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:12 -07:00
Chuck Ebbert e6cab99bb4 [PATCH] unwind: fix unused variable warning when !CONFIG_MODULES
Fix "variable defined but not used" compiler warning in unwind.c when
CONFIG_MODULES is not set.

Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:11 -07:00
Rolf Eike Beer 2aae4a108d [PATCH] Fix kerneldoc comments in kernel/timer.c
Some of the kerneldoc comments in this file are ignored since the lead-in
is malformed, using either "/*" or "/***" instead of "/**".

[rdunlap@xenotime.net: kerneldoc fixes]
Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:10 -07:00
Steven Rostedt db630637b2 [PATCH] clean up and remove some extra spinlocks from rtmutex
Oleg brought up some interesting points about grabbing the pi_lock for some
protections.  In this discussion, I realized that there are some places
that the pi_lock is being grabbed when it really wasn't necessary.  Also
this patch does a little bit of clean up.

This patch basically does three things:

1) renames the "boost" variable to "chain_walk".  Since it is used in
   the debugging case when it isn't going to be boosted.  It better
   describes what the test is going to do if it succeeds.

2) moves get_task_struct to just before the unlocking of the wait_lock.
   This removes duplicate code, and makes it a little easier to read.  The
   owner wont go away while either the pi_lock or the wait_lock are held.

3) removes the pi_locking and owner blocked checking completely from the
   debugging case.  This is because the grabbing the lock and doing the
   check, then releasing the lock is just so full of races.  It's just as
   good to go ahead and call the pi_chain_walk function, since after
   releasing the lock the owner can then block anyway, and we would have
   missed that.  For the debug case, we really do want to do the chain walk
   to test for deadlocks anyway.

[oleg@tv-sign.ru: more of the same]
Signed-of-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Esben Nielsen <nielsen.esben@googlemail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Alexey Dobriyan 6c5c934153 [PATCH] ifdef blktrace debugging fields
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Josh Triplett 89e7e374dd [PATCH] timer: add lock annotation to lock_timer_base
lock_timer_base acquires a lock and returns with that lock held.  Add a
lock annotation to this function so that sparse can check callers for lock
pairing, and so that sparse will not complain about this function since it
intentionally uses the lock in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:09 -07:00
Mark Huang d10be6d1bd [PATCH] module_subsys: initialize earlier
Initialize module_subsys earlier (or at least earlier than devices) since
it could be used very early in the boot process if kmod loads a module
before the device initcalls.  Otherwise, kmod will crash in
kernel/module.c:mod_sysfs_setup() since the kset in module_subsys is not
initialized yet.

I only noticed this problem because occasionally, kmod loads the modules
for my SCSI and Ethernet adapters very early, during the boot process
itself.  I don't quite understand why it loads them sometimes and doesn't
load them other times.  Or who is telling kmod to do so.  Can someone
explain?

Signed-off-by: Mark Huang <mlhuang@cs.princeton.edu>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Josh Triplett a49a4af759 [PATCH] rcu: add lock annotations to rcu{,_bh}_torture_read_{lock,unlock}
rcu_torture_read_lock and rcu_bh_torture_read_lock acquire locks without
releasing them, and the matching functions rcu_torture_read_unlock and
rcu_bh_torture_read_unlock get called with the corresponding locks held and
release them.  Add lock annotations to these four functions so that sparse
can check callers for lock pairing, and so that sparse will not complain
about these functions since they intentionally use locks in this manner.

Signed-off-by: Josh Triplett <josh@freedesktop.org>
Acked-by: Paul McKenney <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:08 -07:00
Yoichi Yuasa 538d9d532b [PATCH] irq: remove a extra line
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Yoichi Yuasa 2ff6fd8f4a [PATCH] irq: fixed coding style
Signed-off-by: Yoichi Yuasa <yoichi_yuasa@tripeaks.co.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:07 -07:00
Randy Dunlap 4c78a66393 [PATCH] kernel-doc for relay interface
Add relay interface support to DocBook/kernel-api.tmpl.  Fix typos etc.  in
relay.c and relayfs.txt.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Acked-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:06 -07:00
Randy Dunlap d8c7649e99 [PATCH] kernel/params: driver layer error checking
Check driver layer return values in kernel/params.c

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:04 -07:00
Matthew Wilcox 910067d188 [PATCH] remove generic__raw_read_trylock()
If the cpu has the lock held for write, is interrupted, and the interrupt
handler calls read_trylock(), it's an instant deadlock.

Now, Dave Miller has subsequently pointed out that we don't have any
situations where this can occur.  Nevertheless, we should delete
generic__raw_read_lock (and its associated EXPORT to make Arjan happy) so that
nobody thinks they can use it.

Acked-by: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29 09:18:03 -07:00
Linus Torvalds ebdea46fec Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (130 commits)
  [ARM] 3856/1: Add clocksource for Intel IXP4xx platforms
  [ARM] 3855/1: Add generic time support
  [ARM] 3873/1: S3C24XX: Add irq_chip names
  [ARM] 3872/1: S3C24XX: Apply consistant tabbing to irq_chips
  [ARM] 3871/1: S3C24XX: Fix ordering of EINT4..23
  [ARM] nommu: confirms the CR_V bit in nommu mode
  [ARM] nommu: abort handler fixup for !CPU_CP15_MMU cores.
  [ARM] 3870/1: AT91: Start removing static memory mappings
  [ARM] 3869/1: AT91: NAND support for DK and KB9202 boards
  [ARM] 3868/1: AT91 hardware header update
  [ARM] 3867/1: AT91 GPIO update
  [ARM] 3866/1: AT91 clock update
  [ARM] 3865/1: AT91RM9200 header updates
  [ARM] 3862/2: S3C2410 - add basic power management support for AML M5900 series
  [ARM] kthread: switch arch/arm/kernel/apm.c
  [ARM] Off-by-one in arch/arm/common/icst*
  [ARM] 3864/1: Refactore sharpsl_pm
  [ARM] 3863/1: Add Locomo SPI Device
  [ARM] 3847/2:  Convert LOMOMO to use struct device for GPIOs
  [ARM] Use CPU_CACHE_* where possible in asm/cacheflush.h
  ...
2006-09-28 14:40:39 -07:00
Russell King 2dc94310bd Merge master.kernel.org:/pub/scm/linux/kernel/git/tmlind/linux-omap-upstream into devel 2006-09-27 19:57:54 +01:00
Eric W. Biederman 65800ac77e [PATCH] pid: remove temporary debug code in attach_pid
With the patches flying between Oleg and myself somehow this temporary
debug code got left in pid.c.  It was never intended to make it to the
stable kernel.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman c18258c6f0 [PATCH] pid: Implement transfer_pid and use it to simplify de_thread
In de_thread we move pids from one process to another, a rather ugly case.
The function transfer_pid makes it clear what we are doing, and makes the
action atomic.  This is useful we ever want to atomically traverse the
process group and session lists, in a rcu safe manner.

Even if the atomic properties this change should be a win as transfer_pid
should be less code to execute than executing both attach_pid and
detach_pid, and this should make de_thread slightly smaller as only a
single function call needs to be emitted.  The only downside is that the
code might be slower to execute as the odds are against transfer_pid being
in cache.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Eric W. Biederman b89a81712f [PATCH] sysctl: Allow /proc/sys without sys_sysctl
Since sys_sysctl is deprecated start allow it to be compiled out.  This
should catch any remaining user space code that cares, and paves the way
for further sysctl cleanups.

[akpm@osdl.org: If sys_sysctl() is not compiled-in, emit a warning]
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:19 -07:00
Theodore Ts'o ba52de123d [PATCH] inode-diet: Eliminate i_blksize from the inode structure
This eliminates the i_blksize field from struct inode.  Filesystems that want
to provide a per-inode st_blksize can do so by providing their own getattr
routine instead of using the generic_fillattr() function.

Note that some filesystems were providing pretty much random (and incorrect)
values for i_blksize.

[bunk@stusta.de: cleanup]
[akpm@osdl.org: generic_fillattr() fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:18 -07:00
Theodore Ts'o 8e18e2941c [PATCH] inode_diet: Replace inode.u.generic_ip with inode.i_private
The following patches reduce the size of the VFS inode structure by 28 bytes
on a UP x86.  (It would be more on an x86_64 system).  This is a 10% reduction
in the inode size on a UP kernel that is configured in a production mode
(i.e., with no spinlock or other debugging functions enabled; if you want to
save memory taken up by in-core inodes, the first thing you should do is
disable the debugging options; they are responsible for a huge amount of bloat
in the VFS inode structure).

This patch:

The filesystem or device-specific pointer in the inode is inside a union,
which is pretty pointless given that all 30+ users of this field have been
using the void pointer.  Get rid of the union and rename it to i_private, with
a comment to explain who is allowed to use the void pointer.  This is just a
cleanup, but it allows us to reuse the union 'u' for something something where
the union will actually be used.

[judith@osdl.org: powerpc build fix]
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Judith Lebzelter <judith@osdl.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:17 -07:00
David Howells f269fdd182 [PATCH] NOMMU: move the fallback arch_vma_name() to a sensible place
Move the fallback arch_vma_name() to a sensible place (kernel/signal.c).

Currently it's in fs/proc/task_mmu.c, a file that is dependent on both
CONFIG_PROC_FS and CONFIG_MMU being enabled, but it's used from
kernel/signal.c from where it is called unconditionally.

[akpm@osdl.org: build fix]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:15 -07:00
David Howells 0ec76a110f [PATCH] NOMMU: Check that access_process_vm() has a valid target
Check that access_process_vm() is accessing a valid mapping in the target
process.

This limits ptrace() accesses and accesses through /proc/<pid>/maps to only
those regions actually mapped by a program.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27 08:26:14 -07:00
Matthew Wilcox d33b6fba2c Resources: insert identical resources above existing resources
If you have two resources which aree exactly the same size,
insert_resource() currently inserts the new one below the existing one. 
This is wrong because there's no way to insert a resource of the same size
above an existing one.

I took this opportunity to rewrite the initial loop to be a for-loop
instead of a goto-loop and fix the documentation.

Signed-off-by: Matthew Wilcox <matthew@wil.cx>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-09-26 17:43:52 -07:00
Linus Torvalds b278240839 Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits)
  [PATCH] Don't set calgary iommu as default y
  [PATCH] i386/x86-64: New Intel feature flags
  [PATCH] x86: Add a cumulative thermal throttle event counter.
  [PATCH] i386: Make the jiffies compares use the 64bit safe macros.
  [PATCH] x86: Refactor thermal throttle processing
  [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64)
  [PATCH] Fix unwinder warning in traps.c
  [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1
  [PATCH] x86: Move direct PCI scanning functions out of line
  [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI
  [PATCH] Don't leak NT bit into next task
  [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder
  [PATCH] Fix some broken white space in ia32_signal.c
  [PATCH] Initialize argument registers for 32bit signal handlers.
  [PATCH] Remove all traces of signal number conversion
  [PATCH] Don't synchronize time reading on single core AMD systems
  [PATCH] Remove outdated comment in x86-64 mmconfig code
  [PATCH] Use string instructions for Core2 copy/clear
  [PATCH] x86: - restore i8259A eoi status on resume
  [PATCH] i386: Split multi-line printk in oops output.
  ...
2006-09-26 13:07:55 -07:00
Linus Torvalds dd77a4ee0f Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/driver-2.6: (47 commits)
  Driver core: Don't call put methods while holding a spinlock
  Driver core: Remove unneeded routines from driver core
  Driver core: Fix potential deadlock in driver core
  PCI: enable driver multi-threaded probe
  Driver Core: add ability for drivers to do a threaded probe
  sysfs: add proper sysfs_init() prototype
  drivers/base: check errors
  drivers/base: Platform notify needs to occur before drivers attach to the device
  v4l-dev2: handle __must_check
  add CONFIG_ENABLE_MUST_CHECK
  add __must_check to device management code
  Driver core: fixed add_bind_files() definition
  Driver core: fix comments in drivers/base/power/resume.c
  sysfs_remove_bin_file: no return value, dump_stack on error
  kobject: must_check fixes
  Driver core: add ability for devices to create and remove bin files
  Class: add support for class interfaces for devices
  Driver core: create devices/virtual/ tree
  Driver core: add device_rename function
  Driver core: add ability for classes to handle devices properly
  ...
2006-09-26 11:49:46 -07:00
Rafael J. Wysocki c5c6ba4e08 [PATCH] PM: Add pm_trace switch
Add the pm_trace attribute in /sys/power which has to be explicitly set to
one to really enable the "PM tracing" code compiled in when CONFIG_PM_TRACE
is set (which modifies the machine's CMOS clock in unpredictable ways).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:04 -07:00
Rafael J. Wysocki c8eb8b4025 [PATCH] PM: make it possible to disable console suspending
Change suspend_console() so that it waits for all consoles to flush the
remaining messages and make it possible to switch the console suspending off
with the help of a Kconfig option.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Stefan Seyfried <seife@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:03 -07:00
Rafael J. Wysocki 940864ddab [PATCH] swsusp: Use memory bitmaps during resume
Make swsusp use memory bitmaps to store its internal information during the
resume phase of the suspend-resume cycle.

If the pfns of saveable pages are saved during the suspend phase instead of
the kernel virtual addresses of these pages, we can use them during the resume
phase directly to set the corresponding bits in a memory bitmap.  Then, this
bitmap is used to mark the page frames corresponding to the pages that were
saveable before the suspend (aka "unsafe" page frames).

Next, we allocate as many page frames as needed to store the entire suspend
image and make sure that there will be some extra free "safe" page frames for
the list of PBEs constructed later.  Subsequently, the image is loaded and, if
possible, the data loaded from it are written into their "original" page
frames (ie.  the ones they had occupied before the suspend).

The image data that cannot be written into their "original" page frames are
loaded into "safe" page frames and their "original" kernel virtual addresses,
as well as the addresses of the "safe" pages containing their copies, are
stored in a list of PBEs.  Finally, the list of PBEs is used to copy the
remaining image data into their "original" page frames (this is done
atomically, by the architecture-dependent parts of swsusp).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki b788db7989 [PATCH] swsusp: Introduce memory bitmaps
Introduce the memory bitmap data structure and make swsusp use in the suspend
phase.

The current swsusp's internal data structure is not very efficient from the
memory usage point of view, so it seems reasonable to replace it with a data
structure that will require less memory, such as a pair of bitmaps.

The idea is to use bitmaps that may be allocated as sets of individual pages,
so that we can avoid making allocations of order greater than 0.  For this
reason the memory bitmap structure consists of several linked lists of objects
that contain pointers to memory pages with the actual bitmap data.  Still, for
a typical system all of these lists fit in a single page, so it's reasonable
to introduce an additional mechanism allowing us to allocate all of them
efficiently without sacrificing the generality of the design.  This is done
with the help of the chain_allocator structure and associated functions.

We need to use two memory bitmaps during the suspend phase of the
suspend-resume cycle.  One of them is necessary for marking the saveable
pages, and the second is used to mark the pages in which to store the copies
of them (aka image pages).

First, the bitmaps are created and we allocate as many image pages as needed
(the corresponding bits in the second bitmap are set as soon as the pages are
allocated).  Second, the bits corresponding to the saveable pages are set in
the first bitmap and the saveable pages are copied to the image pages.
Finally, the first bitmap is used to save the kernel virtual addresses of the
saveable pages and the second one is used to save the contents of the image
pages.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki 0bcd888d64 [PATCH] swsusp: Introduce some helpful constants
Introduce some constants that hopefully will help improve the readability of
code in kernel/power/snapshot.c.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:02 -07:00
Rafael J. Wysocki 75534b50cc [PATCH] Change the name of pagedir_nosave
The name of the pagedir_nosave variable does not make sense any more, so it
seems reasonable to change it to something more meaningful.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:49:01 -07:00
Rafael J. Wysocki cd560bb2f9 [PATCH] swsusp: Fix alloc_pagedir
Get rid of the FIXME in kernel/power/snapshot.c#alloc_pagedir() and
simplify the functions called by it.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki f6143aa60e [PATCH] swsusp: Reorder memory-allocating functions
Move some functions in kernel/power/snapshot.c to a better place (in the
same file) and introduce free_image_page() (will be necessary in the
future).

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki f623f0db8e [PATCH] swsusp: Fix mark_free_pages
Clean up mm/page_alloc.c#mark_free_pages() and make it avoid clearing
PageNosaveFree for PageNosave pages.  This allows us to get rid of an ugly
hack in kernel/power/snapshot.c#copy_data_pages().

Additionally, the page-copying loop in copy_data_pages() is moved to an
inline function.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki e3920fb42c [PATCH] Disable CPU hotplug during suspend
The current suspend code has to be run on one CPU, so we use the CPU
hotplug to take the non-boot CPUs offline on SMP machines.  However, we
should also make sure that these CPUs will not be enabled by someone else
after we have disabled them.

The functions disable_nonboot_cpus() and enable_nonboot_cpus() are moved to
kernel/cpu.c, because they now refer to some stuff in there that should
better be static.  Also it's better if disable_nonboot_cpus() returns an
error instead of panicking if something goes wrong, and
enable_nonboot_cpus() has no reason to panic(), because the CPUs may have
been enabled by the userland before it tries to take them online.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:59 -07:00
Rafael J. Wysocki fb13a28b0f [PATCH] swsusp: struct snapshot_handle cleanup
Add comments describing struct snapshot_handle and its members, change the
confusing name of its member 'page' to 'cur'.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Rafael J. Wysocki ae83c5eef5 [PATCH] swsusp: clean up browsing of pfns
Clean up some loops over pfns for each zone in snapshot.c: reduce the
number of additions to perform, rework detection of saveable pages and make
the code a bit less difficult to understand, hopefully.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton 546e0d2719 [PATCH] swsusp: read speedup
Implement async reads for swsusp resuming.

Crufty old PIII testbox:
	15.7 MB/s -> 20.3 MB/s

Sony Vaio:
	14.6 MB/s -> 33.3 MB/s

I didn't implement the post-resume bio_set_pages_dirty().  I don't really
understand why resume needs to run set_page_dirty() against these pages.

It might be a worry that this code modifies PG_Uptodate, PG_Error and
PG_Locked against the image pages.  Can this possibly affect the resumed-into
kernel?  Hopefully not, if we're atomically restoring its mem_map?

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Jens Axboe <axboe@suse.de>
Cc: Laurent Riffard <laurent.riffard@free.fr>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton 8c002494b5 [PATCH] swsusp: add read-speed instrumentation
Add some instrumentation to the swsusp readin code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton ab95416035 [PATCH] swsusp: write speedup
Switch the swsusp writeout code from 4k-at-a-time to 4MB-at-a-time.

Crufty old PIII testbox:
	12.9 MB/s -> 20.9 MB/s

Sony Vaio:
	14.7 MB/s -> 26.5 MB/s

The implementation is crude.  A better one would use larger BIOs, but wouldn't
gain any performance.

The memcpys will be mostly pipelined with the IO and basically come for free.

The ENOMEM path has not been tested.  It should be.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
Andrew Morton 3a4f7577c9 [PATCH] swsusp: add write-speed instrumentation
Add some instrumentation to the swsusp writeout code to show what bandwidth
we're achieving.

Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:58 -07:00
David Howells af8c65b57a [PATCH] FRV: permit __do_IRQ() to be dispensed with
Permit __do_IRQ() to be dispensed with based on a configuration option.

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:53 -07:00
Stephen Smalley 1a70cd40cb [PATCH] selinux: rename selinux_ctxid_to_string
Rename selinux_ctxid_to_string to selinux_sid_to_string to be
consistent with other interfaces.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Stephen Smalley 62bac0185a [PATCH] selinux: eliminate selinux_task_ctxid
Eliminate selinux_task_ctxid since it duplicates selinux_task_get_sid.

Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter 89fa30242f [PATCH] NUMA: Add zone_to_nid function
There are many places where we need to determine the node of a zone.
Currently we use a difficult to read sequence of pointer dereferencing.
Put that into an inline function and use throughout VM.  Maybe we can find
a way to optimize the lookup in the future.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:52 -07:00
Christoph Lameter 0ff38490c8 [PATCH] zone_reclaim: dynamic slab reclaim
Currently one can enable slab reclaim by setting an explicit option in
/proc/sys/vm/zone_reclaim_mode.  Slab reclaim is then used as a final
option if the freeing of unmapped file backed pages is not enough to free
enough pages to allow a local allocation.

However, that means that the slab can grow excessively and that most memory
of a node may be used by slabs.  We have had a case where a machine with
46GB of memory was using 40-42GB for slab.  Zone reclaim was effective in
dealing with pagecache pages.  However, slab reclaim was only done during
global reclaim (which is a bit rare on NUMA systems).

This patch implements slab reclaim during zone reclaim.  Zone reclaim
occurs if there is a danger of an off node allocation.  At that point we

1. Shrink the per node page cache if the number of pagecache
   pages is more than min_unmapped_ratio percent of pages in a zone.

2. Shrink the slab cache if the number of the nodes reclaimable slab pages
   (patch depends on earlier one that implements that counter)
   are more than min_slab_ratio (a new /proc/sys/vm tunable).

The shrinking of the slab cache is a bit problematic since it is not node
specific.  So we simply calculate what point in the slab we want to reach
(current per node slab use minus the number of pages that neeed to be
allocated) and then repeately run the global reclaim until that is
unsuccessful or we have reached the limit.  I hope we will have zone based
slab reclaim at some point which will make that easier.

The default for the min_slab_ratio is 5%

Also remove the slab option from /proc/sys/vm/zone_reclaim_mode.

[akpm@osdl.org: cleanups]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:51 -07:00
Christoph Lameter fbd98167e6 [PATCH] Profiling: require buffer allocation on the correct node
Profiling really suffers with off node buffers.  Fail if no memory is
available on the nodes.  The profiling code can deal with these failures
should they occur.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26 08:48:50 -07:00