1
0
Fork 0
Commit Graph

391 Commits (90810645f78f894acfb04b3768e8a7d45f2b303a)

Author SHA1 Message Date
Alexey Dobriyan 88e4ccf294 slub: current is always valid
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-07-15 20:36:01 +03:00
Christoph Lameter 0937502af7 slub: Add check for kfree() of non slab objects.
We can detect kfree()s on non slab objects by checking for PageCompound().
Works in the same way as for ksize. This helped me catch an invalid
kfree().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-07-15 20:36:01 +03:00
Linus Torvalds 7daf705f36 Start using the new '%pS' infrastructure to print symbols
This simplifies the code significantly, and was the whole point of the
exercise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-14 12:12:53 -07:00
Dmitry Adamushko bdb2192851 slub: Fix use-after-preempt of per-CPU data structure
Vegard Nossum reported a crash in kmem_cache_alloc():

	BUG: unable to handle kernel paging request at da87d000
	IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0
	*pde = 28180163 *pte = 1a87d160
	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC
	Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5)
	EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0
	EIP is at kmem_cache_alloc+0xc7/0xe0
	EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b
	ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54
	DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068

and analyzed it:

  "The register %ecx looks innocent but is very important here. The disassembly:

       mov    %edx,%ecx
       shr    $0x2,%ecx
       rep stos %eax,%es:(%edi) <-- the fault

   So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE.
   (0x6b6b6b6b >> 2 == 0x1adadada.)

   %ecx is the counter for the memset, from here:

       memset(object, 0, c->objsize);

  i.e. %ecx was loaded from c->objsize, so "c" must have been freed.
  Where did "c" come from? Uh-oh...

       c = get_cpu_slab(s, smp_processor_id());

  This looks like it has very much to do with CPU hotplug/unplug. Is
  there a race between SLUB/hotplug since the CPU slab is used after it
  has been freed?"

Good analysis.

Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc()
can be migrated on another CPU right after local_irq_restore() and
before memset().  The inital cpu can become offline in the mean time (or
a migration is a consequence of the CPU going offline) so its
'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback).

At some point of time the caller continues on another CPU having an
obsolete pointer...

Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-10 15:18:50 -07:00
Christoph Lameter cde5353599 Christoph has moved
Remove all clameter@sgi.com addresses from the kernel tree since they will
become invalid on June 27th.  Change my maintainer email address for the
slab allocators to cl@linux-foundation.org (which will be the new email
address for the future).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04 10:40:04 -07:00
Christoph Lameter 41d54d3bf8 slub: Do not use 192 byte sized cache if minimum alignment is 128 byte
The 192 byte cache is not necessary if we have a basic alignment of 128
byte. If it would be used then the 192 would be aligned to the next 128 byte
boundary which would result in another 256 byte cache. Two 256 kmalloc caches
cause sysfs to complain about a duplicate entry.

MIPS needs 128 byte aligned kmalloc caches and spits out warnings on boot without
this patch.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-07-03 19:01:55 +03:00
Jens Axboe 15c8b6c1aa on_each_cpu(): kill unused 'retry' parameter
It's not even passed on to smp_call_function() anymore, since that
was removed. So kill it.

Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2008-06-26 11:24:38 +02:00
Pekka Enberg 76994412f8 slub: ksize() abuse checks
Add a WARN_ON for pages that don't have PageSlab nor PageCompound set to catch
the worst abusers of ksize() in the kernel.

Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-22 19:52:18 +03:00
Benjamin Herrenschmidt 4ea33e2dc2 slub: fix atomic usage in any_slab_objects()
any_slab_objects() does an atomic_read on an atomic_long_t, this
fixes it to use atomic_long_read instead.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-08 10:46:56 -07:00
Christoph Lameter f6acb63508 slub: #ifdef simplification
If we make SLUB_DEBUG depend on SYSFS then we can simplify some
#ifdefs and avoid others.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-02 00:27:13 +03:00
Christoph Lameter 0121c619d0 slub: Whitespace cleanup and use of strict_strtoul
Fix some issues with wrapping and use strict_strtoul to make parameter
passing from sysfs safer.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-05-02 00:26:31 +03:00
Roman Zippel f8bd2258e2 remove div_long_long_rem
x86 is the only arch right now, which provides an optimized for
div_long_long_rem and it has the downside that one has to be very careful that
the divide doesn't overflow.

The API is a little akward, as the arguments for the unsigned divide are
signed.  The signed version also doesn't handle a negative divisor and
produces worse code on 64bit archs.

There is little incentive to keep this API alive, so this converts the few
users to the new API.

Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-05-01 08:03:58 -07:00
Thomas Gleixner 3ac7fe5a4a infrastructure to debug (dynamic) objects
We can see an ever repeating problem pattern with objects of any kind in the
kernel:

1) freeing of active objects
2) reinitialization of active objects

Both problems can be hard to debug because the crash happens at a point where
we have no chance to decode the root cause anymore.  One problem spot are
kernel timers, where the detection of the problem often happens in interrupt
context and usually causes the machine to panic.

While working on a timer related bug report I had to hack specialized code
into the timer subsystem to get a reasonable hint for the root cause.  This
debug hack was fine for temporary use, but far from a mergeable solution due
to the intrusiveness into the timer code.

The code further lacked the ability to detect and report the root cause
instantly and keep the system operational.

Keeping the system operational is important to get hold of the debug
information without special debugging aids like serial consoles and special
knowledge of the bug reporter.

The problems described above are not restricted to timers, but timers tend to
expose it usually in a full system crash.  Other objects are less explosive,
but the symptoms caused by such mistakes can be even harder to debug.

Instead of creating specialized debugging code for the timer subsystem a
generic infrastructure is created which allows developers to verify their code
and provides an easy to enable debug facility for users in case of trouble.

The debugobjects core code keeps track of operations on static and dynamic
objects by inserting them into a hashed list and sanity checking them on
object operations and provides additional checks whenever kernel memory is
freed.

The tracked object operations are:
- initializing an object
- adding an object to a subsystem list
- deleting an object from a subsystem list

Each operation is sanity checked before the operation is executed and the
subsystem specific code can provide a fixup function which allows to prevent
the damage of the operation.  When the sanity check triggers a warning message
and a stack trace is printed.

The list of operations can be extended if the need arises.  For now it's
limited to the requirements of the first user (timers).

The core code enqueues the objects into hash buckets.  The hash index is
generated from the address of the object to simplify the lookup for the check
on kfree/vfree.  Each bucket has it's own spinlock to avoid contention on a
global lock.

The debug code can be compiled in without being active.  The runtime overhead
is minimal and could be optimized by asm alternatives.  A kernel command line
option enables the debugging code.

Thanks to Ingo Molnar for review, suggestions and cleanup patches.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Greg KH <greg@kroah.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:53 -07:00
Nadia Derbey 0c40ba4fd6 ipc: define the slab_memory_callback priority as a constant
This is a trivial patch that defines the priority of slab_memory_callback in
the callback chain as a constant.  This is to prepare for next patch in the
series.

Signed-off-by: Nadia Derbey <Nadia.Derbey@bull.net>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Mingming Cao <cmm@us.ibm.com>
Cc: Pierre Peiffer <pierre.peiffer@bull.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-29 08:06:12 -07:00
Linus Torvalds e97e386b12 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: pack objects denser
  slub: Calculate min_objects based on number of processors.
  slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS
  slub: Simplify any_slab_object checks
  slub: Make the order configurable for each slab cache
  slub: Drop fallback to page allocator method
  slub: Fallback to minimal order during slab page allocation
  slub: Update statistics handling for variable order slabs
  slub: Add kmem_cache_order_objects struct
  slub: for_each_object must be passed the number of objects in a slab
  slub: Store max number of objects in the page struct.
  slub: Dump list of objects not freed on kmem_cache_close()
  slub: free_list() cleanup
  slub: improve kmem_cache_destroy() error message
  slob: fix bug - when slob allocates "struct kmem_cache", it does not force alignment.
2008-04-28 14:08:56 -07:00
Pekka Enberg 1b27d05b6e mm: move cache_line_size() to <linux/cache.h>
Not all architectures define cache_line_size() so as suggested by Andrew move
the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:19 -07:00
Mel Gorman dd1a239f6f mm: have zonelist contains structs with both a zone pointer and zone_idx
Filtering zonelists requires very frequent use of zone_idx().  This is costly
as it involves a lookup of another structure and a substraction operation.  As
the zone_idx is often required, it should be quickly accessible.  The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary.  Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman 54a6eb5c47 mm: use two zonelist that are filtered by GFP mask
Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations.  Based on the zones
allowed by a gfp mask, one of these zonelists is selected.  All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists.  The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages.  The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Mel Gorman 0e88460da6 mm: introduce node_zonelist() for accessing the zonelist for a GFP mask
Introduce a node_zonelist() helper function.  It is used to lookup the
appropriate zonelist given a node and a GFP mask.  The patch on its own is a
cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
necessary, it can be merged with the next patch in this set without problems.

Reviewed-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 08:58:18 -07:00
Christoph Lameter c124f5b54f slub: pack objects denser
Since we now have more orders available use a denser packing.
Increase slab order if more than 1/16th of a slab would be wasted.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:40 +03:00
Christoph Lameter 9b2cd506e5 slub: Calculate min_objects based on number of processors.
The mininum objects per slab is calculated based on the number of processors
that may come online.

Processors    min_objects
---------------------------
1             8
2             12
4             16
8             20
16            24
32            28
64            32
1024          48
4096          56

The higher the number of processors the large the order sizes used for various
slab caches will become. This has been shown to address the performance issues
in hackbench on 16p etc.

The calculation is only performed if slub_min_objects is zero (default). If one
specifies a slub_min_objects on boot then that setting is taken.

As suggested by Zhang Yanmin's performance tests on 16-core Tigerton, use the
formula '4 * (fls(nr_cpu_ids) + 1)':

  ./hackbench 100 process 2000:

  1) 2.6.25-rc6slab: 23.5 seconds
  2) 2.6.25-rc7SLUB+slub_min_objects=20: 31 seconds
  3) 2.6.25-rc7SLUB+slub_min_objects=24: 23.5 seconds

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:40 +03:00
Christoph Lameter 114e9e89e6 slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS
We can now fallback to order 0 slabs. So set the slub_max_order to
PAGE_CACHE_ORDER_COSTLY but keep the slub_min_objects at 4. This
will mostly preserve the orders used in 2.6.25. F.e. The 2k kmalloc slab
will use order 1 allocs and the 4k kmalloc slab order 2.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:39 +03:00
Christoph Lameter 31d33baf36 slub: Simplify any_slab_object checks
Since we now have total_objects counter per node use that to
check for the presence of any objects. The loop over all cpu slabs
is not that useful since any cpu slab would require an object allocation
first. So drop that.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:18 +03:00
Christoph Lameter 06b285dc3d slub: Make the order configurable for each slab cache
Makes /sys/kernel/slab/<slabname>/order writable. The allocation
order of a slab cache can then be changed dynamically during runtime.
This can be used to override the objects per slabs value establisheed
with the slub_min_objects setting that was manually specified or
calculated on bootup.

The changes of the slab order can occur while allocate_slab() runs.
Allocate slab needs the order and the number of slab objects that
are both changed by the change of order. Both are put into
a single word (struct kmem_cache_order_objects). They can then
be atomically updated and retrieved.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:18 +03:00
Christoph Lameter 319d1e2406 slub: Drop fallback to page allocator method
There is now a generic method of falling back to a slab page of minimal
order. No need anymore for the fallback to kmalloc_large().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:18 +03:00
Christoph Lameter 65c3376aac slub: Fallback to minimal order during slab page allocation
If any higher order allocation fails then fall back the smallest order
necessary to contain at least one object. This enables fallback for all
allocations to order 0 pages. The fallback will waste more memory (objects
will not fit neatly) and the fallback slabs will be not as efficient as larger
slabs since they contain less objects.

Note that SLAB also depends on order 1 allocations for some slabs that waste
too much memory if forced into PAGE_SIZE'd page. SLUB now can now deal with
failing order 1 allocs which SLAB cannot do.

Add a new field min that will contain the objects for the smallest possible order
for a slab cache.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:18 +03:00
Christoph Lameter 205ab99dd1 slub: Update statistics handling for variable order slabs
Change the statistics to consider that slabs of the same slabcache
can have different number of objects in them since they may be of
different order.

Provide a new sysfs field

	total_objects

which shows the total objects that the allocated slabs of a slabcache
could hold.

Add a max field that holds the largest slab order that was ever used
for a slab cache.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:17 +03:00
Christoph Lameter 834f3d1192 slub: Add kmem_cache_order_objects struct
Pack the order and the number of objects into a single word.
This saves some memory in the kmem_cache_structure and more importantly
allows us to fetch both values atomically.

Later the slab orders become runtime configurable and we need to fetch these
two items together in order to properly allocate a slab and initialize its
objects.

Fix the race by fetching the order and the number of objects in one word.

[penberg@cs.helsinki.fi: fix memset() page order in new_slab()]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:17 +03:00
Christoph Lameter 224a88be40 slub: for_each_object must be passed the number of objects in a slab
Pass the number of objects to the for_each_object macro. Most of these are
debug related.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:17 +03:00
Christoph Lameter 39b264641a slub: Store max number of objects in the page struct.
Split the inuse field up to be able to store the number of objects in this
page in the page struct as well. Necessary if we want to have pages of
various orders for a slab. Also avoids touching struct kmem_cache cachelines in
__slab_alloc().

Update diagnostic code to check the number of objects and make sure that
the number of objects always stays within the bounds of a 16 bit unsigned
integer.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:28:16 +03:00
Christoph Lameter 33b12c3813 slub: Dump list of objects not freed on kmem_cache_close()
Dump a list of unfreed objects if a slab cache is closed but
objects still remain.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:27:37 +03:00
Christoph Lameter 599870b175 slub: free_list() cleanup
free_list looked a bit screwy so here is an attempt to clean it up.

free_list is is only used for freeing partial lists. We do not need to return a
parameter if we decrement nr_partial within the function which allows a
simplification of the whole thing.

The current version modifies nr_partial outside of the list_lock which is
technically not correct. It was only ok because we should be the only user of
this slab cache at this point.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:26:18 +03:00
Pekka Enberg d629d81957 slub: improve kmem_cache_destroy() error message
As pointed out by Ingo, the SLUB warning of calling kmem_cache_destroy()
with cache that still has objects triggers in practice. So turn this
WARN_ON() into a nice SLUB specific error message to avoid people
confusing it to a SLUB bug.

Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-27 18:26:06 +03:00
Christoph Lameter 3dc5063786 slab_err: Pass parameters correctly to slab_bug
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-23 12:47:48 -07:00
Christoph Lameter 0f389ec630 slub: No need for per node slab counters if !SLUB_DEBUG
The per node counters are used mainly for showing data through the sysfs API.
If that API is not compiled in then there is no point in keeping track of this
data. Disable counters for the number of slabs and the number of total slabs
if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
potentially contended cacheline so this could actually be a performance
benefit to embedded systems.

SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
is on by default).

Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
if the system is not compiled with NUMA support.

[penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:53:02 +03:00
Christoph Lameter 49bd5221ce slub: Move map/flag clearing to __free_slab
__free_slab does some diagnostics. The resetting of mapcount etc
in discard_slab() can interfere with debug processing. So move
the reset immediately before the page is freed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:52:18 +03:00
Christoph Lameter 50ef37b96c slub: Fixes to per cpu stat output in sysfs
Only output per cpu stats if the kernel is build for SMP.

Use a capital "C" as a leading character for the processor number
(same as the numa statistics that also use a capital letter "N").

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:52:05 +03:00
Christoph Lameter 5b06c853ad slub: Deal with config variable dependencies
count_partial() is used by both slabinfo and the sysfs proc support. Move
the function directly before the beginning of the sysfs code so that it can
be easily found. Rework the preprocessor conditional to take into account
that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG.

Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There
is no point of keeping statistics if no one can restrive them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:51:34 +03:00
Christoph Lameter 4097d60175 slub: Reduce #ifdef ZONE_DMA by moving kmalloc_caches_dma near dma logic
Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA.
This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:51:18 +03:00
Pekka Enberg 62f75532b5 slub: Initialize per-cpu stats
As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before
using it.

[kmem_cache_cpu structures are usually allocated from arrays defined via
DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl].

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-04-14 18:50:44 +03:00
Christoph Lameter 00460dd5f4 Fix undefined count_partial if !CONFIG_SLABINFO
Small typo in the patch recently merged to avoid the unused symbol
message for count_partial(). Discussion thread with confirmation of fix at
http://marc.info/?t=120696854400001&r=1&w=2

Typo in the check if we need the count_partial function that was
introduced by 53625b4204

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-01 12:44:06 -07:00
Linus Torvalds e72e9c23ee Revert "SLUB: remove useless masking of GFP_ZERO"
This reverts commit 3811dbf671.

The masking was not at all useless, and it was sensible.  We handle
GFP_ZERO in the caller, and passing it down to any page allocator logic
is buggy and wrong.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-27 20:56:33 -07:00
Christoph Lameter 53625b4204 count_partial() is not used if !SLUB_DEBUG and !CONFIG_SLABINFO
Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO
is defined. This patch will be reversed when slab defrag is merged since slab
defrag requires count_partial() to determine the fragmentation status of
slab caches.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-26 10:42:28 -07:00
Christoph Lameter caeab084de slub page alloc fallback: Enable interrupts for GFP_WAIT.
The fallback path needs to enable interrupts like done for
the other page allocator calls. This was not necessary with
the alternate fast path since we handled irq enable/disable in
the slow path. The regular fastpath handles irq enable/disable
around calls to the slow path so we need to restore the proper
status before calling the page allocator from the slowpath.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-17 11:14:17 -07:00
Nick Piggin b621038678 slub: Do not cross cacheline boundaries for very small objects
SLUB should pack even small objects nicely into cachelines if that is what
has been asked for. Use the same algorithm as SLAB for this.

The effect of this patch for a system with a cacheline size of 64
bytes is that the 24 byte sized slab caches will now put exactly
2 objects into a cacheline instead of 3 with some overlap into
the next cacheline. This reduces the object density in a 4k slab
from 170 to 128 objects (same as SLAB).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06 16:21:50 -08:00
Christoph Lameter b773ad7369 slub statistics: Fix check for DEACTIVATE_REMOTE_FREES
The remote frees are in the freelist of the page and not in the
percpu freelist.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06 16:21:49 -08:00
Cyrill Gorcunov 62e5c4b4d6 slub: fix possible NULL pointer dereference
This patch fix possible NULL pointer dereference if kzalloc
failed. To be able to return proper error code the function
return type is changed to ssize_t (according to callees and
sysfs definitions).

Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter f619cfe1bd slub: Add kmalloc_large_node() to support kmalloc_node fallback
Slub is missing some NUMA support for large kmallocs. Provide that.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Pekka J Enberg 7693143481 slub: look up object from the freelist once
We only need to look up object from c->page->freelist once in
__slab_alloc().

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter 6446faa2ff slub: Fix up comments
Provide comments and fix up various spelling / style issues.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:32 -08:00
Christoph Lameter d8b42bf54b slub: Rearrange #ifdef CONFIG_SLUB_DEBUG in calculate_sizes()
Group SLUB_DEBUG code together to reduce the number of #ifdefs. Move some
debug checks under the #ifdef.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter ae20bfda68 slub: Remove BUG_ON() from ksize and omit checks for !SLUB_DEBUG
The BUG_ONs are useless since the pointer derefs will lead to
NULL deref errors anyways. Some of the checks are not necessary
if no debugging is possible.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter 27d9e4e948 slub: Use the objsize from the kmem_cache_cpu structure
No need to access the kmem_cache structure. We have the same value
in kmem_cache_cpu.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter d692ef6dcd slub: Remove useless checks in alloc_debug_processing
Alloc debug processing is never called with a NULL object pointer.
No reason to check for NULL.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:31 -08:00
Christoph Lameter e153362a50 slub: Remove objsize check in kmem_cache_flags()
There is no page->offset anymore and also no associated limit on the number
of objects. The page->offset field was removed for 2.6.24. So the check
in kmem_cache_flags() is now also obsolete (should have been dropped
earlier, somehow a hunk vanished).

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Christoph Lameter d9acf4b7b6 slub: rename slab_objects to show_slab_objects
The sysfs callback is better named show_slab_objects since it is always
called from the xxx_show callbacks. We need the name for other purposes
later.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Christoph Lameter a973e9dd1e Revert "unique end pointer" patch
This only made sense for the alternate fastpath which was reverted last week.

Mathieu is working on a new version that addresses the fastpath issues but that
new code first needs to go through mm and it is not clear if we need the
unique end pointers with his new scheme.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-03 12:22:30 -08:00
Linus Torvalds 00e962c540 Revert "SLUB: Alternate fast paths using cmpxchg_local"
This reverts commit 1f84260c8c, which is
suspected to be the reason for some very occasional and hard-to-trigger
crashes that usually look related to memory allocation (mostly reported
in networking, but since that's generally the most common source of
shortlived allocations - and allocations in interrupt contexts - that in
itself is not a big clue).

See for example
	http://bugzilla.kernel.org/show_bug.cgi?id=9973
	http://lkml.org/lkml/2008/2/19/278
etc.

One promising suspicion for what the root cause of bug is (which also
explains why it's so hard to trigger in practice) came from Eric
Dumazet:

   "I wonder how SLUB_FASTPATH is supposed to work, since it is affected
    by a classical ABA problem of lockless algo.

    cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
    while an interrupt came (on this cpu), and several allocations were
    done, and one free was performed at the end of this interruption, so
    'object' was recycled.

    c->freelist can then contain the previous value (object), but
    object[c->offset] was changed by IRQ.

    We then put back in freelist an already allocated object."

but another reason for the revert is simply that everybody agrees that
this code was the main suspect just by virtue of the pattern of oopses.

Cc: Torsten Kaiser <just.for.lkml@googlemail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-19 09:08:49 -08:00
Christoph Lameter 331dc558fa slub: Support 4k kmallocs again to compensate for page allocator slowness
Currently we hand off PAGE_SIZEd kmallocs to the page allocator in the
mistaken belief that the page allocator can handle these allocations
effectively. However, measurements indicate a minimum slowdown by the
factor of 8 (and that is only SMP, NUMA is much worse) vs the slub fastpath
which causes regressions in tbench.

Increase the number of kmalloc caches by one so that we again handle 4k
kmallocs directly from slub. 4k page buffering for the page allocator
will be performed by slub like done by slab.

At some point the page allocator fastpath should be fixed. A lot of the kernel
would benefit from a faster ability to allocate a single page. If that is
done then the 4k allocs may again be forwarded to the page allocator and this
patch could be reverted.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:02 -08:00
Christoph Lameter 71c7a06ff0 slub: Fallback to kmalloc_large for failing higher order allocs
Slub already has two ways of allocating an object. One is via its own
logic and the other is via the call to kmalloc_large to hand off object
allocation to the page allocator. kmalloc_large is typically used
for objects >= PAGE_SIZE.

We can use that handoff to avoid failing if a higher order kmalloc slab
allocation cannot be satisfied by the page allocator. If we reach the
out of memory path then simply try a kmalloc_large(). kfree() can
already handle the case of an object that was allocated via the page
allocator and so this will work just fine (apart from object
accounting...).

For any kmalloc slab that already requires higher order allocs (which
makes it impossible to use the page allocator fastpath!)
we just use PAGE_ALLOC_COSTLY_ORDER to get the largest number of
objects in one go from the page allocator slowpath.

On a 4k platform this patch will lead to the following use of higher
order pages for the following kmalloc slabs:

8 ... 1024	order 0
2048 .. 4096	order 3 (4k slab only after the next patch)

We may waste some space if fallback occurs on a 2k slab but we
are always able to fallback to an order 0 alloc.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Christoph Lameter b7a49f0d4c slub: Determine gfpflags once and not every time a slab is allocated
Currently we determine the gfp flags to pass to the page allocator
each time a slab is being allocated.

Determine the bits to be set at the time the slab is created. Store
in a new allocflags field and add the flags in allocate_slab().

Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Adrian Bunk dada123d99 make slub.c:slab_address() static
slab_address() can become static.

Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Pekka Enberg eada35efcb slub: kmalloc page allocator pass-through cleanup
This adds a proper function for kmalloc page allocator pass-through. While it
simplifies any code that does slab tracing code a lot, I think it's a
worthwhile cleanup in itself.

Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-14 15:30:01 -08:00
Ingo Molnar 3adbefee6f SLUB: fix checkpatch warnings
fix checkpatch --file mm/slub.c errors and warnings.

 $ q-code-quality-compare
                                      errors   lines of code   errors/KLOC
 mm/slub.c      [before]                  22            4204           5.2
 mm/slub.c      [after]                    0            4210             0

no code changed:

    text    data     bss     dec     hex filename
   22195    8634     136   30965    78f5 slub.o.before
   22195    8634     136   30965    78f5 slub.o.after

   md5:
     93cdfbec2d6450622163c590e1064358  slub.o.before.asm
     93cdfbec2d6450622163c590e1064358  slub.o.after.asm

[clameter: rediffed against Pekka's cleanup patch, omitted
moves of the name of a function to the start of line]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07 17:52:39 -08:00
Nick Piggin a76d354629 Use non atomic unlock
Slub can use the non-atomic version to unlock because other flags will not
get modified with the lock held.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07 17:47:42 -08:00
Christoph Lameter 8ff12cfc00 SLUB: Support for performance statistics
The statistics provided here allow the monitoring of allocator behavior but
at the cost of some (minimal) loss of performance. Counters are placed in
SLUB's per cpu data structure. The per cpu structure may be extended by the
statistics to grow larger than one cacheline which will increase the cache
footprint of SLUB.

There is a compile option to enable/disable the inclusion of the runtime
statistics and its off by default.

The slabinfo tool is enhanced to support these statistics via two options:

-D 	Switches the line of information displayed for a slab from size
	mode to activity mode.

-A	Sorts the slabs displayed by activity. This allows the display of
	the slabs most important to the performance of a certain load.

-r	Report option will report detailed statistics on

Example (tbench load):

slabinfo -AD		->Shows the most active slabs

Name                   Objects    Alloc     Free   %Fast
skbuff_fclone_cache         33 111953835 111953835  99  99
:0000192                  2666  5283688  5281047  99  99
:0001024                   849  5247230  5246389  83  83
vm_area_struct            1349   119642   118355  91  22
:0004096                    15    66753    66751  98  98
:0000064                  2067    25297    23383  98  78
dentry                   10259    28635    18464  91  45
:0000080                 11004    18950     8089  98  98
:0000096                  1703    12358    10784  99  98
:0000128                   762    10582     9875  94  18
:0000512                   184     9807     9647  95  81
:0002048                   479     9669     9195  83  65
anon_vma                   777     9461     9002  99  71
kmalloc-8                 6492     9981     5624  99  97
:0000768                   258     7174     6931  58  15

So the skbuff_fclone_cache is of highest importance for the tbench load.
Pretty high load on the 192 sized slab. Look for the aliases

slabinfo -a | grep 000192
:0000192     <- xfs_btree_cur filp kmalloc-192 uid_cache tw_sock_TCP
	request_sock_TCPv6 tw_sock_TCPv6 skbuff_head_cache xfs_ili

Likely skbuff_head_cache.


Looking into the statistics of the skbuff_fclone_cache is possible through

slabinfo skbuff_fclone_cache	->-r option implied if cache name is mentioned


.... Usual output ...

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             111953360 111946981  99  99
Slowpath                 1044     7423   0   0
Page Alloc                272      264   0   0
Add partial                25      325   0   0
Remove partial             86      264   0   0
RemoteObj/SlabFrozen      350     4832   0   0
Total                111954404 111954404

Flushes       49 Refill        0
Deactivate Full=325(92%) Empty=0(0%) ToHead=24(6%) ToTail=1(0%)

Looks good because the fastpath is overwhelmingly taken.


skbuff_head_cache:

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath              5297262  5259882  99  99
Slowpath                 4477    39586   0   0
Page Alloc                937      824   0   0
Add partial                 0     2515   0   0
Remove partial           1691      824   0   0
RemoteObj/SlabFrozen     2621     9684   0   0
Total                 5301739  5299468

Deactivate Full=2620(100%) Empty=0(0%) ToHead=0(0%) ToTail=0(0%)


Descriptions of the output:

Total:		The total number of allocation and frees that occurred for a
		slab

Fastpath:	The number of allocations/frees that used the fastpath.

Slowpath:	Other allocations

Page Alloc:	Number of calls to the page allocator as a result of slowpath
		processing

Add Partial:	Number of slabs added to the partial list through free or
		alloc (occurs during cpuslab flushes)

Remove Partial:	Number of slabs removed from the partial list as a result of
		allocations retrieving a partial slab or by a free freeing
		the last object of a slab.

RemoteObj/Froz:	How many times were remotely freed object encountered when a
		slab was about to be deactivated. Frozen: How many times was
		free able to skip list processing because the slab was in use
		as the cpuslab of another processor.

Flushes:	Number of times the cpuslab was flushed on request
		(kmem_cache_shrink, may result from races in __slab_alloc)

Refill:		Number of times we were able to refill the cpuslab from
		remotely freed objects for the same slab.

Deactivate:	Statistics how slabs were deactivated. Shows how they were
		put onto the partial list.

In general fastpath is very good. Slowpath without partial list processing is
also desirable. Any touching of partial list uses node specific locks which
may potentially cause list lock contention.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07 17:47:41 -08:00
Christoph Lameter 1f84260c8c SLUB: Alternate fast paths using cmpxchg_local
Provide an alternate implementation of the SLUB fast paths for alloc
and free using cmpxchg_local. The cmpxchg_local fast path is selected
for arches that have CONFIG_FAST_CMPXCHG_LOCAL set. An arch should only
set CONFIG_FAST_CMPXCHG_LOCAL if the cmpxchg_local is faster than an
interrupt enable/disable sequence. This is known to be true for both
x86 platforms so set FAST_CMPXCHG_LOCAL for both arches.

Currently another requirement for the fastpath is that the kernel is
compiled without preemption. The restriction will go away with the
introduction of a new per cpu allocator and new per cpu operations.

The advantages of a cmpxchg_local based fast path are:

1. Potentially lower cycle count (30%-60% faster)

2. There is no need to disable and enable interrupts on the fast path.
   Currently interrupts have to be disabled and enabled on every
   slab operation. This is likely avoiding a significant percentage
   of interrupt off / on sequences in the kernel.

3. The disposal of freed slabs can occur with interrupts enabled.

The alternate path is realized using #ifdef's. Several attempts to do the
same with macros and inline functions resulted in a mess (in particular due
to the strange way that local_interrupt_save() handles its argument and due
to the need to define macros/functions that sometimes disable interrupts
and sometimes do something else).

[clameter: Stripped preempt bits and disabled fastpath if preempt is enabled]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07 17:47:41 -08:00
Christoph Lameter 683d0baad3 SLUB: Use unique end pointer for each slab page.
We use a NULL pointer on freelists to signal that there are no more objects.
However the NULL pointers of all slabs match in contrast to the pointers to
the real objects which are in different ranges for different slab pages.

Change the end pointer to be a pointer to the first object and set bit 0.
Every slab will then have a different end pointer. This is necessary to ensure
that end markers can be matched to the source slab during cmpxchg_local.

Bring back the use of the mapping field by SLUB since we would otherwise have
to call a relatively expensive function page_address() in __slab_alloc().  Use
of the mapping field allows avoiding a call to page_address() in various other
functions as well.

There is no need to change the page_mapping() function since bit 0 is set on
the mapping as also for anonymous pages.  page_mapping(slab_page) will
therefore still return NULL although the mapping field is overloaded.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-07 17:47:41 -08:00
Christoph Lameter 5bb983b0cc SLUB: Deal with annoying gcc warning on kfree()
gcc 4.2 spits out an annoying warning if one casts a const void *
pointer to a void * pointer. No warning is generated if the
conversion is done through an assignment.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-07 17:47:41 -08:00
root ba84c73c7a SLUB: Do not upset lockdep
inconsistent {softirq-on-W} -> {in-softirq-W} usage.
swapper/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
 (&n->list_lock){-+..}, at: [<ffffffff802935c1>] add_partial+0x31/0xa0
{softirq-on-W} state was registered at:
  [<ffffffff80259fb8>] __lock_acquire+0x3e8/0x1140
  [<ffffffff80259838>] debug_check_no_locks_freed+0x188/0x1a0
  [<ffffffff8025ad65>] lock_acquire+0x55/0x70
  [<ffffffff802935c1>] add_partial+0x31/0xa0
  [<ffffffff805c76de>] _spin_lock+0x1e/0x30
  [<ffffffff802935c1>] add_partial+0x31/0xa0
  [<ffffffff80296f9c>] kmem_cache_open+0x1cc/0x330
  [<ffffffff805c7984>] _spin_unlock_irq+0x24/0x30
  [<ffffffff802974f4>] create_kmalloc_cache+0x64/0xf0
  [<ffffffff80295640>] init_alloc_cpu_cpu+0x70/0x90
  [<ffffffff8080ada5>] kmem_cache_init+0x65/0x1d0
  [<ffffffff807f1b4e>] start_kernel+0x23e/0x350
  [<ffffffff807f112d>] _sinittext+0x12d/0x140
  [<ffffffffffffffff>] 0xffffffffffffffff

This change isn't really necessary for correctness, but it prevents lockdep
from getting upset and then disabling itself.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04 10:56:02 -08:00
Pekka Enberg 064287807c SLUB: Fix coding style violations
This fixes most of the obvious coding style violations in mm/slub.c as
reported by checkpatch.

Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04 10:56:02 -08:00
Christoph Lameter 7c2e132c54 Add parameter to add_partial to avoid having two functions
Add a parameter to add_partial instead of having separate functions.  The
parameter allows a more detailed control of where the slab pages is placed in
the partial queues.

If we put slabs back to the front then they are likely immediately used for
allocations.  If they are put at the end then we can maximize the time that
the partial slabs spent without being subject to allocations.

When deactivating slab we can put the slabs that had remote objects freed (we
can see that because objects were put on the freelist that requires locks) to
them at the end of the list so that the cachelines of remote processors can
cool down.  Slabs that had objects from the local cpu freed to them (objects
exist in the lockless freelist) are put in the front of the list to be reused
ASAP in order to exploit the cache hot state of the local cpu.

Patch seems to slightly improve tbench speed (1-2%).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04 10:56:02 -08:00
Christoph Lameter 9824601ead SLUB: rename defrag to remote_node_defrag_ratio
The NUMA defrag works by allocating objects from partial slabs on remote
nodes.  Rename it to

	remote_node_defrag_ratio

to be clear about this.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04 10:56:02 -08:00
Christoph Lameter f61396aed9 Move count_partial before kmem_cache_shrink
Move the counting function for objects in partial slabs so that it is placed
before kmem_cache_shrink.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2008-02-04 10:56:01 -08:00
Christoph Lameter 151c602f79 SLUB: Fix sysfs refcounting
If CONFIG_SYSFS is set then free the kmem_cache structure when
sysfs tells us its okay.

Otherwise there is the danger (as pointed out by
Al Viro) that sysfs thinks the kobject still exists after
kmem_cache_destroy() removed it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka J Enberg <penberg@cs.helsinki.fi>
2008-02-04 10:56:01 -08:00
Harvey Harrison e374d48356 slub: fix shadowed variable sparse warnings
Introduce 'len' at outer level:
mm/slub.c:3406:26: warning: symbol 'n' shadows an earlier one
mm/slub.c:3393:6: originally declared here

No need to declare new node:
mm/slub.c:3501:7: warning: symbol 'node' shadows an earlier one
mm/slub.c:3491:6: originally declared here

No need to declare new x:
mm/slub.c:3513:9: warning: symbol 'x' shadows an earlier one
mm/slub.c:3492:6: originally declared here

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-02-04 10:56:01 -08:00
Greg Kroah-Hartman 1eada11c88 Kobject: convert mm/slub.c to use kobject_init/add_ng()
This converts the code to use the new kobject functions, cleaning up the
logic in doing so.

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:31 -08:00
Greg Kroah-Hartman 0ff21e4663 kobject: convert kernel_kset to be a kobject
kernel_kset does not need to be a kset, but a much simpler kobject now
that we have kobj_attributes.

We also rename kernel_kset to kernel_kobj to catch all users of this
symbol with a build error instead of an easy-to-ignore build warning.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:24 -08:00
Greg Kroah-Hartman 081248de0a kset: move /sys/slab to /sys/kernel/slab
/sys/kernel is where these things should go.
Also updated the documentation and tool that used this directory.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:16 -08:00
Greg Kroah-Hartman 27c3a314d5 kset: convert slub to use kset_create
Dynamically create the kset instead of declaring it statically.

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:15 -08:00
Greg Kroah-Hartman 3514faca19 kobject: remove struct kobj_type from struct kset
We don't need a "default" ktype for a kset.  We should set this
explicitly every time for each kset.  This change is needed so that we
can make ksets dynamic, and cleans up one of the odd, undocumented
assumption that the kset/kobject/ktype model has.

This patch is based on a lot of help from Kay Sievers.

Nasty bug in the block code was found by Dave Young
<hidave.darkstar@gmail.com>

Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-24 20:40:10 -08:00
Linus Torvalds 158a962422 Unify /proc/slabinfo configuration
Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.

This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code.  Maybe SLOB will want this some
day too.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-02 13:04:48 -08:00
Pekka J Enberg 57ed3eda97 slub: provide /proc/slabinfo
This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ mingo@elte.hu: build fix. ]

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-01-01 11:32:02 -08:00
Christoph Lameter 76be895001 SLUB: Improve hackbench speed
Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-21 15:51:07 -08:00
Christoph Lameter 3811dbf671 SLUB: remove useless masking of GFP_ZERO
Remove a recently added useless masking of GFP_ZERO.  GFP_ZERO is already
masked out in new_slab() (See how it calls allocate_slab).  No need to do
it twice.

This reverts the SLUB parts of 7fd272550b.

Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-17 19:28:17 -08:00
Linus Torvalds 7fd272550b Avoid double memclear() in SLOB/SLUB
Both slob and slub react to __GFP_ZERO by clearing the allocation, which
means that passing the GFP_ZERO bit down to the page allocator is just
wasteful and pointless.

Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-09 10:17:52 -08:00
Vegard Nossum 294a80a8ed SLUB's ksize() fails for size > 2048
I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.

The error of ksize() seems to be that it does not check if the allocation
was made by SLUB or the page allocator.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-12-05 09:21:20 -08:00
Denis Cheng efe44183f6 SLUB: killed the unused "end" variable
Since the macro "for_each_object" introduced, the "end" variable becomes unused anymore.

Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-12 10:32:29 -08:00
Christoph Lameter 05aa345034 SLUB: Fix memory leak by not reusing cpu_slab
Fix the memory leak that may occur when we attempt to reuse a cpu_slab
that was allocated while we reenabled interrupts in order to be able to
grow a slab cache.

The per cpu freelist may contain objects and in that situation we may
overwrite the per cpu freelist pointer loosing objects.  This only
occurs if we find that the concurrently allocated slab fits our
allocation needs.

If we simply always deactivate the slab then the freelist will be
properly reintegrated and the memory leak will go away.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-11-05 11:37:12 -08:00
Al Viro 27bb628a1d missing atomic_read_long() in slub.c
nr_slabs is atomic_long_t, not atomic_t

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-29 07:41:32 -07:00
Yasunori Goto b9049e2344 memory hotplug: make kmem_cache_node for SLUB on memory online avoid panic
Fix a panic due to access NULL pointer of kmem_cache_node at discard_slab()
after memory online.

When memory online is called, kmem_cache_nodes are created for all SLUBs
for new node whose memory are available.

slab_mem_going_online_callback() is called to make kmem_cache_node() in
callback of memory online event.  If it (or other callbacks) fails, then
slab_mem_offline_callback() is called for rollback.

In memory offline, slab_mem_going_offline_callback() is called to shrink
all slub cache, then slab_mem_offline_callback() is called later.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: locking fix]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-22 08:13:17 -07:00
Christoph Lameter 4ba9b9d0ba Slab API: remove useless ctor parameter and reorder parameters
Slab constructors currently have a flags parameter that is never used.  And
the order of the arguments is opposite to other slab functions.  The object
pointer is placed before the kmem_cache pointer.

Convert

        ctor(void *object, struct kmem_cache *s, unsigned long flags)

to

        ctor(struct kmem_cache *s, void *object)

throughout the kernel

[akpm@linux-foundation.org: coupla fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Christoph Lameter b811c202a0 SLUB: simplify IRQ off handling
Move irq handling out of new slab into __slab_alloc.  That is useful for
Mathieu's cmpxchg_local patchset and also allows us to remove the crude
local_irq_off in early_kmem_cache_alloc().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17 08:42:45 -07:00
Andrew Morton ea3061d227 slub: list_locations() can use GFP_TEMPORARY
It's a short-lived allocation.

Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter 42a9fdbb12 SLUB: Optimize cacheline use for zeroing
We touch a cacheline in the kmem_cache structure for zeroing to get the
size. However, the hot paths in slab_alloc and slab_free do not reference
any other fields in kmem_cache, so we may have to just bring in the
cacheline for this one access.

Add a new field to kmem_cache_cpu that contains the object size. That
cacheline must already be used in the hotpaths. So we save one cacheline
on every slab_alloc if we zero.

We need to update the kmem_cache_cpu object size if an aliasing operation
changes the objsize of an non debug slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter 4c93c355d5 SLUB: Place kmem_cache_cpu structures in a NUMA aware way
The kmem_cache_cpu structures introduced are currently an array placed in the
kmem_cache struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

However, the kmem_cache_cpu structures can now be allocated in a more
intelligent way.

We would like to put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. We set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
	- NUMA aware placement improves memory performance
	- All global structures in struct kmem_cache become readonly
	- Dense packing of per cpu structures reduces cacheline
	  footprint in SMP and NUMA.
	- Potential avoidance of exclusive cacheline fetches
	  on the free and alloc hotpath since multiple kmem_cache_cpu
	  structures are in one cacheline. This is particularly important
	  for the kmalloc array.

Cons:
	- Additional reference to one read only cacheline (per cpu
	  array of pointers to kmem_cache_cpu) in both slab_alloc()
	  and slab_free().

[akinobu.mita@gmail.com: fix cpu hotplug offline/online path]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: "Pekka Enberg" <penberg@cs.helsinki.fi>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter ee3c72a14b SLUB: Avoid touching page struct when freeing to per cpu slab
Set c->node to -1 if we allocate from a debug slab instead for SlabDebug
which requires access the page struct cacheline.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Alexey Dobriyan <adobriyan@sw.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter b3fba8da65 SLUB: Move page->offset to kmem_cache_cpu->offset
We need the offset from the page struct during slab_alloc and slab_free. In
both cases we also reference the cacheline of the kmem_cache_cpu structure.
We can therefore move the offset field into the kmem_cache_cpu structure
freeing up 16 bits in the page struct.

Moving the offset allows an allocation from slab_alloc() without touching the
page struct in the hot path.

The only thing left in slab_free() that touches the page struct cacheline for
per cpu freeing is the checking of SlabDebug(page). The next patch deals with
that.

Use the available 16 bits to broaden page->inuse. More than 64k objects per
slab become possible and we can get rid of the checks for that limitation.

No need anymore to shrink the order of slabs if we boot with 2M sized slabs
(slub_min_order=9).

No need anymore to switch off the offset calculation for very large slabs
since the field in the kmem_cache_cpu structure is 32 bits and so the offset
field can now handle slab sizes of up to 8GB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter 8e65d24c7c SLUB: Do not use page->mapping
After moving the lockless_freelist to kmem_cache_cpu we no longer need
page->lockless_freelist. Restructure the use of the struct page fields in
such a way that we never touch the mapping field.

This is turn allows us to remove the special casing of SLUB when determining
the mapping of a page (needed for corner cases of virtual caches machines that
need to flush caches of processors mapping a page).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Christoph Lameter dfb4f09609 SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be handled in this way is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects and therefore performance.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.

Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object
	Read: page cache struct (but see later patch that avoids touching
		that cacheline)

The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.

Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc. (but see later patch that
  			avoids touching the page struct on free)
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.

Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline. That can be mitigated for some cases by the following
	  patches but its not possible to completely eliminate these
	  references.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.
	(Another later patch reduces these requirements)

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:01 -07:00
Mel Gorman e12ba74d8f Group short-lived and reclaimable kernel allocations
This patch marks a number of allocations that are either short-lived such as
network buffers or are reclaimable such as inode allocations.  When something
like updatedb is called, long-lived and unmovable kernel allocations tend to
be spread throughout the address space which increases fragmentation.

This patch groups these allocations together as much as possible by adding a
new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
them and re-reading the information from elsewhere.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:00 -07:00
Christoph Lameter 6cb062296f Categorize GFP flags
The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
flags:

GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.

GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.

GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.

These replace the uses of GFP_LEVEL mask in the slab allocators and in
vmalloc.c.

The use of the flags not included in these sets may occur as a result of a
slab allocation standing in for a page allocation when constructing scatter
gather lists.  Extraneous flags are cleared and not passed through to the
page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
now be ignored if passed to a slab allocator.

Change the allocation of allocator meta data in SLAB and vmalloc to not
pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
__GFP_THISNODE flag for such allocations.  Generalize that to also cover
vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.

The impact of allocator metadata placement on access latency to the
cachelines of the object itself is minimal since metadata is only
referenced on alloc and free.  The attempt is still made to place the meta
data optimally but we consistently allow fallback both in SLAB and vmalloc
(SLUB does not need to allocate metadata like that).

Allocator metadata may serve multiple in kernel users and thus should not
be subject to the limitations arising from a single allocation context.

[akpm@linux-foundation.org: fix fallback_alloc()]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:59 -07:00
Christoph Lameter f64dc58c54 Memoryless nodes: SLUB support
Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY).
That way SLUB only operates on nodes with regular memory.  Any allocation
attempt on a memoryless node or a node with just highmem will fall whereupon
SLUB will fetch memory from a nearby node (depending on how memory policies
and cpuset describe fallback).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Bob Picco <bob.picco@hp.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:58 -07:00
Christoph Lameter ef8b4520bd Slab allocators: fail if ksize is called with a NULL parameter
A NULL pointer means that the object was not allocated.  One cannot
determine the size of an object that has not been allocated.  Currently we
return 0 but we really should BUG() on attempts to determine the size of
something nonexistent.

krealloc() interprets NULL to mean a zero sized object.  Handle that
separately in krealloc().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Satyam Sharma 2408c55037 {slub, slob}: use unlikely() for kfree(ZERO_OR_NULL_PTR) check
Considering kfree(NULL) would normally occur only in error paths and
kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the
condition check in SLUB's and SLOB's kfree() to optimize for the common
case.  SLAB has this already.

Signed-off-by: Satyam Sharma <satyam@infradead.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Christoph Lameter aadb4bc4a1 SLUB: direct pass through of page size or higher kmalloc requests
This gets rid of all kmalloc caches larger than page size.  A kmalloc
request larger than PAGE_SIZE > 2 is going to be passed through to the page
allocator.  This works both inline where we will call __get_free_pages
instead of kmem_cache_alloc and in __kmalloc.

kfree is modified to check if the object is in a slab page. If not then
the page is freed via the page allocator instead. Roughly similar to what
SLOB does.

Advantages:
- Reduces memory overhead for kmalloc array
- Large kmalloc operations are faster since they do not
  need to pass through the slab allocator to get to the
  page allocator.
- Performance increase of 10%-20% on alloc and 50% on free for
  PAGE_SIZEd allocations.
  SLUB must call page allocator for each alloc anyways since
  the higher order pages which that allowed avoiding the page alloc calls
  are not available in a reliable way anymore. So we are basically removing
  useless slab allocator overhead.
- Large kmallocs yields page aligned object which is what
  SLAB did. Bad things like using page sized kmalloc allocations to
  stand in for page allocate allocs can be transparently handled and are not
  distinguishable from page allocator uses.
- Checking for too large objects can be removed since
  it is done by the page allocator.

Drawbacks:
- No accounting for large kmalloc slab allocations anymore
- No debugging of large kmalloc slab allocations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:53 -07:00
Adrian Bunk 1cd7daa51b slub.c:early_kmem_cache_node_alloc() shouldn't be __init
WARNING: mm/built-in.o(.text+0x24bd3): Section mismatch: reference to .init.text:early_kmem_cache_node_alloc (between 'init_kmem_cache_nodes' and 'calculate_sizes')
...

Signed-off-by: Adrian Bunk <bunk@stusta.de>
Acked-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:42:51 -07:00
Christoph Lameter ba0268a8b0 SLUB: accurately compare debug flags during slab cache merge
This was posted on Aug 28 and fixes an issue that could cause troubles
when slab caches >=128k are created.

http://marc.info/?l=linux-mm&m=118798149918424&w=2

Currently we simply add the debug flags unconditional when checking for a
matching slab.  This creates issues for sysfs processing when slabs exist
that are exempt from debugging due to their huge size or because only a
subset of slabs was selected for debugging.

We need to only add the flags if kmem_cache_open() would also add them.

Create a function to calculate the flags that would be set
if the cache would be opened and use that function to determine
the flags before looking for a compatible slab.

[akpm@linux-foundation.org: fixlets]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-09-11 17:21:27 -07:00
Christoph Lameter 5d540fb715 slub: do not fail if we cannot register a slab with sysfs
Do not BUG() if we cannot register a slab with sysfs.  Just print an error.
 The only consequence of not registering is that the slab cache is not
visible via /sys/slab.  A BUG() may not be visible that early during boot
and we have had multiple issues here already.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-31 01:42:22 -07:00
Christoph Lameter a2f92ee7e7 SLUB: do not fail on broken memory configurations
Print a big fat warning and do what is necessary to continue if a node is
marked as up (meaning either node is online (upstream) or node has memory
(Andrew's tree)) but allocations from the node do not succeed.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Christoph Lameter 9e86943b6c SLUB: use atomic_long_read for atomic_long variables
SLUB is using atomic_read() for variables declared atomic_long_t.
Switch to atomic_long_read().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-08-22 19:52:47 -07:00
Christoph Lameter 1ceef40249 SLUB: Fix dynamic dma kmalloc cache creation
The dynamic dma kmalloc creation can run into trouble if a
GFP_ATOMIC allocation is the first one performed for a certain size
of dma kmalloc slab.

- Move the adding of the slab to sysfs into a workqueue
  (sysfs does GFP_KERNEL allocations)
- Do not call kmem_cache_destroy() (uses slub_lock)
- Only acquire the slub_lock once and--if we cannot wait--do a trylock.

  This introduces a slight risk of the first kmalloc(x, GFP_DMA|GFP_ATOMIC)
  for a range of sizes failing due to another process holding the slub_lock.
  However, we only need to acquire the spinlock once in order to establish
  each power of two DMA kmalloc cache. The possible conflict is with the
  slub_lock taken during slab management actions (create / remove slab cache).

  It is rather typical that a driver will first fill its buffers using
  GFP_KERNEL allocations which will wait until the slub_lock can be acquired.
  Drivers will also create its slab caches first outside of an atomic
  context before starting to use atomic kmalloc from an interrupt context.

  If there are any failures then they will occur early after boot or when
  loading of multiple drivers concurrently. Drivers can already accomodate
  failures of GFP_ATOMIC for other reasons. Retries will then create the slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-08-09 21:57:16 -07:00
Christoph Lameter fcda3d89bf SLUB: Remove checks for MAX_PARTIAL from kmem_cache_shrink
The MAX_PARTIAL checks were supposed to be an optimization. However, slab
shrinking is a manually triggered process either through running slabinfo
or by the kernel calling kmem_cache_shrink.

If one really wants to shrink a slab then all operations should be done
regardless of the size of the partial list. This also fixes an issue that
could surface if the number of partial slabs was initially above MAX_PARTIAL
in kmem_cache_shrink and later drops below MAX_PARTIAL through the
elimination of empty slabs on the partial list (rare). In that case a few
slabs may be left off the partial list (and only be put back when they
are empty).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-08-09 21:57:15 -07:00
Peter Zijlstra 2208b764c1 slub: fix bug in slub debug support
We ClearSlabDebug() before the last SlabDebug() check. Clear it later.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-07-30 12:15:15 -07:00
Peter Zijlstra 02febdf7f6 slub: add lock debugging check
Ingo noticed that the SLUB code does include the lock debugging free
check.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
2007-07-30 12:12:39 -07:00
Paul Mundt 20c2df83d2 mm: Remove slab destructors from kmem_cache_create().
Slab destructors were no longer supported after Christoph's
c59def9f22 change. They've been
BUGs for both slab and slub, and slob never supported them
either.

This rips out support for the dtor pointer from kmem_cache_create()
completely and fixes up every single callsite in the kernel (there were
about 224, not including the slab allocator definitions themselves,
or the documentation references).

Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-20 10:11:58 +09:00
Linus Torvalds 9550b105b8 slub: fix ksize() for zero-sized pointers
The slab and slob allocators already did this right, but slub would call
"get_object_page()" on the magic ZERO_SIZE_PTR, with all kinds of nasty
end results.

Noted by Ingo Molnar.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-19 13:21:34 -07:00
Christoph Lameter 8ab1372fac SLUB: Fix CONFIG_SLUB_DEBUG use for CONFIG_NUMA
We currently cannot disable CONFIG_SLUB_DEBUG for CONFIG_NUMA.  Now that
embedded systems start to use NUMA we may need this.

Put an #ifdef around places where NUMA only code uses fields only valid
for CONFIG_SLUB_DEBUG.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Christoph Lameter a0e1d1be20 SLUB: Move sysfs operations outside of slub_lock
Sysfs can do a gazillion things when called.  Make sure that we do not call
any sysfs functions while holding the slub_lock.

Just protect the essentials:

1. The list of all slab caches
2. The kmalloc_dma array
3. The ref counters of the slabs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Christoph Lameter 434e245ddd SLUB: Do not allocate object bit array on stack
The objects per slab increase with the current patches in mm since we allow up
to order 3 allocs by default.  More patches in mm actually allow to use 2M or
higher sized slabs.  For slab validation we need per object bitmaps in order
to check a slab.  We end up with up to 64k objects per slab resulting in a
potential requirement of 8K stack space.  That does not look good.

Allocate the bit arrays via kmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:02 -07:00
Christoph Lameter 81cda66261 Slab allocators: Cleanup zeroing allocations
It becomes now easy to support the zeroing allocs with generic inline
functions in slab.h.  Provide inline definitions to allow the continued use of
kzalloc, kmem_cache_zalloc etc but remove other definitions of zeroing
functions from the slab allocators and util.c.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter ce15fea827 SLUB: Do not use length parameter in slab_alloc()
We can get to the length of the object through the kmem_cache_structure.  The
additional parameter does no good and causes the compiler to generate bad
code.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 12ad6843dd SLUB: Style fix up the loop to disable small slabs
Do proper spacing and we only need to do this in steps of 8.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Adrian Bunk 5af328a510 mm/slub.c: make code static
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 7b55f620e6 SLUB: Simplify dma index -> size calculation
There is no need to caculate the dma slab size ourselves. We can simply
lookup the size of the corresponding non dma slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter f1b2633936 SLUB: faster more efficient slab determination for __kmalloc
kmalloc_index is a long series of comparisons.  The attempt to replace
kmalloc_index with something more efficient like ilog2 failed due to compiler
issues with constant folding on gcc 3.3 / powerpc.

kmalloc_index()'es long list of comparisons works fine for constant folding
since all the comparisons are optimized away.  However, SLUB also uses
kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
This leads to a large set of comparisons in get_slab().

The patch here allows to get rid of that list of comparisons in get_slab():

1. If the requested size is larger than 192 then we can simply use
   fls to determine the slab index since all larger slabs are
   of the power of two type.

2. If the requested size is smaller then we cannot use fls since there
   are non power of two caches to be considered. However, the sizes are
   in a managable range. So we divide the size by 8. Then we have only
   24 possibilities left and then we simply look up the kmalloc index
   in a table.

Code size of slub.o decreases by more than 200 bytes through this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter dfce8648d6 SLUB: do proper locking during dma slab creation
We modify the kmalloc_cache_dma[] array without proper locking.  Do the proper
locking and undo the dma cache creation if another processor has already
created it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 2e443fd003 SLUB: extract dma_kmalloc_cache from get_cache.
The rarely used dma functionality in get_slab() makes the function too
complex.  The compiler begins to spill variables from the working set onto the
stack.  The created function is only used in extremely rare cases so make sure
that the compiler does not decide on its own to merge it back into get_slab().

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 0c71001320 SLUB: add some more inlines and #ifdef CONFIG_SLUB_DEBUG
Add #ifdefs around data structures only needed if debugging is compiled into
SLUB.

Add inlines to small functions to reduce code size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter d07dbea464 Slab allocators: support __GFP_ZERO in all allocators
A kernel convention for many allocators is that if __GFP_ZERO is passed to an
allocator then the allocated memory should be zeroed.

This is currently not supported by the slab allocators.  The inconsistency
makes it difficult to implement in derived allocators such as in the uncached
allocator and the pool allocators.

In addition the support zeroed allocations in the slab allocators does not
have a consistent API.  There are no zeroing allocator functions for NUMA node
placement (kmalloc_node, kmem_cache_alloc_node).  The zeroing allocations are
only provided for default allocs (kzalloc, kmem_cache_zalloc_node).
__GFP_ZERO will make zeroing universally available and does not require any
addititional functions.

So add the necessary logic to all slab allocators to support __GFP_ZERO.

The code is added to the hot path.  The gfp flags are on the stack and so the
cacheline is readily available for checking if we want a zeroed object.

Zeroing while allocating is now a frequent operation and we seem to be
gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
The current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 6cb8f91320 Slab allocators: consistent ZERO_SIZE_PTR support and NULL result semantics
Define ZERO_OR_NULL_PTR macro to be able to remove the checks from the
allocators.  Move ZERO_SIZE_PTR related stuff into slab.h.

Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
WARN_ON_ONCE(size == 0) that is still remaining in SLAB.

Make slub return NULL like the other allocators if a too large memory segment
is requested via __kmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter ef2ad80c7d Slab allocators: consolidate code for krealloc in mm/util.c
The size of a kmalloc object is readily available via ksize().  ksize is
provided by all allocators and thus we can implement krealloc in a generic
way.

Implement krealloc in mm/util.c and drop slab specific implementations of
krealloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter d45f39cb06 SLUB Debug: fix initial object debug state of NUMA bootstrap objects
The function we are calling to initialize object debug state during early NUMA
bootstrap sets up an inactive object giving it the wrong redzone signature.
The bootstrap nodes are active objects and should have active redzone
signatures.

Currently slab validation complains and reverts the object to active state.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 6300ea7503 SLUB: ensure that the number of objects per slab stays low for high orders
Currently SLUB has no provision to deal with too high page orders that may
be specified on the kernel boot line.  If an order higher than 6 (on a 4k
platform) is generated then we will BUG() because slabs get more than 65535
objects.

Add some logic that decreases order for slabs that have too many objects.
This allow booting with slab sizes up to MAX_ORDER.

For example

	slub_min_order=10

will boot with a default slab size of 4M and reduce slab sizes for small
object sizes to lower orders if the number of objects becomes too big.
Large slab sizes like that allow a concentration of objects of the same
slab cache under as few as possible TLB entries and thus potentially
reduces TLB pressure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 68dff6a9af SLUB slab validation: Move tracking information alloc outside of lock
We currently have to do an GFP_ATOMIC allocation because the list_lock is
already taken when we first allocate memory for tracking allocation
information.  It would be better if we could avoid atomic allocations.

Allocate a size of the tracking table that is usually sufficient (one page)
before we take the list lock.  We will then only do the atomic allocation
if we need to resize the table to become larger than a page (mostly only
needed under large NUMA because of the tracking of cpus and nodes otherwise
the table stays small).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 5b95a4acf1 SLUB: use list_for_each_entry for loops over all slabs
Use list_for_each_entry() instead of list_for_each().

Get rid of for_all_slabs(). It had only one user. So fold it into the
callback. This also gets rid of cpu_slab_flush.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter 2492268472 SLUB: change error reporting format to follow lockdep loosely
Changes the error reporting format to loosely follow lockdep.

If data corruption is detected then we generate the following lines:

============================================
BUG <slab-cache>: <problem>
--------------------------------------------

INFO: <more information> [possibly multiple times]

<object dump>

FIX <slab-cache>: <remedial action>

This also adds some more intelligence to the data corruption detection. Its
now capable of figuring out the start and end.

Add a comment on how to configure SLUB so that a production system may
continue to operate even though occasional slab corruption occur through
a misbehaving kernel component. See "Emergency operations" in
Documentation/vm/slub.txt.

[akpm@linux-foundation.org: build fix]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-17 10:23:01 -07:00
Christoph Lameter f0630fff54 SLUB: support slub_debug on by default
Add a new configuration variable

CONFIG_SLUB_DEBUG_ON

If set then the kernel will be booted by default with slab debugging
switched on. Similar to CONFIG_SLAB_DEBUG. By default slab debugging
is available but must be enabled by specifying "slub_debug" as a
kernel parameter.

Also add support to switch off slab debugging for a kernel that was
built with CONFIG_SLUB_DEBUG_ON. This works by specifying

slub_debug=-

as a kernel parameter.

Dave Jones wanted this feature.
http://marc.info/?l=linux-kernel&m=118072189913045&w=2

[akpm@linux-foundation.org: clean up switch statement]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:36 -07:00
Christoph Lameter d23cf676d0 slub: remove useless EXPORT_SYMBOL
kmem_cache_open is static. EXPORT_SYMBOL was leftover from some earlier
time period where kmem_cache_open was usable outside of slub.

(Fixes powerpc build error)

Signed-off-by: Chrsitoph Lameter <clameter@sgi.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06 11:45:11 -07:00
Christoph Lameter dbc55faa64 SLUB: Make lockdep happy by not calling add_partial with interrupts enabled during bootstrap
If we move the local_irq_enable() to the end of the function then
add_partial() in early_kmem_cache_node_alloc() will be called
with interrupts disabled like during regular operations.

This makes lockdep happy.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Andre Noll <maan@systemlinux.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-03 13:56:13 -07:00
Christoph Lameter 8496634302 SLUB: fix behavior if the text output of list_locations overflows PAGE_SIZE
If slabs are allocated or freed from a large set of call sites (typical for
the kmalloc area) then we may create more output than fits into a single
PAGE and sysfs only gives us one page.  The output should be truncated.
This patch fixes the checks to do the truncation properly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24 08:59:11 -07:00
Christoph Lameter 4b356be019 SLUB: minimum alignment fixes
If ARCH_KMALLOC_MINALIGN is set to a value greater than 8 (SLUBs smallest
kmalloc cache) then SLUB may generate duplicate slabs in sysfs (yes again)
because the object size is padded to reach ARCH_KMALLOC_MINALIGN.  Thus the
size of the small slabs is all the same.

No arch sets ARCH_KMALLOC_MINALIGN larger than 8 though except mips which
for some reason wants a 128 byte alignment.

This patch increases the size of the smallest cache if
ARCH_KMALLOC_MINALIGN is greater than 8.  In that case more and more of the
smallest caches are disabled.

If we do that then the count of the active general caches that is displayed
on boot is not correct anymore since we may skip elements of the kmalloc
array.  So count them separately.

This approach was tested by Havard yesterday.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:16 -07:00
Christoph Lameter dd08c40e3e SLUB slab validation: Alloc while interrupts are disabled must use GFP_ATOMIC
The data structure to manage the information gathered about functions
allocating and freeing objects is allocated when the list_lock has already
been taken.  We need to allocate with GFP_ATOMIC instead of GFP_KERNEL.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16 13:16:15 -07:00
Christoph Lameter 272c1d21d6 SLUB: return ZERO_SIZE_PTR for kmalloc(0)
Instead of returning the smallest available object return ZERO_SIZE_PTR.

A ZERO_SIZE_PTR can be legitimately used as an object pointer as long as it
is not deferenced.  The dereference of ZERO_SIZE_PTR causes a distinctive
fault.  kfree can handle a ZERO_SIZE_PTR in the same way as NULL.

This enables functions to use zero sized object. e.g. n = number of objects.

	objects = kmalloc(n * sizeof(object));

	for (i = 0; i < n; i++)
		objects[i].x = y;

	kfree(objects);

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-08 17:23:33 -07:00
Christoph Lameter 27390bc335 SLUB: fix locking for hotplug callbacks
Hotplug callbacks are performed with interrupts enabled.  Slub requires
interrupts to be disabled for flushing caches.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Michal Piotrowski <michal.k.k.piotrowski@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-01 08:18:30 -07:00
Christoph Lameter 8ffa68755a SLUB: Fix NUMA / SYSFS bootstrap issue
We need this patch in ASAP.  Patch fixes the mysterious hang that remained
on some particular configurations with lockdep on after the first fix that
moved the #idef CONFIG_SLUB_DEBUG to the right location.  See
http://marc.info/?t=117963072300001&r=1&w=2

The kmem_cache_node cache is very special because it is needed for NUMA
bootstrap.  Under certain conditions (like for example if lockdep is
enabled and significantly increases the size of spinlock_t) the structure
may become exactly the size as one of the larger caches in the kmalloc
array.

That early during bootstrap we cannot perform merging properly.  The unique
id for the kmem_cache_node cache will match one of the kmalloc array.
Sysfs will complain about a duplicate directory entry.  All of this occurs
while the console is not yet fully operational.  Thus boot may appear to be
silently failing.

The kmem_cache_node cache is very special.  During early boostrap the main
allocation function is not operational yet and so we have to run our own
small special alloc function during early boot.  It is also special in that
it is never freed.

We really do not want any merging on that cache.  Set the refcount -1 and
forbid merging of slabs that have a negative refcount.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-31 07:58:14 -07:00
Christoph Lameter 33e9e24101 SLUB Debug: fix check for super sized slabs (>512k 64bit, >256k 32bit)
The check for super sized slabs where we can no longer move the free
pointer behind the object for debugging purposes etc is accessing a
field that is not setup yet.  We must use objsize here since the size of
the slab has not been determined yet.

The effect of this is that a global slab shrink via "slabinfo -s" will
show errors about offsets being wrong if booted with slub_debug.
Potentially there are other troubles with huge slabs under slub_debug
because the calculated free pointer offset is truncated.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:13 -07:00
Christoph Lameter c12b3c6251 SLUB Debug: Fix object size calculation
The object size calculation is wrong if !CONFIG_SLUB_DEBUG because the
#ifdef CONFIG_SLUB_DEBUG is now switching off the size adjustments for
DESTROY_BY_RCU and ctor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-23 20:14:11 -07:00
Christoph Lameter 3ec0974210 SLUB: Simplify debug code
Consolidate functionality into the #ifdef section.

Extract tracing into one subroutine.

Move object debug processing into the #ifdef section so that the
code in __slab_alloc and __slab_free becomes minimal.

Reduce number of functions we need to provide stubs for in the !SLUB_DEBUG case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter a35afb830f Remove SLAB_CTOR_CONSTRUCTOR
SLAB_CTOR_CONSTRUCTOR is always specified. No point in checking it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: David Chinner <dgc@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:04 -07:00
Christoph Lameter 5577bd8a85 SLUB: Do our own flags based on PG_active and PG_error
The atomicity when handling flags in SLUB is not necessary since both flags
used by SLUB are not updated in a racy way.  Flag updates are either done
during slab creation or destruction or under slab_lock.  Some of these flags
do not have the non atomic variants that we need.  So define our own.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter 4b6f075045 SLUB: Define functions for cpu slab handling instead of using PageActive
Use inline functions to access the per cpu bit.  Intoduce the notion of
"freezing" a slab to make things more understandable.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Christoph Lameter c59def9f22 Slab allocators: Drop support for destructors
There is no user of destructors left.  There is no reason why we should keep
checking for destructors calls in the slab allocators.

The RFC for this patch was discussed at
http://marc.info/?l=linux-kernel&m=117882364330705&w=2

Destructors were mainly used for list management which required them to take a
spinlock.  Taking a spinlock in a destructor is a bit risky since the slab
allocators may run the destructors anytime they decide a slab is no longer
needed.

Patch drops destructor support.  Any attempt to use a destructor will BUG().

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-17 05:23:03 -07:00
Hugh Dickins 1800782016 slub: don't confuse ctor and dtor
kmem_cache_create() was swapping ctor and dtor in calling find_mergeable():
though it caused no bug, and probably never would, even if destructors are
retained; but fix it so as not to generate anxiety ;)

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-16 21:19:15 -07:00
Christoph Lameter bcf889f965 SLUB: remove nr_cpu_ids hack
This was in SLUB in order to head off trouble while the nr_cpu_ids
functionality was not merged.  Its merged now so no need to still have this.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:53 -07:00
Christoph Lameter 894b8788d7 slub: support concurrent local and remote frees and allocs on a slab
Avoid atomic overhead in slab_alloc and slab_free

SLUB needs to use the slab_lock for the per cpu slabs to synchronize with
potential kfree operations.  This patch avoids that need by moving all free
objects onto a lockless_freelist.  The regular freelist continues to exist
and will be used to free objects.  So while we consume the
lockless_freelist the regular freelist may build up objects.

If we are out of objects on the lockless_freelist then we may check the
regular freelist.  If it has objects then we move those over to the
lockless_freelist and do this again.  There is a significant savings in
terms of atomic operations that have to be performed.

We can even free directly to the lockless_freelist if we know that we are
running on the same processor.  So this speeds up short lived objects.
They may be allocated and freed without taking the slab_lock.  This is
particular good for netperf.

In order to maximize the effect of the new faster hotpath we extract the
hottest performance pieces into inlined functions.  These are then inlined
into kmem_cache_alloc and kmem_cache_free.  So hotpath allocation and
freeing no longer requires a subroutine call within SLUB.

[I am not sure that it is worth doing this because it changes the easy to
read structure of slub just to reduce atomic ops.  However, there is
someone out there with a benchmark on 4 way and 8 way processor systems
that seems to show a 5% regression vs.  Slab.  Seems that the regression is
due to increased atomic operations use vs.  SLAB in SLUB).  I wonder if
this is applicable or discernable at all in a real workload?]

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-10 09:26:52 -07:00
Christoph Lameter 4037d45220 Move remote node draining out of slab allocators
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes.  This requires SLUB to have
a whole subsystem in order to be compatible with SLAB.  Moving node draining
out of the slab allocators avoids a section of code in SLUB.

Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.

Add a expire counter there.  If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.

The expire counter will be decremented on each vm stats update pass until it
reaches zero.  Then we will drain one batch from the pageset.  The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty.  So we will drain a batch every 3 seconds.

Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues.  The number of pcp is determined by the
number of processors and nodes in a system.  A system with 4 processors and 2
nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Christoph Lameter d1187ed210 vmstat: use our own timer events
vmstat is currently using the cache reaper to periodically bring the
statistics up to date.  The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB.  This patch removes the vmstat calls from the
slab allocators and provides its own handling.

The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick.  This will lead to some overlap in large
systems.  F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.

However, the vm stats update only accesses per node information.  It is only
necessary to stagger the vm statistics updates per processor in each node.  Vm
counter updates occurring on distant nodes will not cause cacheline
contention.

We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick.  That may be useful to keep a large amount of the second free
of timer activity.  Maybe the timer folks will have some feedback on this one?

[jirislaby@gmail.com: add missing break]
Cc: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Rafael J. Wysocki 8bb7844286 Add suspend-related notifications for CPU hotplug
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress.  This
patch introduces such notifications and causes them to be used during
suspend and resume transitions.  It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).

[oleg@tv-sign.ru: cleanups]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:56 -07:00
Pekka J Enberg 7ae439ce0c krealloc: fix kerneldoc comments
No "blank" (or "*") line is allowed between the function name and lines for
it parameter(s).

Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 5e6d444ea1 SLUB: rework slab order determination
In some cases SLUB is creating uselessly slabs that are larger than
slub_max_order. Also the layout of some of the slabs was not satisfactory.

Go to an iterarive approach.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 45edfa580b SLUB: include lifetime stats and sets of cpus / nodes in tracking output
We have information about how long an object existed and about the nodes and
cpus where the allocations and frees took place.  Add that information to the
tracking output in /sys/slab/xx/alloc_calls and /sys/slab/free_calls

This will then enable slabinfo to output nice reports like this:

  christoph@qirst:~/slub$ ./slabinfo kmalloc-128

  Slabcache: kmalloc-128           Aliases:  0 Order :  0

  Sizes (bytes)     Slabs              Debug                Memory
  ------------------------------------------------------------------------
  Object :     128  Total  :      12   Sanity Checks : On   Total:   49152
  SlabObj:     200  Full   :       7   Redzoning     : On   Used :   24832
  SlabSiz:    4096  Partial:       4   Poisoning     : On   Loss :   24320
  Loss   :      72  CpuSlab:       1   Tracking      : On   Lalig:   13968
  Align  :       8  Objects:      20   Tracing       : Off  Lpadd:    1152

  kmalloc-128 has no kmem_cache operations

  kmalloc-128: Kernel object allocation
  -----------------------------------------------------------------------
        6 param_sysfs_setup+0x71/0x130 age=284512/284512/284512 pid=1 nodes=0-1,3
       11 percpu_populate+0x39/0x80 age=283914/284428/284512 pid=1 nodes=0
       21 __register_chrdev_region+0x31/0x170 age=282896/284347/284473 pid=1-1705 nodes=0-2
        1 sys_inotify_init+0x76/0x1c0 age=283423 pid=1004 nodes=0
       19 as_get_io_context+0x32/0xd0 age=6/247567/283988 pid=1-11782 nodes=0,2
       10 ida_pre_get+0x4a/0x80 age=277666/283773/284526 pid=0-2177 nodes=0,2
       24 kobject_kset_add_dir+0x37/0xb0 age=282727/283860/284472 pid=1-1723 nodes=0-2
        1 acpi_ds_build_internal_buffer_obj+0xd3/0x11d age=284508 pid=1 nodes=0
       24 con_insert_unipair+0xd7/0x110 age=284438/284438/284438 pid=1 nodes=0,2
        1 uart_open+0x2d2/0x4b0 age=283896 pid=1 nodes=0
       26 dma_pool_create+0x73/0x1a0 age=282762/282833/282916 pid=1705-1723 nodes=0
        1 neigh_table_init_no_netlink+0xd2/0x210 age=284461 pid=1 nodes=0
        2 neigh_parms_alloc+0x2b/0xe0 age=284410/284411/284412 pid=1 nodes=2
        2 neigh_resolve_output+0x1e1/0x280 age=276289/276291/276293 pid=0-2443 nodes=0
        1 netlink_kernel_create+0x90/0x170 age=284472 pid=1 nodes=0
        4 xt_alloc_table_info+0x39/0xf0 age=283958/283958/283959 pid=1 nodes=1
        3 fn_hash_insert+0x473/0x720 age=277653/277661/277666 pid=2177-2185 nodes=0
        1 get_mtrr_state+0x285/0x2a0 age=284526 pid=0 nodes=0
        1 cacheinfo_cpu_callback+0x26d/0x3e0 age=284458 pid=1 nodes=0
       29 kernel_param_sysfs_setup+0x25/0x90 age=284511/284511/284512 pid=1 nodes=0-1,3
        5 process_zones+0x5e/0x170 age=284546/284546/284546 pid=0 nodes=0
        1 drm_core_init+0x48/0x160 age=284421 pid=1 nodes=2

  kmalloc-128: Kernel object freeing
  ------------------------------------------------------------------------
      163 <not-available> age=4295176847 pid=0 nodes=0-3
        1 __vunmap+0x6e/0xf0 age=282907 pid=1723 nodes=0
       28 free_as_io_context+0x12/0x90 age=9243/262197/283474 pid=42-11754 nodes=0
        1 acpi_get_object_info+0x1b7/0x1d4 age=284475 pid=1 nodes=0
        1 do_acpi_find_child+0x45/0x4e age=284475 pid=1 nodes=0

  NUMA nodes           :    0    1    2    3
  ------------------------------------------
  All slabs                 7    2    2    1
  Partial slabs             2    2    0    0

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:46 -07:00
Christoph Lameter 41ecc55b8a SLUB: add CONFIG_SLUB_DEBUG
CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components
of SLUB.  Thus SLUB will be able to replace SLOB.  SLUB can arrange objects in
a denser way than SLOB and the code size should be minimal without debugging
and sysfs support.

Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG.
CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB.  SLUB enables
debugging via a boot parameter.  SLUB debug code should always be present.

CONFIG_SLUB_DEBUG can be modified in the embedded config section.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 02cbc87446 SLUB: move tracking definitions and check_valid_pointer() away from debug code
Move the tracking definitions and the check_valid_pointer() function away from
the debugging related functions.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 636f0d7de8 SLUB: consolidate trace code
Trace in both slab_alloc and slab_free has a lot of common code.  Use a single
function for both.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 35e5d7ee27 SLUB: introduce DebugSlab(page)
This replaces the PageError() checking.  DebugSlab is clearer and allows for
future changes to the page bit used.  We also need it to support
CONFIG_SLUB_DEBUG.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter b345970905 SLUB: move resiliency check into SYSFS section
Move the resiliency check into the SYSFS section after validate_slab that is
used by the resiliency check.  This will avoid a forward declaration.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 7656c72b5a SLUB: add macros for scanning objects in a slab
Scanning of objects happens in a number of functions.  Consolidate that code.
DECLARE_BITMAP instead of coding the declaration for bitmaps.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 672bba3a4b SLUB: update comments
Update comments throughout SLUB to reflect the new developments.  Fix up
various awkward sentences.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 26a7bd0302 SLUB: get rid of finish_bootstrap
Its only purpose was to bring some sort of symmetry to sysfs usage when
dealing with bootstrapping per cpu flushing.  Since we do not time out slabs
anymore we have no need to run finish_bootstrap even without sysfs.  Fold it
back into slab_sysfs_init and drop the initcall for the !SYFS case.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter 1f99a283dc SLUB: clean up krealloc
We really do not need all this gaga there.

ksize gives us all the information we need to figure out if the object can
cope with the new size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:45 -07:00
Christoph Lameter abcd08a6f5 SLUB: use check_valid_pointer in kmem_ptr_validate
We needlessly duplicate code. Also make check_valid_pointer inline.

Signed-off-by: Christoph LAemter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Christoph Lameter be7b3fbcef SLUB: after object padding only needed for Redzoning
If no redzoning is selected then we do not need padding before the next
object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Christoph Lameter 65c02d4cfb SLUB: add support for dynamic cacheline size determination
SLUB currently assumes that the cacheline size is static.  However, i386 f.e.
supports dynamic cache line size determination.

Use cache_line_size() instead of L1_CACHE_BYTES in the allocator.

That also explains the purpose of SLAB_HWCACHE_ALIGN.  So we will need to keep
that one around to allow dynamic aligning of objects depending on boot
determination of the cache line size.

[akpm@linux-foundation.org: need to define it before we use it]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-09 12:30:44 -07:00
Linus Torvalds 0f9008ef38 Fix up SLUB compile
The newly merged SLUB allocator patches had been generated before the
removal of "struct subsystem", and ended up applying fine, but wouldn't
build based on the current tree as a result.

Fix up that merge error - not that SLUB is likely really ready for
showtime yet, but at least I can fix the trivial stuff.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:31:58 -07:00
Christoph Lameter cfce66047f Slab allocators: remove useless __GFP_NO_GROW flag
There is no user remaining and I have never seen any use of that flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 4f10493459 slab allocators: Remove SLAB_CTOR_ATOMIC
SLAB_CTOR atomic is never used which is no surprise since I cannot imagine
that one would want to do something serious in a constructor or destructor.
 In particular given that the slab allocators run with interrupts disabled.
 Actions in constructors and destructors are by their nature very limited
and usually do not go beyond initializing variables and list operations.

(The i386 pgd ctor and dtors do take a spinlock in constructor and
destructor.....  I think that is the furthest we go at this point.)

There is no flag passed to the destructor so removing SLAB_CTOR_ATOMIC also
establishes a certain symmetry.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 50953fe9e0 slab allocators: Remove SLAB_DEBUG_INITIAL flag
I have never seen a use of SLAB_DEBUG_INITIAL.  It is only supported by
SLAB.

I think its purpose was to have a callback after an object has been freed
to verify that the state is the constructor state again?  The callback is
performed before each freeing of an object.

I would think that it is much easier to check the object state manually
before the free.  That also places the check near the code object
manipulation of the object.

Also the SLAB_DEBUG_INITIAL callback is only performed if the kernel was
compiled with SLAB debugging on.  If there would be code in a constructor
handling SLAB_DEBUG_INITIAL then it would have to be conditional on
SLAB_DEBUG otherwise it would just be dead code.  But there is no such code
in the kernel.  I think SLUB_DEBUG_INITIAL is too problematic to make real
use of, difficult to understand and there are easier ways to accomplish the
same effect (i.e.  add debug code before kfree).

There is a related flag SLAB_CTOR_VERIFY that is frequently checked to be
clear in fs inode caches.  Remove the pointless checks (they would even be
pointless without removeal of SLAB_DEBUG_INITIAL) from the fs constructors.

This is the last slab flag that SLUB did not support.  Remove the check for
unimplemented flags from SLUB.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:57 -07:00
Christoph Lameter 5af6083990 slab allocators: Remove obsolete SLAB_MUST_HWCACHE_ALIGN
This patch was recently posted to lkml and acked by Pekka.

The flag SLAB_MUST_HWCACHE_ALIGN is

1. Never checked by SLAB at all.

2. A duplicate of SLAB_HWCACHE_ALIGN for SLUB

3. Fulfills the role of SLAB_HWCACHE_ALIGN for SLOB.

The only remaining use is in sparc64 and ppc64 and their use there
reflects some earlier role that the slab flag once may have had. If
its specified then SLAB_HWCACHE_ALIGN is also specified.

The flag is confusing, inconsistent and has no purpose.

Remove it.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:55 -07:00
Christoph Lameter 70d71228af slub: remove object activities out of checking functions
Make sure that the check function really only check things and do not perform
activities.  Extract the tracing and object seeding out of the two check
functions and place them into slab_alloc and slab_free

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 2086d26a05 SLUB: Free slabs and sort partial slab lists in kmem_cache_shrink
At kmem_cache_shrink check if we have any empty slabs on the partial
if so then remove them.

Also--as an anti-fragmentation measure--sort the partial slabs so that
the most fully allocated ones come first and the least allocated last.

The next allocations may fill up the nearly full slabs. Having the
least allocated slabs last gives them the maximum chance that their
remaining objects may be freed. Thus we can hopefully minimize the
partial slabs.

I think this is the best one can do in terms antifragmentation
measures. Real defragmentation (meaning moving objects out of slabs with
the least free objects to those that are almost full) can be implemted
by reverse scanning through the list produced here but that would mean
that we need to provide a callback at slab cache creation that allows
the deletion or moving of an object. This will involve slab API
changes, so defer for now.

Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 88a420e4e2 slub: add ability to list alloc / free callers per slab
This patch enables listing the callers who allocated or freed objects in a
cache.

For example to list the allocators for kmalloc-128 do

cat /sys/slab/kmalloc-128/alloc_calls
      7 sn_io_slot_fixup+0x40/0x700
      7 sn_io_slot_fixup+0x80/0x700
      9 sn_bus_fixup+0xe0/0x380
      6 param_sysfs_setup+0xf0/0x280
    276 percpu_populate+0xf0/0x1a0
     19 __register_chrdev_region+0x30/0x360
      8 expand_files+0x2e0/0x6e0
      1 sys_epoll_create+0x60/0x200
      1 __mounts_open+0x140/0x2c0
     65 kmem_alloc+0x110/0x280
      3 alloc_disk_node+0xe0/0x200
     33 as_get_io_context+0x90/0x280
     74 kobject_kset_add_dir+0x40/0x140
     12 pci_create_bus+0x2a0/0x5c0
      1 acpi_ev_create_gpe_block+0x120/0x9e0
     41 con_insert_unipair+0x100/0x1c0
      1 uart_open+0x1c0/0xba0
      1 dma_pool_create+0xe0/0x340
      2 neigh_table_init_no_netlink+0x260/0x4c0
      6 neigh_parms_alloc+0x30/0x200
      1 netlink_kernel_create+0x130/0x320
      5 fz_hash_alloc+0x50/0xe0
      2 sn_common_hubdev_init+0xd0/0x6e0
     28 kernel_param_sysfs_setup+0x30/0x180
     72 process_zones+0x70/0x2e0

cat /sys/slab/kmalloc-128/free_calls
    558 <not-available>
      3 sn_io_slot_fixup+0x600/0x700
     84 free_fdtable_rcu+0x120/0x260
      2 seq_release+0x40/0x60
      6 kmem_free+0x70/0xc0
     24 free_as_io_context+0x20/0x200
      1 acpi_get_object_info+0x3a0/0x3e0
      1 acpi_add_single_object+0xcf0/0x1e40
      2 con_release_unimap+0x80/0x140
      1 free+0x20/0x40

SLAB_STORE_USER must be enabled for a slab cache by either booting with
"slab_debug" or enabling user tracking specifically for the slab of interest.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter e95eed571e SLUB: Add MIN_PARTIAL
We leave a mininum of partial slabs on nodes when we search for
partial slabs on other node. Define a constant for that value.

Then modify slub to keep MIN_PARTIAL slabs around.

This avoids bad situations where a function frees the last object
in a slab (which results in the page being returned to the page
allocator) only to then allocate one again (which requires getting
a page back from the page allocator if the partial list was empty).
Keeping a couple of slabs on the partial list reduces overhead.

Empty slabs are added to the end of the partial list to insure that
partially allocated slabs are consumed first (defragmentation).

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 53e15af03b slub: validation of slabs (metadata and guard zones)
This enables validation of slab.  Validation means that all objects are
checked to see if there are redzone violations, if padding has been
overwritten or any pointers have been corrupted.  Also checks the consistency
of slab counters.

Validation enables the detection of metadata corruption without the kernel
having to execute code that actually uses (allocs/frees) and object.  It
allows one to make sure that the slab metainformation and the guard values
around an object have not been compromised.

A single slabcache can be checked by writing a 1 to the "validate" file.

i.e.

echo 1 >/sys/slab/kmalloc-128/validate

or use the slabinfo tool to check all slabs

slabinfo -v

Error messages will show up in the syslog.

Note that validation can only reach slabs that are on a list.  This means that
we are usually restricted to partial slabs and active slabs unless
SLAB_STORE_USER is active which will build a full slab list and allows
validation of slabs that are fully in use.  Booting with "slub_debug" set will
enable SLAB_STORE_USER and then full diagnostic are available.

Note that we attempt to push cpu slabs back to the lists when we start the
check.  If the cpu slab is reactivated before we get to it (another processor
grabs it before we get to it) then it cannot be checked.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 643b113849 slub: enable tracking of full slabs
If slab tracking is on then build a list of full slabs so that we can verify
the integrity of all slabs and are also able to built list of alloc/free
callers.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter 77c5e2d01a slub: fix object tracking
Object tracking did not work the right way for several call chains. Fix this up
by adding a new parameter to slub_alloc and slub_free that specifies the
caller address explicitly.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter b49af68ff9 Add virt_to_head_page and consolidate code in slab and slub
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:54 -07:00
Christoph Lameter d85f33855c Make page->private usable in compound pages
If we add a new flag so that we can distinguish between the first page and the
tail pages then we can avoid to use page->private in the first page.
page->private == page for the first page, so there is no real information in
there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages.  Right now we have to be careful f.e.
 if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field.  This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Having page->private available for SLUB would allow more meta information in
the page struct.  I can probably avoid the 16 bit ints that I have in there
right now.

Also if page->private is available then a compound page may be equipped with
buffer heads.  This may free up the way for filesystems to support larger
blocks than page size.

We add PageTail as an alias of PageReclaim.  Compound pages cannot currently
be reclaimed.  Because of the alias one needs to check PageCompound first.

The RFC for the this approach was discussed at
http://marc.info/?t=117574302800001&r=1&w=2

[nacc@us.ibm.com: fix hugetlbfs]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 614410d589 SLUB: allocate smallest object size if the user asks for 0 bytes
Makes SLUB behave like SLAB in this area to avoid issues....

Throw a stack dump to alert people.

At some point the behavior should be switched back.  NULL is no memory as
far as I can tell and if the use asked for 0 bytes then he need to get no
memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 47bfdc0d5a SLUB: change default alignments
Structures may contain u64 items on 32 bit platforms that are only able to
address 64 bit items on 64 bit boundaries.  Change the mininum alignment of
slabs to conform to those expectations.

ARCH_KMALLOC_MINALIGN must be changed for good since a variety of structure
are mixed in the general slabs.

ARCH_SLAB_MINALIGN is changed because currently there is no consistent
specification of object alignment.  We may have that in the future when the
KMEM_CACHE and related macros are used to generate slabs.  These pass the
alignment of the structure generated by the compiler to the slab.

With KMEM_CACHE etc we could align structures that do not contain 64
bit values to 32 bit boundaries potentially saving some memory.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00
Christoph Lameter 81819f0fc8 SLUB core
This is a new slab allocator which was motivated by the complexity of the
existing code in mm/slab.c. It attempts to address a variety of concerns
with the existing implementation.

A. Management of object queues

   A particular concern was the complex management of the numerous object
   queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
   each allocating CPU and use objects from a slab directly instead of
   queueing them up.

B. Storage overhead of object queues

   SLAB Object queues exist per node, per CPU. The alien cache queue even
   has a queue array that contain a queue for each processor on each
   node. For very large systems the number of queues and the number of
   objects that may be caught in those queues grows exponentially. On our
   systems with 1k nodes / processors we have several gigabytes just tied up
   for storing references to objects for those queues  This does not include
   the objects that could be on those queues. One fears that the whole
   memory of the machine could one day be consumed by those queues.

C. SLAB meta data overhead

   SLAB has overhead at the beginning of each slab. This means that data
   cannot be naturally aligned at the beginning of a slab block. SLUB keeps
   all meta data in the corresponding page_struct. Objects can be naturally
   aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
   boundaries and can fit tightly into a 4k page with no bytes left over.
   SLAB cannot do this.

D. SLAB has a complex cache reaper

   SLUB does not need a cache reaper for UP systems. On SMP systems
   the per CPU slab may be pushed back into partial list but that
   operation is simple and does not require an iteration over a list
   of objects. SLAB expires per CPU, shared and alien object queues
   during cache reaping which may cause strange hold offs.

E. SLAB has complex NUMA policy layer support

   SLUB pushes NUMA policy handling into the page allocator. This means that
   allocation is coarser (SLUB does interleave on a page level) but that
   situation was also present before 2.6.13. SLABs application of
   policies to individual slab objects allocated in SLAB is
   certainly a performance concern due to the frequent references to
   memory policies which may lead a sequence of objects to come from
   one node after another. SLUB will get a slab full of objects
   from one node and then will switch to the next.

F. Reduction of the size of partial slab lists

   SLAB has per node partial lists. This means that over time a large
   number of partial slabs may accumulate on those lists. These can
   only be reused if allocator occur on specific nodes. SLUB has a global
   pool of partial slabs and will consume slabs from that pool to
   decrease fragmentation.

G. Tunables

   SLAB has sophisticated tuning abilities for each slab cache. One can
   manipulate the queue sizes in detail. However, filling the queues still
   requires the uses of the spin lock to check out slabs. SLUB has a global
   parameter (min_slab_order) for tuning. Increasing the minimum slab
   order can decrease the locking overhead. The bigger the slab order the
   less motions of pages between per CPU and partial lists occur and the
   better SLUB will be scaling.

G. Slab merging

   We often have slab caches with similar parameters. SLUB detects those
   on boot up and merges them into the corresponding general caches. This
   leads to more effective memory use. About 50% of all caches can
   be eliminated through slab merging. This will also decrease
   slab fragmentation because partial allocated slabs can be filled
   up again. Slab merging can be switched off by specifying
   slub_nomerge on boot up.

   Note that merging can expose heretofore unknown bugs in the kernel
   because corrupted objects may now be placed differently and corrupt
   differing neighboring objects. Enable sanity checks to find those.

H. Diagnostics

   The current slab diagnostics are difficult to use and require a
   recompilation of the kernel. SLUB contains debugging code that
   is always available (but is kept out of the hot code paths).
   SLUB diagnostics can be enabled via the "slab_debug" option.
   Parameters can be specified to select a single or a group of
   slab caches for diagnostics. This means that the system is running
   with the usual performance and it is much more likely that
   race conditions can be reproduced.

I. Resiliency

   If basic sanity checks are on then SLUB is capable of detecting
   common error conditions and recover as best as possible to allow the
   system to continue.

J. Tracing

   Tracing can be enabled via the slab_debug=T,<slabcache> option
   during boot. SLUB will then protocol all actions on that slabcache
   and dump the object contents on free.

K. On demand DMA cache creation.

   Generally DMA caches are not needed. If a kmalloc is used with
   __GFP_DMA then just create this single slabcache that is needed.
   For systems that have no ZONE_DMA requirement the support is
   completely eliminated.

L. Performance increase

   Some benchmarks have shown speed improvements on kernbench in the
   range of 5-10%. The locking overhead of slub is based on the
   underlying base allocation size. If we can reliably allocate
   larger order pages then it is possible to increase slub
   performance much further. The anti-fragmentation patches may
   enable further performance increases.

Tested on:
i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator

SLUB Boot options

slub_nomerge		Disable merging of slabs
slub_min_order=x	Require a minimum order for slab caches. This
			increases the managed chunk size and therefore
			reduces meta data and locking overhead.
slub_min_objects=x	Mininum objects per slab. Default is 8.
slub_max_order=x	Avoid generating slabs larger than order specified.
slub_debug		Enable all diagnostics for all caches
slub_debug=<options>	Enable selective options for all caches
slub_debug=<o>,<cache>	Enable selective options for a certain set of
			caches

Available Debug options
F		Double Free checking, sanity and resiliency
R		Red zoning
P		Object / padding poisoning
U		Track last free / alloc
T		Trace all allocs / frees (only use for individual slabs).

To use SLUB: Apply this patch and then select SLUB as the default slab
allocator.

[hugh@veritas.com: fix an oops-causing locking error]
[akpm@linux-foundation.org: various stupid cleanups and small fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-07 12:12:53 -07:00