1
0
Fork 0
Commit Graph

356798 Commits (3da4cd2015630f50d8d80c6ff5089d3daa2306c6)

Author SHA1 Message Date
AceLan Kao 3da4cd2015 asus-laptop: add all video switch keys
Fill up all the video switch keys in the map.

Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:55 -08:00
AceLan Kao 98bfcb8e99 asus-nb-wmi: correct a touchpad hotkey mapping
0x60 is touchpad enable key, but is misdefined in the keymap.

Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:55 -08:00
AceLan Kao 19d3ab12e8 asus-laptop: correct a touchpad hotkey mapping
0x60 is touchpad enable key, but is misdefined in the keymap.

Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:55 -08:00
Corentin Chary a935eaecef asus-{nb-wmi|laptop}.c: sync keymaps
Maybe this should be shared in another module...

Signed-off-by: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:55 -08:00
Corentin Chary 982d385ad1 asus-laptop: map some new keys
Signed-off-by: Corentin Chary <corentin.chary@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:55 -08:00
Maxim Mikityanskiy c11ac2aa52 msi-wmi: Add MSI Wind support
Add MSI Wind support to msi-wmi driver. MSI Wind has different GUID for
key events, different WMI key scan codes, it does not need filtering
consecutive identical events and it does not support backlight control
via MSIWMI_BIOS_GUID WMI. Tested on MSI Wind U100.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy fedda8e738 msi-wmi: Introduced quirk_last_pressed
Introduced quirk_last_pressed variable that would indicate if
last_pressed is used or not. Also converted last_pressed to simple
variable in order to allow keymap to be non-contiguous.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy da8506288f msi-wmi: Make keys and backlight independent
Introduced function msi_wmi_backlight_setup() that initializes backlight
device. Made driver load and work if only one WMI (only for hotkeys or
only for backlight) is present.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy b0d3bb53be msi-wmi: Use enums for scancodes
Use enums for consecutive scancodes, rename key names from MSI_WMI_* to
MSI_KEY_* and use tabs for whitespace in msi_wmi_keymap.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy dd2b025157 msi-wmi: Avoid repeating constants
Use UUID defines in MODULE_ALIAS strings to avoid repeating strings.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy 51c94491c8 msi-wmi: Fix memory leak
Fix memory leak - don't forget to kfree ACPI object when returning from
msi_wmi_notify() after suppressing key event.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Anisse Astier <anisse@astier.eu>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:54 -08:00
Maxim Mikityanskiy 03696e51d7 msi-laptop: Disable brightness control for new EC
It seems that existing brightness control works only for old EC models.
On newer ones auto_brightness access always timeouts and lcd_level
always shows 0. So disable brightness control for new EC models. It
works fine with ACPI video driver anyway.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
Maxim Mikityanskiy cdeaf38683 msi-laptop: Add missing ABI documentation
Add ABI documentation for all sysfs files exposed by msi-laptop driver.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
Maxim Mikityanskiy 0de6575ad0 msi-laptop: Add MSI Wind U90/U100 support
Add MSI Wind U90/U100 to DMI table and add some missing EC features
support such as basic fan control, turbo and ECO modes and touchpad
state. Tested on MSI Wind U100.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
Lee, Chun-Yi 0816392b97 msi-laptop: merge quirk tables to one
This patch introduced a quirk_entry struct, then we merged all quirk
tables to msi_dmi_table. Then we can more easily to set different quirk
attributes for different machine.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>

Changed this patch so that it could be applied before MSI Wind U100
support patch. Changed rfkill logic for ec_read_only quirk support.
Removed delays if ec_delay = false.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Acked-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
Maxim Mikityanskiy 1b6517a0a9 msi-laptop: Work around gcc warning
Assign initial value to variable in order to prevent gcc warning about
uninitialized variable.

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
Maxim Mikityanskiy 27eb9e7f12 msi-laptop: Use proper return codes instead of -1
Use proper function return codes instead of -1

Signed-off-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Signed-off-by: Matthew Garrett <matthew.garrett@nebula.com>
2013-02-24 14:49:53 -08:00
AceLan Kao 6cae06e603 asus-wmi: update wlan LED through rfkill led trigger
For those machines with wapf=4, BIOS won't update the wireless LED,
since wapf=4 means user application will take in chage of the wifi and bt.
So, we have to update wlan LED status explicitly.

But I found there is another wireless LED bug in launchpad and which is
not in the wapf=4 quirk.
So, it might be better to set wireless LED status explicitly for all
machines.

BugLink: https://launchpad.net/bugs/901105

Signed-off-by: AceLan Kao <acelan.kao@canonical.com>
Signed-off-by: Matthew Garrett <mjg@redhat.com>
2013-02-24 14:49:52 -08:00
Linus Torvalds 9e2d59ad58 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal
Pull signal handling cleanups from Al Viro:
 "This is the first pile; another one will come a bit later and will
  contain SYSCALL_DEFINE-related patches.

   - a bunch of signal-related syscalls (both native and compat)
     unified.

   - a bunch of compat syscalls switched to COMPAT_SYSCALL_DEFINE
     (fixing several potential problems with missing argument
     validation, while we are at it)

   - a lot of now-pointless wrappers killed

   - a couple of architectures (cris and hexagon) forgot to save
     altstack settings into sigframe, even though they used the
     (uninitialized) values in sigreturn; fixed.

   - microblaze fixes for delivery of multiple signals arriving at once

   - saner set of helpers for signal delivery introduced, several
     architectures switched to using those."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (143 commits)
  x86: convert to ksignal
  sparc: convert to ksignal
  arm: switch to struct ksignal * passing
  alpha: pass k_sigaction and siginfo_t using ksignal pointer
  burying unused conditionals
  make do_sigaltstack() static
  arm64: switch to generic old sigaction() (compat-only)
  arm64: switch to generic compat rt_sigaction()
  arm64: switch compat to generic old sigsuspend
  arm64: switch to generic compat rt_sigqueueinfo()
  arm64: switch to generic compat rt_sigpending()
  arm64: switch to generic compat rt_sigprocmask()
  arm64: switch to generic sigaltstack
  sparc: switch to generic old sigsuspend
  sparc: COMPAT_SYSCALL_DEFINE does all sign-extension as well as SYSCALL_DEFINE
  sparc: kill sign-extending wrappers for native syscalls
  kill sparc32_open()
  sparc: switch to use of generic old sigaction
  sparc: switch sys_compat_rt_sigaction() to COMPAT_SYSCALL_DEFINE
  mips: switch to generic sys_fork() and sys_clone()
  ...
2013-02-23 18:50:11 -08:00
Linus Torvalds 5ce1a70e2f Merge branch 'akpm' (more incoming from Andrew)
Merge second patch-bomb from Andrew Morton:

 - A little DM fix

 - the MM queue

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (154 commits)
  ksm: allocate roots when needed
  mm: cleanup "swapcache" in do_swap_page
  mm,ksm: swapoff might need to copy
  mm,ksm: FOLL_MIGRATION do migration_entry_wait
  ksm: shrink 32-bit rmap_item back to 32 bytes
  ksm: treat unstable nid like in stable tree
  ksm: add some comments
  tmpfs: fix mempolicy object leaks
  tmpfs: fix use-after-free of mempolicy object
  mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
  mm: export mmu notifier invalidates
  mm: accelerate mm_populate() treatment of THP pages
  mm: use long type for page counts in mm_populate() and get_user_pages()
  mm: accurately document nr_free_*_pages functions with code comments
  HWPOISON: change order of error_states[]'s elements
  HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
  memcg: stop warning on memcg_propagate_kmem
  net: change type of virtio_chan->p9_max_pages
  vmscan: change type of vm_total_pages to unsigned long
  fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
  ...
2013-02-23 17:50:35 -08:00
Hugh Dickins ef53d16cde ksm: allocate roots when needed
It is a pity to have MAX_NUMNODES+MAX_NUMNODES tree roots statically
allocated, particularly when very few users will ever actually tune
merge_across_nodes 0 to use more than 1+1 of those trees.  Not a big
deal (only 16kB wasted on each machine with CONFIG_MAXSMP), but a pity.

Start off with 1+1 statically allocated, then if merge_across_nodes is
ever tuned, allocate for nr_node_ids+nr_node_ids.  Do not attempt to
free up the extra if it's tuned back, that would be a waste of effort.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:24 -08:00
Hugh Dickins 56f31801cc mm: cleanup "swapcache" in do_swap_page
I dislike the way in which "swapcache" gets used in do_swap_page():
there is always a page from swapcache there (even if maybe uncached by
the time we lock it), but tests are made according to "swapcache".
Rework that with "page != swapcache", as has been done in unuse_pte().

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:24 -08:00
Hugh Dickins 9e16b7fb1d mm,ksm: swapoff might need to copy
Before establishing that KSM page migration was the cause of my
WARN_ON_ONCE(page_mapped(page))s, I suspected that they came from the
lack of a ksm_might_need_to_copy() in swapoff's unuse_pte() - which in
many respects is equivalent to faulting in a page.

In fact I've never caught that as the cause: but in theory it does at
least need the KSM_RUN_UNMERGE check in ksm_might_need_to_copy(), to
avoid bringing a KSM page back in when it's not supposed to be.

I intended to copy how it's done in do_swap_page(), but have a strong
aversion to how "swapcache" ends up being used there: rework it with
"page != swapcache".

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Hugh Dickins 5117b3b835 mm,ksm: FOLL_MIGRATION do migration_entry_wait
In "ksm: remove old stable nodes more thoroughly" I said that I'd never
seen its WARN_ON_ONCE(page_mapped(page)).  True at the time of writing,
but it soon appeared once I tried fuller tests on the whole series.

It turned out to be due to the KSM page migration itself: unmerge_and_
remove_all_rmap_items() failed to locate and replace all the KSM pages,
because of that hiatus in page migration when old pte has been replaced
by migration entry, but not yet by new pte.  follow_page() finds no page
at that instant, but a KSM page reappears shortly after, without a
fault.

Add FOLL_MIGRATION flag, so follow_page() can do migration_entry_wait()
for KSM's break_cow().  I'd have preferred to avoid another flag, and do
it every time, in case someone else makes the same easy mistake; but did
not find another transgressor (the common get_user_pages() is of course
safe), and cannot be sure that every follow_page() caller is prepared to
sleep - ia64's xencomm_vtop()? Now, THP's wait_split_huge_page() can
already sleep there, since anon_vma locking was changed to mutex, but
maybe that's somehow excluded.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Hugh Dickins bc56620b49 ksm: shrink 32-bit rmap_item back to 32 bytes
Think of struct rmap_item as an extension of struct page (restricted to
MADV_MERGEABLE areas): there may be a lot of them, we need to keep them
small, especially on 32-bit architectures of limited lowmem.

Siting "int nid" after "unsigned int checksum" works nicely on 64-bit,
making no change to its 64-byte struct rmap_item; but bloats the 32-bit
struct rmap_item from (nicely cache-aligned) 32 bytes to 36 bytes, which
rounds up to 40 bytes once allocated from slab.  We'd better avoid that.

Hey, I only just remembered that the anon_vma pointer in struct
rmap_item has no purpose until the rmap_item is hung from a stable tree
node (which has its own nid field); and rmap_item's nid field no purpose
than to say which tree root to tell rb_erase() when unlinking from an
unstable tree.

Double them up in a union.  There's just one place where we set anon_vma
early (when we already hold mmap_sem): now we must remove tree_rmap_item
from its unstable tree there, before overwriting nid.  No need to
spatter BUG()s around: we'd be seeing oopses if this were wrong.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Hugh Dickins b599cbdf1c ksm: treat unstable nid like in stable tree
An inconsistency emerged in reviewing the NUMA node changes to KSM: when
meeting a page from the wrong NUMA node in a stable tree, we say that
it's okay for comparisons, but not as a leaf for merging; whereas when
meeting a page from the wrong NUMA node in an unstable tree, we bail out
immediately.

Now, it might be that a wrong NUMA node in an unstable tree is more
likely to correlate with instablility (different content, with rbnode
now misplaced) than page migration; but even so, we are accustomed to
instablility in the unstable tree.

Without strong evidence for which strategy is generally better, I'd
rather be consistent with what's done in the stable tree: accept a page
from the wrong NUMA node for comparison, but not as a leaf for merging.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Hugh Dickins 8fdb3dbf02 ksm: add some comments
Added slightly more detail to the Documentation of merge_across_nodes, a
few comments in areas indicated by review, and renamed get_ksm_page()'s
argument from "locked" to "lock_it".  No functional change.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Petr Holasek <pholasek@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Greg Thelen 49cd0a5c29 tmpfs: fix mempolicy object leaks
Fix several mempolicy leaks in the tmpfs mount logic.  These leaks are
slow - on the order of one object leaked per mount attempt.

Leak 1 (umount doesn't free mpol allocated in mount):
    while true; do
        mount -t tmpfs -o mpol=interleave,size=100M nodev /mnt
        umount /mnt
    done

Leak 2 (errors parsing remount options will leak mpol):
    mount -t tmpfs -o size=100M nodev /mnt
    while true; do
        mount -o remount,mpol=interleave,size=x /mnt 2> /dev/null
    done
    umount /mnt

Leak 3 (multiple mpol per mount leak mpol):
    while true; do
        mount -t tmpfs -o mpol=interleave,mpol=interleave,size=100M nodev /mnt
        umount /mnt
    done

This patch fixes all of the above.  I could have broken the patch into
three pieces but is seemed easier to review as one.

[akpm@linux-foundation.org: fix handling of mpol_parse_str() errors, per Hugh]
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Greg Thelen 5f00110f72 tmpfs: fix use-after-free of mempolicy object
The tmpfs remount logic preserves filesystem mempolicy if the mpol=M
option is not specified in the remount request.  A new policy can be
specified if mpol=M is given.

Before this patch remounting an mpol bound tmpfs without specifying
mpol= mount option in the remount request would set the filesystem's
mempolicy object to a freed mempolicy object.

To reproduce the problem boot a DEBUG_PAGEALLOC kernel and run:
    # mkdir /tmp/x

    # mount -t tmpfs -o size=100M,mpol=interleave nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=102400k,mpol=interleave:0-3 0 0

    # mount -o remount,size=200M nodev /tmp/x

    # grep /tmp/x /proc/mounts
    nodev /tmp/x tmpfs rw,relatime,size=204800k,mpol=??? 0 0
        # note ? garbage in mpol=... output above

    # dd if=/dev/zero of=/tmp/x/f count=1
        # panic here

Panic:
    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<          (null)>]           (null)
    [...]
    Oops: 0010 [#1] SMP DEBUG_PAGEALLOC
    Call Trace:
      mpol_shared_policy_init+0xa5/0x160
      shmem_get_inode+0x209/0x270
      shmem_mknod+0x3e/0xf0
      shmem_create+0x18/0x20
      vfs_create+0xb5/0x130
      do_last+0x9a1/0xea0
      path_openat+0xb3/0x4d0
      do_filp_open+0x42/0xa0
      do_sys_open+0xfe/0x1e0
      compat_sys_open+0x1b/0x20
      cstar_dispatch+0x7/0x1f

Non-debug kernels will not crash immediately because referencing the
dangling mpol will not cause a fault.  Instead the filesystem will
reference a freed mempolicy object, which will cause unpredictable
behavior.

The problem boils down to a dropped mpol reference below if
shmem_parse_options() does not allocate a new mpol:

    config = *sbinfo
    shmem_parse_options(data, &config, true)
    mpol_put(sbinfo->mpol)
    sbinfo->mpol = config.mpol  /* BUG: saves unreferenced mpol */

This patch avoids the crash by not releasing the mempolicy if
shmem_parse_options() doesn't create a new mpol.

How far back does this issue go? I see it in both 2.6.36 and 3.3.  I did
not look back further.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Mel Gorman 67d46b296a mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
Rob van der Heij reported the following (paraphrased) on private mail.

	The scenario is that I want to avoid backups to fill up the page
	cache and purge stuff that is more likely to be used again (this is
	with s390x Linux on z/VM, so I don't give it as much memory that
	we don't care anymore). So I have something with LD_PRELOAD that
	intercepts the close() call (from tar, in this case) and issues
	a posix_fadvise() just before closing the file.

	This mostly works, except for small files (less than 14 pages)
	that remains in page cache after the face.

Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.

The issue is the per-cpu pagevecs for LRU additions.  If the pages are
added by one CPU but fadvise() is called on another then the pages
remain resident as the invalidate_mapping_pages() only drains the local
pagevecs via its call to pagevec_release().  The user-visible effect is
that a program that uses fadvise() properly is not obeyed.

A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a
pagevec page could not be discarded.  The downside with this is that an
inode cache shrink would send a global IPI and memory pressure
potentially causing global IPI storms is very undesirable.

Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages.
If a subset of pages are discarded it drains the LRU pagevecs and tries
again.  If the second attempt fails, it assumes it is due to the pages
being mapped, locked or dirty and does not care.  With this patch, an
application using fadvise() correctly will be obeyed but there is a
downside that a malicious application can force the kernel to send
global IPIs and increase overhead.

If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.

The following test program demonstrates the problem.  It should never
report that pages are still resident but will without this patch.  It
assumes that CPU 0 and 1 exist.

int main() {
	int fd;
	int pagesize = getpagesize();
	ssize_t written = 0, expected;
	char *buf;
	unsigned char *vec;
	int resident, i;
	cpu_set_t set;

	/* Prepare a buffer for writing */
	expected = FILESIZE_PAGES * pagesize;
	buf = malloc(expected + 1);
	if (buf == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}
	buf[expected] = 0;
	memset(buf, 'a', expected);

	/* Prepare the mincore vec */
	vec = malloc(FILESIZE_PAGES);
	if (vec == NULL) {
		printf("ENOMEM\n");
		exit(EXIT_FAILURE);
	}

	/* Bind ourselves to CPU 0 */
	CPU_ZERO(&set);
	CPU_SET(0, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* open file, unlink and write buffer */
	fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
	if (fd == -1) {
		perror("open");
		exit(EXIT_FAILURE);
	}
	unlink("fadvise-test-file");
	while (written < expected) {
		ssize_t this_write;
		this_write = write(fd, buf + written, expected - written);

		if (this_write == -1) {
			perror("write");
			exit(EXIT_FAILURE);
		}

		written += this_write;
	}
	free(buf);

	/*
	 * Force ourselves to another CPU. If fadvise only flushes the local
	 * CPUs pagevecs then the fadvise will fail to discard all file pages
	 */
	CPU_ZERO(&set);
	CPU_SET(1, &set);
	if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
		perror("sched_setaffinity");
		exit(EXIT_FAILURE);
	}

	/* sync and fadvise to discard the page cache */
	fsync(fd);
	if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
		perror("posix_fadvise");
		exit(EXIT_FAILURE);
	}

	/* map the file and use mincore to see which parts of it are resident */
	buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
	if (buf == NULL) {
		perror("mmap");
		exit(EXIT_FAILURE);
	}
	if (mincore(buf, expected, vec) == -1) {
		perror("mincore");
		exit(EXIT_FAILURE);
	}

	/* Check residency */
	for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
		if (vec[i])
			resident++;
	}
	if (resident != 0) {
		printf("Nr unexpected pages resident: %d\n", resident);
		exit(EXIT_FAILURE);
	}

	munmap(buf, expected);
	close(fd);
	free(vec);
	exit(EXIT_SUCCESS);
}

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Rob van der Heij <rvdheij@gmail.com>
Tested-by: Rob van der Heij <rvdheij@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Cliff Wickman fa794199e3 mm: export mmu notifier invalidates
We at SGI have a need to address some very high physical address ranges
with our GRU (global reference unit), sometimes across partitioned
machine boundaries and sometimes with larger addresses than the cpu
supports.  We do this with the aid of our own 'extended vma' module
which mimics the vma.  When something (either unmap or exit) frees an
'extended vma' we use the mmu notifiers to clean them up.

We had been able to mimic the functions
__mmu_notifier_invalidate_range_start() and
__mmu_notifier_invalidate_range_end() by locking the per-mm lock and
walking the per-mm notifier list.  But with the change to a global srcu
lock (static in mmu_notifier.c) we can no longer do that.  Our module has
no access to that lock.

So we request that these two functions be exported.

Signed-off-by: Cliff Wickman <cpw@sgi.com>
Acked-by: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Michel Lespinasse 240aadeedc mm: accelerate mm_populate() treatment of THP pages
This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.

follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters
a THP page, and to 0 in other cases.

__get_user_pages() makes use of this in order to accelerate populating
THP ranges - that is, when both the pages and vmas arrays are NULL, we
don't need to iterate HPAGE_PMD_NR times to cover a single THP page (and
we also avoid taking mm->page_table_lock that many times).

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:23 -08:00
Michel Lespinasse 28a35716d3 mm: use long type for page counts in mm_populate() and get_user_pages()
Use long type for page counts in mm_populate() so as to avoid integer
overflow when running the following test code:

int main(void) {
  void *p = mmap(NULL, 0x100000000000, PROT_READ,
                 MAP_PRIVATE | MAP_ANON, -1, 0);
  printf("p: %p\n", p);
  mlockall(MCL_CURRENT);
  printf("done\n");
  return 0;
}

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei e0fb581529 mm: accurately document nr_free_*_pages functions with code comments
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.

[akpm@linux-foundation.org: tweak comments]
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Naoya Horiguchi 5f4b9fc5c1 HWPOISON: change order of error_states[]'s elements
error_states[] has two separate states "unevictable LRU page" and
"mlocked LRU page", and the former one has the higher priority now.  But
because of that the latter one is rarely chosen because pages with
PageMlocked highly likely have PG_unevictable set.  On the other hand,
PG_unevictable without PageMlocked is common for ramfs or SHM_LOCKed
shared memory, so reversing the priority of these two states helps us
clearly distinguish them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Naoya Horiguchi 524fca1e73 HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
memory_failure() can't handle memory errors on mlocked pages correctly,
because page_action() judges such errors as ones on "unknown pages"
instead of ones on "unevictable LRU page" or "mlocked LRU page".  In
order to determine page_state page_action() checks page flags at the
timing of the judgement, but such page flags are not the same with those
just after memory_failure() is called, because memory_failure() does
unmapping of the error pages before doing page_action().  This unmapping
changes the page state, especially page_remove_rmap() (called from
try_to_unmap_one()) clears PG_mlocked, so page_action() can't catch
mlocked pages after that.

With this patch, we store the page flag of the error page before doing
unmap, and (only) if the first check with page flags at the time decided
the error page is unknown, we do the second check with the stored page
flag.  This implementation doesn't change error handling for the page
types for which the first check can determine the page state correctly.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Hugh Dickins 6d04399040 memcg: stop warning on memcg_propagate_kmem
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.

Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei 7293bfba03 net: change type of virtio_chan->p9_max_pages
This member of struct virtio_chan is calculated from nr_free_buffer_pages
so change its type to unsigned long in case of overflow.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei b21e0b90cc vmscan: change type of vm_total_pages to unsigned long
This variable is calculated from nr_free_pagecache_pages so
change its type to unsigned long.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei 697ce9be7d fs/nfsd: change type of max_delegations, nfsd_drc_max_mem and nfsd_drc_mem_used
The three variables are calculated from nr_free_buffer_pages so change
their types to unsigned long in case of overflow.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei 43be594a6b fs/buffer.c: change type of max_buffer_heads to unsigned long
max_buffer_heads is calculated from nr_free_buffer_pages(), so change
its type to unsigned long in case of overflow.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei 6434b94a16 ia64: use %ld to print pages calculated in nr_free_buffer_pages
Now the function nr_free_buffer_pages returns unsigned long, so use %ld
to print its return value.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:22 -08:00
Zhang Yanfei ebec3862fd mm: fix return type for functions nr_free_*_pages
Currently, the amount of RAM that functions nr_free_*_pages return is
held in unsigned int.  But in machines with big memory (exceeding 16TB),
the amount may be incorrect because of overflow, so fix it.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Simon Horman <horms@verge.net.au>
Cc: Julian Anastasov <ja@ssi.bg>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Michal Hocko 1081312f95 memcg: cleanup mem_cgroup_init comment
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc
callback and assume that nothing happens before root cgroup is created.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Michal Hocko e477749624 memcg: move memcg_stock initialization to mem_cgroup_init
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.

This patch wraps the current memcg_stock initialization code into a
helper calls it from the controller subsystem initialization code.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Michal Hocko 8787a1df30 memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg
allocation code with something that can be called when the memcg
subsystem is initialized by mem_cgroup_init along with other controller
specific parts.

While we are at it let's make mem_cgroup_soft_limit_tree_init void
because it doesn't make much sense to report memory failure because if
we fail to allocate memory that early during the boot then we are
screwed anyway (this saves some code).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Minchan Kim 0e50ce3b50 mm: use up free swap space before reaching OOM kill
Recently, Luigi reported there are lots of free swap space when OOM
happens.  It's easily reproduced on zram-over-swap, where many instance
of memory hogs are running and laptop_mode is enabled.  He said there
was no problem when he disabled laptop_mode.  The problem when I
investigate problem is following as.

Assumption for easy explanation: There are no page cache page in system
because they all are already reclaimed.

1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
2. shrink_inactive_list isolates victim pages from inactive anon lru list.
3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
   pageout because sc->may_writepage is 0 so the page is rotated back into
   inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
   retry reclaim with higher priority.
5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
   but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
   inactive anon lru list is full of dirty pages by 3 so it just returns
   without  any reclaim progress.
6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
   Because sc->nr_scanned is increased by shrink_page_list but we don't call
   shrink_page_list in 5 due to short of isolated pages.

Above loop is continued until OOM happens.

The problem didn't happen before [1] was merged because old logic's
isolatation in shrink_inactive_list was successful and tried to call
shrink_page_list to pageout them but it still ends up failed to page out
by may_writepage.  But important point is that sc->nr_scanned was
increased although we couldn't swap out them so do_try_to_free_pages
could set may_writepages.

Since commit f80c067361 ("mm: zone_reclaim: make isolate_lru_page()
filter-aware") was introduced, it's not a good idea any more to depends
on only the number of scanned pages for setting may_writepage.  So this
patch adds new trigger point of setting may_writepage as below
DEF_PRIOIRTY - 2 which is used to show the significant memory pressure
in VM so it's good fit for our purpose which would be better to lose
power saving or clickety rather than OOM killing.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Luigi Semenzato <semenzato@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
David Rientjes 00ef2d2f84 mm: use NUMA_NO_NODE
Make a sweep through mm/ and convert code that uses -1 directly to using
the more appropriate NUMA_NO_NODE.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Robin Holt 751efd8610 mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().

Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).

                A                               B
t1                                              srcu_read_lock()
t2              if (!hlist_unhashed())
t3                                              srcu_read_unlock()
t4              srcu_read_lock()
t5                                              hlist_del_init_rcu()
t6                                              synchronize_srcu()
t7              srcu_read_unlock()
t8              hlist_del_rcu()  <--- NULL pointer deref.

Additionally, the list traversal in __mmu_notifier_release() is not
protected by the by the mmu_notifier_mm->hlist_lock which can result in
callouts to the ->release() notifier from both mmu_notifier_unregister()
and __mmu_notifier_release().

-stable suggestions:

The stable trees prior to 3.7.y need commits 21a92735f6 and
70400303ce cherry-picked in that order prior to cherry-picking this
commit.  The 3.7.y tree already has those two commits.

Signed-off-by: Robin Holt <holt@sgi.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Sagi Grimberg <sagig@mellanox.co.il>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00
Cody P Schafer c1f1949527 mm/memory_hotplug: use pgdat_end_pfn() instead of open coding the same.
Replace open coded pgdat_end_pfn() with helper function.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Hansen <dave@linux.vnet.ibm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 17:50:21 -08:00