Commit graph

9032 commits

Author SHA1 Message Date
Linus Torvalds 9c8e30d12d Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "15 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm: numa: mark huge PTEs young when clearing NUMA hinting faults
  mm: numa: slow PTE scan rate if migration failures occur
  mm: numa: preserve PTE write permissions across a NUMA hinting fault
  mm: numa: group related processes based on VMA flags instead of page table flags
  hfsplus: fix B-tree corruption after insertion at position 0
  MAINTAINERS: add Jan as DMI/SMBIOS support maintainer
  fs/affs/file.c: unlock/release page on error
  mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
  mm/slub: fix lockups on PREEMPT && !SMP kernels
  mm/memory hotplug: postpone the reset of obsolete pgdat
  MAINTAINERS: correct rtc armada38x pattern entry
  mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
  mm: fix anon_vma->degree underflow in anon_vma endless growing prevention
  drivers/rtc/rtc-mrst: fix suspend/resume
  aoe: update aoe maintainer information
2015-03-25 16:21:17 -07:00
Mel Gorman b7b04004ec mm: numa: mark huge PTEs young when clearing NUMA hinting faults
Base PTEs are marked young when the NUMA hinting information is cleared
but the same does not happen for huge pages which this patch addresses.

Note that migrated pages are not marked young as the base page migration
code does not assume that migrated pages have been referenced.  This
could be addressed but beyond the scope of this series which is aimed at
Dave Chinners shrink workload that is unlikely to be affected by this
issue.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman 074c238177 mm: numa: slow PTE scan rate if migration failures occur
Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman b191f9b106 mm: numa: preserve PTE write permissions across a NUMA hinting fault
Protecting a PTE to trap a NUMA hinting fault clears the writable bit
and further faults are needed after trapping a NUMA hinting fault to set
the writable bit again.  This patch preserves the writable bit when
trapping NUMA hinting faults.  The impact is obvious from the number of
minor faults trapped during the basis balancing benchmark and the system
CPU usage;

  autonumabench
                                             4.0.0-rc4             4.0.0-rc4
                                              baseline              preserve
  Time System-NUMA01                  107.13 (  0.00%)      103.13 (  3.73%)
  Time System-NUMA01_THEADLOCAL       131.87 (  0.00%)       83.30 ( 36.83%)
  Time System-NUMA02                    8.95 (  0.00%)       10.72 (-19.78%)
  Time System-NUMA02_SMT                4.57 (  0.00%)        3.99 ( 12.69%)
  Time Elapsed-NUMA01                 515.78 (  0.00%)      517.26 ( -0.29%)
  Time Elapsed-NUMA01_THEADLOCAL      384.10 (  0.00%)      384.31 ( -0.05%)
  Time Elapsed-NUMA02                  48.86 (  0.00%)       48.78 (  0.16%)
  Time Elapsed-NUMA02_SMT              47.98 (  0.00%)       48.12 ( -0.29%)

               4.0.0-rc4   4.0.0-rc4
                baseline    preserve
  User          44383.95    43971.89
  System          252.61      201.24
  Elapsed         998.68     1000.94

  Minor Faults   2597249     1981230
  Major Faults       365         364

There is a similar drop in system CPU usage using Dave Chinner's xfsrepair
workload

                                      4.0.0-rc4             4.0.0-rc4
                                       baseline              preserve
  Amean    real-xfsrepair      454.14 (  0.00%)      442.36 (  2.60%)
  Amean    syst-xfsrepair      277.20 (  0.00%)      204.68 ( 26.16%)

The patch looks hacky but the alternatives looked worse.  The tidest was
to rewalk the page tables after a hinting fault but it was more complex
than this approach and the performance was worse.  It's not generally
safe to just mark the page writable during the fault if it's a write
fault as it may have been read-only for COW so that approach was
discarded.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Mel Gorman bea66fbd11 mm: numa: group related processes based on VMA flags instead of page table flags
These are three follow-on patches based on the xfsrepair workload Dave
Chinner reported was problematic in 4.0-rc1 due to changes in page table
management -- https://lkml.org/lkml/2015/3/1/226.

Much of the problem was reduced by commit 53da3bc2ba ("mm: fix up numa
read-only thread grouping logic") and commit ba68bc0115 ("mm: thp:
Return the correct value for change_huge_pmd").  It was known that the
performance in 3.19 was still better even if is far less safe.  This
series aims to restore the performance without compromising on safety.

For the test of this mail, I'm comparing 3.19 against 4.0-rc4 and the
three patches applied on top

  autonumabench
                                                3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                               vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Time System-NUMA01                  124.00 (  0.00%)      161.86 (-30.53%)      107.13 ( 13.60%)      103.13 ( 16.83%)      145.01 (-16.94%)
  Time System-NUMA01_THEADLOCAL       115.54 (  0.00%)      107.64 (  6.84%)      131.87 (-14.13%)       83.30 ( 27.90%)       92.35 ( 20.07%)
  Time System-NUMA02                    9.35 (  0.00%)       10.44 (-11.66%)        8.95 (  4.28%)       10.72 (-14.65%)        8.16 ( 12.73%)
  Time System-NUMA02_SMT                3.87 (  0.00%)        4.63 (-19.64%)        4.57 (-18.09%)        3.99 ( -3.10%)        3.36 ( 13.18%)
  Time Elapsed-NUMA01                 570.06 (  0.00%)      567.82 (  0.39%)      515.78 (  9.52%)      517.26 (  9.26%)      543.80 (  4.61%)
  Time Elapsed-NUMA01_THEADLOCAL      393.69 (  0.00%)      384.83 (  2.25%)      384.10 (  2.44%)      384.31 (  2.38%)      380.73 (  3.29%)
  Time Elapsed-NUMA02                  49.09 (  0.00%)       49.33 ( -0.49%)       48.86 (  0.47%)       48.78 (  0.63%)       50.94 ( -3.77%)
  Time Elapsed-NUMA02_SMT              47.51 (  0.00%)       47.15 (  0.76%)       47.98 ( -0.99%)       48.12 ( -1.28%)       49.56 ( -4.31%)

                3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
               vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  User        46334.60    46391.94    44383.95    43971.89    44372.12
  System        252.84      284.66      252.61      201.24      249.00
  Elapsed      1062.14     1050.96      998.68     1000.94     1026.78

Overall the system CPU usage is comparable and the test is naturally a
bit variable.  The slowing of the scanner hurts numa01 but on this
machine it is an adverse workload and patches that dramatically help it
often hurt absolutely everything else.

Due to patch 2, the fault activity is interesting

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                   2097811     2656646     2597249     1981230     1636841
  Major Faults                       362         450         365         364         365

Note the impact preserving the write bit across protection updates and
fault reduces faults.

  NUMA alloc hit                 1229008     1217015     1191660     1178322     1199681
  NUMA alloc miss                      0           0           0           0           0
  NUMA interleave hit                  0           0           0           0           0
  NUMA alloc local               1228514     1216317     1190871     1177448     1199021
  NUMA base PTE updates        245706197   240041607   238195516   244704842   115012800
  NUMA huge PMD updates           479530      468448      464868      477573      224487
  NUMA page range updates      491225557   479886983   476207932   489222218   229950144
  NUMA hint faults                659753      656503      641678      656926      294842
  NUMA hint local faults          381604      373963      360478      337585      186249
  NUMA hint local percent             57          56          56          51          63
  NUMA pages migrated            5412140     6374899     6266530     5277468     5755096
  AutoNUMA cost                    5121%       5083%       4994%       5097%       2388%

Here the impact of slowing the PTE scanner on migratrion failures is
obvious as "NUMA base PTE updates" and "NUMA huge PMD updates" are
massively reduced even though the headline performance is very similar.

As xfsrepair was the reported workload here is the impact of the series
on it.

  xfsrepair
                                         3.19.0             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4             4.0.0-rc4
                                        vanilla               vanilla          vmwrite-v5r8         preserve-v5r8         slowscan-v5r8
  Min      real-fsmark        1183.29 (  0.00%)     1165.73 (  1.48%)     1152.78 (  2.58%)     1153.64 (  2.51%)     1177.62 (  0.48%)
  Min      syst-fsmark        4107.85 (  0.00%)     4027.75 (  1.95%)     3986.74 (  2.95%)     3979.16 (  3.13%)     4048.76 (  1.44%)
  Min      real-xfsrepair      441.51 (  0.00%)      463.96 ( -5.08%)      449.50 ( -1.81%)      440.08 (  0.32%)      439.87 (  0.37%)
  Min      syst-xfsrepair      195.76 (  0.00%)      278.47 (-42.25%)      262.34 (-34.01%)      203.70 ( -4.06%)      143.64 ( 26.62%)
  Amean    real-fsmark        1188.30 (  0.00%)     1177.34 (  0.92%)     1157.97 (  2.55%)     1158.21 (  2.53%)     1182.22 (  0.51%)
  Amean    syst-fsmark        4111.37 (  0.00%)     4055.70 (  1.35%)     3987.19 (  3.02%)     3998.72 (  2.74%)     4061.69 (  1.21%)
  Amean    real-xfsrepair      450.88 (  0.00%)      468.32 ( -3.87%)      454.14 ( -0.72%)      442.36 (  1.89%)      440.59 (  2.28%)
  Amean    syst-xfsrepair      199.66 (  0.00%)      290.60 (-45.55%)      277.20 (-38.84%)      204.68 ( -2.51%)      150.55 ( 24.60%)
  Stddev   real-fsmark           4.12 (  0.00%)       10.82 (-162.29%)       4.14 ( -0.28%)        5.98 (-45.05%)        4.60 (-11.53%)
  Stddev   syst-fsmark           2.63 (  0.00%)       20.32 (-671.82%)       0.37 ( 85.89%)       16.47 (-525.59%)      15.05 (-471.79%)
  Stddev   real-xfsrepair        6.87 (  0.00%)        4.55 ( 33.75%)        3.46 ( 49.58%)        1.78 ( 74.12%)        0.52 ( 92.50%)
  Stddev   syst-xfsrepair        3.02 (  0.00%)       10.30 (-241.37%)      13.17 (-336.37%)       0.71 ( 76.63%)        5.00 (-65.61%)
  CoeffVar real-fsmark           0.35 (  0.00%)        0.92 (-164.73%)       0.36 ( -2.91%)        0.52 (-48.82%)        0.39 (-12.10%)
  CoeffVar syst-fsmark           0.06 (  0.00%)        0.50 (-682.41%)       0.01 ( 85.45%)        0.41 (-543.22%)       0.37 (-478.78%)
  CoeffVar real-xfsrepair        1.52 (  0.00%)        0.97 ( 36.21%)        0.76 ( 49.94%)        0.40 ( 73.62%)        0.12 ( 92.33%)
  CoeffVar syst-xfsrepair        1.51 (  0.00%)        3.54 (-134.54%)       4.75 (-214.31%)       0.34 ( 77.20%)        3.32 (-119.63%)
  Max      real-fsmark        1193.39 (  0.00%)     1191.77 (  0.14%)     1162.90 (  2.55%)     1166.66 (  2.24%)     1188.50 (  0.41%)
  Max      syst-fsmark        4114.18 (  0.00%)     4075.45 (  0.94%)     3987.65 (  3.08%)     4019.45 (  2.30%)     4082.80 (  0.76%)
  Max      real-xfsrepair      457.80 (  0.00%)      474.60 ( -3.67%)      457.82 ( -0.00%)      444.42 (  2.92%)      441.03 (  3.66%)
  Max      syst-xfsrepair      203.11 (  0.00%)      303.65 (-49.50%)      294.35 (-44.92%)      205.33 ( -1.09%)      155.28 ( 23.55%)

The really relevant lines as syst-xfsrepair which is the system CPU
usage when running xfsrepair.  Note that on my machine the overhead was
45% higher on 4.0-rc4 which may be part of what Dave is seeing.  Once we
preserve the write bit across faults, it's only 2.51% higher on average.
With the full series applied, system CPU usage is 24.6% lower on
average.

Again, the impact of preserving the write bit on minor faults is obvious
and the impact of slowing scanning after migration failures is obvious
on the PTE updates.  Note also that the number of pages migrated is much
reduced even though the headline performance is comparable.

                                  3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                                 vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Minor Faults                 153466827   254507978   249163829   153501373   105737890
  Major Faults                       610         702         690         649         724
  NUMA base PTE updates        217735049   210756527   217729596   216937111   144344993
  NUMA huge PMD updates           129294       85044      106921      127246       79887
  NUMA pages migrated           21938995    29705270    28594162    22687324    16258075

                        3.19.0   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4   4.0.0-rc4
                       vanilla     vanillavmwrite-v5r8preserve-v5r8slowscan-v5r8
  Mean sdb-avgqusz       13.47        2.54        2.55        2.47        2.49
  Mean sdb-avgrqsz      202.32      140.22      139.50      139.02      138.12
  Mean sdb-await         25.92        5.09        5.33        5.02        5.22
  Mean sdb-r_await        4.71        0.19        0.83        0.51        0.11
  Mean sdb-w_await      104.13        5.21        5.38        5.05        5.32
  Mean sdb-svctm          0.59        0.13        0.14        0.13        0.14
  Mean sdb-rrqm           0.16        0.00        0.00        0.00        0.00
  Mean sdb-wrqm           3.59     1799.43     1826.84     1812.21     1785.67
  Max  sdb-avgqusz      111.06       12.13       14.05       11.66       15.60
  Max  sdb-avgrqsz      255.60      190.34      190.01      187.33      191.78
  Max  sdb-await        168.24       39.28       49.22       44.64       65.62
  Max  sdb-r_await      660.00       52.00      280.00       76.00       12.00
  Max  sdb-w_await     7804.00       39.28       49.22       44.64       65.62
  Max  sdb-svctm          4.00        2.82        2.86        1.98        2.84
  Max  sdb-rrqm           8.30        0.00        0.00        0.00        0.00
  Max  sdb-wrqm          34.20     5372.80     5278.60     5386.60     5546.15

FWIW, I also checked SPECjbb in different configurations but it's
similar observations -- minor faults lower, PTE update activity lower
and performance is roughly comparable against 3.19.

This patch (of 3):

Threads that share writable data within pages are grouped together as
related tasks.  This decision is based on whether the PTE is marked
dirty which is subject to timing races between the PTE scanner update
and when the application writes the page.  If the page is file-backed,
then background flushes and sync also affect placement.  This is
unpredictable behaviour which is impossible to reason about so this
patch makes grouping decisions based on the VMA flags.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:31 -07:00
Laura Abbott cfa8694382 mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolate
Commit 3c605096d3 ("mm/page_alloc: restrict max order of merging on
isolated pageblock") changed the logic of unset_migratetype_isolate to
check the buddy allocator and explicitly call __free_pages to merge.

The page that is being freed in this path never had prep_new_page called
so set_page_refcounted is called explicitly but there is no call to
kernel_map_pages.  With the default kernel_map_pages this is mostly
harmless but if kernel_map_pages does any manipulation of the page
tables (unmapping or setting pages to read only) this may trigger a
fault:

    alloc_contig_range test_pages_isolated(ceb00, ced00) failed
    Unable to handle kernel paging request at virtual address ffffffc0cec00000
    pgd = ffffffc045fc4000
    [ffffffc0cec00000] *pgd=0000000000000000
    Internal error: Oops: 9600004f [#1] PREEMPT SMP
    Modules linked in: exfatfs
    CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1
    task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000
    PC is at memset+0xc8/0x1c0
    LR is at kernel_map_pages+0x1ec/0x244

Fix this by calling kernel_map_pages to ensure the page is set in the
page table properly

Fixes: 3c605096d3 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Laura Abbott <lauraa@codeaurora.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Mark Rutland 859b7a0e89 mm/slub: fix lockups on PREEMPT && !SMP kernels
Commit 9aabf810a6 ("mm/slub: optimize alloc/free fastpath by removing
preemption on/off") introduced an occasional hang for kernels built with
CONFIG_PREEMPT && !CONFIG_SMP.

The problem is the following loop the patch introduced to
slab_alloc_node and slab_free:

    do {
        tid = this_cpu_read(s->cpu_slab->tid);
        c = raw_cpu_ptr(s->cpu_slab);
    } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));

GCC 4.9 has been observed to hoist the load of c and c->tid above the
loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
constant and does not force a reload).  On arm64 the generated assembly
looks like:

         ldr     x4, [x0,#8]
  loop:
         ldr     x1, [x0,#8]
         cmp     x1, x4
         b.ne    loop

If the thread is preempted between the load of c->tid (into x1) and tid
(into x4), and an allocation or free occurs in another thread (bumping
the cpu_slab's tid), the thread will be stuck in the loop until
s->cpu_slab->tid wraps, which may be forever in the absence of
allocations/frees on the same CPU.

This patch changes the loop condition to access c->tid with READ_ONCE.
This ensures that the value is reloaded even when the compiler would
otherwise assume it could cache the value, and also ensures that the
load will not be torn.

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Steve Capper <steve.capper@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Gu Zheng b0dc3a342a mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:

  BUG: unable to handle kernel paging request at 0000000000025f60
  IP: next_online_pgdat+0x1/0x50
  PGD 0
  Oops: 0000 [#1] SMP
  ACPI: Device does not support D3cold
  Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
  CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G           O 3.10.15-5885-euler0302 #1
  Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
  Workqueue: events vmstat_update
  task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
  RIP: 0010: next_online_pgdat+0x1/0x50
  RSP: 0018:ffffa800d32afce8  EFLAGS: 00010286
  RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
  RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
  RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
  R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
  R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
  FS:  0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    refresh_cpu_vm_stats+0xd0/0x140
    vmstat_update+0x11/0x50
    process_one_work+0x194/0x3d0
    worker_thread+0x12b/0x410
    kthread+0xc6/0xd0
    ret_from_fork+0x7c/0xb0

The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.

process A:				offline node XX:

vmstat_updat()
   refresh_cpu_vm_stats()
     for_each_populated_zone()
       find online node XX
     cond_resched()
					offline cpu and memory, then try_offline_node()
					node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
       zone = next_zone(zone)
         pg_data_t *pgdat = zone->zone_pgdat;  // here pgdat is NULL now
           next_online_pgdat(pgdat)
             next_online_node(pgdat->node_id);  // NULL pointer access

So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Naoya Horiguchi f683739539 mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers
walk_page_test() is purely pagewalk's internal stuff, and its positive
return values are not intended to be passed to the callers of pagewalk.

However, in the current code if the last vma in the do-while loop in
walk_page_range() happens to return a positive value, it leaks outside
walk_page_range().  So the user visible effect is invalid/unexpected
return value (according to the reporter, mbind() causes it.)

This patch fixes it simply by reinitializing the return value after
checked.

Another exposed interface, walk_page_vma(), already returns 0 for such
cases so no problem.

Fixes: fafaa4264e ("pagewalk: improve vma handling")
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kazutomo Yoshii <kazutomo.yoshii@gmail.com>
Reported-by: Kazutomo Yoshii <kazutomo.yoshii@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Leon Yu 3fe89b3e2a mm: fix anon_vma->degree underflow in anon_vma endless growing prevention
I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after
upgrading to 3.19 and had no luck with 4.0-rc1 neither.

So, after looking into new logic introduced by commit 7a3ef208e6 ("mm:
prevent endless growth of anon_vma hierarchy"), I found chances are that
unlink_anon_vmas() is called without incrementing dst->anon_vma->degree
in anon_vma_clone() due to allocation failure.  If dst->anon_vma is not
NULL in error path, its degree will be incorrectly decremented in
unlink_anon_vmas() and eventually underflow when exiting as a result of
another call to unlink_anon_vmas().  That's how "kernel BUG at
mm/rmap.c:399!" is triggered for me.

This patch fixes the underflow by dropping dst->anon_vma when allocation
fails.  It's safe to do so regardless of original value of dst->anon_vma
because dst->anon_vma doesn't have valid meaning if anon_vma_clone()
fails.  Besides, callers don't care dst->anon_vma in such case neither.

Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as
anon_vma_clone() now does the work.

[akpm@linux-foundation.org: tweak comment]
Fixes: 7a3ef208e6 ("mm: prevent endless growth of anon_vma hierarchy")
Signed-off-by: Leon Yu <chianglungyu@gmail.com>
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-25 16:20:30 -07:00
Linus Torvalds b8517e9830 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes that has been gathered over the last few
  weeks.  This contains:

   - A one-liner fix for NVMe, fixing a missing list_head init that
     could makes us oops on hitting recovery at load time.

   - Two small blk-mq fixes:
        - Fixup a bad goto jump on error handling.
        - Fix for oopsing if running out of reserved tags.

   - A memory leak fix for NBD.

   - Two small writeback fixes from Tejun, fixing a missing init to
     INITIAL_JIFFIES, and a possible underflow introduced recently.

   - A core merge fixup in sg gap detection, where rq->biotail was
     indexed with the count of rq->bio"

* 'for-linus' of git://git.kernel.dk/linux-block:
  writeback: fix possible underflow in write bandwidth calculation
  NVMe: Initialize device list head before starting
  Fix bug in blk_rq_merge_ok
  blkmq: Fix NULL pointer deref when all reserved tags in
  blk-mq: fix use of incorrect goto label in blk_mq_init_queue error path
  nbd: fix possible memory leak
  writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()
2015-03-25 15:40:21 -07:00
Tejun Heo c72efb658f writeback: fix possible underflow in write bandwidth calculation
From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001
From: Tejun Heo <tj@kernel.org>
Date: Mon, 23 Mar 2015 00:08:19 -0400

2f800fbd77 ("writeback: fix dirtied pages accounting on redirty")
introduced account_page_redirty() which reverts stat updates for a
redirtied page, making BDI_DIRTIED no longer monotonically increasing.

bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the
basis for bandwidth calculation.  While unlikely, since the above
patch, the newer value may be lower than the recorded past value and
underflow the bandwidth calculation leading to a wild result.

Fix it by subtracing min of the old and new values when calculating
delta.  AFAIK, there hasn't been any report of it happening but the
resulting erratic behavior would be non-critical and temporary, so
it's possible that the issue is happening without being reported.  The
risk of the fix is very low, so tagged for -stable.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Fixes: 2f800fbd77 ("writeback: fix dirtied pages accounting on redirty")
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-23 09:35:58 -06:00
Linus Torvalds f788baadbd Merge branch 'gadget' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull gadgetfs fixes from Al Viro:
 "Assorted fixes around AIO on gadgetfs: leaks, use-after-free, troubles
  caused by ->f_op flipping"

* 'gadget' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  gadgetfs: really get rid of switching ->f_op
  gadgetfs: get rid of flipping ->f_op in ep_config()
  gadget: switch ep_io_operations to ->read_iter/->write_iter
  gadgetfs: use-after-free in ->aio_read()
  gadget/function/f_fs.c: switch to ->{read,write}_iter()
  gadget/function/f_fs.c: use put iov_iter into io_data
  gadget/function/f_fs.c: close leaks
  move iov_iter.c from mm/ to lib/
  new helper: dup_iter()
2015-03-13 10:55:32 -07:00
Linus Torvalds c202baf017 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "13 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  memcg: disable hierarchy support if bound to the legacy cgroup hierarchy
  mm: reorder can_do_mlock to fix audit denial
  kasan, module: move MODULE_ALIGN macro into <linux/moduleloader.h>
  kasan, module, vmalloc: rework shadow allocation for modules
  fanotify: fix event filtering with FAN_ONDIR set
  mm/nommu.c: export symbol max_mapnr
  arch/c6x/include/asm/pgtable.h: define dummy pgprot_writecombine for !MMU
  nilfs2: fix deadlock of segment constructor during recovery
  mm: cma: fix CMA aligned offset calculation
  mm, hugetlb: close race when setting PageTail for gigantic pages
  mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled
  drivers/rtc/rtc-s3c.c: add .needs_src_clk to s3c6410 RTC data
  ocfs2: make append_dio an incompat feature
2015-03-12 18:46:19 -07:00
Vladimir Davydov 7feee590bb memcg: disable hierarchy support if bound to the legacy cgroup hierarchy
If the memory cgroup controller is initially mounted in the scope of the
default cgroup hierarchy and then remounted to a legacy hierarchy, it will
still have hierarchy support enabled, which is incorrect.  We should
disable hierarchy support if bound to the legacy cgroup hierarchy.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Jeff Vander Stoep a5a6579db3 mm: reorder can_do_mlock to fix audit denial
A userspace call to mmap(MAP_LOCKED) may result in the successful locking
of memory while also producing a confusing audit log denial.  can_do_mlock
checks capable and rlimit.  If either of these return positive
can_do_mlock returns true.  The capable check leads to an LSM hook used by
apparmour and selinux which produce the audit denial.  Reordering so
rlimit is checked first eliminates the denial on success, only recording a
denial when the lock is unsuccessful as a result of the denial.

Signed-off-by: Jeff Vander Stoep <jeffv@google.com>
Acked-by: Nick Kralevich <nnk@google.com>
Cc: Jeff Vander Stoep <jeffv@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Paul Cassella <cassella@cray.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Andrey Ryabinin a5af5aa8b6 kasan, module, vmalloc: rework shadow allocation for modules
Current approach in handling shadow memory for modules is broken.

Shadow memory could be freed only after memory shadow corresponds it is no
longer used.  vfree() called from interrupt context could use memory its
freeing to store 'struct llist_node' in it:

    void vfree(const void *addr)
    {
    ...
        if (unlikely(in_interrupt())) {
            struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
            if (llist_add((struct llist_node *)addr, &p->list))
                    schedule_work(&p->wq);

Later this list node used in free_work() which actually frees memory.
Currently module_memfree() called in interrupt context will free shadow
before freeing module's memory which could provoke kernel crash.

So shadow memory should be freed after module's memory.  However, such
deallocation order could race with kasan_module_alloc() in module_alloc().

Free shadow right before releasing vm area.  At this point vfree()'d
memory is not used anymore and yet not available for other allocations.
New VM_KASAN flag used to indicate that vm area has dynamically allocated
shadow memory so kasan frees shadow only if it was previously allocated.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
gchen gchen 5b8bf30721 mm/nommu.c: export symbol max_mapnr
Several modules may need max_mapnr, so export, the related error with
allmodconfig under c6x:

  MODPOST 3327 modules
  ERROR: "max_mapnr" [fs/pstore/ramoops.ko] undefined!
  ERROR: "max_mapnr" [drivers/media/v4l2-core/videobuf2-dma-contig.ko] undefined!

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:08 -07:00
Danesh Petigara 850fc430f4 mm: cma: fix CMA aligned offset calculation
The CMA aligned offset calculation is incorrect for non-zero order_per_bit
values.

For example, if cma->order_per_bit=1, cma->base_pfn= 0x2f800000 and
align_order=12, the function returns a value of 0x17c00 instead of 0x400.

This patch fixes the CMA aligned offset calculation.

The previous calculation was wrong and would return too-large values for
the offset, so that when cma_alloc looks for free pages in the bitmap with
the requested alignment > order_per_bit, it starts too far into the bitmap
and so CMA allocations will fail despite there actually being plenty of
free pages remaining.  It will also probably have the wrong alignment.
With this change, we will get the correct offset into the bitmap.

One affected user is powerpc KVM, which has kvm_cma->order_per_bit set to
KVM_CMA_CHUNK_ORDER - PAGE_SHIFT, or 18 - 12 = 6.

[gregory.0xf0@gmail.com: changelog additions]
Signed-off-by: Danesh Petigara <dpetigara@broadcom.com>
Reviewed-by: Gregory Fong <gregory.0xf0@gmail.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
David Rientjes 44fc80573c mm, hugetlb: close race when setting PageTail for gigantic pages
Now that gigantic pages are dynamically allocatable, care must be taken to
ensure that p->first_page is valid before setting PageTail.

If this isn't done, then it is possible to race and have compound_head()
return NULL.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
Michal Hocko e009d5dc0a mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled
Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
after OOM killer is disabled if the allocation is performed by a kernel
thread.  This behavior was introduced from the very beginning by
7f33d49a2e ("mm, PM/Freezer: Disable OOM killer when tasks are frozen").
 This means that the basic contract for the allocation request is broken
and the context requesting such an allocation might blow up unexpectedly.

There are basically two ways forward.

1) move oom_killer_disable after kernel threads are frozen.  This has a
   risk that the OOM victim wouldn't be able to finish because it would
   depend on an already frozen kernel thread.  This would be really tricky
   to debug.

2) do not fail GFP_NOFAIL allocation no matter what and risk a
   potential Freezable kernel threads will loop and fail the suspend.
   Incidental allocations after kernel threads are frozen will at least
   dump a warning - if we are lucky and the serial console is still active
   of course...

This patch implements the later option because it is safer.  We would see
warning rather than allocation failures for the kernel threads which would
blow up otherwise and have a higher chances to identify __GFP_NOFAIL users
from deeper pm code.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@gooogle.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 18:46:07 -07:00
Mel Gorman ba68bc0115 mm: thp: Return the correct value for change_huge_pmd
The wrong value is being returned by change_huge_pmd since commit
10c1045f28 ("mm: numa: avoid unnecessary TLB flushes when setting
NUMA hinting entries") which allows a fallthrough that tries to adjust
non-existent PTEs. This patch corrects it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 14:07:41 -07:00
Linus Torvalds 53da3bc2ba mm: fix up numa read-only thread grouping logic
Dave Chinner reported that commit 4d94246699 ("mm: convert
p[te|md]_mknonnuma and remaining page table manipulations") slowed down
his xfsrepair test enormously.  In particular, it was using more system
time due to extra TLB flushing.

The ultimate reason turns out to be how the change to use the regular
page table accessor functions broke the NUMA grouping logic.  The old
special mknuma/mknonnuma code accessed the page table present bit and
the magic NUMA bit directly, while the new code just changes the page
protections using PROT_NONE and the regular vma protections.

That sounds equivalent, and from a fault standpoint it really is, but a
subtle side effect is that the *other* protection bits of the page table
entries also change.  And the code to decide how to group the NUMA
entries together used the writable bit to decide whether a particular
page was likely to be shared read-only or not.

And with the change to make the NUMA handling use the regular permission
setting functions, that writable bit was basically always cleared for
private mappings due to COW.  So even if the page actually ends up being
written to in the end, the NUMA balancing would act as if it was always
shared RO.

This code is a heuristic anyway, so the fix - at least for now - is to
instead check whether the page is dirty rather than writable.  The bit
doesn't change with protection changes.

NOTE! This also adds a FIXME comment to revisit this issue,

Not only should we probably re-visit the whole "is this a shared
read-only page" heuristic (we might want to take the vma permissions
into account and base this more on those than the per-page ones, and
also look at whether the particular access that triggers it is a write
or not), but the whole COW issue shows that we should think about the
NUMA fault handling some more.

For example, maybe we should do the early-COW thing that a regular fault
does.  Or maybe we should accept that while using the same bits as
PROTNONE was a good thing (and got rid of the specual NUMA bit), we
might still want to just preseve the other protection bits across NUMA
faulting.

Those are bigger questions, left for later.  This just fixes up the
heuristic so that it at least approximates working again.  More analysis
and work needed.

Reported-by: Dave Chinner <david@fromorbit.com>
Tested-by: Mel Gorman <mgorman@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-12 08:45:46 -07:00
Tejun Heo 7d70e15480 writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()
global_update_bandwidth() uses static variable update_time as the
timestamp for the last update but forgets to initialize it to
INITIALIZE_JIFFIES.

This means that global_dirty_limit will be 5 mins into the future on
32bit and some large amount jiffies into the past on 64bit.  This
isn't critical as the only effect is that global_dirty_limit won't be
updated for the first 5 mins after booting on 32bit machines,
especially given the auxiliary nature of global_dirty_limit's role -
protecting against global dirty threshold's sudden dips; however, it
does lead to unintended suboptimal behavior.  Fix it.

Fixes: c42843f2f0 ("writeback: introduce smoothed global dirty limit")
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-03-04 08:38:59 -07:00
Linus Torvalds a015d33c98 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "Two smaller fixes for this cycle:

   - A fixup from Keith so that NVMe compiles without BLK_INTEGRITY,
     basically just moving the code around appropriately.

   - A fixup for shm, fixing an oops in shmem_mapping() for mapping with
     no inode.  From Sasha"

[ The shmem fix doesn't look block-layer-related, but fixes a bug that
  happened due to the backing_dev_info removal..  - Linus ]

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm: shmem: check for mapping owner before dereferencing
  NVMe: Fix for BLK_DEV_INTEGRITY not set
2015-02-28 10:21:57 -08:00
Johannes Weiner cc87317726 mm: page_alloc: revert inadvertent !__GFP_FS retry behavior change
Historically, !__GFP_FS allocations were not allowed to invoke the OOM
killer once reclaim had failed, but nevertheless kept looping in the
allocator.

Commit 9879de7373 ("mm: page_alloc: embed OOM killing naturally into
allocation slowpath"), which should have been a simple cleanup patch,
accidentally changed the behavior to aborting the allocation at that
point.  This creates problems with filesystem callers (?) that currently
rely on the allocator waiting for other tasks to intervene.

Revert the behavior as it shouldn't have been changed as part of a
cleanup patch.

Fixes: 9879de7373 ("mm: page_alloc: embed OOM killing naturally into allocation slowpath")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>	[3.19.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Johannes Weiner d2973697b3 mm: memcontrol: use "max" instead of "infinity" in control knobs
The memcg control knobs indicate the highest possible value using the
symbolic name "infinity", which is long and awkward to type.

Switch to the string "max", which is just as descriptive but shorter and
sweeter.

This changes a user interface, so do it before the release and before
the development flag is dropped from the default hierarchy.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Michal Hocko 4e54dede38 memcg: fix low limit calculation
A memcg is considered low limited even when the current usage is equal to
the low limit.  This leads to interesting side effects e.g.
groups/hierarchies with no memory accounted are considered protected and
so the reclaim will emit MEMCG_LOW event when encountering them.

Another and much bigger issue was reported by Joonsoo Kim.  He has hit a
NULL ptr dereference with the legacy cgroup API which even doesn't have
low limit exposed.  The limit is 0 by default but the initial check fails
for memcg with 0 consumption and parent_mem_cgroup() would return NULL if
use_hierarchy is 0 and so page_counter_read would try to dereference NULL.

I suppose that the current implementation is just an overlook because the
documentation in Documentation/cgroups/unified-hierarchy.txt says:

  "The memory.low boundary on the other hand is a top-down allocated
  reserve.  A cgroup enjoys reclaim protection when it and all its
  ancestors are below their low boundaries"

Fix the usage and the low limit comparision in mem_cgroup_low accordingly.

Fixes: 241994ed86 (mm: memcontrol: default hierarchy interface for memory)
Reported-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:51 -08:00
Joonsoo Kim da616534ed mm/nommu: fix memory leak
Maxime reported the following memory leak regression due to commit
dbc8358c72 ("mm/nommu: use alloc_pages_exact() rather than its own
implementation").

On v3.19, I am facing a memory leak.  Each time I run a command one page
is lost.  Here an example with busybox's free command:

  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1972       5956          0          0        492
  -/+ buffers/cache:       1480       6448
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1976       5952          0          0        492
  -/+ buffers/cache:       1484       6444
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1980       5948          0          0        492
  -/+ buffers/cache:       1488       6440
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1984       5944          0          0        492
  -/+ buffers/cache:       1492       6436
  / # free
               total       used       free     shared    buffers     cached
  Mem:          7928       1988       5940          0          0        492
  -/+ buffers/cache:       1496       6432

At some point, the system fails to sastisfy 256KB allocations:

  free: page allocation failure: order:6, mode:0xd0
  CPU: 0 PID: 67 Comm: free Not tainted 3.19.0-05389-gacf2cf1-dirty #64
  Hardware name: STM32 (Device Tree Support)
    show_stack+0xb/0xc
    warn_alloc_failed+0x97/0xbc
    __alloc_pages_nodemask+0x295/0x35c
    __get_free_pages+0xb/0x24
    alloc_pages_exact+0x19/0x24
    do_mmap_pgoff+0x423/0x658
    vm_mmap_pgoff+0x3f/0x4e
    load_flat_file+0x20d/0x4f8
    load_flat_binary+0x3f/0x26c
    search_binary_handler+0x51/0xe4
    do_execveat_common+0x271/0x35c
    do_execve+0x19/0x1c
    ret_fast_syscall+0x1/0x4a
  Mem-info:
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  2048 pages of RAM
  1538 free pages
  66 reserved pages
  109 slab pages
  -46 pages shared
  0 pages swap cached
  nommu: Allocation of length 221184 from process 67 (free) failed
  Normal per-cpu:
  CPU    0: hi:    0, btch:   1 usd:   0
  active_anon:0 inactive_anon:0 isolated_anon:0
   active_file:0 inactive_file:0 isolated_file:0
   unevictable:123 dirty:0 writeback:0 unstable:0
   free:1515 slab_reclaimable:17 slab_unreclaimable:139
   mapped:0 shmem:0 pagetables:0 bounce:0
   free_cma:0
  Normal free:6060kB min:352kB low:440kB high:528kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:492kB isolated(anon):0ks
  lowmem_reserve[]: 0 0
  Normal: 23*4kB (U) 22*8kB (U) 24*16kB (U) 23*32kB (U) 23*64kB (U) 23*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6060kB
  123 total pagecache pages
  Unable to allocate RAM for process text/data, errno 12 SEGV

This problem happens because we allocate ordered page through
__get_free_pages() in do_mmap_private() in some cases and we try to free
individual pages rather than ordered page in free_page_series().  In
this case, freeing pages whose refcount is not 0 won't be freed to the
page allocator so memory leak happens.

To fix the problem, this patch changes __get_free_pages() to
alloc_pages_exact() since alloc_pages_exact() returns
physically-contiguous pages but each pages are refcounted.

Fixes: dbc8358c72 ("mm/nommu: use alloc_pages_exact() rather than its own implementation").
Reported-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Tested-by: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>	[3.19]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-28 09:57:50 -08:00
Sasha Levin f0774d884b mm: shmem: check for mapping owner before dereferencing
mapping->host can be NULL and shouldn't be dereferenced before being checked.

[ 1295.741844] GPF could be caused by NULL-ptr deref or user memory accessgeneral protection fault: 0000 [#1] SMP KASAN
[ 1295.746387] Dumping ftrace buffer:
[ 1295.748217]    (ftrace buffer empty)
[ 1295.749527] Modules linked in:
[ 1295.750268] CPU: 62 PID: 23410 Comm: trinity-c70 Not tainted 3.19.0-next-20150219-sasha-00045-g9130270f #1939
[ 1295.750268] task: ffff8803a49db000 ti: ffff8803a4dc8000 task.ti: ffff8803a4dc8000
[ 1295.750268] RIP: shmem_mapping (mm/shmem.c:1458)
[ 1295.750268] RSP: 0000:ffff8803a4dcfbf8  EFLAGS: 00010206
[ 1295.750268] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 00000000000f2804
[ 1295.750268] RDX: 0000000000000005 RSI: 0400000000000794 RDI: 0000000000000028
[ 1295.750268] RBP: ffff8803a4dcfc08 R08: 0000000000000000 R09: 00000000031de000
[ 1295.750268] R10: dffffc0000000000 R11: 00000000031c1000 R12: 0400000000000794
[ 1295.750268] R13: 00000000031c2000 R14: 00000000031de000 R15: ffff880e3bdc1000
[ 1295.750268] FS:  00007f8703c7e700(0000) GS:ffff881164800000(0000) knlGS:0000000000000000
[ 1295.750268] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1295.750268] CR2: 0000000004e58000 CR3: 00000003a9f3c000 CR4: 00000000000007a0
[ 1295.750268] DR0: ffffffff81000000 DR1: 0000009494949494 DR2: 0000000000000000
[ 1295.750268] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[ 1295.750268] Stack:
[ 1295.750268]  ffff8803a4dcfec8 ffffffffbb1dc770 ffff8803a4dcfc38 ffffffffad6f230b
[ 1295.750268]  ffffffffad6f2b0d 0000014100000000 ffff88001e17c08b ffff880d9453fe08
[ 1295.750268]  ffff8803a4dcfd18 ffffffffad6f2ce2 ffff8803a49dbcd8 ffff8803a49dbce0
[ 1295.750268] Call Trace:
[ 1295.750268] mincore_page (mm/mincore.c:61)
[ 1295.750268] ? mincore_pte_range (include/linux/spinlock.h:312 mm/mincore.c:131)
[ 1295.750268] mincore_pte_range (mm/mincore.c:151)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] __walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:90 mm/pagewalk.c:116 mm/pagewalk.c:204)
[ 1295.750268] walk_page_range (mm/pagewalk.c:275)
[ 1295.750268] SyS_mincore (mm/mincore.c:191 mm/mincore.c:253 mm/mincore.c:220)
[ 1295.750268] ? mincore_pte_range (mm/mincore.c:220)
[ 1295.750268] ? mincore_unmapped_range (mm/mincore.c:113)
[ 1295.750268] ? __mincore_unmapped_range (mm/mincore.c:105)
[ 1295.750268] ? ptlock_free (mm/mincore.c:24)
[ 1295.750268] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1610)
[ 1295.750268] ia32_do_call (arch/x86/ia32/ia32entry.S:446)
[ 1295.750268] Code: e5 48 c1 ea 03 53 48 89 fb 48 83 ec 08 80 3c 02 00 75 4f 48 b8 00 00 00 00 00 fc ff df 48 8b 1b 48 8d 7b 28 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 3f 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 28 48

All code
========
   0:	e5 48                	in     $0x48,%eax
   2:	c1 ea 03             	shr    $0x3,%edx
   5:	53                   	push   %rbx
   6:	48 89 fb             	mov    %rdi,%rbx
   9:	48 83 ec 08          	sub    $0x8,%rsp
   d:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
  11:	75 4f                	jne    0x62
  13:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  1a:	fc ff df
  1d:	48 8b 1b             	mov    (%rbx),%rbx
  20:	48 8d 7b 28          	lea    0x28(%rbx),%rdi
  24:	48 89 fa             	mov    %rdi,%rdx
  27:	48 c1 ea 03          	shr    $0x3,%rdx
  2b:*	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)		<-- trapping instruction
  2f:	75 3f                	jne    0x70
  31:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  38:	fc ff df
  3b:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  3f:	48                   	rex.W
	...

Code starting with the faulting instruction
===========================================
   0:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1)
   4:	75 3f                	jne    0x45
   6:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
   d:	fc ff df
  10:	48 8b 5b 28          	mov    0x28(%rbx),%rbx
  14:	48                   	rex.W
	...
[ 1295.750268] RIP shmem_mapping (mm/shmem.c:1458)
[ 1295.750268]  RSP <ffff8803a4dcfbf8>

Fixes: 97b713ba3e ("fs: kill BDI_CAP_SWAP_BACKED")
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-02-23 10:00:11 -08:00
Linus Torvalds be5e6616dd Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 "Assorted stuff from this cycle.  The big ones here are multilayer
  overlayfs from Miklos and beginning of sorting ->d_inode accesses out
  from David"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (51 commits)
  autofs4 copy_dev_ioctl(): keep the value of ->size we'd used for allocation
  procfs: fix race between symlink removals and traversals
  debugfs: leave freeing a symlink body until inode eviction
  Documentation/filesystems/Locking: ->get_sb() is long gone
  trylock_super(): replacement for grab_super_passive()
  fanotify: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  Cachefiles: Fix up scripted S_ISDIR/S_ISREG/S_ISLNK conversions
  VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
  SELinux: Use d_is_positive() rather than testing dentry->d_inode
  Smack: Use d_is_positive() rather than testing dentry->d_inode
  TOMOYO: Use d_is_dir() rather than d_inode and S_ISDIR()
  Apparmor: Use d_is_positive/negative() rather than testing dentry->d_inode
  Apparmor: mediated_filesystem() should use dentry->d_sb not inode->i_sb
  VFS: Split DCACHE_FILE_TYPE into regular and special types
  VFS: Add a fallthrough flag for marking virtual dentries
  VFS: Add a whiteout dentry type
  VFS: Introduce inode-getting helpers for layered/unioned fs environments
  Infiniband: Fix potential NULL d_inode dereference
  posix_acl: fix reference leaks in posix_acl_create
  autofs4: Wrong format for printing dentry
  ...
2015-02-22 17:42:14 -08:00
David Howells e36cb0b89c VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry)
Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:38:41 -05:00
Linus Torvalds b11a278397 Merge branch 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kconfig updates from Michal Marek:
 "Yann E Morin was supposed to take over kconfig maintainership, but
  this hasn't happened.  So I'm sending a few kconfig patches that I
  collected:

   - Fix for missing va_end in kconfig
   - merge_config.sh displays used if given too few arguments
   - s/boolean/bool/ in Kconfig files for consistency, with the plan to
     only support bool in the future"

* 'kconfig' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  kconfig: use va_end to match corresponding va_start
  merge_config.sh: Display usage if given too few arguments
  kconfig: use bool instead of boolean for type definition attributes
2015-02-19 10:36:45 -08:00
Al Viro d879cb8341 move iov_iter.c from mm/ to lib/
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:22:17 -05:00
Al Viro 4b8164b91d new helper: dup_iter()
Copy iter and kmemdup the underlying array for the copy.  Returns
a pointer to result of kmemdup() to be kfree()'d later.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17 22:21:11 -05:00
Linus Torvalds 038911597e Merge branch 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull lazytime mount option support from Al Viro:
 "Lazytime stuff from tytso"

* 'lazytime' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ext4: add optimization for the lazytime mount option
  vfs: add find_inode_nowait() function
  vfs: add support for a lazytime mount option
2015-02-17 16:12:34 -08:00
Linus Torvalds 66dc830d14 Merge branch 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "More iov_iter work - missing counterpart of iov_iter_init() for
  bvec-backed ones and vfs_read_iter()/vfs_write_iter() - wrappers for
  sync calls of ->read_iter()/->write_iter()"

* 'iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: add vfs_iter_{read,write} helpers
  new helper: iov_iter_bvec()
2015-02-17 15:48:33 -08:00
Matthew Wilcox e748dcd095 vfs: remove get_xip_mem
All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
 Also remove mm/filemap_xip.c as it is now empty.  Also remove
documentation of the long-gone get_xip_page().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox 4c0ccfef2e dax,ext2: replace xip_truncate_page with dax_truncate_page
It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox f7ca90b160 dax,ext2: replace the XIP page fault handler with the DAX page fault handler
Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

This requires that all architectures implement copy_user_page().  At the
time of writing, mips and arm do not.  Patches exist and are in progress.

[akpm@linux-foundation.org: remap_file_pages went away]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox d475c6346a dax,ext2: replace XIP read and write with DAX I/O
Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox fbbbad4bc2 vfs,ext2: introduce IS_DAX(inode)
Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().  Prevent I/Os to files
tagged with the DAX flag from falling back to buffered I/O.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox 2e4cdab058 mm: allow page fault handlers to perform the COW
Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL.  This indicates to the MM code that the
i_mmap_lock is held instead of the page lock.

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:03 -08:00
Matthew Wilcox 283307c760 mm: fix XIP fault vs truncate race
DAX is a replacement for the variation of XIP currently supported by the
ext2 filesystem.  We have three different things in the tree called 'XIP',
and the new focus is on access to data rather than executables, so a name
change was in order.  DAX stands for Direct Access.  The X is for
eXciting.

The new focus on data access has resulted in more careful attention to
races that exist in the current XIP code, but are not hit by the use-case
that it was designed for.  XIP's architecture worked fine for ext2, but
DAX is architected to work with modern filsystems such as ext4 and XFS.
DAX is not intended for use with btrfs; the value that btrfs adds relies
on manipulating data and writing data to different locations, while DAX's
value is for write-in-place and keeping the kernel from touching the data.

DAX was developed in order to support NV-DIMMs, but it's become clear that
its usefuless extends beyond NV-DIMMs and there are several potential
customers including the tracing machinery.  Other people want to place the
kernel log in an area of memory, as long as they have a BIOS that does not
clear DRAM on reboot.

Patch 1 is a bug fix, probably worth including in 3.18.

Patches 2 & 3 are infrastructure for DAX.

Patches 4-8 replace the XIP code with its DAX equivalents, transforming
ext2 to use the DAX code as we go.  Note that patch 10 is the
Documentation patch.

Patches 9-15 clean up after the XIP code, removing the infrastructure
that is no longer needed and renaming various XIP things to DAX.
Most of these patches were added after Jan found things he didn't
like in an earlier version of the ext4 patch ... that had been copied
from ext2.  So ext2 i being transformed to do things the same way that
ext4 will later.  The ability to mount ext2 filesystems with the 'xip'
option is retained, although the 'dax' option is now preferred.

Patch 16 adds some DAX infrastructure to support ext4.

Patch 17 adds DAX support to ext4.  It is broadly similar to ext2's DAX
support, but it is more efficient than ext4's due to its support for
unwritten extents.

Patch 18 is another cleanup patch renaming XIP to DAX.

My thanks to Mathieu Desnoyers for his reviews of the v11 patchset.  Most
of the changes below were based on his feedback.

This patch (of 18):

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock in
the XIP case, so use i_mmap_lock_read() instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while we
hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the thread
has a mapping to a page which will be removed from the file, but this is
harmless as the page will not be allocated to a different purpose before
the thread's access to it is revoked.

[akpm@linux-foundation.org: switch to i_mmap_lock_read(), add comment in unmap_single_vma()]
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:02 -08:00
Linus Torvalds c833e17e27 Tighten rules for ACCESS_ONCE
This series tightens the rules for ACCESS_ONCE to only work
 on scalar types. It also contains the necessary fixups as
 indicated by build bots of linux-next.
 Now everything is in place to prevent new non-scalar users
 of ACCESS_ONCE and we can continue to convert code to
 READ_ONCE/WRITE_ONCE.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJU2H5MAAoJEBF7vIC1phx8Jm4QALPqKOMDSUBCrqJFWJeujtv2
 ILxJKsnjrAlt3dxnlVI3q6e5wi896hSce75PcvZ/vs/K3GdgMxOjrakBJGTJ2Qjg
 5njW9aGJDDr/SYFX33MLWfqy222TLtpxgSz379UgXjEzB0ymMWbJJ3FnGjVqQJdp
 RXDutpncRySc/rGHh9UPREIRR5GvimONsWE2zxgXjUzB8vIr2fCGvHTXfIb6RKbQ
 yaFoihzn0m+eisc5Gy4tQ1qhhnaYyWEGrINjHTjMFTQOWTlH80BZAyQeLdbyj2K5
 qloBPS/VhBTr/5TxV5onM+nVhu0LiblVNrdMHVeb7jyST4LeFOCaWK98lB3axSB5
 v/2D1YKNb3g1U1x3In/oNGQvs36zGiO1uEdMF1l8ZFXgCvHmATSFSTWBtqUhb5Ew
 JA3YyqMTG6dpRTMSnmu3/frr4wDqnxlB/ktQC1pf3tDp87mr1ZYEy/dQld+tltjh
 9Z5GSdrw0nf91wNI3DJf+26ZDdz5B+EpDnPnOKG8anI1lc/mQneI21/K/xUteFXw
 UZ1XGPLV2vbv9/a13u44SdjenHvQs1egsGeebMxVPoj6WmDLVmcIqinyS6NawYzn
 IlDGy/b3bSnXWMBP0ZVBX94KWLxqDDc4a/ayxsmxsP1tPZ+jDXjVDa7E3zskcHxG
 Uj5ULCPyU087t8Sl76mv
 =Dj70
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux

Pull ACCESS_ONCE() rule tightening from Christian Borntraeger:
 "Tighten rules for ACCESS_ONCE

  This series tightens the rules for ACCESS_ONCE to only work on scalar
  types.  It also contains the necessary fixups as indicated by build
  bots of linux-next.  Now everything is in place to prevent new
  non-scalar users of ACCESS_ONCE and we can continue to convert code to
  READ_ONCE/WRITE_ONCE"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux:
  kernel: Fix sparse warning for ACCESS_ONCE
  next: sh: Fix compile error
  kernel: tighten rules for ACCESS ONCE
  mm/gup: Replace ACCESS_ONCE with READ_ONCE
  x86/spinlock: Leftover conversion ACCESS_ONCE->READ_ONCE
  x86/xen/p2m: Replace ACCESS_ONCE with READ_ONCE
  ppc/hugetlbfs: Replace ACCESS_ONCE with READ_ONCE
  ppc/kvm: Replace ACCESS_ONCE with READ_ONCE
2015-02-14 10:54:28 -08:00
Andrey Ryabinin bebf56a1b1 kasan: enable instrumentation of global variables
This feature let us to detect accesses out of bounds of global variables.
This will work as for globals in kernel image, so for globals in modules.
Currently this won't work for symbols in user-specified sections (e.g.
__init, __read_mostly, ...)

The idea of this is simple.  Compiler increases each global variable by
redzone size and add constructors invoking __asan_register_globals()
function.  Information about global variable (address, size, size with
redzone ...) passed to __asan_register_globals() so we could poison
variable's redzone.

This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
address making shadow memory handling (
kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
guarantees that each shadow page backing modules address space correspond
to only one module_alloc() allocation.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:42 -08:00
Andrey Ryabinin cb9e3c292d mm: vmalloc: pass additional vm_flags to __vmalloc_node_range()
For instrumenting global variables KASan will shadow memory backing memory
for modules.  So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area.  Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole.  So we could fail to allocate shadow
for module_alloc().

Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
__vmalloc_node_range().  Add new parameter 'vm_flags' to
__vmalloc_node_range() function.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:42 -08:00
Andrey Ryabinin 71394fe501 mm: vmalloc: add flag preventing guard hole allocation
For instrumenting global variables KASan will shadow memory backing memory
for modules.  So on module loading we will need to allocate memory for
shadow and map it at address in shadow that corresponds to the address
allocated in module_alloc().

__vmalloc_node_range() could be used for this purpose, except it puts a
guard hole after allocated area.  Guard hole in shadow memory should be a
problem because at some future point we might need to have a shadow memory
at address occupied by guard hole.  So we could fail to allocate shadow
for module_alloc().

Add a new vm_struct flag 'VM_NO_GUARD' indicating that vm area doesn't
have a guard hole.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:42 -08:00
Andrey Ryabinin c420f167db kasan: enable stack instrumentation
Stack instrumentation allows to detect out of bounds memory accesses for
variables allocated on stack.  Compiler adds redzones around every
variable on stack and poisons redzones in function's prologue.

Such approach significantly increases stack usage, so all in-kernel stacks
size were doubled.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:41 -08:00
Andrey Ryabinin 393f203f5f x86_64: kasan: add interceptors for memset/memmove/memcpy functions
Recently instrumentation of builtin functions calls was removed from GCC
5.0.  To check the memory accessed by such functions, userspace asan
always uses interceptors for them.

So now we should do this as well.  This patch declares
memset/memmove/memcpy as weak symbols.  In mm/kasan/kasan.c we have our
own implementation of those functions which checks memory before accessing
it.

Default memset/memmove/memcpy now now always have aliases with '__'
prefix.  For files that built without kasan instrumentation (e.g.
mm/slub.c) original mem* replaced (via #define) with prefixed variants,
cause we don't want to check memory accesses there.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrey Konovalov <adech.fo@gmail.com>
Cc: Yuri Gribov <tetra2005@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-13 21:21:41 -08:00