Commit graph

616509 commits

Author SHA1 Message Date
Mike Travis e363d24c2b x86/platform/UV: Fix bug with iounmap() of the UV4 EFI System Table causing a crash
Save the uv_systab::size field before doing the iounmap()
of the struct pointer, to avoid a NULL dereference crash.

Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.250424783@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:55:38 +02:00
Mike Travis 054f621fd5 x86/platform/UV: Fix problem with UV4 Socket IDs not being contiguous
The UV4 Socket IDs are not guaranteed to equate to Node values which
can cause the GAM (Global Addressable Memory) table lookups to fail.
Fix this by using an independent index into the GAM table instead of
the Socket ID to reference the base address.

Tested-by: Frank Ramsay <framsay@sgi.com>
Tested-by: John Estabrook <estabrook@sgi.com>
Signed-off-by: Mike Travis <travis@sgi.com>
Reviewed-by: Dimitri Sivanich <sivanich@sgi.com>
Reviewed-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russ Anderson <rja@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160801184050.048755337@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:55:38 +02:00
Borislav Petkov 3e03530587 x86/entry: Clarify the RF saving/restoring situation with SYSCALL/SYSRET
Clarify why exactly RF cannot be restored properly by SYSRET to avoid
confusion.

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160803171429.GA2590@nazgul.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:53:43 +02:00
Andy Shevchenko e95d0dfb22 pinctrl: intel: merrifield: Add missed header
On x86 builds the absense of <linux/io.h> makes static analyzer and compiler
unhappy which fails to build the driver.

CHECK   drivers/pinctrl/intel/pinctrl-merrifield.c
drivers/pinctrl/intel/pinctrl-merrifield.c:518:17:
  error: undefined identifier 'readl'
drivers/pinctrl/intel/pinctrl-merrifield.c:570:17:
  error: undefined identifier 'readl'
drivers/pinctrl/intel/pinctrl-merrifield.c:575:9:
  error: undefined identifier 'writel'
drivers/pinctrl/intel/pinctrl-merrifield.c:645:17:
  error: undefined identifier 'readl'
  CC      drivers/pinctrl/intel/pinctrl-merrifield.o
drivers/pinctrl/intel/pinctrl-merrifield.c: In function ‘mrfld_pin_dbg_show’:
drivers/pinctrl/intel/pinctrl-merrifield.c:518:10:
  error: implicit declaration of function ‘readl’
  [-Werror=implicit-function-declaration]
  value = readl(bufcfg);
            ^
drivers/pinctrl/intel/pinctrl-merrifield.c: In function ‘mrfld_update_bufcfg’:
drivers/pinctrl/intel/pinctrl-merrifield.c:575:2:
  error: implicit declaration of function ‘writel’
  [-Werror=implicit-function-declaration]
  writel(value, bufcfg);
    ^
cc1: some warnings being treated as errors

Add header to the top of the module.

Fixes: 4e80c8f505 ("pinctrl: intel: Add Intel Merrifield pin controller support")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-08-10 15:46:28 +02:00
Agrawal, Nitesh-kumar 8cf4345575 pinctrl/amd: Remove the default de-bounce time
In the function amd_gpio_irq_enable() and
amd_gpio_direction_input(), remove the code which is setting
the default de-bounce time to 2.75ms.

The driver code shall use the same settings as specified in
BIOS. Any default assignment impacts TouchPad behaviour when
the LevelTrig is set to EDGE FALLING.

Cc: stable@vger.kernel.org
Reviewed-by:  Ken Xue <Ken.Xue@amd.com>
Signed-off-by: Nitesh Kumar Agrawal <Nitesh-kumar.Agrawal@amd.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-08-10 15:45:54 +02:00
Wei Yongjun b120a3c286 pinctrl: pistachio: Drop pinctrl_unregister for devm_ registered device
It's not necessary to unregister pin controller device registered
with devm_pinctrl_register() and using pinctrl_unregister() leads
to a double free.

This is detected by Coccinelle semantic patch.

Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-08-10 15:45:54 +02:00
Wei Yongjun 5b236d0fde pinctrl: meson: Drop pinctrl_unregister for devm_ registered device
It's not necessary to unregister pin controller device registered
with devm_pinctrl_register() and using pinctrl_unregister() leads
to a double free.

This is detected by Coccinelle semantic patch.

Fixes: e649f7ec8c ("pinctrl: meson: Use devm_pinctrl_register() for pinctrl registration")
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Acked-by: Kevin Hilman <khilman@baylibre.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
2016-08-10 15:45:54 +02:00
Sebastian Andrzej Siewior 5cf0791da5 x86/mm: Disable preemption during CR3 read+write
There's a subtle preemption race on UP kernels:

Usually current->mm (and therefore mm->pgd) stays the same during the
lifetime of a task so it does not matter if a task gets preempted during
the read and write of the CR3.

But then, there is this scenario on x86-UP:

TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by:

 -> mmput()
 -> exit_mmap()
 -> tlb_finish_mmu()
 -> tlb_flush_mmu()
 -> tlb_flush_mmu_tlbonly()
 -> tlb_flush()
 -> flush_tlb_mm_range()
 -> __flush_tlb_up()
 -> __flush_tlb()
 ->  __native_flush_tlb()

At this point current->mm is NULL but current->active_mm still points to
the "old" mm.

Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its
own mm so CR3 has changed.

Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's
mm and so CR3 remains unchanged. Once taskA gets active it continues
where it was interrupted and that means it writes its old CR3 value
back. Everything is fine because userland won't need its memory
anymore.

Now the fun part:

Let's preempt taskA one more time and get back to taskB. This
time switch_mm() won't do a thing because oldmm (->active_mm)
is the same as mm (as per context_switch()). So we remain
with a bad CR3 / PGD and return to userland.

The next thing that happens is handle_mm_fault() with an address for
the execution of its code in userland. handle_mm_fault() realizes that
it has a PTE with proper rights so it returns doing nothing. But the
CPU looks at the wrong PGD and insists that something is wrong and
faults again. And again. And one more time…

This pagefault circle continues until the scheduler gets tired of it and
puts another task on the CPU. It gets little difficult if the task is a
RT task with a high priority. The system will either freeze or it gets
fixed by the software watchdog thread which usually runs at RT-max prio.
But waiting for the watchdog will increase the latency of the RT task
which is no good.

Fix this by disabling preemption across the critical code section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1470404259-26290-1-git-send-email-bigeasy@linutronix.de
[ Prettified the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 15:37:16 +02:00
Michael Ellerman ca49e64f0c selftests/powerpc: Specify we expect to build with std=gnu99
We have some tests that assume we're using std=gnu99, which is fine on
most compilers, but some old compilers use a different default.

So make it explicit that we want to use std=gnu99.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 23:21:37 +10:00
Takashi Iwai a52ff34e5e ALSA: hda - Manage power well properly for resume
For SKL and later Intel chips, we control the power well per codec
basis via link_power callback since the commit [03b135cebc: ALSA:
hda - remove dependency on i915 power well for SKL].
However, there are a few exceptional cases where the gfx registers are
accessed from the audio driver: namely the wakeup override bit
toggling at (both system and runtime) resume.  This seems causing a
kernel warning when accessed during the power well down (and likely
resulting in the bogus register accesses).

This patch puts the proper power up / down sequence around the resume
code so that the wakeup bit is fiddled properly while the power is
up.  (The other callback, sync_audio_rate, is used only in the PCM
callback, so it's guaranteed in the power-on.)

Also, by this proper power up/down, the instantaneous flip of wakeup
bit in the resume callback that was introduced by the commit
[033ea349a7: ALSA: hda - Fix Skylake codec timeout] becomes
superfluous, as snd_hdac_display_power() already does it.  So we can
clean it up together.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96214
Fixes: 03b135cebc ('ALSA: hda - remove dependency on i915 power well for SKL')
Cc: <stable@vger.kernel.org> # v4.2+
Tested-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-08-10 15:05:17 +02:00
Nicholas Piggin b9a4a0d02c powerpc/vdso: Fix build rules to rebuild vdsos correctly
When using if_changed, we need to add FORCE as a dependency (see
Documentation/kbuild/makefiles.txt) otherwise we don't get command line
change checking amongst other things. This has resulted in vdsos not
being rebuilt when switching between big and little endian.

The vdso64/32ld commands have to be changed around to avoid pulling
FORCE into the linker command line (code copied from x86).

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 23:04:12 +10:00
Michael Ellerman 164af597ce powerpc/Makefile: Use cflags-y/aflags-y for setting endian options
When we introduced the little endian support, we added the endian flags
to CC directly using override. I don't know the history of why we did
that, I suspect no one does.

Although this mostly works, it has one bug, which is that CROSS32CC
doesn't get -mbig-endian. That means when the compiler is little endian
by default and the user is building big endian, vdso32 is incorrectly
compiled as little endian and the kernel fails to build.

Instead we can add the endian flags to cflags-y/aflags-y, and then
append those to KBUILD_CFLAGS/KBUILD_AFLAGS.

This has the advantage of being 1) less ugly, 2) the documented way of
adding flags in the arch Makefile and 3) it fixes building vdso32 with a
LE toolchain.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 23:01:53 +10:00
Thomas Garnier fb754f958f x86/mm/KASLR: Increase BRK pages for KASLR memory randomization
Default implementation expects 6 pages maximum are needed for low page
allocations. If KASLR memory randomization is enabled, the worse case
of e820 layout would require 12 pages (no large pages). It is due to the
PUD level randomization and the variable e820 memory layout.

This bug was found while doing extensive testing of KASLR memory
randomization on different type of hardware.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 021182e52f ("Enable KASLR for physical mapping memory regions")
Link: http://lkml.kernel.org/r/1470762665-88032-2-git-send-email-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:45:19 +02:00
Thomas Garnier c7d2361f75 x86/mm/KASLR: Fix physical memory calculation on KASLR memory randomization
Initialize KASLR memory randomization after max_pfn is initialized. Also
ensure the size is rounded up. It could create problems on machines
with more than 1Tb of memory on certain random addresses.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: kernel-hardening@lists.openwall.com
Fixes: 021182e52f ("Enable KASLR for physical mapping memory regions")
Link: http://lkml.kernel.org/r/1470762665-88032-1-git-send-email-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:45:19 +02:00
Arnd Bergmann 22cc1ca3c5 x86/hpet: Fix /dev/rtc breakage caused by RTC cleanup
Ville Syrjälä reports "The first time I run hwclock after rebooting
I get this:

 open("/dev/rtc", O_RDONLY)              = 3
 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = 0
 select(4, [3], NULL, NULL, {10, 0})     = 0 (Timeout)
 ioctl(3, PHN_NOT_OH or RTC_UIE_OFF, 0)  = 0
 close(3)                                = 0

On all subsequent runs I get this:

 open("/dev/rtc", O_RDONLY)              = 3
 ioctl(3, PHN_SET_REGS or RTC_UIE_ON, 0) = -1 EINVAL (Invalid argument)
 ioctl(3, RTC_RD_TIME, 0x7ffd76b3ae70)   = -1 EINVAL (Invalid argument)
 close(3)                                = 0"

This was caused by a stupid typo in a patch that should have been
a simple rename to move around contents of a header file, but
accidentally wrote zeroes into the rtc rather than reading from
it:

  463a86304c ("char/genrtc: x86: remove remnants of asm/rtc.h")

Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: rtc-linux@googlegroups.com
Fixes: 463a86304c ("char/genrtc: x86: remove remnants of asm/rtc.h")
Link: http://lkml.kernel.org/r/20160809195528.1604312-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:37:06 +02:00
Ingo Molnar fdbdfefbab Merge branch 'linus' into timers/urgent, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:36:23 +02:00
Alexander Potapenko 469f002312 x86, kasan, ftrace: Put APIC interrupt handlers into .irqentry.text
Dmitry Vyukov has reported unexpected KASAN stackdepot growth:

  https://github.com/google/kasan/issues/36

... which is caused by the APIC handlers not being present in .irqentry.text:

When building with CONFIG_FUNCTION_GRAPH_TRACER=y or CONFIG_KASAN=y, put the
APIC interrupt handlers into the .irqentry.text section. This is needed
because both KASAN and function graph tracer use __irqentry_text_start and
__irqentry_text_end to determine whether a function is an IRQ entry point.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aryabinin@virtuozzo.com
Cc: kasan-dev@googlegroups.com
Cc: kcc@google.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1468575763-144889-1-git-send-email-glider@google.com
[ Minor edits. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:19:33 +02:00
Pan Xinhui c2ace36b88 locking/pvqspinlock: Fix a bug in qstat_read()
It's obviously wrong to set stat to NULL. So lets remove it.
Otherwise it is always zero when we check the latency of kick/wake.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:13:29 +02:00
Wanpeng Li 229ce63157 locking/pvqspinlock: Fix double hash race
When the lock holder vCPU is racing with the queue head:

   CPU 0 (lock holder)    CPU1 (queue head)
   ===================    =================
   spin_lock();           spin_lock();
    pv_kick_node():        pv_wait_head_or_lock():
                            if (!lp) {
                             lp = pv_hash(lock, pn);
                             xchg(&l->locked, _Q_SLOW_VAL);
                            }
                            WRITE_ONCE(pn->state, vcpu_halted);
     cmpxchg(&pn->state,
      vcpu_halted, vcpu_hashed);
     WRITE_ONCE(l->locked, _Q_SLOW_VAL);
     (void)pv_hash(lock, pn);

In this case, lock holder inserts the pv_node of queue head into the
hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
restoring/setting vcpu_hashed state after failing adaptive locking
spinning.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:13:28 +02:00
pan xinhui 2db34e8bf9 locking/qrwlock: Fix write unlock bug on big endian systems
This patch aims to get rid of endianness in queued_write_unlock(). We
want to set  __qrwlock->wmode to NULL, however the address is not
&lock->cnts in big endian machine. That causes queued_write_unlock()
write NULL to the wrong field of __qrwlock.

So implement __qrwlock_write_byte() which returns the correct
__qrwlock->wmode address.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: arnd@arndb.de
Cc: boqun.feng@gmail.com
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1468835259-4486-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:13:27 +02:00
Ingo Molnar a2071cd765 Merge branch 'linus' into locking/urgent, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:11:54 +02:00
Wanpeng Li c0c8c9fa21 sched/deadline: Fix lock pinning warning during CPU hotplug
The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0
  releasing a pinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   ? dl_task_timer+0x1a1/0x2b0
   warn_slowpath_fmt+0x4f/0x60
   ? sched_clock+0x13/0x20
   lock_release+0x690/0x6a0
   ? enqueue_pushable_dl_task+0x9b/0xa0
   ? enqueue_task_dl+0x1ca/0x480
   _raw_spin_unlock+0x1f/0x40
   dl_task_timer+0x1a1/0x2b0
   ? push_dl_task.part.31+0x190/0x190
  WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0
  unpinning an unpinned lock
  Call Trace:
   dump_stack+0x99/0xd0
   __warn+0xd1/0xf0
   warn_slowpath_fmt+0x4f/0x60
   lock_unpin_lock+0x181/0x1a0
   dl_task_timer+0x127/0x2b0
   ? push_dl_task.part.31+0x190/0x190

As per the comment before this code, its safe to drop the RQ lock
here, and since we (potentially) change rq, unpin and repin to avoid
the splat.

Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[ Rewrote changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luca Abeni <luca.abeni@unitn.it>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:02:55 +02:00
Giovanni Gherdovich 6075620b05 sched/cputime: Mitigate performance regression in times()/clock_gettime()
Commit:

  6e998916df ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")

fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
allow a task to wake early. It addressed the problem by calling the scheduling
classes update_curr() when the cputimer starts.

Said change induced a considerable performance regression on the syscalls
times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
debuggers and applications that monitor their own performance that
accidentally depend on the performance of these specific calls.

This patch mitigates the performace loss by prefetching data in the CPU
cache, as stalls due to cache misses appear to be where most time is spent
in our benchmarks.

Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
variable number of threads, from 2 to 4*num_cpus; the results are in
seconds and correspond to the average of 10 runs; the percentage gain is
computed with (before-after)/before so a positive value is an improvement
(it's faster). The improvement varies between a few percents for 5-20
threads and more than 10% for 2 or >20 threads.

pound_clock_gettime:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.48        3.06 ( 11.83%)
      5           3.33        3.25 (  2.40%)
      8           3.37        3.26 (  3.30%)
     12           3.32        3.37 ( -1.60%)
     21           4.01        3.90 (  2.74%)
     30           3.63        3.36 (  7.41%)
     48           3.71        3.11 ( 16.27%)
     79           3.75        3.16 ( 15.74%)
    110           3.81        3.25 ( 14.80%)
    128           3.88        3.31 ( 14.76%)

pound_times:

    threads       4.7-rc7     patched 4.7-rc7
    [num]         [secs]      [secs (percent)]
      2           3.65        3.25 ( 11.03%)
      5           3.45        3.17 (  7.92%)
      8           3.52        3.22 (  8.69%)
     12           3.29        3.36 ( -2.04%)
     21           4.07        3.92 (  3.78%)
     30           3.87        3.40 ( 12.17%)
     48           3.79        3.16 ( 16.61%)
     79           3.88        3.28 ( 15.42%)
    110           3.90        3.38 ( 13.35%)
    128           4.00        3.38 ( 15.45%)

pound_clock_gettime and pound_clock_gettime are two benchmarks included in
the MMTests framework. They launch a given number of threads which
repeatedly call times() or clock_gettimes(). The results above can be
reproduced with cloning MMTests from github.com and running the "poundtime"
workload:

  $ git clone https://github.com/gormanm/mmtests.git
  $ cd mmtests
  $ cp configs/config-global-dhp__workload_poundtime config
  $ ./run-mmtests.sh --run-monitor $(uname -r)

The above will run "poundtime" measuring the kernel currently running on
the machine; Once a new kernel is installed and the machine rebooted,
running again

  $ cd mmtests
  $ ./run-mmtests.sh --run-monitor $(uname -r)

will produce results to compare with. A comparison table will be output
with:

  $ cd mmtests/work/log
  $ ../../compare-kernels.sh

the table will contain a lot of entries; grepping for "Amean" (as in
"arithmetic mean") will give the tables presented above. The source code
for the two benchmarks is reported at the end of this changelog for
clairity.

The cache misses addressed by this patch were found using a combination of
`perf top`, `perf record` and `perf annotate`. The incriminated lines were
found to be

    struct sched_entity *curr = cfs_rq->curr;

and

    delta_exec = now - curr->exec_start;

in the function update_curr() from kernel/sched/fair.c. This patch
prefetches the data from memory just before update_curr is called in the
interested execution path.

A comparison of the total number of cycles before and after the patch
follows; the data is obtained using `perf stat -r 10 -ddd <program>`
running over the same sequence of number of threads used above (a positive
gain is an improvement):

  threads   cycles before                 cycles after                gain

    2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
    5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
    8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
   12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
   21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
   30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
   48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
   79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
  110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
  128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%

A comparison of cache miss vs total cache loads ratios, before and after
the patch (again from the `perf stat -r 10 -ddd <program>` tables):

  threads   L1 misses/total*100     L1 misses/total*100            gain
		         before                   after
      2           7.43  +-4.90%           7.36  +-4.70%           0.94%
      5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
      8          13.79  +-5.61%          12.90  +-3.27%           6.45%
     12          11.57  +-2.44%           8.71  +-1.40%          24.72%
     21          12.39  +-3.92%           9.97  +-1.84%          19.53%
     30          13.91  +-2.53%          11.73  +-2.28%          15.67%
     48          13.71  +-1.59%          12.32  +-1.97%          10.14%
     79          14.44  +-0.66%          13.40  +-1.06%           7.20%
    110          15.86  +-0.50%          14.46  +-0.59%           8.83%
    128          16.51  +-0.32%          15.06  +-0.78%           8.78%

As a final note, the following shows the evolution of performance figures
in the "poundtime" benchmark and pinpoints commit 6e998916df
("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
major source of degradation, mostly unaddressed to this day (figures
expressed in seconds).

pound_clock_gettime:

  threads   parent of         6e998916df        4.7-rc7
	    6e998916df            itself
    2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
    5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
    8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
    12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
    21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
    30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
    48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
    79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
    110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
    128      3.05          6.42 (-110.08%)        3.88 (-27.04%)

pound_times:

  threads   parent of         6e998916df        4.7-rc7
	    6e998916df            itself
    2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
    5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
    8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
    12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
    21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
    30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
    48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
    79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
    110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
    128      3.10          6.35 (-104.61%)        4.00 (-28.87%)

The source code of the two benchmarks follows. To compile the two:

  NR_THREADS=42
  for FILE in pound_times pound_clock_gettime; do
      gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
  done

==== BEGIN pound_times.c ====

struct tms start;

void *pound (void *threadid)
{
  struct tms end;
  int oldutime = 0;
  int utime;
  int i;
  for (i = 0; i < 5000000 / NUM_THREADS; i++) {
          times(&end);
          utime = ((int)end.tms_utime - (int)start.tms_utime);
          if (oldutime > utime) {
            printf("utime decreased, was %d, now %d!\n", oldutime, utime);
          }
          oldutime = utime;
  }
  pthread_exit(NULL);
}

int main()
{
  pthread_t th[NUM_THREADS];
  long i;
  times(&start);
  for (i = 0; i < NUM_THREADS; i++) {
    pthread_create (&th[i], NULL, pound, (void *)i);
  }
  pthread_exit(NULL);
  return 0;
}
==== END pound_times.c ====

==== BEGIN pound_clock_gettime.c ====

void *pound (void *threadid)
{
	struct timespec ts;
	int rc, i;
	unsigned long prev = 0, this = 0;

	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
		if (rc < 0)
			perror("clock_gettime");
		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
		if (0 && this < prev)
			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
		prev = this;
	}
	pthread_exit(NULL);
}

int main()
{
	pthread_t th[NUM_THREADS];
	long rc, i;
	pid_t pgid;

	for (i = 0; i < NUM_THREADS; i++) {
		rc = pthread_create(&th[i], NULL, pound, (void *)i);
		if (rc < 0)
			perror("pthread_create");
	}

	pthread_exit(NULL);
	return 0;
}
==== END pound_clock_gettime.c ====

Suggested-by: Mike Galbraith <mgalbraith@suse.de>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:32:56 +02:00
Xunlei Pang b8922125e4 sched/fair: Fix typo in sync_throttle()
We should update cfs_rq->throttled_clock_task, not
pcfs_rq->throttle_clock_task.

The effects of this bug was probably occasionally erratic
group scheduling, particularly in cgroups-intense workloads.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 55e16d30bd ("sched/fair: Rework throttle_count sync")
Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:32:55 +02:00
Tommaso Cucinotta a23eadfae2 sched/deadline: Fix wrap-around in DL heap
Current code in cpudeadline.c has a bug in re-heapifying when adding a
new element at the end of the heap, because a deadline value of 0 is
temporarily set in the new elem, then cpudl_change_key() is called
with the actual elem deadline as param.

However, the function compares the new deadline to set with the one
previously in the elem, which is 0.  So, if current absolute deadlines
grew so much to have negative values as s64, the comparison in
cpudl_change_key() makes the wrong decision.  Instead, as from
dl_time_before(), the kernel should handle correctly abs deadlines
wrap-arounds.

This patch fixes the problem with a minimally invasive change that
forces cpudl_change_key() to heapify up in this case.

Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Luca Abeni <luca.abeni@unitn.it>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.it
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:32:55 +02:00
Laura Garcia Liebana 4da449ae1d netfilter: nft_exthdr: Add size check on u8 nft_exthdr attributes
Fix the direct assignment of offset and length attributes included in
nft_exthdr structure from u32 data to u8.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-10 13:10:13 +02:00
David Carrillo-Cisneros db4a835601 perf/core: Set cgroup in CPU contexts for new cgroup events
There's a perf stat bug easy to observer on a machine with only one cgroup:

  $ perf stat -e cycles -I 1000 -C 0 -G /
  #          time             counts unit events
      1.000161699      <not counted>      cycles                    /
      2.000355591      <not counted>      cycles                    /
      3.000565154      <not counted>      cycles                    /
      4.000951350      <not counted>      cycles                    /

We'd expect some output there.

The underlying problem is that there is an optimization in
perf_cgroup_sched_{in,out}() that skips the switch of cgroup events
if the old and new cgroups in a task switch are the same.

This optimization interacts with the current code in two ways
that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a
cgroup event matches the current task. These are:

  1. On creation of the first cgroup event in a CPU: In current code,
  cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the
  aforesaid optimization, perf_cgroup_sched_in will run until the next
  cgroup switches in that CPU. This may happen late or never happen,
  depending on system's number of cgroups, CPU load, etc.

  2. On deletion of the last cgroup event in a cpuctx: In list_del_event,
  cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in
  because cpuctx->cgrp == NULL until a cgroup switch occurs and
  perf_cgroup_sched_in is executed (updating cpuctx->cgrp).

This patch fixes both problems by setting cpuctx->cgrp in list_add_event,
mirroring what list_del_event does when removing a cgroup event from CPU
context, as introduced in:

  commit 68cacd2916 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()")

With this patch, cpuctx->cgrp is always set/clear when installing/removing
the first/last cgroup event in/from the CPU context. With cpuctx->cgrp
correctly set, event_filter_match works as intended when events are
sched in/out.

After the fix, the output is as expected:

  $ perf stat -e cycles -I 1000 -a -G /
  #         time             counts unit events
     1.004699159          627342882      cycles                    /
     2.007397156          615272690      cycles                    /
     3.010019057          616726074      cycles                    /

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:05:52 +02:00
Peter Zijlstra 0b8f1e2e26 perf/core: Fix sideband list-iteration vs. event ordering NULL pointer deference crash
Vegard Nossum reported that perf fuzzing generates a NULL
pointer dereference crash:

> Digging a bit deeper into this, it seems the event itself is getting
> created by perf_event_open() and it gets added to the pmu_event_list
> through:
>
> perf_event_open()
>  - perf_event_alloc()
>     - account_event()
>        - account_pmu_sb_event()
>           - attach_sb_event()
>
> so at this point the event is being attached but its ->ctx is still
> NULL. It seems like ->ctx is set just a bit later in
> perf_event_open(), though.
>
> But before that, __schedule() comes along and creates a stack trace
> similar to the one above:
>
> __schedule()
>  - __perf_event_task_sched_out()
>    - perf_iterate_sb()
>      - perf_iterate_sb_cpu()
>         - event_filter_match()
>           - perf_cgroup_match()
>             - __get_cpu_context()
>               - (dereference ctx which is NULL)
>
> So I guess the question is... should the event be attached (= put on
> the list) before ->ctx gets set? Or should the cgroup code check for a
> NULL ->ctx?

The latter seems like the simplest solution. Moving the list-add later
creates a bit of a mess.

Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vegard Nossum <vegard.nossum@gmail.com>
Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Carrillo-Cisneros <davidcc@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f2fb6bef92 ("perf/core: Optimize side-band event delivery")
Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 13:05:51 +02:00
Nicolai Stange 6731b0d611 x86/timers/apic: Inform TSC deadline clockevent device about recalibration
This patch eliminates a source of imprecise APIC timer interrupts,
which imprecision may result in double interrupts or even late
interrupts.

The TSC deadline clockevent devices' configuration and registration
happens before the TSC frequency calibration is refined in
tsc_refine_calibration_work().

This results in the TSC clocksource and the TSC deadline clockevent
devices being configured with slightly different frequencies: the former
gets the refined one and the latter are configured with the inaccurate
frequency detected earlier by means of the "Fast TSC calibration using PIT".

Within the APIC code, introduce the notifier function
lapic_update_tsc_freq() which reconfigures all per-CPU TSC deadline
clockevent devices with the current tsc_khz.

Call it from the TSC code after TSC calibration refinement has happened.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christopher S. Hall <christopher.s.hall@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20160714152255.18295-3-nicstange@gmail.com
[ Pushed #ifdef CONFIG_X86_LOCAL_APIC into header, improved changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 12:38:12 +02:00
Nicolai Stange 1a9e4c564a x86/timers/apic: Fix imprecise timer interrupts by eliminating TSC clockevents frequency roundoff error
I noticed the following bug/misbehavior on certain Intel systems: with a
single task running on a NOHZ CPU on an Intel Haswell, I recognized
that I did not only get the one expected local_timer APIC interrupt, but
two per second at minimum. (!)

Further tracing showed that the first one precedes the programmed deadline
by up to ~50us and hence, it did nothing except for reprogramming the TSC
deadline clockevent device to trigger shortly thereafter again.

The reason for this is imprecise calibration, the timeout we program into
the APIC results in 'too short' timer interrupts. The core (hr)timer code
notices this (because it has a precise ktime source and sees the short
interrupt) and fixes it up by programming an additional very short
interrupt period.

This is obviously suboptimal.

The reason for the imprecise calibration is twofold, and this patch
fixes the first reason:

In setup_APIC_timer(), the registered clockevent device's frequency
is calculated by first dividing tsc_khz by TSC_DIVISOR and multiplying
it with 1000 afterwards:

  (tsc_khz / TSC_DIVISOR) * 1000

The multiplication with 1000 is done for converting from kHz to Hz and the
division by TSC_DIVISOR is carried out in order to make sure that the final
result fits into an u32.

However, with the order given in this calculation, the roundoff error
introduced by the division gets magnified by a factor of 1000 by the
following multiplication.

To fix it, reversing the order of the division and the multiplication a la:

  (tsc_khz * 1000) / TSC_DIVISOR

... reduces the roundoff error already.

Furthermore, if TSC_DIVISOR divides 1000, associativity holds:

  (tsc_khz * 1000) / TSC_DIVISOR = tsc_khz * (1000 / TSC_DIVISOR)

and thus, the roundoff error even vanishes and the whole operation can be
carried out within 32 bits.

The powers of two that divide 1000 are 2, 4 and 8. A value of 8 for
TSC_DIVISOR still allows for TSC frequencies up to
2^32 / 10^9ns * 8 = 34.4GHz which is way larger than anything to expect
in the next years.

Thus we also replace the current TSC_DIVISOR value of 32 by 8. Reverse
the order of the divison and the multiplication in the calculation of
the registered clockevent device's frequency.

Signed-off-by: Nicolai Stange <nicstange@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Christopher S. Hall <christopher.s.hall@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/20160714152255.18295-2-nicstange@gmail.com
[ Improved changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 12:37:38 +02:00
Robin Murphy c987ff0d3c iommu/dma: Respect IOMMU aperture when allocating
Where a device driver has set a 64-bit DMA mask to indicate the absence
of addressing limitations, we still need to ensure that we don't
allocate IOVAs beyond the actual input size of the IOMMU. The reported
aperture is the most reliable way we have of inferring that input
address size, so use that to enforce a hard upper limit where available.

Fixes: 0db2e5d18f ("iommu: Implement common IOMMU ops for DMA mapping")
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2016-08-10 12:02:02 +02:00
Benjamin Herrenschmidt 97f6e0cc35 powerpc/32: Fix crash during static key init
We cannot do those initializations from apply_feature_fixups() as
this function runs in a very restricted environment on 32-bit where
the kernel isn't running at its linked address and the PTRRELOC()
macro must be used for any global accesss.

Instead, split them into a separtate steup_feature_keys() function
which is called in a more suitable spot on ppc32.

Fixes: 309b315b6e ("powerpc: Call jump_label_init() in apply_feature_fixups()")
Reported-and-tested-by: Christian Kujau <lists@nerdbynature.de>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 19:41:58 +10:00
Benjamin Herrenschmidt f9cc1d1f80 powerpc: Update obsolete comment in setup_32.c about early_init()
We don't identify the machine type anymore...

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 19:38:02 +10:00
Benjamin Herrenschmidt 7d70c63c71 powerpc: Print the kernel load address at the end of prom_init()
This makes it easier to debug crashes that happen very early before
the kernel takes over Open Firmware by allowing us to relate the OF
reported crashing addresses to offsets within the kernel.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 19:37:43 +10:00
Heiko Carstens 4d81aaa53c s390/pageattr: handle numpages parameter correctly
Both set_memory_ro() and set_memory_rw() will modify the page
attributes of at least one page, even if the numpages parameter is
zero.

The author expected that calling these functions with numpages == zero
would never happen. However with the new 444d13ff10 ("modules: add
ro_after_init support") feature this happens frequently.

Therefore do the right thing and make these two functions return
gracefully if nothing should be done.

Fixes crashes on module load like this one:

Unable to handle kernel pointer dereference in virtual kernel address space
Failing address: 000003ff80008000 TEID: 000003ff80008407
Fault in home space mode while using kernel ASCE.
AS:0000000000d18007 R3:00000001e6aa4007 S:00000001e6a10800 P:00000001e34ee21d
Oops: 0004 ilc:3 [#1] SMP
Modules linked in: x_tables
CPU: 10 PID: 1 Comm: systemd Not tainted 4.7.0-11895-g3fa9045 #4
Hardware name: IBM              2964 N96              703              (LPAR)
task: 00000001e9118000 task.stack: 00000001e9120000
Krnl PSW : 0704e00180000000 00000000005677f8 (rb_erase+0xf0/0x4d0)
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
Krnl GPRS: 000003ff80008b20 000003ff80008b20 000003ff80008b70 0000000000b9d608
           000003ff80008b20 0000000000000000 00000001e9123e88 000003ff80008950
           00000001e485ab40 000003ff00000000 000003ff80008b00 00000001e4858480
           0000000100000000 000003ff80008b68 00000000001d5998 00000001e9123c28
Krnl Code: 00000000005677e8: ec1801c3007c        cgij    %r1,0,8,567b6e
           00000000005677ee: e32010100020        cg      %r2,16(%r1)
          #00000000005677f4: a78401c2            brc     8,567b78
          >00000000005677f8: e35010080024        stg     %r5,8(%r1)
           00000000005677fe: ec5801af007c        cgij    %r5,0,8,567b5c
           0000000000567804: e30050000024        stg     %r0,0(%r5)
           000000000056780a: ebacf0680004        lmg     %r10,%r12,104(%r15)
           0000000000567810: 07fe                bcr     15,%r14
Call Trace:
([<000003ff80008900>] __this_module+0x0/0xffffffffffffd700 [x_tables])
([<0000000000264fd4>] do_init_module+0x12c/0x220)
([<00000000001da14a>] load_module+0x24e2/0x2b10)
([<00000000001da976>] SyS_finit_module+0xbe/0xd8)
([<0000000000803b26>] system_call+0xd6/0x264)
Last Breaking-Event-Address:
 [<000000000056771a>] rb_erase+0x12/0x4d0
 Kernel panic - not syncing: Fatal exception: panic_on_oops

Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reported-and-tested-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Fixes: e8a97e42dc ("s390/pageattr: allow kernel page table splitting")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-10 10:12:19 +02:00
Stefan Haberland 9ba333dc55 s390/dasd: fix hanging device after clear subchannel
When a device is in a status where CIO has killed all I/O by itself the
interrupt for a clear request may not contain an irb to determine the
clear function. Instead it contains an error pointer -EIO.
This was ignored by the DASD int_handler leading to a hanging device
waiting for a clear interrupt.

Handle -EIO error pointer correctly for requests that are clear pending and
treat the clear as successful.

Signed-off-by: Stefan Haberland <sth@linux.vnet.ibm.com>
Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-08-10 10:12:16 +02:00
Cyril Bur c7a318ba86 powerpc/ptrace: Fix coredump since ptrace TM changes
Commit 8d460f6156 ("powerpc/process: Add the function
flush_tmregs_to_thread") added flush_tmregs_to_thread() and included
the assumption that it would only be called for a task which is not
current.

Although this is correct for ptrace, when generating a core dump, some
of the routines which call flush_tmregs_to_thread() are called. This
leads to a WARNing such as:

  Not expecting ptrace on self: TM regs may be incorrect
  ------------[ cut here ]------------
  WARNING: CPU: 123 PID: 7727 at arch/powerpc/kernel/process.c:1088 flush_tmregs_to_thread+0x78/0x80
  CPU: 123 PID: 7727 Comm: libvirtd Not tainted 4.8.0-rc1-gcc6x-g61e8a0d #1
  task: c000000fe631b600 task.stack: c000000fe63b0000
  NIP: c00000000001a1a8 LR: c00000000001a1a4 CTR: c000000000717780
  REGS: c000000fe63b3420 TRAP: 0700   Not tainted  (4.8.0-rc1-gcc6x-g61e8a0d)
  MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28004222  XER: 20000000
  ...
  NIP [c00000000001a1a8] flush_tmregs_to_thread+0x78/0x80
  LR [c00000000001a1a4] flush_tmregs_to_thread+0x74/0x80
  Call Trace:
   flush_tmregs_to_thread+0x74/0x80 (unreliable)
   vsr_get+0x64/0x1a0
   elf_core_dump+0x604/0x1430
   do_coredump+0x5fc/0x1200
   get_signal+0x398/0x740
   do_signal+0x54/0x2b0
   do_notify_resume+0x98/0xb0
   ret_from_except_lite+0x70/0x74

So fix flush_tmregs_to_thread() to detect the case where it is called on
current, and a transaction is active, and in that case flush the TM regs
to the thread_struct.

This patch also moves flush_tmregs_to_thread() into ptrace.c as it is
only called from that file.

Fixes: 8d460f6156 ("powerpc/process: Add the function flush_tmregs_to_thread")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Flesh out change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 16:34:20 +10:00
Christophe Leroy 1bc8b816cb powerpc/32: Fix csum_partial_copy_generic()
Commit 7aef413656 ("powerpc32: rewrite csum_partial_copy_generic()
based on copy_tofrom_user()") introduced a bug when destination
address is odd and initial csum is not null

In that (rare) case the initial csum value has to be rotated one byte
as well as the resulting value is

This patch also fixes related comments

Fixes: 7aef413656 ("powerpc32: rewrite csum_partial_copy_generic() based on copy_tofrom_user()")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 14:52:45 +10:00
Toshiaki Makita 7bb90c3715 bridge: Fix problems around fdb entries pointing to the bridge device
Adding fdb entries pointing to the bridge device uses fdb_insert(),
which lacks various checks and does not respect added_by_user flag.

As a result, some inconsistent behavior can happen:
* Adding temporary entries succeeds but results in permanent entries.
* Same goes for "dynamic" and "use".
* Changing mac address of the bridge device causes deletion of
  user-added entries.
* Replacing existing entries looks successful from userspace but actually
  not, regardless of NLM_F_EXCL flag.

Use the same logic as other entries and fix them.

Fixes: 3741873b4f ("bridge: allow adding of fdb entries pointing to the bridge device")
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 21:42:44 -07:00
Frederic Barrat c6d2ee09c2 cxl: Set psl_fir_cntl to production environment value
Switch the setting of psl_fir_cntl from debug to production
environment recommended value. It mostly affects the PSL behavior when
an error is raised in psl_fir1/2.

Tested with cxlflash.

Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Uma Krishnan <ukrishn@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-10 14:02:45 +10:00
Konstantin Khlebnikov 51350ea0d7 mm, writeback: flush plugged IO in wakeup_flusher_threads()
I've found funny live-lock between raid10 barriers during resync and
memory controller hard limits. Inside mpage_readpages() task holds on to
its plug bio which blocks the barrier in raid10. Its memory cgroup have
no free memory thus the task goes into reclaimer but all reclaimable
pages are dirty and cannot be written because raid10 is rebuilding and
stuck on the barrier.

Common flush of such IO in schedule() never happens, because the caller
doesn't go to sleep.

Lock is 'live' because changing memory limit or killing tasks which
holds that stuck bio unblock whole progress.

That was what happened in 3.18.x but I see no difference in upstream
logic.  Theoretically this might happen even without memory cgroup.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-09 19:58:06 -06:00
Wenyou Yang 836384d250 net: phy: micrel: Add specific suspend
Disable all interrupts when suspend, they will be enabled
when resume. Otherwise, the suspend/resume process will be
blocked occasionally.

Signed-off-by: Wenyou Yang <wenyou.yang@atmel.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 16:19:15 -07:00
Sylwester Nawrocki a96d3b7593 dm9000: Fix irq trigger type setup on non-dt platforms
Commit b5a099c67a "net: ethernet: davicom: fix devicetree irq
resource" causes an interrupt storm after the ethernet interface
is activated on S3C24XX platform (ARM non-dt), due to the interrupt
trigger type not being set properly.

It seems, after adding parsing of IRQ flags in commit 7085a7401b
"drivers: platform: parse IRQ flags from resources", there is no path
for non-dt platforms where irq_set_type callback could be invoked when
we don't pass the trigger type flags to the request_irq() call.

In case of a board where the regression is seen the interrupt trigger
type flags are passed through a platform device's resource and it is
not currently handled properly without passing the irq trigger type
flags to the request_irq() call.  In case of OF an of_irq_get() call
within platform_get_irq() function seems to be ensuring required irq_chip
setup, but there is no equivalent code for non OF/ACPI platforms.

This patch mostly restores irq trigger type setting code which has been
removed in commit ("net: ethernet: davicom: fix devicetree irq resource").

Fixes: b5a099c67a ("net: ethernet: davicom: fix devicetree irq resource")

Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com>
Acked-by: Robert Jarzmik <robert.jarzmik@free.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 15:08:22 -07:00
Zhu Yanjun 0d039f337f bonding: fix the typo
The message "803.ad" should be "802.3ad".

Signed-off-by: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 14:57:14 -07:00
Grygorii Strashko 254a49d513 drivers: net: cpsw: fix kmemleak false-positive reports for sk buffers
Kmemleak reports following false positive memory leaks for each sk
buffers allocated by CPSW (__netdev_alloc_skb_ip_align()) in
cpsw_ndo_open() and cpsw_rx_handler():

unreferenced object 0xea915000 (size 2048):
  comm "systemd-network", pid 713, jiffies 4294938323 (age 102.180s)
  hex dump (first 32 bytes):
    00 58 91 ea ff ff ff ff ff ff ff ff ff ff ff ff  .X..............
    ff ff ff ff ff ff fd 0f 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<c0108680>] __kmalloc_track_caller+0x1a4/0x230
    [<c0529eb4>] __alloc_skb+0x68/0x16c
    [<c052c884>] __netdev_alloc_skb+0x40/0x104
    [<bf1ad29c>] cpsw_ndo_open+0x374/0x670 [ti_cpsw]
    [<c053c3d4>] __dev_open+0xb0/0x114
    [<c053c690>] __dev_change_flags+0x9c/0x14c
    [<c053c760>] dev_change_flags+0x20/0x50
    [<c054bdcc>] do_setlink+0x2cc/0x78c
    [<c054c358>] rtnl_setlink+0xcc/0x100
    [<c054b34c>] rtnetlink_rcv_msg+0x184/0x224
    [<c056467c>] netlink_rcv_skb+0xa8/0xc4
    [<c054b1c0>] rtnetlink_rcv+0x2c/0x34
    [<c0564018>] netlink_unicast+0x16c/0x1f8
    [<c0564498>] netlink_sendmsg+0x334/0x348
    [<c052015c>] sock_sendmsg+0x1c/0x2c
    [<c05213e0>] SyS_sendto+0xc0/0xe8

unreferenced object 0xec861780 (size 192):
  comm "softirq", pid 0, jiffies 4294938759 (age 109.540s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 b0 5a ed 00 00 00 00 00 00 00 00  ......Z.........
  backtrace:
    [<c0107830>] kmem_cache_alloc+0x190/0x208
    [<c052c768>] __build_skb+0x30/0x98
    [<c052c8fc>] __netdev_alloc_skb+0xb8/0x104
    [<bf1abc54>] cpsw_rx_handler+0x68/0x1e4 [ti_cpsw]
    [<bf11aa30>] __cpdma_chan_free+0xa8/0xc4 [davinci_cpdma]
    [<bf11ab98>] __cpdma_chan_process+0x14c/0x16c [davinci_cpdma]
    [<bf11abfc>] cpdma_chan_process+0x44/0x5c [davinci_cpdma]
    [<bf1adc78>] cpsw_rx_poll+0x1c/0x9c [ti_cpsw]
    [<c0539180>] net_rx_action+0x1f0/0x2ec
    [<c003881c>] __do_softirq+0x134/0x258
    [<c0038a00>] do_softirq+0x68/0x70
    [<c0038adc>] __local_bh_enable_ip+0xd4/0xe8
    [<c0640994>] _raw_spin_unlock_bh+0x30/0x34
    [<c05f4e9c>] igmp6_group_added+0x4c/0x1bc
    [<c05f6600>] ipv6_dev_mc_inc+0x398/0x434
    [<c05dba74>] addrconf_dad_work+0x224/0x39c

This happens because CPSW allocates SK buffers and then passes
pointers on them in CPDMA where they stored in internal CPPI RAM
(SRAM) which belongs to DEV MMIO space. Kmemleak does not scan IO
memory and so reports memory leaks.

Hence, mark allocated sk buffers as false positive explicitly.

Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 14:54:59 -07:00
Lance Richardson a5d0dc810a vti: flush x-netns xfrm cache when vti interface is removed
When executing the script included below, the netns delete operation
hangs with the following message (repeated at 10 second intervals):

  kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1

This occurs because a reference to the lo interface in the "secure" netns
is still held by a dst entry in the xfrm bundle cache in the init netns.

Address this problem by garbage collecting the tunnel netns flow cache
when a cross-namespace vti interface receives a NETDEV_DOWN notification.

A more detailed description of the problem scenario (referencing commands
in the script below):

(1) ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1

  The vti_test interface is created in the init namespace. vti_tunnel_init()
  attaches a struct ip_tunnel to the vti interface's netdev_priv(dev),
  setting the tunnel net to &init_net.

(2) ip link set vti_test netns secure

  The vti_test interface is moved to the "secure" netns. Note that
  the associated struct ip_tunnel still has tunnel->net set to &init_net.

(3) ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

  The first packet sent using the vti device causes xfrm_lookup() to be
  called as follows:

      dst = xfrm_lookup(tunnel->net, skb_dst(skb), fl, NULL, 0);

  Note that tunnel->net is the init namespace, while skb_dst(skb) references
  the vti_test interface in the "secure" namespace. The returned dst
  references an interface in the init namespace.

  Also note that the first parameter to xfrm_lookup() determines which flow
  cache is used to store the computed xfrm bundle, so after xfrm_lookup()
  returns there will be a cached bundle in the init namespace flow cache
  with a dst referencing a device in the "secure" namespace.

(4) ip netns del secure

  Kernel begins to delete the "secure" namespace.  At some point the
  vti_test interface is deleted, at which point dst_ifdown() changes
  the dst->dev in the cached xfrm bundle flow from vti_test to lo (still
  in the "secure" namespace however).
  Since nothing has happened to cause the init namespace's flow cache
  to be garbage collected, this dst remains attached to the flow cache,
  so the kernel loops waiting for the last reference to lo to go away.

<Begin script>
ip link add br1 type bridge
ip link set dev br1 up
ip addr add dev br1 1.1.1.1/8

ip netns add secure
ip link add vti_test type vti local 1.1.1.1 remote 1.1.1.2 key 1
ip link set vti_test netns secure
ip netns exec secure ip link set vti_test up
ip netns exec secure ip link s lo up
ip netns exec secure ip addr add dev lo 192.168.100.1/24
ip netns exec secure ip route add 192.168.200.0/24 dev vti_test
ip xfrm policy flush
ip xfrm state flush
ip xfrm policy add dir out tmpl src 1.1.1.1 dst 1.1.1.2 \
   proto esp mode tunnel mark 1
ip xfrm policy add dir in tmpl src 1.1.1.2 dst 1.1.1.1 \
   proto esp mode tunnel mark 1
ip xfrm state add src 1.1.1.1 dst 1.1.1.2 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788
ip xfrm state add src 1.1.1.2 dst 1.1.1.1 proto esp spi 1 \
   mode tunnel enc des3_ede 0x112233445566778811223344556677881122334455667788

ip netns exec secure ping -c 4 -i 0.02 -I 192.168.100.1 192.168.200.1

ip netns del secure
<End script>

Reported-by: Hangbin Liu <haliu@redhat.com>
Reported-by: Jan Tluka <jtluka@redhat.com>
Signed-off-by: Lance Richardson <lrichard@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 12:57:49 -07:00
David S. Miller aee4eda1ab RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAV6oCE/Sw1s6N8H32AQK1bxAAj2FoASombJwHomUjnvKb4RWdFkwA2Km3
 A2R6qdPYCZLbdiH7gw8TqSYGBp//8EL6qlcMKKVvSciAFlS0oQb2G99Tni4cVB4W
 /xmRatAeU2/k1YBndPsOFHkhdDto6BlsUJbWaW3WIH2vqIfFsW1nh4oIR0N9yvt0
 awF34s2yXfp77hr2t7Ca/PMibEcyCmg2kF83Kxilx5z/Ubpl/TtyqswLiPDO7EJq
 lfyoplgxoG0kvpN1wh2z8i4dKAh8+e3PMIMBdodG/2aA/srWVlECRDx0IqQ4hzQg
 iSaJDBp+E9ZhKExkPIDpEx49iH1kawerFub+udBDCR6n9JPSlPOWdHiYxXaVE7yA
 WGW2JGyv2Wo7vkEM76/3dzHBzn0w203T6C+Sa25I82oOxvVbAZfQRxHTJVmCdcOc
 lHyeGcEuk4h2lYc7U/ty3uGEwm7dL+tyoEw0MGcHQEDpnainKjFPdXiKSOOMb/Vd
 kiqXrsK9nPZ9JjDtRA1oK7uLNYbllFH13aDR5kU1VAH4s2zwD1498/tegcZ+Hds5
 UKhyWfqHBr3uBrxEwRVLFnJ5yi6LQy+O4jNHg1gTakaY6B2j34sLzd5fuc1idhGe
 yUuCU+XC7mcdQQWoXAc96/orsqmtEiY6yABuTTijeZTSbpKBT3uJpelzAnuxCePZ
 2u0lddGIl7M=
 =c47N
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20160809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Miscellaneous fixes

Here are a bunch of miscellaneous fixes to AF_RXRPC:

 (*) Fix an uninitialised pointer.

 (*) Fix error handling when we fail to connect a call.

 (*) Fix a NULL pointer dereference.

 (*) Fix two occasions where a packet is accessed again after being queued
     for someone else to deal with.

 (*) Fix a missing skb free.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-09 12:56:43 -07:00
Ingo Molnar 69766c40c3 perf/urgent fixes:
User visible fixes:
 
 - Fix the lookup for a kernel module in 'perf probe', fixing for instance, the
   erroneous return of "[raid10]" when looking for "[raid1]"  (Konstantin Khlebnikov)
 
 - Disable counters in a group before reading them in 'perf stat', to avoid skew (Mark Rutland)
 
 - Fix adding probes to function aliases in systems using kaslr (Masami Hiramatsu)
 
 - Trip libtraceevent trace_seq buffers, removing unnecessary memory usage that could
   bring a system using tracepoint events with 'perf top' to a crawl, as the trace_seq
   buffers start at a whooping 4 KB, which is very rarely used in perf's usecases,
   so realloc it to the really used space as a last measure after using libtraceevent
   functions to format the fields of tracepoint events (Arnaldo Carvalho de Melo)
 
 - Fix 'perf probe' location when using DWARF on ppc64le (Ravi Bangoria)
 
 Improvement:
 
 - Allow specifying signedness casts to a 'perf probe' variable, to shorten
   the number of steps to see signed values that otherwise would always appear
   as hex values (Naohiro Aota)
 
 Documentation fixes:
 
 - Add 'bpf-output' field to 'perf script' usage message (Brendan Gregg)
 
 Infrastructure fixes:
 
 - Sync kernel header files: cpufeatures.h, {disabled,required}-features.h,
   bpf.h and vmx.h, so that we get a clean build, without warnings about files
   being different from the kernel counterparts.
 
   A verification of the need or desirability of changes in tools/ based on what
   was done in the kernel changesets was made and documented in the respective
   file sync changesets (Arnaldo Carvalho de Melo)
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJXqfpSAAoJENZQFvNTUqpAkJUQALGxbMIZmCOE2O/lok7t6PDZ
 VUsq0vs3V1mPqdin1UnNsfMCvWQExeJZQ4P8EUTSLoMbpiifq2BY5696xs35LUtl
 +UTTtVgtN+/W5gLiQ78U8kEkG8Q/PiMeWKyLLKgBSAEtibC4pOLnQCu/g4DP3e3c
 oYQTaoaSq4eBQMQaDKF2Y6EVQFEdQs4PI1JUIsGn9zTfR5qtRiKwZxrNkAfmNAVO
 opDN42JR3HewwXiOKWuoAVHDi5QVsHgDUnPuYlFujbx306WV+EiypRpzA4Rbr0Cu
 AZrtkqQdSkKYVhEop3Az5kW9m3qZ6DRcZfJNVmD0Cax637gQNbOyIVVxo2KiS5JQ
 8kZknTuQoR8GTARUwlUlQVydwKaRXsox4M1o71FVAOuvEKzBFpIUF46c+Ljgj4O0
 zo1q9I2GnmxjakP0528oLrtT4UyndWTxjK0bwPcr+AwFGVfICT5OteUoTkLSbTAO
 WdqfIhtS1PL+yJmy9iQkaPtTWVtgHmJbQ8PtBVBZZjwvDbx2sDv84/83picwUM+u
 4g0WaqEzKMhRznDHXjc6EkSWRcP20AFqAotRNn8Uxhinx6OvLMs2C6TLzt1lHuw/
 Zsz6wJnM3qaOs9Oxe1U8J04Zm47C8y6iVx5zpJkgFaAf49mJ475me89NratFrAfq
 T1OsSy9dejfB4xrBvOjk
 =/B7T
 -----END PGP SIGNATURE-----

Merge tag 'perf-urgent-for-mingo-20160809' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/urgent fixes from Arnaldo Carvalho de Melo:

User visible fixes:

- Fix the lookup for a kernel module in 'perf probe', fixing for instance, the
  erroneous return of "[raid10]" when looking for "[raid1]"  (Konstantin Khlebnikov)

- Disable counters in a group before reading them in 'perf stat', to avoid skew (Mark Rutland)

- Fix adding probes to function aliases in systems using kaslr (Masami Hiramatsu)

- Trip libtraceevent trace_seq buffers, removing unnecessary memory usage that could
  bring a system using tracepoint events with 'perf top' to a crawl, as the trace_seq
  buffers start at a whooping 4 KB, which is very rarely used in perf's usecases,
  so realloc it to the really used space as a last measure after using libtraceevent
  functions to format the fields of tracepoint events (Arnaldo Carvalho de Melo)

- Fix 'perf probe' location when using DWARF on ppc64le (Ravi Bangoria)

- Allow specifying signedness casts to a 'perf probe' variable, to shorten
  the number of steps to see signed values that otherwise would always appear
  as hex values (Naohiro Aota)

Documentation fixes:

- Add 'bpf-output' field to 'perf script' usage message (Brendan Gregg)

Infrastructure fixes:

- Sync kernel header files: cpufeatures.h, {disabled,required}-features.h,
  bpf.h and vmx.h, so that we get a clean build, without warnings about files
  being different from the kernel counterparts.

  A verification of the need or desirability of changes in tools/ based on what
  was done in the kernel changesets was made and documented in the respective
  file sync changesets (Arnaldo Carvalho de Melo)

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-09 21:11:00 +02:00
Linus Torvalds a0cba2179e Revert "printk: create pr_<level> functions"
This reverts commit 874f9c7da9.

Geert Uytterhoeven reports:
 "This change seems to have an (unintendent?) side-effect.

  Before, pr_*() calls without a trailing newline characters would be
  printed with a newline character appended, both on the console and in
  the output of the dmesg command.

  After this commit, no new line character is appended, and the output
  of the next pr_*() call of the same type may be appended, like in:

    - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000
    - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)
    + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)"

Joe Perches says:
 "No, that is not intentional.

  The newline handling code inside vprintk_emit is a bit involved and
  for now I suggest a revert until this has all the same behavior as
  earlier"

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Requested-by: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09 10:48:18 -07:00
Linus Torvalds 84bd8d33a9 Luiz Capitulino noticed that the tick_stop tracepoint wasn't being parsed
properly by the tracing user space tools. This was due to the
 TRACE_DEFINE_ENUM() being set to a define, when it should have been set
 to the enum itself. The define was of the MASK that used the BIT to shift.
 The BIT was the enum and by adding that, everything gets converted nicely.
 The MASK is still kept just in case it gets converted to an enum in the
 future.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXqeA/AAoJEKKk/i67LK/8wLAH/0nD8L5pxtn+pZi3mQnWNwbn
 qEBtfKK8cvnt0IWH2HlKmRKLAiIJfp8UrkGoPiLT7Nb83PlKaw3UT868t7eDmknu
 b29SsZroMgvJ1MeeNH9Yzk7cK3/K213VO02P4ce8EWSELYCqFlxJDE3dhl52K9Lj
 clJSoZbIGTLlx4pk6zZnPksTt3Z9WZXcVJITwxEiz/Cr+CKAWZpLoPPUaqJOmA4j
 9oS6d++0DYZdz3cWCmYBBkphmc5IQkBNZWGMYLcAR+M+m5fsN+baWlP3Dhq4j7he
 WknHwu4WDFMk6a2Kh0Ggi6yUWVUIkLNY2Z4QUF2gNmJT1g/FH5lZka4/kIxjvKw=
 =rgBA
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix tick_stop tracepoint symbols for user export.

  Luiz Capitulino noticed that the tick_stop tracepoint wasn't being
  parsed properly by the tracing user space tools.

  This was due to the TRACE_DEFINE_ENUM() being set to a define, when it
  should have been set to the enum itself.  The define was of the MASK
  that used the BIT to shift.  The BIT was the enum and by adding that,
  everything gets converted nicely.  The MASK is still kept just in case
  it gets converted to an enum in the future"

* tag 'trace-v4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix tick_stop tracepoint symbols for user export
2016-08-09 10:34:09 -07:00