1
0
Fork 0
Commit Graph

28835 Commits (49ef341ab668df248949c8e011d176ac6f8918fe)

Author SHA1 Message Date
Linus Torvalds 343a9f3540 The biggest change here is the updates to kprobes
Back in January I posted patches to create function based events. These were
 the events that you suggested I make to allow developers to easily create
 events in code where no trace event exists. After posting those changes for
 review, it was suggested that we implement this instead with kprobes.
 
 The problem with kprobes is that the interface is too complex and needs to
 be simplified. Masami Hiramatsu posted patches in March and I've been
 playing with them a bit. There's been a bit of clean up in the kprobe code
 that was inspired by the function based event patches, and a couple of
 enhancements to the kprobe event interface.
 
  - If the arch supports it (we added support for x86), you can place a
    kprobe event at the start of a function and use $arg1, $arg2, etc
    to reference the arguments of a function. (Before you needed to know
    what register or where on the stack the argument was).
 
  - The second is a way to see array of events. For example, if you reference
    a mac address, you can add:
 
    echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events
 
    And this will produce:
 
    mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}
 
 Other changes include
 
  - Exporting trace_dump_stack to modules
 
  - Have the stack tracer trace the entire stack (stop trying to remove
    tracing itself, as we keep removing too much).
 
  - Added support for SDT in uprobes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW9hdjxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmtbAP9GS/o2WSvsYLSIw4+mF94eCL06lUxp
 rRrktkEofm/PagEAl2JNmvHrAJN+LIrajqXTbwlZ7Ckk1rZhCW41Am7qnQs=
 =sTUM
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:
 "The biggest change here is the updates to kprobes

  Back in January I posted patches to create function based events.
  These were the events that you suggested I make to allow developers to
  easily create events in code where no trace event exists. After
  posting those changes for review, it was suggested that we implement
  this instead with kprobes.

  The problem with kprobes is that the interface is too complex and
  needs to be simplified. Masami Hiramatsu posted patches in March and
  I've been playing with them a bit. There's been a bit of clean up in
  the kprobe code that was inspired by the function based event patches,
  and a couple of enhancements to the kprobe event interface.

   - If the arch supports it (we added support for x86), you can place a
     kprobe event at the start of a function and use $arg1, $arg2, etc
     to reference the arguments of a function. (Before you needed to
     know what register or where on the stack the argument was).

   - The second is a way to see array of events. For example, if you
     reference a mac address, you can add:

	echo 'p:mac ip_rcv perm_addr=+574($arg2):x8[6]' > kprobe_events

     And this will produce:

	mac: (ip_rcv+0x0/0x140) perm_addr={0x52,0x54,0x0,0xc0,0x76,0xec}

  Other changes include

   - Exporting trace_dump_stack to modules

   - Have the stack tracer trace the entire stack (stop trying to remove
     tracing itself, as we keep removing too much).

   - Added support for SDT in uprobes"

[ SDT - "Statically Defined Tracing" are userspace markers for tracing.
  Let's not use random TLA's in explanations unless they are fairly
  well-established as generic (at least for kernel people) - Linus ]

* tag 'trace-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (24 commits)
  tracing: Have stack tracer trace full stack
  tracing: Export trace_dump_stack to modules
  tracing: probeevent: Fix uninitialized used of offset in parse args
  tracing/kprobes: Allow kprobe-events to record module symbol
  tracing/kprobes: Check the probe on unloaded module correctly
  tracing/uprobes: Fix to return -EFAULT if copy_from_user failed
  tracing: probeevent: Add $argN for accessing function args
  x86: ptrace: Add function argument access API
  tracing: probeevent: Add array type support
  tracing: probeevent: Add symbol type
  tracing: probeevent: Unify fetch_insn processing common part
  tracing: probeevent: Append traceprobe_ for exported function
  tracing: probeevent: Return consumed bytes of dynamic area
  tracing: probeevent: Unify fetch type tables
  tracing: probeevent: Introduce new argument fetching code
  tracing: probeevent: Remove NOKPROBE_SYMBOL from print functions
  tracing: probeevent: Cleanup argument field definition
  tracing: probeevent: Cleanup print argument functions
  trace_uprobe: support reference counter in fd-based uprobe
  perf probe: Support SDT markers having reference counter (semaphore)
  ...
2018-10-30 09:49:56 -07:00
Linus Torvalds f4267b3604 Masami had a couple more fixes to the synthetic events. One was a proper
error return value, and the other is for the self tests.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW9hYsBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiWPAQCARhYpeiWNHYGAirpI1vxaNHzutZvP
 j6eaqRXu0JDfsQD7Bb19KxSHUEaPSpZIo68N5OJkoTv9UxD2KYuRysu4VQ8=
 =y1Gc
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.19-rc8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Masami had a couple more fixes to the synthetic events. One was a
  proper error return value, and the other is for the self tests"

* tag 'trace-v4.19-rc8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  selftests/ftrace: Fix synthetic event test to delete event correctly
  tracing: Return -ENOENT if there is no target synthetic event
2018-10-30 09:47:28 -07:00
Linus Torvalds 6ef746769e More power management updates for 4.20-rc1
- Fix build regression in the intel_pstate driver that doesn't
    build without CONFIG_ACPI after recent changes (Dominik Brodowski).
 
  - One of the heuristics in the menu cpuidle governor is based on a
    function returning 0 most of the time, so drop it and clean up
    the scheduler code related to it (Daniel Lezcano).
 
  - Prevent the arm_big_little cpufreq driver from being used on ARM64
    which is not suitable for it and drop the arm_big_little_dt driver
    that is not used any more (Sudeep Holla).
 
  - Prevent the hung task watchdog from triggering during resume from
    system-wide sleep states by disabling it before freezing tasks and
    enabling it again after they have been thawed (Vitaly Kuznetsov).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJb2BJ7AAoJEILEb/54YlRx/kwP/iD7tUUZ6mT84OI0FTbEj8A/
 fM+uHrwy25PmqyWGGtbHpaWU9OxVxUReSicsBCt+2LZmX3sFYpbSb243mv3pmxqb
 A0kLflG4lWCKJNIfa/a3OMDTUw26mxSTCidE3jJXkd8HkWrzeAWvMair+UCuzMf3
 A4Omu0IkNL8C0MKtUOb3PlUk3dnLYMxuairNhozBPhi+P+0tLW9/9XvgPJBVhnbZ
 CKn/aFsDoc08tAfxC8N32cgKwE7nbeIgTJTBFyu2lQmInsd4TTuoM50vSC5i+x88
 AmBOoH9IX0fhXJ6hgm+VMW8+x9S+H7jAVy/3C2xoUBeCclzlxX6eUCtjV5YNZqqn
 1nXQfGeAwgzX6Tyu6HjM7vjbfObk59ZwpmDRPJEUEhLDEBMS+iDStlp9zmKTedNm
 G4iSTzS6qJCNPtx4y5wkLp/FvzTofIuWqVFJSJC4+EoVKkbbw9xwaY+JKXUt1Uwx
 j+U6EtRhzL/kVX0nq+iQXXeANxCFNzI56Ov5O7mxjF1m/hDE/Gb2QEeIb6nRZC2A
 H3I2so2J3h1yTgadpGFFvJWaqfHkgcBTsm06tSgHVb86quiTANJIQ9mqfFyOzDDJ
 KaZ82MROt7UuCMI6X9n+oIBDZWLHmADge6RdHCD1wB+zrUmusCtNEHUZACXd0mPf
 s8MUK4bWVhViVXGS5bMP
 =/bnR
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These remove a questionable heuristic from the menu cpuidle governor,
  fix a recent build regression in the intel_pstate driver, clean up ARM
  big-Little support in cpufreq and fix up hung task watchdog's
  interaction with system-wide power management transitions.

  Specifics:

   - Fix build regression in the intel_pstate driver that doesn't build
     without CONFIG_ACPI after recent changes (Dominik Brodowski).

   - One of the heuristics in the menu cpuidle governor is based on a
     function returning 0 most of the time, so drop it and clean up the
     scheduler code related to it (Daniel Lezcano).

   - Prevent the arm_big_little cpufreq driver from being used on ARM64
     which is not suitable for it and drop the arm_big_little_dt driver
     that is not used any more (Sudeep Holla).

   - Prevent the hung task watchdog from triggering during resume from
     system-wide sleep states by disabling it before freezing tasks and
     enabling it again after they have been thawed (Vitaly Kuznetsov)"

* tag 'pm-4.20-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  kernel: hung_task.c: disable on suspend
  cpufreq: remove unused arm_big_little_dt driver
  cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
  cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
  cpuidle: menu: Remove get_loadavg() from the performance multiplier
  sched: Factor out nr_iowait and nr_iowait_cpu
2018-10-30 09:08:07 -07:00
Rafael J. Wysocki c4ac688993 Merge branches 'pm-cpuidle' and 'pm-cpufreq'
* pm-cpuidle:
  cpuidle: menu: Remove get_loadavg() from the performance multiplier
  sched: Factor out nr_iowait and nr_iowait_cpu

* pm-cpufreq:
  cpufreq: remove unused arm_big_little_dt driver
  cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
  cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI
2018-10-30 08:47:14 +01:00
Linus Torvalds 9f51ae62c8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) GRO overflow entries are not unlinked properly, resulting in list
    poison pointers being dereferenced.

 2) Fix bridge build with ipv6 disabled, from Nikolay Aleksandrov.

 3) Direct packet access and other fixes in BPF from Daniel Borkmann.

 4) gred_change_table_def() gets passed the wrong pointer, a pointer to
    a set of unparsed attributes instead of the attribute itself. From
    Jakub Kicinski.

 5) Allow macsec device to be brought up even if it's lowerdev is down,
    from Sabrina Dubroca.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: diag: document swapped src/dst in udp_dump_one.
  macsec: let the administrator set UP state even if lowerdev is down
  macsec: update operstate when lower device changes
  net: sched: gred: pass the right attribute to gred_change_table_def()
  ptp: drop redundant kasprintf() to create worker name
  net: bridge: remove ipv6 zero address check in mcast queries
  net: Properly unlink GRO packets on overflow.
  bpf: fix wrong helper enablement in cgroup local storage
  bpf: add bpf_jit_limit knob to restrict unpriv allocations
  bpf: make direct packet write unclone more robust
  bpf: fix leaking uninitialized memory on pop/peek helpers
  bpf: fix direct packet write into pop/peek helpers
  bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
  bpf: fix direct packet access for flow dissector progs
  bpf: disallow direct packet access for unpriv in cg_skb
  bpf: fix test suite to enable all unpriv program types
  bpf, btf: fix a missing check bug in btf_parse
  selftests/bpf: add config fragments BPF_STREAM_PARSER and XDP_SOCKETS
  bpf: devmap: fix wrong interface selection in notifier_call
2018-10-28 20:17:49 -07:00
Linus Torvalds ac747c0715 Kbuild updates for v4.20
- optimize kallsyms slightly
 
 - remove check for old CFLAGS usage
 
 - add some compiler flags unconditionally instead of evaluating
   $(call cc-option,...)
 
 - fix variable shadowing in host tools
 
 - refactor scripts/mkmakefile
 
 - refactor various makefiles
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJb1eyEAAoJED2LAQed4NsGUjwP/0/W0nLP+nKCJ0NpsSD151Ea
 Vrbm+RMxl8uPuQ0+Bh59rdO2yWR8v7jOX8CPbX1DqW0XwW4vTNRZm9j6A83rJzHv
 dPpinUObq6vXGJCYvMLoumhOkM1lRZie1AeBeLgKF0G0jprxUspGvakSyM/5SOo6
 3gIhpPcczijC980KHIJbmwiGqRFVs3/zwqcKjQaRD3C6f4HdJL3i9Zr7kuZF0g1x
 dxCdN4Shv5x0igggp58z8646bbKlN7hItYaPq2csDop55/jWKzUkBkFkeaDjiUdB
 B6ysQfwkuTb6sbKXE6euYLUTyc6epx9v9pey0FpOx5tjXT+QmgK1ddLQEwdFH2+/
 Fd5VR70h8uKTNvmqFJ+iebpR/aC71sUqYAzsoiTVZibR/F6+QAuhHJPZJDlSSwLC
 QH7j0fwJ3X5p9PYe3JBWyWzOd9BDvtV+HGE67Kx17iRNW0FDyKSvn7tezRVtqvHr
 KENrMT4n+lNQzB+c4lfjOE9ANpOf+PP4+ODhbrtvKItyb5QfF4F+5CDVif3wv3Kt
 6dvh13hvR06gQCpbuxcFgD17mT1T/lVieuzOnWjtEh4HiMI1H0abYfkRWbQshrti
 u7TfYxwAyWKtisiPiT/1THYsW3ux7QmTSIo2iOznQxDlbmzENpQx+f22HsLaC0XU
 AUgvKGBaN0WOar2f2gyq
 =jM4E
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - optimize kallsyms slightly

 - remove check for old CFLAGS usage

 - add some compiler flags unconditionally instead of evaluating
   $(call cc-option,...)

 - fix variable shadowing in host tools

 - refactor scripts/mkmakefile

 - refactor various makefiles

* tag 'kbuild-v4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  modpost: Create macro to avoid variable shadowing
  ASN.1: Remove unnecessary shadowed local variable
  kbuild: use 'else ifeq' for checksrc to improve readability
  kbuild: remove unneeded link_multi_deps
  kbuild: add -Wno-unused-but-set-variable flag unconditionally
  kbuild: add -Wdeclaration-after-statement flag unconditionally
  kbuild: add -Wno-pointer-sign flag unconditionally
  modpost: remove leftover symbol prefix handling for module device table
  kbuild: simplify command line creation in scripts/mkmakefile
  kbuild: do not pass $(objtree) to scripts/mkmakefile
  kbuild: remove user ID check in scripts/mkmakefile
  kbuild: remove VERSION and PATCHLEVEL from $(objtree)/Makefile
  kbuild: add --include-dir flag only for out-of-tree build
  kbuild: remove dead code in cmd_files calculation in top Makefile
  kbuild: hide most of targets when running config or mixed targets
  kbuild: remove old check for CFLAGS use
  kbuild: prefix Makefile.dtbinst path with $(srctree) unconditionally
  kallsyms: remove left-over Blackfin code
  kallsyms: reduce size a little on 64-bit
2018-10-28 13:22:35 -07:00
Linus Torvalds dad4f140ed Merge branch 'xarray' of git://git.infradead.org/users/willy/linux-dax
Pull XArray conversion from Matthew Wilcox:
 "The XArray provides an improved interface to the radix tree data
  structure, providing locking as part of the API, specifying GFP flags
  at allocation time, eliminating preloading, less re-walking the tree,
  more efficient iterations and not exposing RCU-protected pointers to
  its users.

  This patch set

   1. Introduces the XArray implementation

   2. Converts the pagecache to use it

   3. Converts memremap to use it

  The page cache is the most complex and important user of the radix
  tree, so converting it was most important. Converting the memremap
  code removes the only other user of the multiorder code, which allows
  us to remove the radix tree code that supported it.

  I have 40+ followup patches to convert many other users of the radix
  tree over to the XArray, but I'd like to get this part in first. The
  other conversions haven't been in linux-next and aren't suitable for
  applying yet, but you can see them in the xarray-conv branch if you're
  interested"

* 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
  radix tree: Remove multiorder support
  radix tree test: Convert multiorder tests to XArray
  radix tree tests: Convert item_delete_rcu to XArray
  radix tree tests: Convert item_kill_tree to XArray
  radix tree tests: Move item_insert_order
  radix tree test suite: Remove multiorder benchmarking
  radix tree test suite: Remove __item_insert
  memremap: Convert to XArray
  xarray: Add range store functionality
  xarray: Move multiorder_check to in-kernel tests
  xarray: Move multiorder_shrink to kernel tests
  xarray: Move multiorder account test in-kernel
  radix tree test suite: Convert iteration test to XArray
  radix tree test suite: Convert tag_tagged_items to XArray
  radix tree: Remove radix_tree_clear_tags
  radix tree: Remove radix_tree_maybe_preload_order
  radix tree: Remove split/join code
  radix tree: Remove radix_tree_update_node_t
  page cache: Finish XArray conversion
  dax: Convert page fault handlers to XArray
  ...
2018-10-28 11:35:40 -07:00
Masami Hiramatsu 18858511fd tracing: Return -ENOENT if there is no target synthetic event
Return -ENOENT error if there is no target synthetic event.
This notices an operation failure to user as below;

  # echo 'wakeup_latency u64 lat; pid_t pid;' > synthetic_events
  # echo '!wakeup' >> synthetic_events
  sh: write error: No such file or directory

Link: http://lkml.kernel.org/r/154013449986.25576.9487131386597290172.stgit@devbox

Acked-by: Tom Zanussi <zanussi@linux.intel.com>
Tested-by: Tom Zanussi <zanussi@linux.intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Rajvi Jingar <rajvi.jingar@intel.com>
Cc: stable@vger.kernel.org
Fixes: 4b147936fa ('tracing: Add support for 'synthetic' events')
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-28 04:58:15 -04:00
Steven Rostedt (VMware) a2acce5369 tracing: Have stack tracer trace full stack
The stack tracer traces every function call checking the current stack (in
non interrupt context), looking for the deepest stack, and saving it when it
finds a new max depth. The problem is that it calls save_stack_trace(), and
with the new ORC unwinder, it can skip too much. As it looks at the ip of
the function call in the backtrace to find where it should start, it doesn't
need to skip anything.

The stack trace selftest would fail when the kernel was complied with the
ORC UNDWINDER enabled. Without skipping functions when doing the stack
trace, it now passes again.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-27 11:15:04 -04:00
Nikolay Borisov da387e5c93 tracing: Export trace_dump_stack to modules
There is no reason for this function to be unexprted and it's a useful
debugging aid.

Link: http://lkml.kernel.org/r/1539759103-5923-1-git-send-email-nborisov@suse.com

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-10-27 11:15:03 -04:00
David S. Miller 6788fac820 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-10-27

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix toctou race in BTF header validation, from Martin and Wenwen.

2) Fix devmap interface comparison in notifier call which was
   neglecting netns, from Taehee.

3) Several fixes in various places, for example, correcting direct
   packet access and helper function availability, from Daniel.

4) Fix BPF kselftest config fragment to include af_xdp and sockmap,
   from Naresh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-26 21:41:49 -07:00
Linus Torvalds 345671ea0f Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  hugetlbfs: dirty pages as they are added to pagecache
  mm: export add_swap_extent()
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
  mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
  mm/gup: cache dev_pagemap while pinning pages
  Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  mm: return zero_resv_unavail optimization
  mm: zero remaining unavailable struct pages
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
  tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
  tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
  mm/gup_benchmark.c: add additional pinning methods
  mm/gup_benchmark.c: time put_page()
  mm: don't raise MEMCG_OOM event due to failed high-order allocation
  mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
  ...
2018-10-26 19:33:41 -07:00
Alexander Duyck 966cf44f63 mm: defer ZONE_DEVICE page initialization to the point where we init pgmap
The ZONE_DEVICE pages were being initialized in two locations.  One was
with the memory_hotplug lock held and another was outside of that lock.
The problem with this is that it was nearly doubling the memory
initialization time.  Instead of doing this twice, once while holding a
global lock and once without, I am opting to defer the initialization to
the one outside of the lock.  This allows us to avoid serializing the
overhead for memory init and we can instead focus on per-node init times.

One issue I encountered is that devm_memremap_pages and
hmm_devmmem_pages_create were initializing only the pgmap field the same
way.  One wasn't initializing hmm_data, and the other was initializing it
to a poison value.  Since this is something that is exposed to the driver
in the case of hmm I am opting for a third option and just initializing
hmm_data to 0 since this is going to be exposed to unknown third party
drivers.

[alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
  Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomain
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:34 -07:00
Johannes Weiner 2ce7135adc psi: cgroup support
On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner eb414681d5 psi: pressure stall information for CPU, memory, and IO
When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner 246b3b3342 sched: introduce this_rq_lock_irq()
do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
patch is adding another site with the same pattern, so provide a
convenience function for it.

Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner 1f351d7f75 sched: sched.h: make rq locking and clock functions available in stats.h
kernel/sched/sched.h includes "stats.h" half-way through the file.  The
next patch introduces users of sched.h's rq locking functions and
update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
the file so they are available in stats.h.

Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner 5c54f5b9ed sched: loadavg: make calc_load_n() public
It's going to be used in a later patch. Keep the churn separate.

Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner 8508cf3ffa sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD
There are several definitions of those functions/macros in places that
mess with fixed-point load averages.  Provide an official version.

[akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Tested-by: Daniel Drake <drake@endlessm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Johannes Weiner b1d29ba82c delayacct: track delays from thrashing cache pages
Delay accounting already measures the time a task spends in direct reclaim
and waiting for swapin, but in low memory situations tasks spend can spend
a significant amount of their time waiting on thrashing page cache.  This
isn't tracked right now.

To know the full impact of memory contention on an individual task,
measure the delay when waiting for a recently evicted active cache page to
read back into memory.

Also update tools/accounting/getdelays.c:

     [hannes@computer accounting]$ sudo ./getdelays -d -p 1
     print delayacct stats ON
     PID     1

     CPU             count     real total  virtual total    delay total  delay average
                     50318      745000000      847346785      400533713          0.008ms
     IO              count    delay total  delay average
                       435      122601218              0ms
     SWAP            count    delay total  delay average
                         0              0              0ms
     RECLAIM         count    delay total  delay average
                         0              0              0ms
     THRASHING       count    delay total  delay average
                        19       12621439              0ms

Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Daniel Drake <drake@endlessm.com>
Tested-by: Suren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:26:32 -07:00
Roman Gushchin 9b6f7e163c mm: rework memcg kernel stack accounting
If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT.  So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.

The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.

Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.

To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen.  As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.

As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.

Link: http://lkml.kernel.org/r/20180827162621.30187-1-guro@fb.com
Fixes: ac496bf48d ("fork: Optimize task creation by caching two thread stacks per CPU if CONFIG_VMAP_STACK=y")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-10-26 16:25:19 -07:00
Linus Torvalds befa936331 Second batch of dma-mapping updates for 4.20:
- various swiotlb cleanups
  - do not dip into the ѕwiotlb pool for dma coherent allocations
  - add support for not cache coherent DMA to swiotlb
  - switch ARM64 to use the generic swiotlb_dma_ops
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlvTDNELHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOfdg//a4nGw5bTwj2uol/oS/vOdpE0otR3XpnNC3oopHd3
 3xbb0LUR12A0YJmV5Ao9iIRFc60boLNblZglho8Diqks3P9+nYWCQ2rQ+o7ZsfK9
 xGBNP0HrdGKPBzs9JodvhP/WBNWEReNbb6KwMe63LtDqUqNDNFmlYfxUgQ+fbgST
 ZscR7iHJT1Nl0PoJ0mVziPFDA5DFB6JN7z0nU9uddVXLM7MPOFVX/WsFdLZLg/8W
 /jgQ1gQ+gRMBSyctR8u+LG5KTxtYofiPo46lbyuSa5sx+cnZmjo7Uk5OZgs3soHn
 gnobSjxgNbPgt8ttaKoFtOpyuO/3cLQeObS8i2MQuzvYpUD/nBT5q6l8XZBAFQCc
 BkOe0qaBa2BrSG2O7BRgagHuSUjpgkxR3vyVT5e6eyr7uFk+LWslJKx4Uem3uIRf
 yfNQzx91Xaw7ypaF7//jVPeWucLFZuwYsy13ZAg5D71vYpZBkfgHoMSCrmZQT9o/
 V1ijmJy2ZDlGIKH95A0Utr6QKrci1drigg8Zce5A2E2EPAzTiro90ZeGT2tcJmVs
 LOrbxEN1Dctkkk5l6PJE3k4zYCP7rW4lbW7wPcLZrJoptsm3sQprE10zaAFGKks6
 vPLY615FK/JydJ9AIjx6PhFiY6M6Zb2er6CbRo18BjpqjKHgmK0A/uCbLhN+Qnac
 IoE=
 =pI5p
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-4.20-1' of git://git.infradead.org/users/hch/dma-mapping

Pull more dma-mapping updates from Christoph Hellwig:

 - various swiotlb cleanups

 - do not dip into the ѕwiotlb pool for dma coherent allocations

 - add support for not cache coherent DMA to swiotlb

 - switch ARM64 to use the generic swiotlb_dma_ops

* tag 'dma-mapping-4.20-1' of git://git.infradead.org/users/hch/dma-mapping:
  arm64: use the generic swiotlb_dma_ops
  swiotlb: add support for non-coherent DMA
  swiotlb: don't dip into swiotlb pool for coherent allocations
  swiotlb: refactor swiotlb_map_page
  swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs
  swiotlb: merge swiotlb_unmap_page and unmap_single
  swiotlb: remove the overflow buffer
  swiotlb: do not panic on mapping failures
  swiotlb: mark is_swiotlb_buffer static
  swiotlb: remove a pointless comment
2018-10-26 11:29:17 -07:00
Linus Torvalds a67eefad99 Printk changes for 4.20
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJb0GRmAAoJEFKgDEdIgJTy/IEQAKOC4fHonA5LUa8GEO7s+byX
 LNQzH/NAR+86CCKdWzaiCpyNEbwzXC/5kFuGB+NIKGrutQ+HO1haKG6URRvZiw0c
 YgxaRpJ1h5OfZNuCjql5dX/bFAuBPwEPUAPusA4YJYSiXota2O76OW+RwEpr71i5
 /Z2ygi3nlPECOhS1jTwY+cxGci67cfBIzKKdTXEft53xO38xAp0+Ea5Ljf2kIbgl
 rNz6XqJcy7rcAwEvh1kHw0AVEauLWs4NRlLX5eX7FHnqoh4TVFxWhLfNKirRo7gb
 vHemuucVUdvgG8yoFyg9CkFNLIMV9fWyDXkxab7dvrgD61oceLbNZ7dL86eijz7j
 qBoQy/igiH1nqIiczhTtp+JltIMzjPmC3unaie7f+oTHnKinzAaaND3wUjqObdZm
 MZQWsjIpBXC1nIcIs35NZiVMs8xcOG/sekkRcjU6/kbrBkoRqR5xhbm/tIcaCj0Z
 wKTlgET9b4dnmX8ZiEpvrfeMGxEu4yqfh1O3rvKnk8hKgxTvnzsSriHKh86KSv1L
 Gby4BA1zYwQxsJJ6LZMJjtHptxKBTcLANx8C/E9wPETP4EM5A1m5egJYRlDW3hb9
 MhM4vzUQfq63b9gPduP10jlLrXsWBQRAcAvtvm2lou3TNqipm6ZqVn9vqkv0retR
 Auk7mO33MVpHbDOQw1GK
 =N+Ts
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk

Pull printk updates from Petr Mladek:

 - Fix two more locations where printf formatting leaked pointers

 - Better log_buf_len parameter handling

 - Add prefix to messages from printk code

 - Do not miss messages on other consoles when the log is replayed on a
   new one

 - Reduce race between console registration and panic() when the log
   might get replayed on all consoles

 - Some cont buffer code clean up

 - Call console only when there is something to do (log vs cont buffer)

* tag 'printk-for-4.20' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk:
  lib/vsprintf: Hash printed address for netdev bits fallback
  lib/vsprintf: Hash legacy clock addresses
  lib/vsprintf: Prepare for more general use of ptr_to_id()
  lib/vsprintf: Make ptr argument conts in ptr_to_id()
  printk: fix integer overflow in setup_log_buf()
  printk: do not preliminary split up cont buffer
  printk: lock/unlock console only for new logbuf entries
  printk: keep kernel cont support always enabled
  printk: Give error on attempt to set log buffer length to over 2G
  printk: Add KBUILD_MODNAME and remove a redundant print prefix
  printk: Correct wrong casting
  printk: Fix panic caused by passing log_buf_len to command line
  printk: CON_PRINTBUFFER console registration is a bit racy
  printk: Do not miss new messages when replaying the log
2018-10-25 17:11:52 -07:00
Daniel Borkmann ede95a63b5 bpf: add bpf_jit_limit knob to restrict unpriv allocations
Rick reported that the BPF JIT could potentially fill the entire module
space with BPF programs from unprivileged users which would prevent later
attempts to load normal kernel modules or privileged BPF programs, for
example. If JIT was enabled but unsuccessful to generate the image, then
before commit 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
we would always fall back to the BPF interpreter. Nowadays in the case
where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
with a failure since the BPF interpreter was compiled out.

Add a global limit and enforce it for unprivileged users such that in case
of BPF interpreter compiled out we fail once the limit has been reached
or we fall back to BPF interpreter earlier w/o using module mem if latter
was compiled in. In a next step, fair share among unprivileged users can
be resolved in particular for the case where we would fail hard once limit
is reached.

Fixes: 290af86629 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Fixes: 0a14842f5a ("net: filter: Just In Time compiler for x86-64")
Co-Developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:11:42 -07:00
Daniel Borkmann b09928b976 bpf: make direct packet write unclone more robust
Given this seems to be quite fragile and can easily slip through the
cracks, lets make direct packet write more robust by requiring that
future program types which allow for such write must provide a prologue
callback. In case of XDP and sk_msg it's noop, thus add a generic noop
handler there. The latter starts out with NULL data/data_end unconditionally
when sg pages are shared.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann d3f66e4116 bpf: fix leaking uninitialized memory on pop/peek helpers
Commit f1a2e44a3a ("bpf: add queue and stack maps") added helpers
with ARG_PTR_TO_UNINIT_MAP_VALUE. Meaning, the helper is supposed to
fill the map value buffer with data instead of reading from it like
in other helpers such as map update. However, given the buffer is
allowed to be uninitialized (since we fill it in the helper anyway),
it also means that the helper is obliged to wipe the memory in case
of an error in order to not allow for leaking uninitialized memory.
Given pop/peek is both handled inside __{stack,queue}_map_get(),
lets wipe it there on error case, that is, empty stack/queue.

Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Mauricio Vasquez B<mauricio.vasquez@polito.it>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann 80b0d86a17 bpf: fix direct packet write into pop/peek helpers
Commit f1a2e44a3a ("bpf: add queue and stack maps") probably just
copy-pasted .pkt_access for bpf_map_{pop,peek}_elem() helpers, but
this is buggy in this context since it would allow writes into cloned
skbs which is invalid. Therefore, disable .pkt_access for the two.

Fixes: f1a2e44a3a ("bpf: add queue and stack maps")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Mauricio Vasquez B <mauricio.vasquez@polito.it>
Acked-by: Mauricio Vasquez B<mauricio.vasquez@polito.it>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann d5563d367c bpf: fix cg_skb types to hint access type in may_access_direct_pkt_data
Commit b39b5f411d ("bpf: add cg_skb_is_valid_access for
BPF_PROG_TYPE_CGROUP_SKB") added direct packet access for skbs in
cg_skb program types, however allowed access type was not added to
the may_access_direct_pkt_data() helper. Therefore the latter always
returns false. This is not directly an issue, it just means writes
are unconditionally disabled (which is correct) but also reads.
Latter is relevant in this function when BPF helpers may read direct
packet data which is unconditionally disabled then. Fix it by properly
adding BPF_PROG_TYPE_CGROUP_SKB to may_access_direct_pkt_data().

Fixes: b39b5f411d ("bpf: add cg_skb_is_valid_access for BPF_PROG_TYPE_CGROUP_SKB")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Daniel Borkmann 5d66fa7d9e bpf: fix direct packet access for flow dissector progs
Commit d58e468b11 ("flow_dissector: implements flow dissector BPF
hook") added direct packet access for skbs in may_access_direct_pkt_data()
function where this enables read and write access to the skb->data. This
is buggy because without a prologue generator such as bpf_unclone_prologue()
we would allow for writing into cloned skbs. Original intention might have
been to only allow read access where this is not needed (similar as the
flow_dissector_func_proto() indicates which enables only bpf_skb_load_bytes()
as well), therefore this patch fixes it to restrict to read-only.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Cc: Petar Penkov <ppenkov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-10-25 17:02:06 -07:00
Martin Lau 4a6998aff8 bpf, btf: fix a missing check bug in btf_parse
Wenwen Wang reported:

  In btf_parse(), the header of the user-space btf data 'btf_data'
  is firstly parsed and verified through btf_parse_hdr().
  In btf_parse_hdr(), the header is copied from user-space 'btf_data'
  to kernel-space 'btf->hdr' and then verified. If no error happens
  during the verification process, the whole data of 'btf_data',
  including the header, is then copied to 'data' in btf_parse(). It
  is obvious that the header is copied twice here. More importantly,
  no check is enforced after the second copy to make sure the headers
  obtained in these two copies are same. Given that 'btf_data' resides
  in the user space, a malicious user can race to modify the header
  between these two copies. By doing so, the user can inject
  inconsistent data, which can cause undefined behavior of the
  kernel and introduce potential security risk.

This issue is similar to the one fixed in commit 8af03d1ae2 ("bpf:
btf: Fix a missing check bug"). To fix it, this patch copies the user
'btf_data' *before* parsing / verifying the BTF header.

Fixes: 69b693f0ae ("bpf: btf: Introduce BPF Type Format (BTF)")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Co-developed-by: Wenwen Wang <wang6495@umn.edu>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-26 00:42:03 +02:00
Taehee Yoo f592f80483 bpf: devmap: fix wrong interface selection in notifier_call
The dev_map_notification() removes interface in devmap if
unregistering interface's ifindex is same.
But only checking ifindex is not enough because other netns can have
same ifindex. so that wrong interface selection could occurred.
Hence netdev pointer comparison code is added.

v2: compare netdev pointer instead of using net_eq() (Daniel Borkmann)
v1: Initial patch

Fixes: 2ddf71e23c ("net: add notifier hooks for devmap bpf map")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-10-26 00:32:21 +02:00
Linus Torvalds 5947a64a7e Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "The interrupt brigade came up with the following updates:

   - Driver for the Marvell System Error Interrupt machinery

   - Overhaul of the GIC-V3 ITS driver

   - Small updates and fixes all over the place"

* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  genirq: Fix race on spurious interrupt detection
  softirq: Fix typo in __do_softirq() comments
  genirq: Fix grammar s/an /a /
  irqchip/gic: Unify GIC priority definitions
  irqchip/gic-v3: Remove acknowledge loop
  dt-bindings/interrupt-controller: Add documentation for Marvell SEI controller
  dt-bindings/interrupt-controller: Update Marvell ICU bindings
  irqchip/irq-mvebu-icu: Add support for System Error Interrupts (SEI)
  arm64: marvell: Enable SEI driver
  irqchip/irq-mvebu-sei: Add new driver for Marvell SEI
  irqchip/irq-mvebu-icu: Support ICU subnodes
  irqchip/irq-mvebu-icu: Disociate ICU and NSR
  irqchip/irq-mvebu-icu: Clarify the reset operation of configured interrupts
  irqchip/irq-mvebu-icu: Fix wrong private data retrieval
  dt-bindings/interrupt-controller: Fix Marvell ICU length in the example
  genirq/msi: Allow creation of a tree-based irqdomain for platform-msi
  dt-bindings: irqchip: renesas-irqc: Document r8a7744 support
  dt-bindings: irqchip: renesas-irqc: Document R-Car E3 support
  irqchip/pdc: Setup all edge interrupts as rising edge at GIC
  irqchip/gic-v3-its: Allow use of LPI tables in reserved memory
  ...
2018-10-25 11:43:47 -07:00
Linus Torvalds 4dcb9239da Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
 "The timers and timekeeping departement provides:

   - Another large y2038 update with further preparations for providing
     the y2038 safe timespecs closer to the syscalls.

   - An overhaul of the SHCMT clocksource driver

   - SPDX license identifier updates

   - Small cleanups and fixes all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
  tick/sched : Remove redundant cpu_online() check
  clocksource/drivers/dw_apb: Add reset control
  clocksource: Remove obsolete CLOCKSOURCE_OF_DECLARE
  clocksource/drivers: Unify the names to timer-* format
  clocksource/drivers/sh_cmt: Add R-Car gen3 support
  dt-bindings: timer: renesas: cmt: document R-Car gen3 support
  clocksource/drivers/sh_cmt: Properly line-wrap sh_cmt_of_table[] initializer
  clocksource/drivers/sh_cmt: Fix clocksource width for 32-bit machines
  clocksource/drivers/sh_cmt: Fixup for 64-bit machines
  clocksource/drivers/sh_tmu: Convert to SPDX identifiers
  clocksource/drivers/sh_mtu2: Convert to SPDX identifiers
  clocksource/drivers/sh_cmt: Convert to SPDX identifiers
  clocksource/drivers/renesas-ostm: Convert to SPDX identifiers
  clocksource: Convert to using %pOFn instead of device_node.name
  tick/broadcast: Remove redundant check
  RISC-V: Request newstat syscalls
  y2038: signal: Change rt_sigtimedwait to use __kernel_timespec
  y2038: socket: Change recvmmsg to use __kernel_timespec
  y2038: sched: Change sched_rr_get_interval to use __kernel_timespec
  y2038: utimes: Rework #ifdef guards for compat syscalls
  ...
2018-10-25 11:14:36 -07:00
Vitaly Kuznetsov a1c6ca3c6d kernel: hung_task.c: disable on suspend
It is possible to observe hung_task complaints when system goes to
suspend-to-idle state:

 # echo freeze > /sys/power/state

 PM: Syncing filesystems ... done.
 Freezing user space processes ... (elapsed 0.001 seconds) done.
 OOM killer disabled.
 Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
 sd 0:0:0:0: [sda] Synchronizing SCSI cache
 INFO: task bash:1569 blocked for more than 120 seconds.
       Not tainted 4.19.0-rc3_+ #687
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 bash            D    0  1569    604 0x00000000
 Call Trace:
  ? __schedule+0x1fe/0x7e0
  schedule+0x28/0x80
  suspend_devices_and_enter+0x4ac/0x750
  pm_suspend+0x2c0/0x310

Register a PM notifier to disable the detector on suspend and re-enable
back on wakeup.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-25 18:45:08 +02:00
Daniel Lezcano a7fe5190c0 cpuidle: menu: Remove get_loadavg() from the performance multiplier
The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.

In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:

unsigned long this_cpu_load(void)
{
        struct rq *this = this_rq();
        return this->cpu_load[0];
}

In 2014, the code was changed by commit 372ba8cb46 (cpuidle: menu: Lookup CPU
runqueues less) and the load is:

void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
        struct rq *rq = this_rq();
        *nr_waiters = atomic_read(&rq->nr_iowait);
        *load = rq->load.weight;
}

with the same result.

Both measurements show using the load in this code path does no matter
anymore. Removing it.

[1] 4dedd9f124

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-25 16:49:27 +02:00
Daniel Lezcano 145d952a29 sched: Factor out nr_iowait and nr_iowait_cpu
The function nr_iowait_cpu() can be used directly by nr_iowait() instead
of duplicating code.

Call nr_iowait_cpu() from nr_iowait()

Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2018-10-25 16:49:26 +02:00
Linus Torvalds 638820d8da Merge branch 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "In this patchset, there are a couple of minor updates, as well as some
  reworking of the LSM initialization code from Kees Cook (these prepare
  the way for ordered stackable LSMs, but are a valuable cleanup on
  their own)"

* 'next-general' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  LSM: Don't ignore initialization failures
  LSM: Provide init debugging infrastructure
  LSM: Record LSM name in struct lsm_info
  LSM: Convert security_initcall() into DEFINE_LSM()
  vmlinux.lds.h: Move LSM_TABLE into INIT_DATA
  LSM: Convert from initcall to struct lsm_info
  LSM: Remove initcall tracing
  LSM: Rename .security_initcall section to .lsm_info
  vmlinux.lds.h: Avoid copy/paste of security_init section
  LSM: Correctly announce start of LSM initialization
  security: fix LSM description location
  keys: Fix the use of the C++ keyword "private" in uapi/linux/keyctl.h
  seccomp: remove unnecessary unlikely()
  security: tomoyo: Fix obsolete function
  security/capabilities: remove check for -EINVAL
2018-10-24 11:49:35 +01:00
Linus Torvalds ba9f6f8954 Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo updates from Eric Biederman:
 "I have been slowly sorting out siginfo and this is the culmination of
  that work.

  The primary result is in several ways the signal infrastructure has
  been made less error prone. The code has been updated so that manually
  specifying SEND_SIG_FORCED is never necessary. The conversion to the
  new siginfo sending functions is now complete, which makes it
  difficult to send a signal without filling in the proper siginfo
  fields.

  At the tail end of the patchset comes the optimization of decreasing
  the size of struct siginfo in the kernel from 128 bytes to about 48
  bytes on 64bit. The fundamental observation that enables this is by
  definition none of the known ways to use struct siginfo uses the extra
  bytes.

  This comes at the cost of a small user space observable difference.
  For the rare case of siginfo being injected into the kernel only what
  can be copied into kernel_siginfo is delivered to the destination, the
  rest of the bytes are set to 0. For cases where the signal and the
  si_code are known this is safe, because we know those bytes are not
  used. For cases where the signal and si_code combination is unknown
  the bits that won't fit into struct kernel_siginfo are tested to
  verify they are zero, and the send fails if they are not.

  I made an extensive search through userspace code and I could not find
  anything that would break because of the above change. If it turns out
  I did break something it will take just the revert of a single change
  to restore kernel_siginfo to the same size as userspace siginfo.

  Testing did reveal dependencies on preferring the signo passed to
  sigqueueinfo over si->signo, so bit the bullet and added the
  complexity necessary to handle that case.

  Testing also revealed bad things can happen if a negative signal
  number is passed into the system calls. Something no sane application
  will do but something a malicious program or a fuzzer might do. So I
  have fixed the code that performs the bounds checks to ensure negative
  signal numbers are handled"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
  signal: Guard against negative signal numbers in copy_siginfo_from_user32
  signal: Guard against negative signal numbers in copy_siginfo_from_user
  signal: In sigqueueinfo prefer sig not si_signo
  signal: Use a smaller struct siginfo in the kernel
  signal: Distinguish between kernel_siginfo and siginfo
  signal: Introduce copy_siginfo_from_user and use it's return value
  signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
  signal: Fail sigqueueinfo if si_signo != sig
  signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
  signal/unicore32: Use force_sig_fault where appropriate
  signal/unicore32: Generate siginfo in ucs32_notify_die
  signal/unicore32: Use send_sig_fault where appropriate
  signal/arc: Use force_sig_fault where appropriate
  signal/arc: Push siginfo generation into unhandled_exception
  signal/ia64: Use force_sig_fault where appropriate
  signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
  signal/ia64: Use the generic force_sigsegv in setup_frame
  signal/arm/kvm: Use send_sig_mceerr
  signal/arm: Use send_sig_fault where appropriate
  signal/arm: Use force_sig_fault where appropriate
  ...
2018-10-24 11:22:39 +01:00
Linus Torvalds 50b825d7e8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Add VF IPSEC offload support in ixgbe, from Shannon Nelson.

 2) Add zero-copy AF_XDP support to i40e, from Björn Töpel.

 3) All in-tree drivers are converted to {g,s}et_link_ksettings() so we
    can get rid of the {g,s}et_settings ethtool callbacks, from Michal
    Kubecek.

 4) Add software timestamping to veth driver, from Michael Walle.

 5) More work to make packet classifiers and actions lockless, from Vlad
    Buslov.

 6) Support sticky FDB entries in bridge, from Nikolay Aleksandrov.

 7) Add ipv6 version of IP_MULTICAST_ALL sockopt, from Andre Naujoks.

 8) Support batching of XDP buffers in vhost_net, from Jason Wang.

 9) Add flow dissector BPF hook, from Petar Penkov.

10) i40e vf --> generic iavf conversion, from Jesse Brandeburg.

11) Add NLA_REJECT netlink attribute policy type, to signal when users
    provide attributes in situations which don't make sense. From
    Johannes Berg.

12) Switch TCP and fair-queue scheduler over to earliest departure time
    model. From Eric Dumazet.

13) Improve guest receive performance by doing rx busy polling in tx
    path of vhost networking driver, from Tonghao Zhang.

14) Add per-cgroup local storage to bpf

15) Add reference tracking to BPF, from Joe Stringer. The verifier can
    now make sure that references taken to objects are properly released
    by the program.

16) Support in-place encryption in TLS, from Vakul Garg.

17) Add new taprio packet scheduler, from Vinicius Costa Gomes.

18) Lots of selftests additions, too numerous to mention one by one here
    but all of which are very much appreciated.

19) Support offloading of eBPF programs containing BPF to BPF calls in
    nfp driver, frm Quentin Monnet.

20) Move dpaa2_ptp driver out of staging, from Yangbo Lu.

21) Lots of u32 classifier cleanups and simplifications, from Al Viro.

22) Add new strict versions of netlink message parsers, and enable them
    for some situations. From David Ahern.

23) Evict neighbour entries on carrier down, also from David Ahern.

24) Support BPF sk_msg verdict programs with kTLS, from Daniel Borkmann
    and John Fastabend.

25) Add support for filtering route dumps, from David Ahern.

26) New igc Intel driver for 2.5G parts, from Sasha Neftin et al.

27) Allow vxlan enslavement to bridges in mlxsw driver, from Ido
    Schimmel.

28) Add queue and stack map types to eBPF, from Mauricio Vasquez B.

29) Add back byte-queue-limit support to r8169, with all the bug fixes
    in other areas of the driver it works now! From Florian Westphal and
    Heiner Kallweit.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2147 commits)
  tcp: add tcp_reset_xmit_timer() helper
  qed: Fix static checker warning
  Revert "be2net: remove desc field from be_eq_obj"
  Revert "net: simplify sock_poll_wait"
  net: socionext: Reset tx queue in ndo_stop
  net: socionext: Add dummy PHY register read in phy_write()
  net: socionext: Stop PHY before resetting netsec
  net: stmmac: Set OWN bit for jumbo frames
  arm64: dts: stratix10: Support Ethernet Jumbo frame
  tls: Add maintainers
  net: ethernet: ti: cpsw: unsync mcast entries while switch promisc mode
  octeontx2-af: Support for NIXLF's UCAST/PROMISC/ALLMULTI modes
  octeontx2-af: Support for setting MAC address
  octeontx2-af: Support for changing RSS algorithm
  octeontx2-af: NIX Rx flowkey configuration for RSS
  octeontx2-af: Install ucast and bcast pkt forwarding rules
  octeontx2-af: Add LMAC channel info to NIXLF_ALLOC response
  octeontx2-af: NPC MCAM and LDATA extract minimal configuration
  octeontx2-af: Enable packet length and csum validation
  octeontx2-af: Support for VTAG strip and capture
  ...
2018-10-24 06:47:44 +01:00
Linus Torvalds 034bda1cd5 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "Two main changes:

   - Cleanups, simplifications and CLOCK_TAI support (Thomas Gleixner)

   - Improve code generation (Andy Lutomirski)"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Rearrange do_hres() to improve code generation
  x86/vdso: Document vgtod_ts better
  x86/vdso: Remove "memory" clobbers in the vDSO syscall fallbacks
  x66/vdso: Add CLOCK_TAI support
  x86/vdso: Move cycle_last handling into the caller
  x86/vdso: Simplify the invalid vclock case
  x86/vdso: Replace the clockid switch case
  x86/vdso: Collapse coarse functions
  x86/vdso: Collapse high resolution functions
  x86/vdso: Introduce and use vgtod_ts
  x86/vdso: Use unsigned int consistently for vsyscall_gtod_data:: Seq
  x86/vdso: Enforce 64bit clocksource
  x86/time: Implement clocksource_arch_init()
  clocksource: Provide clocksource_arch_init()
2018-10-23 19:07:25 +01:00
Linus Torvalds d82924c3b8 Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 pti updates from Ingo Molnar:
 "The main changes:

   - Make the IBPB barrier more strict and add STIBP support (Jiri
     Kosina)

   - Micro-optimize and clean up the entry code (Andy Lutomirski)

   - ... plus misc other fixes"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/speculation: Propagate information about RSB filling mitigation to sysfs
  x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation
  x86/speculation: Apply IBPB more strictly to avoid cross-process data leak
  x86/speculation: Add RETPOLINE_AMD support to the inline asm CALL_NOSPEC variant
  x86/CPU: Fix unused variable warning when !CONFIG_IA32_EMULATION
  x86/pti/64: Remove the SYSCALL64 entry trampoline
  x86/entry/64: Use the TSS sp2 slot for SYSCALL/SYSRET scratch space
  x86/entry/64: Document idtentry
2018-10-23 18:43:04 +01:00
Linus Torvalds 99792e0cea Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Lots of changes in this cycle:

   - Lots of CPA (change page attribute) optimizations and related
     cleanups (Thomas Gleixner, Peter Zijstra)

   - Make lazy TLB mode even lazier (Rik van Riel)

   - Fault handler cleanups and improvements (Dave Hansen)

   - kdump, vmcore: Enable kdumping encrypted memory with AMD SME
     enabled (Lianbo Jiang)

   - Clean up VM layout documentation (Baoquan He, Ingo Molnar)

   - ... plus misc other fixes and enhancements"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
  x86/stackprotector: Remove the call to boot_init_stack_canary() from cpu_startup_entry()
  x86/mm: Kill stray kernel fault handling comment
  x86/mm: Do not warn about PCI BIOS W+X mappings
  resource: Clean it up a bit
  resource: Fix find_next_iomem_res() iteration issue
  resource: Include resource end in walk_*() interfaces
  x86/kexec: Correct KEXEC_BACKUP_SRC_END off-by-one error
  x86/mm: Remove spurious fault pkey check
  x86/mm/vsyscall: Consider vsyscall page part of user address space
  x86/mm: Add vsyscall address helper
  x86/mm: Fix exception table comments
  x86/mm: Add clarifying comments for user addr space
  x86/mm: Break out user address space handling
  x86/mm: Break out kernel address space handling
  x86/mm: Clarify hardware vs. software "error_code"
  x86/mm/tlb: Make lazy TLB mode lazier
  x86/mm/tlb: Add freed_tables element to flush_tlb_info
  x86/mm/tlb: Add freed_tables argument to flush_tlb_mm_range
  smp,cpumask: introduce on_each_cpu_cond_mask
  smp: use __cpumask_set_cpu in on_each_cpu_cond
  ...
2018-10-23 17:05:28 +01:00
Linus Torvalds cbbfb0ae2c Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 apic updates from Ingo Molnar:
 "Improve the spreading of managed IRQs at allocation time"

* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irq/matrix: Spread managed interrupts on allocation
  irq/matrix: Split out the CPU selection code into a helper
2018-10-23 15:15:20 +01:00
Linus Torvalds 42f52e1c59 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes are:

   - Migrate CPU-intense 'misfit' tasks on asymmetric capacity systems,
     to better utilize (much) faster 'big core' CPUs. (Morten Rasmussen,
     Valentin Schneider)

   - Topology handling improvements, in particular when CPU capacity
     changes and related load-balancing fixes/improvements (Morten
     Rasmussen)

   - ... plus misc other improvements, fixes and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  sched/completions/Documentation: Add recommendation for dynamic and ONSTACK completions
  sched/completions/Documentation: Clean up the document some more
  sched/completions/Documentation: Fix a couple of punctuation nits
  cpu/SMT: State SMT is disabled even with nosmt and without "=force"
  sched/core: Fix comment regarding nr_iowait_cpu() and get_iowait_load()
  sched/fair: Remove setting task's se->runnable_weight during PELT update
  sched/fair: Disable LB_BIAS by default
  sched/pelt: Fix warning and clean up IRQ PELT config
  sched/topology: Make local variables static
  sched/debug: Use symbolic names for task state constants
  sched/numa: Remove unused numa_stats::nr_running field
  sched/numa: Remove unused code from update_numa_stats()
  sched/debug: Explicitly cast sched_feat() to bool
  sched/core: Disable SD_PREFER_SIBLING on asymmetric CPU capacity domains
  sched/fair: Don't move tasks to lower capacity CPUs unless necessary
  sched/fair: Set rq->rd->overload when misfit
  sched/fair: Wrap rq->rd->overload accesses with READ/WRITE_ONCE()
  sched/core: Change root_domain->overload type to int
  sched/fair: Change 'prefer_sibling' type to bool
  sched/fair: Kick nohz balance if rq->misfit_task_load
  ...
2018-10-23 15:00:03 +01:00
Linus Torvalds c05f3642f4 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "The main updates in this cycle were:

   - Lots of perf tooling changes too voluminous to list (big perf trace
     and perf stat improvements, lots of libtraceevent reorganization,
     etc.), so I'll list the authors and refer to the changelog for
     details:

       Benjamin Peterson, Jérémie Galarneau, Kim Phillips, Peter
       Zijlstra, Ravi Bangoria, Sangwon Hong, Sean V Kelley, Steven
       Rostedt, Thomas Gleixner, Ding Xiang, Eduardo Habkost, Thomas
       Richter, Andi Kleen, Sanskriti Sharma, Adrian Hunter, Tzvetomir
       Stoyanov, Arnaldo Carvalho de Melo, Jiri Olsa.

     ... with the bulk of the changes written by Jiri Olsa, Tzvetomir
     Stoyanov and Arnaldo Carvalho de Melo.

   - Continued intel_rdt work with a focus on playing well with perf
     events. This also imported some non-perf RDT work due to
     dependencies. (Reinette Chatre)

   - Implement counter freezing for Arch Perfmon v4 (Skylake and newer).
     This allows to speed up the PMI handler by avoiding unnecessary MSR
     writes and make it more accurate. (Andi Kleen)

   - kprobes cleanups and simplification (Masami Hiramatsu)

   - Intel Goldmont PMU updates (Kan Liang)

   - ... plus misc other fixes and updates"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (155 commits)
  kprobes/x86: Use preempt_enable() in optimized_callback()
  x86/intel_rdt: Prevent pseudo-locking from using stale pointers
  kprobes, x86/ptrace.h: Make regs_get_kernel_stack_nth() not fault on bad stack
  perf/x86/intel: Export mem events only if there's PEBS support
  x86/cpu: Drop pointless static qualifier in punit_dev_state_show()
  x86/intel_rdt: Fix initial allocation to consider CDP
  x86/intel_rdt: CBM overlap should also check for overlap with CDP peer
  x86/intel_rdt: Introduce utility to obtain CDP peer
  tools lib traceevent, perf tools: Move struct tep_handler definition in a local header file
  tools lib traceevent: Separate out tep_strerror() for strerror_r() issues
  perf python: More portable way to make CFLAGS work with clang
  perf python: Make clang_has_option() work on Python 3
  perf tools: Free temporary 'sys' string in read_event_files()
  perf tools: Avoid double free in read_event_file()
  perf tools: Free 'printk' string in parse_ftrace_printk()
  perf tools: Cleanup trace-event-info 'tdata' leak
  perf strbuf: Match va_{add,copy} with va_end
  perf test: S390 does not support watchpoints in test 22
  perf auxtrace: Include missing asm/bitsperlong.h to get BITS_PER_LONG
  tools include: Adopt linux/bits.h
  ...
2018-10-23 13:32:18 +01:00
Linus Torvalds 0200fbdd43 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking and misc x86 updates from Ingo Molnar:
 "Lots of changes in this cycle - in part because locking/core attracted
  a number of related x86 low level work which was easier to handle in a
  single tree:

   - Linux Kernel Memory Consistency Model updates (Alan Stern, Paul E.
     McKenney, Andrea Parri)

   - lockdep scalability improvements and micro-optimizations (Waiman
     Long)

   - rwsem improvements (Waiman Long)

   - spinlock micro-optimization (Matthew Wilcox)

   - qspinlocks: Provide a liveness guarantee (more fairness) on x86.
     (Peter Zijlstra)

   - Add support for relative references in jump tables on arm64, x86
     and s390 to optimize jump labels (Ard Biesheuvel, Heiko Carstens)

   - Be a lot less permissive on weird (kernel address) uaccess faults
     on x86: BUG() when uaccess helpers fault on kernel addresses (Jann
     Horn)

   - macrofy x86 asm statements to un-confuse the GCC inliner. (Nadav
     Amit)

   - ... and a handful of other smaller changes as well"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  locking/lockdep: Make global debug_locks* variables read-mostly
  locking/lockdep: Fix debug_locks off performance problem
  locking/pvqspinlock: Extend node size when pvqspinlock is configured
  locking/qspinlock_stat: Count instances of nested lock slowpaths
  locking/qspinlock, x86: Provide liveness guarantee
  x86/asm: 'Simplify' GEN_*_RMWcc() macros
  locking/qspinlock: Rework some comments
  locking/qspinlock: Re-order code
  locking/lockdep: Remove duplicated 'lock_class_ops' percpu array
  x86/defconfig: Enable CONFIG_USB_XHCI_HCD=y
  futex: Replace spin_is_locked() with lockdep
  locking/lockdep: Make class->ops a percpu counter and move it under CONFIG_DEBUG_LOCKDEP=y
  x86/jump-labels: Macrofy inline assembly code to work around GCC inlining bugs
  x86/cpufeature: Macrofy inline assembly code to work around GCC inlining bugs
  x86/extable: Macrofy inline assembly code to work around GCC inlining bugs
  x86/paravirt: Work around GCC inlining bugs when compiling paravirt ops
  x86/bug: Macrofy the BUG table section handling, to work around GCC inlining bugs
  x86/alternatives: Macrofy lock prefixes to work around GCC inlining bugs
  x86/refcount: Work around GCC inlining bug
  x86/objtool: Use asm macros to work around GCC inlining bugs
  ...
2018-10-23 13:08:53 +01:00
Linus Torvalds cee1352f79 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The biggest change in this cycle is the conclusion of the big
  'simplify RCU to two primary flavors' consolidation work - i.e.
  there's a single RCU flavor for any kernel variant (PREEMPT and
  !PREEMPT):

    - Consolidate the RCU-bh, RCU-preempt, and RCU-sched flavors into a
      single flavor similar to RCU-sched in !PREEMPT kernels and into a
      single flavor similar to RCU-preempt (but also waiting on
      preempt-disabled sequences of code) in PREEMPT kernels.

      This branch also includes a refactoring of
      rcu_{nmi,irq}_{enter,exit}() from Byungchul Park.

    - Now that there is only one RCU flavor in any given running kernel,
      the many "rsp" pointers are no longer required, and this cleanup
      series removes them.

    - This branch carries out additional cleanups made possible by the
      RCU flavor consolidation, including inlining now-trivial
      functions, updating comments and definitions, and removing
      now-unneeded rcutorture scenarios.

    - Now that there is only one flavor of RCU in any running kernel,
      there is also only on rcu_data structure per CPU. This means that
      the rcu_dynticks structure can be merged into the rcu_data
      structure, a task taken on by this branch. This branch also
      contains a -rt-related fix from Mike Galbraith.

  There were also other updates:

    - Documentation updates, including some good-eye catches from Joel
      Fernandes.

    - SRCU updates, most notably changes enabling call_srcu() to be
      invoked very early in the boot sequence.

    - Torture-test updates, including some preliminary work towards
      making rcutorture better able to find problems that result in
      insufficient grace-period forward progress.

    - Initial changes to RCU to better promote forward progress of grace
      periods, including fixing a bug found by Marius Hillenbrand and
      David Woodhouse, with the fix suggested by Peter Zijlstra"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (140 commits)
  srcu: Make early-boot call_srcu() reuse workqueue lists
  rcutorture: Test early boot call_srcu()
  srcu: Make call_srcu() available during very early boot
  rcu: Convert rcu_state.ofl_lock to raw_spinlock_t
  rcu: Remove obsolete ->dynticks_fqs and ->cond_resched_completed
  rcu: Switch ->dynticks to rcu_data structure, remove rcu_dynticks
  rcu: Switch dyntick nesting counters to rcu_data structure
  rcu: Switch urgent quiescent-state requests to rcu_data structure
  rcu: Switch lazy counts to rcu_data structure
  rcu: Switch last accelerate/advance to rcu_data structure
  rcu: Switch ->tick_nohz_enabled_snap to rcu_data structure
  rcu: Merge rcu_dynticks structure into rcu_data structure
  rcu: Remove unused rcu_dynticks_snap() from Tiny RCU
  rcu: Convert "1UL << x" to "BIT(x)"
  rcu: Avoid resched_cpu() when rescheduling the current CPU
  rcu: More aggressively enlist scheduler aid for nohz_full CPUs
  rcu: Compute jiffies_till_sched_qs from other kernel parameters
  rcu: Provide functions for determining if call_rcu() has been invoked
  rcu: Eliminate ->rcu_qs_ctr from the rcu_dynticks structure
  rcu: Motivate Tiny RCU forward progress
  ...
2018-10-23 12:31:17 +01:00
Ingo Molnar dda93b4538 Merge branch 'x86/cache' into perf/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-23 12:30:19 +02:00
Linus Torvalds 12dd08fa95 Power management updates for 4.20-rc1
- Backport hibernation bug fixes from x86-64 to x86-32 and
    consolidate hibernation handling on x86 to allow 32-bit
    systems to work in all of the cases in which 64-bit ones
    work (Zhimin Gu, Chen Yu).
 
  - Fix hibernation documentation (Vladimir D. Seleznev).
 
  - Update the menu cpuidle governor to fix a couple of issues
    with it, make it more efficient in some cases and clean it
    up (Rafael Wysocki).
 
  - Rework the cpuidle polling state implementation to make it
    more efficient (Rafael Wysocki).
 
  - Clean up the cpuidle core somewhat (Fieah Lim).
 
  - Fix the cpufreq conservative governor to take policy limits
    into account properly in some cases (Rafael Wysocki).
 
  - Add support for retrieving guaranteed performance information
    to the ACPI CPPC library and make the intel_pstate driver use
    it to expose the CPU base frequency via sysfs on systems with
    the hardware-managed P-states (HWP) feature enabled (Srinivas
    Pandruvada).
 
  - Fix clang warning in the CPPC cpufreq driver (Nathan Chancellor).
 
  - Get rid of device_node.name printing from cpufreq (Rob Herring).
 
  - Remove unnecessary unlikely() from the cpufreq core (Igor Stoppa).
 
  - Add support for the r8a7744 SoC to the cpufreq-dt driver (Biju Das).
 
  - Update the dt-platdev cpufreq driver to allow RK3399 to have
    separate tunables per cluster (Dmitry Torokhov).
 
  - Fix the dma_alloc_coherent() usage in the tegra186 cpufreq driver
    (Christoph Hellwig).
 
  - Make the imx6q cpufreq driver read OCOTP through nvmem for
    imx6ul/imx6ull (Anson Huang).
 
  - Fix several bugs in the operating performance points (OPP)
    framework and make it more stable (Viresh Kumar, Dave Gerlach).
 
  - Update the devfreq subsystem to take changes in the APIs used
    by into account, fix some issues with it and make it stop
    print device_node.name directly (Bjorn Andersson, Enric Balletbo
    i Serra, Matthias Kaehlcke, Rob Herring, Vincent Donnefort, zhong
    jiang).
 
  - Prepare the generic power domains (genpd) framework for dealing
    with domains containing CPUs (Ulf Hansson).
 
  - Prevent sysfs attributes representing low-power S0 residency
    counters from being exposed if low-power S0 support is not
    indicated in ACPI FADT (Rajneesh Bhardwaj).
 
  - Get rid of custom CPU features macros for Intel CPUs from the
    intel_idle and RAPL drivers (Andy Shevchenko).
 
  - Update the tasks freezer to list tasks that refused to freeze
    and caused a system transition to a sleep state to be aborted
    (Todd Brandt).
 
  - Update the pm-graph set of tools to v5.2 (Todd Brandt).
 
  - Fix some issues in the cpupower utility (Anders Roxell, Prarit
    Bhargava).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJbyaznAAoJEILEb/54YlRxUkoP/iOroh5pMW7PDa1g8sG26bfN
 ICln5Tt9lv1Euk3QALc5r05kLjyObfoMoDwvH2oiM0TgwSw6G64tm/ansTsvbPpc
 DCk53d0/gSqv5B1dZxV6OUYoXP0Z5hD+nW+1dg6EiGr1h24kesdEXdSB09bfTUY3
 N4zUurWDUD92havuV3PakI/d/aOdxlwt9drwxv/cx4/gSYS0q5KtB2/N8YdWrk8Q
 1UNwZkQLO8I0URfp9bwvwG3VhgKn0SKpLHlajq9KzWDPRgCl32oB0tY+3fOHW9Q+
 djgMRA7xlAzAcCCL0vYJnEja6uMenvx3hZa1m68ZWFr0C25LQ5V87IEyZ3znvJQu
 IlcY9jMbYkX8dZz1M8LZA+nOtyYM5GxvgylaQvHRn8fi0jzYJWfJbAKdyvEX94qz
 UWtY35ihXFVBkhJuSxDPzluhMwxtd5uux1zO09/KlpUg8nnhxRx5l7AF7k7YyRk9
 TZ5dVa6kp8CdmBZK6E9FNHstfvECL64oc9Ig3CB/bRXYBm60hN9pLXO2abJKV7dU
 FUe4kmWUNus5QKOzfGuPKJokw34/vxBW2CVrOeRUNcuaRhlUwuboijeLPf23XLI/
 fYDI4EiMxAZvcEZ5h0KKDS0MaLv4uy0LbAdrWx8Eg7pNeFUiovDgovYUF7HOmn6M
 BzesklDaXWUSPWxlnASg
 =WJgu
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "These make hibernation on 32-bit x86 systems work in all of the cases
  in which it works on 64-bit x86 ones, update the menu cpuidle governor
  and the "polling" state to make them more efficient, add more hardware
  support to cpufreq drivers and fix issues with some of them, fix a bug
  in the conservative cpufreq governor, fix the operating performance
  points (OPP) framework and make it more stable, update the devfreq
  subsystem to take changes in the APIs used by into account and clean
  up some things all over.

  Specifics:

   - Backport hibernation bug fixes from x86-64 to x86-32 and
     consolidate hibernation handling on x86 to allow 32-bit systems to
     work in all of the cases in which 64-bit ones work (Zhimin Gu, Chen
     Yu).

   - Fix hibernation documentation (Vladimir D. Seleznev).

   - Update the menu cpuidle governor to fix a couple of issues with it,
     make it more efficient in some cases and clean it up (Rafael
     Wysocki).

   - Rework the cpuidle polling state implementation to make it more
     efficient (Rafael Wysocki).

   - Clean up the cpuidle core somewhat (Fieah Lim).

   - Fix the cpufreq conservative governor to take policy limits into
     account properly in some cases (Rafael Wysocki).

   - Add support for retrieving guaranteed performance information to
     the ACPI CPPC library and make the intel_pstate driver use it to
     expose the CPU base frequency via sysfs on systems with the
     hardware-managed P-states (HWP) feature enabled (Srinivas
     Pandruvada).

   - Fix clang warning in the CPPC cpufreq driver (Nathan Chancellor).

   - Get rid of device_node.name printing from cpufreq (Rob Herring).

   - Remove unnecessary unlikely() from the cpufreq core (Igor Stoppa).

   - Add support for the r8a7744 SoC to the cpufreq-dt driver (Biju
     Das).

   - Update the dt-platdev cpufreq driver to allow RK3399 to have
     separate tunables per cluster (Dmitry Torokhov).

   - Fix the dma_alloc_coherent() usage in the tegra186 cpufreq driver
     (Christoph Hellwig).

   - Make the imx6q cpufreq driver read OCOTP through nvmem for
     imx6ul/imx6ull (Anson Huang).

   - Fix several bugs in the operating performance points (OPP)
     framework and make it more stable (Viresh Kumar, Dave Gerlach).

   - Update the devfreq subsystem to take changes in the APIs used by
     into account, fix some issues with it and make it stop print
     device_node.name directly (Bjorn Andersson, Enric Balletbo i Serra,
     Matthias Kaehlcke, Rob Herring, Vincent Donnefort, zhong jiang).

   - Prepare the generic power domains (genpd) framework for dealing
     with domains containing CPUs (Ulf Hansson).

   - Prevent sysfs attributes representing low-power S0 residency
     counters from being exposed if low-power S0 support is not
     indicated in ACPI FADT (Rajneesh Bhardwaj).

   - Get rid of custom CPU features macros for Intel CPUs from the
     intel_idle and RAPL drivers (Andy Shevchenko).

   - Update the tasks freezer to list tasks that refused to freeze and
     caused a system transition to a sleep state to be aborted (Todd
     Brandt).

   - Update the pm-graph set of tools to v5.2 (Todd Brandt).

   - Fix some issues in the cpupower utility (Anders Roxell, Prarit
     Bhargava)"

* tag 'pm-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (73 commits)
  PM / Domains: Document flags for genpd
  PM / Domains: Deal with multiple states but no governor in genpd
  PM / Domains: Don't treat zero found compatible idle states as an error
  cpuidle: menu: Avoid computations when result will be discarded
  cpuidle: menu: Drop redundant comparison
  cpufreq: tegra186: don't pass GFP_DMA32 to dma_alloc_coherent()
  cpufreq: conservative: Take limits changes into account properly
  Documentation: intel_pstate: Add base_frequency information
  cpufreq: intel_pstate: Add base_frequency attribute
  ACPI / CPPC: Add support for guaranteed performance
  cpuidle: menu: Simplify checks related to the polling state
  PM / tools: sleepgraph and bootgraph: upgrade to v5.2
  PM / tools: sleepgraph: first batch of v5.2 changes
  cpupower: Fix coredump on VMWare
  cpupower: Fix AMD Family 0x17 msr_pstate size
  cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull
  cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster
  cpuidle: poll_state: Revise loop termination condition
  cpuidle: menu: Move the latency_req == 0 special case check
  cpuidle: menu: Avoid computations for very close timers
  ...
2018-10-23 10:28:21 +01:00
Olivier Brunel 876dcf2f3a umh: Add command line to user mode helpers
User mode helpers were spawned without a command line, and because
an empty command line is used by many tools to identify processes as
kernel threads, this could cause some issues.

Notably during killing spree on shutdown, since such helper would then
be skipped (i.e. not killed) which would result in the process remaining
alive, and thus preventing unmouting of the rootfs (as experienced with
the bpfilter umh).

Fixes: 449325b52b ("umh: introduce fork_usermode_blob() helper")
Signed-off-by: Olivier Brunel <jjk@jjacky.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-22 19:37:36 -07:00