1
0
Fork 0
Commit Graph

29365 Commits (1d5b82331ee8b8f2c951b055329b943efcf5c834)

Author SHA1 Message Date
Arnd Bergmann d33c577ccc y2038: rename old time and utime syscalls
The time, stime, utime, utimes, and futimesat system calls are only
used on older architectures, and we do not provide y2038 safe variants
of them, as they are replaced by clock_gettime64, clock_settime64,
and utimensat_time64.

However, for consistency it seems better to have the 32-bit architectures
that still use them call the "time32" entry points (leaving the
traditional handlers for the 64-bit architectures), like we do for system
calls that now require two versions.

Note: We used to always define __ARCH_WANT_SYS_TIME and
__ARCH_WANT_SYS_UTIME and only set __ARCH_WANT_COMPAT_SYS_TIME and
__ARCH_WANT_SYS_UTIME32 for compat mode on 64-bit kernels. Now this is
reversed: only 64-bit architectures set __ARCH_WANT_SYS_TIME/UTIME, while
we need __ARCH_WANT_SYS_TIME32/UTIME32 for 32-bit architectures and compat
mode. The resulting asm/unistd.h changes look a bit counterintuitive.

This is only a cleanup patch and it should not change any behavior.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
2019-02-07 00:13:28 +01:00
Arnd Bergmann 8dabe7245b y2038: syscalls: rename y2038 compat syscalls
A lot of system calls that pass a time_t somewhere have an implementation
using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
been reworked so that this implementation can now be used on 32-bit
architectures as well.

The missing step is to redefine them using the regular SYSCALL_DEFINEx()
to get them out of the compat namespace and make it possible to build them
on 32-bit architectures.

Any system call that ends in 'time' gets a '32' suffix on its name for
that version, while the others get a '_time32' suffix, to distinguish
them from the normal version, which takes a 64-bit time argument in the
future.

In this step, only 64-bit architectures are changed, doing this rename
first lets us avoid touching the 32-bit architectures twice.

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Deepa Dinamani 3876ced476 timex: change syscalls to use struct __kernel_timex
struct timex is not y2038 safe.
Switch all the syscall apis to use y2038 safe __kernel_timex.

Note that sys_adjtimex() does not have a y2038 safe solution.  C libraries
can implement it by calling clock_adjtime(CLOCK_REALTIME, ...).

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Deepa Dinamani ead25417f8 timex: use __kernel_timex internally
struct timex is not y2038 safe.
Replace all uses of timex with y2038 safe __kernel_timex.

Note that struct __kernel_timex is an ABI interface definition.
We could define a new structure based on __kernel_timex that
is only available internally instead. Right now, there isn't
a strong motivation for this as the structure is isolated to
a few defined struct timex interfaces and such a structure would
be exactly the same as struct timex.

The patch was generated by the following coccinelle script:

virtual patch

@depends on patch forall@
identifier ts;
expression e;
@@
(
- struct timex ts;
+ struct __kernel_timex ts;
|
- struct timex ts = {};
+ struct __kernel_timex ts = {};
|
- struct timex ts = e;
+ struct __kernel_timex ts = e;
|
- struct timex *ts;
+ struct __kernel_timex *ts;
|
(memset \| copy_from_user \| copy_to_user \)(...,
- sizeof(struct timex))
+ sizeof(struct __kernel_timex))
)

@depends on patch forall@
identifier ts;
identifier fn;
@@
fn(...,
- struct timex *ts,
+ struct __kernel_timex *ts,
...) {
...
}

@depends on patch forall@
identifier ts;
identifier fn;
@@
fn(...,
- struct timex *ts) {
+ struct __kernel_timex *ts) {
...
}

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: linux-alpha@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Arnd Bergmann 1a596398a3 sparc64: add custom adjtimex/clock_adjtime functions
sparc64 is the only architecture on Linux that has a 'timeval'
definition with a 32-bit tv_usec but a 64-bit tv_sec. This causes
problems for sparc32 compat mode when we convert it to use the
new __kernel_timex type that has the same layout as all other
64-bit architectures.

To avoid adding sparc64 specific code into the generic adjtimex
implementation, this adds a wrapper in the sparc64 system call handling
that converts the sparc64 'timex' into the new '__kernel_timex'.

At this point, the two structures are defined to be identical,
but that will change in the next step once we convert sparc32.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Arnd Bergmann 4d5f007eed time: make adjtime compat handling available for 32 bit
We want to reuse the compat_timex handling on 32-bit architectures the
same way we are using the compat handling for timespec when moving to
64-bit time_t.

Move all definitions related to compat_timex out of the compat code
into the normal timekeeping code, along with a rename to old_timex32,
corresponding to the timespec/timeval structures, and make it controlled
by CONFIG_COMPAT_32BIT_TIME, which 32-bit architectures will then select.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-02-07 00:13:27 +01:00
Arnd Bergmann 275f22148e ipc: rename old-style shmctl/semctl/msgctl syscalls
The behavior of these system calls is slightly different between
architectures, as determined by the CONFIG_ARCH_WANT_IPC_PARSE_VERSION
symbol. Most architectures that implement the split IPC syscalls don't set
that symbol and only get the modern version, but alpha, arm, microblaze,
mips-n32, mips-n64 and xtensa expect the caller to pass the IPC_64 flag.

For the architectures that so far only implement sys_ipc(), i.e. m68k,
mips-o32, powerpc, s390, sh, sparc, and x86-32, we want the new behavior
when adding the split syscalls, so we need to distinguish between the
two groups of architectures.

The method I picked for this distinction is to have a separate system call
entry point: sys_old_*ctl() now uses ipc_parse_version, while sys_*ctl()
does not. The system call tables of the five architectures are changed
accordingly.

As an additional benefit, we no longer need the configuration specific
definition for ipc_parse_version(), it always does the same thing now,
but simply won't get called on architectures with the modern interface.

A small downside is that on architectures that do set
ARCH_WANT_IPC_PARSE_VERSION, we now have an extra set of entry points
that are never called. They only add a few bytes of bloat, so it seems
better to keep them compared to adding yet another Kconfig symbol.
I considered adding new syscall numbers for the IPC_64 variants for
consistency, but decided against that for now.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-01-25 17:22:50 +01:00
Arnd Bergmann 58fa4a410f ipc: introduce ksys_ipc()/compat_ksys_ipc() for s390
The sys_ipc() and compat_ksys_ipc() functions are meant to only
be used from the system call table, not called by another function.

Introduce ksys_*() interfaces for this purpose, as we have done
for many other system calls.

Link: https://lore.kernel.org/lkml/20190116131527.2071570-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[heiko.carstens@de.ibm.com: compile fix for !CONFIG_COMPAT]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2019-01-18 09:33:18 +01:00
Jonathan Neuschäfer b7285b4253 kernel/sys.c: Clarify that UNAME26 does not generate unique versions anymore
UNAME26 is a mechanism to report Linux's version as 2.6.x, for
compatibility with old/broken software.  Due to the way it is
implemented, it would have to be updated after 5.0, to keep the
resulting versions unique.  Linus Torvalds argued:

 "Do we actually need this?

  I'd rather let it bitrot, and just let it return random versions. It
  will just start again at 2.4.60, won't it?

  Anybody who uses UNAME26 for a 5.x kernel might as well think it's
  still 4.x. The user space is so old that it can't possibly care about
  differences between 4.x and 5.x, can it?

  The only thing that matters is that it shows "2.4.<largeenough>",
  which it will do regardless"

Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-14 10:38:03 +12:00
Linus Torvalds a88cc8da02 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "14 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  mm, page_alloc: do not wake kswapd with zone lock held
  hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization"
  hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"
  mm: page_mapped: don't assume compound page is huge or THP
  mm/memory.c: initialise mmu_notifier_range correctly
  tools/vm/page_owner: use page_owner_sort in the use example
  kasan: fix krealloc handling for tag-based mode
  kasan: make tag based mode work with CONFIG_HARDENED_USERCOPY
  kasan, arm64: use ARCH_SLAB_MINALIGN instead of manual aligning
  mm, memcg: fix reclaim deadlock with writeback
  mm/usercopy.c: no check page span for stack objects
  slab: alien caches must not be initialized if the allocation of the alien cache failed
  fork, memcg: fix cached_stacks case
  zram: idle writeback fixes and cleanup
2019-01-08 18:58:29 -08:00
Shakeel Butt ba4a45746c fork, memcg: fix cached_stacks case
Commit 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on
memcg charge fail") fixes a crash caused due to failed memcg charge of
the kernel stack.  However the fix misses the cached_stacks case which
this patch fixes.  So, the same crash can happen if the memcg charge of
a cached stack is failed.

Link: http://lkml.kernel.org/r/20190102180145.57406-1-shakeelb@google.com
Fixes: 5eed6f1dff ("fork,memcg: fix crash in free_thread_stack on memcg charge fail")
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 17:15:11 -08:00
David Herrmann 7b55851367 fork: record start_time late
This changes the fork(2) syscall to record the process start_time after
initializing the basic task structure but still before making the new
process visible to user-space.

Technically, we could record the start_time anytime during fork(2).  But
this might lead to scenarios where a start_time is recorded long before
a process becomes visible to user-space.  For instance, with
userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
for an indefinite amount of time (and will, if this causes network
access, or similar).

By recording the start_time late, it much closer reflects the point in
time where the process becomes live and can be observed by other
processes.

Lastly, this makes it much harder for user-space to predict and control
the start_time they get assigned.  Previously, user-space could fork a
process and stall it in copy_thread_tls() before its pid is allocated,
but after its start_time is recorded.  This can be misused to later-on
cycle through PIDs and resume the stalled fork(2) yielding a process
that has the same pid and start_time as a process that existed before.
This can be used to circumvent security systems that identify processes
by their pid+start_time combination.

Even though user-space was always aware that start_time recording is
flaky (but several projects are known to still rely on start_time-based
identification), changing the start_time to be recorded late will help
mitigate existing attacks and make it much harder for user-space to
control the start_time a process gets assigned.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Tom Gundersen <teg@jklm.no>
Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-08 09:40:53 -08:00
Linus Torvalds 85e1ffbd42 Kbuild late updates for v4.21
- improve boolinit.cocci and use_after_iter.cocci semantic patches
 
 - fix alignment for kallsyms
 
 - move 'asm goto' compiler test to Kconfig and clean up jump_label
   CONFIG option
 
 - generate asm-generic wrappers automatically if arch does not implement
   mandatory UAPI headers
 
 - remove redundant generic-y defines
 
 - misc cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJcMV5GAAoJED2LAQed4NsGs9gQAI/oGg8wJgk9a7+dJCX245W5
 F4ReftnQd4AFptFCi9geJkr+sfViXNgwPLqlJxiXz8Qe8XP7z3LcArDw3FUzwvGn
 bMSBiN9ggwWkOFgF523XesYgUVtcLpkNch/Migzf1Ac0FHk0G9o7gjcdsvAWHkUu
 qFwtNcUB6PElRbhsHsh5qCY1/6HaAXgf/7O7wztnaKRe9myN6f2HzT4wANS9HHde
 1e1r0LcIQeGWfG+3va3fZl6SDxSI/ybl244OcDmDyYl6RA1skSDlHbIBIFgUPoS0
 cLyzoVj+GkfI1fRFEIfou+dj7lpukoAXHsggHo0M+ofqtbMF+VB2T3jvg4txanCP
 TXzDc+04QUguK5yVnBfcnyC64Htrhnbq0eGy43kd1VZWAEGApl+680P8CRsWU3ZV
 kOiFvZQ6RP/Ssw+a42yU3SHr31WD7feuQqHU65osQt4rdyL5wnrfU1vaUvJSkltF
 cyPr9Kz/Ism0kPodhpFkuKxwtlKOw6/uwdCQoQHtxAPkvkcydhYx93x3iE0nxObS
 CRMximiRyE12DOcv/3uv69n0JOPn6AsITcMNp8XryASYrR2/52txhGKGhvo3+Zoq
 5pwc063JsuxJ/5/dcOw/erQar5d1eBRaBJyEWnXroxUjbsLPAznE+UIN8tmvyVly
 SunlxNOXBdYeWN6t6S3H
 =I+r6
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - improve boolinit.cocci and use_after_iter.cocci semantic patches

 - fix alignment for kallsyms

 - move 'asm goto' compiler test to Kconfig and clean up jump_label
   CONFIG option

 - generate asm-generic wrappers automatically if arch does not
   implement mandatory UAPI headers

 - remove redundant generic-y defines

 - misc cleanups

* tag 'kbuild-v4.21-3' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kconfig: rename generated .*conf-cfg to *conf-cfg
  kbuild: remove unnecessary stubs for archheader and archscripts
  kbuild: use assignment instead of define ... endef for filechk_* rules
  arch: remove redundant UAPI generic-y defines
  kbuild: generate asm-generic wrappers if mandatory headers are missing
  arch: remove stale comments "UAPI Header export list"
  riscv: remove redundant kernel-space generic-y
  kbuild: change filechk to surround the given command with { }
  kbuild: remove redundant target cleaning on failure
  kbuild: clean up rule_dtc_dt_yaml
  kbuild: remove UIMAGE_IN and UIMAGE_OUT
  jump_label: move 'asm goto' support test to Kconfig
  kallsyms: lower alignment on ARM
  scripts: coccinelle: boolinit: drop warnings on named constants
  scripts: coccinelle: check for redeclaration
  kconfig: remove unused "file" field of yylval union
  nds32: remove redundant kernel-space generic-y
  nios2: remove unneeded HAS_DMA define
2019-01-06 16:33:10 -08:00
Linus Torvalds e2b745f469 dma-mapping fixes for Linux 4.21-rc1
Fix various regressions introduced in this cycles:
 
  - fix dma-debug tracking for the map_page / map_single consolidatation
  - properly stub out DMA mapping symbols for !HAS_DMA builds to avoid
    link failures
  - fix AMD Gart direct mappings
  - setup the dma address for no kernel mappings using the remap
    allocator
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlwyR9ULHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYPvOA/+L+32p2pm8o6NTgvtRvqsKNrbOm02fORLrhBqAiok
 AcirFDxTfMuUWU2isr7E7WNqwEmUQ1nVUa+I0IJ/IJFfKdTggXcaTX1M19+62KWa
 1LHpZLg1t2rl2yFQHgTrFKr5sz1PwUKZO8UbrYaYYgLgQkWDRzJs4E/tFNju8pMm
 0Usexo/bkI5mreJBImMsFwAnuk0k3NT058XIeD+eNttKjcuz5kEH+bE/999vySW3
 sOj9Peic/EFelOGb4ODxUIPjhiGFMv5dVusSAsFBH26iwQfX/tFSmXhrI5cnDewg
 NlREennfyM+6uTH/DO+BlX7eGCRYbFc1GU5H9q4rRMXhEam6oc2AzVKuElJOVstZ
 XVjP6zTwmuOh/5ff0NG6EPjA/OFcmlBEsmeWu4xSS8KsNILOkpUaPed/uWnA7O+2
 mvU104NA5cHgVMgiGNM/4ilirkEZEFEHYhafH42bQxjMigm7ZHN14NtwM7StLTu6
 QgyfPUcW/LmHj2scgvB1AZ+iQX0z7yJJMGifUxtz+eMCWCC7neOJ7JLvNnS9WI5w
 9RwYaCOcDAZyAmCpbSADWxeG9cfsCDp8wmaGs3YVyhkDU8tCSqbxWJutvyDQnC17
 GtZ0vYLTaJXBCq1L/FC0y8NCCGgvySPXYU7/ZYuOCzS4q2jvjwTWD3dKodvnS+mb
 B0s=
 =H9J6
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix various regressions introduced in this cycles:

   - fix dma-debug tracking for the map_page / map_single
     consolidatation

   - properly stub out DMA mapping symbols for !HAS_DMA builds to avoid
     link failures

   - fix AMD Gart direct mappings

   - setup the dma address for no kernel mappings using the remap
     allocator"

* tag 'dma-mapping-4.21-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocations
  x86/amd_gart: fix unmapping of non-GART mappings
  dma-mapping: remove a few unused exports
  dma-mapping: properly stub out the DMA API for !CONFIG_HAS_DMA
  dma-mapping: remove dmam_{declare,release}_coherent_memory
  dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrs
  dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs
2019-01-06 11:47:26 -08:00
Masahiro Yamada ad77408635 kbuild: change filechk to surround the given command with { }
filechk_* rules often consist of multiple 'echo' lines. They must be
surrounded with { } or ( ) to work correctly. Otherwise, only the
string from the last 'echo' would be written into the target.

Let's take care of that in the 'filechk' in scripts/Kbuild.include
to clean up filechk_* rules.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
2019-01-06 09:46:51 +09:00
Masahiro Yamada e9666d10a5 jump_label: move 'asm goto' support test to Kconfig
Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:

  #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
  # define HAVE_JUMP_LABEL
  #endif

We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
2019-01-06 09:46:51 +09:00
Linus Torvalds a65981109f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - procfs updates

 - various misc bits

 - lib/ updates

 - epoll updates

 - autofs

 - fatfs

 - a few more MM bits

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits)
  mm/page_io.c: fix polled swap page in
  checkpatch: add Co-developed-by to signature tags
  docs: fix Co-Developed-by docs
  drivers/base/platform.c: kmemleak ignore a known leak
  fs: don't open code lru_to_page()
  fs/: remove caller signal_pending branch predictions
  mm/: remove caller signal_pending branch predictions
  arch/arc/mm/fault.c: remove caller signal_pending_branch predictions
  kernel/sched/: remove caller signal_pending branch predictions
  kernel/locking/mutex.c: remove caller signal_pending branch predictions
  mm: select HAVE_MOVE_PMD on x86 for faster mremap
  mm: speed up mremap by 20x on large regions
  mm: treewide: remove unused address argument from pte_alloc functions
  initramfs: cleanup incomplete rootfs
  scripts/gdb: fix lx-version string output
  kernel/kcov.c: mark write_comp_data() as notrace
  kernel/sysctl: add panic_print into sysctl
  panic: add options to print system info when panic happens
  bfs: extra sanity checking and static inode bitmap
  exec: separate MM_ANONPAGES and RLIMIT_STACK accounting
  ...
2019-01-05 09:16:18 -08:00
Christoph Hellwig 8270f3a11c dma-direct: fix DMA_ATTR_NO_KERNEL_MAPPING for remapped allocations
We need to return a dma_addr_t even if we don't have a kernel mapping.
Do so by consolidating the phys_to_dma call in a single place and jump
to it from all the branches that return successfully.

Fixes: bfd56cd605 ("dma-mapping: support highmem in the generic remap allocator")
Reported-by: Liviu Dudau <liviu@dudau.co.uk
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Liviu Dudau <liviu@dudau.co.uk>
2019-01-05 08:28:29 +01:00
Davidlohr Bueso 34ec35ad8f kernel/sched/: remove caller signal_pending branch predictions
This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Davidlohr Bueso 3bb5f4ac55 kernel/locking/mutex.c: remove caller signal_pending branch predictions
This is already done for us internally by the signal machinery.

Link: http://lkml.kernel.org/r/20181116002713.8474-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:48 -08:00
Anders Roxell 6347244316 kernel/kcov.c: mark write_comp_data() as notrace
Since __sanitizer_cov_trace_const_cmp4 is marked as notrace, the
function called from __sanitizer_cov_trace_const_cmp4 shouldn't be
traceable either.  ftrace_graph_caller() gets called every time func
write_comp_data() gets called if it isn't marked 'notrace'.  This is the
backtrace from gdb:

 #0  ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:179
 #1  0xffffff8010201920 in ftrace_caller () at ../arch/arm64/kernel/entry-ftrace.S:151
 #2  0xffffff8010439714 in write_comp_data (type=5, arg1=0, arg2=0, ip=18446743524224276596) at ../kernel/kcov.c:116
 #3  0xffffff8010439894 in __sanitizer_cov_trace_const_cmp4 (arg1=<optimized out>, arg2=<optimized out>) at ../kernel/kcov.c:188
 #4  0xffffff8010201874 in prepare_ftrace_return (self_addr=18446743524226602768, parent=0xffffff801014b918, frame_pointer=18446743524223531344) at ./include/generated/atomic-instrumented.h:27
 #5  0xffffff801020194c in ftrace_graph_caller () at ../arch/arm64/kernel/entry-ftrace.S:182

Rework so that write_comp_data() that are called from
__sanitizer_cov_trace_*_cmp*() are marked as 'notrace'.

Commit 903e8ff867 ("kernel/kcov.c: mark funcs in __sanitizer_cov_trace_pc() as notrace")
missed to mark write_comp_data() as 'notrace'. When that patch was
created gcc-7 was used. In lib/Kconfig.debug
config KCOV_ENABLE_COMPARISONS
	depends on $(cc-option,-fsanitize-coverage=trace-cmp)

That code path isn't hit with gcc-7. However, it were that with gcc-8.

Link: http://lkml.kernel.org/r/20181206143011.23719-1-anders.roxell@linaro.org
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Co-developed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Feng Tang 81c9d43f94 kernel/sysctl: add panic_print into sysctl
So that we can also runtime chose to print out the needed system info
for panic, other than setting the kernel cmdline.

Link: http://lkml.kernel.org/r/1543398842-19295-3-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Feng Tang d999bd9392 panic: add options to print system info when panic happens
Kernel panic issues are always painful to debug, partially because it's
not easy to get enough information of the context when panic happens.

And we have ramoops and kdump for that, while this commit tries to
provide a easier way to show the system info by adding a cmdline
parameter, referring some idea from sysrq handler.

Link: http://lkml.kernel.org/r/1543398842-19295-2-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Yi Wang fb5bf31722 fork: fix some -Wmissing-prototypes warnings
We get a warning when building kernel with W=1:

  kernel/fork.c:167:13: warning: no previous prototype for `arch_release_thread_stack' [-Wmissing-prototypes]
  kernel/fork.c:779:13: warning: no previous prototype for `fork_init' [-Wmissing-prototypes]

Add the missing declaration in head file to fix this.

Also, remove arch_release_thread_stack() completely because no arch
seems to implement it since bb9d81264 (arch: remove tile port).

Link: http://lkml.kernel.org/r/1542170087-23645-1-git-send-email-wang.yi59@zte.com.cn
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:47 -08:00
Tetsuo Handa 304ae42739 kernel/hung_task.c: break RCU locks based on jiffies
check_hung_uninterruptible_tasks() is currently calling rcu_lock_break()
for every 1024 threads.  But check_hung_task() is very slow if printk()
was called, and is very fast otherwise.

If many threads within some 1024 threads called printk(), the RCU grace
period might be extended enough to trigger RCU stall warnings.
Therefore, calling rcu_lock_break() for every some fixed jiffies will be
safer.

Link: http://lkml.kernel.org/r/1544800658-11423-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:45 -08:00
Liu, Chuansheng 168e06f793 kernel/hung_task.c: force console verbose before panic
Based on commit 401c636a0e ("kernel/hung_task.c: show all hung tasks
before panic"), we could get the call stack of hung task.

However, if the console loglevel is not high, we still can not see the
useful panic information in practice, and in most cases users don't set
console loglevel to high level.

This patch is to force console verbose before system panic, so that the
real useful information can be seen in the console, instead of being
like the following, which doesn't have hung task information.

  INFO: task init:1 blocked for more than 120 seconds.
        Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  Kernel panic - not syncing: hung_task: blocked tasks
  CPU: 2 PID: 479 Comm: khungtaskd Tainted: G     U  W         4.19.0-quilt-2e5dc0ac-g51b6c21d76cc #1
  Call Trace:
   dump_stack+0x4f/0x65
   panic+0xde/0x231
   watchdog+0x290/0x410
   kthread+0x12c/0x150
   ret_from_fork+0x35/0x40
  reboot: panic mode set: p,w
  Kernel Offset: 0x34000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A6015B675@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:45 -08:00
Cheng Lin 09be178400 proc/sysctl: fix return error for proc_doulongvec_minmax()
If the number of input parameters is less than the total parameters, an
EINVAL error will be returned.

For example, we use proc_doulongvec_minmax to pass up to two parameters
with kern_table:

{
	.procname       = "monitor_signals",
	.data           = &monitor_sigs,
	.maxlen         = 2*sizeof(unsigned long),
	.mode           = 0644,
	.proc_handler   = proc_doulongvec_minmax,
},

Reproduce:

When passing two parameters, it's work normal.  But passing only one
parameter, an error "Invalid argument"(EINVAL) is returned.

  [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  1       2
  [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
  -bash: echo: write error: Invalid argument
  [root@cl150 ~]# echo $?
  1
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  3       2
  [root@cl150 ~]#

The following is the result after apply this patch.  No error is
returned when the number of input parameters is less than the total
parameters.

  [root@cl150 ~]# echo 1 2 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  1       2
  [root@cl150 ~]# echo 3 > /proc/sys/kernel/monitor_signals
  [root@cl150 ~]# echo $?
  0
  [root@cl150 ~]# cat /proc/sys/kernel/monitor_signals
  3       2
  [root@cl150 ~]#

There are three processing functions dealing with digital parameters,
__do_proc_dointvec/__do_proc_douintvec/__do_proc_doulongvec_minmax.

This patch deals with __do_proc_doulongvec_minmax, just as
__do_proc_dointvec does, adding a check for parameters 'left'.  In
__do_proc_douintvec, its code implementation explicitly does not support
multiple inputs.

static int __do_proc_douintvec(...){
         ...
         /*
          * Arrays are not supported, keep this simple. *Do not* add
          * support for them.
          */
         if (vleft != 1) {
                 *lenp = 0;
                 return -EINVAL;
         }
         ...
}

So, just __do_proc_doulongvec_minmax has the problem.  And most use of
proc_doulongvec_minmax/proc_doulongvec_ms_jiffies_minmax just have one
parameter.

Link: http://lkml.kernel.org/r/1544081775-15720-1-git-send-email-cheng.lin130@zte.com.cn
Signed-off-by: Cheng Lin <cheng.lin130@zte.com.cn>
Acked-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 13:13:45 -08:00
Linus Torvalds 594cc251fd make 'user_access_begin()' do 'access_ok()'
Originally, the rule used to be that you'd have to do access_ok()
separately, and then user_access_begin() before actually doing the
direct (optimized) user access.

But experience has shown that people then decide not to do access_ok()
at all, and instead rely on it being implied by other operations or
similar.  Which makes it very hard to verify that the access has
actually been range-checked.

If you use the unsafe direct user accesses, hardware features (either
SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
Access Never - on ARM) do force you to use user_access_begin().  But
nothing really forces the range check.

By putting the range check into user_access_begin(), we actually force
people to do the right thing (tm), and the range check vill be visible
near the actual accesses.  We have way too long a history of people
trying to avoid them.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04 12:56:09 -08:00
Christoph Hellwig 48e638fb68 dma-mapping: remove a few unused exports
Now that the slow path DMA API calls are implemented out of line a few
helpers only used by them don't need to be exported anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04 09:03:17 +01:00
Christoph Hellwig 4788ba5792 dma-mapping: remove dmam_{declare,release}_coherent_memory
These functions have never been used.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04 09:03:17 +01:00
Christoph Hellwig d7076f0784 dma-mapping: implement dmam_alloc_coherent using dmam_alloc_attrs
dmam_alloc_coherent is just the default no-flags case of
dmam_alloc_attrs, so take advantage of this similar to the non-managed
version.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2019-01-04 09:03:16 +01:00
Christoph Hellwig 2e05ea5cdc dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs
And also switch the way we implement the unmap side around to stay
consistent.  This ensures dma-debug works again because it records which
function we used for mapping to ensure it is also used for unmapping,
and also reduces further code duplication.  Last but not least this
also officially allows calling dma_sync_single_* for mappings created
using dma_map_page, which is perfectly fine given that the sync calls
only take a dma_addr_t, but not a virtual address or struct page.

Fixes: 7f0fee242e ("dma-mapping: merge dma_unmap_page_attrs and dma_unmap_single_attrs")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: LABBE Corentin <clabbe.montjoie@gmail.com>
2019-01-04 09:02:17 +01:00
Linus Torvalds 96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Linus Torvalds 43d86ee8c6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Several fixes here. Basically split down the line between newly
  introduced regressions and long existing problems:

   1) Double free in tipc_enable_bearer(), from Cong Wang.

   2) Many fixes to nf_conncount, from Florian Westphal.

   3) op->get_regs_len() can throw an error, check it, from Yunsheng
      Lin.

   4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman
      driver, from Scott Wood.

   5) Inifnite loop in fib_empty_table(), from Yue Haibing.

   6) Use after free in ax25_fillin_cb(), from Cong Wang.

   7) Fix socket locking in nr_find_socket(), also from Cong Wang.

   8) Fix WoL wakeup enable in r8169, from Heiner Kallweit.

   9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani.

  10) Fix ptr_ring wrap during queue swap, from Cong Wang.

  11) Missing shutdown callback in hinic driver, from Xue Chaojing.

  12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano
      Brivio.

  13) BPF out of bounds speculation fixes from Daniel Borkmann"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
  ipv6: Consider sk_bound_dev_if when binding a socket to an address
  ipv6: Fix dump of specific table with strict checking
  bpf: add various test cases to selftests
  bpf: prevent out of bounds speculation on pointer arithmetic
  bpf: fix check_map_access smin_value test when pointer contains offset
  bpf: restrict unknown scalars of mixed signed bounds for unprivileged
  bpf: restrict stack pointer arithmetic for unprivileged
  bpf: restrict map value pointer arithmetic for unprivileged
  bpf: enable access to ax register also from verifier rewrite
  bpf: move tmp variable into ax register in interpreter
  bpf: move {prev_,}insn_idx into verifier env
  isdn: fix kernel-infoleak in capi_unlocked_ioctl
  ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error
  net/hamradio/6pack: use mod_timer() to rearm timers
  net-next/hinic:add shutdown callback
  net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT
  ip: validate header length on virtual device xmit
  tap: call skb_probe_transport_header after setting skb->dev
  ptr_ring: wrap back ->producer in __ptr_ring_swap_queue()
  net: rds: remove unnecessary NULL check
  ...
2019-01-03 12:53:47 -08:00
Linus Torvalds e6b9257280 NFS client updates for Linux 4.21
Note that there is a conflict with the rdma tree in this pull request, since
 we delete a file that has been changed in the rdma tree.  Hopefully that's
 easy enough to resolve!
 
 We also were unable to track down a maintainer for Neil Brown's changes to
 the generic cred code that are prerequisites to his RPC cred cleanup patches.
 We've been asking around for several months without any response, so
 hopefully it's okay to include those patches in this pull request.
 
 Stable bugfixes:
 - xprtrdma: Yet another double DMA-unmap # v4.20
 
 Features:
 - Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
 - Per-xprt rdma receive workqueues
 - Drop support for FMR memory registration
 - Make port= mount option optional for RDMA mounts
 
 Other bugfixes and cleanups:
 - Remove unused nfs4_xdev_fs_type declaration
 - Fix comments for behavior that has changed
 - Remove generic RPC credentials by switching to 'struct cred'
 - Fix crossing mountpoints with different auth flavors
 - Various xprtrdma fixes from testing and auditing the close code
 - Fixes for disconnect issues when using xprtrdma with krb5
 - Clean up and improve xprtrdma trace points
 - Fix NFS v4.2 async copy reboot recovery
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlwtO50ACgkQ18tUv7Cl
 QOtZWQ//e5Hhp2TnQZ6U+99YKedjwBHP6psH3GKSEdeHSNdlSpZ5ckgHxvMb9TBa
 6t4ecgv5P/uYLIePQ0u2ubUFc9+TlyGi7Iacx13/YhK7kihGHDPnZhfl0QbYixV7
 rwa9bFcKmOrXs8ld+Hw3P2UL22G1gMf/LHDhPNshbW7LFZmcshKz+mKTk70kwkq9
 v7tFC59p6GwV8Sr2YI2NXn2fOWsUS00sQfgj2jceJYJ8PsNa+wHYF4wPj2IY5NsE
 D5Oq2kLPbytBhCllOHgopNZaf4qb5BfqhVETyc1O+kDF3BZKUhQ1PoDi2FPinaHM
 5/d8hS+5fr3eMBsQrPWQLXYjWQFUXnkQQJvU3Bo52AIgomsk/8uBq3FvH7XmFcBd
 C8sgnuUAkAS8feMes8GCS50BTxclnGuYGdyFJyCRXoG9Kn9rMrw9EKitky6EVq0v
 NmXhW79jK84a3yDXVlAIpZ8Y9BU/HQ3GviGX8lQEdZU9YiYRzDIHvpMFwzMgqaBi
 XvLbr8PlLOm8GZokThS8QYT/G2Wu6IwfUq/AufVjVD4+HiL3duKKfWSGAvcm6aAa
 GoRF6UG+OmjWlzKojtRc1dI+sy22Fzh+DW+Mx6tuf/b/66wkmYnW7eKcV4rt6Tm5
 /JEhvTMo9q7elL/4FgCoMCcdoc5eXqQyXRXrQiOU7YHLzn2aWU0=
 =DvVW
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - xprtrdma: Yet another double DMA-unmap # v4.20

  Features:
   - Allow some /proc/sys/sunrpc entries without CONFIG_SUNRPC_DEBUG
   - Per-xprt rdma receive workqueues
   - Drop support for FMR memory registration
   - Make port= mount option optional for RDMA mounts

  Other bugfixes and cleanups:
   - Remove unused nfs4_xdev_fs_type declaration
   - Fix comments for behavior that has changed
   - Remove generic RPC credentials by switching to 'struct cred'
   - Fix crossing mountpoints with different auth flavors
   - Various xprtrdma fixes from testing and auditing the close code
   - Fixes for disconnect issues when using xprtrdma with krb5
   - Clean up and improve xprtrdma trace points
   - Fix NFS v4.2 async copy reboot recovery"

* tag 'nfs-for-4.21-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (63 commits)
  sunrpc: convert to DEFINE_SHOW_ATTRIBUTE
  sunrpc: Add xprt after nfs4_test_session_trunk()
  sunrpc: convert unnecessary GFP_ATOMIC to GFP_NOFS
  sunrpc: handle ENOMEM in rpcb_getport_async
  NFS: remove unnecessary test for IS_ERR(cred)
  xprtrdma: Prevent leak of rpcrdma_rep objects
  NFSv4.2 fix async copy reboot recovery
  xprtrdma: Don't leak freed MRs
  xprtrdma: Add documenting comment for rpcrdma_buffer_destroy
  xprtrdma: Replace outdated comment for rpcrdma_ep_post
  xprtrdma: Update comments in frwr_op_send
  SUNRPC: Fix some kernel doc complaints
  SUNRPC: Simplify defining common RPC trace events
  NFS: Fix NFSv4 symbolic trace point output
  xprtrdma: Trace mapping, alloc, and dereg failures
  xprtrdma: Add trace points for calls to transport switch methods
  xprtrdma: Relocate the xprtrdma_mr_map trace points
  xprtrdma: Clean up of xprtrdma chunk trace points
  xprtrdma: Remove unused fields from rpcrdma_ia
  xprtrdma: Cull dprintk() call sites
  ...
2019-01-02 16:35:23 -08:00
Daniel Borkmann 979d63d50c bpf: prevent out of bounds speculation on pointer arithmetic
Jann reported that the original commit back in b2157399cc
("bpf: prevent out-of-bounds speculation") was not sufficient
to stop CPU from speculating out of bounds memory access:
While b2157399cc only focussed on masking array map access
for unprivileged users for tail calls and data access such
that the user provided index gets sanitized from BPF program
and syscall side, there is still a more generic form affected
from BPF programs that applies to most maps that hold user
data in relation to dynamic map access when dealing with
unknown scalars or "slow" known scalars as access offset, for
example:

  - Load a map value pointer into R6
  - Load an index into R7
  - Do a slow computation (e.g. with a memory dependency) that
    loads a limit into R8 (e.g. load the limit from a map for
    high latency, then mask it to make the verifier happy)
  - Exit if R7 >= R8 (mispredicted branch)
  - Load R0 = R6[R7]
  - Load R0 = R6[R0]

For unknown scalars there are two options in the BPF verifier
where we could derive knowledge from in order to guarantee
safe access to the memory: i) While </>/<=/>= variants won't
allow to derive any lower or upper bounds from the unknown
scalar where it would be safe to add it to the map value
pointer, it is possible through ==/!= test however. ii) another
option is to transform the unknown scalar into a known scalar,
for example, through ALU ops combination such as R &= <imm>
followed by R |= <imm> or any similar combination where the
original information from the unknown scalar would be destroyed
entirely leaving R with a constant. The initial slow load still
precedes the latter ALU ops on that register, so the CPU
executes speculatively from that point. Once we have the known
scalar, any compare operation would work then. A third option
only involving registers with known scalars could be crafted
as described in [0] where a CPU port (e.g. Slow Int unit)
would be filled with many dependent computations such that
the subsequent condition depending on its outcome has to wait
for evaluation on its execution port and thereby executing
speculatively if the speculated code can be scheduled on a
different execution port, or any other form of mistraining
as described in [1], for example. Given this is not limited
to only unknown scalars, not only map but also stack access
is affected since both is accessible for unprivileged users
and could potentially be used for out of bounds access under
speculation.

In order to prevent any of these cases, the verifier is now
sanitizing pointer arithmetic on the offset such that any
out of bounds speculation would be masked in a way where the
pointer arithmetic result in the destination register will
stay unchanged, meaning offset masked into zero similar as
in array_index_nospec() case. With regards to implementation,
there are three options that were considered: i) new insn
for sanitation, ii) push/pop insn and sanitation as inlined
BPF, iii) reuse of ax register and sanitation as inlined BPF.

Option i) has the downside that we end up using from reserved
bits in the opcode space, but also that we would require
each JIT to emit masking as native arch opcodes meaning
mitigation would have slow adoption till everyone implements
it eventually which is counter-productive. Option ii) and iii)
have both in common that a temporary register is needed in
order to implement the sanitation as inlined BPF since we
are not allowed to modify the source register. While a push /
pop insn in ii) would be useful to have in any case, it
requires once again that every JIT needs to implement it
first. While possible, amount of changes needed would also
be unsuitable for a -stable patch. Therefore, the path which
has fewer changes, less BPF instructions for the mitigation
and does not require anything to be changed in the JITs is
option iii) which this work is pursuing. The ax register is
already mapped to a register in all JITs (modulo arm32 where
it's mapped to stack as various other BPF registers there)
and used in constant blinding for JITs-only so far. It can
be reused for verifier rewrites under certain constraints.
The interpreter's tmp "register" has therefore been remapped
into extending the register set with hidden ax register and
reusing that for a number of instructions that needed the
prior temporary variable internally (e.g. div, mod). This
allows for zero increase in stack space usage in the interpreter,
and enables (restricted) generic use in rewrites otherwise as
long as such a patchlet does not make use of these instructions.
The sanitation mask is dynamic and relative to the offset the
map value or stack pointer currently holds.

There are various cases that need to be taken under consideration
for the masking, e.g. such operation could look as follows:
ptr += val or val += ptr or ptr -= val. Thus, the value to be
sanitized could reside either in source or in destination
register, and the limit is different depending on whether
the ALU op is addition or subtraction and depending on the
current known and bounded offset. The limit is derived as
follows: limit := max_value_size - (smin_value + off). For
subtraction: limit := umax_value + off. This holds because
we do not allow any pointer arithmetic that would
temporarily go out of bounds or would have an unknown
value with mixed signed bounds where it is unclear at
verification time whether the actual runtime value would
be either negative or positive. For example, we have a
derived map pointer value with constant offset and bounded
one, so limit based on smin_value works because the verifier
requires that statically analyzed arithmetic on the pointer
must be in bounds, and thus it checks if resulting
smin_value + off and umax_value + off is still within map
value bounds at time of arithmetic in addition to time of
access. Similarly, for the case of stack access we derive
the limit as follows: MAX_BPF_STACK + off for subtraction
and -off for the case of addition where off := ptr_reg->off +
ptr_reg->var_off.value. Subtraction is a special case for
the masking which can be in form of ptr += -val, ptr -= -val,
or ptr -= val. In the first two cases where we know that
the value is negative, we need to temporarily negate the
value in order to do the sanitation on a positive value
where we later swap the ALU op, and restore original source
register if the value was in source.

The sanitation of pointer arithmetic alone is still not fully
sufficient as is, since a scenario like the following could
happen ...

  PTR += 0x1000 (e.g. K-based imm)
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  PTR += 0x1000
  PTR -= BIG_NUMBER_WITH_SLOW_COMPARISON
  [...]

... which under speculation could end up as ...

  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  PTR += 0x1000
  PTR -= 0 [ truncated by mitigation ]
  [...]

... and therefore still access out of bounds. To prevent such
case, the verifier is also analyzing safety for potential out
of bounds access under speculative execution. Meaning, it is
also simulating pointer access under truncation. We therefore
"branch off" and push the current verification state after the
ALU operation with known 0 to the verification stack for later
analysis. Given the current path analysis succeeded it is
likely that the one under speculation can be pruned. In any
case, it is also subject to existing complexity limits and
therefore anything beyond this point will be rejected. In
terms of pruning, it needs to be ensured that the verification
state from speculative execution simulation must never prune
a non-speculative execution path, therefore, we mark verifier
state accordingly at the time of push_stack(). If verifier
detects out of bounds access under speculative execution from
one of the possible paths that includes a truncation, it will
reject such program.

Given we mask every reg-based pointer arithmetic for
unprivileged programs, we've been looking into how it could
affect real-world programs in terms of size increase. As the
majority of programs are targeted for privileged-only use
case, we've unconditionally enabled masking (with its alu
restrictions on top of it) for privileged programs for the
sake of testing in order to check i) whether they get rejected
in its current form, and ii) by how much the number of
instructions and size will increase. We've tested this by
using Katran, Cilium and test_l4lb from the kernel selftests.
For Katran we've evaluated balancer_kern.o, Cilium bpf_lxc.o
and an older test object bpf_lxc_opt_-DUNKNOWN.o and l4lb
we've used test_l4lb.o as well as test_l4lb_noinline.o. We
found that none of the programs got rejected by the verifier
with this change, and that impact is rather minimal to none.
balancer_kern.o had 13,904 bytes (1,738 insns) xlated and
7,797 bytes JITed before and after the change. Most complex
program in bpf_lxc.o had 30,544 bytes (3,817 insns) xlated
and 18,538 bytes JITed before and after and none of the other
tail call programs in bpf_lxc.o had any changes either. For
the older bpf_lxc_opt_-DUNKNOWN.o object we found a small
increase from 20,616 bytes (2,576 insns) and 12,536 bytes JITed
before to 20,664 bytes (2,582 insns) and 12,558 bytes JITed
after the change. Other programs from that object file had
similar small increase. Both test_l4lb.o had no change and
remained at 6,544 bytes (817 insns) xlated and 3,401 bytes
JITed and for test_l4lb_noinline.o constant at 5,080 bytes
(634 insns) xlated and 3,313 bytes JITed. This can be explained
in that LLVM typically optimizes stack based pointer arithmetic
by using K-based operations and that use of dynamic map access
is not overly frequent. However, in future we may decide to
optimize the algorithm further under known guarantees from
branch and value speculation. Latter seems also unclear in
terms of prediction heuristics that today's CPUs apply as well
as whether there could be collisions in e.g. the predictor's
Value History/Pattern Table for triggering out of bounds access,
thus masking is performed unconditionally at this point but could
be subject to relaxation later on. We were generally also
brainstorming various other approaches for mitigation, but the
blocker was always lack of available registers at runtime and/or
overhead for runtime tracking of limits belonging to a specific
pointer. Thus, we found this to be minimally intrusive under
given constraints.

With that in place, a simple example with sanitized access on
unprivileged load at post-verification time looks as follows:

  # bpftool prog dump xlated id 282
  [...]
  28: (79) r1 = *(u64 *)(r7 +0)
  29: (79) r2 = *(u64 *)(r7 +8)
  30: (57) r1 &= 15
  31: (79) r3 = *(u64 *)(r0 +4608)
  32: (57) r3 &= 1
  33: (47) r3 |= 1
  34: (2d) if r2 > r3 goto pc+19
  35: (b4) (u32) r11 = (u32) 20479  |
  36: (1f) r11 -= r2                | Dynamic sanitation for pointer
  37: (4f) r11 |= r2                | arithmetic with registers
  38: (87) r11 = -r11               | containing bounded or known
  39: (c7) r11 s>>= 63              | scalars in order to prevent
  40: (5f) r11 &= r2                | out of bounds speculation.
  41: (0f) r4 += r11                |
  42: (71) r4 = *(u8 *)(r4 +0)
  43: (6f) r4 <<= r1
  [...]

For the case where the scalar sits in the destination register
as opposed to the source register, the following code is emitted
for the above example:

  [...]
  16: (b4) (u32) r11 = (u32) 20479
  17: (1f) r11 -= r2
  18: (4f) r11 |= r2
  19: (87) r11 = -r11
  20: (c7) r11 s>>= 63
  21: (5f) r2 &= r11
  22: (0f) r2 += r0
  23: (61) r0 = *(u32 *)(r2 +0)
  [...]

JIT blinding example with non-conflicting use of r10:

  [...]
   d5:	je     0x0000000000000106    _
   d7:	mov    0x0(%rax),%edi       |
   da:	mov    $0xf153246,%r10d     | Index load from map value and
   e0:	xor    $0xf153259,%r10      | (const blinded) mask with 0x1f.
   e7:	and    %r10,%rdi            |_
   ea:	mov    $0x2f,%r10d          |
   f0:	sub    %rdi,%r10            | Sanitized addition. Both use r10
   f3:	or     %rdi,%r10            | but do not interfere with each
   f6:	neg    %r10                 | other. (Neither do these instructions
   f9:	sar    $0x3f,%r10           | interfere with the use of ax as temp
   fd:	and    %r10,%rdi            | in interpreter.)
  100:	add    %rax,%rdi            |_
  103:	mov    0x0(%rdi),%eax
 [...]

Tested that it fixes Jann's reproducer, and also checked that test_verifier
and test_progs suite with interpreter, JIT and JIT with hardening enabled
on x86-64 and arm64 runs successfully.

  [0] Speculose: Analyzing the Security Implications of Speculative
      Execution in CPUs, Giorgi Maisuradze and Christian Rossow,
      https://arxiv.org/pdf/1801.04084.pdf

  [1] A Systematic Evaluation of Transient Execution Attacks and
      Defenses, Claudio Canella, Jo Van Bulck, Michael Schwarz,
      Moritz Lipp, Benjamin von Berg, Philipp Ortner, Frank Piessens,
      Dmitry Evtyushkin, Daniel Gruss,
      https://arxiv.org/pdf/1811.05441.pdf

Fixes: b2157399cc ("bpf: prevent out-of-bounds speculation")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann b7137c4eab bpf: fix check_map_access smin_value test when pointer contains offset
In check_map_access() we probe actual bounds through __check_map_access()
with offset of reg->smin_value + off for lower bound and offset of
reg->umax_value + off for the upper bound. However, even though the
reg->smin_value could have a negative value, the final result of the
sum with off could be positive when pointer arithmetic with known and
unknown scalars is combined. In this case we reject the program with
an error such as "R<x> min value is negative, either use unsigned index
or do a if (index >=0) check." even though the access itself would be
fine. Therefore extend the check to probe whether the actual resulting
reg->smin_value + off is less than zero.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann 9d7eceede7 bpf: restrict unknown scalars of mixed signed bounds for unprivileged
For unknown scalars of mixed signed bounds, meaning their smin_value is
negative and their smax_value is positive, we need to reject arithmetic
with pointer to map value. For unprivileged the goal is to mask every
map pointer arithmetic and this cannot reliably be done when it is
unknown at verification time whether the scalar value is negative or
positive. Given this is a corner case, the likelihood of breaking should
be very small.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann e4298d2583 bpf: restrict stack pointer arithmetic for unprivileged
Restrict stack pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with
a stack pointer as a destination we simulate a check_stack_access()
of 1 byte on the destination and once that fails the program is
rejected for unprivileged program loads. This is analog to map
value pointer arithmetic and needed for masking later on.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann 0d6303db79 bpf: restrict map value pointer arithmetic for unprivileged
Restrict map value pointer arithmetic for unprivileged users in that
arithmetic itself must not go out of bounds as opposed to the actual
access later on. Therefore after each adjust_ptr_min_max_vals() with a
map value pointer as a destination it will simulate a check_map_access()
of 1 byte on the destination and once that fails the program is rejected
for unprivileged program loads. We use this later on for masking any
pointer arithmetic with the remainder of the map value space. The
likelihood of breaking any existing real-world unprivileged eBPF
program is very small for this corner case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann 9b73bfdd08 bpf: enable access to ax register also from verifier rewrite
Right now we are using BPF ax register in JIT for constant blinding as
well as in interpreter as temporary variable. Verifier will not be able
to use it simply because its use will get overridden from the former in
bpf_jit_blind_insn(). However, it can be made to work in that blinding
will be skipped if there is prior use in either source or destination
register on the instruction. Taking constraints of ax into account, the
verifier is then open to use it in rewrites under some constraints. Note,
ax register already has mappings in every eBPF JIT.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann 144cd91c4c bpf: move tmp variable into ax register in interpreter
This change moves the on-stack 64 bit tmp variable in ___bpf_prog_run()
into the hidden ax register. The latter is currently only used in JITs
for constant blinding as a temporary scratch register, meaning the BPF
interpreter will never see the use of ax. Therefore it is safe to use
it for the cases where tmp has been used earlier. This is needed to later
on allow restricted hidden use of ax in both interpreter and JITs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Daniel Borkmann c08435ec7f bpf: move {prev_,}insn_idx into verifier env
Move prev_insn_idx and insn_idx from the do_check() function into
the verifier environment, so they can be read inside the various
helper functions for handling the instructions. It's easier to put
this into the environment rather than changing all call-sites only
to pass it along. insn_idx is useful in particular since this later
on allows to hold state in env->insn_aux_data[env->insn_idx].

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-01-02 16:01:24 -08:00
Linus Torvalds d9a7fa67b4 Merge branch 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull seccomp updates from James Morris:

 - Add SECCOMP_RET_USER_NOTIF

 - seccomp fixes for sparse warnings and s390 build (Tycho)

* 'next-seccomp' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  seccomp, s390: fix build for syscall type change
  seccomp: fix poor type promotion
  samples: add an example of seccomp user trap
  seccomp: add a return code to trap to userspace
  seccomp: switch system call argument type to void *
  seccomp: hoist struct seccomp_data recalculation higher
2019-01-02 09:48:13 -08:00
Linus Torvalds fcf010449e kgdb patches for 4.20-rc1
Mostly clean ups although whilst Doug's was chasing down a odd
 lockdep warning he also did some work to improved debugger resilience
 when some CPUs fail to respond to the round up request.
 
 The main changes are:
 
  * Fixing a lockdep warning on architectures that cannot use an NMI for
    the round up plus related changes to make CPU round up and all CPU
    backtrace more resilient.
 
  * Constify the arch ops tables
 
  * A couple of other small clean ups
 
 Two of the three patchsets here include changes that spill over into
 arch/.  Changes in the arch space are relatively narrow in scope
 (and directly related to kgdb). Didn't get comprehensive acks but
 all impacted maintainers were Cc:ed in good time.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJPBAABCAA5FiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAlwonoUbHGRhbmllbC50
 aG9tcHNvbkBsaW5hcm8ub3JnAAoJEHzjJV0594ihmooP/1uzSMGQIoQMB8XeU/jT
 Da2iILybi6hGp7ILA27d0yN3tsJBxWGWs8wzNdzMo3NQ3J0o4foAUnS/R0Vjkg9w
 uphe5EA4HDsIrH05OouNb984BeEgNaC9HSqtyr9fXuh024NboULFKIm7REYm+QHT
 C5SrBtmonL1xE7FmAhudLWjl7ZlvxM6DJeoVViH4kKq0raTiILt6VJaGl9JfcAdL
 m9GEf9r/nh0sCq3GNgyc0y4BvHed+Kxzy1fsIi3jE6t8elaYYR72gNRQ5LaFxcnQ
 F04/UtH75qB4rqYsqqV1q0rFi+tj+p9wYTmxixaGWsVDX4Gb5KXuLWJhaRb5IvwC
 bdq/0IAXRr4vUL3y0tFWfCj7pHGaVc/gfXi8aieRXLGAZG+tdfuu99NCiulIZTfc
 QqZz12Z+99/qi6dK7dBQtaN8SyPeB1QXKWefeGo2Bt5QqiBmcKHxsQYMUo3nkf3J
 UXHpj4LG6Ldsi/w8VZfvXmM0/vbO/jrus9m+X2v+4tJyisjrsyv0FRnREI4avfbC
 l09P1ajv7RrAaxtab0smV9krqWZ/mSn0zcgcaD6RdKe0+SwsiP/CEx1z1Wb1MH9c
 wjEiClXjdVB39YVT0YVfG2Ho7qH8WRErxVyNb/f4QKHMXL1Mu91hFWhBBpUOGUj2
 7Jrq2zK1uWramtt7GBDpHYYH
 =Aqlc
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb updates from Daniel Thompson:
 "Mostly clean ups although while Doug's was chasing down a odd lockdep
  warning he also did some work to improved debugger resilience when
  some CPUs fail to respond to the round up request.

  The main changes are:

   - Fixing a lockdep warning on architectures that cannot use an NMI
     for the round up plus related changes to make CPU round up and all
     CPU backtrace more resilient.

   - Constify the arch ops tables

   - A couple of other small clean ups

  Two of the three patchsets here include changes that spill over into
  arch/. Changes in the arch space are relatively narrow in scope (and
  directly related to kgdb). Didn't get comprehensive acks but all
  impacted maintainers were Cc:ed in good time"

* tag 'kgdb-4.21-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb/treewide: constify struct kgdb_arch arch_kgdb_ops
  mips/kgdb: prepare arch_kgdb_ops for constness
  kdb: use bool for binary state indicators
  kdb: Don't back trace on a cpu that didn't round up
  kgdb: Don't round up a CPU that failed rounding up before
  kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
  kgdb: Remove irq flags from roundup
2019-01-01 15:38:14 -08:00
Linus Torvalds 495d714ad1 Tracing changes for v4.21:
- Rework of the kprobe/uprobe and synthetic events to consolidate all
    the dynamic event code. This will make changes in the future easier.
 
  - Partial rewrite of the function graph tracing infrastructure.
    This will allow for multiple users of hooking onto functions
    to get the callback (return) of the function. This is the ground
    work for having kprobes and function graph tracer using one code base.
 
  - Clean up of the histogram code that will facilitate adding more
    features to the histograms in the future.
 
  - Addition of str_has_prefix() and a few use cases. There currently
    is a similar function strstart() that is used in a few places, but
    only returns a bool and not a length. These instances will be
    removed in the future to use str_has_prefix() instead.
 
  - A few other various clean ups as well.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXCawlBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qhbcAQCFeT0fWWTUxofBQz5jqsHaRnVg21+9
 X4sTldYRYEn4YgEAmWOyiwq7zvrsAu4ZwkNBMeqxn3tVymYHiGOGe3Y4BAw=
 =u96o
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Rework of the kprobe/uprobe and synthetic events to consolidate all
   the dynamic event code. This will make changes in the future easier.

 - Partial rewrite of the function graph tracing infrastructure. This
   will allow for multiple users of hooking onto functions to get the
   callback (return) of the function. This is the ground work for having
   kprobes and function graph tracer using one code base.

 - Clean up of the histogram code that will facilitate adding more
   features to the histograms in the future.

 - Addition of str_has_prefix() and a few use cases. There currently is
   a similar function strstart() that is used in a few places, but only
   returns a bool and not a length. These instances will be removed in
   the future to use str_has_prefix() instead.

 - A few other various clean ups as well.

* tag 'trace-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (57 commits)
  tracing: Use the return of str_has_prefix() to remove open coded numbers
  tracing: Have the historgram use the result of str_has_prefix() for len of prefix
  tracing: Use str_has_prefix() instead of using fixed sizes
  tracing: Use str_has_prefix() helper for histogram code
  string.h: Add str_has_prefix() helper function
  tracing: Make function ‘ftrace_exports’ static
  tracing: Simplify printf'ing in seq_print_sym
  tracing: Avoid -Wformat-nonliteral warning
  tracing: Merge seq_print_sym_short() and seq_print_sym_offset()
  tracing: Add hist trigger comments for variable-related fields
  tracing: Remove hist trigger synth_var_refs
  tracing: Use hist trigger's var_ref array to destroy var_refs
  tracing: Remove open-coding of hist trigger var_ref management
  tracing: Use var_refs[] for hist trigger reference checking
  tracing: Change strlen to sizeof for hist trigger static strings
  tracing: Remove unnecessary hist trigger struct field
  tracing: Fix ftrace_graph_get_ret_stack() to use task and not current
  seq_buf: Use size_t for len in seq_buf_puts()
  seq_buf: Make seq_buf_puts() null-terminate the buffer
  arm64: Use ftrace_graph_get_ret_stack() instead of curr_ret_stack
  ...
2018-12-31 11:46:59 -08:00
Linus Torvalds e3ed513bcf Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Ingo Molnar:
 "This is a revert for a lockup in cgroups-intense workloads - the real
  fixes will come later"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b
2018-12-31 09:54:17 -08:00
Linus Torvalds c40f7d74c7 sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f6544b
Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
scheduler under high loads, starting at around the v4.18 time frame,
and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
manipulation.

Do a (manual) revert of:

  a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")

It turns out that the list_del_leaf_cfs_rq() introduced by this commit
is a surprising property that was not considered in followup commits
such as:

  9c2791f936 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

As Vincent Guittot explains:

 "I think that there is a bigger problem with commit a9e7f6544b and
  cfs_rq throttling:

  Let take the example of the following topology TG2 --> TG1 --> root:

   1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
      cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
      one path because it has never been used and can't be throttled so
      tmp_alone_branch will point to leaf_cfs_rq_list at the end.

   2) Then TG1 is throttled

   3) and we add TG3 as a new child of TG1.

   4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
      cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.

  With commit a9e7f6544b, we can del a cfs_rq from rq->leaf_cfs_rq_list.
  So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
  cfs_rq is removed from the list.
  Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
  but tmp_alone_branch still points to TG3 cfs_rq because its throttled
  parent can't be enqueued when the lock is released.
  tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

  So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
  points on another TG cfs_rq, the next TG cfs_rq that will be added,
  will be linked outside rq->leaf_cfs_rq_list - which is bad.

  In addition, we can break the ordering of the cfs_rq in
  rq->leaf_cfs_rq_list but this ordering is used to update and
  propagate the update from leaf down to root."

Instead of trying to work through all these cases and trying to reproduce
the very high loads that produced the lockup to begin with, simplify
the code temporarily by reverting a9e7f6544b - which change was clearly
not thought through completely.

This (hopefully) gives us a kernel that doesn't lock up so people
can continue to enjoy their holidays without worrying about regressions. ;-)

[ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

Analyzed-by: Xie XiuQi <xiexiuqi@huawei.com>
Analyzed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reported-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Reported-by: Sargun Dhillon <sargun@sargun.me>
Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Tested-by: Zhipeng Xie <xiezhipeng1@huawei.com>
Tested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: <stable@vger.kernel.org> # v4.13+
Cc: Bin Li <huawei.libin@huawei.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: a9e7f6544b ("sched/fair: Fix O(nr_cgroups) in load balance path")
Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-12-30 13:54:31 +01:00
Nicholas Mc Guire 7faedcd4de kdb: use bool for binary state indicators
defcmd_in_progress  is the state trace for command group processing
- within a command group or not -  usable  is an indicator if a command
set is valid (allocated/non-empty) - so use a bool for those binary
indication here.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-12-30 08:31:52 +00:00
Douglas Anderson 162bc7f5af kdb: Don't back trace on a cpu that didn't round up
If you have a CPU that fails to round up and then run 'btc' you'll end
up crashing in kdb becaue we dereferenced NULL.  Let's add a check.
It's wise to also set the task to NULL when leaving the debugger so
that if we fail to round up on a later entry into the debugger we
won't backtrace a stale task.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Acked-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2018-12-30 08:31:23 +00:00