Fork of alistair23 Linux kernel for reMarkable from https://github.com/alistair23/linux
Go to file
Adrian Reber 124ea650d3 capabilities: Introduce CAP_CHECKPOINT_RESTORE
This patch introduces CAP_CHECKPOINT_RESTORE, a new capability facilitating
checkpoint/restore for non-root users.

Over the last years, The CRIU (Checkpoint/Restore In Userspace) team has
been asked numerous times if it is possible to checkpoint/restore a
process as non-root. The answer usually was: 'almost'.

The main blocker to restore a process as non-root was to control the PID
of the restored process. This feature available via the clone3 system
call, or via /proc/sys/kernel/ns_last_pid is unfortunately guarded by
CAP_SYS_ADMIN.

In the past two years, requests for non-root checkpoint/restore have
increased due to the following use cases:
* Checkpoint/Restore in an HPC environment in combination with a
  resource manager distributing jobs where users are always running as
  non-root. There is a desire to provide a way to checkpoint and
  restore long running jobs.
* Container migration as non-root
* We have been in contact with JVM developers who are integrating
  CRIU into a Java VM to decrease the startup time. These
  checkpoint/restore applications are not meant to be running with
  CAP_SYS_ADMIN.

We have seen the following workarounds:
* Use a setuid wrapper around CRIU:
  See https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c
* Use a setuid helper that writes to ns_last_pid.
  Unfortunately, this helper delegation technique is impossible to use
  with clone3, and is thus prone to races.
  See https://github.com/twosigma/set_ns_last_pid
* Cycle through PIDs with fork() until the desired PID is reached:
  This has been demonstrated to work with cycling rates of 100,000 PIDs/s
  See https://github.com/twosigma/set_ns_last_pid
* Patch out the CAP_SYS_ADMIN check from the kernel
* Run the desired application in a new user and PID namespace to provide
  a local CAP_SYS_ADMIN for controlling PIDs. This technique has limited
  use in typical container environments (e.g., Kubernetes) as /proc is
  typically protected with read-only layers (e.g., /proc/sys) for
  hardening purposes. Read-only layers prevent additional /proc mounts
  (due to proc's SB_I_USERNS_VISIBLE property), making the use of new
  PID namespaces limited as certain applications need access to /proc
  matching their PID namespace.

The introduced capability allows to:
* Control PIDs when the current user is CAP_CHECKPOINT_RESTORE capable
  for the corresponding PID namespace via ns_last_pid/clone3.
* Open files in /proc/pid/map_files when the current user is
  CAP_CHECKPOINT_RESTORE capable in the root namespace, useful for
  recovering files that are unreachable via the file system such as
  deleted files, or memfd files.

See corresponding selftest for an example with clone3().

Signed-off-by: Adrian Reber <areber@redhat.com>
Signed-off-by: Nicolas Viennot <Nicolas.Viennot@twosigma.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20200719100418.2112740-2-areber@redhat.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-07-19 20:14:42 +02:00
arch Xtensa fixes for v5.8: 2020-07-12 13:29:07 -07:00
block block-5.8-2020-07-10 2020-07-10 09:55:46 -07:00
certs .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
crypto crypto: af_alg - fix use-after-free in af_alg_accept() due to bh_lock_sock() 2020-06-18 17:09:54 +10:00
Documentation Extend coding-style with inclusive-terminology recommendations. 2020-07-10 21:15:25 -07:00
drivers SCSI fixes on 20200711 2020-07-11 18:15:17 -07:00
fs io_uring-5.8-2020-07-12 2020-07-12 12:17:58 -07:00
include capabilities: Introduce CAP_CHECKPOINT_RESTORE 2020-07-19 20:14:42 +02:00
init kbuild: fix CONFIG_CC_CAN_LINK(_STATIC) for cross-compilation with Clang 2020-07-02 00:57:45 +09:00
ipc mmap locking API: use coccinelle to convert mmap_sem rwsem call sites 2020-06-09 09:39:14 -07:00
kernel RISC-V Fixes for 5.8-rc5 (ideally) 2020-07-11 19:22:46 -07:00
lib RISC-V Fixes for 5.8-rc5 (ideally) 2020-07-11 19:22:46 -07:00
LICENSES LICENSES: Rename other to deprecated 2019-05-03 06:34:32 -06:00
mm Fix gfs2 readahead deadlocks 2020-07-10 08:53:21 -07:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-07-10 18:16:22 -07:00
samples samples/vfs: avoid warning in statx override 2020-07-03 16:15:25 -07:00
scripts kbuild: Move -Wtype-limits to W=2 2020-07-09 18:00:56 -07:00
security capabilities: Introduce CAP_CHECKPOINT_RESTORE 2020-07-19 20:14:42 +02:00
sound sound fixes for 5.8-rc5 2020-07-08 11:07:09 -07:00
tools Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-07-10 18:16:22 -07:00
usr bpfilter: match bit size of bpfilter_umh to that of the kernel 2020-05-17 18:52:01 +09:00
virt kvm: use more precise cast and do not drop __user 2020-07-02 05:39:31 -04:00
.clang-format block: add bio_for_each_bvec_all() 2020-05-25 11:25:24 +02:00
.cocciconfig
.get_maintainer.ignore Opt out of scripts/get_maintainer.pl 2019-05-16 10:53:40 -07:00
.gitattributes .gitattributes: use 'dts' diff driver for dts files 2019-12-04 19:44:11 -08:00
.gitignore .gitignore: Do not track defconfig from make savedefconfig 2020-07-05 16:15:46 +09:00
.mailmap MAINTAINERS: update email address for Gerald Schaefer 2020-07-10 15:06:49 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS mailmap: change email for Ricardo Ribalda 2020-05-25 18:59:59 -06:00
Kbuild kbuild: rename hostprogs-y/always to hostprogs/always-y 2020-02-04 01:53:07 +09:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-07-10 18:16:22 -07:00
Makefile Linux 5.8-rc5 2020-07-12 16:34:50 -07:00
README Drop all 00-INDEX files from Documentation/ 2018-09-09 15:08:58 -06:00

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.