pidfd: add CLONE_WAIT_PIDFD

If CLONE_WAIT_PIDFD is set the newly created process will not be
considered by process wait requests that wait generically on children
such as:

	syscall(__NR_wait4, -1, wstatus, options, rusage)
	syscall(__NR_waitpid, -1, wstatus, options)
	syscall(__NR_waitid, P_ALL, -1, siginfo, options, rusage)
	syscall(__NR_waitid, P_PGID, -1, siginfo, options, rusage)
	syscall(__NR_waitpid, -pid, wstatus, options)
	syscall(__NR_wait4, -pid, wstatus, options, rusage)

A process created with CLONE_WAIT_PIDFD can only be waited upon with a
focussed wait call. This ensures that processes can be reaped even if
all file descriptors referring to it are closed.

/* Usecases */
This feature has been requested multiple times in discussions when I
presented this work. Here are concrete use cases people have:
1. Process managers that would like to use pidfd for all process
   watching needs require this feature.
   A process manager (e.g. PID 1) that needs to reap all children
   assigned to it needs to invoke some form of waitall request as
   outlined above.  This has to be done since the process manager might
   not know about processes that got re-parented to it. Without
   CLONE_WAIT_PIDFD the process manager will end up reaping processes it
   uses pidfds to watch for since they are crucial internal processes.
2. Various libraries want to be able to fork off helper processes
   internally that do not otherwise affect the program they are used in.
   This is currently not possible.
   However, if a process invokes a waitall request the internal
   helper process of the library might get reaped, confusing the library
   which expected it to reap it itself.
   Careful programs will thus generally avoid waitall requests which is
   inefficient.
3. A general class of programs are ones that use event loops (e.g. GLib,
   systemd, and LXC etc.). Such event loops currently call focused wait
   requests iteratively on all processes they are configured to watch to
   avoid waitall request pitfalls.
   This is ugly and inefficient since it cannot be used to watch large
   numbers of file descriptors without paying the O(n) cost on each
   event loop iteration.

/* Prior art */
FreeBSD has a similar concept (cf. [1], [2]). They are currently doing
it the other way around, i.e. by default all procdescs are not visible
in waitall requests. Howver, originally, they allowed procdescs to
appear in waitall and changed it later (cf. [1]).

Currently, CLONE_WAIT_PIDFD can only be used in conjunction with
CLONE_PIDFD.

/* References */
[1]: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201054
[2]: https://svnweb.freebsd.org/base/head/sys/kern/kern_exit.c

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Jann Horn <jannh@google.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-api@vger.kernel.org
7 files changed