fork: add clone3

This adds the clone3 system call.

We recently merged the CLONE_PIDFD patchset. It took the last free flag
from clone(). Independent of the CLONE_PIDFD patchset a time namespace has
been discussed at Linux Plumber Conference last year and has been sent out
and reviewed. It is expected that it will go upstream in the not too
distant future. However, it relies on the addition of another flag -
CLONE_TIMENS - to clone(). Given that we grabbed the last clone() flag and
thereby blocked the CLONE_TIMENS patchset it just seems right that we offer
a solution.

Our idea is to keep clone3() very simple and very close to the original
clone(), i.e. keep on supporting clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this. One stance we take is
that there's no real good alternative to fork()+exec() and we need and want
to support this model going forward. The following requirements guided us
for clone3():
- bump the number of available flags as much as possible while ensuring
  that all flag arguments are passed in registers so they remain easily
  accessible for seccomp.
- move non-flag arguments that are currently passed as separate arguments
  in clone() into a dedicated struct
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flags dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
- do not try to be clever or complex: keep clone3() as dumb as possible

What we came up with is clone3() which has the following signature:

struct clone3_args {
        __u32 version;
        __s32 pidfd;
        __aligned_u64 parent_tidptr;
        __aligned_u64 child_tidptr;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;

        /* keep kernel-only parameters at the end of the structure */
};

long sys_clone3(struct clone3_args __user *uargs,
                unsigned int flags1,
                unsigned int flags2,
                unsigned int flags3,
                unsigned int flags4,
                unsigned int flags5);

clone3() cleanly supports all of the supported flags from clone() in
flags1, i.e. flags1 is full and all legacy workloads are supported with
clone3().
With clone3() we have 160 flag values in total which - even for a feature
growing syscall like clone - should hold quite a while. If they are really
all taken at some point we can simply bite the bullet and start adding
additional flag arguments into struct clone3_args itself.

Another advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone3() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc an just write a simple wrapper to accomodate clone3().

Co-developed-by: Jann Horn <jann@thejh.net>
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Christian Brauner <christian@brauner.io>
8 files changed