vfs: Allow filesystems with foreign owner IDs to override UID checks

A number of ownership checks made by the VFS make a number of assumptions:

 (1) that it is meaningful to compare inode->i_uid to a second ->i_uid or
     to current_fsuid(),

 (2) that current_fsuid() represents the subject of the action,

 (3) that the number in ->i_uid belong to the system's ID space and

 (4) that the IDs can be represented by 32-bit integers.

Network filesystems, however, may violate all four of these assumptions.
Indeed, a network filesystem may not even have an actual concept of a UNIX
integer UID (cifs without POSIX extensions, for example).  Plug-in block
filesystems (e.g. USB drives) may also violate this assumption.

In particular, AFS implements its own ACL security model with its own
per-cell user ID space with 64-bit IDs for some server variants.  The
subject is represented by a token in a key, not current_fsuid().  The AFS
user IDs and the system user IDs for a cell may be numerically equivalent,
but that's matter of administrative policy and should perhaps be noted in
the cell definition or by mount option.  A subsequent patch will address
AFS.

To help fix this, three functions are defined to perform UID comparison
within the VFS:

 (1) vfs_inode_is_owned_by_me().  This defaults to comparing i_uid to
     current_fsuid(), with appropriate namespace mapping, assuming that the
     fsuid identifies the subject of the action.  The filesystem may
     override it by implementing an inode op:

	int (*is_owned_by_me)(struct mnt_idmap *idmap, struct inode *inode);

     This should return 0 if owned, 1 if not or an error if there's some
     sort of lookup failure.  It may use a means of identifying the subject
     of the action other than fsuid, for example by using an authentication
     token stored in a key.

 (2) vfs_inodes_have_same_owner().  This defaults to comparing the i_uids
     of two different inodes with appropriate namespace mapping.  The
     filesystem may override it by implementing another inode op:

	int (*have_same_owner)(struct mnt_idmap *idmap, struct inode *inode1,
			       struct inode *inode2);

     Again, this should return 0 if matching, 1 if not or an error if
     there's some sort of lookup failure.

 (3) vfs_inode_and_dir_have_same_owner().  This is similar to (2), but
     assumes that the second inode is the parent directory to the first and
     takes a nameidata struct instead of a second inode pointer.

Fix a number of places within the VFS where such UID checks are made that
should be deferring interpretation to the filesystem.

 (*) chown_ok()
 (*) chgrp_ok()

     Call vfs_inode_is_owned_by_me().  Possibly these need to defer all
     their checks to the network filesystem as the interpretation of the
     new UID/GID depends on the netfs too, but the ->setattr() method gets
     a chance to deal with that.

 (*) coredump_file()

     Call vfs_is_owned_by_me() to check that the file created is owned by
     the caller - but the check that's there might be sufficient.

 (*) inode_owner_or_capable()

     Call vfs_is_owned_by_me().  I'm not sure whether the namespace mapping
     makes sense in such a case, but it probably could be used.

 (*) vfs_setlease()

     Call vfs_is_owned_by_me().  Actually, it should query if leasing is
     permitted.

     Also, setting locks could perhaps do with a permission call to the
     filesystem driver as AFS, for example, has a lock permission bit in
     the ACL, but since the AFS server checks that when the RPC call is
     made, it's probably unnecessary.

 (*) acl_permission_check()
 (*) posix_acl_permission()

     Unchanged.  These functions are only used by generic_permission()
     which is overridden if ->permission() is supplied, and when evaluating
     a POSIX ACL, it should arguably be checking the UID anyway.

     AFS, for example, implements its own ACLs and evaluates them in
     ->permission() and on the server.

 (*) may_follow_link()

     Call vfs_inode_and_dir_have_same_owner() and vfs_is_owned_by_me() on
     the the link and its parent dir.

 (*) may_create_in_sticky()

     Call vfs_is_owned_by_me() and also vfs_inode_and_dir_have_same_owner()
     both.

     [?] Should this return ok immediately if the open call we're in
     created the file being checked.

 (*) __check_sticky()

     Call vfs_is_owned_by_me() on both the dir and the inode, but for AFS
     vfs_is_owned_by_me() on a directory doesn't work, so call
     vfs_inodes_have_same_owner() instead to check the directory (as is
     done in may_create_in_sticky()).

 (*) may_dedupe_file()

     Call vfs_is_owned_by_me().

 (*) IMA policy ops.

     Unchanged for now.  I'm not sure what the best way to deal with this
     is - if, indeed, it needs any changes.

Note that wrapping stuff up into vfs_inode_is_owned_by_me() isn't
necessarily the most efficient as it means we may end up doing the uid
idmapping an extra time - though this is only done in three places, all to
do with world-writable sticky dir checks.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Etienne Champetier <champetier.etienne@gmail.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeffrey Altman <jaltman@auristor.com>
cc: Chet Ramey <chet.ramey@case.edu>
cc: Cheyenne Wills <cwills@sinenomine.net>
cc: Alexander Viro <viro@zeniv.linux.org.uk>
cc: Christian Brauner <brauner@kernel.org>
cc: Steve French <sfrench@samba.org>
cc: Mimi Zohar <zohar@linux.ibm.com>
cc: linux-afs@lists.infradead.org
cc: openafs-devel@openafs.org
cc: linux-cifs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
cc: linux-integrity@vger.kernel.org
Link: https://groups.google.com/g/gnu.bash.bug/c/6PPTfOgFdL4/m/2AQU-S1N76UJ
Link: https://git.savannah.gnu.org/cgit/bash.git/tree/redir.c?h=bash-5.3-rc1#n733
9 files changed