| Idmappings |
| ========== |
| |
| Most filesystem developers will have encountered idmappings. They have to be |
| used when reading from or writing ownership to disk, reporting ownership to |
| userspace, or for permission checking. This document is aimed at filesystem |
| developers that want to know how idmappings work. |
| |
| Formal notes |
| ------------ |
| |
| An idmapping is essentially a translation of a range of ids into another or the |
| same range of ids. The notational convention for idmappings that is widely used |
| in userspace is:: |
| |
| x:y:K |
| |
| The ``K`` parameter indicates the range of the idmapping, i.e. how many ids are |
| mapped. More generally, ``x`` is an element of the upper idmapset ``X`` and |
| ``y`` is an element of the lower idmapset ``Y``. |
| |
| To see what this looks like in practice, let's take the following idmapping:: |
| |
| 22:10000:3 |
| |
| and write down the mappings it will generate:: |
| |
| 22 -> 10000 |
| 23 -> 10001 |
| 24 -> 10002 |
| |
| From a mathematical viewpoint ``X`` and ``Y`` are well-ordered sets and an |
| idmapping is an order isomorphism from ``X`` into ``Y``. So ``X`` and ``Y`` are |
| order isomorphic. In fact, ``X`` and ``Y`` are always well-ordered subsets of |
| the set of all possible ids useable on a given system. |
| |
| Looking at this mathematically briefly will help us highlight some properties |
| that make it easier to understand how we can translate between idmappings. For |
| example, we know that the inverse idmapping is an order isomorphism as well:: |
| |
| 10000 -> 22 |
| 10001 -> 23 |
| 10002 -> 24 |
| |
| Given that we are dealing with order isomorphisms plus the fact that we're |
| dealing with subsets we can embedd idmappings into each other, i.e. we can |
| sensibly translate between different idmappings. For example, assume we've been |
| given the three idmappings:: |
| |
| 1. 0:10000:10000 |
| 2. 0:20000:10000 |
| 3. 0:30000:10000 |
| |
| and we're given the id ``11000`` which has been generated by the first |
| idmapping by mapping ``1000 -> 11000`` down from the upper into the lower |
| idmapset. |
| |
| Because we're dealing with order isomorphic subsets it is meaningful to ask |
| what id ``11000`` corresponds to in the second or third idmapping. The |
| straightfoward algorithm to use is to apply the inverse of the first idmapping |
| ``11000 -> 1000`` and then use the second idmapping ``1000 -> 21000`` or the |
| third idmapping ``1000 -> 31000`` . If we were given the same task for the |
| following three idmappings:: |
| |
| 1. 0:10000:10000 |
| 2. 0:20000:200 |
| 3. 0:30000:300 |
| |
| we would fail to translate as the sets aren't order isomorphic anymore over the |
| full range of the first idmapping (However they are order isomorphic over the |
| full range of the second idmapping.). Neither the second or third idmapping |
| contain id ``1000`` in the upper idmapset ``X``. This is equivalent to not |
| having an id mapped, so ``1000`` is an unmapped id in the second and third |
| idmaping. The kernel will report unmapped ids as the overflowuid ``(uid_t)-1`` |
| or overflowgid ``(gid_t)-1`` to userspace. |
| |
| The algorithm to calculate what a given id maps to is pretty simple. First, we |
| need to verify that the range can contain our target id. We will skip this step |
| for simplicity. After that if we want to know what the id ``id`` maps to we can |
| do simple calculations: |
| |
| - If we want to map from left to right:: |
| |
| x:y:K |
| id - x + y = z |
| |
| - If we want to map from right to left:: |
| |
| x:y:K |
| id - y + x = z |
| |
| Instead of "left to right" we can also say "down" and instead of "right to |
| left" we can also say "up". Obviously mapping down and up invert each other. |
| |
| To see whether the simple formulas above work, consider the following two |
| idmappings:: |
| |
| 1. 0:20000:10000 |
| 2. 500:30000:10000 |
| |
| Assume we are given the id ``21000`` in the lower idmapset of the first |
| idmapping. We want to know what id this was mapped from in the upper idmapset |
| of the first idmapping. So we're mapping up in the first idmapping:: |
| |
| id - y + x = z |
| 21000 - 20000 + 0 = 1000 |
| |
| Now assume we are given the id ``1100`` in the upper idmapset of the second |
| idmapping and we want to know what this id maps down to in the lower idmapset |
| of the second idmapping. This means we're mapping down in the second idmapping:: |
| |
| id - x + y = z |
| 1100 - 500 + 30000 = 30600 |
| |
| General notes |
| ------------- |
| |
| In the context of the kernel an idmapping can be interpreted as mapping a range |
| of userspace ids into a range of kernel ids:: |
| |
| userspace-id:kernel-id:range |
| |
| A userspace id is always an element in the source idmapset of an idmapping of |
| type ``uid_t`` or ``gid_t`` and a kernel id is always an element in the target |
| idmapset of an idmapping of type ``kuid_t`` or ``kgid_t``. From now on |
| "userspace id" will be used to refer to the well known ``uid_t`` and ``gid_t`` |
| types and "kernel id" will be used to refer to ``kuid_t`` and ``kgid_t``. |
| |
| The kernel is mostly concerned with kernel ids. They are used when performing |
| permission checks and are stored in an inode's ``i_uid`` and ``i_gid`` field. |
| A userspace id on the other hand is an id that is reported to userspace by the |
| kernel, or is passed by userspace to the kernel, or a raw device id that is |
| written or read from disk. |
| |
| Note that we are only concerned with idmappings as the kernel stores them not |
| how userspace would specify them. |
| |
| A kernel id is always created by an idmapping. Such idmappings are associated |
| with user namespaces. Since we mainly care about how idmappings work we're not |
| going to be concerned with how idmappings are created nor how they are used |
| outside of the filesystem context. This is best left to an explanation of user |
| namespaces. |
| |
| The initial user namespace is special. It always has an idmapping of the |
| following form:: |
| |
| 0:0:4294967295 |
| |
| which is an identity idmapping over the full range of ids available on this |
| system. |
| |
| Other user namespaces usually have non-identity idmappings such as:: |
| |
| 0:10000:10000 |
| |
| When a process creates or wants to change ownership of a file, or when the |
| ownership of a file is read from disk by a filesystem, the userspace id is |
| immediately translated into a kernel id according to the idmapping associated |
| with the relevant user namespace. |
| |
| For instance, a file that is stored on disk by a filesystem as being owned by |
| userspace id ``1000``: |
| |
| - If a filesystem were to be mounted in the initial user namespaces (as most |
| filesystems are) then the initial idmapping will be used. As we saw this is |
| simply the identity idmapping. This would mean the userspace id ``1000`` read |
| from disk would be mapped to kernel id ``1000``. So a VFS inode's ``i_uid`` |
| and ``i_gid`` field would contain kernel id ``1000``. |
| |
| - If a filesystem were to be mounted in a user namespace with an idmapping of |
| ``0:10000:10000`` then the userspace id ``1000`` read from disk would be |
| mapped to kernel id ``11000``. So a VFS inode's ``i_uid`` and ``i_gid`` would |
| contain ``11000``. |
| |
| An idmapping ``0:10000:10000`` consists of a set of userspace ids or "userspace |
| idmapset" and a set of kernel ids or "kernel idmapset". This distinction is |
| import when translating between different idmappings. |
| |
| Translation algorithms |
| ---------------------- |
| |
| We've already seen briefly that it is possible to translate between different |
| idmappings. We'll now take a closer look how that works. |
| |
| Crossmapping |
| ~~~~~~~~~~~~ |
| |
| This translation algorithm is used by the kernel in quite a few places. For |
| example, it is used when reporting back the ownership of a file to userspace |
| via the ``stat()`` system call family. |
| |
| If we've been given a kernel id ``11000`` from one idmapping we can map that id |
| up in another idmapping. In order for this to work both idmappings need to |
| contain the same kernel id in their kernel idmapsets. For example, consider the |
| following idmappings:: |
| |
| 1. 0:10000:10000 |
| 2. 20000:10000:10000 |
| |
| and we are mapping the userspace id ``1000`` according to the first idmapping |
| ``1000 -> 11000``. We can translate the kernel id ``11000`` into a userspace id |
| in the second idmapping using the kernel idmapset of the second idmapping:: |
| |
| /* Map the kernel id up into a userspace id in the second idmapping. */ |
| from_kuid(20000:10000:10000, 11000) = 21000 |
| |
| Note, how we can get back to the kernel id in the first idmapping by inverting |
| the algorithm:: |
| |
| /* Map the userspace id down into a kernel id in the second idmapping. */ |
| make_kuid(20000:10000:10000, 21000) = 11000 |
| |
| /* Map the kernel id up into a userspace id in the first idmapping. */ |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| This algorithm allows us to answer the question what userspace id a given |
| kernel id corresponds to in a given idmapping. In order to be able to answer |
| this question both idmappings need to contain the same kernel id in their |
| respective kernel idmapsets. |
| |
| For example, when the kernel reads a raw userspace id from disk it maps it into |
| a kernel id according to the idmapping associated with the filesystem. Let's |
| assume the filesystem was mount with an idmapping of ``0:20000:10000`` and it |
| reads a file owned by userspace id ``1000`` from disk. This means userspace id |
| ``1000`` will be mapped to kernel id ``21000`` which is what will be stored in |
| the VFS's inode ``i_uid`` and ``i_gid`` field. |
| |
| When someone in userspace calls ``stat()`` or a related function to get |
| ownership information of the file the kernel can't simply map the id back up |
| according to the filesystem's idmapping as this would give the wrong owner. |
| Instead, the kernel will map the id back up in the idmapping of the caller. |
| Let's assume the caller has the slighly unconventional idmapping |
| ``3000:20000:10000`` then the kernel id ``21000`` would map back up to |
| userspace id ``4000`` in this idmapping and consequently the user would see |
| that this file is owned by userspace id ``4000`` according to their idmapping. |
| |
| Remapping |
| ~~~~~~~~~ |
| |
| It is possible to translate the id from one idmapping to another one via the |
| userspace idmapset of the two idmappings. This is equivalent to remapping an |
| id. |
| |
| Let's look at an example. We are given the following two idmappings:: |
| |
| 1. 0:10000:10000 |
| 2. 0:20000:10000 |
| |
| and we are given the kernel id ``11000`` in the first idmapping. In order to |
| translate this kernel id in the first idmapping into a kernel id in the second |
| idmapping we need to perform two steps: |
| |
| 1. Map the kernel id up into a userspace id in the first idmapping:: |
| |
| /* Map the kernel id up into a userspace id in the first idmapping. */ |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| 2. Map the userspace id down into a kernel id in the second idmapping:: |
| |
| /* Map the userspace id down into a kernel id in the second idmapping. */ |
| make_kuid(0:20000:10000, 1000) = 21000 |
| |
| As you can see we used the userspace idmapset in both idmappings to translate |
| the kernel id in one idmapping to a kernel id in another idmapping. |
| |
| This allows us to answer the question what kernel id we would need to use to |
| get the same userspace id in another idmapping. In order to be able to answer |
| this question both idmappings need to contain the same userspace id in their |
| respective userspace idmapsets. |
| |
| Note, how we can easily get back to the kernel id in the first idmapping by |
| inverting the algorithm: |
| |
| 1. Map the kernel id up into a userspace id in the second idmapping:: |
| |
| /* Map the kernel id up into a userspace id in the second idmapping. */ |
| from_kuid(0:20000:10000, 21000) = 1000 |
| |
| 2. Map the userspace id down into a kernel id in the first idmapping:: |
| |
| /* Map the userspace id down into a kernel id in the first idmapping. */ |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| Another way to look at this translation is to treat it as undoing an already |
| active idmapping and applying another idmapping. This will come in handy when |
| working with idmapped mounts. |
| |
| Invalid translations |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| It is never valid to use an id in the kernel idmapset of one idmapping as the |
| id in the userspace idmapset of another or the same idmapping. While the kernel |
| idmapset always indicates an idmapset in the kernel id space the userspace |
| idmapset indicates a userspace id. So the following translations are forbidden:: |
| |
| /* Map the userspace id down into a kernel id in the first idmapping. */ |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| /* INVALID: Map the kernel id down into a kernel id in the second idmapping. */ |
| make_kuid(10000:20000:10000, 110000) = 21000 |
| |
| and equally wrong:: |
| |
| /* Map the kernel id up into a userspace id in the first idmapping. */ |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| /* INVALID: Map the userspace id up into a userspace id in the second idmapping. */ |
| from_kuid(20000:0:10000, 1000) = 21000 |
| |
| Idmappings when creating filesystem objects |
| ------------------------------------------- |
| |
| The concepts of mapping an id down or mapping an id up are expressed in the two |
| kernel functions filesystem developers are rather familiar with:: |
| |
| /* Map the userspace id down into a kernel id. */ |
| make_kuid(idmapping, uid) |
| |
| /* Map the kernel id up into a userspace id. */ |
| from_kuid(idmapping, kuid) |
| |
| We will take an abbreviated look into how idmappings figure into creating |
| filesystem objects. For simplicity we will only look at what happens when the |
| VFS has already completed path lookup right before it calls into the filesystem |
| itself. So we're concerned with what happens when e.g. ``vfs_mkdir()`` is |
| called. We will also assume that the directory we're creating filesystem |
| objects in is readable and writable for everyone. |
| |
| When creating a filesystem object the caller will look at the caller's |
| filesystem ids. These are just regular ``uid_t`` and ``gid_t`` userspace ids |
| but they are exclusively used when determining file ownership which is why they |
| are called "filesystem ids". They are usually identical to the uid and gid of |
| the caller but can differ. We will just assume they are always identical to not |
| get lost in too many details. |
| |
| When the caller enters the kernel two things happen: |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping. |
| (To be precise, the kernel will simply look at the kernel ids stashed in the |
| credentials of the current task but for our education we'll pretend this |
| translation happens just in time.) |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping. |
| |
| The second step is important as regular filesystem will ultimately need to |
| translate the kernel id back into a raw userspace id when writing to disk. |
| So with the second step the kernel guarantees that a valid userspace id can be |
| written to disk. If it can't the kernel will refuse the creation request to not |
| even remotely risk filesystem corruption. |
| |
| Example 1 |
| ~~~~~~~~~ |
| |
| :: |
| |
| caller userspace id: 1000 |
| caller idmapping: 0:0:4294967295 |
| filesystem idmapping: 0:0:4294967295 |
| |
| Both the caller and the filesystem use the identity idmapping: |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping. |
| |
| For this second step the kernel will call the function |
| ``fsuidgid_has_mapping()`` which ultimately boils down to calling |
| ``from_kuid()``:: |
| |
| from_kuid(0:0:4294967295, 1000) = 1000 |
| |
| The astute reader will have realized that this is simply a varation of the |
| crossmapping algorithm we mentioned above in a previous section. First, the |
| kernel maps the caller's userspace id down into a kernel id according to the |
| caller's idmapping and then maps that kernel id up according to the |
| filesystem's idmapping. In this example both idmappings are the same so there's |
| nothing exciting going on. Ultimately the userspace id that lands on disk will |
| be ``1000``. |
| |
| Example 2 |
| ~~~~~~~~~ |
| |
| :: |
| |
| caller userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:20000:10000 |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping:: |
| |
| from_kuid(0:20000:10000, 11000) = -1 |
| |
| It's immediately clear that while the caller's userspace id could be |
| successfully mapped down into kernel ids in the caller's idmapping the kernel |
| ids could not be mapped up according to the filesystem's idmapping. So the |
| kernel will deny this creation request. |
| |
| Note that while this example is less common, because most filesystem can't be |
| mounted with non-initial idmappings this is a general problem. |
| |
| Example 3 |
| ~~~~~~~~~ |
| |
| :: |
| |
| caller userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:0:4294967295 |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping:: |
| |
| from_kuid(0:0:4294967295, 11000) = 11000 |
| |
| We can see that the translation always succeeds. The userspace id that the |
| filesystem will ultimately put to disk will always be identical to the value of |
| the kernel id that was created in the caller's idmapping. In this example |
| ``11000``. This has mainly two consequences. |
| |
| First, that we can't allow a caller to ultimately write to disk with another |
| userspace id. We could only do this if we were to mount the whole fileystem |
| with the caller's or another idmapping. But as we've seen that is limited to |
| a few filesystems and not very flexible. But this is a use-case that is pretty |
| important in containerized workloads. |
| |
| Second, the caller will usually not be able to create any files or access |
| directories that have stricter permissions because none of the filesystem's |
| kernel ids map up into valid userspace ids in the caller's idmapping |
| |
| 1. Map raw userspace ids into kernel ids in the filesystem's idmapping:: |
| |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Map kernel ids into userspace ids in the caller's idmapping:: |
| |
| from_kuid(0:10000:10000, 1000) = -1 |
| |
| Example 4 |
| ~~~~~~~~~ |
| |
| :: |
| |
| file userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:0:4294967295 |
| |
| In order to report ownership to userspace uses the crossmapping algorithm |
| introduced in a previous section: |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:10000:10000, 1000) = -1 |
| |
| The crossmapping algorithm fails in this case because the kernel id in the |
| filesystem idmapping cannot be mapped to a userspace id in the caller's |
| idmapping. Thus, the kernel will report the ownership of this file as the |
| overflowid. |
| |
| Example 5 |
| ~~~~~~~~~ |
| |
| :: |
| |
| file userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:20000:10000 |
| |
| In order to report ownership to userspace uses the crossmapping algorithm |
| introduced in a previous section: |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:20000:10000, 1000) = 21000 |
| |
| 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:10000:10000, 1000) = -1 |
| |
| Again, the crossmapping algorithm fails in this case because the kernel id in |
| the filesystem idmapping cannot be mapped to a userspace id in the caller's |
| idmapping. Thus, the kernel will report the ownership of this file as the |
| overflowid. |
| |
| Note how in the last two examples things would be simple if the caller would be |
| using the initial idmapping. For a filesystem mounted with the initial |
| idmapping it would be trivial. So we only consider a filesystem with an |
| idmapping of ``0:20000:10000``: |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:20000:10000, 1000) = 21000 |
| |
| 2. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:0:4294967295, 1000) = 21000 |
| |
| Idmappings on idmapped mounts |
| ----------------------------- |
| |
| The examples we've seen in the previous section where the caller's idmapping |
| and the filesystem's idmapping are incompatible causes various issues for |
| workloads. For a more complex but common example, consider two containers |
| started on the host. To completely prevent the two containers from affecting |
| each other, an administrato may often use different non-overlapping idmappings |
| for the two containers:: |
| |
| container1 idmapping: 0:10000:10000 |
| container2 idmapping: 0:20000:10000 |
| filesystem idmapping: 0:30000:10000 |
| |
| An administrator wanting to provide easy read-write access to the following set |
| of files:: |
| |
| dir userpace id: 0 |
| dir/file1 userpace id: 1000 |
| dir/file2 userpace id: 2000 |
| |
| to both containers currently can't. |
| |
| Of course the administrator has the option to recursively change ownership via |
| ``chown()``. For example, they could change ownership so that ``dir`` and all |
| files below it can be crossmapped from the filesystem's into the container's |
| idmapping. Let's assume they change ownership so it is compatible with the |
| first container's idmapping:: |
| |
| dir userpace id: 10000 |
| dir/file1 userpace id: 11000 |
| dir/file2 userpace id: 12000 |
| |
| This would still leave ``dir`` rather useless to the second container. In fact, |
| ``dir`` and all files below it would continue to appear owned by the overflowid |
| for the second container. |
| |
| Or consider another increasingly popular example. Some service managers such as |
| systemd implement a concept called "portable home directories". A user may want |
| to use their home directories on different machines where they are assigned |
| different login userspace ids. Most users will have ``1000`` as the login id on |
| their machine at home and all files in their home directory will usually be |
| owned by id ``1000``. At uni or at work they may have another login id such as |
| ``1125``. This makes it rather difficult to interact with their home directory |
| on the work machine. |
| |
| In both cases changing ownership recursively has grave implications. The most |
| obvious one is that ownership is changed globally and permanently. In the home |
| directory case this change in ownership would even need to happen everytime the |
| user switches from their home to their work machine. For really large sets of |
| files this becomes increasingly costly. |
| |
| If the user is lucky, they are dealing with a filesystem that is mountable |
| inside user namespaces. But this would also change ownership globally and the |
| change in ownership is tied to the lifetime of the filesystem mount, i.e. the |
| superblock. The only way to change ownership is to completely unmount the |
| filesystem and mount it again in another user namespace. This is usually |
| impossible because it would mean that all users currently accessing the |
| filesystem can't anymore. And it means that ``dir`` still can't be shared |
| between two containers with different idmappings. |
| But usually the user doesn't even have this option since most filesystems |
| aren't mountable inside containers. And not having them mountable might be |
| desirable as it doesn't require the filesystem to deal with malicious |
| filesystem images. |
| |
| But the usecases mentioned above and more can be handled by idmapped mounts. |
| They allow to expose the same set of dentries with different ownership at |
| different mounts. This is achieved by marking the mounts with a user namespace |
| through the ``mount_setattr()`` system call. The idmapping associated with it |
| is then used to translate from the caller's idmapping to the filesystem's |
| idmapping and vica versa using the remapping algorithm we introduced above. |
| |
| In contrast, idmapped mounts make it possible to change ownership in |
| a temporary and localized way. The ownership changes are restricted to |
| a specific mount and the ownership changes are tied to the lifetime of the |
| mount. All other users and locations where the filesystem is exposed are |
| unaffected. |
| |
| Filesystems that support idmapped mounts don't have any real reason to support |
| being mountable inside user namespaces. A filesystem could be exposed |
| completely under an idmapped mount to get the same effect. This has the |
| advantage that filesystem can leave the creation of the superblock to |
| privileged users in the initial user namespace. |
| |
| However, it is perfectly possible to combine idmapped mounts with filesystems |
| mountable inside user namespaces. We will touch on this further below. |
| |
| Idmapping functions were added that translate between idmappings. They make use |
| of the remapping algorithm we've introduced earlier. We're going to look at |
| two: |
| |
| - ``mapped_fsuid()`` and ``mapped_fsgid()`` |
| |
| The ``mapped_fs*id()`` functions translate the caller's kernel ids into |
| kernel ids in the filesystem's idmapping. This translation is achieved by |
| remapping the caller's kernel ids using the mount's idmapping:: |
| |
| /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ |
| uid = from_kuid(mount, id) |
| |
| /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ |
| kuid = make_kuid(filesystem, uid) |
| |
| - ``i_uid_into_mnt()`` and ``i_gid_into_mnt()`` |
| |
| The ``i_*id_into_mnt()`` functions translate filesystem's kernel ids into |
| kernel ids in the mount's idmapping:: |
| |
| /* Map the filesystem's kernel id up into a userspace id in the filesystem's idmapping. */ |
| uid = from_kuid(filesystem, id) |
| |
| /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ |
| kuid = make_kuid(mount, uid) |
| |
| Note that these two functions invert each other. Consider the following |
| idmappings:: |
| |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:20000:10000 |
| mount idmapping: 0:10000:10000 |
| |
| Assume a file with userspace id ``1000`` is read from disk. The filesystem maps |
| this userspace id into kernel id ``21000`` according to it's idmapping. This is |
| what is stored in the inode's ``i_uid`` and ``i_gid`` fields. |
| |
| When the caller queries the ownership of this file via ``stat()`` the kernel |
| would usually simply use the crossmapping algorithm and map the filesystem's |
| kernel id up to a userspace id in the caller's idmapping. |
| |
| But when the caller is accessing the file on an idmapped mount the kernel will |
| first call ``i_uid_into_mnt()`` thereby translating the filesystem's kernel id |
| into a kernel id in the mount's idmapping:: |
| |
| i_uid_into_mnt(21000): |
| /* Map the filesystem's kernel id up into a userspace id. */ |
| 1000 = from_kuid(0:20000:10000, 21000) |
| |
| /* Map the filesystem's userspace id down ito a kernel id in the mount's idmapping. */ |
| 11000 = make_kuid(0:10000:10000, 1000) |
| |
| Finally, when the kernel reports the owner to the caller it will turn the |
| kernel id in the mount's idmapping into a userspace id in the caller's |
| idmapping:: |
| |
| 1000 = from_kuid(0:10000:10000, 11000) |
| |
| We can test whether this algorithm really works by verifying what happens when |
| we create a new file. Let's say the user is creating a file with filesystem |
| userspace id ``1000``. |
| |
| The kernel maps this to kernel id ``11000`` in the caller's idmapping. Usually |
| the kernel would now apply the crossmapping, verifying that the kernel id |
| ``11000`` can be mapped to a userspace id in the filesystem's idmapping and |
| ultimately write that userspace id to disk. |
| |
| But when the caller is accessing the file on an idmapped mount the kernel will |
| first call ``mapped_fs*id()`` thereby translating the caller's kernel id into |
| a kernel id according to the mount's idmapping:: |
| |
| mapped_fs(id(11000): |
| /* Map the caller's kernel id up into a userspace id in the mount's idmapping. */ |
| 1000 = from_kuid(0:10000:10000, 11000) |
| |
| /* Map the mount's userspace id down into a kernel id in the filesystem's idmapping. */ |
| 21000 = make_kuid(0:20000:10000, 1000) |
| |
| When finally writing to disk the kernel will then map the kernel id ``21000`` |
| up into a userspace id in the filesystem's idmapping:: |
| |
| 1000 = from_kuid(0:20000:10000, 21000) |
| |
| As we can see, we end up with a revertible and information preserving |
| algorithm. A file created from userspace id ``1000`` from an idmapped mount |
| will also be reported as being owned by userspace id ``1000`` and vica versa. |
| |
| Let's now briefly reconsider the failing examples from earlier in the context |
| of idmapped mounts. |
| |
| Example 2 reconsidered |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| :: |
| |
| caller userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:20000:10000 |
| mount idmapping: 0:10000:10000 |
| |
| When the caller is using a non-initial idmapping the common case is to attach |
| the same idmapping to the mount. We now perform three steps: |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 2. Translate the caller's kernel id into a kernel id in the filesystem's |
| idmapping:: |
| |
| mapped_fsuid(11000): |
| /* Map the kernel id up into a userspace id in the mount's idmapping. */ |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| make_kuid(0:20000:10000, 1000) = 21000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping:: |
| |
| from_kuid(0:20000:10000, 21000) = 1000 |
| |
| So the ownership that lands on disk will be the userspace id ``1000``. |
| |
| Example 3 reconsidered |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| :: |
| |
| caller userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:0:4294967295 |
| mount idmapping: 0:10000:10000 |
| |
| The same translation algorithm works with the third example. |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 2. Translate the caller's kernel id into a kernel id in the filesystem's |
| idmapping:: |
| |
| mapped_fsuid(11000): |
| /* Map the kernel id up into a userspace id in the mount's idmapping. */ |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping:: |
| |
| from_kuid(0:0:4294967295, 21000) = 1000 |
| |
| So the ownership that lands on disk will be the userspace id ``1000``. |
| |
| Example 4 reconsidered |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| :: |
| |
| file userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:0:4294967295 |
| mount idmapping: 0:10000:10000 |
| |
| In order to report ownership to userspace the kernel now does three steps with |
| a translation algorithm we introduced earlier: |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Translate the kernel id into a kernel id in the mount's idmapping:: |
| |
| i_uid_into_mnt(1000): |
| /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| from_kuid(0:0:4294967295, 1000) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the mounts's idmapping. */ |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 3. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| Earlier, the caller's kernel id couldn't be crossmapped in the filesystems's |
| idmapping. With the idmapped mount in place it now can be crossmapped into the |
| filesystem's idmapping via the mount's idmapping. The file will now be created |
| with userspace id ``1000`` according to the mount's idmapping. |
| |
| Example 5 reconsidered |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| :: |
| |
| file userspace id: 1000 |
| caller idmapping: 0:10000:10000 |
| filesystem idmapping: 0:20000:10000 |
| mount idmapping: 0:10000:10000 |
| |
| Again, in order to report ownership to userspace the kernel now does three |
| steps with a translation algorithm we introduced earlier: |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:20000:10000, 1000) = 21000 |
| |
| 2. Translate the kernel id into a kernel id in the mount's idmapping:: |
| |
| i_uid_into_mnt(21000): |
| /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| from_kuid(0:20000:10000, 21000) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the mounts's idmapping. */ |
| make_kuid(0:10000:10000, 1000) = 11000 |
| |
| 3. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:10000:10000, 11000) = 1000 |
| |
| Earlier, the file's kernel id couldn't be crossmapped in the filesystems's |
| idmapping. With the idmapped mount in place it now can be crossmapped into the |
| filesystem's idmapping via the mount's idmapping. The file is now owned by |
| userspace id ``1000`` according to the mount's idmapping. |
| |
| Changing ownership on a home directory |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| We've seen above how idmapped mounts can be used to translate between |
| idmappings when either the caller, the filesystem or both uses a non-initial |
| idmapping. A wide range of usecases exist when the caller is using |
| a non-initial idmapping. This mostly happens in the context of containerized |
| workloads. The consequence is as we have seen that for both, filesystem mounted |
| with the initial idmapping and filesystems mounted with non-initial idmappings, |
| access to the filesystem isn't working because the kernel ids can't be |
| crossmapped between the caller's and the filesystem's idmapping. |
| |
| As we've seen above idmapped mounts provide a solution to this by translating |
| between the caller's and the filesystem's idmapping. |
| |
| Aside from containerized workloads, idmapped mounts have the advantage that |
| they also work when both the caller and the filesystem use the initial |
| idmapping which means users on the host can change the ownership of dentries on |
| a per-mount basis. |
| |
| Consider our previous example where a user has their home directory on portable |
| storage. At home they have id ``1000`` and all files in their home directory |
| are owned by id ``1000`` whereas at uni or work they have login id ``1125``. |
| |
| Taking their home directory with them becomes problematic. They can't easily |
| access their files, they might not be able to write to disk without applying |
| lax permissions or ACLs and even if they can, they will end up with an annoying |
| mix of files and directories owned by id ``1000`` and id ``1125``. |
| |
| Idmapped mounts allow to solve this problem. A user can create an idmapped |
| mount for their home directory on their work computer or their computer at home |
| depending on what ownership they would prefer to end up on the portable storage |
| itself. |
| |
| Let's assume they want all files on disk to belong to userspace id ``1000``. |
| When the user plugs in their portable storage at their work station they can |
| setup a job that creates an idmapped mount with the minimal idmapping |
| ``1000:1125:1``. So now when they create a file the kernel performs the |
| following steps we already know from above: |
| |
| :: |
| |
| caller userspace id: 1125 |
| caller idmapping: 0:0:4294967295 |
| filesystem idmapping: 0:0:4294967295 |
| mount idmapping: 1000:1125:1 |
| |
| 1. Map the caller's userspace ids into kernel ids in the caller's idmapping:: |
| |
| make_kuid(0:0:4294967295, 1125) = 1125 |
| |
| 2. Translate the caller's kernel id into a kernel id in the filesystem's |
| idmapping:: |
| |
| mapped_fsuid(1125): |
| /* Map the kernel id up into a userspace id in the mount's idmapping. */ |
| from_kuid(1000:1125:1, 1125) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Verify that the caller's kernel ids can be mapped to userspace ids in the |
| filesystem's idmapping:: |
| |
| from_kuid(0:0:4294967295, 1000) = 1000 |
| |
| So ultimately the file will be created with userspace id ``1000`` on disk. |
| |
| Now let's briefly look at what ownership the caller with id ``1125`` will see |
| on their work computer: |
| |
| :: |
| |
| file userspace id: 1000 |
| caller idmapping: 0:0:4294967295 |
| filesystem idmapping: 0:0:4294967295 |
| mount idmapping: 1000:1125:1 |
| |
| 1. Map the userspace id on disk down into a kernel id in the filesystem's |
| idmapping:: |
| |
| make_kuid(0:0:4294967295, 1000) = 1000 |
| |
| 2. Translate the kernel id into a kernel id in the mount's idmapping:: |
| |
| i_uid_into_mnt(1000): |
| /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| from_kuid(0:0:4294967295, 1000) = 1000 |
| |
| /* Map the userspace id down into a kernel id in the mounts's idmapping. */ |
| make_kuid(1000:1125:1, 1000) = 1125 |
| |
| 3. Map the kernel id up into a userspace id in the caller's idmapping:: |
| |
| from_kuid(0:0:4294967295, 1125) = 1125 |
| |
| So ultimately the caller will be reported that the file belongs to userspace id |
| ``1125`` which is the caller's userspace id on their workstation in our |
| example. |
| |
| The raw userspace id that is put on disk is ``1000`` so when the user takes |
| their home directory back to their home computer where they are assigned |
| userspace id ``1000`` using the initial idmapping and mount the filesystem with |
| the initial idmapping they will see all those files belonging to id ``1000``. |
| |
| Shortcircuting |
| -------------- |
| |
| Currently, the implementation of idmapped mounts enforces that the filesystem |
| is mounted with the initial idmapping. The reason is simply that none of the |
| filesystems that we targeted were mountable with a non-initial idmapping. But |
| that might change soon enough. As we've seen above, thanks to the properties of |
| idmappings the translation works for both filesystems mounted with the initial |
| idmapping and filesystem with non-initial idmappings. |
| |
| Based on this current restriction to filesystem mounted with the initial |
| idmapping two noticeable shortcuts have been taken: |
| |
| 1. We always stash a reference to the initial user namespace in ``struct |
| vfsmount``. Idmapped mounts are thus mounts that have a non-initial user |
| namespace attached to them. |
| |
| In order to support idmapped mounts this needs to be changed. Instead of |
| stashing the initial user namespace the user namespace the filesystem was |
| mounted with must be stashed. An idmapped mount is then any mount that has |
| a different user namespace attached then the filesystem was mounted with. |
| This has no user-visible consequences. |
| |
| 2. The translation algorithms in ``mapped_fs*id()`` and ``i_*id_into_mnt()`` |
| are simplified. |
| |
| Let's consider ``mapped_fs*id()`` first. This function translates the |
| caller's kernel id into a kernel id in the filesystem's idmapping via |
| a mount's idmapping. The full algorithm is:: |
| |
| mapped_fsuid(): |
| /* Map the kernel id up into a userspace id in the mount's idmapping. */ |
| uid_t uid = from_kuid(mount-idmapping, id) |
| |
| /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| kuid_t kuid = make_kuid(filesystem-idmapping, uid) |
| |
| We know that the filesystem is always mounted with the initial idmapping as |
| we enforce this in ``mount_setattr()``. So this can be shortened to:: |
| |
| mapped_fsuid(): |
| /* Map the kernel id up into a userspace id in the mount's idmapping. */ |
| uid_t uid = from_kuid(mount-idmapping, id) |
| |
| /* Map the userspace id down into a kernel id in the filesystem's idmapping. */ |
| kuid_t kuid = KUIDT_INIT(uid); |
| |
| Similarly, for ``i_*id_into_mnt()`` which translated the filesystem's kernel |
| id into a mount's kernel id:: |
| |
| i_uid_into_mnt(): |
| /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| uid_t uid = from_kuid(filesystem-idmapping, id) |
| |
| /* Map the userspace id down into a kernel id in the mounts's idmapping. */ |
| kuid_t kuid = make_kuid(mount-idmapping, uid) |
| |
| Again, we know that the filesystem is always mounted with the initial |
| idmapping as we enforce this in ``mount_setattr()``. So this can be |
| shortened to:: |
| |
| i_uid_into_mnt(): |
| /* Map the kernel id up into a userspace id in the filesystem's idmapping. */ |
| uid_t uid = __kuid_val(kuid) |
| |
| /* Map the userspace id down into a kernel id in the mounts's idmapping. */ |
| kuid_t kuid = make_kuid(mount-idmapping, uid) |
| |
| Handling filesystems mounted with non-initial idmappings requires that the |
| translation functions be converted to their full form. They can still be |
| shortcircuited on non-idmapped mounts. This has no user-visible consequences. |