Documentation/filesystems/propagate_umount.txt - linux/kernel/git/tj/wq - Git at Google

 	Notes on propagate_umount()

 Umount propagation starts with a set of mounts we are already going to
 take out.  Ideally, we would like to add all downstream cognates to
 that set - anything with the same mountpoint as one of the removed
 mounts and with parent that would receive events from the parent of that
 mount.  However, there are some constraints the resulting set must
 satisfy.

 It is convenient to define several properties of sets of mounts:

 1) A set S of mounts is non-shifting if for any mount X belonging
 to S all subtrees mounted strictly inside of X (i.e. not overmounting
 the root of X) contain only elements of S.

 2) A set S is non-revealing if all locked mounts that belong to S have
 parents that also belong to S.

 3) A set S is closed if it contains all children of its elements.

 The set of mounts taken out by umount(2) must be non-shifting and
 non-revealing; the first constraint is what allows to reparent
 any remaining mounts and the second is what prevents the exposure
 of any concealed mountpoints.

 propagate_umount() takes the original set as an argument and tries to
 extend that set.  The original set is a full subtree and its root is
 unlocked; what matters is that it's closed and non-revealing.
 Resulting set may not be closed; there might still be mounts outside
 of that set, but only on top of stacks of root-overmounting elements
 of set.  They can be reparented to the place where the bottom of
 stack is attached to a mount that will survive.  NOTE: doing that
 will violate a constraint on having no more than one mount with
 the same parent/mountpoint pair; however, the caller (umount_tree())
 will immediately remedy that - it may keep unmounted element attached
 to parent, but only if the parent itself is unmounted.  Since all
 conflicts created by reparenting have common parent *not* in the
 set and one side of the conflict (bottom of the stack of overmounts)
 is in the set, it will be resolved.  However, we rely upon umount_tree()
 doing that pretty much immediately after the call of propagate_umount().

 Algorithm is based on two statements:
 	1) for any set S, there is a maximal non-shifting subset of S
 and it can be calculated in O(#S) time.
 	2) for any non-shifting set S, there is a maximal non-revealing
 subset of S.  That subset is also non-shifting and it can be calculated
 in O(#S) time.

 		Finding candidates.

 We are given a closed set U and we want to find all mounts that have
 the same mountpoint as some mount m in U *and* whose parent receives
 propagation from the parent of the same mount m.  Naive implementation
 would be
 	S = {}
 	for each m in U
 		add m to S
 		p = parent(m)
 		for each q in Propagation(p) - {p}
 			child = look_up(q, mountpoint(m))
 			if child
 				add child to S
 but that can lead to excessive work - there might be propagation among the
 subtrees of U, in which case we'd end up examining the same candidates
 many times.  Since propagation is transitive, the same will happen to
 everything downstream of that candidate and it's not hard to construct
 cases where the approach above leads to the time quadratic by the actual
 number of candidates.

 Note that if we run into a candidate we'd already seen, it must've been
 added on an earlier iteration of the outer loop - all additions made
 during one iteration of the outer loop have different parents.  So
 if we find a child already added to the set, we know that everything
 in Propagation(parent(child)) with the same mountpoint has been already
 added.
 	S = {}
 	for each m in U
 		if m in S
 			continue
 		add m to S
 		p = parent(m)
 		q = propagation_next(p, p)
 		while q
 			child = look_up(q, mountpoint(m))
 			if child
 				if child in S
 					q = skip_them(q, p)
 					continue;
 				add child to S
 			q = propagation_next(q, p)
 where
 skip_them(q, p)
 	keep walking Propagation(p) from q until we find something
 	not in Propagation(q)

 would get rid of that problem, but we need a sane implementation of
 skip_them().  That's not hard to do - split propagation_next() into
 "down into mnt_slave_list" and "forward-and-up" parts, with the
 skip_them() being "repeat the forward-and-up part until we get NULL
 or something that isn't a peer of the one we are skipping".

 Note that there can be no absolute roots among the extra candidates -
 they all come from mount lookups.  Absolute root among the original
 set is _currently_ impossible, but it might be worth protecting
 against.

 		Maximal non-shifting subsets.

 Let's call a mount m in a set S forbidden in that set if there is a
 subtree mounted strictly inside m and containing mounts that do not
 belong to S.

 The set is non-shifting when none of its elements are forbidden in it.

 If mount m is forbidden in a set S, it is forbidden in any subset S' it
 belongs to.  In other words, it can't belong to any of the non-shifting
 subsets of S.  If we had a way to find a forbidden mount or show that
 there's none, we could use it to find the maximal non-shifting subset
 simply by finding and removing them until none remain.

 Suppose mount m is forbidden in S; then any mounts forbidden in S - {m}
 must have been forbidden in S itself.  Indeed, since m has descendents
 that do not belong to S, any subtree that fits into S will fit into
 S - {m} as well.

 So in principle we could go through elements of S, checking if they
 are forbidden in S and removing the ones that are.  Removals will
 not invalidate the checks done for earlier mounts - if they were not
 forbidden at the time we checked, they won't become forbidden later.
 It's too costly to be practical, but there is a similar approach that
 is linear by size of S.

 Let's say that mount x in a set S is forbidden by mount y, if
 	* both x and y belong to S.
 	* there is a chain of mounts starting at x and leaving S
 	  immediately after passing through y, with the first
 	  mountpoint strictly inside x.
 Note 1: x may be equal to y - that's the case when something not
 belonging to S is mounted strictly inside x.
 Note 2: if y does not belong to S, it can't forbid anything in S.
 Note 3: if y has no children outside of S, it can't forbid anything in S.

 It's easy to show that mount x is forbidden in S if and only if x is
 forbidden in S by some mount y.  And it's easy to find all mounts in S
 forbidden by a given mount.

 Consider the following operation:
 	Trim(S, m) = S - {x : x is forbidden by m in S}

 Note that if m does not belong to S or has no children outside of S we
 are guaranteed that Trim(S, m) is equal to S.

 The following is true: if x is forbidden by y in Trim(S, m), it was
 already forbidden by y in S.

 Proof: Suppose x is forbidden by y in Trim(S, m).  Then there is a
 chain of mounts (x_0 = x, ..., x_k = y, x_{k+1} = r), such that x_{k+1}
 is the first element that doesn't belong to Trim(S, m) and the
 mountpoint of x_1 is strictly inside x.  If mount r belongs to S, it must
 have been removed by Trim(S, m), i.e. it was forbidden in S by m.
 Then there was a mount chain from r to some child of m that stayed in
 S all the way until m, but that's impossible since x belongs to Trim(S, m)
 and prepending (x_0, ..., x_k) to that chain demonstrates that x is also
 forbidden in S by m, and thus can't belong to Trim(S, m).
 Therefore r can not belong to S and our chain demonstrates that
 x is forbidden by y in S.  QED.

 Corollary: no mount is forbidden by m in Trim(S, m).  Indeed, any
 such mount would have been forbidden by m in S and thus would have been
 in the part of S removed in Trim(S, m).

 Corollary: no mount is forbidden by m in Trim(Trim(S, m), n).  Indeed,
 any such would have to have been forbidden by m in Trim(S, m), which
 is impossible.

 Corollary: after
 	S = Trim(S, x_1)
 	S = Trim(S, x_2)
 	...
 	S = Trim(S, x_k)
 no mount remaining in S will be forbidden by either of x_1,...,x_k.

 The following will reduce S to its maximal non-shifting subset:
 	visited = {}
 	while S contains elements not belonging to visited
 		let m be an arbitrary such element of S
 		S = Trim(S, m)
 		add m to visited

 S never grows, so the number of elements of S not belonging to visited
 decreases at least by one on each iteration.  When the loop terminates,
 all mounts remaining in S belong to visited.  It's easy to see that at
 the beginning of each iteration no mount remaining in S will be forbidden
 by any element of visited.  In other words, no mount remaining in S will
 be forbidden, i.e. final value of S will be non-shifting.  It will be
 the maximal non-shifting subset, since we were removing only forbidden
 elements.

 	There are two difficulties in implementing the above in linear
 time, both due to the fact that Trim() might need to remove more than one
 element.  Naive implementation of Trim() is vulnerable to running into a
 long chain of mounts, each mounted on top of parent's root.  Nothing in
 that chain is forbidden, so nothing gets removed from it.  We need to
 recognize such chains and avoid walking them again on subsequent calls of
 Trim(), otherwise we will end up with worst-case time being quadratic by
 the number of elements in S.  Another difficulty is in implementing the
 outer loop - we need to iterate through all elements of a shrinking set.
 That would be trivial if we never removed more than one element at a time
 (linked list, with list_for_each_entry_safe for iterator), but we may
 need to remove more than one entry, possibly including the ones we have
 already visited.

 	Let's start with naive algorithm for Trim():

 Trim_one(m)
 	found = false
 	for each n in children(m)
 		if n not in S
 			found = true
 			if (mountpoint(n) != root(m))
 				remove m from S
 				break
 	if found
 		Trim_ancestors(m)

 Trim_ancestors(m)
 	for (; parent(m) in S; m = parent(m)) {
 		if (mountpoint(m) != root(parent(m)))
 			remove parent(m) from S
 	}

 If m belongs to S, Trim_one(m) will replace S with Trim(S, m).
 Proof:
 	Consider the chains excluding elements from Trim(S, m).  The last
 two elements in such chain are m and some child of m that does not belong
 to S.  If m has no such children, Trim(S, m) is equal to S.
 	m itself is removed if and only if the chain has exactly two
 elements, i.e. when the last element does not overmount the root of m.
 In other words, that happens when m has a child not in S that does not
 overmount the root of m.
 	All other elements to remove will be ancestors of m, such that
 the entire descent chain from them to m is contained in S.  Let
 (x_0, x_1, ..., x_k = m) be the longest such chain.  x_i needs to be
 removed if and only if x_{i+1} does not overmount its root.  It's easy
 to see that Trim_ancestors(m) will iterate through that chain from
 x_k to x_1 and that it will remove exactly the elements that need to be
 removed.

 	Note that if the loop in Trim_ancestors() walks into an already
 visited element, we are guaranteed that remaining iterations will see
 only elements that had already been visited and remove none of them.
 That's the weakness that makes it vulnerable to long chains of full
 overmounts.

 	It's easy to deal with, if we can afford setting marks on
 elements of S; we would mark all elements already visited by
 Trim_ancestors() and have it bail out as soon as it sees an already
 marked element.

 	The problems with iterating through the set can be dealt with in
 several ways, depending upon the representation we choose for our set.
 One useful observation is that we are given a closed subset in S - the
 original set passed to propagate_umount().  Its elements can neither
 forbid anything nor be forbidden by anything - all their descendents
 belong to S, so they can not occur anywhere in any excluding chain.
 In other words, the elements of that subset will remain in S until
 the end and Trim_one(S, m) is a no-op for all m from that subset.

 	That suggests keeping S as a disjoint union of a closed set U
 ('will be unmounted, no matter what') and the set of all elements of
 S that do not belong to U.  That set ('candidates') is all we need
 to iterate through.  Let's represent it as a subset in a cyclic list,
 consisting of all list elements that are marked as candidates (initially -
 all of them).  Then we could have Trim_ancestors() only remove the mark,
 leaving the elements on the list.  Then Trim_one() would never remove
 anything other than its argument from the containing list, allowing to
 use list_for_each_entry_safe() as iterator.

 	Assuming that representation we get the following:

 	list_for_each_entry_safe(m, ..., Candidates, ...)
 		Trim_one(m)
 where
 Trim_one(m)
 	if (m is not marked as a candidate)
 		strip the "seen by Trim_ancestors" mark from m
 		remove m from the Candidates list
 		return

 	remove_this = false
 	found = false
 	for each n in children(m)
 		if n not in S
 			found = true
 			if (mountpoint(n) != root(m))
 				remove_this = true
 				break
 	if found
 		Trim_ancestors(m)
 	if remove_this
 		strip the "seen by Trim_ancestors" mark from m
 		strip the "candidate" mark from m
 		remove m from the Candidate list

 Trim_ancestors(m)
 	for (p = parent(m); p is marked as candidate ; m = p, p = parent(p)) {
 		if m is marked as seen by Trim_ancestors
 			return
 		mark m as seen by Trim_ancestors
 		if (mountpoint(m) != root(p))
 			strip the "candidate" mark from p
 	}

 	Terminating condition in the loop in Trim_ancestors() is correct,
 since that that loop will never run into p belonging to U - p is always
 an ancestor of argument of Trim_one() and since U is closed, the argument
 of Trim_one() would also have to belong to U.  But Trim_one() is never
 called for elements of U.  In other words, p belongs to S if and only
 if it belongs to candidates.

 	Time complexity:
 * we get no more than O(#S) calls of Trim_one()
 * the loop over children in Trim_one() never looks at the same child
 twice through all the calls.
 * iterations of that loop for children in S are no more than O(#S)
 in the worst case
 * at most two children that are not elements of S are considered per
 call of Trim_one().
 * the loop in Trim_ancestors() sets its mark once per iteration and
 no element of S has is set more than once.

 	In the end we may have some elements excluded from S by
 Trim_ancestors() still stuck on the list.  We could do a separate
 loop removing them from the list (also no worse than O(#S) time),
 but it's easier to leave that until the next phase - there we will
 iterate through the candidates anyway.

 	The caller has already removed all elements of U from their parents'
 lists of children, which means that checking if child belongs to S is
 equivalent to checking if it's marked as a candidate; we'll never see
 the elements of U in the loop over children in Trim_one().

 	What's more, if we see that children(m) is empty and m is not
 locked, we can immediately move m into the committed subset (remove
 from the parent's list of children, etc.).  That's one fewer mount we'll
 have to look into when we check the list of children of its parent *and*
 when we get to building the non-revealing subset.

 		Maximal non-revealing subsets

 If S is not a non-revealing subset, there is a locked element x in S
 such that parent of x is not in S.

 Obviously, no non-revealing subset of S may contain x.  Removing such
 elements one by one will obviously end with the maximal non-revealing
 subset (possibly empty one).  Note that removal of an element will
 require removal of all its locked children, etc.

 If the set had been non-shifting, it will remain non-shifting after
 such removals.
 Proof: suppose S was non-shifting, x is a locked element of S, parent of x
 is not in S and S - {x} is not non-shifting.  Then there is an element m
 in S - {x} and a subtree mounted strictly inside m, such that m contains
 an element not in in S - {x}.  Since S is non-shifting, everything in
 that subtree must belong to S.  But that means that this subtree must
 contain x somewhere *and* that parent of x either belongs that subtree
 or is equal to m.  Either way it must belong to S.  Contradiction.

 // same representation as for finding maximal non-shifting subsets:
 // S is a disjoint union of a non-revealing set U (the ones we are committed
 // to unmount) and a set of candidates, represented as a subset of list
 // elements that have "is a candidate" mark on them.
 // Elements of U are removed from their parents' lists of children.
 // In the end candidates becomes empty and maximal non-revealing non-shifting
 // subset of S is now in U
 	while (Candidates list is non-empty)
 		handle_locked(first(Candidates))

 handle_locked(m)
 	if m is not marked as a candidate
 		strip the "seen by Trim_ancestors" mark from m
 		remove m from the list
 		return
 	cutoff = m
 	for (p = m; p in candidates; p = parent(p)) {
 		strip the "seen by Trim_ancestors" mark from p
 		strip the "candidate" mark from p
 		remove p from the Candidates list
 		if (!locked(p))
 			cutoff = parent(p)
 	}
 	if p in U
 		cutoff = p
 	while m != cutoff
 		remove m from children(parent(m))
 		add m to U
 		m = parent(m)

 Let (x_0, ..., x_n = m) be the maximal chain of descent of m within S.
 * If it contains some elements of U, let x_k be the last one of those.
 Then union of U with {x_{k+1}, ..., x_n} is obviously non-revealing.
 * otherwise if all its elements are locked, then none of {x_0, ..., x_n}
 may be elements of a non-revealing subset of S.
 * otherwise let x_k be the first unlocked element of the chain.  Then none
 of {x_0, ..., x_{k-1}} may be an element of a non-revealing subset of
 S and union of U and {x_k, ..., x_n} is non-revealing.

 handle_locked(m) finds which of these cases applies and adjusts Candidates
 and U accordingly.  U remains non-revealing, union of Candidates and
 U still contains any non-revealing subset of S and after the call of
 handle_locked(m) m is guaranteed to be not in Candidates list.  So having
 it called for each element of S would suffice to empty Candidates,
 leaving U the maximal non-revealing subset of S.

 However, handle_locked(m) is a no-op when m belongs to U, so it's enough
 to have it called for elements of Candidates list until none remain.

 Time complexity: number of calls of handle_locked() is limited by
 #Candidates, each iteration of the first loop in handle_locked() removes
 an element from the list, so their total number of executions is also
 limited by #Candidates; number of iterations in the second loop is no
 greater than the number of iterations of the first loop.


 		Reparenting

 After we'd calculated the final set, we still need to deal with
 reparenting - if an element of the final set has a child not in it,
 we need to reparent such child.

 Such children can only be root-overmounting (otherwise the set wouldn't
 be non-shifting) and their parents can not belong to the original set,
 since the original is guaranteed to be closed.


 		Putting all of that together

 The plan is to
 	* find all candidates
 	* trim down to maximal non-shifting subset
 	* trim down to maximal non-revealing subset
 	* reparent anything that needs to be reparented
 	* return the resulting set to the caller

 For the 2nd and 3rd steps we want to separate the set into growing
 non-revealing subset, initially containing the original set ("U" in
 terms of the pseudocode above) and everything we are still not sure about
 ("candidates").  It means that for the output of the 1st step we'd like
 the extra candidates separated from the stuff already in the original set.
 For the 4th step we would like the additions to U separate from the
 original set.

 So let's go for
 	* original set ("set").  Linkage via mnt_list
 	* undecided candidates ("candidates").  Subset of a list,
 consisting of all its elements marked with a new flag (T_UMOUNT_CANDIDATE).
 Initially all elements of the list will be marked that way; in the
 end the list will become empty and no mounts will remain marked with
 that flag.
 	* Reuse T_MARKED for "has been already seen by trim_ancestors()".
 	* anything in U that hadn't been in the original set - elements of
 candidates will gradually be either discarded or moved there.  In other
 words, it's the candidates we have already decided to unmount.	Its role
 is reasonably close to the old "to_umount", so let's use that name.
 Linkage via mnt_list.

 For gather_candidates() we'll need to maintain both candidates (S -
 set) and intersection of S with set.  Use T_UMOUNT_CANDIDATE for
 all elements we encounter, putting the ones not already in the original
 set into the list of candidates.  When we are done, strip that flag from
 all elements of the original set.  That gives a cheap way to check
 if element belongs to S (in gather_candidates) and to candidates
 itself (at later stages).  Call that predicate is_candidate(); it would
 be m->mnt_t_flags & T_UMOUNT_CANDIDATE.

 All elements of the original set are marked with MNT_UMOUNT and we'll
 need the same for elements added when joining the contents of to_umount
 to set in the end.  Let's set MNT_UMOUNT at the time we add an element
 to to_umount; that's close to what the old 'umount_one' is doing, so
 let's keep that name.  It also gives us another predicate we need -
 "belongs to union of set and to_umount"; will_be_unmounted() for now.

 Removals from the candidates list should strip both T_MARKED and
 T_UMOUNT_CANDIDATE; call it remove_from_candidates_list().