memcg: simplify mem_cgroup_reclaim_iter

mem_cgroup_iter() shares mem_cgroup_reclaim_iters among multiple
reclaimers to prevent multiple reclaimers banging on the same cgroups.
To achieve this, mem_cgroup_reclaim_iter remembers the last visited
cgroup.  Before the recent changes, cgroup_next_descendant_pre()
required that the current cgroup is alive or RCU grace period hasn't
passed after its removal as ->sibling.next couldn't be trusted
otherwise.

As bumping cgroup_subsys_state reference doesn't prevent the cgroup
from being removed, instead of pinning the current cgroup,
mem_cgroup_reclaim_iter tracks the number of cgroup removal events in
the subtree and resets the iteration if any removal has happened since
caching the current cgroup.  This scheme involves an overly elaborate
and hard-to-follow synchronization scheme as it needs to game cgroup
removal RCU grace period.

Now that cgroup_next_descendant_pre() can return the next sibling
reliably regardless of the state of the current cgroup, this can be
implemented in a much simpler and more conventional way.
mem_cgroup_reclaim_iter can pin the current cgroup and use
__mem_cgroup_iter_next() on it for the next iteration.  The whole
thing becomes normal RCU synchronization.  Updating the cursor to the
next position is slightly more involved as multiple tasks could be
trying to update it at the same time; however, it can be easily
implemented using xchg().

This replaces the overly elaborate synchronization scheme along with
->dead_count management with a more conventional RCU usage.  As an
added bonus, the new implementation doesn't reset the cursor everytime
a cgroup is deleted in the subtree.  It safely continues the
iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
1 file changed