Documentation/bpf/graph_ds_impl.rst - linux/kernel/git/dhowells/linux-fs - Git at Google

 =========================
 BPF Graph Data Structures
 =========================

 This document describes implementation details of new-style "graph" data
 structures (linked_list, rbtree), with particular focus on the verifier's
 implementation of semantics specific to those data structures.

 Although no specific verifier code is referred to in this document, the document
 assumes that the reader has general knowledge of BPF verifier internals, BPF
 maps, and BPF program writing.

 Note that the intent of this document is to describe the current state of
 these graph data structures. **No guarantees** of stability for either
 semantics or APIs are made or implied here.

 .. contents::
     :local:
     :depth: 2

 Introduction
 ------------

 The BPF map API has historically been the main way to expose data structures
 of various types for use within BPF programs. Some data structures fit naturally
 with the map API (HASH, ARRAY), others less so. Consequentially, programs
 interacting with the latter group of data structures can be hard to parse
 for kernel programmers without previous BPF experience.

 Luckily, some restrictions which necessitated the use of BPF map semantics are
 no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
 BPF allocator, it is now possible to implement BPF data structures whose API
 and semantics more closely match those exposed to the rest of the kernel.

 Two such data structures - linked_list and rbtree - have many verification
 details in common. Because both have "root"s ("head" for linked_list) and
 "node"s, the verifier code and this document refer to common functionality
 as "graph_api", "graph_root", "graph_node", etc.

 Unless otherwise stated, examples and semantics below apply to both graph data
 structures.

 Unstable API
 ------------

 Data structures implemented using the BPF map API have historically used BPF
 helper functions - either standard map API helpers like ``bpf_map_update_elem``
 or map-specific helpers. The new-style graph data structures instead use kfuncs
 to define their manipulation helpers. Because there are no stability guarantees
 for kfuncs, the API and semantics for these data structures can be evolved in
 a way that breaks backwards compatibility if necessary.

 Root and node types for the new data structures are opaquely defined in the
 ``uapi/linux/bpf.h`` header.

 Locking
 -------

 The new-style data structures are intrusive and are defined similarly to their
 vanilla kernel counterparts:

 .. code-block:: c

         struct node_data {
           long key;
           long data;
           struct bpf_rb_node node;
         };

         struct bpf_spin_lock glock;
         struct bpf_rb_root groot __contains(node_data, node);

 The "root" type for both linked_list and rbtree expects to be in a map_value
 which also contains a ``bpf_spin_lock`` - in the above example both global
 variables are placed in a single-value arraymap. The verifier considers this
 spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
 the same map_value and will enforce that the correct lock is held when
 verifying BPF programs that manipulate the tree. Since this lock checking
 happens at verification time, there is no runtime penalty.

 Non-owning references
 ---------------------

 **Motivation**

 Consider the following BPF code:

 .. code-block:: c

         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */

         bpf_spin_lock(&lock);

         bpf_rbtree_add(&tree, n); /* PASSED */

         bpf_spin_unlock(&lock);

 From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
 has type ``PTR_TO_BTF_ID | MEM_ALLOC``, with a ``btf_id`` of
 ``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
 program has ownership of the pointee's (object pointed to by ``n``) lifetime.
 The BPF program must pass off ownership before exiting - either via
 ``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
 ``bpf_rbtree_add``.

 (``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
 "ownership is acquired" and "ownership is passed", respectively)

 What should the verifier do with ``n`` after ownership is passed off? If the
 object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
 should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
 the object is no longer valid. The underlying memory may have been reused for
 some other allocation, unmapped, etc.

 When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
 obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
 but that would result in programs with useful, common coding patterns being
 rejected, e.g.:

 .. code-block:: c

         int x;
         struct node_data *n = bpf_obj_new(typeof(*n)); /* ACQUIRED */

         bpf_spin_lock(&lock);

         bpf_rbtree_add(&tree, n); /* PASSED */
         x = n->data;
         n->data = 42;

         bpf_spin_unlock(&lock);

 Both the read from and write to ``n->data`` would be rejected. The verifier
 can do better, though, by taking advantage of two details:

   * Graph data structure APIs can only be used when the ``bpf_spin_lock``
     associated with the graph root is held

   * Both graph data structures have pointer stability

      * Because graph nodes are allocated with ``bpf_obj_new`` and
        adding / removing from the root involves fiddling with the
        ``bpf_{list,rb}_node`` field of the node struct, a graph node will
        remain at the same address after either operation.

 Because the associated ``bpf_spin_lock`` must be held by any program adding
 or removing, if we're in the critical section bounded by that lock, we know
 that no other program can add or remove until the end of the critical section.
 This combined with pointer stability means that, until the critical section
 ends, we can safely access the graph node through ``n`` even after it was used
 to pass ownership.

 The verifier considers such a reference a *non-owning reference*. The ref
 returned by ``bpf_obj_new`` is accordingly considered an *owning reference*.
 Both terms currently only have meaning in the context of graph nodes and API.

 **Details**

 Let's enumerate the properties of both types of references.

 *owning reference*

   * This reference controls the lifetime of the pointee

   * Ownership of pointee must be 'released' by passing it to some graph API
     kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee

     * If not released before program ends, verifier considers program invalid

   * Access to the pointee's memory will not page fault

 *non-owning reference*

   * This reference does not own the pointee

      * It cannot be used to add the graph node to a graph root, nor ``free``'d via
        ``bpf_obj_drop``

   * No explicit control of lifetime, but can infer valid lifetime based on
     non-owning ref existence (see explanation below)

   * Access to the pointee's memory will not page fault

 From verifier's perspective non-owning references can only exist
 between spin_lock and spin_unlock. Why? After spin_unlock another program
 can do arbitrary operations on the data structure like removing and ``free``-ing
 via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
 ``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
 Or the memory could go away.

 To prevent this logic violation all non-owning references are invalidated by the
 verifier after a critical section ends. This is necessary to ensure the "will
 not page fault" property of non-owning references. So if the verifier hasn't
 invalidated a non-owning ref, accessing it will not page fault.

 Currently ``bpf_obj_drop`` is not allowed in the critical section, so
 if there's a valid non-owning ref, we must be in a critical section, and can
 conclude that the ref's memory hasn't been dropped-and- ``free``'d or
 dropped-and-reused.

 Any reference to a node that is in an rbtree _must_ be non-owning, since
 the tree has control of the pointee's lifetime. Similarly, any ref to a node
 that isn't in rbtree _must_ be owning. This results in a nice property:
 graph API add / remove implementations don't need to check if a node
 has already been added (or already removed), as the ownership model
 allows the verifier to prevent such a state from being valid by simply checking
 types.

 However, pointer aliasing poses an issue for the above "nice property".
 Consider the following example:

 .. code-block:: c

         struct node_data *n, *m, *o, *p;
         n = bpf_obj_new(typeof(*n));     /* 1 */

         bpf_spin_lock(&lock);

         bpf_rbtree_add(&tree, n);        /* 2 */
         m = bpf_rbtree_first(&tree);     /* 3 */

         o = bpf_rbtree_remove(&tree, n); /* 4 */
         p = bpf_rbtree_remove(&tree, m); /* 5 */

         bpf_spin_unlock(&lock);

         bpf_obj_drop(o);
         bpf_obj_drop(p); /* 6 */

 Assume the tree is empty before this program runs. If we track verifier state
 changes here using numbers in above comments:

   1) n is an owning reference

   2) n is a non-owning reference, it's been added to the tree

   3) n and m are non-owning references, they both point to the same node

   4) o is an owning reference, n and m non-owning, all point to same node

   5) o and p are owning, n and m non-owning, all point to the same node

   6) a double-free has occurred, since o and p point to same node and o was
      ``free``'d in previous statement

 States 4 and 5 violate our "nice property", as there are non-owning refs to
 a node which is not in an rbtree. Statement 5 will try to remove a node which
 has already been removed as a result of this violation. State 6 is a dangerous
 double-free.

 At a minimum we should prevent state 6 from being possible. If we can't also
 prevent state 5 then we must abandon our "nice property" and check whether a
 node has already been removed at runtime.

 We prevent both by generalizing the "invalidate non-owning references" behavior
 of ``bpf_spin_unlock`` and doing similar invalidation after
 ``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:

   * takes an arbitrary node argument

   * removes it from the data structure

   * returns an owning reference to the removed node

 May result in a state where some other non-owning reference points to the same
 node. So ``remove``-type kfuncs must be considered a non-owning reference
 invalidation point as well.
	=========================
	BPF Graph Data Structures
	=========================

	This document describes implementation details of new-style "graph" data
	structures (linked_list, rbtree), with particular focus on the verifier's
	implementation of semantics specific to those data structures.

	Although no specific verifier code is referred to in this document, the document
	assumes that the reader has general knowledge of BPF verifier internals, BPF
	maps, and BPF program writing.

	Note that the intent of this document is to describe the current state of
	these graph data structures. No guarantees of stability for either
	semantics or APIs are made or implied here.

	.. contents::
	:local:
	:depth: 2

	Introduction
	------------

	The BPF map API has historically been the main way to expose data structures
	of various types for use within BPF programs. Some data structures fit naturally
	with the map API (HASH, ARRAY), others less so. Consequentially, programs
	interacting with the latter group of data structures can be hard to parse
	for kernel programmers without previous BPF experience.

	Luckily, some restrictions which necessitated the use of BPF map semantics are
	no longer relevant. With the introduction of kfuncs, kptrs, and the any-context
	BPF allocator, it is now possible to implement BPF data structures whose API
	and semantics more closely match those exposed to the rest of the kernel.

	Two such data structures - linked_list and rbtree - have many verification
	details in common. Because both have "root"s ("head" for linked_list) and
	"node"s, the verifier code and this document refer to common functionality
	as "graph_api", "graph_root", "graph_node", etc.

	Unless otherwise stated, examples and semantics below apply to both graph data
	structures.

	Unstable API
	------------

	Data structures implemented using the BPF map API have historically used BPF
	helper functions - either standard map API helpers like ``bpf_map_update_elem``
	or map-specific helpers. The new-style graph data structures instead use kfuncs
	to define their manipulation helpers. Because there are no stability guarantees
	for kfuncs, the API and semantics for these data structures can be evolved in
	a way that breaks backwards compatibility if necessary.

	Root and node types for the new data structures are opaquely defined in the
	``uapi/linux/bpf.h`` header.

	Locking
	-------

	The new-style data structures are intrusive and are defined similarly to their
	vanilla kernel counterparts:

	.. code-block:: c

	struct node_data {
	long key;
	long data;
	struct bpf_rb_node node;
	};

	struct bpf_spin_lock glock;
	struct bpf_rb_root groot __contains(node_data, node);

	The "root" type for both linked_list and rbtree expects to be in a map_value
	which also contains a ``bpf_spin_lock`` - in the above example both global
	variables are placed in a single-value arraymap. The verifier considers this
	spin_lock to be associated with the ``bpf_rb_root`` by virtue of both being in
	the same map_value and will enforce that the correct lock is held when
	verifying BPF programs that manipulate the tree. Since this lock checking
	happens at verification time, there is no runtime penalty.

	Non-owning references
	---------------------

	Motivation

	Consider the following BPF code:

	.. code-block:: c

	struct node_data n = bpf_obj_new(typeof(n)); /* ACQUIRED */

	bpf_spin_lock(&lock);

	bpf_rbtree_add(&tree, n); /* PASSED */

	bpf_spin_unlock(&lock);

	From the verifier's perspective, the pointer ``n`` returned from ``bpf_obj_new``
	has type ``PTR_TO_BTF_ID \| MEM_ALLOC``, with a ``btf_id`` of
	``struct node_data`` and a nonzero ``ref_obj_id``. Because it holds ``n``, the
	program has ownership of the pointee's (object pointed to by ``n``) lifetime.
	The BPF program must pass off ownership before exiting - either via
	``bpf_obj_drop``, which ``free``'s the object, or by adding it to ``tree`` with
	``bpf_rbtree_add``.

	(``ACQUIRED`` and ``PASSED`` comments in the example denote statements where
	"ownership is acquired" and "ownership is passed", respectively)

	What should the verifier do with ``n`` after ownership is passed off? If the
	object was ``free``'d with ``bpf_obj_drop`` the answer is obvious: the verifier
	should reject programs which attempt to access ``n`` after ``bpf_obj_drop`` as
	the object is no longer valid. The underlying memory may have been reused for
	some other allocation, unmapped, etc.

	When ownership is passed to ``tree`` via ``bpf_rbtree_add`` the answer is less
	obvious. The verifier could enforce the same semantics as for ``bpf_obj_drop``,
	but that would result in programs with useful, common coding patterns being
	rejected, e.g.:

	.. code-block:: c

	int x;
	struct node_data n = bpf_obj_new(typeof(n)); /* ACQUIRED */

	bpf_spin_lock(&lock);

	bpf_rbtree_add(&tree, n); /* PASSED */
	x = n->data;
	n->data = 42;

	bpf_spin_unlock(&lock);

	Both the read from and write to ``n->data`` would be rejected. The verifier
	can do better, though, by taking advantage of two details:

	* Graph data structure APIs can only be used when the ``bpf_spin_lock``
	associated with the graph root is held

	* Both graph data structures have pointer stability

	* Because graph nodes are allocated with ``bpf_obj_new`` and
	adding / removing from the root involves fiddling with the
	``bpf_{list,rb}_node`` field of the node struct, a graph node will
	remain at the same address after either operation.

	Because the associated ``bpf_spin_lock`` must be held by any program adding
	or removing, if we're in the critical section bounded by that lock, we know
	that no other program can add or remove until the end of the critical section.
	This combined with pointer stability means that, until the critical section
	ends, we can safely access the graph node through ``n`` even after it was used
	to pass ownership.

	The verifier considers such a reference a non-owning reference. The ref
	returned by ``bpf_obj_new`` is accordingly considered an owning reference.
	Both terms currently only have meaning in the context of graph nodes and API.

	Details

	Let's enumerate the properties of both types of references.

	owning reference

	* This reference controls the lifetime of the pointee

	* Ownership of pointee must be 'released' by passing it to some graph API
	kfunc, or via ``bpf_obj_drop``, which ``free``'s the pointee

	* If not released before program ends, verifier considers program invalid

	* Access to the pointee's memory will not page fault

	non-owning reference

	* This reference does not own the pointee

	* It cannot be used to add the graph node to a graph root, nor ``free``'d via
	``bpf_obj_drop``

	* No explicit control of lifetime, but can infer valid lifetime based on
	non-owning ref existence (see explanation below)

	* Access to the pointee's memory will not page fault

	From verifier's perspective non-owning references can only exist
	between spin_lock and spin_unlock. Why? After spin_unlock another program
	can do arbitrary operations on the data structure like removing and ``free``-ing
	via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
	``free``'d, and reused via bpf_obj_new would point to an entirely different thing.
	Or the memory could go away.

	To prevent this logic violation all non-owning references are invalidated by the
	verifier after a critical section ends. This is necessary to ensure the "will
	not page fault" property of non-owning references. So if the verifier hasn't
	invalidated a non-owning ref, accessing it will not page fault.

	Currently ``bpf_obj_drop`` is not allowed in the critical section, so
	if there's a valid non-owning ref, we must be in a critical section, and can
	conclude that the ref's memory hasn't been dropped-and- ``free``'d or
	dropped-and-reused.

	Any reference to a node that is in an rbtree _must_ be non-owning, since
	the tree has control of the pointee's lifetime. Similarly, any ref to a node
	that isn't in rbtree _must_ be owning. This results in a nice property:
	graph API add / remove implementations don't need to check if a node
	has already been added (or already removed), as the ownership model
	allows the verifier to prevent such a state from being valid by simply checking
	types.

	However, pointer aliasing poses an issue for the above "nice property".
	Consider the following example:

	.. code-block:: c

	struct node_data n, m, o, p;
	n = bpf_obj_new(typeof(n)); / 1 */

	bpf_spin_lock(&lock);

	bpf_rbtree_add(&tree, n); /* 2 */
	m = bpf_rbtree_first(&tree); /* 3 */

	o = bpf_rbtree_remove(&tree, n); /* 4 */
	p = bpf_rbtree_remove(&tree, m); /* 5 */

	bpf_spin_unlock(&lock);

	bpf_obj_drop(o);
	bpf_obj_drop(p); /* 6 */

	Assume the tree is empty before this program runs. If we track verifier state
	changes here using numbers in above comments:

	1) n is an owning reference

	2) n is a non-owning reference, it's been added to the tree

	3) n and m are non-owning references, they both point to the same node

	4) o is an owning reference, n and m non-owning, all point to same node

	5) o and p are owning, n and m non-owning, all point to the same node

	6) a double-free has occurred, since o and p point to same node and o was
	``free``'d in previous statement

	States 4 and 5 violate our "nice property", as there are non-owning refs to
	a node which is not in an rbtree. Statement 5 will try to remove a node which
	has already been removed as a result of this violation. State 6 is a dangerous
	double-free.

	At a minimum we should prevent state 6 from being possible. If we can't also
	prevent state 5 then we must abandon our "nice property" and check whether a
	node has already been removed at runtime.

	We prevent both by generalizing the "invalidate non-owning references" behavior
	of ``bpf_spin_unlock`` and doing similar invalidation after
	``bpf_rbtree_remove``. The logic here being that any graph API kfunc which:

	* takes an arbitrary node argument

	* removes it from the data structure

	* returns an owning reference to the removed node

	May result in a state where some other non-owning reference points to the same
	node. So ``remove``-type kfuncs must be considered a non-owning reference
	invalidation point as well.