649 lines
32 KiB
ReStructuredText
649 lines
32 KiB
ReStructuredText
|
======================================================
|
|||
|
A Tour Through TREE_RCU's Grace-Period Memory Ordering
|
|||
|
======================================================
|
|||
|
|
|||
|
August 8, 2017
|
|||
|
|
|||
|
This article was contributed by Paul E. McKenney
|
|||
|
|
|||
|
Introduction
|
|||
|
============
|
|||
|
|
|||
|
This document gives a rough visual overview of how Tree RCU's
|
|||
|
grace-period memory ordering guarantee is provided.
|
|||
|
|
|||
|
What Is Tree RCU's Grace Period Memory Ordering Guarantee?
|
|||
|
==========================================================
|
|||
|
|
|||
|
RCU grace periods provide extremely strong memory-ordering guarantees
|
|||
|
for non-idle non-offline code.
|
|||
|
Any code that happens after the end of a given RCU grace period is guaranteed
|
|||
|
to see the effects of all accesses prior to the beginning of that grace
|
|||
|
period that are within RCU read-side critical sections.
|
|||
|
Similarly, any code that happens before the beginning of a given RCU grace
|
|||
|
period is guaranteed to not see the effects of all accesses following the end
|
|||
|
of that grace period that are within RCU read-side critical sections.
|
|||
|
|
|||
|
Note well that RCU-sched read-side critical sections include any region
|
|||
|
of code for which preemption is disabled.
|
|||
|
Given that each individual machine instruction can be thought of as
|
|||
|
an extremely small region of preemption-disabled code, one can think of
|
|||
|
``synchronize_rcu()`` as ``smp_mb()`` on steroids.
|
|||
|
|
|||
|
RCU updaters use this guarantee by splitting their updates into
|
|||
|
two phases, one of which is executed before the grace period and
|
|||
|
the other of which is executed after the grace period.
|
|||
|
In the most common use case, phase one removes an element from
|
|||
|
a linked RCU-protected data structure, and phase two frees that element.
|
|||
|
For this to work, any readers that have witnessed state prior to the
|
|||
|
phase-one update (in the common case, removal) must not witness state
|
|||
|
following the phase-two update (in the common case, freeing).
|
|||
|
|
|||
|
The RCU implementation provides this guarantee using a network
|
|||
|
of lock-based critical sections, memory barriers, and per-CPU
|
|||
|
processing, as is described in the following sections.
|
|||
|
|
|||
|
Tree RCU Grace Period Memory Ordering Building Blocks
|
|||
|
=====================================================
|
|||
|
|
|||
|
The workhorse for RCU's grace-period memory ordering is the
|
|||
|
critical section for the ``rcu_node`` structure's
|
|||
|
``->lock``. These critical sections use helper functions for lock
|
|||
|
acquisition, including ``raw_spin_lock_rcu_node()``,
|
|||
|
``raw_spin_lock_irq_rcu_node()``, and ``raw_spin_lock_irqsave_rcu_node()``.
|
|||
|
Their lock-release counterparts are ``raw_spin_unlock_rcu_node()``,
|
|||
|
``raw_spin_unlock_irq_rcu_node()``, and
|
|||
|
``raw_spin_unlock_irqrestore_rcu_node()``, respectively.
|
|||
|
For completeness, a ``raw_spin_trylock_rcu_node()`` is also provided.
|
|||
|
The key point is that the lock-acquisition functions, including
|
|||
|
``raw_spin_trylock_rcu_node()``, all invoke ``smp_mb__after_unlock_lock()``
|
|||
|
immediately after successful acquisition of the lock.
|
|||
|
|
|||
|
Therefore, for any given ``rcu_node`` structure, any access
|
|||
|
happening before one of the above lock-release functions will be seen
|
|||
|
by all CPUs as happening before any access happening after a later
|
|||
|
one of the above lock-acquisition functions.
|
|||
|
Furthermore, any access happening before one of the
|
|||
|
above lock-release function on any given CPU will be seen by all
|
|||
|
CPUs as happening before any access happening after a later one
|
|||
|
of the above lock-acquisition functions executing on that same CPU,
|
|||
|
even if the lock-release and lock-acquisition functions are operating
|
|||
|
on different ``rcu_node`` structures.
|
|||
|
Tree RCU uses these two ordering guarantees to form an ordering
|
|||
|
network among all CPUs that were in any way involved in the grace
|
|||
|
period, including any CPUs that came online or went offline during
|
|||
|
the grace period in question.
|
|||
|
|
|||
|
The following litmus test exhibits the ordering effects of these
|
|||
|
lock-acquisition and lock-release functions::
|
|||
|
|
|||
|
1 int x, y, z;
|
|||
|
2
|
|||
|
3 void task0(void)
|
|||
|
4 {
|
|||
|
5 raw_spin_lock_rcu_node(rnp);
|
|||
|
6 WRITE_ONCE(x, 1);
|
|||
|
7 r1 = READ_ONCE(y);
|
|||
|
8 raw_spin_unlock_rcu_node(rnp);
|
|||
|
9 }
|
|||
|
10
|
|||
|
11 void task1(void)
|
|||
|
12 {
|
|||
|
13 raw_spin_lock_rcu_node(rnp);
|
|||
|
14 WRITE_ONCE(y, 1);
|
|||
|
15 r2 = READ_ONCE(z);
|
|||
|
16 raw_spin_unlock_rcu_node(rnp);
|
|||
|
17 }
|
|||
|
18
|
|||
|
19 void task2(void)
|
|||
|
20 {
|
|||
|
21 WRITE_ONCE(z, 1);
|
|||
|
22 smp_mb();
|
|||
|
23 r3 = READ_ONCE(x);
|
|||
|
24 }
|
|||
|
25
|
|||
|
26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0);
|
|||
|
|
|||
|
The ``WARN_ON()`` is evaluated at "the end of time",
|
|||
|
after all changes have propagated throughout the system.
|
|||
|
Without the ``smp_mb__after_unlock_lock()`` provided by the
|
|||
|
acquisition functions, this ``WARN_ON()`` could trigger, for example
|
|||
|
on PowerPC.
|
|||
|
The ``smp_mb__after_unlock_lock()`` invocations prevent this
|
|||
|
``WARN_ON()`` from triggering.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But the chain of rcu_node-structure lock acquisitions guarantees |
|
|||
|
| that new readers will see all of the updater's pre-grace-period |
|
|||
|
| accesses and also guarantees that the updater's post-grace-period |
|
|||
|
| accesses will see all of the old reader's accesses. So why do we |
|
|||
|
| need all of those calls to smp_mb__after_unlock_lock()? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| Because we must provide ordering for RCU's polling grace-period |
|
|||
|
| primitives, for example, get_state_synchronize_rcu() and |
|
|||
|
| poll_state_synchronize_rcu(). Consider this code:: |
|
|||
|
| |
|
|||
|
| CPU 0 CPU 1 |
|
|||
|
| ---- ---- |
|
|||
|
| WRITE_ONCE(X, 1) WRITE_ONCE(Y, 1) |
|
|||
|
| g = get_state_synchronize_rcu() smp_mb() |
|
|||
|
| while (!poll_state_synchronize_rcu(g)) r1 = READ_ONCE(X) |
|
|||
|
| continue; |
|
|||
|
| r0 = READ_ONCE(Y) |
|
|||
|
| |
|
|||
|
| RCU guarantees that the outcome r0 == 0 && r1 == 0 will not |
|
|||
|
| happen, even if CPU 1 is in an RCU extended quiescent state |
|
|||
|
| (idle or offline) and thus won't interact directly with the RCU |
|
|||
|
| core processing at all. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
This approach must be extended to include idle CPUs, which need
|
|||
|
RCU's grace-period memory ordering guarantee to extend to any
|
|||
|
RCU read-side critical sections preceding and following the current
|
|||
|
idle sojourn.
|
|||
|
This case is handled by calls to the strongly ordered
|
|||
|
``atomic_add_return()`` read-modify-write atomic operation that
|
|||
|
is invoked within ``rcu_dynticks_eqs_enter()`` at idle-entry
|
|||
|
time and within ``rcu_dynticks_eqs_exit()`` at idle-exit time.
|
|||
|
The grace-period kthread invokes ``rcu_dynticks_snap()`` and
|
|||
|
``rcu_dynticks_in_eqs_since()`` (both of which invoke
|
|||
|
an ``atomic_add_return()`` of zero) to detect idle CPUs.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But what about CPUs that remain offline for the entire grace period? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| Such CPUs will be offline at the beginning of the grace period, so |
|
|||
|
| the grace period won't expect quiescent states from them. Races |
|
|||
|
| between grace-period start and CPU-hotplug operations are mediated |
|
|||
|
| by the CPU's leaf ``rcu_node`` structure's ``->lock`` as described |
|
|||
|
| above. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
The approach must be extended to handle one final case, that of waking a
|
|||
|
task blocked in ``synchronize_rcu()``. This task might be affinitied to
|
|||
|
a CPU that is not yet aware that the grace period has ended, and thus
|
|||
|
might not yet be subject to the grace period's memory ordering.
|
|||
|
Therefore, there is an ``smp_mb()`` after the return from
|
|||
|
``wait_for_completion()`` in the ``synchronize_rcu()`` code path.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| What? Where??? I don't see any ``smp_mb()`` after the return from |
|
|||
|
| ``wait_for_completion()``!!! |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| That would be because I spotted the need for that ``smp_mb()`` during |
|
|||
|
| the creation of this documentation, and it is therefore unlikely to |
|
|||
|
| hit mainline before v4.14. Kudos to Lance Roy, Will Deacon, Peter |
|
|||
|
| Zijlstra, and Jonathan Cameron for asking questions that sensitized |
|
|||
|
| me to the rather elaborate sequence of events that demonstrate the |
|
|||
|
| need for this memory barrier. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
Tree RCU's grace--period memory-ordering guarantees rely most heavily on
|
|||
|
the ``rcu_node`` structure's ``->lock`` field, so much so that it is
|
|||
|
necessary to abbreviate this pattern in the diagrams in the next
|
|||
|
section. For example, consider the ``rcu_prepare_for_idle()`` function
|
|||
|
shown below, which is one of several functions that enforce ordering of
|
|||
|
newly arrived RCU callbacks against future grace periods:
|
|||
|
|
|||
|
::
|
|||
|
|
|||
|
1 static void rcu_prepare_for_idle(void)
|
|||
|
2 {
|
|||
|
3 bool needwake;
|
|||
|
4 struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
|
|||
|
5 struct rcu_node *rnp;
|
|||
|
6 int tne;
|
|||
|
7
|
|||
|
8 lockdep_assert_irqs_disabled();
|
|||
|
9 if (rcu_rdp_is_offloaded(rdp))
|
|||
|
10 return;
|
|||
|
11
|
|||
|
12 /* Handle nohz enablement switches conservatively. */
|
|||
|
13 tne = READ_ONCE(tick_nohz_active);
|
|||
|
14 if (tne != rdp->tick_nohz_enabled_snap) {
|
|||
|
15 if (!rcu_segcblist_empty(&rdp->cblist))
|
|||
|
16 invoke_rcu_core(); /* force nohz to see update. */
|
|||
|
17 rdp->tick_nohz_enabled_snap = tne;
|
|||
|
18 return;
|
|||
|
19 }
|
|||
|
20 if (!tne)
|
|||
|
21 return;
|
|||
|
22
|
|||
|
23 /*
|
|||
|
24 * If we have not yet accelerated this jiffy, accelerate all
|
|||
|
25 * callbacks on this CPU.
|
|||
|
26 */
|
|||
|
27 if (rdp->last_accelerate == jiffies)
|
|||
|
28 return;
|
|||
|
29 rdp->last_accelerate = jiffies;
|
|||
|
30 if (rcu_segcblist_pend_cbs(&rdp->cblist)) {
|
|||
|
31 rnp = rdp->mynode;
|
|||
|
32 raw_spin_lock_rcu_node(rnp); /* irqs already disabled. */
|
|||
|
33 needwake = rcu_accelerate_cbs(rnp, rdp);
|
|||
|
34 raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
|
|||
|
35 if (needwake)
|
|||
|
36 rcu_gp_kthread_wake();
|
|||
|
37 }
|
|||
|
38 }
|
|||
|
|
|||
|
But the only part of ``rcu_prepare_for_idle()`` that really matters for
|
|||
|
this discussion are lines 32–34. We will therefore abbreviate this
|
|||
|
function as follows:
|
|||
|
|
|||
|
.. kernel-figure:: rcu_node-lock.svg
|
|||
|
|
|||
|
The box represents the ``rcu_node`` structure's ``->lock`` critical
|
|||
|
section, with the double line on top representing the additional
|
|||
|
``smp_mb__after_unlock_lock()``.
|
|||
|
|
|||
|
Tree RCU Grace Period Memory Ordering Components
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
Tree RCU's grace-period memory-ordering guarantee is provided by a
|
|||
|
number of RCU components:
|
|||
|
|
|||
|
#. `Callback Registry`_
|
|||
|
#. `Grace-Period Initialization`_
|
|||
|
#. `Self-Reported Quiescent States`_
|
|||
|
#. `Dynamic Tick Interface`_
|
|||
|
#. `CPU-Hotplug Interface`_
|
|||
|
#. `Forcing Quiescent States`_
|
|||
|
#. `Grace-Period Cleanup`_
|
|||
|
#. `Callback Invocation`_
|
|||
|
|
|||
|
Each of the following section looks at the corresponding component in
|
|||
|
detail.
|
|||
|
|
|||
|
Callback Registry
|
|||
|
^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
If RCU's grace-period guarantee is to mean anything at all, any access
|
|||
|
that happens before a given invocation of ``call_rcu()`` must also
|
|||
|
happen before the corresponding grace period. The implementation of this
|
|||
|
portion of RCU's grace period guarantee is shown in the following
|
|||
|
figure:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-callback-registry.svg
|
|||
|
|
|||
|
Because ``call_rcu()`` normally acts only on CPU-local state, it
|
|||
|
provides no ordering guarantees, either for itself or for phase one of
|
|||
|
the update (which again will usually be removal of an element from an
|
|||
|
RCU-protected data structure). It simply enqueues the ``rcu_head``
|
|||
|
structure on a per-CPU list, which cannot become associated with a grace
|
|||
|
period until a later call to ``rcu_accelerate_cbs()``, as shown in the
|
|||
|
diagram above.
|
|||
|
|
|||
|
One set of code paths shown on the left invokes ``rcu_accelerate_cbs()``
|
|||
|
via ``note_gp_changes()``, either directly from ``call_rcu()`` (if the
|
|||
|
current CPU is inundated with queued ``rcu_head`` structures) or more
|
|||
|
likely from an ``RCU_SOFTIRQ`` handler. Another code path in the middle
|
|||
|
is taken only in kernels built with ``CONFIG_RCU_FAST_NO_HZ=y``, which
|
|||
|
invokes ``rcu_accelerate_cbs()`` via ``rcu_prepare_for_idle()``. The
|
|||
|
final code path on the right is taken only in kernels built with
|
|||
|
``CONFIG_HOTPLUG_CPU=y``, which invokes ``rcu_accelerate_cbs()`` via
|
|||
|
``rcu_advance_cbs()``, ``rcu_migrate_callbacks``,
|
|||
|
``rcutree_migrate_callbacks()``, and ``takedown_cpu()``, which in turn
|
|||
|
is invoked on a surviving CPU after the outgoing CPU has been completely
|
|||
|
offlined.
|
|||
|
|
|||
|
There are a few other code paths within grace-period processing that
|
|||
|
opportunistically invoke ``rcu_accelerate_cbs()``. However, either way,
|
|||
|
all of the CPU's recently queued ``rcu_head`` structures are associated
|
|||
|
with a future grace-period number under the protection of the CPU's lead
|
|||
|
``rcu_node`` structure's ``->lock``. In all cases, there is full
|
|||
|
ordering against any prior critical section for that same ``rcu_node``
|
|||
|
structure's ``->lock``, and also full ordering against any of the
|
|||
|
current task's or CPU's prior critical sections for any ``rcu_node``
|
|||
|
structure's ``->lock``.
|
|||
|
|
|||
|
The next section will show how this ordering ensures that any accesses
|
|||
|
prior to the ``call_rcu()`` (particularly including phase one of the
|
|||
|
update) happen before the start of the corresponding grace period.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But what about ``synchronize_rcu()``? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| The ``synchronize_rcu()`` passes ``call_rcu()`` to ``wait_rcu_gp()``, |
|
|||
|
| which invokes it. So either way, it eventually comes down to |
|
|||
|
| ``call_rcu()``. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
Grace-Period Initialization
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Grace-period initialization is carried out by the grace-period kernel
|
|||
|
thread, which makes several passes over the ``rcu_node`` tree within the
|
|||
|
``rcu_gp_init()`` function. This means that showing the full flow of
|
|||
|
ordering through the grace-period computation will require duplicating
|
|||
|
this tree. If you find this confusing, please note that the state of the
|
|||
|
``rcu_node`` changes over time, just like Heraclitus's river. However,
|
|||
|
to keep the ``rcu_node`` river tractable, the grace-period kernel
|
|||
|
thread's traversals are presented in multiple parts, starting in this
|
|||
|
section with the various phases of grace-period initialization.
|
|||
|
|
|||
|
The first ordering-related grace-period initialization action is to
|
|||
|
advance the ``rcu_state`` structure's ``->gp_seq`` grace-period-number
|
|||
|
counter, as shown below:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp-init-1.svg
|
|||
|
|
|||
|
The actual increment is carried out using ``smp_store_release()``, which
|
|||
|
helps reject false-positive RCU CPU stall detection. Note that only the
|
|||
|
root ``rcu_node`` structure is touched.
|
|||
|
|
|||
|
The first pass through the ``rcu_node`` tree updates bitmasks based on
|
|||
|
CPUs having come online or gone offline since the start of the previous
|
|||
|
grace period. In the common case where the number of online CPUs for
|
|||
|
this ``rcu_node`` structure has not transitioned to or from zero, this
|
|||
|
pass will scan only the leaf ``rcu_node`` structures. However, if the
|
|||
|
number of online CPUs for a given leaf ``rcu_node`` structure has
|
|||
|
transitioned from zero, ``rcu_init_new_rnp()`` will be invoked for the
|
|||
|
first incoming CPU. Similarly, if the number of online CPUs for a given
|
|||
|
leaf ``rcu_node`` structure has transitioned to zero,
|
|||
|
``rcu_cleanup_dead_rnp()`` will be invoked for the last outgoing CPU.
|
|||
|
The diagram below shows the path of ordering if the leftmost
|
|||
|
``rcu_node`` structure onlines its first CPU and if the next
|
|||
|
``rcu_node`` structure has no online CPUs (or, alternatively if the
|
|||
|
leftmost ``rcu_node`` structure offlines its last CPU and if the next
|
|||
|
``rcu_node`` structure has no online CPUs).
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp-init-2.svg
|
|||
|
|
|||
|
The final ``rcu_gp_init()`` pass through the ``rcu_node`` tree traverses
|
|||
|
breadth-first, setting each ``rcu_node`` structure's ``->gp_seq`` field
|
|||
|
to the newly advanced value from the ``rcu_state`` structure, as shown
|
|||
|
in the following diagram.
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp-init-3.svg
|
|||
|
|
|||
|
This change will also cause each CPU's next call to
|
|||
|
``__note_gp_changes()`` to notice that a new grace period has started,
|
|||
|
as described in the next section. But because the grace-period kthread
|
|||
|
started the grace period at the root (with the advancing of the
|
|||
|
``rcu_state`` structure's ``->gp_seq`` field) before setting each leaf
|
|||
|
``rcu_node`` structure's ``->gp_seq`` field, each CPU's observation of
|
|||
|
the start of the grace period will happen after the actual start of the
|
|||
|
grace period.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But what about the CPU that started the grace period? Why wouldn't it |
|
|||
|
| see the start of the grace period right when it started that grace |
|
|||
|
| period? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| In some deep philosophical and overly anthromorphized sense, yes, the |
|
|||
|
| CPU starting the grace period is immediately aware of having done so. |
|
|||
|
| However, if we instead assume that RCU is not self-aware, then even |
|
|||
|
| the CPU starting the grace period does not really become aware of the |
|
|||
|
| start of this grace period until its first call to |
|
|||
|
| ``__note_gp_changes()``. On the other hand, this CPU potentially gets |
|
|||
|
| early notification because it invokes ``__note_gp_changes()`` during |
|
|||
|
| its last ``rcu_gp_init()`` pass through its leaf ``rcu_node`` |
|
|||
|
| structure. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
Self-Reported Quiescent States
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
When all entities that might block the grace period have reported
|
|||
|
quiescent states (or as described in a later section, had quiescent
|
|||
|
states reported on their behalf), the grace period can end. Online
|
|||
|
non-idle CPUs report their own quiescent states, as shown in the
|
|||
|
following diagram:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-qs.svg
|
|||
|
|
|||
|
This is for the last CPU to report a quiescent state, which signals the
|
|||
|
end of the grace period. Earlier quiescent states would push up the
|
|||
|
``rcu_node`` tree only until they encountered an ``rcu_node`` structure
|
|||
|
that is waiting for additional quiescent states. However, ordering is
|
|||
|
nevertheless preserved because some later quiescent state will acquire
|
|||
|
that ``rcu_node`` structure's ``->lock``.
|
|||
|
|
|||
|
Any number of events can lead up to a CPU invoking ``note_gp_changes``
|
|||
|
(or alternatively, directly invoking ``__note_gp_changes()``), at which
|
|||
|
point that CPU will notice the start of a new grace period while holding
|
|||
|
its leaf ``rcu_node`` lock. Therefore, all execution shown in this
|
|||
|
diagram happens after the start of the grace period. In addition, this
|
|||
|
CPU will consider any RCU read-side critical section that started before
|
|||
|
the invocation of ``__note_gp_changes()`` to have started before the
|
|||
|
grace period, and thus a critical section that the grace period must
|
|||
|
wait on.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But a RCU read-side critical section might have started after the |
|
|||
|
| beginning of the grace period (the advancing of ``->gp_seq`` from |
|
|||
|
| earlier), so why should the grace period wait on such a critical |
|
|||
|
| section? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| It is indeed not necessary for the grace period to wait on such a |
|
|||
|
| critical section. However, it is permissible to wait on it. And it is |
|
|||
|
| furthermore important to wait on it, as this lazy approach is far |
|
|||
|
| more scalable than a “big bang” all-at-once grace-period start could |
|
|||
|
| possibly be. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
If the CPU does a context switch, a quiescent state will be noted by
|
|||
|
``rcu_note_context_switch()`` on the left. On the other hand, if the CPU
|
|||
|
takes a scheduler-clock interrupt while executing in usermode, a
|
|||
|
quiescent state will be noted by ``rcu_sched_clock_irq()`` on the right.
|
|||
|
Either way, the passage through a quiescent state will be noted in a
|
|||
|
per-CPU variable.
|
|||
|
|
|||
|
The next time an ``RCU_SOFTIRQ`` handler executes on this CPU (for
|
|||
|
example, after the next scheduler-clock interrupt), ``rcu_core()`` will
|
|||
|
invoke ``rcu_check_quiescent_state()``, which will notice the recorded
|
|||
|
quiescent state, and invoke ``rcu_report_qs_rdp()``. If
|
|||
|
``rcu_report_qs_rdp()`` verifies that the quiescent state really does
|
|||
|
apply to the current grace period, it invokes ``rcu_report_rnp()`` which
|
|||
|
traverses up the ``rcu_node`` tree as shown at the bottom of the
|
|||
|
diagram, clearing bits from each ``rcu_node`` structure's ``->qsmask``
|
|||
|
field, and propagating up the tree when the result is zero.
|
|||
|
|
|||
|
Note that traversal passes upwards out of a given ``rcu_node`` structure
|
|||
|
only if the current CPU is reporting the last quiescent state for the
|
|||
|
subtree headed by that ``rcu_node`` structure. A key point is that if a
|
|||
|
CPU's traversal stops at a given ``rcu_node`` structure, then there will
|
|||
|
be a later traversal by another CPU (or perhaps the same one) that
|
|||
|
proceeds upwards from that point, and the ``rcu_node`` ``->lock``
|
|||
|
guarantees that the first CPU's quiescent state happens before the
|
|||
|
remainder of the second CPU's traversal. Applying this line of thought
|
|||
|
repeatedly shows that all CPUs' quiescent states happen before the last
|
|||
|
CPU traverses through the root ``rcu_node`` structure, the “last CPU”
|
|||
|
being the one that clears the last bit in the root ``rcu_node``
|
|||
|
structure's ``->qsmask`` field.
|
|||
|
|
|||
|
Dynamic Tick Interface
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Due to energy-efficiency considerations, RCU is forbidden from
|
|||
|
disturbing idle CPUs. CPUs are therefore required to notify RCU when
|
|||
|
entering or leaving idle state, which they do via fully ordered
|
|||
|
value-returning atomic operations on a per-CPU variable. The ordering
|
|||
|
effects are as shown below:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-dyntick.svg
|
|||
|
|
|||
|
The RCU grace-period kernel thread samples the per-CPU idleness variable
|
|||
|
while holding the corresponding CPU's leaf ``rcu_node`` structure's
|
|||
|
``->lock``. This means that any RCU read-side critical sections that
|
|||
|
precede the idle period (the oval near the top of the diagram above)
|
|||
|
will happen before the end of the current grace period. Similarly, the
|
|||
|
beginning of the current grace period will happen before any RCU
|
|||
|
read-side critical sections that follow the idle period (the oval near
|
|||
|
the bottom of the diagram above).
|
|||
|
|
|||
|
Plumbing this into the full grace-period execution is described
|
|||
|
`below <Forcing Quiescent States_>`__.
|
|||
|
|
|||
|
CPU-Hotplug Interface
|
|||
|
^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
RCU is also forbidden from disturbing offline CPUs, which might well be
|
|||
|
powered off and removed from the system completely. CPUs are therefore
|
|||
|
required to notify RCU of their comings and goings as part of the
|
|||
|
corresponding CPU hotplug operations. The ordering effects are shown
|
|||
|
below:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-hotplug.svg
|
|||
|
|
|||
|
Because CPU hotplug operations are much less frequent than idle
|
|||
|
transitions, they are heavier weight, and thus acquire the CPU's leaf
|
|||
|
``rcu_node`` structure's ``->lock`` and update this structure's
|
|||
|
``->qsmaskinitnext``. The RCU grace-period kernel thread samples this
|
|||
|
mask to detect CPUs having gone offline since the beginning of this
|
|||
|
grace period.
|
|||
|
|
|||
|
Plumbing this into the full grace-period execution is described
|
|||
|
`below <Forcing Quiescent States_>`__.
|
|||
|
|
|||
|
Forcing Quiescent States
|
|||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
As noted above, idle and offline CPUs cannot report their own quiescent
|
|||
|
states, and therefore the grace-period kernel thread must do the
|
|||
|
reporting on their behalf. This process is called “forcing quiescent
|
|||
|
states”, it is repeated every few jiffies, and its ordering effects are
|
|||
|
shown below:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp-fqs.svg
|
|||
|
|
|||
|
Each pass of quiescent state forcing is guaranteed to traverse the leaf
|
|||
|
``rcu_node`` structures, and if there are no new quiescent states due to
|
|||
|
recently idled and/or offlined CPUs, then only the leaves are traversed.
|
|||
|
However, if there is a newly offlined CPU as illustrated on the left or
|
|||
|
a newly idled CPU as illustrated on the right, the corresponding
|
|||
|
quiescent state will be driven up towards the root. As with
|
|||
|
self-reported quiescent states, the upwards driving stops once it
|
|||
|
reaches an ``rcu_node`` structure that has quiescent states outstanding
|
|||
|
from other CPUs.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| The leftmost drive to root stopped before it reached the root |
|
|||
|
| ``rcu_node`` structure, which means that there are still CPUs |
|
|||
|
| subordinate to that structure on which the current grace period is |
|
|||
|
| waiting. Given that, how is it possible that the rightmost drive to |
|
|||
|
| root ended the grace period? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| Good analysis! It is in fact impossible in the absence of bugs in |
|
|||
|
| RCU. But this diagram is complex enough as it is, so simplicity |
|
|||
|
| overrode accuracy. You can think of it as poetic license, or you can |
|
|||
|
| think of it as misdirection that is resolved in the |
|
|||
|
| `stitched-together diagram <Putting It All Together_>`__. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
Grace-Period Cleanup
|
|||
|
^^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Grace-period cleanup first scans the ``rcu_node`` tree breadth-first
|
|||
|
advancing all the ``->gp_seq`` fields, then it advances the
|
|||
|
``rcu_state`` structure's ``->gp_seq`` field. The ordering effects are
|
|||
|
shown below:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp-cleanup.svg
|
|||
|
|
|||
|
As indicated by the oval at the bottom of the diagram, once grace-period
|
|||
|
cleanup is complete, the next grace period can begin.
|
|||
|
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Quick Quiz**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| But when precisely does the grace period end? |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| **Answer**: |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
| There is no useful single point at which the grace period can be said |
|
|||
|
| to end. The earliest reasonable candidate is as soon as the last CPU |
|
|||
|
| has reported its quiescent state, but it may be some milliseconds |
|
|||
|
| before RCU becomes aware of this. The latest reasonable candidate is |
|
|||
|
| once the ``rcu_state`` structure's ``->gp_seq`` field has been |
|
|||
|
| updated, but it is quite possible that some CPUs have already |
|
|||
|
| completed phase two of their updates by that time. In short, if you |
|
|||
|
| are going to work with RCU, you need to learn to embrace uncertainty. |
|
|||
|
+-----------------------------------------------------------------------+
|
|||
|
|
|||
|
Callback Invocation
|
|||
|
^^^^^^^^^^^^^^^^^^^
|
|||
|
|
|||
|
Once a given CPU's leaf ``rcu_node`` structure's ``->gp_seq`` field has
|
|||
|
been updated, that CPU can begin invoking its RCU callbacks that were
|
|||
|
waiting for this grace period to end. These callbacks are identified by
|
|||
|
``rcu_advance_cbs()``, which is usually invoked by
|
|||
|
``__note_gp_changes()``. As shown in the diagram below, this invocation
|
|||
|
can be triggered by the scheduling-clock interrupt
|
|||
|
(``rcu_sched_clock_irq()`` on the left) or by idle entry
|
|||
|
(``rcu_cleanup_after_idle()`` on the right, but only for kernels build
|
|||
|
with ``CONFIG_RCU_FAST_NO_HZ=y``). Either way, ``RCU_SOFTIRQ`` is
|
|||
|
raised, which results in ``rcu_do_batch()`` invoking the callbacks,
|
|||
|
which in turn allows those callbacks to carry out (either directly or
|
|||
|
indirectly via wakeup) the needed phase-two processing for each update.
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-callback-invocation.svg
|
|||
|
|
|||
|
Please note that callback invocation can also be prompted by any number
|
|||
|
of corner-case code paths, for example, when a CPU notes that it has
|
|||
|
excessive numbers of callbacks queued. In all cases, the CPU acquires
|
|||
|
its leaf ``rcu_node`` structure's ``->lock`` before invoking callbacks,
|
|||
|
which preserves the required ordering against the newly completed grace
|
|||
|
period.
|
|||
|
|
|||
|
However, if the callback function communicates to other CPUs, for
|
|||
|
example, doing a wakeup, then it is that function's responsibility to
|
|||
|
maintain ordering. For example, if the callback function wakes up a task
|
|||
|
that runs on some other CPU, proper ordering must in place in both the
|
|||
|
callback function and the task being awakened. To see why this is
|
|||
|
important, consider the top half of the `grace-period
|
|||
|
cleanup`_ diagram. The callback might be
|
|||
|
running on a CPU corresponding to the leftmost leaf ``rcu_node``
|
|||
|
structure, and awaken a task that is to run on a CPU corresponding to
|
|||
|
the rightmost leaf ``rcu_node`` structure, and the grace-period kernel
|
|||
|
thread might not yet have reached the rightmost leaf. In this case, the
|
|||
|
grace period's memory ordering might not yet have reached that CPU, so
|
|||
|
again the callback function and the awakened task must supply proper
|
|||
|
ordering.
|
|||
|
|
|||
|
Putting It All Together
|
|||
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
A stitched-together diagram is here:
|
|||
|
|
|||
|
.. kernel-figure:: TreeRCU-gp.svg
|
|||
|
|
|||
|
Legal Statement
|
|||
|
~~~~~~~~~~~~~~~
|
|||
|
|
|||
|
This work represents the view of the author and does not necessarily
|
|||
|
represent the view of IBM.
|
|||
|
|
|||
|
Linux is a registered trademark of Linus Torvalds.
|
|||
|
|
|||
|
Other company, product, and service names may be trademarks or service
|
|||
|
marks of others.
|