mirror of
				https://github.com/qemu/qemu.git
				synced 2025-10-31 04:06:46 +00:00 
			
		
		
		
	 ac06724a71
			
		
	
	
		ac06724a71
		
	
	
	
	
		
			
			Developer documentation should be its own manual. As a start, move all developer-oriented files to a separate directory. Also move non-text files to their own directories: docs/config/ for QEMU -readconfig input, and docs/spin/ for formal models to be used with the SPIN model checker. Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			391 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			391 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| Using RCU (Read-Copy-Update) for synchronization
 | |
| ================================================
 | |
| 
 | |
| Read-copy update (RCU) is a synchronization mechanism that is used to
 | |
| protect read-mostly data structures.  RCU is very efficient and scalable
 | |
| on the read side (it is wait-free), and thus can make the read paths
 | |
| extremely fast.
 | |
| 
 | |
| RCU supports concurrency between a single writer and multiple readers,
 | |
| thus it is not used alone.  Typically, the write-side will use a lock to
 | |
| serialize multiple updates, but other approaches are possible (e.g.,
 | |
| restricting updates to a single task).  In QEMU, when a lock is used,
 | |
| this will often be the "iothread mutex", also known as the "big QEMU
 | |
| lock" (BQL).  Also, restricting updates to a single task is done in
 | |
| QEMU using the "bottom half" API.
 | |
| 
 | |
| RCU is fundamentally a "wait-to-finish" mechanism.  The read side marks
 | |
| sections of code with "critical sections", and the update side will wait
 | |
| for the execution of all *currently running* critical sections before
 | |
| proceeding, or before asynchronously executing a callback.
 | |
| 
 | |
| The key point here is that only the currently running critical sections
 | |
| are waited for; critical sections that are started _after_ the beginning
 | |
| of the wait do not extend the wait, despite running concurrently with
 | |
| the updater.  This is the reason why RCU is more scalable than,
 | |
| for example, reader-writer locks.  It is so much more scalable that
 | |
| the system will have a single instance of the RCU mechanism; a single
 | |
| mechanism can be used for an arbitrary number of "things", without
 | |
| having to worry about things such as contention or deadlocks.
 | |
| 
 | |
| How is this possible?  The basic idea is to split updates in two phases,
 | |
| "removal" and "reclamation".  During removal, we ensure that subsequent
 | |
| readers will not be able to get a reference to the old data.  After
 | |
| removal has completed, a critical section will not be able to access
 | |
| the old data.  Therefore, critical sections that begin after removal
 | |
| do not matter; as soon as all previous critical sections have finished,
 | |
| there cannot be any readers who hold references to the data structure,
 | |
| and these can now be safely reclaimed (e.g., freed or unref'ed).
 | |
| 
 | |
| Here is a picture:
 | |
| 
 | |
|         thread 1                  thread 2                  thread 3
 | |
|     -------------------    ------------------------    -------------------
 | |
|     enter RCU crit.sec.
 | |
|            |                finish removal phase
 | |
|            |                begin wait
 | |
|            |                      |                    enter RCU crit.sec.
 | |
|     exit RCU crit.sec             |                           |
 | |
|                             complete wait                     |
 | |
|                             begin reclamation phase           |
 | |
|                                                        exit RCU crit.sec.
 | |
| 
 | |
| 
 | |
| Note how thread 3 is still executing its critical section when thread 2
 | |
| starts reclaiming data.  This is possible, because the old version of the
 | |
| data structure was not accessible at the time thread 3 began executing
 | |
| that critical section.
 | |
| 
 | |
| 
 | |
| RCU API
 | |
| =======
 | |
| 
 | |
| The core RCU API is small:
 | |
| 
 | |
|      void rcu_read_lock(void);
 | |
| 
 | |
|         Used by a reader to inform the reclaimer that the reader is
 | |
|         entering an RCU read-side critical section.
 | |
| 
 | |
|      void rcu_read_unlock(void);
 | |
| 
 | |
|         Used by a reader to inform the reclaimer that the reader is
 | |
|         exiting an RCU read-side critical section.  Note that RCU
 | |
|         read-side critical sections may be nested and/or overlapping.
 | |
| 
 | |
|      void synchronize_rcu(void);
 | |
| 
 | |
|         Blocks until all pre-existing RCU read-side critical sections
 | |
|         on all threads have completed.  This marks the end of the removal
 | |
|         phase and the beginning of reclamation phase.
 | |
| 
 | |
|         Note that it would be valid for another update to come while
 | |
|         synchronize_rcu is running.  Because of this, it is better that
 | |
|         the updater releases any locks it may hold before calling
 | |
|         synchronize_rcu.  If this is not possible (for example, because
 | |
|         the updater is protected by the BQL), you can use call_rcu.
 | |
| 
 | |
|      void call_rcu1(struct rcu_head * head,
 | |
|                     void (*func)(struct rcu_head *head));
 | |
| 
 | |
|         This function invokes func(head) after all pre-existing RCU
 | |
|         read-side critical sections on all threads have completed.  This
 | |
|         marks the end of the removal phase, with func taking care
 | |
|         asynchronously of the reclamation phase.
 | |
| 
 | |
|         The foo struct needs to have an rcu_head structure added,
 | |
|         perhaps as follows:
 | |
| 
 | |
|             struct foo {
 | |
|                 struct rcu_head rcu;
 | |
|                 int a;
 | |
|                 char b;
 | |
|                 long c;
 | |
|             };
 | |
| 
 | |
|         so that the reclaimer function can fetch the struct foo address
 | |
|         and free it:
 | |
| 
 | |
|             call_rcu1(&foo.rcu, foo_reclaim);
 | |
| 
 | |
|             void foo_reclaim(struct rcu_head *rp)
 | |
|             {
 | |
|                 struct foo *fp = container_of(rp, struct foo, rcu);
 | |
|                 g_free(fp);
 | |
|             }
 | |
| 
 | |
|         For the common case where the rcu_head member is the first of the
 | |
|         struct, you can use the following macro.
 | |
| 
 | |
|      void call_rcu(T *p,
 | |
|                    void (*func)(T *p),
 | |
|                    field-name);
 | |
|      void g_free_rcu(T *p,
 | |
|                      field-name);
 | |
| 
 | |
|         call_rcu1 is typically used through these macro, in the common case
 | |
|         where the "struct rcu_head" is the first field in the struct.  If
 | |
|         the callback function is g_free, in particular, g_free_rcu can be
 | |
|         used.  In the above case, one could have written simply:
 | |
| 
 | |
|             g_free_rcu(&foo, rcu);
 | |
| 
 | |
|      typeof(*p) atomic_rcu_read(p);
 | |
| 
 | |
|         atomic_rcu_read() is similar to atomic_mb_read(), but it makes
 | |
|         some assumptions on the code that calls it.  This allows a more
 | |
|         optimized implementation.
 | |
| 
 | |
|         atomic_rcu_read assumes that whenever a single RCU critical
 | |
|         section reads multiple shared data, these reads are either
 | |
|         data-dependent or need no ordering.  This is almost always the
 | |
|         case when using RCU, because read-side critical sections typically
 | |
|         navigate one or more pointers (the pointers that are changed on
 | |
|         every update) until reaching a data structure of interest,
 | |
|         and then read from there.
 | |
| 
 | |
|         RCU read-side critical sections must use atomic_rcu_read() to
 | |
|         read data, unless concurrent writes are prevented by another
 | |
|         synchronization mechanism.
 | |
| 
 | |
|         Furthermore, RCU read-side critical sections should traverse the
 | |
|         data structure in a single direction, opposite to the direction
 | |
|         in which the updater initializes it.
 | |
| 
 | |
|      void atomic_rcu_set(p, typeof(*p) v);
 | |
| 
 | |
|         atomic_rcu_set() is also similar to atomic_mb_set(), and it also
 | |
|         makes assumptions on the code that calls it in order to allow a more
 | |
|         optimized implementation.
 | |
| 
 | |
|         In particular, atomic_rcu_set() suffices for synchronization
 | |
|         with readers, if the updater never mutates a field within a
 | |
|         data item that is already accessible to readers.  This is the
 | |
|         case when initializing a new copy of the RCU-protected data
 | |
|         structure; just ensure that initialization of *p is carried out
 | |
|         before atomic_rcu_set() makes the data item visible to readers.
 | |
|         If this rule is observed, writes will happen in the opposite
 | |
|         order as reads in the RCU read-side critical sections (or if
 | |
|         there is just one update), and there will be no need for other
 | |
|         synchronization mechanism to coordinate the accesses.
 | |
| 
 | |
| The following APIs must be used before RCU is used in a thread:
 | |
| 
 | |
|      void rcu_register_thread(void);
 | |
| 
 | |
|         Mark a thread as taking part in the RCU mechanism.  Such a thread
 | |
|         will have to report quiescent points regularly, either manually
 | |
|         or through the QemuCond/QemuSemaphore/QemuEvent APIs.
 | |
| 
 | |
|      void rcu_unregister_thread(void);
 | |
| 
 | |
|         Mark a thread as not taking part anymore in the RCU mechanism.
 | |
|         It is not a problem if such a thread reports quiescent points,
 | |
|         either manually or by using the QemuCond/QemuSemaphore/QemuEvent
 | |
|         APIs.
 | |
| 
 | |
| Note that these APIs are relatively heavyweight, and should _not_ be
 | |
| nested.
 | |
| 
 | |
| 
 | |
| DIFFERENCES WITH LINUX
 | |
| ======================
 | |
| 
 | |
| - Waiting on a mutex is possible, though discouraged, within an RCU critical
 | |
|   section.  This is because spinlocks are rarely (if ever) used in userspace
 | |
|   programming; not allowing this would prevent upgrading an RCU read-side
 | |
|   critical section to become an updater.
 | |
| 
 | |
| - atomic_rcu_read and atomic_rcu_set replace rcu_dereference and
 | |
|   rcu_assign_pointer.  They take a _pointer_ to the variable being accessed.
 | |
| 
 | |
| - call_rcu is a macro that has an extra argument (the name of the first
 | |
|   field in the struct, which must be a struct rcu_head), and expects the
 | |
|   type of the callback's argument to be the type of the first argument.
 | |
|   call_rcu1 is the same as Linux's call_rcu.
 | |
| 
 | |
| 
 | |
| RCU PATTERNS
 | |
| ============
 | |
| 
 | |
| Many patterns using read-writer locks translate directly to RCU, with
 | |
| the advantages of higher scalability and deadlock immunity.
 | |
| 
 | |
| In general, RCU can be used whenever it is possible to create a new
 | |
| "version" of a data structure every time the updater runs.  This may
 | |
| sound like a very strict restriction, however:
 | |
| 
 | |
| - the updater does not mean "everything that writes to a data structure",
 | |
|   but rather "everything that involves a reclamation step".  See the
 | |
|   array example below
 | |
| 
 | |
| - in some cases, creating a new version of a data structure may actually
 | |
|   be very cheap.  For example, modifying the "next" pointer of a singly
 | |
|   linked list is effectively creating a new version of the list.
 | |
| 
 | |
| Here are some frequently-used RCU idioms that are worth noting.
 | |
| 
 | |
| 
 | |
| RCU list processing
 | |
| -------------------
 | |
| 
 | |
| TBD (not yet used in QEMU)
 | |
| 
 | |
| 
 | |
| RCU reference counting
 | |
| ----------------------
 | |
| 
 | |
| Because grace periods are not allowed to complete while there is an RCU
 | |
| read-side critical section in progress, the RCU read-side primitives
 | |
| may be used as a restricted reference-counting mechanism.  For example,
 | |
| consider the following code fragment:
 | |
| 
 | |
|     rcu_read_lock();
 | |
|     p = atomic_rcu_read(&foo);
 | |
|     /* do something with p. */
 | |
|     rcu_read_unlock();
 | |
| 
 | |
| The RCU read-side critical section ensures that the value of "p" remains
 | |
| valid until after the rcu_read_unlock().  In some sense, it is acquiring
 | |
| a reference to p that is later released when the critical section ends.
 | |
| The write side looks simply like this (with appropriate locking):
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     atomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     synchronize_rcu();
 | |
|     free(old);
 | |
| 
 | |
| If the processing cannot be done purely within the critical section, it
 | |
| is possible to combine this idiom with a "real" reference count:
 | |
| 
 | |
|     rcu_read_lock();
 | |
|     p = atomic_rcu_read(&foo);
 | |
|     foo_ref(p);
 | |
|     rcu_read_unlock();
 | |
|     /* do something with p. */
 | |
|     foo_unref(p);
 | |
| 
 | |
| The write side can be like this:
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     atomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     synchronize_rcu();
 | |
|     foo_unref(old);
 | |
| 
 | |
| or with call_rcu:
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     atomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     call_rcu(foo_unref, old, rcu);
 | |
| 
 | |
| In both cases, the write side only performs removal.  Reclamation
 | |
| happens when the last reference to a "foo" object is dropped.
 | |
| Using synchronize_rcu() is undesirably expensive, because the
 | |
| last reference may be dropped on the read side.  Hence you can
 | |
| use call_rcu() instead:
 | |
| 
 | |
|      foo_unref(struct foo *p) {
 | |
|         if (atomic_fetch_dec(&p->refcount) == 1) {
 | |
|             call_rcu(foo_destroy, p, rcu);
 | |
|         }
 | |
|     }
 | |
| 
 | |
| 
 | |
| Note that the same idioms would be possible with reader/writer
 | |
| locks:
 | |
| 
 | |
|     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | |
|     p = foo;                        p = foo;
 | |
|     /* do something with p. */      foo = new;
 | |
|     read_unlock(&foo_rwlock);       free(p);
 | |
|                                     write_mutex_unlock(&foo_rwlock);
 | |
|                                     free(p);
 | |
| 
 | |
|     ------------------------------------------------------------------
 | |
| 
 | |
|     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | |
|     p = foo;                        old = foo;
 | |
|     foo_ref(p);                     foo = new;
 | |
|     read_unlock(&foo_rwlock);       foo_unref(old);
 | |
|     /* do something with p. */      write_mutex_unlock(&foo_rwlock);
 | |
|     read_lock(&foo_rwlock);
 | |
|     foo_unref(p);
 | |
|     read_unlock(&foo_rwlock);
 | |
| 
 | |
| foo_unref could use a mechanism such as bottom halves to move deallocation
 | |
| out of the write-side critical section.
 | |
| 
 | |
| 
 | |
| RCU resizable arrays
 | |
| --------------------
 | |
| 
 | |
| Resizable arrays can be used with RCU.  The expensive RCU synchronization
 | |
| (or call_rcu) only needs to take place when the array is resized.
 | |
| The two items to take care of are:
 | |
| 
 | |
| - ensuring that the old version of the array is available between removal
 | |
|   and reclamation;
 | |
| 
 | |
| - avoiding mismatches in the read side between the array data and the
 | |
|   array size.
 | |
| 
 | |
| The first problem is avoided simply by not using realloc.  Instead,
 | |
| each resize will allocate a new array and copy the old data into it.
 | |
| The second problem would arise if the size and the data pointers were
 | |
| two members of a larger struct:
 | |
| 
 | |
|     struct mystuff {
 | |
|         ...
 | |
|         int data_size;
 | |
|         int data_alloc;
 | |
|         T   *data;
 | |
|         ...
 | |
|     };
 | |
| 
 | |
| Instead, we store the size of the array with the array itself:
 | |
| 
 | |
|     struct arr {
 | |
|         int size;
 | |
|         int alloc;
 | |
|         T   data[];
 | |
|     };
 | |
|     struct arr *global_array;
 | |
| 
 | |
|     read side:
 | |
|         rcu_read_lock();
 | |
|         struct arr *array = atomic_rcu_read(&global_array);
 | |
|         x = i < array->size ? array->data[i] : -1;
 | |
|         rcu_read_unlock();
 | |
|         return x;
 | |
| 
 | |
|     write side (running under a lock):
 | |
|         if (global_array->size == global_array->alloc) {
 | |
|             /* Creating a new version.  */
 | |
|             new_array = g_malloc(sizeof(struct arr) +
 | |
|                                  global_array->alloc * 2 * sizeof(T));
 | |
|             new_array->size = global_array->size;
 | |
|             new_array->alloc = global_array->alloc * 2;
 | |
|             memcpy(new_array->data, global_array->data,
 | |
|                    global_array->alloc * sizeof(T));
 | |
| 
 | |
|             /* Removal phase.  */
 | |
|             old_array = global_array;
 | |
|             atomic_rcu_set(&new_array->data, new_array);
 | |
|             synchronize_rcu();
 | |
| 
 | |
|             /* Reclamation phase.  */
 | |
|             free(old_array);
 | |
|         }
 | |
| 
 | |
| 
 | |
| SOURCES
 | |
| =======
 | |
| 
 | |
| * Documentation/RCU/ from the Linux kernel
 |