mirror of
				https://github.com/qemu/qemu.git
				synced 2025-10-31 04:06:46 +00:00 
			
		
		
		
	 5135fe7110
			
		
	
	
		5135fe7110
		
	
	
	
	
		
			
			Replace Qemu -> QEMU. Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com> Reviewed-by: Darren Kenny <darren.kenny@oracle.com> Reviewed-by: Markus Armbruster <armbru@redhat.com> Message-Id: <20211118143401.4101497-1-philmd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
		
			
				
	
	
		
			902 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			902 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| == General ==
 | |
| 
 | |
| A qcow2 image file is organized in units of constant size, which are called
 | |
| (host) clusters. A cluster is the unit in which all allocations are done,
 | |
| both for actual guest data and for image metadata.
 | |
| 
 | |
| Likewise, the virtual disk as seen by the guest is divided into (guest)
 | |
| clusters of the same size.
 | |
| 
 | |
| All numbers in qcow2 are stored in Big Endian byte order.
 | |
| 
 | |
| 
 | |
| == Header ==
 | |
| 
 | |
| The first cluster of a qcow2 image contains the file header:
 | |
| 
 | |
|     Byte  0 -  3:   magic
 | |
|                     QCOW magic string ("QFI\xfb")
 | |
| 
 | |
|           4 -  7:   version
 | |
|                     Version number (valid values are 2 and 3)
 | |
| 
 | |
|           8 - 15:   backing_file_offset
 | |
|                     Offset into the image file at which the backing file name
 | |
|                     is stored (NB: The string is not null terminated). 0 if the
 | |
|                     image doesn't have a backing file.
 | |
| 
 | |
|                     Note: backing files are incompatible with raw external data
 | |
|                     files (auto-clear feature bit 1).
 | |
| 
 | |
|          16 - 19:   backing_file_size
 | |
|                     Length of the backing file name in bytes. Must not be
 | |
|                     longer than 1023 bytes. Undefined if the image doesn't have
 | |
|                     a backing file.
 | |
| 
 | |
|          20 - 23:   cluster_bits
 | |
|                     Number of bits that are used for addressing an offset
 | |
|                     within a cluster (1 << cluster_bits is the cluster size).
 | |
|                     Must not be less than 9 (i.e. 512 byte clusters).
 | |
| 
 | |
|                     Note: qemu as of today has an implementation limit of 2 MB
 | |
|                     as the maximum cluster size and won't be able to open images
 | |
|                     with larger cluster sizes.
 | |
| 
 | |
|                     Note: if the image has Extended L2 Entries then cluster_bits
 | |
|                     must be at least 14 (i.e. 16384 byte clusters).
 | |
| 
 | |
|          24 - 31:   size
 | |
|                     Virtual disk size in bytes.
 | |
| 
 | |
|                     Note: qemu has an implementation limit of 32 MB as
 | |
|                     the maximum L1 table size.  With a 2 MB cluster
 | |
|                     size, it is unable to populate a virtual cluster
 | |
|                     beyond 2 EB (61 bits); with a 512 byte cluster
 | |
|                     size, it is unable to populate a virtual size
 | |
|                     larger than 128 GB (37 bits).  Meanwhile, L1/L2
 | |
|                     table layouts limit an image to no more than 64 PB
 | |
|                     (56 bits) of populated clusters, and an image may
 | |
|                     hit other limits first (such as a file system's
 | |
|                     maximum size).
 | |
| 
 | |
|          32 - 35:   crypt_method
 | |
|                     0 for no encryption
 | |
|                     1 for AES encryption
 | |
|                     2 for LUKS encryption
 | |
| 
 | |
|          36 - 39:   l1_size
 | |
|                     Number of entries in the active L1 table
 | |
| 
 | |
|          40 - 47:   l1_table_offset
 | |
|                     Offset into the image file at which the active L1 table
 | |
|                     starts. Must be aligned to a cluster boundary.
 | |
| 
 | |
|          48 - 55:   refcount_table_offset
 | |
|                     Offset into the image file at which the refcount table
 | |
|                     starts. Must be aligned to a cluster boundary.
 | |
| 
 | |
|          56 - 59:   refcount_table_clusters
 | |
|                     Number of clusters that the refcount table occupies
 | |
| 
 | |
|          60 - 63:   nb_snapshots
 | |
|                     Number of snapshots contained in the image
 | |
| 
 | |
|          64 - 71:   snapshots_offset
 | |
|                     Offset into the image file at which the snapshot table
 | |
|                     starts. Must be aligned to a cluster boundary.
 | |
| 
 | |
| For version 2, the header is exactly 72 bytes in length, and finishes here.
 | |
| For version 3 or higher, the header length is at least 104 bytes, including
 | |
| the next fields through header_length.
 | |
| 
 | |
|          72 -  79:  incompatible_features
 | |
|                     Bitmask of incompatible features. An implementation must
 | |
|                     fail to open an image if an unknown bit is set.
 | |
| 
 | |
|                     Bit 0:      Dirty bit.  If this bit is set then refcounts
 | |
|                                 may be inconsistent, make sure to scan L1/L2
 | |
|                                 tables to repair refcounts before accessing the
 | |
|                                 image.
 | |
| 
 | |
|                     Bit 1:      Corrupt bit.  If this bit is set then any data
 | |
|                                 structure may be corrupt and the image must not
 | |
|                                 be written to (unless for regaining
 | |
|                                 consistency).
 | |
| 
 | |
|                     Bit 2:      External data file bit.  If this bit is set, an
 | |
|                                 external data file is used. Guest clusters are
 | |
|                                 then stored in the external data file. For such
 | |
|                                 images, clusters in the external data file are
 | |
|                                 not refcounted. The offset field in the
 | |
|                                 Standard Cluster Descriptor must match the
 | |
|                                 guest offset and neither compressed clusters
 | |
|                                 nor internal snapshots are supported.
 | |
| 
 | |
|                                 An External Data File Name header extension may
 | |
|                                 be present if this bit is set.
 | |
| 
 | |
|                     Bit 3:      Compression type bit.  If this bit is set,
 | |
|                                 a non-default compression is used for compressed
 | |
|                                 clusters. The compression_type field must be
 | |
|                                 present and not zero.
 | |
| 
 | |
|                     Bit 4:      Extended L2 Entries.  If this bit is set then
 | |
|                                 L2 table entries use an extended format that
 | |
|                                 allows subcluster-based allocation. See the
 | |
|                                 Extended L2 Entries section for more details.
 | |
| 
 | |
|                     Bits 5-63:  Reserved (set to 0)
 | |
| 
 | |
|          80 -  87:  compatible_features
 | |
|                     Bitmask of compatible features. An implementation can
 | |
|                     safely ignore any unknown bits that are set.
 | |
| 
 | |
|                     Bit 0:      Lazy refcounts bit.  If this bit is set then
 | |
|                                 lazy refcount updates can be used.  This means
 | |
|                                 marking the image file dirty and postponing
 | |
|                                 refcount metadata updates.
 | |
| 
 | |
|                     Bits 1-63:  Reserved (set to 0)
 | |
| 
 | |
|          88 -  95:  autoclear_features
 | |
|                     Bitmask of auto-clear features. An implementation may only
 | |
|                     write to an image with unknown auto-clear features if it
 | |
|                     clears the respective bits from this field first.
 | |
| 
 | |
|                     Bit 0:      Bitmaps extension bit
 | |
|                                 This bit indicates consistency for the bitmaps
 | |
|                                 extension data.
 | |
| 
 | |
|                                 It is an error if this bit is set without the
 | |
|                                 bitmaps extension present.
 | |
| 
 | |
|                                 If the bitmaps extension is present but this
 | |
|                                 bit is unset, the bitmaps extension data must be
 | |
|                                 considered inconsistent.
 | |
| 
 | |
|                     Bit 1:      Raw external data bit
 | |
|                                 If this bit is set, the external data file can
 | |
|                                 be read as a consistent standalone raw image
 | |
|                                 without looking at the qcow2 metadata.
 | |
| 
 | |
|                                 Setting this bit has a performance impact for
 | |
|                                 some operations on the image (e.g. writing
 | |
|                                 zeros requires writing to the data file instead
 | |
|                                 of only setting the zero flag in the L2 table
 | |
|                                 entry) and conflicts with backing files.
 | |
| 
 | |
|                                 This bit may only be set if the External Data
 | |
|                                 File bit (incompatible feature bit 1) is also
 | |
|                                 set.
 | |
| 
 | |
|                     Bits 2-63:  Reserved (set to 0)
 | |
| 
 | |
|          96 -  99:  refcount_order
 | |
|                     Describes the width of a reference count block entry (width
 | |
|                     in bits: refcount_bits = 1 << refcount_order). For version 2
 | |
|                     images, the order is always assumed to be 4
 | |
|                     (i.e. refcount_bits = 16).
 | |
|                     This value may not exceed 6 (i.e. refcount_bits = 64).
 | |
| 
 | |
|         100 - 103:  header_length
 | |
|                     Length of the header structure in bytes. For version 2
 | |
|                     images, the length is always assumed to be 72 bytes.
 | |
|                     For version 3 it's at least 104 bytes and must be a multiple
 | |
|                     of 8.
 | |
| 
 | |
| 
 | |
| === Additional fields (version 3 and higher) ===
 | |
| 
 | |
| In general, these fields are optional and may be safely ignored by the software,
 | |
| as well as filled by zeros (which is equal to field absence), if software needs
 | |
| to set field B, but does not care about field A which precedes B. More
 | |
| formally, additional fields have the following compatibility rules:
 | |
| 
 | |
| 1. If the value of the additional field must not be ignored for correct
 | |
| handling of the file, it will be accompanied by a corresponding incompatible
 | |
| feature bit.
 | |
| 
 | |
| 2. If there are no unrecognized incompatible feature bits set, an unknown
 | |
| additional field may be safely ignored other than preserving its value when
 | |
| rewriting the image header.
 | |
| 
 | |
| 3. An explicit value of 0 will have the same behavior as when the field is not
 | |
| present*, if not altered by a specific incompatible bit.
 | |
| 
 | |
| *. A field is considered not present when header_length is less than or equal
 | |
| to the field's offset. Also, all additional fields are not present for
 | |
| version 2.
 | |
| 
 | |
|               104:  compression_type
 | |
| 
 | |
|                     Defines the compression method used for compressed clusters.
 | |
|                     All compressed clusters in an image use the same compression
 | |
|                     type.
 | |
| 
 | |
|                     If the incompatible bit "Compression type" is set: the field
 | |
|                     must be present and non-zero (which means non-zlib
 | |
|                     compression type). Otherwise, this field must not be present
 | |
|                     or must be zero (which means zlib).
 | |
| 
 | |
|                     Available compression type values:
 | |
|                         0: zlib <https://www.zlib.net/>
 | |
|                         1: zstd <http://github.com/facebook/zstd>
 | |
| 
 | |
| 
 | |
| === Header padding ===
 | |
| 
 | |
| @header_length must be a multiple of 8, which means that if the end of the last
 | |
| additional field is not aligned, some padding is needed. This padding must be
 | |
| zeroed, so that if some existing (or future) additional field will fall into
 | |
| the padding, it will be interpreted accordingly to point [3.] of the previous
 | |
| paragraph, i.e.  in the same manner as when this field is not present.
 | |
| 
 | |
| 
 | |
| === Header extensions ===
 | |
| 
 | |
| Directly after the image header, optional sections called header extensions can
 | |
| be stored. Each extension has a structure like the following:
 | |
| 
 | |
|     Byte  0 -  3:   Header extension type:
 | |
|                         0x00000000 - End of the header extension area
 | |
|                         0xe2792aca - Backing file format name string
 | |
|                         0x6803f857 - Feature name table
 | |
|                         0x23852875 - Bitmaps extension
 | |
|                         0x0537be77 - Full disk encryption header pointer
 | |
|                         0x44415441 - External data file name string
 | |
|                         other      - Unknown header extension, can be safely
 | |
|                                      ignored
 | |
| 
 | |
|           4 -  7:   Length of the header extension data
 | |
| 
 | |
|           8 -  n:   Header extension data
 | |
| 
 | |
|           n -  m:   Padding to round up the header extension size to the next
 | |
|                     multiple of 8.
 | |
| 
 | |
| Unless stated otherwise, each header extension type shall appear at most once
 | |
| in the same image.
 | |
| 
 | |
| If the image has a backing file then the backing file name should be stored in
 | |
| the remaining space between the end of the header extension area and the end of
 | |
| the first cluster. It is not allowed to store other data here, so that an
 | |
| implementation can safely modify the header and add extensions without harming
 | |
| data of compatible features that it doesn't support. Compatible features that
 | |
| need space for additional data can use a header extension.
 | |
| 
 | |
| 
 | |
| == String header extensions ==
 | |
| 
 | |
| Some header extensions (such as the backing file format name and the external
 | |
| data file name) are just a single string. In this case, the header extension
 | |
| length is the string length and the string is not '\0' terminated. (The header
 | |
| extension padding can make it look like a string is '\0' terminated, but
 | |
| neither is padding always necessary nor is there a guarantee that zero bytes
 | |
| are used for padding.)
 | |
| 
 | |
| 
 | |
| == Feature name table ==
 | |
| 
 | |
| The feature name table is an optional header extension that contains the name
 | |
| for features used by the image. It can be used by applications that don't know
 | |
| the respective feature (e.g. because the feature was introduced only later) to
 | |
| display a useful error message.
 | |
| 
 | |
| The number of entries in the feature name table is determined by the length of
 | |
| the header extension data. Each entry look like this:
 | |
| 
 | |
|     Byte       0:   Type of feature (select feature bitmap)
 | |
|                         0: Incompatible feature
 | |
|                         1: Compatible feature
 | |
|                         2: Autoclear feature
 | |
| 
 | |
|                1:   Bit number within the selected feature bitmap (valid
 | |
|                     values: 0-63)
 | |
| 
 | |
|           2 - 47:   Feature name (padded with zeros, but not necessarily null
 | |
|                     terminated if it has full length)
 | |
| 
 | |
| 
 | |
| == Bitmaps extension ==
 | |
| 
 | |
| The bitmaps extension is an optional header extension. It provides the ability
 | |
| to store bitmaps related to a virtual disk. For now, there is only one bitmap
 | |
| type: the dirty tracking bitmap, which tracks virtual disk changes from some
 | |
| point in time.
 | |
| 
 | |
| The data of the extension should be considered consistent only if the
 | |
| corresponding auto-clear feature bit is set, see autoclear_features above.
 | |
| 
 | |
| The fields of the bitmaps extension are:
 | |
| 
 | |
|     Byte  0 -  3:  nb_bitmaps
 | |
|                    The number of bitmaps contained in the image. Must be
 | |
|                    greater than or equal to 1.
 | |
| 
 | |
|                    Note: QEMU currently only supports up to 65535 bitmaps per
 | |
|                    image.
 | |
| 
 | |
|           4 -  7:  Reserved, must be zero.
 | |
| 
 | |
|           8 - 15:  bitmap_directory_size
 | |
|                    Size of the bitmap directory in bytes. It is the cumulative
 | |
|                    size of all (nb_bitmaps) bitmap directory entries.
 | |
| 
 | |
|          16 - 23:  bitmap_directory_offset
 | |
|                    Offset into the image file at which the bitmap directory
 | |
|                    starts. Must be aligned to a cluster boundary.
 | |
| 
 | |
| == Full disk encryption header pointer ==
 | |
| 
 | |
| The full disk encryption header must be present if, and only if, the
 | |
| 'crypt_method' header requires metadata. Currently this is only true
 | |
| of the 'LUKS' crypt method. The header extension must be absent for
 | |
| other methods.
 | |
| 
 | |
| This header provides the offset at which the crypt method can store
 | |
| its additional data, as well as the length of such data.
 | |
| 
 | |
|     Byte  0 -  7:   Offset into the image file at which the encryption
 | |
|                     header starts in bytes. Must be aligned to a cluster
 | |
|                     boundary.
 | |
|     Byte  8 - 15:   Length of the written encryption header in bytes.
 | |
|                     Note actual space allocated in the qcow2 file may
 | |
|                     be larger than this value, since it will be rounded
 | |
|                     to the nearest multiple of the cluster size. Any
 | |
|                     unused bytes in the allocated space will be initialized
 | |
|                     to 0.
 | |
| 
 | |
| For the LUKS crypt method, the encryption header works as follows.
 | |
| 
 | |
| The first 592 bytes of the header clusters will contain the LUKS
 | |
| partition header. This is then followed by the key material data areas.
 | |
| The size of the key material data areas is determined by the number of
 | |
| stripes in the key slot and key size. Refer to the LUKS format
 | |
| specification ('docs/on-disk-format.pdf' in the cryptsetup source
 | |
| package) for details of the LUKS partition header format.
 | |
| 
 | |
| In the LUKS partition header, the "payload-offset" field will be
 | |
| calculated as normal for the LUKS spec. ie the size of the LUKS
 | |
| header, plus key material regions, plus padding, relative to the
 | |
| start of the LUKS header. This offset value is not required to be
 | |
| qcow2 cluster aligned. Its value is currently never used in the
 | |
| context of qcow2, since the qcow2 file format itself defines where
 | |
| the real payload offset is, but none the less a valid payload offset
 | |
| should always be present.
 | |
| 
 | |
| In the LUKS key slots header, the "key-material-offset" is relative
 | |
| to the start of the LUKS header clusters in the qcow2 container,
 | |
| not the start of the qcow2 file.
 | |
| 
 | |
| Logically the layout looks like
 | |
| 
 | |
|   +-----------------------------+
 | |
|   | QCow2 header                |
 | |
|   | QCow2 header extension X    |
 | |
|   | QCow2 header extension FDE  |
 | |
|   | QCow2 header extension ...  |
 | |
|   | QCow2 header extension Z    |
 | |
|   +-----------------------------+
 | |
|   | ....other QCow2 tables....  |
 | |
|   .                             .
 | |
|   .                             .
 | |
|   +-----------------------------+
 | |
|   | +-------------------------+ |
 | |
|   | | LUKS partition header   | |
 | |
|   | +-------------------------+ |
 | |
|   | | LUKS key material 1     | |
 | |
|   | +-------------------------+ |
 | |
|   | | LUKS key material 2     | |
 | |
|   | +-------------------------+ |
 | |
|   | | LUKS key material ...   | |
 | |
|   | +-------------------------+ |
 | |
|   | | LUKS key material 8     | |
 | |
|   | +-------------------------+ |
 | |
|   +-----------------------------+
 | |
|   | QCow2 cluster payload       |
 | |
|   .                             .
 | |
|   .                             .
 | |
|   .                             .
 | |
|   |                             |
 | |
|   +-----------------------------+
 | |
| 
 | |
| == Data encryption ==
 | |
| 
 | |
| When an encryption method is requested in the header, the image payload
 | |
| data must be encrypted/decrypted on every write/read. The image headers
 | |
| and metadata are never encrypted.
 | |
| 
 | |
| The algorithms used for encryption vary depending on the method
 | |
| 
 | |
|  - AES:
 | |
| 
 | |
|    The AES cipher, in CBC mode, with 256 bit keys.
 | |
| 
 | |
|    Initialization vectors generated using plain64 method, with
 | |
|    the virtual disk sector as the input tweak.
 | |
| 
 | |
|    This format is no longer supported in QEMU system emulators, due
 | |
|    to a number of design flaws affecting its security. It is only
 | |
|    supported in the command line tools for the sake of back compatibility
 | |
|    and data liberation.
 | |
| 
 | |
|  - LUKS:
 | |
| 
 | |
|    The algorithms are specified in the LUKS header.
 | |
| 
 | |
|    Initialization vectors generated using the method specified
 | |
|    in the LUKS header, with the physical disk sector as the
 | |
|    input tweak.
 | |
| 
 | |
| == Host cluster management ==
 | |
| 
 | |
| qcow2 manages the allocation of host clusters by maintaining a reference count
 | |
| for each host cluster. A refcount of 0 means that the cluster is free, 1 means
 | |
| that it is used, and >= 2 means that it is used and any write access must
 | |
| perform a COW (copy on write) operation.
 | |
| 
 | |
| The refcounts are managed in a two-level table. The first level is called
 | |
| refcount table and has a variable size (which is stored in the header). The
 | |
| refcount table can cover multiple clusters, however it needs to be contiguous
 | |
| in the image file.
 | |
| 
 | |
| It contains pointers to the second level structures which are called refcount
 | |
| blocks and are exactly one cluster in size.
 | |
| 
 | |
| Although a large enough refcount table can reserve clusters past 64 PB
 | |
| (56 bits) (assuming the underlying protocol can even be sized that
 | |
| large), note that some qcow2 metadata such as L1/L2 tables must point
 | |
| to clusters prior to that point.
 | |
| 
 | |
| Note: qemu has an implementation limit of 8 MB as the maximum refcount
 | |
| table size.  With a 2 MB cluster size and a default refcount_order of
 | |
| 4, it is unable to reference host resources beyond 2 EB (61 bits); in
 | |
| the worst case, with a 512 cluster size and refcount_order of 6, it is
 | |
| unable to access beyond 32 GB (35 bits).
 | |
| 
 | |
| Given an offset into the image file, the refcount of its cluster can be
 | |
| obtained as follows:
 | |
| 
 | |
|     refcount_block_entries = (cluster_size * 8 / refcount_bits)
 | |
| 
 | |
|     refcount_block_index = (offset / cluster_size) % refcount_block_entries
 | |
|     refcount_table_index = (offset / cluster_size) / refcount_block_entries
 | |
| 
 | |
|     refcount_block = load_cluster(refcount_table[refcount_table_index]);
 | |
|     return refcount_block[refcount_block_index];
 | |
| 
 | |
| Refcount table entry:
 | |
| 
 | |
|     Bit  0 -  8:    Reserved (set to 0)
 | |
| 
 | |
|          9 - 63:    Bits 9-63 of the offset into the image file at which the
 | |
|                     refcount block starts. Must be aligned to a cluster
 | |
|                     boundary.
 | |
| 
 | |
|                     If this is 0, the corresponding refcount block has not yet
 | |
|                     been allocated. All refcounts managed by this refcount block
 | |
|                     are 0.
 | |
| 
 | |
| Refcount block entry (x = refcount_bits - 1):
 | |
| 
 | |
|     Bit  0 -  x:    Reference count of the cluster. If refcount_bits implies a
 | |
|                     sub-byte width, note that bit 0 means the least significant
 | |
|                     bit in this context.
 | |
| 
 | |
| 
 | |
| == Cluster mapping ==
 | |
| 
 | |
| Just as for refcounts, qcow2 uses a two-level structure for the mapping of
 | |
| guest clusters to host clusters. They are called L1 and L2 table.
 | |
| 
 | |
| The L1 table has a variable size (stored in the header) and may use multiple
 | |
| clusters, however it must be contiguous in the image file. L2 tables are
 | |
| exactly one cluster in size.
 | |
| 
 | |
| The L1 and L2 tables have implications on the maximum virtual file
 | |
| size; for a given L1 table size, a larger cluster size is required for
 | |
| the guest to have access to more space.  Furthermore, a virtual
 | |
| cluster must currently map to a host offset below 64 PB (56 bits)
 | |
| (although this limit could be relaxed by putting reserved bits into
 | |
| use).  Additionally, as cluster size increases, the maximum host
 | |
| offset for a compressed cluster is reduced (a 2M cluster size requires
 | |
| compressed clusters to reside below 512 TB (49 bits), and this limit
 | |
| cannot be relaxed without an incompatible layout change).
 | |
| 
 | |
| Given an offset into the virtual disk, the offset into the image file can be
 | |
| obtained as follows:
 | |
| 
 | |
|     l2_entries = (cluster_size / sizeof(uint64_t))        [*]
 | |
| 
 | |
|     l2_index = (offset / cluster_size) % l2_entries
 | |
|     l1_index = (offset / cluster_size) / l2_entries
 | |
| 
 | |
|     l2_table = load_cluster(l1_table[l1_index]);
 | |
|     cluster_offset = l2_table[l2_index];
 | |
| 
 | |
|     return cluster_offset + (offset % cluster_size)
 | |
| 
 | |
|     [*] this changes if Extended L2 Entries are enabled, see next section
 | |
| 
 | |
| L1 table entry:
 | |
| 
 | |
|     Bit  0 -  8:    Reserved (set to 0)
 | |
| 
 | |
|          9 - 55:    Bits 9-55 of the offset into the image file at which the L2
 | |
|                     table starts. Must be aligned to a cluster boundary. If the
 | |
|                     offset is 0, the L2 table and all clusters described by this
 | |
|                     L2 table are unallocated.
 | |
| 
 | |
|         56 - 62:    Reserved (set to 0)
 | |
| 
 | |
|              63:    0 for an L2 table that is unused or requires COW, 1 if its
 | |
|                     refcount is exactly one. This information is only accurate
 | |
|                     in the active L1 table.
 | |
| 
 | |
| L2 table entry:
 | |
| 
 | |
|     Bit  0 -  61:   Cluster descriptor
 | |
| 
 | |
|               62:   0 for standard clusters
 | |
|                     1 for compressed clusters
 | |
| 
 | |
|               63:   0 for clusters that are unused, compressed or require COW.
 | |
|                     1 for standard clusters whose refcount is exactly one.
 | |
|                     This information is only accurate in L2 tables
 | |
|                     that are reachable from the active L1 table.
 | |
| 
 | |
|                     With external data files, all guest clusters have an
 | |
|                     implicit refcount of 1 (because of the fixed host = guest
 | |
|                     mapping for guest cluster offsets), so this bit should be 1
 | |
|                     for all allocated clusters.
 | |
| 
 | |
| Standard Cluster Descriptor:
 | |
| 
 | |
|     Bit       0:    If set to 1, the cluster reads as all zeros. The host
 | |
|                     cluster offset can be used to describe a preallocation,
 | |
|                     but it won't be used for reading data from this cluster,
 | |
|                     nor is data read from the backing file if the cluster is
 | |
|                     unallocated.
 | |
| 
 | |
|                     With version 2 or with extended L2 entries (see the next
 | |
|                     section), this is always 0.
 | |
| 
 | |
|          1 -  8:    Reserved (set to 0)
 | |
| 
 | |
|          9 - 55:    Bits 9-55 of host cluster offset. Must be aligned to a
 | |
|                     cluster boundary. If the offset is 0 and bit 63 is clear,
 | |
|                     the cluster is unallocated. The offset may only be 0 with
 | |
|                     bit 63 set (indicating a host cluster offset of 0) when an
 | |
|                     external data file is used.
 | |
| 
 | |
|         56 - 61:    Reserved (set to 0)
 | |
| 
 | |
| 
 | |
| Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
 | |
| 
 | |
|     Bit  0 - x-1:   Host cluster offset. This is usually _not_ aligned to a
 | |
|                     cluster or sector boundary!  If cluster_bits is
 | |
|                     small enough that this field includes bits beyond
 | |
|                     55, those upper bits must be set to 0.
 | |
| 
 | |
|          x - 61:    Number of additional 512-byte sectors used for the
 | |
|                     compressed data, beyond the sector containing the offset
 | |
|                     in the previous field. Some of these sectors may reside
 | |
|                     in the next contiguous host cluster.
 | |
| 
 | |
|                     Note that the compressed data does not necessarily occupy
 | |
|                     all of the bytes in the final sector; rather, decompression
 | |
|                     stops when it has produced a cluster of data.
 | |
| 
 | |
|                     Another compressed cluster may map to the tail of the final
 | |
|                     sector used by this compressed cluster.
 | |
| 
 | |
| If a cluster is unallocated, read requests shall read the data from the backing
 | |
| file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
 | |
| no backing file or the backing file is smaller than the image, they shall read
 | |
| zeros for all parts that are not covered by the backing file.
 | |
| 
 | |
| == Extended L2 Entries ==
 | |
| 
 | |
| An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
 | |
| field of the header.
 | |
| 
 | |
| In these images standard data clusters are divided into 32 subclusters of the
 | |
| same size. They are contiguous and start from the beginning of the cluster.
 | |
| Subclusters can be allocated independently and the L2 entry contains information
 | |
| indicating the status of each one of them. Compressed data clusters don't have
 | |
| subclusters so they are treated the same as in images without this feature.
 | |
| 
 | |
| The size of an extended L2 entry is 128 bits so the number of entries per table
 | |
| is calculated using this formula:
 | |
| 
 | |
|     l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
 | |
| 
 | |
| The first 64 bits have the same format as the standard L2 table entry described
 | |
| in the previous section, with the exception of bit 0 of the standard cluster
 | |
| descriptor.
 | |
| 
 | |
| The last 64 bits contain a subcluster allocation bitmap with this format:
 | |
| 
 | |
| Subcluster Allocation Bitmap (for standard clusters):
 | |
| 
 | |
|     Bit  0 - 31:    Allocation status (one bit per subcluster)
 | |
| 
 | |
|                     1: the subcluster is allocated. In this case the
 | |
|                        host cluster offset field must contain a valid
 | |
|                        offset.
 | |
|                     0: the subcluster is not allocated. In this case
 | |
|                        read requests shall go to the backing file or
 | |
|                        return zeros if there is no backing file data.
 | |
| 
 | |
|                     Bits are assigned starting from the least significant
 | |
|                     one (i.e. bit x is used for subcluster x).
 | |
| 
 | |
|         32 - 63     Subcluster reads as zeros (one bit per subcluster)
 | |
| 
 | |
|                     1: the subcluster reads as zeros. In this case the
 | |
|                        allocation status bit must be unset. The host
 | |
|                        cluster offset field may or may not be set.
 | |
|                     0: no effect.
 | |
| 
 | |
|                     Bits are assigned starting from the least significant
 | |
|                     one (i.e. bit x is used for subcluster x - 32).
 | |
| 
 | |
| Subcluster Allocation Bitmap (for compressed clusters):
 | |
| 
 | |
|     Bit  0 - 63:    Reserved (set to 0)
 | |
|                     Compressed clusters don't have subclusters,
 | |
|                     so this field is not used.
 | |
| 
 | |
| == Snapshots ==
 | |
| 
 | |
| qcow2 supports internal snapshots. Their basic principle of operation is to
 | |
| switch the active L1 table, so that a different set of host clusters are
 | |
| exposed to the guest.
 | |
| 
 | |
| When creating a snapshot, the L1 table should be copied and the refcount of all
 | |
| L2 tables and clusters reachable from this L1 table must be increased, so that
 | |
| a write causes a COW and isn't visible in other snapshots.
 | |
| 
 | |
| When loading a snapshot, bit 63 of all entries in the new active L1 table and
 | |
| all L2 tables referenced by it must be reconstructed from the refcount table
 | |
| as it doesn't need to be accurate in inactive L1 tables.
 | |
| 
 | |
| A directory of all snapshots is stored in the snapshot table, a contiguous area
 | |
| in the image file, whose starting offset and length are given by the header
 | |
| fields snapshots_offset and nb_snapshots. The entries of the snapshot table
 | |
| have variable length, depending on the length of ID, name and extra data.
 | |
| 
 | |
| Snapshot table entry:
 | |
| 
 | |
|     Byte 0 -  7:    Offset into the image file at which the L1 table for the
 | |
|                     snapshot starts. Must be aligned to a cluster boundary.
 | |
| 
 | |
|          8 - 11:    Number of entries in the L1 table of the snapshots
 | |
| 
 | |
|         12 - 13:    Length of the unique ID string describing the snapshot
 | |
| 
 | |
|         14 - 15:    Length of the name of the snapshot
 | |
| 
 | |
|         16 - 19:    Time at which the snapshot was taken in seconds since the
 | |
|                     Epoch
 | |
| 
 | |
|         20 - 23:    Subsecond part of the time at which the snapshot was taken
 | |
|                     in nanoseconds
 | |
| 
 | |
|         24 - 31:    Time that the guest was running until the snapshot was
 | |
|                     taken in nanoseconds
 | |
| 
 | |
|         32 - 35:    Size of the VM state in bytes. 0 if no VM state is saved.
 | |
|                     If there is VM state, it starts at the first cluster
 | |
|                     described by first L1 table entry that doesn't describe a
 | |
|                     regular guest cluster (i.e. VM state is stored like guest
 | |
|                     disk content, except that it is stored at offsets that are
 | |
|                     larger than the virtual disk presented to the guest)
 | |
| 
 | |
|         36 - 39:    Size of extra data in the table entry (used for future
 | |
|                     extensions of the format)
 | |
| 
 | |
|         variable:   Extra data for future extensions. Unknown fields must be
 | |
|                     ignored. Currently defined are (offset relative to snapshot
 | |
|                     table entry):
 | |
| 
 | |
|                     Byte 40 - 47:   Size of the VM state in bytes. 0 if no VM
 | |
|                                     state is saved. If this field is present,
 | |
|                                     the 32-bit value in bytes 32-35 is ignored.
 | |
| 
 | |
|                     Byte 48 - 55:   Virtual disk size of the snapshot in bytes
 | |
| 
 | |
|                     Byte 56 - 63:   icount value which corresponds to
 | |
|                                     the record/replay instruction count
 | |
|                                     when the snapshot was taken. Set to -1
 | |
|                                     if icount was disabled
 | |
| 
 | |
|                     Version 3 images must include extra data at least up to
 | |
|                     byte 55.
 | |
| 
 | |
|         variable:   Unique ID string for the snapshot (not null terminated)
 | |
| 
 | |
|         variable:   Name of the snapshot (not null terminated)
 | |
| 
 | |
|         variable:   Padding to round up the snapshot table entry size to the
 | |
|                     next multiple of 8.
 | |
| 
 | |
| 
 | |
| == Bitmaps ==
 | |
| 
 | |
| As mentioned above, the bitmaps extension provides the ability to store bitmaps
 | |
| related to a virtual disk. This section describes how these bitmaps are stored.
 | |
| 
 | |
| All stored bitmaps are related to the virtual disk stored in the same image, so
 | |
| each bitmap size is equal to the virtual disk size.
 | |
| 
 | |
| Each bit of the bitmap is responsible for strictly defined range of the virtual
 | |
| disk. For bit number bit_nr the corresponding range (in bytes) will be:
 | |
| 
 | |
|     [bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1]
 | |
| 
 | |
| Granularity is a property of the concrete bitmap, see below.
 | |
| 
 | |
| 
 | |
| === Bitmap directory ===
 | |
| 
 | |
| Each bitmap saved in the image is described in a bitmap directory entry. The
 | |
| bitmap directory is a contiguous area in the image file, whose starting offset
 | |
| and length are given by the header extension fields bitmap_directory_offset and
 | |
| bitmap_directory_size. The entries of the bitmap directory have variable
 | |
| length, depending on the lengths of the bitmap name and extra data.
 | |
| 
 | |
| Structure of a bitmap directory entry:
 | |
| 
 | |
|     Byte 0 -  7:    bitmap_table_offset
 | |
|                     Offset into the image file at which the bitmap table
 | |
|                     (described below) for the bitmap starts. Must be aligned to
 | |
|                     a cluster boundary.
 | |
| 
 | |
|          8 - 11:    bitmap_table_size
 | |
|                     Number of entries in the bitmap table of the bitmap.
 | |
| 
 | |
|         12 - 15:    flags
 | |
|                     Bit
 | |
|                       0: in_use
 | |
|                          The bitmap was not saved correctly and may be
 | |
|                          inconsistent. Although the bitmap metadata is still
 | |
|                          well-formed from a qcow2 perspective, the metadata
 | |
|                          (such as the auto flag or bitmap size) or data
 | |
|                          contents may be outdated.
 | |
| 
 | |
|                       1: auto
 | |
|                          The bitmap must reflect all changes of the virtual
 | |
|                          disk by any application that would write to this qcow2
 | |
|                          file (including writes, snapshot switching, etc.). The
 | |
|                          type of this bitmap must be 'dirty tracking bitmap'.
 | |
| 
 | |
|                       2: extra_data_compatible
 | |
|                          This flags is meaningful when the extra data is
 | |
|                          unknown to the software (currently any extra data is
 | |
|                          unknown to QEMU).
 | |
|                          If it is set, the bitmap may be used as expected, extra
 | |
|                          data must be left as is.
 | |
|                          If it is not set, the bitmap must not be used, but
 | |
|                          both it and its extra data be left as is.
 | |
| 
 | |
|                     Bits 3 - 31 are reserved and must be 0.
 | |
| 
 | |
|              16:    type
 | |
|                     This field describes the sort of the bitmap.
 | |
|                     Values:
 | |
|                       1: Dirty tracking bitmap
 | |
| 
 | |
|                     Values 0, 2 - 255 are reserved.
 | |
| 
 | |
|              17:    granularity_bits
 | |
|                     Granularity bits. Valid values: 0 - 63.
 | |
| 
 | |
|                     Note: QEMU currently supports only values 9 - 31.
 | |
| 
 | |
|                     Granularity is calculated as
 | |
|                         granularity = 1 << granularity_bits
 | |
| 
 | |
|                     A bitmap's granularity is how many bytes of the image
 | |
|                     accounts for one bit of the bitmap.
 | |
| 
 | |
|         18 - 19:    name_size
 | |
|                     Size of the bitmap name. Must be non-zero.
 | |
| 
 | |
|                     Note: QEMU currently doesn't support values greater than
 | |
|                     1023.
 | |
| 
 | |
|         20 - 23:    extra_data_size
 | |
|                     Size of type-specific extra data.
 | |
| 
 | |
|                     For now, as no extra data is defined, extra_data_size is
 | |
|                     reserved and should be zero. If it is non-zero the
 | |
|                     behavior is defined by extra_data_compatible flag.
 | |
| 
 | |
|         variable:   extra_data
 | |
|                     Extra data for the bitmap, occupying extra_data_size bytes.
 | |
|                     Extra data must never contain references to clusters or in
 | |
|                     some other way allocate additional clusters.
 | |
| 
 | |
|         variable:   name
 | |
|                     The name of the bitmap (not null terminated), occupying
 | |
|                     name_size bytes. Must be unique among all bitmap names
 | |
|                     within the bitmaps extension.
 | |
| 
 | |
|         variable:   Padding to round up the bitmap directory entry size to the
 | |
|                     next multiple of 8. All bytes of the padding must be zero.
 | |
| 
 | |
| 
 | |
| === Bitmap table ===
 | |
| 
 | |
| Each bitmap is stored using a one-level structure (as opposed to two-level
 | |
| structures like for refcounts and guest clusters mapping) for the mapping of
 | |
| bitmap data to host clusters. This structure is called the bitmap table.
 | |
| 
 | |
| Each bitmap table has a variable size (stored in the bitmap directory entry)
 | |
| and may use multiple clusters, however, it must be contiguous in the image
 | |
| file.
 | |
| 
 | |
| Structure of a bitmap table entry:
 | |
| 
 | |
|     Bit       0:    Reserved and must be zero if bits 9 - 55 are non-zero.
 | |
|                     If bits 9 - 55 are zero:
 | |
|                       0: Cluster should be read as all zeros.
 | |
|                       1: Cluster should be read as all ones.
 | |
| 
 | |
|          1 -  8:    Reserved and must be zero.
 | |
| 
 | |
|          9 - 55:    Bits 9 - 55 of the host cluster offset. Must be aligned to
 | |
|                     a cluster boundary. If the offset is 0, the cluster is
 | |
|                     unallocated; in that case, bit 0 determines how this
 | |
|                     cluster should be treated during reads.
 | |
| 
 | |
|         56 - 63:    Reserved and must be zero.
 | |
| 
 | |
| 
 | |
| === Bitmap data ===
 | |
| 
 | |
| As noted above, bitmap data is stored in separate clusters, described by the
 | |
| bitmap table. Given an offset (in bytes) into the bitmap data, the offset into
 | |
| the image file can be obtained as follows:
 | |
| 
 | |
|     image_offset(bitmap_data_offset) =
 | |
|         bitmap_table[bitmap_data_offset / cluster_size] +
 | |
|             (bitmap_data_offset % cluster_size)
 | |
| 
 | |
| This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see
 | |
| above).
 | |
| 
 | |
| Given an offset byte_nr into the virtual disk and the bitmap's granularity, the
 | |
| bit offset into the image file to the corresponding bit of the bitmap can be
 | |
| calculated like this:
 | |
| 
 | |
|     bit_offset(byte_nr) =
 | |
|         image_offset(byte_nr / granularity / 8) * 8 +
 | |
|             (byte_nr / granularity) % 8
 | |
| 
 | |
| If the size of the bitmap data is not a multiple of the cluster size then the
 | |
| last cluster of the bitmap data contains some unused tail bits. These bits must
 | |
| be zero.
 | |
| 
 | |
| 
 | |
| === Dirty tracking bitmaps ===
 | |
| 
 | |
| Bitmaps with 'type' field equal to one are dirty tracking bitmaps.
 | |
| 
 | |
| When the virtual disk is in use dirty tracking bitmap may be 'enabled' or
 | |
| 'disabled'. While the bitmap is 'enabled', all writes to the virtual disk
 | |
| should be reflected in the bitmap. A set bit in the bitmap means that the
 | |
| corresponding range of the virtual disk (see above) was written to while the
 | |
| bitmap was 'enabled'. An unset bit means that this range was not written to.
 | |
| 
 | |
| The software doesn't have to sync the bitmap in the image file with its
 | |
| representation in RAM after each write or metadata change. Flag 'in_use'
 | |
| should be set while the bitmap is not synced.
 | |
| 
 | |
| In the image file the 'enabled' state is reflected by the 'auto' flag. If this
 | |
| flag is set, the software must consider the bitmap as 'enabled' and start
 | |
| tracking virtual disk changes to this bitmap from the first write to the
 | |
| virtual disk. If this flag is not set then the bitmap is disabled.
 |