The Role of the Journal During Protection
After defining a VPG, the protected virtual machine disks are synced with the recovery site. After initial synchronization, every write to a protected virtual machine is copied by Zerto Virtual Replication to the recovery site. The write continues to be processed normally on the protected site and the copy is sent asynchronously to the recovery site and written to a journal managed by a Virtual Replication Appliance (VRA). Each protected virtual machine has its own journal.
In addition to the writes, every few seconds all journals are updated with a checkpoint time-stamp. Checkpoints are used to ensure write order fidelity and crash-consistency. A recovery can be done to the last checkpoint or to a user-selected, crash-consistent, checkpoint. This enables recovering the virtual machines, either to the last crash-consistent point-in-time or for example, when the virtual machine is attacked by a virus, to a point-in-time before the virus attack.
Data and checkpoints are written to the journal until the specified journal history size is reached, which is the optimum situation. At this point, as new writes and checkpoints are written to a journal, the older writes are written to the virtual machine recovery virtual disks. When specifying a checkpoint to recover to, the checkpoint must still be in the journal. For example, if the value specified is 24 hours then recovery can be specified to any checkpoint up to 24 hours. After the time specified, the mirror virtual disk volumes maintained by the VRA are updated.
During recovery, the virtual machines at the recovery site are created and the recovery disks for each virtual machine, managed by the VRA, are attached to the recovered virtual machines. Information in the journal is promoted to the virtual machines to bring them up to the date and time of the selected checkpoint. To improve the RTO during recovery, the virtual machine can be used even before the journal data has been fully promoted. Every request is analyzed and the response is returned from the virtual machine directly or, if the information in the journal is more up-to-date, it comes from the journal. This continues until the recovery site’s virtual environment is fully restored to the selected checkpoint.
Each protected virtual machine has its own dedicated journal, consisting of one or more volumes. A dedicated journal enables journal data to be maintained, even when changing the host for the recovery. The default storage used for a journal is the storage used for recovery of each virtual machine. Thus for example, if protected virtual machines in a VPG are configured with different recovery storage, the journal data is by default stored for each virtual machine on that virtual machine recovery storage. The default storage used for a journal when protecting to a VMware vCloud Director is the storage with the most free space, that has either been defined as journal storage for the provider vDC, in the Configure Provider vDCs dialog or any storage visible to the recovery host if the journal storage was not defined in the Configure Provider vDCs dialog.
The journals for the protected virtual machines are defined as part of the VPG definition and by default are defined to reside on the same storage as the virtual machine. This can be overridden at the virtual machine and VPG levels as follows.
| Allows Storage Tiering | Notes |
Default Journal | No | The journal is located on the virtual machine recovery datastore. By default, the recovery datastore for each virtual machine journal is the same as the virtual machine recovery datastore. |
Journal datastore separate from VM datastore for each VM | No | Specify a journal datastore for each virtual machine. All journals for the virtual machine are stored in this datastore. |
Journal datastore for each VPG | Yes | Specify a journal datastore for each VPG. All journals for the virtual machines in the VPG are stored in this datastore. |
Journal datastore for multiple VPGs | Yes | Enables the use of advanced settings such as storage IO controls etc., to provide individualized service to customers by grouping VPGs by customer and assigning each group to a specific storage. This option is recommended for cloud service providers. |
Journal Sizing
The journal space is always allocated on demand. The provisioned journal size initially allocated for a journal is 16GB. The provisioned journal size is the current size of all the journal volumes.
If the journal grows to approximately 80% of the provisioned journal size or less than 6GB remains free, a new volume is added to increase the journal size. Each new journal volume added is bigger than the previous volume. The journal size can increase up until a specified hard limit. If the size of the journal is reduced in the VPG definition after new volumes have been added, these volumes are not reduced and continue to be used if required. In this case, the journal size can be bigger than the set size and the reduced journal size definition is not applied, except to ensure that no new volumes are created if the new journal size is reached or exceeded.
The provisioned journal size reported in the Resources report can fluctuate considerably when new volumes are added or removed.
When the amount of the journal used is approximately 50% of the provisioned journal size, the biggest unused journal volume from the added volumes is marked for removal. This volume is then removed after the time equivalent to three times the amount specified for the journal history, or twenty-four hours, whichever is more if it is still not used.
Note: With VMware vSphere, with VMFS datastores and when the VRA is on a host ESXi that is version 5.1 or higher, the journal can also reclaim unused space on a volume. Unused space is not reclaimed when using NFS datastores or any datastore with a ESXi host that is lower than version 5.1.
Reclaiming space on a volume does not change the provisioned journal size, which is the current size of all the journal volumes.
When a virtual machine journal comes close to a specified hard limit, Zerto Virtual Replication starts to move data to the target disks. Once this begins, the maintained history begins to decrease. If the journal history falls below 75% of the value specified for the journal history, a warning alert is issued in the GUI. If the history falls below one hour, an error is issued. However, if the amount of history defined is only one hour, an error is issued if it is less than 45 minutes.
The size of the datastore where the journal resides must have at least 30GB free, or have 15% free space, relative to the total datastore space, whichever number of GBs is smaller.
If the available storage of the journal datastore falls below 30GB or 15% of the total datastore size:
■ The datastore itself is considered full.
■ An error alert is issued and all writes to the journal volumes that datastore storage are blocked.
■ Replication is halted, but history is not lost.
■ The RPO begins to steadily increase until additional datastore space is made available.
Examples:
■ For a large (2TB) datastore: 15% free space remaining = 307GB.
The ZVM would not consider the datastore full if 307GB of free space were remaining. 30GB free space remaining would trigger an alert, as it is the smaller figure.
■ For a small (100GB) datastore: 15% free space free space remaining = 15GB.
The ZVM would not consider the datastore full if 30GB of free space were remaining. 15GB free space remaining would trigger an alert, as it is the smaller figure.
Testing Considerations When Determining Journal Size
When a VPG is tested, either during a failover test or before committing a Move or Failover operation, a scratch volume is created for each virtual machine being tested. The scratch volume created uses the same size limit defined for the virtual machine journal.
The size limit of the scratch volume determines the length of time that you can test for. Larger limits enable longer testing times if the constant rate of change is constant. If a small hard limit size is set for this amount of history, for example 2–3 hours, the scratch volume created for testing will also be small, thus limiting the time available for testing.