Introduction to Protecting Virtual Machines
Virtual machines are protected in virtual protection groups. A virtual protection group (VPG) is a group of virtual machines that you group together for recovery purposes. For example, the virtual machines that comprise an application like Microsoft Exchange, where one virtual machine is used for the software, one for the database, and a third for the Web Server require that all three virtual machines be replicated to maintain data integrity.
Any virtual machine whose operating system is supported in both the protected site and recovery site can be protected in a VPG.
Once a virtual machine is protected, all changes made on the machine are replicated in the remote site. The replicated virtual machines in the remote site can be recovered to any point in time defined for the VPG or if a period further in the past is required, a retention set can be restored.
When a VPG is created, a replica of each virtual machine disk in the VPG is created under a VRA on the recovery site. These replica virtual disks must be populated with the data in the protected virtual machines, which is done by synchronizing the protected virtual machines with the recovery site replicas. This synchronization between the protected site and remote site takes time, depending on the size of the virtual machines.
After the initial synchronization completes, only the writes to disk from the virtual machines in the protected site are sent to the remote site. These writes are stored by the VRA in the remote site in journals for a specified period, after which they are promoted to the replica virtual disks managed by the VRA.
The number of VPGs that can be defined on a site is limited only by the number of virtual machines that can be protected.
For the maximum number of virtual machines, either being protected or recovered to a site, see Zerto Scale and Benchmarking Guidelines
Note: | If the total number of protected virtual machines on the paired sites is 5000, then any additional machines are not protected. |
Any virtual machine that is supported by the hypervisor can be protected. When recovering to a different hypervisor, the protected virtual machines must also be supported by the recovery hypervisors.
The following topics are described in this
• | Configuring Virtual Protection Groups |
• | The Role of the Journal During Protection |
• | What Happens After the VPG is Defined |
Configuring Virtual Protection Groups
Use the following guidelines:
• | You protect one or more virtual machines in a VPG. |
• | The VPG must include at least one virtual machine. |
• | After creating a VPG, you can add or remove virtual machines as required. |
• | You can only protect a virtual machine in a VPG when the virtual machine has no more than 60 disks. |
The 60 disks can be a combination of IDE and SCSI disks, where each virtual machine can have up to 2 IDE controllers each with a maximum of 4 IDE disks and up to 4 SCSI controllers each with a maximum of 15 disks, so that the total of IDE and SCSI disks does not exceed 60 disks.
When the recovery site is VMware vSphere, any IDE disks are converted to SCSI disks. When the recovery site is Amazon Web Services (AWS), you can only protect virtual machines in the protected site that are supported by AWS in the recovery site and the maximum number of supported disks is 12 for virtual machines running a Windows operating system and 1 for virtual machines running a Linux operating system.
You can protect a virtual machine in several VPGs. A virtual machine can be in a maximum of three VPGs. VPGs that contain the same virtual machine cannot be recovered to the same site.
Note: | Protecting virtual machines in several VPGs is enabled only if both the protected site and the recovery site, as well as the VRAs installed on these sites, are of version 5.0 and higher. |
The virtual machines can be defined under a single hypervisor host or under multiple hosts.The recovery can also be to a single host or multiple hosts. The virtual machines are recovered with the same configuration as the protected machines. For example, if a virtual machine in the protected site is configured so that space is allocated on demand and this machine is protected in a VPG, then during recovery the machine is defined in the recovery site with the same space allocation configuration. You protect virtual machines by creating the VPG on the site hosting these virtual machines. After the VPG is created, you can add or remove virtual machines from the VPG by editing the VPG in the Zerto User Interface running on either the protected or recovery site.
Note: | To create a VPG you must have a recovery site available with a host with a VRA installed. The recovery site can either be a remote site, paired with the protected site, or the protected site itself, where both protection and recovery are to the same Zerto Virtual Manager site. |
The VPG definition consists of the following:
General: A name to identify the VPG and the priority to assign to the VPG.
Virtual machines: The list of virtual machines being protected as well as the boot order and boot delay to apply to the virtual protection groups during recovery.
Replication Settings: VPG replication settings, such as the recovery site, host and storage and the VPG SLA. SLA information includes the default journal history settings and how often tests should be performed on the VPG. The defaults are applied to every virtual machine in the VPG but can be overridden per virtual machine, as required.
Cloud service providers can group the VPG SLA properties together in a service profile. When a service profile is used, the VPG SLA settings cannot be modified unless a Custom service profile is available.
Storage Settings: By default the storage used for the virtual machine definition is also used for the virtual machine data. This storage can be overridden per virtual machine, as required.
Recovery Settings: Recovery details include the networks to use for recovered virtual machines and scripts that should be run either at the start or end of a recovery operation.
NIC Settings: Specify the network details to use for the recovered virtual machines after a live or test failover or migration.
Retention Policy Settings: Specify the VPG’s retention properties, including the repository where the retention sets are saved.
The Role of the Journal During Protection
After defining a VPG, the protected virtual machine disks are synced with the recovery site. After initial synchronization, every write to a protected virtual machine is copied by Zerto to the recovery site. The write continues to be processed normally on the protected site and the copy is sent asynchronously to the recovery site and written to a journal managed by a Virtual Replication Appliance (VRA). Each protected virtual machine has its own journal.
In addition to the writes, every few seconds all journals are updated with a checkpoint time-stamp. Checkpoints are used to ensure write order fidelity and crash-consistency. A recovery can be done to the last checkpoint or to a user-selected, crash-consistent, checkpoint. This enables recovering the virtual machines, either to the last crash-consistent point-in-time or, for example, when the virtual machine is attacked by a virus, to a point-in-time before the virus attack.
Data and checkpoints are written to the journal until the specified journal history size is reached, which is the optimum situation. At this point, as new writes and checkpoints are written to a journal, the older writes are written to the virtual machine recovery virtual disks. When specifying a checkpoint to recover to, the checkpoint must still be in the journal. For example, if the value specified is 24 hours then recovery can be specified to any checkpoint up to 24 hours. After the time specified, the mirror virtual disk volumes maintained by the VRA are updated.
During recovery, the virtual machines at the recovery site are created and the recovery disks for each virtual machine, managed by the VRA, are attached to the recovered virtual machines. Information in the journal is promoted to the virtual machines to bring them up to the date and time of the selected checkpoint. To improve the RTO during recovery, the virtual machine can be used even before the journal data has been fully promoted. Every request is analyzed and the response is returned from the virtual machine directly or, if the information in the journal is more up-to-date, it comes from the journal. This continues until the recovery site’s virtual environment is fully restored to the selected checkpoint.
Each protected virtual machine has its own dedicated journal, consisting of one or more volumes. A dedicated journal enables journal data to be maintained, even when changing the host for the recovery. The default storage used for a journal is the storage used for recovery of each virtual machine. Thus for example, if protected virtual machines in a VPG are configured with different recovery storage, the journal data is by default stored for each virtual machine on that virtual machine recovery storage.
The journals for the protected virtual machines are defined as part of the VPG definition and by default are defined to reside on the same storage as the virtual machine. This can be overridden at the virtual machine and VPG levels as follows.
Allows Storage Tiering | Notes | |
---|---|---|
Default Journal | No | The journal is located on the virtual machine recovery storage. By default, the recovery storage for each virtual machine journal is the same as the virtual machine recovery storage. |
Journal storage separate from VM storage for each VM | No | Specify a journal storage for each virtual machine. All journals for the virtual machine are stored in this storage. |
Journal storage for each VPG | Yes | Specify a journal storage for each VPG. All journals for the virtual machines in the VPG are stored in this storage. |
Journal storage for multiple VPGs | Yes | Enables the use of advanced settings such as storage IO controls etc., to provide individualized service to customers by grouping VPGs by customer and assigning each group to a specific storage. This option is recommended for cloud service providers. |
Journal Sizing
The journal space is always allocated on demand. The provisioned journal size initially allocated for a journal is 16GB. The provisioned journal size is the current size of all the journal volumes.
If the journal grows to approximately 80% of the provisioned journal size or less than 6GB remains free, a new volume is added to increase the journal size. Each new journal volume added is bigger than the previous volume. The journal size can increase up until a specified hard limit. If the size of the journal is reduced in the VPG definition after new volumes have been added, these volumes are not reduced and continue to be used if required. In this case, the journal size can be bigger than the set size and the reduced journal size definition is not applied, except to ensure that no new volumes are created if the new journal size is reached or exceeded.
The provisioned journal size reported in the Resources report can fluctuate considerably when new volumes are added or removed.
When the amount of the journal used is approximately 50% of the provisioned journal size, the biggest unused journal volume from the added volumes is marked for removal. This volume is then removed after the time equivalent to three times the amount specified for the journal history, or twenty-four hours, whichever is more if it is still not used.
The size of the datastore where the journal resides must have at least 30GB free, or have 15% free space, relative to the total datastore space, whichever number of GBs is smaller.
If the available storage of the journal datastore falls below 30GB or 15% of the total datastore size:
• | The datastore itself is considered full. |
• | An error alert is issued and all writes to the journal volumes that datastore storage are blocked. |
• | Replication is halted, but history is not lost. |
• | The RPO begins to steadily increase until additional datastore space is made available. |
Examples:
• | For a large (2TB) datastore: 15% free space remaining = 307GB. |
The ZVM would not consider the datastore full if 307GB of free space were remaining. 30GB free space remaining would trigger an alert, as it is the smaller figure.
• | For a small (100GB) datastore: 15% free space free space remaining = 15GB. |
The ZVM would not consider the datastore full if 30GB of free space were remaining. 15GB free space remaining would trigger an alert, as it is the smaller figure.
n a volume does not change the provisioned journal size, which is the current size of all the journal volumes.
When a virtual machine journal comes close to a specified hard limit, Zerto starts to move data to the target disks. Once this begins, the maintained history begins to decrease. If the journal history falls below 75% of the value specified for the journal history, a warning alert is issued in the GUI. If the history falls below one hour, an error is issued. However, if the amount of history defined is only one hour, an error is issued if it is less than 45 minutes.
The size of the datastore where the journal resides must have at least 30GB free, or have 15% free space, relative to the total datastore space, whichever number of GBs is smaller.
If the available storage of the journal datastore falls below 30GB or 15% of the total datastore size:
• | The datastore itself is considered full. |
• | An error alert is issued and all writes to the journal volumes that datastore storage are blocked. |
• | Replication is halted, but history is not lost. |
• | The RPO begins to steadily increase until additional datastore space is made available. |
Examples:
• | For a large (2TB) datastore: 15% free space remaining = 307GB. |
The ZVM would not consider the datastore full if 307GB of free space were remaining. 30GB free space remaining would trigger an alert, as it is the smaller figure.
• | For a small (100GB) datastore: 15% free space free space remaining = 15GB. |
The ZVM would not consider the datastore full if 30GB of free space were remaining. 15GB free space remaining would trigger an alert, as it is the smaller figure.
Testing Considerations When Determining Journal Size
When a VPG is tested, either during a failover test or before committing a Move or Failover operation, a scratch volume is created for each virtual machine being tested. The scratch volume created uses the same size limit defined for the virtual machine journal.
The size limit of the scratch volume determines the length of time that you can test for. Larger limits enable longer testing times if the constant rate of change is constant. If a small hard limit size is set for this amount of history, for example 2–3 hours, the scratch volume created for testing will also be small, thus limiting the time available for testing.
What Happens After the VPG is Defined
After defining a VPG, the VPG is created. For the creation to be successful, the storage used for the recovery must have either 30GB free space or 15% of the size free. This requirement ensures that during protection the VRA, which manages the virtual machine journal and data, cannot completely fill the storage, which would result in the VRA freezing and stopping to protect all virtual machines using that VRA.
The VRA in the remote site is updated with information about the VPG and then the data on the protected virtual machines are synchronized with the replication virtual machines managed by the VRA on the recovery site. This process can take some time, depending on the size of the VMs and the bandwidth between the sites.
During this synchronization, you cannot perform any replication task, such as adding a checkpoint.
For synchronization to work, the protected virtual machines must be powered on. The VRA requires an active IO stack to access the virtual machine data to be synchronized across the sites. If the virtual machine is not powered on, there is no IO stack to use to access the protected data to replicate to the target recovery disks and an alert is issued.
Once synchronized, the VRA on the recovery site includes a complete copy of every virtual machine in the VPG. After synchronization the virtual machines in the VPG are fully protected, meeting their SLA, and the delta changes to these virtual machines are sent to the recovery site.
For details of the screen, see Monitoring a Single VPG, on page 135.
Recovery
After initializing the VPG, all writes to the protected virtual machines are sent by the VRA on the relevant host for each virtual machine on the protected site to the VRA on the recovery site specified as the recovery host for the virtual machine. The information is saved in the journal for the virtual machine with a timestamp, ensuring write-fidelity. Every few seconds the Zerto Virtual Manager writes a checkpoint to every journal on the recovery site for every virtual machine in the VPG, ensuring crash-consistency.
The data remains in the journal for the time defined by the journal history configuration, after which it is moved to the relevant mirror disks for each virtual machine. Both the journal and the mirror disks are managed by the VRA.
When recovering, either a failover or move, or testing failover or cloning protected virtual machines in the recovery site, you specify the checkpoint at which you want the recovered virtual machines to be recovered. The mirror disks and journal are used to recover the virtual machines to this point-in-time.
File and Folder Recovery
After initializing the VPG, instead of recovering a virtual machine, you can recover specific files and folders in the protected virtual machines from a checkpoint.