How to get started with the Disaster Recovery solution for Performance Cloud VMware (NSX-T)

TABLE OF CONTENTS

Description
Definitions
- Recovery Point Objective (RPO)
Requirements
Important Notes
Procedures
References

Description

Here is a guide to help you deploying your disaster recovery (or “DR”) solution for Performance Cloud VMware, based on VMware Cloud Director Availability. With this guide, you will learn many things like how to setup the replication of virtual machines, how to proceed with a test failover and how to proceed with a failover to recover from a disaster.

Definitions

Recovery Point Objective (RPO)

The RPO is the longest tolerable timeframe of data loss.

Example:

With one (1) hour RPO, the recovered virtual machine can have no more than one (1) hour of data lost. With shorter RPO intervals, we ensure less data loss during recovery, but this is consuming more network bandwidth to keep the replicated virtual machines up to date.

However, this does not mean that virtual machines are replicated every one (1) hour. Also, a RPO violation can occur if the replication timeframe is too short according to the sizing of virtual machines, the change rate inside virtual machines and the available network bandwidth.

For more information about the replication scheduler and how the replication policy works, please see the following articles:

- https://docs.vmware.com/en/vSphere-Replication/8.6/com.vmware.vsphere.replication-admin.doc/GUID-84FAF645-1C65-413D-A89B-70DBA0990631.html

- https://docs.vmware.com/en/vSphere-Replication/8.6/com.vmware.vsphere.replication-admin.doc/GUID-07B5263A-8E10-42E7-B68B-325BBA910489.html

Requirements

For now, there is no offer or SKU to enable in your Cumulus account management portal to get this feature. Please contact your account manager for more information.
This guide assumes that you already have your production servers running in the primary site.
Required virtual networks, IP Sets, custom applications, and NAT & firewall rules must be preconfigured in your secondary organization to allow a faster recovery and to properly test recovery plans. Please follow the Getting Started guide for Performance Cloud VMware for guidance.
To configure email notifications, you must provide your own SMTP settings.

Important Notes

For now, this disaster recovery (or “DR”) solution is only available in Canada for organisations in Performance Cloud VMware (NSX-T).
In your secondary site, a new WAN IP address will be used. Please plan accordingly your disaster recovery plan to also include DNS and/or VPN changes if required.
It is strongly recommended to test your recovery plan at regular intervals.
You will receive a separate set of credentials to access the secondary organization.

Procedures

Review the peer sites

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Peer Sites.

You should see two (2) peer sites. In this example, VCAV-CAE2 will be the destination site for the replication task.

How to configure the replication of virtual machines

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Incoming Replications. Then, click on the “New protection” icon.
In the new window to start configuring the incoming replication, enter credentials of a user having the “Organization Administrator” role in the primary organization (in the production site) by using this format: username@organization

In this example: username@org-name-prod (“username” is a user with the “Organization Administrator” role for the organization named “org-name-prod”)

Note: Credentials for the other organization are not saved. Credentials will not be asked again until your session is active on the portal.
Select the source virtual machines to replicate and click on NEXT.

Note: If you were not able to authenticate at the previous step, the NEXT button will remain greyed out.

In this example, the vApp named “vApp-SW” is chosen.
Select the destination virtual data center or "VDC" and the storage policy for the replication. The storage policy is applied to the whole virtual machine. A specific disk cannot be replicated to a different storage tier. Then, click on NEXT.

In most cases, you will have no to few customization options.
In the Settings section, the possibility to exclude some disks from the replication task can be enabled.
Then, click on NEXT.
If the option to exclude disks was previously enabled, select the virtual machine(s) and deselect the disk(s) that you do not want to replicate. Then, click on NEXT.
Review selected replication settings and click on FINISH.

After this step, the replication will occur, and disks will be created in the destination site.

Monitor the replication status

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
From the dashboard, you can review the replication status, the overall replication health and some charts.
You can also validate if RPO violations often occur.

Configure email notifications

If the setup of email notifications is already in place in the Performance Cloud VMware portal, steps 2 to 4 can be skipped.

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
Click on Administration, then, under the Settings section, click on Email.
In the Email sub-section, click on EDIT.
Fill all required fields with your SMTP settings and click on Notification Settings.
Enter the desired sender’s email address and click on SAVE.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
Click on Events and Notifications. From the Events and Notifications section, you can quickly enable email notifications for all event types.
In option, you can then configure replication notifications based on RPO violation thresholds by clicking on EDIT.

Once thresholds are updated, click on APPLY.
Test your SMTP settings and update them if needed.

How to configure the recovery settings

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Incoming Replications. Then, select a vApp or a virtual machine (change the Grouping view if needed). Once the selection is made, click on ALL ACTIONS, then click on Recovery Settings.
On the next screen, the target network can be enforced for the failover or the test failover.
Settings for NICs or the guest customization can also be customized for failovers or test failovers if needed.
Once you are done with recovery settings, click on APPLY.

Manage virtual machine’s instances

In some cases (such as security investigations, ransomware, or legal hold for example), you may want to keep a particular virtual machine’s instance without having to stop the replication to ensure its availability. In that case, you can store a specific virtual machine’s instance.

Store a virtual machine’s instance

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Incoming Replications. Then, click on the replication task.
Identify and select the virtual machine’s instance time you want to store. Then, click on STORE.
Accept by clicking on STORE again.
The retention period for the selected virtual machine’s instance will then change to Permanent.

Stop storing a virtual machine’s instance

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Incoming Replications. Then, click on the replication task.
Identify and select the virtual machine’s instance time you don’t want to store anymore. Then, click on DON’T STORE.
Accept by clicking on “DON’T STORE” again.

Configure a recovery plan

Here are steps to create a simple recovery plan.

In an advanced network or to recover using different ways according to the disaster, multiple recovery plans can be created. Recovery plans can also be cloned to speed up the creation of multiple recovery plans.

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Recovery Plans. Then, click on the “New recovery plan” icon.
Name your plan and add a description. Then, click on OK.
Start configuring the plan by adding a new step to the plan.
Name your step. In option, add some wait time before the next step or add a message to display for a manual validation before the next step. Then, click on NEXT.

Example below to start the domain controller and let 60 seconds before starting the second step.
Select the virtual machine(s) to recover in this first step and click on NEXT.
Review chosen settings and click on FINISH.

Repeat steps 5 to 8 to add more steps to the recovery plan.

Example below to start the database server as a second step and ask for a manual validation.

Graphical user interface, text, application

Description automatically generated

Graphical user interface, text, application, email

Description automatically generated

Graphical user interface, application, Teams

Description automatically generated

Example below to start remaining virtual machines as a third step without validation or wait time.

Graphical user interface, text, application

Description automatically generated

Graphical user interface, text, application, email

Description automatically generated

Graphical user interface, application, Teams

Description automatically generated

Graphical user interface, application

Description automatically generated

Test a recovery plan

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Recovery Plans. Then, click on the recovery plan to test and click on the Test button.
Click on OK
vApps and virtual machines will now be created in the recovery site. Virtual machines will start following steps in the recovery plan (including configured wait time or prompts). Required networks will be added to vApps and recovery settings will be applied.

If a prompt was configured, you must acknowledge the message to allow the plan to continue its execution with next steps.
Once the recovery plan execution is completed, validate the network functionality and test server roles, accessible services and applications for remote users (including but not limited to Remote Desktop Services and Web Services).
Adjust the recovery plan if needed. If applicable, add any missing configuration that would allow a faster recovery.
Once tests are completed, do the test cleanup.
vApps and virtual machines will now be deleted from the recovery site.

Proceed with a failover

This is the option to take to recover from a disaster for the primary site.

A failover and a test failover are similar. The main differences are that after a failover, you don’t have the cleanup option and the replication is now stopped.

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Recovery Plans. Then, click on the recovery plan to start and click on the Failover button.
Click on OK
vApps and virtual machines will now be created in the recovery site. Virtual machines will start following steps in the recovery plan (including configured wait time or prompts). Required networks will be added to vApps and recovery settings will be applied.

If a prompt was configured, you must acknowledge the message to allow the plan to continue its execution.
Once the recovery plan execution is completed, the production servers now run in the secondary site.
Validate the network functionality and test server roles, accessible services and applications for remote users (including but not limited to Remote Desktop Services and Web Services).
If applicable, add any missing configuration to make all services available.
Once the primary site is back online, you can also consider deleting virtual machines in the primary location as their data is deprecated.

Failback to the production site

Once the primary site is back online, a failback to the primary site can be considered.

Login to the Performance Cloud VMware portal with a user having the “Organization Administrator” role on the secondary organization.
In the More section, click on Availability (VCAV-XXXX). XXXX will change according to the site in use.
On the next screen, click on Incoming Replications and select the replication task. Then, click on ALL ACTIONS then click on Reverse.
In the new window to reverse the replication, enter credentials of a user having the “Organization Administrator” role in the primary organization (in the primary/production site) by using this format: username@organization

In this example: username@org-name-prod (“username” is a user with the “Organization Administrator” role for the organization named “org-name-prod”)

Note: Credentials for the other organization are not saved. Credentials will not be asked again until your session is active on the portal.
Click on REVERSE.
The source and the destination for the replication task will now be reversed.

If the reverse option is failing, validate that you have enough free disk space in the destination. It could also be impossible to replicate all virtual machines on the same storage policy if multiple disk tiers are in use. If applicable, skipped virtual machines can be added later using a different storage policy.

You can also consider removing all replication jobs and all recovery plans. Then, create them again to replicate virtual machines in the opposite way (in the primary site).
Proceed with a failover (refer to the previous section)

References

- https://techdocs.broadcom.com/us/en/vmware-cis/cloud-director/availability/4-7/availability-user-guide-4-7.html

Helpdesk