Disaster Recovery

Scope and Objectives of Recovery Plan

This plan is limited in scope to recovery and business continuance from a serious disruption in activities due to non-availability of migVisor’s facilities.

The objective of this plan is to coordinate recovery of critical business functions in managing and supporting the business recovery in the event of a facilities disruption or disaster.

This include short or long-term disasters or other disruptions, such as fires, floods, earthquakes, explosions, terrorism, tornadoes, extended power interruptions, hazardous chemical spills, and other natural or man-made disasters.

Business Continuity and DR Plan Description

migVisor relies on the availability of some cloud vendor services, as well as internal services for normal operation. Below is a list of the major services and the guaranteed SLA availability per service:

Guaranteed SLA Availability
Service	Availability Target	Provider
Elasticsearch	>=99.95%	migVisor
Cloud Storage	>= 99.95%	CSP
Cloud SQL	>= 99.99%	CSP
Kubernetes Engine	>=99.95%	CSP
Pub/Sub	>=99.95%	CSP
Cloud DNS	= 100%	CSP
Cloud Load Balancing	>= 99.99%	CSP
Mean Guaranteed SLA: 99.5%, 99.95%, 99.99%, 99.95%, 99.95% >= 99.9686% of availability

Operations At Risk

Risks that can affect migVisor’s operations include natural disasters, cyber-attacks and loss of critical CSP.

The operations at risk include:

Determining how those risks will affect operations
Implementing Threat Modeling to mitigate the risks
Testing procedures to ensure they work
Reviewing the process to make sure that it is up to date
Migrating the infrastructure to an alternative CSP

Service Resiliency
Process	Processing Schedule
Working hours for incident processing	8x7 (GMT+2)
Response time	1 business day
Recovery plan testing frequency	Annually

Recovery Strategy

The recovery strategy is organized in the following order:

Identification of the incident (automatic via monitoring tools, or user reported)
Investigation phase
DRP activation phase
Recovery implementation phase
Return to normal operation

Each activity is assigned to appropriate team members who has the primary assignment to complete the activity.

Recovery implementations by provider/resource type:

Provider/Resource Type	Recovery Implementation
Elasticsearch	Elasticsearch nodes located across multiple zones within a region Data in Elasticsearch is replicated across the nodes Persistent volumes maintain storage availability independently of the individual containers
Cloud Storage	Duo-region or multi region support Service is resilient and not interrupted Data and metadata stored redundantly across regions Objects versioning and cloud backup
Cloud SQL	Databases support is regional with high availability Backups location is multi-regional in the United States Cloud storage is multi-regional with high availability Point-in-time recovery is configured for protection against accidental deletion or writes
Kubernetes Engine	Distribute Kubernetes resources across multiple zones within a region Persistent volumes maintain storage availability independently of the individual containers Liveness probe restarts failed pods Node auto repair
Pub/Sub	Replication is within just one region Each topic uses three zones to store data Synchronous replication is guaranteed to at least two zones, and best-effort replication to an additional third zone
Cloud DNS	Uses GCP global network of Anycast name servers to serve clients' DNS zones from redundant locations around the world Providing high availability and lower latency
Cloud Load Balancing	Software-based managed service Distributed across multiple zones in the region

Roles and Responsibilities

Describes key personnel and their assigned tasks during or after the incident. Each team member has a unique set of responsibilities for successfully completing BCP for each business function.

Roles and Responsibilities
Role	Area of responsibility
On-duty support engineer	Monitoring suspicious or abnormal activity
	Initial investigation, notification of DR Team
	Minor issues fix
Support team	Detailed investigation
	Recovery actions
	Restoration of affected systems to normal operation
	Security and DR testing
DR team head	Impact assessment
	DR plan activation, decision- necessary alternative strategy and recovery methods
	Inform customers and partners about potential harm
	Post-incident analysis DRP improvement