Business Continuity and Disaster Recovery (BCDR)
Business continuity and disaster recovery (BCDR) is a critical design area that ensures workloads can meet their recovery time objective (RTO) and recovery point objective (RPO) during unplanned events. For Microsoft Fabric landing zones, BCDR capabilities are essential to maintain resilience, minimize downtime, and uphold data integrity and availability.
Design Considerations
When designing BCDR strategies for Fabric application workloads, consider the following factors:
Application and Data Availability Requirements
- Define the RTO and RPO for each workload and align infrastructure and services accordingly.
- Determine whether active-active or active-passive models are required.
- Understand the tolerance for reduced functionality or degraded performance during an outage.
Support for PaaS and SaaS Services
- Evaluate native high availability (HA) and disaster recovery (DR) features of Microsoft Fabric services like Data Warehouses, Lakehouses, Notebooks, Eventstreams, and Pipelines.
- Enable geo-redundancy and cross-region support for critical services.
- Leverage Fabric capabilities such as global OneLake-backed datasets and distributed compute engines.
Availability Zones and Sets
- Use Availability Zones (AZs) where supported for zonal redundancy of supporting services.
- Understand data and service dependencies between zones and the impact on failover consistency.
- For infrastructure components (e.g., Kusto engines or Data Gateway VMs), assess suitability for availability sets vs. zones.
Backup and Restore
- Use native backup features for Fabric items:
- Automatically versioned datasets (OneLake)
- Artifact version history (e.g., Notebooks, Pipelines)
- Externalized logs and metadata (e.g., storing Eventstream offsets)
- Consider integrating Azure Backup or GitOps-based snapshot mechanisms where applicable.
Network Considerations
- Plan bandwidth and failover routing for hybrid connectivity using Azure ExpressRoute.
- Ensure redundant regional peering and non-overlapping IP ranges for production and DR networks.
- Design for DNS resilience and traffic redirection strategies.
Planned and Unplanned Failovers
- Plan for failover and failback strategies with:
- Consistent IP address retention (where applicable)
- Role-based access control and reauthentication mechanisms
- Continuity of DevOps pipelines and deployment automation
- Ensure Azure Key Vault or Microsoft Purview (if used for secrets) is geo-redundant or replicated.
Data Residency and Compliance
- Follow in-country or regional regulations for cross-region replication and storage.
- Refer to Azure cross-region replication documentation.
Design Recommendations
Use the following practices to implement robust BCDR strategies in Fabric Landing Zones:
-
Employ Azure Site Recovery
For supporting workloads (e.g., SQL Server in VMs or Data Gateway clusters), replicate across Azure regions. Site Recovery provides low RPO and RTO capabilities. -
Leverage Native Fabric Redundancy
For core Fabric workloads, replicate business logic and data ingestion pipelines across environments or workspaces. Use Fabric Git integration and DevOps pipelines for deployment repeatability. -
Use Fabric Git Integration
Store Notebooks, Dataflows, and Pipelines in source control. Automate their deployment across staging, pre-prod, and production. -
Backup with Policy Enforcement
Use Azure Policy or custom scripts to ensure backup configurations exist across Fabric artifacts. Integrate external monitoring where APIs are available. -
Avoid Overlapping IP Ranges
Ensure DR networks use distinct IP space to simplify failover networking. -
Design for Multi-Region Deployments
Where latency allows, build multi-region workspaces for critical analytics scenarios. Use OneLake's cross-region capabilities for data access. -
Document DR Plans and Test Regularly
Establish clear DR runbooks. Conduct test failovers regularly, including for Fabric-specific artifacts.