> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gorules.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Disaster recovery

> Strategies for backup, failover, and recovery of GoRules deployments.

GoRules separates rule management (BRMS) from rule execution (Agent/SDK). This decoupled architecture simplifies disaster recovery — your production workloads continue even if the BRMS is temporarily unavailable.

## Recovery priorities

| Component                          | Impact if unavailable                                         | Priority |
| ---------------------------------- | ------------------------------------------------------------- | -------- |
| **Execution layer** (Agent or SDK) | Applications can't evaluate rules                             | High     |
| **Object Storage**                 | Agents can't load new rules (existing rules remain in memory) | High     |
| **BRMS**                           | Can't author or publish new rules                             | Medium   |
| **PostgreSQL**                     | BRMS unavailable                                              | Medium   |

## Management layer (BRMS)

Because the management layer is decoupled from execution, the BRMS does not require maximum availability. A straightforward setup is sufficient:

**Horizontal scaling** — Run at least 2 BRMS replicas behind a load balancer for failover.

**Database HA** — Use managed PostgreSQL with high availability enabled:

* AWS Aurora PostgreSQL (Multi-AZ)
* Azure Database for PostgreSQL Flexible Server (HA mode)
* Google Cloud SQL (High availability)

**Backups** — Configure automated backups with point-in-time recovery. This is standard in modern managed database services and protects against data corruption or accidental deletion.

<Note>
  If BRMS becomes unavailable, rule execution continues uninterrupted. Users cannot author or publish new rules until BRMS is restored, but all existing rules remain operational.
</Note>

## Execution layer

The execution layer requires high availability since it handles live traffic.

### Agent deployment

When using the Agent, the only external dependency is object storage — which cloud providers design for high availability across multiple availability zones.

**Run at least 2 Agent replicas** in production with health checks. Spread replicas across availability zones when possible.

**Agent resilience** — Once rules are loaded into memory, the Agent continues serving requests even if object storage becomes temporarily unavailable. Rules are not ejected automatically on storage failure.

### SDK deployment

When using the SDK with bundled rules, there are no external dependencies — your application is self-contained. High availability depends entirely on how you deploy your service. Run with horizontal scaling and standard HA patterns for your platform.

## Cross-region availability

For extreme availability requirements, you can deploy across multiple regions without running BRMS in every region.

```mermaid theme={null}
flowchart TB
    subgraph primary[Primary Region]
        brms[BRMS] -- Publish --> storageA[Object Storage]
        agentA[Agent] -- Poll --> storageA
    end

    subgraph secondary[Secondary Region]
        storageB[Object Storage]
        agentB[Agent] -- Poll --> storageB
    end

    storageA -.-> storageB
```

**How it works:**

1. BRMS runs in a single region and publishes releases to object storage
2. Object storage replicates to a secondary region (using native cloud replication)
3. Agents in each region poll their local storage bucket
4. When BRMS publishes a release, both regions receive the same rules automatically

This approach provides regional failover for rule execution while keeping the management layer simple.

## Recovery procedures

### Agent failure

Traffic automatically routes to healthy replicas via your load balancer. Replace failed instances and investigate root cause from logs.

### Storage failure

Agents continue serving with rules already in memory. If using cross-region replication, update Agent configuration to use the replica bucket. Restore primary storage when possible.

### BRMS failure

Rule execution continues unaffected. Redeploy BRMS containers and verify database connectivity. Users can resume authoring once restored.

## Recovery objectives

| Metric                             | Typical target                      |
| ---------------------------------- | ----------------------------------- |
| **RTO** (Recovery Time Objective)  | Near-zero with multi-replica Agents |
| **RPO** (Recovery Point Objective) | 0 with storage replication          |

With replicated storage and multi-replica Agents, most failures are handled automatically without downtime.
