# edited by glg
# P2 Runbook: DR & Fleet Operations

## 1) Backup Policy Otomatis
### Tujuan
- Menjamin ketersediaan backup konsisten dan tervalidasi checksum.

### Prosedur Harian
1. Jalankan:
```bash
python scripts/ops/run_backup_policy.py --tag daily --keep-last 14 --max-age-days 30
```
2. Pastikan `manifest.json` terbentuk di folder backup terbaru.
3. Validasi:
- jumlah file tercopy
- hash `sha256`
- tidak ada `missing_files` kritikal.

### Recovery Drill Mingguan
1. Ambil backup terbaru.
2. Restore ke environment sandbox.
3. Jalankan smoke test aplikasi + quality gate minimum.

## 2) Remote Config Rollout
### Tujuan
- Mencegah blast radius saat perubahan config lintas cabang.

### Build Plan
```bash
python scripts/ops/build_fleet_rollout_plan.py \
  --branches-file branches.json \
  --config-version cfg-2026.04.02 \
  --canary-percent 1 \
  --waves 5,20,50,100 \
  --output rollout_plan.json
```

### Eksekusi
1. Apply canary.
2. Monitor 15-30 menit:
- error rate
- p95 latency
- retry spike.
3. Jika sehat, lanjut wave berikutnya.
4. Jika tidak sehat, stop wave dan rollback.

## 3) Canary + Staged Deployment
### Health Gate Minimum
- Error rate <= 2%.
- p95 latency <= 2000 ms.
- Tidak ada peningkatan backlog kritikal.

### Decision
- `continue`: semua gate lolos.
- `halt_and_rollback`: satu atau lebih gate gagal.

## 4) Pipeline Otomatis Produksi
### Tujuan
- Menyatukan DR backup + restore drill + reconciliation dedup + fleet gate dalam satu eksekusi otomatis.

### Perintah Utama
```bash
python scripts/ops/run_dr_fleet_pipeline.py \
  --backup-root ./ops_backups \
  --restore-root ./ops_restore_drill \
  --db-path ./db/beta_sb_pos_sqlite.db \
  --tag scheduled_ops \
  --run-reconciliation 1 \
  --run-retention 1 \
  --fail-on-dead 1 \
  --fleet-plan-file ./rollout_plan.json \
  --fleet-wave-name canary \
  --fleet-metrics-file ./wave_metrics.json \
  --output ./ops_pipeline_report.json
```

### Artefak yang Wajib Dicek
- `backup.manifest_path` valid dan `checks.backup_ok=true`
- `restore_drill.ok=true`
- `checks.reconciliation_ok=true` (khususnya `outbox_dead=0` saat `fail_on_dead=1`)
- `checks.fleet_gate_ok=true` dengan `decision=continue`

### Otomasi CI Terjadwal
- Workflow: `.github/workflows/ops-dr-fleet-pipeline.yml`
- Trigger:
  - `schedule` mingguan
  - `workflow_dispatch` manual

## 5) Incident Response
### Trigger
- Backlog melebihi threshold kritikal.
- Error burst lintas cabang.
- SLA ingest turun di bawah target.

### Tindakan
1. Aktifkan mode throttle/backpressure.
2. Suspend wave rollout.
3. Prioritaskan recovery cabang kritikal.
4. Jalankan postmortem 24 jam setelah stabil.
