Health check
{"status": "healthy"} when the control plane is running.
System status
Prometheus metrics
The control plane exposes a Prometheus-compatible metrics endpoint:Scrape config
Host status
Monitor host health across the cluster:| Status | Meaning |
|---|---|
online | Healthy, receiving heartbeats |
offline | No heartbeat for 30+ seconds, machines marked failed |
draining | Migrating VMs off the host |
maintenance | Drain complete, host deactivated |
Background services
These services run insidezwrmd and are monitored via logs:
| Service | Interval | Description |
|---|---|---|
| Local host heartbeat | 10s | Updates last_heartbeat in database |
| Host health monitor | 15s | Marks hosts offline after 30s, fails their machines |
| VM cleanup | 30s | Detects orphaned Firecracker processes, schedules restarts |
| Restart manager | on demand | Processes restart queue with exponential backoff |
| Cache cleanup | periodic | Evicts stale build cache entries |
| Proxy health checker | 10s | Checks backend health, rotates unhealthy VMs out |
Logs
Control plane
METHOD PATH STATUS DURATION BYTES.