Monitoring - ZWRM

Health check

curl http://localhost:8080/v1/health

Returns {"status": "healthy"} when the control plane is running.

curl -H "Authorization: Bearer <token>" http://localhost:8080/v1/status

Returns system-wide statistics scoped to the user’s organization: app count, machine count, database count, and host status.

The control plane exposes a Prometheus-compatible metrics endpoint:

curl http://localhost:8080/metrics

Metrics include HTTP request duration and count (by method, path, status), VM count, and host resource usage.

scrape_configs:
  - job_name: "zwrmd"
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: "/metrics"

Monitor host health across the cluster:

zwrm host list

Status	Meaning
`online`	Healthy, receiving heartbeats
`offline`	No heartbeat for 30+ seconds, machines marked failed
`draining`	Migrating VMs off the host
`maintenance`	Drain complete, host deactivated

These services run inside zwrmd and are monitored via logs:

Service	Interval	Description
Local host heartbeat	10s	Updates `last_heartbeat` in database
Host health monitor	15s	Marks hosts offline after 30s, fails their machines
VM cleanup	30s	Detects orphaned Firecracker processes, schedules restarts
Restart manager	on demand	Processes restart queue with exponential backoff
Cache cleanup	periodic	Evicts stale build cache entries
Proxy health checker	10s	Checks backend health, rotates unhealthy VMs out

sudo journalctl -u zwrmd -f
sudo journalctl -u zwrmd --since "1 hour ago"

The control plane logs every request as METHOD PATH STATUS DURATION BYTES.

sudo journalctl -u zwrm-agent -f

# Via CLI
zwrm logs --app <name>

# Via API
curl -H "Authorization: Bearer <token>" \
  "http://localhost:8080/v1/machines/{id}/logs?lines=100"