10 KiB
Monitoring
Real-time cluster health monitoring via SSH. The system has two parts:
orama node report— Runs on each VPS node, collects all local health data, outputs JSONorama monitor— Runs on your local machine, SSHes into nodes, aggregates results, displays via TUI or tables
Architecture
Developer Machine VPS Nodes (via SSH)
┌──────────────────┐ ┌────────────────────┐
│ orama monitor │ ──SSH──────────>│ orama node report │
│ (TUI / tables) │ <──JSON─────── │ (local collector) │
│ │ └────────────────────┘
│ CollectOnce() │ ──SSH──────────>│ orama node report │
│ DeriveAlerts() │ <──JSON─────── │ (local collector) │
│ Render() │ └────────────────────┘
└──────────────────┘
Each node runs orama node report --json locally (no SSH to other nodes), collecting data via os/exec and net/http to localhost services. The monitor SSHes into all nodes in parallel, collects reports, then runs cross-node analysis to detect cluster-wide issues.
Quick Start
# Interactive TUI (auto-refreshes every 30s)
orama monitor --env testnet
# Cluster overview table
orama monitor cluster --env testnet
# Alerts only
orama monitor alerts --env testnet
# Full JSON report (pipe to jq or feed to LLM)
orama monitor report --env testnet
orama monitor — Local Orchestrator
Usage
orama monitor [subcommand] --env <environment> [flags]
Without a subcommand, launches the interactive TUI.
Global Flags
| Flag | Default | Description |
|---|---|---|
--env |
(required) | Environment: devnet, testnet, mainnet |
--json |
false |
Machine-readable JSON output (for one-shot subcommands) |
--node |
Filter to a specific node host/IP | |
--config |
scripts/remote-nodes.conf |
Path to node configuration file |
Subcommands
| Subcommand | Description |
|---|---|
live |
Interactive TUI monitor (default when no subcommand) |
cluster |
Cluster overview: all nodes, roles, RQLite state, WG peers |
node |
Per-node health details (system, services, WG, DNS) |
service |
Service status matrix across all nodes |
mesh |
WireGuard mesh connectivity and peer details |
dns |
DNS health: CoreDNS, Caddy, TLS cert expiry, resolution |
namespaces |
Namespace health across nodes |
alerts |
Active alerts and warnings sorted by severity |
report |
Full JSON dump optimized for LLM consumption |
Examples
# Cluster overview
orama monitor cluster --env testnet
# Cluster overview as JSON
orama monitor cluster --env testnet --json
# Alerts for all nodes
orama monitor alerts --env testnet
# Single-node deep dive
orama monitor node --env testnet --node 51.195.109.238
# Services for one node
orama monitor service --env testnet --node 51.195.109.238
# WireGuard mesh details
orama monitor mesh --env testnet
# DNS health
orama monitor dns --env testnet
# Namespace health
orama monitor namespaces --env testnet
# Full report for LLM analysis
orama monitor report --env testnet | jq .
# Single-node report
orama monitor report --env testnet --node 51.195.109.238
# Custom config file
orama monitor cluster --config /path/to/nodes.conf --env devnet
Interactive TUI
The live subcommand (default) launches a full-screen terminal UI:
Tabs: Overview | Nodes | Services | WG Mesh | DNS | Namespaces | Alerts
Key Bindings:
| Key | Action |
|---|---|
Tab / Shift+Tab |
Switch tabs |
j / k or ↑ / ↓ |
Scroll content |
r |
Force refresh |
q / Ctrl+C |
Quit |
The TUI auto-refreshes every 30 seconds. A spinner shows during data collection. Colors indicate health: green = healthy, red = critical, yellow = warning.
LLM Report Format
orama monitor report outputs structured JSON designed for AI consumption:
{
"meta": {
"environment": "testnet",
"collected_at": "2026-02-16T12:00:00Z",
"duration_seconds": 3.2,
"node_count": 3,
"healthy_count": 3
},
"summary": {
"rqlite_leader": "10.0.0.1",
"rqlite_voters": "3/3",
"rqlite_raft_term": 42,
"wg_mesh_status": "all connected",
"service_health": "all nominal",
"critical_alerts": 0,
"warning_alerts": 1,
"info_alerts": 0
},
"alerts": [...],
"nodes": [
{
"host": "51.195.109.238",
"status": "healthy",
"collection_ms": 526,
"report": { ... }
}
]
}
orama node report — VPS-Side Collector
Runs locally on a VPS node. Collects all system and service data in parallel and outputs a single JSON blob. Requires root privileges.
Usage
# On a VPS node
sudo orama node report --json
What It Collects
| Section | Data |
|---|---|
| system | CPU count, load average, memory/disk/swap usage, OOM kills, kernel version, uptime, clock time |
| services | Systemd service states (active, restarts, memory, CPU, restart loop detection) for 10 core services |
| rqlite | Raft state, leader, term, applied/commit index, peers, strong read test, readyz, debug vars |
| olric | Service state, memberlist, member count, restarts, memory, log analysis |
| ipfs | Daemon/cluster state, swarm/cluster peers, repo size, versions, swarm key |
| gateway | HTTP health check, subsystem status |
| wireguard | Interface state, WG IP, peers, handshake ages, MTU, config permissions |
| dns | CoreDNS/Caddy state, port bindings, resolution tests, TLS cert expiry |
| anyone | Relay/client state, bootstrap progress, fingerprint |
| network | Internet reachability, TCP stats, retransmission rate, listening ports, UFW rules |
| processes | Zombie count, orphan orama processes, panic/fatal count in logs |
| namespaces | Per-namespace service probes (RQLite, Olric, Gateway) |
Performance
All 12 collectors run in parallel with goroutines. Typical collection time is < 1 second per node. HTTP timeouts are 3 seconds, command timeouts are 4 seconds.
Output Schema
{
"timestamp": "2026-02-16T12:00:00Z",
"hostname": "ns1",
"version": "0.107.0",
"collect_ms": 526,
"errors": [],
"system": { "cpu_count": 4, "load_avg_1": 0.1, "mem_total_mb": 7937, ... },
"services": { "services": [...], "failed_units": [] },
"rqlite": { "responsive": true, "raft_state": "Leader", "term": 42, ... },
"olric": { "service_active": true, "memberlist_up": true, ... },
"ipfs": { "daemon_active": true, "swarm_peers": 2, ... },
"gateway": { "responsive": true, "http_status": 200, ... },
"wireguard": { "interface_up": true, "wg_ip": "10.0.0.1", "peers": [...], ... },
"dns": { "coredns_active": true, "caddy_active": true, "base_tls_days_left": 88, ... },
"anyone": { "relay_active": true, "bootstrapped": true, ... },
"network": { "internet_reachable": true, "ufw_active": true, ... },
"processes": { "zombie_count": 0, "orphan_count": 0, "panic_count": 0, ... },
"namespaces": []
}
Alert Detection
Alerts are derived from cross-node analysis of all collected reports. Each alert has a severity level and identifies the affected subsystem and node.
Alert Severities
| Severity | Examples |
|---|---|
| critical | SSH collection failed (node unreachable), no RQLite leader, split brain, RQLite unresponsive, WireGuard interface down, WG peer never handshaked, OOM kills, service failed, UFW inactive |
| warning | Strong read failed, memory > 90%, disk > 85%, stale WG handshake (> 3min), Raft term inconsistency, applied index lag > 100, restart loop detected, TLS cert < 14 days, DNS down, namespace gateway down, Anyone not bootstrapped, clock skew > 5s, binary version mismatch, internet unreachable, high TCP retransmission |
| info | Zombie processes, orphan orama processes, swap usage > 30% |
Cross-Node Checks
These checks compare data across all nodes:
- RQLite Leader: Exactly one leader exists (no split brain)
- Leader Agreement: All nodes agree on the same leader address
- Raft Term Consistency: Term values within 1 of each other
- Applied Index Lag: Followers within 100 entries of the leader
- WireGuard Peer Symmetry: Each node has N-1 peers
- Clock Skew: Node clocks within 5 seconds of each other
- Binary Version: All nodes running the same version
Per-Node Checks
- RQLite: Responsive, ready, strong read
- WireGuard: Interface up, handshake freshness
- System: Memory, disk, load, OOM kills, swap
- Services: Systemd state, restart loops
- DNS: CoreDNS/Caddy up, TLS cert expiry, SOA resolution
- Anyone: Bootstrap progress
- Processes: Zombies, orphans, panics in logs
- Namespaces: Gateway and RQLite per namespace
- Network: UFW, internet reachability, TCP retransmission
Monitor vs Inspector
Both tools check cluster health, but they serve different purposes:
orama monitor |
orama inspect |
|
|---|---|---|
| Data source | orama node report --json (single SSH call per node) |
15+ SSH commands per node per subsystem |
| Speed | ~3-5s for full cluster | ~4-10s for full cluster |
| Output | TUI, tables, JSON | Tables, JSON |
| Focus | Real-time monitoring, alert detection | Deep diagnostic checks with pass/fail/warn |
| AI support | report subcommand for LLM input |
--ai flag for inline analysis |
| Use case | "Is anything wrong right now?" | "What exactly is wrong and why?" |
Use monitor for day-to-day health checks and the interactive TUI. Use inspect for deep diagnostics when something is already known to be broken.
Configuration
Uses the same scripts/remote-nodes.conf as the inspector. See INSPECTOR.md for format details.
Prerequisites
Nodes must have the orama CLI installed (via orama node install or upload-source.sh). The monitor runs sudo orama node report --json over SSH, so the binary must be at /usr/local/bin/orama on each node.