orama/docs/MONITORING.md
2026-02-16 11:47:18 +02:00

10 KiB

Monitoring

Real-time cluster health monitoring via SSH. The system has two parts:

  1. orama node report — Runs on each VPS node, collects all local health data, outputs JSON
  2. orama monitor — Runs on your local machine, SSHes into nodes, aggregates results, displays via TUI or tables

Architecture

Developer Machine                    VPS Nodes (via SSH)
┌──────────────────┐                 ┌────────────────────┐
│ orama monitor    │ ──SSH──────────>│ orama node report  │
│  (TUI / tables)  │ <──JSON─────── │  (local collector)  │
│                  │                 └────────────────────┘
│  CollectOnce()   │ ──SSH──────────>│ orama node report  │
│  DeriveAlerts()  │ <──JSON─────── │  (local collector)  │
│  Render()        │                 └────────────────────┘
└──────────────────┘

Each node runs orama node report --json locally (no SSH to other nodes), collecting data via os/exec and net/http to localhost services. The monitor SSHes into all nodes in parallel, collects reports, then runs cross-node analysis to detect cluster-wide issues.

Quick Start

# Interactive TUI (auto-refreshes every 30s)
orama monitor --env testnet

# Cluster overview table
orama monitor cluster --env testnet

# Alerts only
orama monitor alerts --env testnet

# Full JSON report (pipe to jq or feed to LLM)
orama monitor report --env testnet

orama monitor — Local Orchestrator

Usage

orama monitor [subcommand] --env <environment> [flags]

Without a subcommand, launches the interactive TUI.

Global Flags

Flag Default Description
--env (required) Environment: devnet, testnet, mainnet
--json false Machine-readable JSON output (for one-shot subcommands)
--node Filter to a specific node host/IP
--config scripts/remote-nodes.conf Path to node configuration file

Subcommands

Subcommand Description
live Interactive TUI monitor (default when no subcommand)
cluster Cluster overview: all nodes, roles, RQLite state, WG peers
node Per-node health details (system, services, WG, DNS)
service Service status matrix across all nodes
mesh WireGuard mesh connectivity and peer details
dns DNS health: CoreDNS, Caddy, TLS cert expiry, resolution
namespaces Namespace health across nodes
alerts Active alerts and warnings sorted by severity
report Full JSON dump optimized for LLM consumption

Examples

# Cluster overview
orama monitor cluster --env testnet

# Cluster overview as JSON
orama monitor cluster --env testnet --json

# Alerts for all nodes
orama monitor alerts --env testnet

# Single-node deep dive
orama monitor node --env testnet --node 51.195.109.238

# Services for one node
orama monitor service --env testnet --node 51.195.109.238

# WireGuard mesh details
orama monitor mesh --env testnet

# DNS health
orama monitor dns --env testnet

# Namespace health
orama monitor namespaces --env testnet

# Full report for LLM analysis
orama monitor report --env testnet | jq .

# Single-node report
orama monitor report --env testnet --node 51.195.109.238

# Custom config file
orama monitor cluster --config /path/to/nodes.conf --env devnet

Interactive TUI

The live subcommand (default) launches a full-screen terminal UI:

Tabs: Overview | Nodes | Services | WG Mesh | DNS | Namespaces | Alerts

Key Bindings:

Key Action
Tab / Shift+Tab Switch tabs
j / k or / Scroll content
r Force refresh
q / Ctrl+C Quit

The TUI auto-refreshes every 30 seconds. A spinner shows during data collection. Colors indicate health: green = healthy, red = critical, yellow = warning.

LLM Report Format

orama monitor report outputs structured JSON designed for AI consumption:

{
  "meta": {
    "environment": "testnet",
    "collected_at": "2026-02-16T12:00:00Z",
    "duration_seconds": 3.2,
    "node_count": 3,
    "healthy_count": 3
  },
  "summary": {
    "rqlite_leader": "10.0.0.1",
    "rqlite_voters": "3/3",
    "rqlite_raft_term": 42,
    "wg_mesh_status": "all connected",
    "service_health": "all nominal",
    "critical_alerts": 0,
    "warning_alerts": 1,
    "info_alerts": 0
  },
  "alerts": [...],
  "nodes": [
    {
      "host": "51.195.109.238",
      "status": "healthy",
      "collection_ms": 526,
      "report": { ... }
    }
  ]
}

orama node report — VPS-Side Collector

Runs locally on a VPS node. Collects all system and service data in parallel and outputs a single JSON blob. Requires root privileges.

Usage

# On a VPS node
sudo orama node report --json

What It Collects

Section Data
system CPU count, load average, memory/disk/swap usage, OOM kills, kernel version, uptime, clock time
services Systemd service states (active, restarts, memory, CPU, restart loop detection) for 10 core services
rqlite Raft state, leader, term, applied/commit index, peers, strong read test, readyz, debug vars
olric Service state, memberlist, member count, restarts, memory, log analysis
ipfs Daemon/cluster state, swarm/cluster peers, repo size, versions, swarm key
gateway HTTP health check, subsystem status
wireguard Interface state, WG IP, peers, handshake ages, MTU, config permissions
dns CoreDNS/Caddy state, port bindings, resolution tests, TLS cert expiry
anyone Relay/client state, bootstrap progress, fingerprint
network Internet reachability, TCP stats, retransmission rate, listening ports, UFW rules
processes Zombie count, orphan orama processes, panic/fatal count in logs
namespaces Per-namespace service probes (RQLite, Olric, Gateway)

Performance

All 12 collectors run in parallel with goroutines. Typical collection time is < 1 second per node. HTTP timeouts are 3 seconds, command timeouts are 4 seconds.

Output Schema

{
  "timestamp": "2026-02-16T12:00:00Z",
  "hostname": "ns1",
  "version": "0.107.0",
  "collect_ms": 526,
  "errors": [],
  "system": { "cpu_count": 4, "load_avg_1": 0.1, "mem_total_mb": 7937, ... },
  "services": { "services": [...], "failed_units": [] },
  "rqlite": { "responsive": true, "raft_state": "Leader", "term": 42, ... },
  "olric": { "service_active": true, "memberlist_up": true, ... },
  "ipfs": { "daemon_active": true, "swarm_peers": 2, ... },
  "gateway": { "responsive": true, "http_status": 200, ... },
  "wireguard": { "interface_up": true, "wg_ip": "10.0.0.1", "peers": [...], ... },
  "dns": { "coredns_active": true, "caddy_active": true, "base_tls_days_left": 88, ... },
  "anyone": { "relay_active": true, "bootstrapped": true, ... },
  "network": { "internet_reachable": true, "ufw_active": true, ... },
  "processes": { "zombie_count": 0, "orphan_count": 0, "panic_count": 0, ... },
  "namespaces": []
}

Alert Detection

Alerts are derived from cross-node analysis of all collected reports. Each alert has a severity level and identifies the affected subsystem and node.

Alert Severities

Severity Examples
critical SSH collection failed (node unreachable), no RQLite leader, split brain, RQLite unresponsive, WireGuard interface down, WG peer never handshaked, OOM kills, service failed, UFW inactive
warning Strong read failed, memory > 90%, disk > 85%, stale WG handshake (> 3min), Raft term inconsistency, applied index lag > 100, restart loop detected, TLS cert < 14 days, DNS down, namespace gateway down, Anyone not bootstrapped, clock skew > 5s, binary version mismatch, internet unreachable, high TCP retransmission
info Zombie processes, orphan orama processes, swap usage > 30%

Cross-Node Checks

These checks compare data across all nodes:

  • RQLite Leader: Exactly one leader exists (no split brain)
  • Leader Agreement: All nodes agree on the same leader address
  • Raft Term Consistency: Term values within 1 of each other
  • Applied Index Lag: Followers within 100 entries of the leader
  • WireGuard Peer Symmetry: Each node has N-1 peers
  • Clock Skew: Node clocks within 5 seconds of each other
  • Binary Version: All nodes running the same version

Per-Node Checks

  • RQLite: Responsive, ready, strong read
  • WireGuard: Interface up, handshake freshness
  • System: Memory, disk, load, OOM kills, swap
  • Services: Systemd state, restart loops
  • DNS: CoreDNS/Caddy up, TLS cert expiry, SOA resolution
  • Anyone: Bootstrap progress
  • Processes: Zombies, orphans, panics in logs
  • Namespaces: Gateway and RQLite per namespace
  • Network: UFW, internet reachability, TCP retransmission

Monitor vs Inspector

Both tools check cluster health, but they serve different purposes:

orama monitor orama inspect
Data source orama node report --json (single SSH call per node) 15+ SSH commands per node per subsystem
Speed ~3-5s for full cluster ~4-10s for full cluster
Output TUI, tables, JSON Tables, JSON
Focus Real-time monitoring, alert detection Deep diagnostic checks with pass/fail/warn
AI support report subcommand for LLM input --ai flag for inline analysis
Use case "Is anything wrong right now?" "What exactly is wrong and why?"

Use monitor for day-to-day health checks and the interactive TUI. Use inspect for deep diagnostics when something is already known to be broken.

Configuration

Uses the same scripts/remote-nodes.conf as the inspector. See INSPECTOR.md for format details.

Prerequisites

Nodes must have the orama CLI installed (via orama node install or upload-source.sh). The monitor runs sudo orama node report --json over SSH, so the binary must be at /usr/local/bin/orama on each node.