DeBros/orama

Fork 0

mirror of https://github.com/DeBrosOfficial/orama.git synced 2026-03-17 12:26:58 +00:00

anonpenguin23 f889c2e358 Added some new alerts on monitoring

2026-02-16 11:47:18 +02:00

10 KiB

Raw Blame History

Monitoring

Real-time cluster health monitoring via SSH. The system has two parts:

orama node report — Runs on each VPS node, collects all local health data, outputs JSON
orama monitor — Runs on your local machine, SSHes into nodes, aggregates results, displays via TUI or tables

Architecture

Developer Machine                    VPS Nodes (via SSH)
┌──────────────────┐                 ┌────────────────────┐
│ orama monitor    │ ──SSH──────────>│ orama node report  │
│  (TUI / tables)  │ <──JSON─────── │  (local collector)  │
│                  │                 └────────────────────┘
│  CollectOnce()   │ ──SSH──────────>│ orama node report  │
│  DeriveAlerts()  │ <──JSON─────── │  (local collector)  │
│  Render()        │                 └────────────────────┘
└──────────────────┘

Each node runs orama node report --json locally (no SSH to other nodes), collecting data via os/exec and net/http to localhost services. The monitor SSHes into all nodes in parallel, collects reports, then runs cross-node analysis to detect cluster-wide issues.

Quick Start

# Interactive TUI (auto-refreshes every 30s)
orama monitor --env testnet

# Cluster overview table
orama monitor cluster --env testnet

# Alerts only
orama monitor alerts --env testnet

# Full JSON report (pipe to jq or feed to LLM)
orama monitor report --env testnet

`orama monitor` — Local Orchestrator

Usage

orama monitor [subcommand] --env <environment> [flags]

Without a subcommand, launches the interactive TUI.

Global Flags

Flag	Default	Description
`--env`	(required)	Environment: `devnet`, `testnet`, `mainnet`
`--json`	`false`	Machine-readable JSON output (for one-shot subcommands)
`--node`		Filter to a specific node host/IP
`--config`	`scripts/remote-nodes.conf`	Path to node configuration file

Subcommands

Subcommand	Description
`live`	Interactive TUI monitor (default when no subcommand)
`cluster`	Cluster overview: all nodes, roles, RQLite state, WG peers
`node`	Per-node health details (system, services, WG, DNS)
`service`	Service status matrix across all nodes
`mesh`	WireGuard mesh connectivity and peer details
`dns`	DNS health: CoreDNS, Caddy, TLS cert expiry, resolution
`namespaces`	Namespace health across nodes
`alerts`	Active alerts and warnings sorted by severity
`report`	Full JSON dump optimized for LLM consumption

Examples

# Cluster overview
orama monitor cluster --env testnet

# Cluster overview as JSON
orama monitor cluster --env testnet --json

# Alerts for all nodes
orama monitor alerts --env testnet

# Single-node deep dive
orama monitor node --env testnet --node 51.195.109.238

# Services for one node
orama monitor service --env testnet --node 51.195.109.238

# WireGuard mesh details
orama monitor mesh --env testnet

# DNS health
orama monitor dns --env testnet

# Namespace health
orama monitor namespaces --env testnet

# Full report for LLM analysis
orama monitor report --env testnet | jq .

# Single-node report
orama monitor report --env testnet --node 51.195.109.238

# Custom config file
orama monitor cluster --config /path/to/nodes.conf --env devnet

Interactive TUI

The live subcommand (default) launches a full-screen terminal UI:

Key Bindings:

Key	Action
`Tab` / `Shift+Tab`	Switch tabs
`j` / `k` or `↑` / `↓`	Scroll content
`r`	Force refresh
`q` / `Ctrl+C`	Quit

The TUI auto-refreshes every 30 seconds. A spinner shows during data collection. Colors indicate health: green = healthy, red = critical, yellow = warning.

LLM Report Format

orama monitor report outputs structured JSON designed for AI consumption:

{
  "meta": {
    "environment": "testnet",
    "collected_at": "2026-02-16T12:00:00Z",
    "duration_seconds": 3.2,
    "node_count": 3,
    "healthy_count": 3
  },
  "summary": {
    "rqlite_leader": "10.0.0.1",
    "rqlite_voters": "3/3",
    "rqlite_raft_term": 42,
    "wg_mesh_status": "all connected",
    "service_health": "all nominal",
    "critical_alerts": 0,
    "warning_alerts": 1,
    "info_alerts": 0
  },
  "alerts": [...],
  "nodes": [
    {
      "host": "51.195.109.238",
      "status": "healthy",
      "collection_ms": 526,
      "report": { ... }
    }
  ]
}

`orama node report` — VPS-Side Collector

Runs locally on a VPS node. Collects all system and service data in parallel and outputs a single JSON blob. Requires root privileges.

Usage

# On a VPS node
sudo orama node report --json

What It Collects

Section	Data
system	CPU count, load average, memory/disk/swap usage, OOM kills, kernel version, uptime, clock time
services	Systemd service states (active, restarts, memory, CPU, restart loop detection) for 10 core services
rqlite	Raft state, leader, term, applied/commit index, peers, strong read test, readyz, debug vars
olric	Service state, memberlist, member count, restarts, memory, log analysis
ipfs	Daemon/cluster state, swarm/cluster peers, repo size, versions, swarm key
gateway	HTTP health check, subsystem status
wireguard	Interface state, WG IP, peers, handshake ages, MTU, config permissions
dns	CoreDNS/Caddy state, port bindings, resolution tests, TLS cert expiry
anyone	Relay/client state, bootstrap progress, fingerprint
network	Internet reachability, TCP stats, retransmission rate, listening ports, UFW rules
processes	Zombie count, orphan orama processes, panic/fatal count in logs
namespaces	Per-namespace service probes (RQLite, Olric, Gateway)

Performance

All 12 collectors run in parallel with goroutines. Typical collection time is < 1 second per node. HTTP timeouts are 3 seconds, command timeouts are 4 seconds.

Output Schema

{
  "timestamp": "2026-02-16T12:00:00Z",
  "hostname": "ns1",
  "version": "0.107.0",
  "collect_ms": 526,
  "errors": [],
  "system": { "cpu_count": 4, "load_avg_1": 0.1, "mem_total_mb": 7937, ... },
  "services": { "services": [...], "failed_units": [] },
  "rqlite": { "responsive": true, "raft_state": "Leader", "term": 42, ... },
  "olric": { "service_active": true, "memberlist_up": true, ... },
  "ipfs": { "daemon_active": true, "swarm_peers": 2, ... },
  "gateway": { "responsive": true, "http_status": 200, ... },
  "wireguard": { "interface_up": true, "wg_ip": "10.0.0.1", "peers": [...], ... },
  "dns": { "coredns_active": true, "caddy_active": true, "base_tls_days_left": 88, ... },
  "anyone": { "relay_active": true, "bootstrapped": true, ... },
  "network": { "internet_reachable": true, "ufw_active": true, ... },
  "processes": { "zombie_count": 0, "orphan_count": 0, "panic_count": 0, ... },
  "namespaces": []
}

Alert Detection

Alerts are derived from cross-node analysis of all collected reports. Each alert has a severity level and identifies the affected subsystem and node.

Alert Severities

Severity	Examples
critical	SSH collection failed (node unreachable), no RQLite leader, split brain, RQLite unresponsive, WireGuard interface down, WG peer never handshaked, OOM kills, service failed, UFW inactive
warning	Strong read failed, memory > 90%, disk > 85%, stale WG handshake (> 3min), Raft term inconsistency, applied index lag > 100, restart loop detected, TLS cert < 14 days, DNS down, namespace gateway down, Anyone not bootstrapped, clock skew > 5s, binary version mismatch, internet unreachable, high TCP retransmission
info	Zombie processes, orphan orama processes, swap usage > 30%

Cross-Node Checks

These checks compare data across all nodes:

RQLite Leader: Exactly one leader exists (no split brain)
Leader Agreement: All nodes agree on the same leader address
Raft Term Consistency: Term values within 1 of each other
Applied Index Lag: Followers within 100 entries of the leader
WireGuard Peer Symmetry: Each node has N-1 peers
Clock Skew: Node clocks within 5 seconds of each other
Binary Version: All nodes running the same version

Per-Node Checks

RQLite: Responsive, ready, strong read
WireGuard: Interface up, handshake freshness
System: Memory, disk, load, OOM kills, swap
Services: Systemd state, restart loops
DNS: CoreDNS/Caddy up, TLS cert expiry, SOA resolution
Anyone: Bootstrap progress
Processes: Zombies, orphans, panics in logs
Namespaces: Gateway and RQLite per namespace
Network: UFW, internet reachability, TCP retransmission

Monitor vs Inspector

Both tools check cluster health, but they serve different purposes:

	`orama monitor`	`orama inspect`
Data source	`orama node report --json` (single SSH call per node)	15+ SSH commands per node per subsystem
Speed	~3-5s for full cluster	~4-10s for full cluster
Output	TUI, tables, JSON	Tables, JSON
Focus	Real-time monitoring, alert detection	Deep diagnostic checks with pass/fail/warn
AI support	`report` subcommand for LLM input	`--ai` flag for inline analysis
Use case	"Is anything wrong right now?"	"What exactly is wrong and why?"

Use monitor for day-to-day health checks and the interactive TUI. Use inspect for deep diagnostics when something is already known to be broken.

Configuration

Uses the same scripts/remote-nodes.conf as the inspector. See INSPECTOR.md for format details.

Prerequisites

Nodes must have the orama CLI installed (via orama node install or upload-source.sh). The monitor runs sudo orama node report --json over SSH, so the binary must be at /usr/local/bin/orama on each node.

10 KiB Raw Blame History

Monitoring

Architecture

Quick Start

orama monitor — Local Orchestrator

Usage

Global Flags

Subcommands

Examples

Interactive TUI

LLM Report Format

orama node report — VPS-Side Collector

Usage

What It Collects

Performance

Output Schema

Alert Detection

Alert Severities

Cross-Node Checks

Per-Node Checks

Monitor vs Inspector

Configuration

Prerequisites

10 KiB

Raw Blame History

`orama monitor` — Local Orchestrator

`orama node report` — VPS-Side Collector