orama-vault/docs/OPERATOR_GUIDE.md
2026-02-27 06:53:06 +02:00

14 KiB

Orama Vault -- Operator Guide

Monitoring

Health Endpoint

The simplest way to check if a guardian is running:

curl -s http://127.0.0.1:7500/v1/vault/health | jq .

Expected response:

{
  "status": "ok",
  "version": "0.1.0"
}

If this endpoint does not respond, the guardian process is not running or the port is blocked. Check systemd status first.

Status Endpoint

Provides runtime configuration:

curl -s http://127.0.0.1:7500/v1/vault/status | jq .

Expected response:

{
  "status": "ok",
  "version": "0.1.0",
  "data_dir": "/opt/orama/.orama/data/vault",
  "client_port": 7500,
  "peer_port": 7501
}

Guardians Endpoint

Lists known guardian nodes in the cluster:

curl -s http://127.0.0.1:7500/v1/vault/guardians | jq .

Note (v0.1.0): This currently returns only the local node. Full cluster listing requires RQLite integration (Phase 2).

Systemd Journal

The guardian logs to stderr using Zig's structured logging, which is captured by the systemd journal:

# View recent logs
sudo journalctl -u orama-vault -n 50 --no-pager

# Follow live logs
sudo journalctl -u orama-vault -f

# View logs since last boot
sudo journalctl -u orama-vault -b

# View error-level logs only
sudo journalctl -u orama-vault -p err

Log messages include:

  • vault-guardian v0.1.0 starting -- startup confirmation
  • config: <path> -- config file path
  • listening on <addr>:<port> (client) -- client listener bound
  • listening on <addr>:<port> (peer) -- peer listener bound
  • data directory: <path> -- data directory path
  • guardian ready -- starting HTTP server -- initialization complete
  • stored share for identity <hex> (<n> bytes, version <v>) -- successful push
  • served share for identity <hex> (<n> bytes) -- successful pull
  • rejected rollback for <hex>: version <v> <= current <v> -- anti-rollback rejection
  • accept error: <err> -- TCP accept failure (non-fatal, retried)
  • connection error: <err> -- individual connection handling error
  • failed to write share for <hex>: <err> -- disk write failure

Service Status

sudo systemctl status orama-vault

Check for:

  • Active: active (running) -- service is up
  • Main PID: <pid> -- process ID
  • Memory and CPU usage in the status output

Troubleshooting

Port Already In Use

Symptom: Guardian fails to start with failed to bind 0.0.0.0:7500: error.AddressInUse

Diagnosis:

# Find what's using the port
sudo ss -tlnp | grep 7500

Resolution:

  • If another vault-guardian is running: sudo systemctl stop orama-vault first.
  • If another service is using port 7500: change the vault port with --port <other>.
  • If the port is in TIME_WAIT state from a recent restart: wait 30-60 seconds. The guardian sets SO_REUSEADDR which should handle most cases.

Data Directory Permissions

Symptom: failed to create data directory <path>: error.AccessDenied

Diagnosis:

ls -la /opt/orama/.orama/data/vault/

Resolution:

sudo chown -R orama:orama /opt/orama/.orama/data/vault
sudo chmod 700 /opt/orama/.orama/data/vault

The systemd service uses ProtectSystem=strict with ReadWritePaths=/opt/orama/.orama/data/vault, so the data directory must be under this exact path or a CLI override must be used.

RQLite Connectivity

Symptom: Log shows failed to fetch node list from RQLite, running in single-node mode

Diagnosis:

# Check if RQLite is running
sudo systemctl status orama-*-rqlite

# Test RQLite endpoint
curl -s http://127.0.0.1:4001/status | jq .store.raft.state

Resolution:

  • This warning is non-fatal. The guardian continues in single-node mode.
  • Ensure RQLite is started before the vault guardian (normal dependency ordering).
  • Verify the rqlite_url in config matches the actual RQLite address.

Note (v0.1.0): RQLite node discovery is a stub. The guardian always falls back to single-node mode. This warning is expected in the current version.

Share Write Failures

Symptom: Push returns 500 Internal Server Error, logs show failed to write share for <hex>: <err>

Diagnosis:

# Check disk space
df -h /opt/orama/.orama/data/vault

# Check inode usage
df -i /opt/orama/.orama/data/vault

# Check directory permissions
ls -la /opt/orama/.orama/data/vault/shares/

Resolution:

  • If disk is full: free space or expand the partition.
  • If inodes are exhausted (unlikely but possible with millions of users): clean up orphaned temp files.
  • If permissions are wrong: fix ownership as shown above.

Anti-Rollback Rejections

Symptom: Push returns 400 with "version must be greater than current stored version"

This is normal behavior -- the client tried to push an older version of a share. Common causes:

  • Client retry after a network timeout (the first push actually succeeded).
  • Client software bug sending stale version numbers.

Diagnosis:

# Check current stored version for an identity
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version

Resolution: The client must send a version number strictly greater than the stored value. This is not a guardian bug.

Guardian Crash Loop

Symptom: systemctl status shows rapid restarts.

Diagnosis:

# View recent crash logs
sudo journalctl -u orama-vault -n 100 --no-pager | tail -50

# Check for OOM kills
sudo journalctl -k | grep -i "oom\|kill"

Resolution:

  • If OOM killed: the 512 MiB memory limit may be too low. Check if share data has grown unexpectedly.
  • If config parse error: check the config file syntax (or remove it to use defaults).
  • If bind error: another process is using the port.

Manual Operations

Check Stored Shares

List all identities with stored shares:

ls /opt/orama/.orama/data/vault/shares/

Check a specific identity's share:

# View version
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version

# View share size
ls -la /opt/orama/.orama/data/vault/shares/<identity_hex>/share.bin

# View checksum
xxd /opt/orama/.orama/data/vault/shares/<identity_hex>/checksum.bin

Count Total Shares

ls -d /opt/orama/.orama/data/vault/shares/*/ 2>/dev/null | wc -l

Verify Share Integrity Manually

The guardian verifies HMAC integrity on every read. To manually check if a share file has been corrupted:

# If you know the integrity key, you can compute HMAC externally:
# (The integrity key is internal to the guardian and not stored on disk in the current version)

# Check file exists and is non-empty
test -s /opt/orama/.orama/data/vault/shares/<identity>/share.bin && echo "OK" || echo "MISSING/EMPTY"
test -s /opt/orama/.orama/data/vault/shares/<identity>/checksum.bin && echo "OK" || echo "MISSING/EMPTY"

Test Push/Pull

# Push a test share
curl -X POST http://127.0.0.1:7500/v1/vault/push \
  -H "Content-Type: application/json" \
  -d '{
    "identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
    "share": "dGVzdCBzaGFyZSBkYXRh",
    "version": 1
  }'

# Expected: {"status":"stored"}

# Pull it back
curl -X POST http://127.0.0.1:7500/v1/vault/pull \
  -H "Content-Type: application/json" \
  -d '{
    "identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
  }'

# Expected: {"share":"dGVzdCBzaGFyZSBkYXRh"}

Delete a Share (Emergency)

Warning: Deleting a share is destructive and cannot be undone. Only do this if you understand the implications for the user's recovery capability.

# Remove a specific identity's share directory
rm -rf /opt/orama/.orama/data/vault/shares/<identity_hex>

Disaster Recovery

Failure Scenarios

Scenario Impact Recovery
1 node dies No impact. K-1 other nodes can still reconstruct. Replace node, shares will be re-pushed by clients.
K-1 nodes die No impact. K remaining nodes can still reconstruct. Replace nodes, reshare when quorum recovers.
N-K nodes die No impact. K+1 or more nodes survive. Replace nodes.
N-K+1 nodes die CRITICAL. Only K-1 nodes remain. Cannot reconstruct. Data loss for users who only have vault-based recovery. Users with mnemonic (Path A) are unaffected.
All nodes die TOTAL LOSS. All shares destroyed. Users must recover via mnemonic (Path A). Vault recovery (Path B) is permanently lost.
Data directory corrupted on 1 node HMAC integrity check fails on read. Node returns errors for affected shares. Delete corrupted share directory. Repair protocol will re-distribute once implemented.

Key Insight

The system can tolerate losing up to N - K nodes without any data loss. With default thresholds:

Cluster Size K Max Node Loss
5 nodes 3 2 nodes
14 nodes 4 10 nodes
50 nodes 16 34 nodes
100 nodes 33 67 nodes

Backup Strategy

The data directory can be backed up with standard tools:

# Rsync backup
rsync -av /opt/orama/.orama/data/vault/ /backup/vault/

# Tarball backup
tar czf vault-backup-$(date +%Y%m%d).tar.gz -C /opt/orama/.orama/data/vault .

However, backups are generally unnecessary because:

  1. Every node stores a share, so the cluster itself is the redundancy.
  2. Shares are re-pushed by clients on updates.
  3. The proactive re-sharing protocol (when fully wired) will redistribute shares automatically.

Backups are only useful if you fear simultaneous catastrophic failure of many nodes.


Capacity Planning

Per-Share Storage

Each user's share directory contains:

File Typical Size Description
share.bin ~1 KB Encrypted share data (same size as original secret)
version ~1-20 bytes Version counter as ASCII digits
checksum.bin 32 bytes HMAC-SHA256 checksum
Directory entry ~4 KB Filesystem overhead (depends on filesystem)

Total per user per node: ~5 KB (including filesystem overhead).

Cluster-Wide Storage

With all-node replication, each user has one share on every node:

Users Per Node Per Cluster (14 nodes)
1,000 ~5 MB ~70 MB
10,000 ~50 MB ~700 MB
100,000 ~500 MB ~7 GB
1,000,000 ~5 GB ~70 GB

At 1 million users, each node stores approximately 5 GB. This is well within the capability of any modern VPS.

Memory Usage

The guardian uses minimal memory:

  • Static binary: ~2-5 MB RSS.
  • Per-connection: ~128 KB (64 KB read buffer + 64 KB write buffer).
  • Single-threaded: only one connection is active at a time (MVP).
  • No in-memory caching of shares: every read/write goes to disk.

The systemd MemoryMax=512M limit is generous for the current architecture. Actual usage is typically under 20 MB.

Inode Usage

Each user creates one directory and 2-3 files. At 1 million users:

  • ~4 million inodes.
  • Most Linux filesystems default to millions of inodes (ext4: 1 inode per 16 KB).
  • A 100 GB partition with ext4 defaults has ~6.5 million inodes.

This is unlikely to be a bottleneck but is worth monitoring on small partitions.


Cluster Scaling

Adding Nodes

When a new node joins the Orama network:

  1. The vault guardian starts and registers itself via RQLite.
  2. Other guardians detect the new node via the discovery module (join event).
  3. The new node initially has zero shares.
  4. Shares are populated in two ways:
    • Client push: When clients push new versions, they include the new node.
    • Repair protocol: The re-sharing protocol redistributes shares to include the new node (Phase 2).

The threshold K is recomputed based on the alive count: K = max(3, floor(N/3)).

Removing Nodes

When a node leaves (graceful shutdown or failure):

  1. Other guardians detect the departure via missed heartbeats (suspect at 15s, dead at 60s).
  2. The departed node's shares are lost.
  3. If N - K nodes remain alive, no data is lost.
  4. If the departure drops below the safety threshold (K+1 alive), the repair protocol triggers re-sharing to adjust.

Threshold Adjustment

The threshold is dynamic and automatic:

K = max(3, floor(alive_count / 3))
  • Adding nodes generally does not change K until the cluster grows significantly.
  • Removing nodes may reduce K if the alive count drops enough.
  • K never drops below 3, ensuring a minimum collusion resistance.

Write Quorum

Write quorum requires supermajority acknowledgment:

W = ceil(2/3 * alive_count)
Alive Write Quorum (W)
1 1
2 2
3 2
4 3
5 4
14 10
100 67

A push succeeds only if W guardians acknowledge storage. This ensures consistency even with some nodes being slow or temporarily unreachable.

Note (v0.1.0): Write quorum is computed but not enforced in the current single-node push handler. Multi-guardian fan-out is Phase 2.


Operational Checklist

Pre-Deploy

  • Binary built with -Doptimize=ReleaseSafe for the correct target.
  • Data directory exists with correct ownership and permissions.
  • systemd service file installed and enabled.
  • Firewall rules allow port 7500 (client) and 7501 (peer, WireGuard only).
  • WireGuard is up and peers are connected.

Post-Deploy

  • Health endpoint responds: curl http://127.0.0.1:7500/v1/vault/health
  • Status endpoint shows correct config: curl http://127.0.0.1:7500/v1/vault/status
  • No error-level log messages: sudo journalctl -u orama-vault -p err -n 10
  • Test push/pull cycle works (see "Test Push/Pull" section above).

Periodic Checks

  • Health endpoint responds on all nodes.
  • Share count is consistent across nodes (same identities on each).
  • Disk usage is within expected bounds.
  • No repeated error messages in the journal.
  • systemd reports active (running) with uptime matching expectations.