14 KiB
Orama Vault -- Operator Guide
Monitoring
Health Endpoint
The simplest way to check if a guardian is running:
curl -s http://127.0.0.1:7500/v1/vault/health | jq .
Expected response:
{
"status": "ok",
"version": "0.1.0"
}
If this endpoint does not respond, the guardian process is not running or the port is blocked. Check systemd status first.
Status Endpoint
Provides runtime configuration:
curl -s http://127.0.0.1:7500/v1/vault/status | jq .
Expected response:
{
"status": "ok",
"version": "0.1.0",
"data_dir": "/opt/orama/.orama/data/vault",
"client_port": 7500,
"peer_port": 7501
}
Guardians Endpoint
Lists known guardian nodes in the cluster:
curl -s http://127.0.0.1:7500/v1/vault/guardians | jq .
Note (v0.1.0): This currently returns only the local node. Full cluster listing requires RQLite integration (Phase 2).
Systemd Journal
The guardian logs to stderr using Zig's structured logging, which is captured by the systemd journal:
# View recent logs
sudo journalctl -u orama-vault -n 50 --no-pager
# Follow live logs
sudo journalctl -u orama-vault -f
# View logs since last boot
sudo journalctl -u orama-vault -b
# View error-level logs only
sudo journalctl -u orama-vault -p err
Log messages include:
vault-guardian v0.1.0 starting-- startup confirmationconfig: <path>-- config file pathlistening on <addr>:<port> (client)-- client listener boundlistening on <addr>:<port> (peer)-- peer listener bounddata directory: <path>-- data directory pathguardian ready -- starting HTTP server-- initialization completestored share for identity <hex> (<n> bytes, version <v>)-- successful pushserved share for identity <hex> (<n> bytes)-- successful pullrejected rollback for <hex>: version <v> <= current <v>-- anti-rollback rejectionaccept error: <err>-- TCP accept failure (non-fatal, retried)connection error: <err>-- individual connection handling errorfailed to write share for <hex>: <err>-- disk write failure
Service Status
sudo systemctl status orama-vault
Check for:
Active: active (running)-- service is upMain PID: <pid>-- process ID- Memory and CPU usage in the status output
Troubleshooting
Port Already In Use
Symptom: Guardian fails to start with failed to bind 0.0.0.0:7500: error.AddressInUse
Diagnosis:
# Find what's using the port
sudo ss -tlnp | grep 7500
Resolution:
- If another vault-guardian is running:
sudo systemctl stop orama-vaultfirst. - If another service is using port 7500: change the vault port with
--port <other>. - If the port is in TIME_WAIT state from a recent restart: wait 30-60 seconds. The guardian sets
SO_REUSEADDRwhich should handle most cases.
Data Directory Permissions
Symptom: failed to create data directory <path>: error.AccessDenied
Diagnosis:
ls -la /opt/orama/.orama/data/vault/
Resolution:
sudo chown -R orama:orama /opt/orama/.orama/data/vault
sudo chmod 700 /opt/orama/.orama/data/vault
The systemd service uses ProtectSystem=strict with ReadWritePaths=/opt/orama/.orama/data/vault, so the data directory must be under this exact path or a CLI override must be used.
RQLite Connectivity
Symptom: Log shows failed to fetch node list from RQLite, running in single-node mode
Diagnosis:
# Check if RQLite is running
sudo systemctl status orama-*-rqlite
# Test RQLite endpoint
curl -s http://127.0.0.1:4001/status | jq .store.raft.state
Resolution:
- This warning is non-fatal. The guardian continues in single-node mode.
- Ensure RQLite is started before the vault guardian (normal dependency ordering).
- Verify the
rqlite_urlin config matches the actual RQLite address.
Note (v0.1.0): RQLite node discovery is a stub. The guardian always falls back to single-node mode. This warning is expected in the current version.
Share Write Failures
Symptom: Push returns 500 Internal Server Error, logs show failed to write share for <hex>: <err>
Diagnosis:
# Check disk space
df -h /opt/orama/.orama/data/vault
# Check inode usage
df -i /opt/orama/.orama/data/vault
# Check directory permissions
ls -la /opt/orama/.orama/data/vault/shares/
Resolution:
- If disk is full: free space or expand the partition.
- If inodes are exhausted (unlikely but possible with millions of users): clean up orphaned temp files.
- If permissions are wrong: fix ownership as shown above.
Anti-Rollback Rejections
Symptom: Push returns 400 with "version must be greater than current stored version"
This is normal behavior -- the client tried to push an older version of a share. Common causes:
- Client retry after a network timeout (the first push actually succeeded).
- Client software bug sending stale version numbers.
Diagnosis:
# Check current stored version for an identity
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version
Resolution: The client must send a version number strictly greater than the stored value. This is not a guardian bug.
Guardian Crash Loop
Symptom: systemctl status shows rapid restarts.
Diagnosis:
# View recent crash logs
sudo journalctl -u orama-vault -n 100 --no-pager | tail -50
# Check for OOM kills
sudo journalctl -k | grep -i "oom\|kill"
Resolution:
- If OOM killed: the 512 MiB memory limit may be too low. Check if share data has grown unexpectedly.
- If config parse error: check the config file syntax (or remove it to use defaults).
- If bind error: another process is using the port.
Manual Operations
Check Stored Shares
List all identities with stored shares:
ls /opt/orama/.orama/data/vault/shares/
Check a specific identity's share:
# View version
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version
# View share size
ls -la /opt/orama/.orama/data/vault/shares/<identity_hex>/share.bin
# View checksum
xxd /opt/orama/.orama/data/vault/shares/<identity_hex>/checksum.bin
Count Total Shares
ls -d /opt/orama/.orama/data/vault/shares/*/ 2>/dev/null | wc -l
Verify Share Integrity Manually
The guardian verifies HMAC integrity on every read. To manually check if a share file has been corrupted:
# If you know the integrity key, you can compute HMAC externally:
# (The integrity key is internal to the guardian and not stored on disk in the current version)
# Check file exists and is non-empty
test -s /opt/orama/.orama/data/vault/shares/<identity>/share.bin && echo "OK" || echo "MISSING/EMPTY"
test -s /opt/orama/.orama/data/vault/shares/<identity>/checksum.bin && echo "OK" || echo "MISSING/EMPTY"
Test Push/Pull
# Push a test share
curl -X POST http://127.0.0.1:7500/v1/vault/push \
-H "Content-Type: application/json" \
-d '{
"identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
"share": "dGVzdCBzaGFyZSBkYXRh",
"version": 1
}'
# Expected: {"status":"stored"}
# Pull it back
curl -X POST http://127.0.0.1:7500/v1/vault/pull \
-H "Content-Type: application/json" \
-d '{
"identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}'
# Expected: {"share":"dGVzdCBzaGFyZSBkYXRh"}
Delete a Share (Emergency)
Warning: Deleting a share is destructive and cannot be undone. Only do this if you understand the implications for the user's recovery capability.
# Remove a specific identity's share directory
rm -rf /opt/orama/.orama/data/vault/shares/<identity_hex>
Disaster Recovery
Failure Scenarios
| Scenario | Impact | Recovery |
|---|---|---|
| 1 node dies | No impact. K-1 other nodes can still reconstruct. | Replace node, shares will be re-pushed by clients. |
| K-1 nodes die | No impact. K remaining nodes can still reconstruct. | Replace nodes, reshare when quorum recovers. |
| N-K nodes die | No impact. K+1 or more nodes survive. | Replace nodes. |
| N-K+1 nodes die | CRITICAL. Only K-1 nodes remain. Cannot reconstruct. | Data loss for users who only have vault-based recovery. Users with mnemonic (Path A) are unaffected. |
| All nodes die | TOTAL LOSS. All shares destroyed. | Users must recover via mnemonic (Path A). Vault recovery (Path B) is permanently lost. |
| Data directory corrupted on 1 node | HMAC integrity check fails on read. Node returns errors for affected shares. | Delete corrupted share directory. Repair protocol will re-distribute once implemented. |
Key Insight
The system can tolerate losing up to N - K nodes without any data loss. With default thresholds:
| Cluster Size | K | Max Node Loss |
|---|---|---|
| 5 nodes | 3 | 2 nodes |
| 14 nodes | 4 | 10 nodes |
| 50 nodes | 16 | 34 nodes |
| 100 nodes | 33 | 67 nodes |
Backup Strategy
The data directory can be backed up with standard tools:
# Rsync backup
rsync -av /opt/orama/.orama/data/vault/ /backup/vault/
# Tarball backup
tar czf vault-backup-$(date +%Y%m%d).tar.gz -C /opt/orama/.orama/data/vault .
However, backups are generally unnecessary because:
- Every node stores a share, so the cluster itself is the redundancy.
- Shares are re-pushed by clients on updates.
- The proactive re-sharing protocol (when fully wired) will redistribute shares automatically.
Backups are only useful if you fear simultaneous catastrophic failure of many nodes.
Capacity Planning
Per-Share Storage
Each user's share directory contains:
| File | Typical Size | Description |
|---|---|---|
share.bin |
~1 KB | Encrypted share data (same size as original secret) |
version |
~1-20 bytes | Version counter as ASCII digits |
checksum.bin |
32 bytes | HMAC-SHA256 checksum |
| Directory entry | ~4 KB | Filesystem overhead (depends on filesystem) |
Total per user per node: ~5 KB (including filesystem overhead).
Cluster-Wide Storage
With all-node replication, each user has one share on every node:
| Users | Per Node | Per Cluster (14 nodes) |
|---|---|---|
| 1,000 | ~5 MB | ~70 MB |
| 10,000 | ~50 MB | ~700 MB |
| 100,000 | ~500 MB | ~7 GB |
| 1,000,000 | ~5 GB | ~70 GB |
At 1 million users, each node stores approximately 5 GB. This is well within the capability of any modern VPS.
Memory Usage
The guardian uses minimal memory:
- Static binary: ~2-5 MB RSS.
- Per-connection: ~128 KB (64 KB read buffer + 64 KB write buffer).
- Single-threaded: only one connection is active at a time (MVP).
- No in-memory caching of shares: every read/write goes to disk.
The systemd MemoryMax=512M limit is generous for the current architecture. Actual usage is typically under 20 MB.
Inode Usage
Each user creates one directory and 2-3 files. At 1 million users:
- ~4 million inodes.
- Most Linux filesystems default to millions of inodes (ext4: 1 inode per 16 KB).
- A 100 GB partition with ext4 defaults has ~6.5 million inodes.
This is unlikely to be a bottleneck but is worth monitoring on small partitions.
Cluster Scaling
Adding Nodes
When a new node joins the Orama network:
- The vault guardian starts and registers itself via RQLite.
- Other guardians detect the new node via the discovery module (join event).
- The new node initially has zero shares.
- Shares are populated in two ways:
- Client push: When clients push new versions, they include the new node.
- Repair protocol: The re-sharing protocol redistributes shares to include the new node (Phase 2).
The threshold K is recomputed based on the alive count: K = max(3, floor(N/3)).
Removing Nodes
When a node leaves (graceful shutdown or failure):
- Other guardians detect the departure via missed heartbeats (suspect at 15s, dead at 60s).
- The departed node's shares are lost.
- If N - K nodes remain alive, no data is lost.
- If the departure drops below the safety threshold (K+1 alive), the repair protocol triggers re-sharing to adjust.
Threshold Adjustment
The threshold is dynamic and automatic:
K = max(3, floor(alive_count / 3))
- Adding nodes generally does not change K until the cluster grows significantly.
- Removing nodes may reduce K if the alive count drops enough.
- K never drops below 3, ensuring a minimum collusion resistance.
Write Quorum
Write quorum requires supermajority acknowledgment:
W = ceil(2/3 * alive_count)
| Alive | Write Quorum (W) |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 2 |
| 4 | 3 |
| 5 | 4 |
| 14 | 10 |
| 100 | 67 |
A push succeeds only if W guardians acknowledge storage. This ensures consistency even with some nodes being slow or temporarily unreachable.
Note (v0.1.0): Write quorum is computed but not enforced in the current single-node push handler. Multi-guardian fan-out is Phase 2.
Operational Checklist
Pre-Deploy
- Binary built with
-Doptimize=ReleaseSafefor the correct target. - Data directory exists with correct ownership and permissions.
- systemd service file installed and enabled.
- Firewall rules allow port 7500 (client) and 7501 (peer, WireGuard only).
- WireGuard is up and peers are connected.
Post-Deploy
- Health endpoint responds:
curl http://127.0.0.1:7500/v1/vault/health - Status endpoint shows correct config:
curl http://127.0.0.1:7500/v1/vault/status - No error-level log messages:
sudo journalctl -u orama-vault -p err -n 10 - Test push/pull cycle works (see "Test Push/Pull" section above).
Periodic Checks
- Health endpoint responds on all nodes.
- Share count is consistent across nodes (same identities on each).
- Disk usage is within expected bounds.
- No repeated error messages in the journal.
- systemd reports
active (running)with uptime matching expectations.