mirror of
https://github.com/DeBrosOfficial/orama-vault.git
synced 2026-03-17 03:53:01 +00:00
463 lines
14 KiB
Markdown
463 lines
14 KiB
Markdown
# Orama Vault -- Operator Guide
|
|
|
|
## Monitoring
|
|
|
|
### Health Endpoint
|
|
|
|
The simplest way to check if a guardian is running:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:7500/v1/vault/health | jq .
|
|
```
|
|
|
|
Expected response:
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"version": "0.1.0"
|
|
}
|
|
```
|
|
|
|
If this endpoint does not respond, the guardian process is not running or the port is blocked. Check systemd status first.
|
|
|
|
### Status Endpoint
|
|
|
|
Provides runtime configuration:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:7500/v1/vault/status | jq .
|
|
```
|
|
|
|
Expected response:
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"version": "0.1.0",
|
|
"data_dir": "/opt/orama/.orama/data/vault",
|
|
"client_port": 7500,
|
|
"peer_port": 7501
|
|
}
|
|
```
|
|
|
|
### Guardians Endpoint
|
|
|
|
Lists known guardian nodes in the cluster:
|
|
|
|
```bash
|
|
curl -s http://127.0.0.1:7500/v1/vault/guardians | jq .
|
|
```
|
|
|
|
> **Note (v0.1.0):** This currently returns only the local node. Full cluster listing requires RQLite integration (Phase 2).
|
|
|
|
### Systemd Journal
|
|
|
|
The guardian logs to stderr using Zig's structured logging, which is captured by the systemd journal:
|
|
|
|
```bash
|
|
# View recent logs
|
|
sudo journalctl -u orama-vault -n 50 --no-pager
|
|
|
|
# Follow live logs
|
|
sudo journalctl -u orama-vault -f
|
|
|
|
# View logs since last boot
|
|
sudo journalctl -u orama-vault -b
|
|
|
|
# View error-level logs only
|
|
sudo journalctl -u orama-vault -p err
|
|
```
|
|
|
|
Log messages include:
|
|
- `vault-guardian v0.1.0 starting` -- startup confirmation
|
|
- `config: <path>` -- config file path
|
|
- `listening on <addr>:<port> (client)` -- client listener bound
|
|
- `listening on <addr>:<port> (peer)` -- peer listener bound
|
|
- `data directory: <path>` -- data directory path
|
|
- `guardian ready -- starting HTTP server` -- initialization complete
|
|
- `stored share for identity <hex> (<n> bytes, version <v>)` -- successful push
|
|
- `served share for identity <hex> (<n> bytes)` -- successful pull
|
|
- `rejected rollback for <hex>: version <v> <= current <v>` -- anti-rollback rejection
|
|
- `accept error: <err>` -- TCP accept failure (non-fatal, retried)
|
|
- `connection error: <err>` -- individual connection handling error
|
|
- `failed to write share for <hex>: <err>` -- disk write failure
|
|
|
|
### Service Status
|
|
|
|
```bash
|
|
sudo systemctl status orama-vault
|
|
```
|
|
|
|
Check for:
|
|
- `Active: active (running)` -- service is up
|
|
- `Main PID: <pid>` -- process ID
|
|
- Memory and CPU usage in the status output
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Port Already In Use
|
|
|
|
**Symptom:** Guardian fails to start with `failed to bind 0.0.0.0:7500: error.AddressInUse`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Find what's using the port
|
|
sudo ss -tlnp | grep 7500
|
|
```
|
|
|
|
**Resolution:**
|
|
- If another vault-guardian is running: `sudo systemctl stop orama-vault` first.
|
|
- If another service is using port 7500: change the vault port with `--port <other>`.
|
|
- If the port is in TIME_WAIT state from a recent restart: wait 30-60 seconds. The guardian sets `SO_REUSEADDR` which should handle most cases.
|
|
|
|
### Data Directory Permissions
|
|
|
|
**Symptom:** `failed to create data directory <path>: error.AccessDenied`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
ls -la /opt/orama/.orama/data/vault/
|
|
```
|
|
|
|
**Resolution:**
|
|
```bash
|
|
sudo chown -R orama:orama /opt/orama/.orama/data/vault
|
|
sudo chmod 700 /opt/orama/.orama/data/vault
|
|
```
|
|
|
|
The systemd service uses `ProtectSystem=strict` with `ReadWritePaths=/opt/orama/.orama/data/vault`, so the data directory must be under this exact path or a CLI override must be used.
|
|
|
|
### RQLite Connectivity
|
|
|
|
**Symptom:** Log shows `failed to fetch node list from RQLite, running in single-node mode`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if RQLite is running
|
|
sudo systemctl status orama-*-rqlite
|
|
|
|
# Test RQLite endpoint
|
|
curl -s http://127.0.0.1:4001/status | jq .store.raft.state
|
|
```
|
|
|
|
**Resolution:**
|
|
- This warning is non-fatal. The guardian continues in single-node mode.
|
|
- Ensure RQLite is started before the vault guardian (normal dependency ordering).
|
|
- Verify the `rqlite_url` in config matches the actual RQLite address.
|
|
|
|
> **Note (v0.1.0):** RQLite node discovery is a stub. The guardian always falls back to single-node mode. This warning is expected in the current version.
|
|
|
|
### Share Write Failures
|
|
|
|
**Symptom:** Push returns 500 Internal Server Error, logs show `failed to write share for <hex>: <err>`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check disk space
|
|
df -h /opt/orama/.orama/data/vault
|
|
|
|
# Check inode usage
|
|
df -i /opt/orama/.orama/data/vault
|
|
|
|
# Check directory permissions
|
|
ls -la /opt/orama/.orama/data/vault/shares/
|
|
```
|
|
|
|
**Resolution:**
|
|
- If disk is full: free space or expand the partition.
|
|
- If inodes are exhausted (unlikely but possible with millions of users): clean up orphaned temp files.
|
|
- If permissions are wrong: fix ownership as shown above.
|
|
|
|
### Anti-Rollback Rejections
|
|
|
|
**Symptom:** Push returns 400 with `"version must be greater than current stored version"`
|
|
|
|
This is normal behavior -- the client tried to push an older version of a share. Common causes:
|
|
- Client retry after a network timeout (the first push actually succeeded).
|
|
- Client software bug sending stale version numbers.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check current stored version for an identity
|
|
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version
|
|
```
|
|
|
|
**Resolution:** The client must send a version number strictly greater than the stored value. This is not a guardian bug.
|
|
|
|
### Guardian Crash Loop
|
|
|
|
**Symptom:** `systemctl status` shows rapid restarts.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# View recent crash logs
|
|
sudo journalctl -u orama-vault -n 100 --no-pager | tail -50
|
|
|
|
# Check for OOM kills
|
|
sudo journalctl -k | grep -i "oom\|kill"
|
|
```
|
|
|
|
**Resolution:**
|
|
- If OOM killed: the 512 MiB memory limit may be too low. Check if share data has grown unexpectedly.
|
|
- If config parse error: check the config file syntax (or remove it to use defaults).
|
|
- If bind error: another process is using the port.
|
|
|
|
---
|
|
|
|
## Manual Operations
|
|
|
|
### Check Stored Shares
|
|
|
|
List all identities with stored shares:
|
|
|
|
```bash
|
|
ls /opt/orama/.orama/data/vault/shares/
|
|
```
|
|
|
|
Check a specific identity's share:
|
|
|
|
```bash
|
|
# View version
|
|
cat /opt/orama/.orama/data/vault/shares/<identity_hex>/version
|
|
|
|
# View share size
|
|
ls -la /opt/orama/.orama/data/vault/shares/<identity_hex>/share.bin
|
|
|
|
# View checksum
|
|
xxd /opt/orama/.orama/data/vault/shares/<identity_hex>/checksum.bin
|
|
```
|
|
|
|
### Count Total Shares
|
|
|
|
```bash
|
|
ls -d /opt/orama/.orama/data/vault/shares/*/ 2>/dev/null | wc -l
|
|
```
|
|
|
|
### Verify Share Integrity Manually
|
|
|
|
The guardian verifies HMAC integrity on every read. To manually check if a share file has been corrupted:
|
|
|
|
```bash
|
|
# If you know the integrity key, you can compute HMAC externally:
|
|
# (The integrity key is internal to the guardian and not stored on disk in the current version)
|
|
|
|
# Check file exists and is non-empty
|
|
test -s /opt/orama/.orama/data/vault/shares/<identity>/share.bin && echo "OK" || echo "MISSING/EMPTY"
|
|
test -s /opt/orama/.orama/data/vault/shares/<identity>/checksum.bin && echo "OK" || echo "MISSING/EMPTY"
|
|
```
|
|
|
|
### Test Push/Pull
|
|
|
|
```bash
|
|
# Push a test share
|
|
curl -X POST http://127.0.0.1:7500/v1/vault/push \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa",
|
|
"share": "dGVzdCBzaGFyZSBkYXRh",
|
|
"version": 1
|
|
}'
|
|
|
|
# Expected: {"status":"stored"}
|
|
|
|
# Pull it back
|
|
curl -X POST http://127.0.0.1:7500/v1/vault/pull \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"identity": "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
|
|
}'
|
|
|
|
# Expected: {"share":"dGVzdCBzaGFyZSBkYXRh"}
|
|
```
|
|
|
|
### Delete a Share (Emergency)
|
|
|
|
> **Warning:** Deleting a share is destructive and cannot be undone. Only do this if you understand the implications for the user's recovery capability.
|
|
|
|
```bash
|
|
# Remove a specific identity's share directory
|
|
rm -rf /opt/orama/.orama/data/vault/shares/<identity_hex>
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery
|
|
|
|
### Failure Scenarios
|
|
|
|
| Scenario | Impact | Recovery |
|
|
|----------|--------|----------|
|
|
| 1 node dies | No impact. K-1 other nodes can still reconstruct. | Replace node, shares will be re-pushed by clients. |
|
|
| K-1 nodes die | No impact. K remaining nodes can still reconstruct. | Replace nodes, reshare when quorum recovers. |
|
|
| N-K nodes die | No impact. K+1 or more nodes survive. | Replace nodes. |
|
|
| N-K+1 nodes die | **CRITICAL.** Only K-1 nodes remain. Cannot reconstruct. | Data loss for users who only have vault-based recovery. Users with mnemonic (Path A) are unaffected. |
|
|
| All nodes die | **TOTAL LOSS.** All shares destroyed. | Users must recover via mnemonic (Path A). Vault recovery (Path B) is permanently lost. |
|
|
| Data directory corrupted on 1 node | HMAC integrity check fails on read. Node returns errors for affected shares. | Delete corrupted share directory. Repair protocol will re-distribute once implemented. |
|
|
|
|
### Key Insight
|
|
|
|
The system can tolerate losing up to **N - K** nodes without any data loss. With default thresholds:
|
|
|
|
| Cluster Size | K | Max Node Loss |
|
|
|-------------|---|---------------|
|
|
| 5 nodes | 3 | 2 nodes |
|
|
| 14 nodes | 4 | 10 nodes |
|
|
| 50 nodes | 16 | 34 nodes |
|
|
| 100 nodes | 33 | 67 nodes |
|
|
|
|
### Backup Strategy
|
|
|
|
The data directory can be backed up with standard tools:
|
|
|
|
```bash
|
|
# Rsync backup
|
|
rsync -av /opt/orama/.orama/data/vault/ /backup/vault/
|
|
|
|
# Tarball backup
|
|
tar czf vault-backup-$(date +%Y%m%d).tar.gz -C /opt/orama/.orama/data/vault .
|
|
```
|
|
|
|
However, backups are generally unnecessary because:
|
|
1. Every node stores a share, so the cluster itself is the redundancy.
|
|
2. Shares are re-pushed by clients on updates.
|
|
3. The proactive re-sharing protocol (when fully wired) will redistribute shares automatically.
|
|
|
|
Backups are only useful if you fear simultaneous catastrophic failure of many nodes.
|
|
|
|
---
|
|
|
|
## Capacity Planning
|
|
|
|
### Per-Share Storage
|
|
|
|
Each user's share directory contains:
|
|
|
|
| File | Typical Size | Description |
|
|
|------|-------------|-------------|
|
|
| `share.bin` | ~1 KB | Encrypted share data (same size as original secret) |
|
|
| `version` | ~1-20 bytes | Version counter as ASCII digits |
|
|
| `checksum.bin` | 32 bytes | HMAC-SHA256 checksum |
|
|
| Directory entry | ~4 KB | Filesystem overhead (depends on filesystem) |
|
|
|
|
**Total per user per node: ~5 KB** (including filesystem overhead).
|
|
|
|
### Cluster-Wide Storage
|
|
|
|
With all-node replication, each user has one share on every node:
|
|
|
|
| Users | Per Node | Per Cluster (14 nodes) |
|
|
|-------|----------|------------------------|
|
|
| 1,000 | ~5 MB | ~70 MB |
|
|
| 10,000 | ~50 MB | ~700 MB |
|
|
| 100,000 | ~500 MB | ~7 GB |
|
|
| 1,000,000 | ~5 GB | ~70 GB |
|
|
|
|
At 1 million users, each node stores approximately 5 GB. This is well within the capability of any modern VPS.
|
|
|
|
### Memory Usage
|
|
|
|
The guardian uses minimal memory:
|
|
- Static binary: ~2-5 MB RSS.
|
|
- Per-connection: ~128 KB (64 KB read buffer + 64 KB write buffer).
|
|
- Single-threaded: only one connection is active at a time (MVP).
|
|
- No in-memory caching of shares: every read/write goes to disk.
|
|
|
|
The systemd `MemoryMax=512M` limit is generous for the current architecture. Actual usage is typically under 20 MB.
|
|
|
|
### Inode Usage
|
|
|
|
Each user creates one directory and 2-3 files. At 1 million users:
|
|
- ~4 million inodes.
|
|
- Most Linux filesystems default to millions of inodes (ext4: 1 inode per 16 KB).
|
|
- A 100 GB partition with ext4 defaults has ~6.5 million inodes.
|
|
|
|
This is unlikely to be a bottleneck but is worth monitoring on small partitions.
|
|
|
|
---
|
|
|
|
## Cluster Scaling
|
|
|
|
### Adding Nodes
|
|
|
|
When a new node joins the Orama network:
|
|
|
|
1. The vault guardian starts and registers itself via RQLite.
|
|
2. Other guardians detect the new node via the discovery module (join event).
|
|
3. The new node initially has zero shares.
|
|
4. Shares are populated in two ways:
|
|
- **Client push:** When clients push new versions, they include the new node.
|
|
- **Repair protocol:** The re-sharing protocol redistributes shares to include the new node (Phase 2).
|
|
|
|
The threshold K is recomputed based on the alive count: `K = max(3, floor(N/3))`.
|
|
|
|
### Removing Nodes
|
|
|
|
When a node leaves (graceful shutdown or failure):
|
|
|
|
1. Other guardians detect the departure via missed heartbeats (suspect at 15s, dead at 60s).
|
|
2. The departed node's shares are lost.
|
|
3. If N - K nodes remain alive, no data is lost.
|
|
4. If the departure drops below the safety threshold (K+1 alive), the repair protocol triggers re-sharing to adjust.
|
|
|
|
### Threshold Adjustment
|
|
|
|
The threshold is dynamic and automatic:
|
|
|
|
```
|
|
K = max(3, floor(alive_count / 3))
|
|
```
|
|
|
|
- Adding nodes generally does not change K until the cluster grows significantly.
|
|
- Removing nodes may reduce K if the alive count drops enough.
|
|
- K never drops below 3, ensuring a minimum collusion resistance.
|
|
|
|
### Write Quorum
|
|
|
|
Write quorum requires supermajority acknowledgment:
|
|
|
|
```
|
|
W = ceil(2/3 * alive_count)
|
|
```
|
|
|
|
| Alive | Write Quorum (W) |
|
|
|-------|-------------------|
|
|
| 1 | 1 |
|
|
| 2 | 2 |
|
|
| 3 | 2 |
|
|
| 4 | 3 |
|
|
| 5 | 4 |
|
|
| 14 | 10 |
|
|
| 100 | 67 |
|
|
|
|
A push succeeds only if W guardians acknowledge storage. This ensures consistency even with some nodes being slow or temporarily unreachable.
|
|
|
|
> **Note (v0.1.0):** Write quorum is computed but not enforced in the current single-node push handler. Multi-guardian fan-out is Phase 2.
|
|
|
|
---
|
|
|
|
## Operational Checklist
|
|
|
|
### Pre-Deploy
|
|
|
|
- [ ] Binary built with `-Doptimize=ReleaseSafe` for the correct target.
|
|
- [ ] Data directory exists with correct ownership and permissions.
|
|
- [ ] systemd service file installed and enabled.
|
|
- [ ] Firewall rules allow port 7500 (client) and 7501 (peer, WireGuard only).
|
|
- [ ] WireGuard is up and peers are connected.
|
|
|
|
### Post-Deploy
|
|
|
|
- [ ] Health endpoint responds: `curl http://127.0.0.1:7500/v1/vault/health`
|
|
- [ ] Status endpoint shows correct config: `curl http://127.0.0.1:7500/v1/vault/status`
|
|
- [ ] No error-level log messages: `sudo journalctl -u orama-vault -p err -n 10`
|
|
- [ ] Test push/pull cycle works (see "Test Push/Pull" section above).
|
|
|
|
### Periodic Checks
|
|
|
|
- [ ] Health endpoint responds on all nodes.
|
|
- [ ] Share count is consistent across nodes (same identities on each).
|
|
- [ ] Disk usage is within expected bounds.
|
|
- [ ] No repeated error messages in the journal.
|
|
- [ ] systemd reports `active (running)` with uptime matching expectations.
|