orama/docs/SECURITY.md
anonpenguin23 e2b6f7d721 docs: add security hardening and OramaOS deployment docs
- Document WireGuard IPv6 disable, service auth, token security, process isolation
- Introduce OramaOS architecture, enrollment flow, and management via Gateway API
- Add troubleshooting for RQLite/Olric auth, OramaOS LUKS/enrollment issues
2026-02-28 15:41:04 +02:00

195 lines
7.6 KiB
Markdown

# Security Hardening
This document describes all security measures applied to the Orama Network, covering both Phase 1 (service hardening on existing Ubuntu nodes) and Phase 2 (OramaOS locked-down image).
## Phase 1: Service Hardening
These measures apply to all nodes (Ubuntu and OramaOS).
### Network Isolation
**CIDR Validation (Step 1.1)**
- WireGuard subnet restricted to `10.0.0.0/24` across all components: firewall rules, rate limiter, auth module, and WireGuard PostUp/PostDown iptables rules
- Prevents other tenants on shared VPS providers from bypassing the firewall via overlapping `10.x.x.x` ranges
**IPv6 Disabled (Step 1.2)**
- IPv6 disabled system-wide via sysctl: `net.ipv6.conf.all.disable_ipv6=1`
- Prevents services bound to `0.0.0.0` from being reachable via IPv6 (which had no firewall rules)
### Authentication
**Internal Endpoint Auth (Step 1.3)**
- `/v1/internal/wg/peers` and `/v1/internal/wg/peer/remove` now require cluster secret validation
- Peer removal additionally validates the request originates from a WireGuard subnet IP
**RQLite Authentication (Step 1.7)**
- RQLite runs with `-auth` flag pointing to a credentials file
- All RQLite HTTP requests include `Authorization: Basic <base64>` headers
- Credentials generated at cluster genesis, distributed to joining nodes via join response
- Both the central RQLite client wrapper and the standalone CoreDNS RQLite client send auth
**Olric Gossip Encryption (Step 1.8)**
- Olric memberlist uses a 32-byte encryption key for all gossip traffic
- Key generated at genesis, distributed via join response
- Prevents rogue nodes from joining the gossip ring and poisoning caches
- Note: encryption is all-or-nothing (coordinated restart required when enabling)
**IPFS Cluster TrustedPeers (Step 1.9)**
- IPFS Cluster `TrustedPeers` populated with actual cluster peer IDs (was `["*"]`)
- New peers added to TrustedPeers on all existing nodes during join
- Prevents unauthorized peers from controlling IPFS pinning
**Vault V1 Auth Enforcement (Step 1.14)**
- V1 push/pull endpoints require a valid session token when vault-guardian is configured
- Previously, auth was optional for backward compatibility — any WG peer could read/overwrite Shamir shares
### Token & Key Storage
**Refresh Token Hashing (Step 1.5)**
- Refresh tokens stored as SHA-256 hashes in RQLite (never plaintext)
- On lookup: hash the incoming token, query by hash
- On revocation: hash before revoking (both single-token and by-subject)
- Existing tokens invalidated on upgrade (users re-authenticate)
**API Key Hashing (Step 1.6)**
- API keys stored as HMAC-SHA256 hashes using a server-side secret
- HMAC secret generated at cluster genesis, stored in `~/.orama/secrets/api-key-hmac-secret`
- On lookup: compute HMAC, query by hash — fast enough for every request (unlike bcrypt)
- In-memory cache uses raw key as cache key (never persisted)
- During rolling upgrade: dual lookup (HMAC first, then raw as fallback) until all nodes upgraded
**TURN Secret Encryption (Step 1.15)**
- TURN shared secrets encrypted at rest in RQLite using AES-256-GCM
- Encryption key derived via HKDF from the cluster secret with purpose string `"turn-encryption"`
### TLS & Transport
**InsecureSkipVerify Fix (Step 1.10)**
- During node join, TLS verification uses TOFU (Trust On First Use)
- Invite token output includes the CA certificate fingerprint (SHA-256)
- Joining node verifies the server cert fingerprint matches before proceeding
- After join: CA cert stored locally for future connections
**WebSocket Origin Validation (Step 1.4)**
- All WebSocket upgraders validate the `Origin` header against the node's configured domain
- Non-browser clients (no Origin header) are still allowed
- Prevents cross-site WebSocket hijacking attacks
### Process Isolation
**Dedicated User (Step 1.11)**
- All services run as the `orama` user (not root)
- Caddy and CoreDNS get `AmbientCapabilities=CAP_NET_BIND_SERVICE` for ports 80/443 and 53
- WireGuard stays as root (kernel netlink requires it)
- vault-guardian already had proper hardening
**systemd Hardening (Step 1.12)**
- All service units include:
```ini
ProtectSystem=strict
ProtectHome=yes
NoNewPrivileges=yes
PrivateDevices=yes
ProtectKernelTunables=yes
ProtectKernelModules=yes
RestrictNamespaces=yes
ReadWritePaths=/opt/orama/.orama
```
- Applied to both template files (`pkg/environments/templates/`) and hardcoded unit generators (`pkg/environments/production/services.go`)
### Supply Chain
**Binary Signing (Step 1.13)**
- Build archives include `manifest.sig` — a rootwallet EVM signature of the manifest hash
- During install, the signature is verified against the embedded Orama public key
- Unsigned or tampered archives are rejected
## Phase 2: OramaOS
These measures apply only to OramaOS nodes (mainnet, devnet, testnet).
### Immutable OS
- **Read-only rootfs** — SquashFS with dm-verity integrity verification
- **No shell** — `/bin/sh` symlinked to `/bin/false`, no bash/ash/ssh
- **No SSH** — OpenSSH not included in the image
- **Minimal packages** — only what's needed for systemd, cryptsetup, and the agent
### Full-Disk Encryption
- **LUKS2** with AES-XTS-Plain64 on the data partition
- **Shamir's Secret Sharing** over GF(256) — LUKS key split across peer vault-guardians
- **Adaptive threshold** — K = max(3, N/3) where N is the number of peers
- **Key zeroing** — LUKS key wiped from memory immediately after use
- **Malicious share detection** — fetch K+1 shares when possible, verify consistency
### Service Sandboxing
Each service runs in isolated Linux namespaces:
- **CLONE_NEWNS** — mount namespace (filesystem isolation)
- **CLONE_NEWUTS** — hostname namespace
- **Dedicated UID/GID** — each service has its own user
- **Seccomp filtering** — per-service syscall allowlist
Note: CLONE_NEWPID is intentionally omitted — it makes services PID 1 in their namespace, which changes signal semantics (SIGTERM ignored by default for PID 1).
### Signed Updates
- A/B partition scheme with systemd-boot and boot counting (`tries_left=3`)
- All updates signed with rootwallet EVM signature (secp256k1 + keccak256)
- Signer address: `0xb5d8a496c8b2412990d7D467E17727fdF5954afC`
- P2P distribution over WireGuard between nodes
- Automatic rollback on 3 consecutive boot failures
### Zero Operator Access
- Operators cannot read data on the machine (LUKS encrypted, no shell)
- Management only through Gateway API → agent over WireGuard
- All commands are logged and auditable
- No root access, no console access, no file system access
## Rollout Strategy
### Phase 1 Batches
```
Batch 1 (zero-risk, no restart):
- CIDR fix
- IPv6 disable
- Internal endpoint auth
- WebSocket origin check
Batch 2 (medium-risk, restart needed):
- Hash refresh tokens
- Hash API keys
- Binary signing
- Vault V1 auth enforcement
- TURN secret encryption
Batch 3 (high-risk, coordinated rollout):
- RQLite auth (followers first, leader last)
- Olric encryption (simultaneous restart)
- IPFS Cluster TrustedPeers
Batch 4 (infrastructure changes):
- InsecureSkipVerify fix
- Dedicated user
- systemd hardening
```
### Phase 2
1. Build and test OramaOS image in QEMU
2. Deploy to sandbox cluster alongside Ubuntu nodes
3. Verify interop and stability
4. Gradual migration: testnet → devnet → mainnet (one node at a time, maintaining Raft quorum)
## Verification
All changes verified on sandbox cluster before production deployment:
- `make test` — all unit tests pass
- `orama monitor report --env sandbox` — full cluster health
- Manual endpoint testing (e.g., curl without auth → 401)
- Security-specific checks (IPv6 listeners, RQLite auth, binary signatures)