- Add `ntfyFanoutResolver` to distribute push notifications across all active cluster nodes, ensuring delivery when nodes lack shared state.
- Refactor secrets encryption key derivation to use cluster-wide secrets via HKDF, replacing ephemeral per-node keys to fix cross-node decryption issues.
- Add unit tests for fan-out resolution logic and caching behavior.
Root cause of the recurring "turn.credentials → namespace_not_configured" on a
distant node: at converge the gateway resolves its TURN secret from the
namespace rqlite, and on a slow/just-restarted node that read fails ONCE, so
the gateway is written with TURN disabled. Removing the node is not a fix — the
software must tolerate a slow read.
Two-part fix (complements e7ed718's "don't blank a warm config"):
- RETRY the secret read (5×2s) at converge so a node whose rqlite is still
syncing waits for it to land instead of writing an empty block once. A
genuine decrypt failure still exhausts the retries → unresolved → the
running config is preserved.
- CACHE the resolved secret into the node's own cluster-state.json
(applyResolvedWebRTCToState), so the NEXT cold start reads it from disk —
chooseRestoreWebRTC is state-first and short-circuits before the DB. The
state struct already had TURNSharedSecret "for cold start" but nothing
populated it; now it's filled on every successful resolve (only rewritten
on change). Each node self-heals its own cache; nothing new is sent
cross-node.
cluster-state.json now carries the TURN secret, so both writers (local
saveLocalState and the remote SaveClusterState) are tightened to 0600 + chmod.
Stale-secret self-heals: disable/enable webrtc re-pushes every node's config
and the next converge re-caches the new value.
Dual-reviewed: code-quality APPROVED; security SECURE after the remote-write
0600 fix. Tests: cache populate + short-circuit, no-change, turn-only node.
VoIP call-invite pushes set no apns-expiration, so apns2 omits the header and
APNs store-and-forwards the push — delivering it minutes late and firing a
phantom "missed call" ring long after the call ended (and burning PushKit
goodwill, inviting throttling). Cap the VoIP apns-expiration to the ring window
(30s) so APNs delivers promptly or DISCARDS, never a stale invite. Alert pushes
keep the default store-and-forward so a message notification still lands after
the device reconnects.
Also surface HTTP 200 on a successful dispatch instead of leaving HTTPStatus at
0 — a successful push was logging "http=0", which reads like an opaque failure
and masked real false-success classes.
Tests: VoIP push carries an expiration within the ring-window cap; alert push
carries none. push package green.
A lost rotation response strands the client on a just-revoked token: the retry
hits res.Count==0 → genuine 401 → SIWE, which is impossible on a VoIP-woken
locked screen, so the call dies. This recurred under the reconnect storms from
today's gateway rolls.
Add an RFC 9700 §4.13.2 reuse grace: a refresh token revoked within 60s whose
grace_used_at is still NULL is accepted ONCE more and mints a fresh session.
The grace path skips the revoke CAS (the token is already revoked — the CAS
would 0-match and mis-fire the replay tripwire) and is locked instead by a
single-use CAS on grace_used_at, so a stolen token can't be replayed at
leisure. The window predicate is repeated on the CAS to close the
SELECT→UPDATE TOCTOU, and the grace SELECT excludes expired tokens.
Security (found + fixed in review): explicit revocation (RevokeToken /
/v1/auth/logout) now also stamps grace_used_at, so a deliberately-logged-out
token can never be grace-recovered — closes a logout-bypass where a just-
revoked token would otherwise be resurrectable for 60s. Transient rqlite
errors on the grace lookup/CAS surface as 503 (retryable), not 401, preserving
the #125 transient-vs-genuine distinction.
Migration 032 adds grace_used_at (additive ALTER, rolling-safe; NULL = grace
available, the window predicate keeps historically-revoked tokens ineligible).
Dual-reviewed: code-quality APPROVED; security SECURE after the logout-bypass
fix. Tests: lost-response recovery, single-use second-attempt 401, genuine bad
token 401, and the logout-bypass regression.
The HTTP client re-threw the raw platform error on a transport failure, so
callers (e.g. AnChat's JwtSession driving client.auth.refresh/challenge) could
only tell "couldn't reach the gateway" from a real HTTP error by string-
matching `TypeError: Network request failed`. The typed SDKError was built
only for the onNetworkError callback, never for the thrown error, and native
errors weren't retryable so they bubbled raw.
normalizeError() now maps a fetch failure -> SDKError{code:"NETWORK_ERROR",
httpStatus:0} and a timeout AbortError -> {code:"TIMEOUT",httpStatus:0}, and
that typed error is thrown to the caller. Genuine HTTP errors pass through
unchanged. httpStatus:0 is the stable "no HTTP response received" signal to
branch on; the original platform message is preserved for diagnostics.
Deliberately NOT auto-retrying network failures: a blind retry on a non-
idempotent POST like /v1/auth/refresh could burn the rotated refresh token on
a lost response. The SDK now just gives the app a typed error to drive its own
retry/failover.
Tests: tests/unit/http/network-errors-bug-129.test.ts (TypeError->NETWORK_ERROR,
AbortError->TIMEOUT, real 401 passes through, callback gets the typed error).
Full unit suite green, typecheck clean.
A freshly-joined namespace node could come up with TURN silently disabled
(turn.credentials -> namespace_not_configured) when GetWebRTCConfig errored:
the stored TURN shared secret was encrypted with a pre-rotation
cluster-secret-derived key and failed to decrypt, and the converge swallowed
that error into "WebRTC disabled", writing a TURN-disabled gateway config.
Distinguish "resolved but not enabled" (genuinely disabled, fine to write the
empty block) from "unresolved" (DB/decrypt error). chooseRestoreWebRTC's
dbFetch callback now returns a `resolved` bool; an unresolved lookup forces
enabled=false AND sets restoreWebRTC.unresolved. The converge then:
- logs the decrypt/read error loudly with the exact remediation
(`orama namespace disable webrtc` then `enable webrtc`) instead of
swallowing it;
- on the warm path, SKIPS ReconcileGateway so it doesn't rewrite a running
gateway's working WebRTC block to empty (preserves TURN);
- on the cold path, still spawns the gateway (the namespace needs one) but
warns loudly that TURN is degraded until the secret is regenerated.
Healthy nodes are unaffected: a node whose state file holds the secret
short-circuits before dbFetch, so a flaky/rotated DB cannot disable it.
Dual-reviewed (code-quality APPROVED, security SECURE — no secret material is
logged; operator disable still resolves to disabled, not unresolved).
Pure-function coverage in restore_webrtc_test.go: unresolved marking,
resolved-empty-is-disabled, and state-secret-wins-over-db-error.
- implement `wsJWTExpired` to validate token lifetime with a grace period
- capture jwt expiry at connection upgrade and update via auth.refresh
- close connections with custom code 4401 when tokens expire to force re-auth
- add unit tests to verify expiry logic and state transitions
Custom JWT claims survive token refresh: migration 031 adds the
custom-claims column to refresh tokens, the new gateway ClaimsProvider
re-resolves claims on refresh, and the serverless invoke path carries
them through. Includes refresh-rotation, WS-JWT middleware, and
claims-provider test coverage.
Reconciles the divergence created by the May-13 nightly history rewrite
(main absorbed pre-rewrite SHAs via PRs #90-92). Content conflicts all
resolve in nightly's favor; nightly is the deployed, verified branch
(v0.122.47 live on devnet).
- refactor(turn): extract decodeTURNConfig for testability
- feat(turn): add stealth domain fields to config
- fix(apns): nest custom data under "body" for expo-notifications compatibility
- Implement `resolveStealthCert` to use existing `*.<baseDomain>` wildcard certificates instead of dynamic Caddyfile provisioning.
- Avoids EROFS errors caused by `ProtectSystem=strict` on the orama-node service.
- Add strict validation to ensure stealth hosts are single-label subdomains covered by the wildcard.
b9d5f54 (stealth TURN discovery) emits a top-level `sni_router:` block
into node.yaml unconditionally, but only added a lenient ad-hoc parse
in the carry-forward logic — not the field on config.Config that
orama-node strict-decodes (KnownFields(true)) at boot. Identical
failure mode to the v0.122.42 secrets_encryption_key incident: the
unknown key fails the whole node.yaml parse and orama-node crash-loops.
Caught pre-deploy this time by the strict-decode gate check; devnet
never saw it. Regression test added alongside the v0.122.42 one in
decode_test.go.
- Add `turn_stealth_domain` to gateway config for stealth TURN support
- Introduce `turn_discovery` in `sni-router` to auto-discover per-namespace routes
- Add database migration to enable stealth TURN per namespace
- Document ephemeral state API in `SERVERLESS.md`
- add `secrets_encryption_key` to gateway config for serverless secrets
- implement durable TURN secret persistence to prevent config regen outages
- add regression test for gateway config loading and field mapping
v0.122.42 (f412425, secrets encryption) shipped the template emission,
the per-cluster secret generator, and the gateway.Config consumer — but
NOT the parse field on config.HTTPGatewayConfig. Phase 4 writes
`secrets_encryption_key` into node.yaml under the http_gateway section,
and pkg/config/yaml.go decodes with KnownFields(true) (strict). The
unknown field made every node.yaml parse fail, so orama-node exited 1
on every start and systemd crash-looped it (restart counter hit 380+ on
the first upgraded devnet node before the rolling controller halted).
Root cause: a generated-config field with no matching struct field under
strict unmarshal. Fix is the missing field. The runtime key itself is
still consumed from ~/.orama/secrets/secrets-encryption-key (pkg/node/
gateway.go), which already worked — so this one-field addition fully
restores boot AND the feature.
The standalone gateway (cmd/gateway/config.go) uses lenient parsing and
was unaffected.
Regression test in pkg/config/decode_test.go decodes a node.yaml
carrying secrets_encryption_key under strict mode.
- Add `raw_http_response` configuration to functions to allow verbatim HTTP responses
- Implement cluster-wide secrets encryption key generation and distribution for serverless functions
- Update documentation with UnifiedPush support for ntfy on Android/GrapheneOS
- Add `FileRouteReloader` to watch and atomically update routes from disk
- Refactor `main` to support seamless configuration updates without restarts
- Ensure existing routes are preserved if a reload encounters an error
- Remove the 2-second polling wait for gossipsub mesh formation in `Publish`
to eliminate unnecessary latency, relying on `FloodPublish` for delivery.
- Introduce a per-invocation publish budget (1000 messages) to prevent
potential flooding of the shared gossipsub router by WASM functions.
- Add regression tests to ensure `Publish` remains non-blocking and that
the publish budget is strictly enforced.
- Introduce `BatchQueryConsistency` with `ReadConsistencyNone` to allow
local SQLite reads, bypassing leader round-trips for performance.
- Add `function_invoke_async` host function to support non-blocking
fire-and-forget function execution.
- split webrtc route gating into `webrtcServeTURNCredentials` and `webrtcServeSFURoutes` to allow non-SFU gateways to mint TURN credentials
- update `chooseRestoreWebRTC` to correctly resolve configurations for nodes without local SFU ports
- add unit tests to verify independent route registration logic (bugboard #25)
- Add logic to reconcile gateway configuration drift for running instances
- Prevent unnecessary restart loops by verifying on-disk config state
- Add unit tests to validate synchronization logic and prevent regressions
- Add internal WebRTC management endpoints to public path exemption list
- Implement DB fallback for WebRTC configuration during cluster restore
- Add unit tests to verify WebRTC config precedence and state self-healing
- opt into `WithSysWalltime` and `WithSysNanotime` to prevent wazero from using a frozen sentinel clock
- add regression tests to verify real-time clock behavior in wasm execution
- ensure serverless functions receive accurate timestamps for audit and cursor logic
- Add `enable` and `disable` commands to manage function status
- Implement process re-exec in the upgrade orchestrator to ensure
Phase 4 config generation uses the newly-installed binary version
(fixes bugboard #15)
- wire PubSubDispatcher to host functions to support local wildcard
triggers for WASM-published topics
- implement batch deduplication by topic to prevent redundant trigger
invocations and bound fan-out
- propagate trigger depth through function invocations to maintain
recursion limits during local dispatch
- Update route registration logic to rely solely on SFUPort > 0, resolving a silent 404 issue where gateways with valid SFU configurations were incorrectly disabled.
- Retain WebRTCEnabled in config for backward compatibility with existing operator YAML and request schemas.
- Add unit tests to pin registration behavior and prevent future regressions.