77 Commits

Author SHA1 Message Date
anonpenguin23
9c213a166c feat(serverless,namespace): cut namespace gateway RPC latency (#708)
The 5-10s RPCs that broke calling were not cold-start — they were
per-RPC sequential rqlite reads, each forwarded to a raft leader that
geography-blind election had placed on a 256ms-distant node.

Lever A (serverless): cache function metadata + env vars in-process
(5s TTL, invalidated on deploy/enable/disable/delete) and stop the hot
invoke path re-fetching the function for the authorization check —
removes ~820ms of leader-routed pre-flight reads from every op.

Lever B (namespace): a locality-aware leadership reconciler hands raft
leadership off a geographically-isolated namespace leader to the nearest
co-located voter, via rqlite's transfer-leadership API. All nodes stay
voters — membership, quorum and fault tolerance are unchanged. Cuts the
per-hop cost from ~274ms to ~20ms when a distant node had become leader.
2026-06-15 08:05:38 +03:00
anonpenguin23
34f9da6f8d feat(gateway): implement ntfy cluster fan-out and improve secrets encryption
- Add `ntfyFanoutResolver` to distribute push notifications across all active cluster nodes, ensuring delivery when nodes lack shared state.
- Refactor secrets encryption key derivation to use cluster-wide secrets via HKDF, replacing ephemeral per-node keys to fix cross-node decryption issues.
- Add unit tests for fan-out resolution logic and caching behavior.
2026-06-13 09:23:14 +03:00
anonpenguin23
2b184f0398 fix(namespace): make WebRTC config survive slow/cold node restarts (#130)
Root cause of the recurring "turn.credentials → namespace_not_configured" on a
distant node: at converge the gateway resolves its TURN secret from the
namespace rqlite, and on a slow/just-restarted node that read fails ONCE, so
the gateway is written with TURN disabled. Removing the node is not a fix — the
software must tolerate a slow read.

Two-part fix (complements e7ed718's "don't blank a warm config"):
  - RETRY the secret read (5×2s) at converge so a node whose rqlite is still
    syncing waits for it to land instead of writing an empty block once. A
    genuine decrypt failure still exhausts the retries → unresolved → the
    running config is preserved.
  - CACHE the resolved secret into the node's own cluster-state.json
    (applyResolvedWebRTCToState), so the NEXT cold start reads it from disk —
    chooseRestoreWebRTC is state-first and short-circuits before the DB. The
    state struct already had TURNSharedSecret "for cold start" but nothing
    populated it; now it's filled on every successful resolve (only rewritten
    on change). Each node self-heals its own cache; nothing new is sent
    cross-node.

cluster-state.json now carries the TURN secret, so both writers (local
saveLocalState and the remote SaveClusterState) are tightened to 0600 + chmod.
Stale-secret self-heals: disable/enable webrtc re-pushes every node's config
and the next converge re-caches the new value.

Dual-reviewed: code-quality APPROVED; security SECURE after the remote-write
0600 fix. Tests: cache populate + short-circuit, no-change, turn-only node.
2026-06-13 08:12:48 +03:00
anonpenguin23
cf21668782 fix(push): cap VoIP apns-expiration to the ring window; record success status (#132)
VoIP call-invite pushes set no apns-expiration, so apns2 omits the header and
APNs store-and-forwards the push — delivering it minutes late and firing a
phantom "missed call" ring long after the call ended (and burning PushKit
goodwill, inviting throttling). Cap the VoIP apns-expiration to the ring window
(30s) so APNs delivers promptly or DISCARDS, never a stale invite. Alert pushes
keep the default store-and-forward so a message notification still lands after
the device reconnects.

Also surface HTTP 200 on a successful dispatch instead of leaving HTTPStatus at
0 — a successful push was logging "http=0", which reads like an opaque failure
and masked real false-success classes.

Tests: VoIP push carries an expiration within the ring-window cap; alert push
carries none. push package green.
2026-06-12 17:49:44 +03:00
anonpenguin23
33600092a8 fix(auth): bounded single-use refresh-token reuse grace (#125)
A lost rotation response strands the client on a just-revoked token: the retry
hits res.Count==0 → genuine 401 → SIWE, which is impossible on a VoIP-woken
locked screen, so the call dies. This recurred under the reconnect storms from
today's gateway rolls.

Add an RFC 9700 §4.13.2 reuse grace: a refresh token revoked within 60s whose
grace_used_at is still NULL is accepted ONCE more and mints a fresh session.
The grace path skips the revoke CAS (the token is already revoked — the CAS
would 0-match and mis-fire the replay tripwire) and is locked instead by a
single-use CAS on grace_used_at, so a stolen token can't be replayed at
leisure. The window predicate is repeated on the CAS to close the
SELECT→UPDATE TOCTOU, and the grace SELECT excludes expired tokens.

Security (found + fixed in review): explicit revocation (RevokeToken /
/v1/auth/logout) now also stamps grace_used_at, so a deliberately-logged-out
token can never be grace-recovered — closes a logout-bypass where a just-
revoked token would otherwise be resurrectable for 60s. Transient rqlite
errors on the grace lookup/CAS surface as 503 (retryable), not 401, preserving
the #125 transient-vs-genuine distinction.

Migration 032 adds grace_used_at (additive ALTER, rolling-safe; NULL = grace
available, the window predicate keeps historically-revoked tokens ineligible).

Dual-reviewed: code-quality APPROVED; security SECURE after the logout-bypass
fix. Tests: lost-response recovery, single-use second-attempt 401, genuine bad
token 401, and the logout-bypass regression.
2026-06-12 17:42:36 +03:00
anonpenguin23
e7ed718965 fix(namespace): don't silently disable TURN on unresolvable WebRTC secret (#130)
A freshly-joined namespace node could come up with TURN silently disabled
(turn.credentials -> namespace_not_configured) when GetWebRTCConfig errored:
the stored TURN shared secret was encrypted with a pre-rotation
cluster-secret-derived key and failed to decrypt, and the converge swallowed
that error into "WebRTC disabled", writing a TURN-disabled gateway config.

Distinguish "resolved but not enabled" (genuinely disabled, fine to write the
empty block) from "unresolved" (DB/decrypt error). chooseRestoreWebRTC's
dbFetch callback now returns a `resolved` bool; an unresolved lookup forces
enabled=false AND sets restoreWebRTC.unresolved. The converge then:
  - logs the decrypt/read error loudly with the exact remediation
    (`orama namespace disable webrtc` then `enable webrtc`) instead of
    swallowing it;
  - on the warm path, SKIPS ReconcileGateway so it doesn't rewrite a running
    gateway's working WebRTC block to empty (preserves TURN);
  - on the cold path, still spawns the gateway (the namespace needs one) but
    warns loudly that TURN is degraded until the secret is regenerated.

Healthy nodes are unaffected: a node whose state file holds the secret
short-circuits before dbFetch, so a flaky/rotated DB cannot disable it.
Dual-reviewed (code-quality APPROVED, security SECURE — no secret material is
logged; operator disable still resolves to disabled, not unresolved).

Pure-function coverage in restore_webrtc_test.go: unresolved marking,
resolved-empty-is-disabled, and state-secret-wins-over-db-error.
2026-06-12 16:44:25 +03:00
anonpenguin23
4d700aed54 feat(gateway): enforce jwt expiry on persistent websockets
- implement `wsJWTExpired` to validate token lifetime with a grace period
- capture jwt expiry at connection upgrade and update via auth.refresh
- close connections with custom code 4401 when tokens expire to force re-auth
- add unit tests to verify expiry logic and state transitions
2026-06-12 10:12:21 +03:00
anonpenguin23
d113b75497 feat(auth): refresh-token custom claims hook (#548)
Custom JWT claims survive token refresh: migration 031 adds the
custom-claims column to refresh tokens, the new gateway ClaimsProvider
re-resolves claims on refresh, and the serverless invoke path carries
them through. Includes refresh-rotation, WS-JWT middleware, and
claims-provider test coverage.
2026-06-12 08:05:27 +03:00
anonpenguin23
cd8c717363 chore(version): bump to 0.122.47
- refactor(turn): extract decodeTURNConfig for testability
- feat(turn): add stealth domain fields to config
- fix(apns): nest custom data under "body" for expo-notifications compatibility
2026-06-11 11:45:12 +03:00
anonpenguin23
f4c58db710 release: 0.122.46 2026-06-11 10:06:19 +03:00
anonpenguin23
8375d92109 feat(namespace): reuse caddy wildcard certificate for stealth turns
- Implement `resolveStealthCert` to use existing `*.<baseDomain>` wildcard certificates instead of dynamic Caddyfile provisioning.
- Avoids EROFS errors caused by `ProtectSystem=strict` on the orama-node service.
- Add strict validation to ensure stealth hosts are single-label subdomains covered by the wildcard.
2026-06-11 10:04:45 +03:00
anonpenguin23
b425f80efb fix(config): add sni_router to root Config — prevents feat-124 boot crash
b9d5f54 (stealth TURN discovery) emits a top-level `sni_router:` block
into node.yaml unconditionally, but only added a lenient ad-hoc parse
in the carry-forward logic — not the field on config.Config that
orama-node strict-decodes (KnownFields(true)) at boot. Identical
failure mode to the v0.122.42 secrets_encryption_key incident: the
unknown key fails the whole node.yaml parse and orama-node crash-loops.

Caught pre-deploy this time by the strict-decode gate check; devnet
never saw it. Regression test added alongside the v0.122.42 one in
decode_test.go.
2026-06-11 08:00:31 +03:00
anonpenguin23
b9d5f542e1 feat(gateway): implement stealth TURN discovery and configuration
- Add `turn_stealth_domain` to gateway config for stealth TURN support
- Introduce `turn_discovery` in `sni-router` to auto-discover per-namespace routes
- Add database migration to enable stealth TURN per namespace
- Document ephemeral state API in `SERVERLESS.md`
2026-06-11 07:04:50 +03:00
anonpenguin23
ff3e273da8 feat(gateway): implement persistent secrets and webrtc configuration
- add `secrets_encryption_key` to gateway config for serverless secrets
- implement durable TURN secret persistence to prevent config regen outages
- add regression test for gateway config loading and field mapping
2026-06-10 12:10:52 +03:00
anonpenguin23
e685c864fc fix(config): add secrets_encryption_key to HTTPGatewayConfig — fixes orama-node boot crash
v0.122.42 (f412425, secrets encryption) shipped the template emission,
the per-cluster secret generator, and the gateway.Config consumer — but
NOT the parse field on config.HTTPGatewayConfig. Phase 4 writes
`secrets_encryption_key` into node.yaml under the http_gateway section,
and pkg/config/yaml.go decodes with KnownFields(true) (strict). The
unknown field made every node.yaml parse fail, so orama-node exited 1
on every start and systemd crash-looped it (restart counter hit 380+ on
the first upgraded devnet node before the rolling controller halted).

Root cause: a generated-config field with no matching struct field under
strict unmarshal. Fix is the missing field. The runtime key itself is
still consumed from ~/.orama/secrets/secrets-encryption-key (pkg/node/
gateway.go), which already worked — so this one-field addition fully
restores boot AND the feature.

The standalone gateway (cmd/gateway/config.go) uses lenient parsing and
was unaffected.

Regression test in pkg/config/decode_test.go decodes a node.yaml
carrying secrets_encryption_key under strict mode.
2026-06-09 15:57:32 +03:00
anonpenguin23
f41242538e feat(serverless): add raw http response mode and secrets encryption
- Add `raw_http_response` configuration to functions to allow verbatim HTTP responses
- Implement cluster-wide secrets encryption key generation and distribution for serverless functions
- Update documentation with UnifiedPush support for ntfy on Android/GrapheneOS
2026-06-09 13:01:02 +03:00
anonpenguin23
f8de4af704 feat(sni-router): implement hot-reloading for route configuration
- Add `FileRouteReloader` to watch and atomically update routes from disk
- Refactor `main` to support seamless configuration updates without restarts
- Ensure existing routes are preserved if a reload encounters an error
2026-06-09 09:23:54 +03:00
anonpenguin23
eade6e1742 feat(pubsub): remove mesh formation wait and add publish rate limiting
- Remove the 2-second polling wait for gossipsub mesh formation in `Publish`
  to eliminate unnecessary latency, relying on `FloodPublish` for delivery.
- Introduce a per-invocation publish budget (1000 messages) to prevent
  potential flooding of the shared gossipsub router by WASM functions.
- Add regression tests to ensure `Publish` remains non-blocking and that
  the publish budget is strictly enforced.
2026-06-04 10:08:10 +03:00
anonpenguin23
9373c2ad92 feat(rqlite,serverless): add local read consistency and async invocation
- Introduce `BatchQueryConsistency` with `ReadConsistencyNone` to allow
  local SQLite reads, bypassing leader round-trips for performance.
- Add `function_invoke_async` host function to support non-blocking
  fire-and-forget function execution.
2026-06-01 19:59:30 +03:00
anonpenguin23
ca4ccbfcd4 feat(gateway): decouple turn credentials and sfu route registration
- split webrtc route gating into `webrtcServeTURNCredentials` and `webrtcServeSFURoutes` to allow non-SFU gateways to mint TURN credentials
- update `chooseRestoreWebRTC` to correctly resolve configurations for nodes without local SFU ports
- add unit tests to verify independent route registration logic (bugboard #25)
2026-06-01 10:12:07 +03:00
anonpenguin23
bf0d5f9f9f feat(namespace): implement warm reconciliation for gateway webrtc config
- Add logic to reconcile gateway configuration drift for running instances
- Prevent unnecessary restart loops by verifying on-disk config state
- Add unit tests to validate synchronization logic and prevent regressions
2026-05-30 19:26:26 +03:00
anonpenguin23
4fc975216f feat(gateway): fix WebRTC config persistence and endpoint access
- Add internal WebRTC management endpoints to public path exemption list
- Implement DB fallback for WebRTC configuration during cluster restore
- Add unit tests to verify WebRTC config precedence and state self-healing
2026-05-30 14:39:39 +03:00
anonpenguin23
325a2471c7 Changes 2026-05-29 11:46:20 +03:00
anonpenguin23
cfff08d91e feat(serverless): add turn_credentials host function and slow invocation diagnostics
- Implement `turn_credentials` host function to provide TURN configuration to WASM modules.
- Add structured logging for slow serverless invocations exceeding 5s, providing per-phase timing (rate-limit, module-load, execution) to identify performance bottlenecks.
- Enhance WebSocket handler logging to capture request context when 30s timeouts occur.
2026-05-28 09:54:24 +03:00
anonpenguin23
8fbc4485c1 fix(serverless): enable system clocks for wasm modules
- opt into `WithSysWalltime` and `WithSysNanotime` to prevent wazero from using a frozen sentinel clock
- add regression tests to verify real-time clock behavior in wasm execution
- ensure serverless functions receive accurate timestamps for audit and cursor logic
2026-05-26 10:53:07 +03:00
anonpenguin23
1faf04e2a3 feat(cli): add function enable/disable and fix upgrade re-exec
- Add `enable` and `disable` commands to manage function status
- Implement process re-exec in the upgrade orchestrator to ensure
  Phase 4 config generation uses the newly-installed binary version
  (fixes bugboard #15)
2026-05-25 10:25:04 +03:00
anonpenguin23
b2d35bbde1 feat(gateway): enable local wildcard triggers for pubsub
- wire PubSubDispatcher to host functions to support local wildcard
  triggers for WASM-published topics
- implement batch deduplication by topic to prevent redundant trigger
  invocations and bound fan-out
- propagate trigger depth through function invocations to maintain
  recursion limits during local dispatch
2026-05-25 09:34:01 +03:00
anonpenguin23
98dad46a81 fix(gateway): decouple webrtc route registration from legacy flag
- Update route registration logic to rely solely on SFUPort > 0, resolving a silent 404 issue where gateways with valid SFU configurations were incorrectly disabled.
- Retain WebRTCEnabled in config for backward compatibility with existing operator YAML and request schemas.
- Add unit tests to pin registration behavior and prevent future regressions.
2026-05-24 20:56:08 +03:00
anonpenguin23
62e4d1963b feat(gateway): add apns_voip provider support
- register "apns_voip" provider to handle PushKit/CallKit signals
- implement target provider filtering in dispatcher to prevent cross-talk
  between alert and VoIP push paths
- add comprehensive tests to ensure backward compatibility for fan-out
  and correct filtering behavior
2026-05-24 19:38:38 +03:00
anonpenguin23
ccbcea0f3f fix(serverless): prevent invocation context race condition
- Attach InvocationContext to the execution context in Engine.Execute to
  ensure host functions resolve identity from the request context.
- Fixes a race condition where concurrent stateless invocations would
  overwrite the global singleton, causing cross-tenant leaks or nil
  namespace errors.
- Added a regression test to verify per-invocation isolation under load.
2026-05-23 12:48:45 +03:00
anonpenguin23
e2bc9577ff feat(serverless): isolate invocation logs and enforce cron poll interval
- Fix log cross-contamination by introducing per-invocation LogBuffers
  (bugboard #108)
- Enforce a 100ms minimum for CronPollInterval to prevent scheduler
  starvation (bugboard #109)
- Add comprehensive validation tests for cron interval constraints
2026-05-21 15:52:46 +03:00
anonpenguin23
3b8139802c feat: APNs silent-drop guard + persistent-WS mid-session JWT refresh
#348 - APNs silent-drop guard
Apple's APNs silently returns HTTP 200 for pushes with no visible
content (no title, no body, no badge, no sound, no
content-available=1) and then drops them — which looked to the WASM
caller like a successful delivery. Now rejected up-front with the new
push.ErrEmptyContent sentinel, and the APNs provider returns the
structured push.PushError shape (HTTPStatus, Reason, Unregistered,
Wrapped) so the dispatcher can branch on Unregistered to remove dead
tokens automatically. Legacy ErrDeviceUnregistered sentinel is
preserved for errors.Is compatibility (wrapped inside PushError).

Always logs APNs HTTP response (status, reason, apns_id, token prefix)
so future silent-drop classes show up in operator logs.

content-available is also now correctly mapped from snake_case
Data["content_available"] (any truthy variant) into Apple's
canonical "content-available": 1 inside the aps dictionary.

#321 - mid-session JWT refresh on persistent WS
Long-lived persistent WS connections used to have to close+reconnect
when the JWT rolled — losing per-instance state, message queues, and
subscriptions. The handler now accepts an "auth.refresh" control
frame: client sends the new token, the gateway re-verifies it via
the new JWTVerifier interface, updates the per-instance invCtx
in-place (persistent.Instance.UpdateInvCtx), and acks. No close, no
state loss.

JWTVerifier is optional — handlers set it via SetJWTVerifier at
gateway init. When unwired the handler nack's with a "not supported
on this gateway" response and clients fall back to the old
close+reconnect path, so older deploys don't break.

Other:
- push/dispatcher.go: SendToUserDetailed returns per-device PushError
  shape so callers can act on Unregistered / HTTPStatus / Reason.
- serverless/hostfunctions/push.go: WASM host functions for the new
  detailed-error shape.
- serverless/persistent/instance.go: UpdateInvCtx mid-session.

Tests:
- ws_persistent_control_test.go: auth.refresh ack/nack paths.
- apns_test.go: empty-content rejection, PushError shape on 410 +
  generic non-200, content-available mapping.
- dispatcher_detailed_test.go: SendToUserDetailed result shape.
- instance_update_invctx_test.go: invCtx update is per-instance, not
  cross-tenant.

VERSION bumped to 0.122.27.
2026-05-19 18:19:21 +03:00
anonpenguin23
ebc9d51167 feat(gateway): implement pubsub dispatcher and batch query support
- Integrate PubSubDispatcher to enable libp2p subscription for trigger patterns
- Add BatchQuery to rqlite client to reduce round-trips for multi-query operations
- Implement lifecycle management for dispatcher and add safety limits for batch queries
2026-05-17 16:27:05 +03:00
anonpenguin23
17b06d38e4 fix(gateway,serverless): libp2p mesh peer-port + system-trigger auth bypass
Two serious bugs found via cross-node behavior observation:

1. libp2p peer-discovery published wrong port
   PeerDiscovery's multiaddr was using the gateway's HTTP API port (e.g.
   10004), not the actual libp2p TCP port. Remote gateways dialed that
   port, hit the HTTP server, received 400, and failed the libp2p
   multistream handshake ("message did not have trailing newline").
   Result: cluster-wide cross-node libp2p mesh had 0 connected peers
   and cross-node pubsub silently dropped 100% of messages.

   The libp2p port is OS-assigned at startup (client.go uses
   /ip4/0.0.0.0/tcp/0). It's not anywhere in cfg — it's only on
   host.Addrs(). Fix: drop the listenPort field from PeerDiscovery
   entirely and derive the port live from host.Addrs() via
   extractLibp2pTCPPort. WG IP still comes from getWireGuardIP
   (libp2p filters its own enumeration so WG IPs don't appear in
   host.Addrs(), but the listener is bound 0.0.0.0 so the port is
   reachable on the WG interface).

2. System triggers silently blocked by CanInvoke (#264)
   Cron, pubsub, database, timer, and job triggers all fire from
   gateway-internal state with no caller identity. Invoke() ran every
   request through CanInvoke(callerWallet) which returned false for
   the empty wallet — every fire returned ErrUnauthorized. Reported as
   a cron firing every minute with "unauthorized" for 19+ hours.

   Auth boundary for system triggers belongs at REGISTRATION time
   (POST /v1/functions/{name}/triggers, deploy-time auto-register
   from function.yaml). Skip the per-invocation check for system
   trigger types; user-driven triggers (HTTP, WebSocket) still gate
   on caller identity as before.

Tests:
- gateway/peer_discovery_test.go covers extractLibp2pTCPPort.
- serverless/invoke_system_trigger_test.go covers the bypass and the
  user-trigger gate.

VERSION bumped to 0.122.25.
2026-05-16 15:43:18 +03:00
anonpenguin23
251630a5c7 fix(serverless): per-call invCtx propagation prevents cross-tenant identity leak in persistent WS
HostFunctions is a process-wide singleton (one per gateway engine).
Its `invCtx` field is shared across all WASM instances. For STATELESS
execution the executor sets/clears it per-call but the lock is
released before WASM runs — two concurrent invocations can race on
the field and one's host call can read the other's identity. Window
is microseconds.

For PERSISTENT WS the bug was much worse: invCtx used to be bound
ONCE at instantiation and reused for the connection's lifetime. Two
simultaneous persistent WS connections from different namespaces /
wallets overwrote each other's invCtx, and EVERY subsequent
function_invoke / GetCallerJWTSubject / GetCallerWallet / GetSecret
call from inside the WASM read whatever was bound LAST. Result:
silent identity leak across tenants for as long as the connections
overlapped.

Fix: per-call invCtx propagation through Go's context.Context.
wazero passes the ctx given to api.Function.Call through to host
function callbacks, so every WASM-host hop carries its own invCtx.

- pkg/serverless/invocation_context.go (new): WithInvocationContext +
  InvocationContextFromCtx helpers using an unexported invCtxKey.
- pkg/serverless/hostfunctions/invocation_context.go (new):
  currentInvocationContext(ctx) — ctx-attached invCtx wins over the
  singleton field.
- All host accessors (FunctionInvoke, GetEnv, GetSecret, GetRequestID,
  GetCallerWallet, GetWSClientID, GetCallerClaim, GetCallerJWTSubject)
  now route through currentInvocationContext(ctx).
- pkg/serverless/persistent/instance.go: every export call's ctx is
  wrapped with the per-instance invCtx before being passed to wazero.
- pkg/gateway/handlers/serverless/ws_persistent_handler.go: invCtx is
  built per-frame and attached to ctx, not stored on a shared field.
- pkg/serverless/engine.go: removed the SetInvocationContext call at
  InstantiatePersistent (no longer needed; ctx carries it).

Stateless still uses the singleton field — its race is latent since
the host-functions split and migrating it is a separate scoped
change.

Tests:
- hostfunctions/invocation_context_test.go covers ctx-wins-over-singleton.
- gateway/handlers/serverless/ws_persistent_handler_test.go covers the
  per-frame ctx wiring.
- cli/functions/build_test.go is new coverage for the build path
  touched in this change.

VERSION bumped to 0.122.24.
2026-05-15 13:36:35 +03:00
anonpenguin23
80b466af68 fix(serverless): override WASI proc_exit so command-mode persistent WS stays alive
The previous fix (v0.122.22) made `InstantiatePersistent` call `_start`
to bootstrap TinyGo's runtime, then catch the resulting ExitError(0).
That got past init, but the module STILL died — wazero's stock
`proc_exit` implementation calls `mod.CloseWithExitCode(exitCode)`
before panicking, which invalidates the module regardless of what
the caller does with the panic. Every subsequent call to ws_open /
ws_frame / ws_close / orama_alloc returned ExitError(0) ("module
already closed").

Wazero exposes no flag for this — the close is hard-coded. The only
intercept point is to override `proc_exit` at the WASI host-module
boundary. Documented pattern at imports/wasi_snapshot_preview1/wasi.go
lines 111-127.

Fix: build the WASI host module manually so we can override
`proc_exit`:

  - exit code 0 → panic ExitError(0) BUT do NOT close the module.
    This is TinyGo's "_start completed cleanly" signal; the module's
    other exports must stay callable for the persistent lifecycle.
  - exit code != 0 → preserve standard WASI behavior (close + panic).
    A non-zero exit is a genuine app-signaled failure; we want
    `proc_exit(N != 0)` to behave exactly as upstream does.

The InstantiatePersistent caller already distinguishes the two cases
via errors.As + ExitCode() check — added in v0.122.22, no change here.

Safe for stateless functions on the same runtime: the stateless
execution path closes its own module after each invocation, so the
"module stays alive on exit 0" override has no effect on that path.

VERSION bumped to 0.122.23.
2026-05-15 11:56:29 +03:00
anonpenguin23
6a0043a244 fix(serverless): bootstrap TinyGo runtime in persistent WS instances (#240/#249)
InstantiatePersistent passed WithStartFunctions() with no args,
explicitly disabling both wasi entry points. The intent was to skip
main(); the side effect was leaving the TinyGo runtime
uninitialized. The first call to any export traps via
wasmExportCheckRun and managed-memory ops panic. Every persistent WS
function was effectively dead since plan #06 landed.

Earlier patch in this thread restored the call but only handled
wasi-reactor builds (_initialize). AnChat's rpc-router is a wasi
command build (`_start` export only, no `_initialize`) — wasm-objdump
confirms — so the reactor-only fix still left it broken.

This fix tries `_initialize` first, falls back to `_start`, and
bounds whichever runs with a 5s timeout so a buggy main() can't hang
instantiation forever. Logs the chosen hook at Debug, warns when
neither is exported.

Still pass WithStartFunctions() (no args) so wazero doesn't
auto-call `_start` during InstantiateModule — we want full control
over which hook runs and the timeout that bounds it.

VERSION bumped to 0.122.22.
2026-05-15 10:40:27 +03:00
anonpenguin23
62a8fbf2df fix(serverless): registry read paths now load WS persistent metadata (#240/#249)
Register() writes the four ws_* columns (ws_persistent,
ws_idle_timeout_sec, ws_max_frame_bytes, ws_max_inflight_per_conn) to
the functions table, but every read path — Get, List, GetByID,
GetByNameInternal — silently dropped them from the SELECT. functionRow
had no fields for them either. Result: fn.WSPersistent was always the
zero value (false) at runtime, no matter what the DB row said. Every
WS function ran in per-frame stateless mode regardless of its
`ws_persistent: true` config.

AnChat's rpc-router was the canary: it relies on per-connection
instance state (request_id ↔ reply correlation, subscription
bookkeeping) that the stateless model destroys every frame. The
gateway telemetry envelope still reached the client
({request_id, status, duration_ms}) so the failure looked like
"function works, frames don't" — every RPC timed out at 15 s.

Fix: include the four columns in every SELECT, add the matching
functionRow fields, and copy them into Function in rowToFunction.
No schema change (columns have been in migration 011 from the start).

Regression tests in registry_ws_columns_test.go cover the Get / List
paths against an in-memory SQLite that mirrors the production DDL.

VERSION bumped to 0.122.21.
2026-05-15 09:01:42 +03:00
anonpenguin23
a0a1decd06 fix(ws): prefer X-Forwarded-Host in Origin check — root cause #240/#249
handleNamespaceGatewayRequest rewrites r.Host to the backend target
IP:port (e.g. "10.0.0.6:10004") before forwarding. The original
public host (e.g. "ns-anchat-test.orama-devnet.network") is preserved
in X-Forwarded-Host. checkWSOrigin in both pubsub/ws_client.go and
serverless/ws_handler.go was comparing the client's Origin against
the proxied r.Host only — so every browser / RN-iOS WS upgrade was
rejected 403 because their Origin's public hostname can never match
10.0.0.6.

curl probes don't send Origin, so curl returned true unconditionally
and the bug was invisible to operator smoke tests. AnChat's iPhone
WS clients hit `code=1006 reason="Received bad response code from
server: 403"` for ~24h.

Fix: prefer X-Forwarded-Host (the original public host) when present,
fall back to r.Host for direct (non-proxied) connections. Applied
identically to both WS handlers. Regression test in
serverless/ws_origin_test.go covers the proxy-hop case, no-Origin
case, and direct-connection case.

This is the real fix; v0.122.19 only closed a separate silent-forward
auth hole that produced opaque 401s on a different code path.

VERSION bumped to 0.122.20.
2026-05-15 07:03:28 +03:00
anonpenguin23
872c553d1c fix(gateway): namespace-proxy rejects unauthed requests at main, logs WS audit
Root-cause hardening for bug #240 and #249's "intermittent 401 over WS"
reports. handleNamespaceGatewayRequest previously had a third code
path beyond "auth ok" and "auth error": when validateAuthForNamespaceProxy
returned empty namespace AND empty error (i.e. "no credentials found"),
the request fell through to a silent forward to the namespace gateway
WITHOUT internal-auth headers. The namespace gateway then rejected
with 401 "missing API key" in ~60µs.

From the client's perspective: opaque 401.
From our side: only the namespace gateway logged it, and that tier
can't validate API keys (they live in the main cluster RQLite), so
the operator had no signal that the main gateway had even seen the
request. AnChat's intermittent 401-on-WS reports went unsolved for
this exact reason.

Fix:
- Explicit reject at main when no credentials extracted AND path
  isn't public. Returns 401 with WWW-Authenticate: Bearer realm and a
  clear message naming the three accepted credential sources.
- Rich structured logging on every WS upgrade auth outcome: presence
  of api_key/token/jwt query params, Authorization + X-API-Key
  headers, Connection/Upgrade headers, Origin, User-Agent, client IP,
  raw query length. Steady-state stays low-noise: success path logs at
  debug, reject paths log at warn.
- Namespace-mismatch reject (existing branch) now also logs.

VERSION bumped to 0.122.19.
2026-05-14 17:53:38 +03:00
anonpenguin23
5c1404849b fix(#72): correct ntfy upstream checksum URL
Upstream publishes the checksums asset as a plain "checksums.txt" at
the release root, not "ntfy_<VER>_checksums.txt". The version-prefixed
URL we were constructing 404'd, so InstallNtfy bailed in the
download-binary step and ntfy never landed even after we wired
InstallNtfy into the pre-built install path.

Verified against the v2.11.0 release assets list. If a future version
changes the naming convention, the install will 404 loud and this URL
gets bumped in the same PR as ntfyVersion.

VERSION bumped to 0.122.18.
2026-05-14 14:29:24 +03:00
anonpenguin23
7e47f42f91 fix(#72): install ntfy in pre-built path too — devnet path was missing it
Phase 2b auto-detects pre-built archive mode and routes to
installFromPreBuilt(). That path copies bundled binaries (caddy, orama,
gateway, …) into place but never called InstallNtfy() — because ntfy
is downloaded from upstream github, not bundled. Result: on devnet
(which always uses pre-built mode), ntfy never installed even though
the always-on code path in installFromSource() was correctly wired up.

Fix: add InstallNtfy() call to installFromPreBuilt right after the
binary deploy + setCapabilities steps, before disableResolvedStub
runs. Ordering matters because Phase 4's ConfigureNtfy chowns
/etc/ntfy/server.yml to the ntfy user, which needs to exist.

VERSION bumped to 0.122.17.
2026-05-14 12:16:28 +03:00
anonpenguin23
8b4abb7eef feat(#72): install ntfy on every node, drop --with-ntfy gating
ntfy is now part of the standard node install, just like Caddy. The
binary, /etc/ntfy/server.yml, and the Caddy push.<dnsZone> reverse-
proxy block are written unconditionally on every node, and the
ntfy.service starts as part of the standard service order.

Why uniform: ntfy listens on 127.0.0.1:NtfyListenPort only, reachable
exclusively via the local Caddy reverse-proxy block. Nodes that don't
serve a public push.* DNS entry just have an idle ntfy with no
inbound traffic — zero operational cost, zero attack surface change.
Removing the flag means no per-node toggling, no preference drift
between nodes, no "did we remember to set --with-ntfy" mistakes when
DNS topology changes (e.g. promoting a node to nameserver later).

Removed:
- NodePreferences.NtfyHost (yaml: ntfy_host)
- ProductionSetup.isNtfyHost field, SetNtfyHost, IsNtfyHost
- install/flags.go --with-ntfy + NtfyHost field
- upgrade/flags.go --with-ntfy + NtfyHost field + isFlagPassed helper
  (was only used for --with-ntfy tri-state semantics)
- upgrade/orchestrator.go preference-load and persist for ntfy
- upgrade/remote.go --with-ntfy forwarding

Phase 2 always calls InstallNtfy.
Phase 4 always calls EnableCaddyNtfyProxy + ConfigureNtfy.
Phase 5 always enables ntfy.service.
Phase 5b always starts ntfy.service.

VERSION bumped to 0.122.16.
2026-05-14 11:51:08 +03:00
anonpenguin23
8c37ef547e fix(upgrade): forward per-node flags to remote so --with-ntfy actually lands
`orama node upgrade --node <ip> --with-ntfy --restart` parsed the flag
locally but `upgradeNode()` ran a hardcoded
`orama node upgrade --restart` on the remote — dropping --with-ntfy,
--nameserver, --force, and --skip-checks on the floor. The remote
orchestrator then read the SAVED preference (or default false for
nameserver/ntfy), so operator overrides like enabling ntfy on a
nameserver were silently ignored. Bug surfaced in devnet today:
running --with-ntfy reported success but ntfy was never installed.

Fix forwards the four passthrough flags to the remote command,
preserving the tri-state semantics for the pointer flags (nil = honor
saved preference; non-nil = explicit override).

VERSION bumped to 0.122.15.
2026-05-14 11:44:47 +03:00
anonpenguin23
07638354d2 feat(#72): full-privacy push — self-hosted ntfy + APNs-direct provider
Migration 028: namespace_push_credentials
- Per-(namespace, provider) AES-256-GCM encrypted credential blob.
- Generic schema — apns/ntfy/expo/future plug in with zero migration.
- Separated from migration 026's namespace_push_config (preferences vs
  credentials, different access patterns).

pkg/push/credentials
- Manager + Registry + RQLite store; HKDF purpose "namespace-push-credentials"
  via pkg/secrets. Provider Validator interface for per-provider schema.

pkg/push/providers/apns
- Apple Push Notification service direct provider (no Expo proxy).
- Validator + dispatcher; credentials are p8 signing key + key_id + team_id.

pkg/push/providers/ntfy/credentials.go
- ntfy credential schema (auth_token + default topic). Used both with
  the public ntfy.sh and our self-hosted instance.

pkg/environments/production/installers/ntfy.go
- Self-hosted ntfy server installer. Binary, system user, hardened
  /etc/ntfy/server.yml, systemd unit. Listens on 127.0.0.1:NtfyListenPort
  only — Caddy is the only public path.

pkg/environments/production/installers/caddy.go
- Emit reverse_proxy block for push.<dnsZone> -> 127.0.0.1:NtfyListenPort
  when operator enables ntfy on a node.

CLI: install/upgrade orchestrators learn a new "ntfy" install/preserve
phase; flag gating in install/flags.go + upgrade/flags.go.

Gateway handlers/push/credentials_handler.go
- GET/PUT/DELETE /v1/namespace/push-credentials/{provider}.
- PUT validates against provider Validator before encrypting and storing.
- GET returns a redacted view (booleans + non-secret fields only).

Push manager: provider resolution now also consults
namespace_push_credentials before falling back to YAML defaults.

Docs: core/docs/PUSH_NOTIFICATIONS.md walks through end-to-end setup.

VERSION bumped to 0.122.14.
2026-05-14 10:48:00 +03:00
anonpenguin23
32a2a62e0d fix(caddy): disable HTTP/2 to keep WebSocket upgrade auth working (#249)
HTTP/2 forbids the `Connection: Upgrade` and `Upgrade: websocket`
headers per RFC 7540 §8.1.2.2. With h2 advertised at the listener,
ALPN negotiates h2 for TLS-capable clients, the WS-upgrade request
arrives at Caddy with those headers stripped, and Caddy forwards a
plain HTTP/1.1 GET to the gateway. The gateway's `isWebSocketUpgrade(r)`
then returns false, the `?api_key=` / `?jwt=` query-string WS-auth
fallback never runs, and clients see 401.

RFC 8441 ("Bootstrapping WebSockets with HTTP/2") fixes this, but iOS
RN and most other mobile WS libraries don't implement it. Until they
do, h1 is the only protocol that keeps WS auth working.

Trade-off: lose h2 multiplexing on plain HTTP traffic. Acceptable for
an API gateway whose dominant workload is REST + WebSocket — neither
benefits much from h2 streams.

caddy_test.go adds a regression guard so anyone re-enabling h2 in the
listener protocols fails CI loud.

Also (separate, was uncommitted): pkg/cli/build/builder.go now reads
VERSION from the repo-root /VERSION file first, falling back to
parsing the Makefile only if absent. The previous Makefile-only path
broke after VERSION moved to /VERSION (Makefile got `$(shell cat ...)`
which the CLI builder pulled in literally).

VERSION bumped to 0.122.13.
2026-05-14 07:50:47 +03:00
anonpenguin23
fda47533c3 feat: per-namespace rate-limit self-service + WS JWT auth + release 0.122.12
Per-namespace rate-limit config (feature #69)
- Migration 027: new `namespace_rate_limit_config` table
  (namespace PK, requests_per_minute, burst, audit metadata).
- pkg/ratelimit: Manager + RQLite ConfigStore + types. Same pattern
  as the push config in bug #220's follow-up — LRU cache, invalidate
  on PUT/DELETE, falls back to YAML defaults when no row exists.
- pkg/gateway/handlers/ratelimit: GET/PUT/DELETE /v1/namespace/rate-limit.
  PUT requests are rejected if they exceed the operator's configured
  ceiling (MaxRequestsPerMinute / MaxBurst) — tenants self-serve but
  cannot raise their quota past the cap.
- pkg/gateway/rate_limiter.go: per-namespace lookup, default fallback.
- pkg/gateway/middleware.go: WS JWT middleware (middleware_ws_jwt_test.go).
- pkg/gateway/auth/service.go: refresh-token rotation hardening with
  regression test in refresh_rotation_test.go.

AI agent instructions
- Add AGENTS.md, CLAUDE.md, .github/copilot-instructions.md (DeBros v0.2.0
  baseline).

DeBros rules bumped to v0.2.0 (sha bb6e6ef).

VERSION bumped to 0.122.12.
2026-05-13 15:41:36 +03:00
anonpenguin23
3676b000a6 chore: adopt DeBros DAO baseline rules + release 0.122.11
Standardization batch — no application code changes. Pulls in the
DeBros DAO baseline rules (v0.1.0, sha 51ce3f8) for supply-chain
defense and toolchain pinning.

Files added:
- DEBROS.md + debros.json — adopted-rules manifest
- .debros/compliance/{go,javascript-typescript,zig}.md — per-language
  compliance docs
- .github/workflows/security.yml — auto-detecting security CI
  (npm audit + go vulncheck), runs on main + weekly cron
- renovate.json — 30-day dependency cooldown, no auto-merge,
  vulnerability alerts bypass cooldown
- .nvmrc — pin Node 20.18.0
- vault/.zigversion — pin Zig 0.14.0
- sdk/.npmrc, website/.npmrc — supply-chain hardening
  (ignore-scripts, strict-peer-dependencies, save-exact, etc.)

Files modified:
- core/go.mod, os/agent/go.mod, website/invest-api/go.mod —
  add `toolchain go1.24.6` directive for reproducible builds
- VERSION + sdk/package.json — bump to 0.122.11
2026-05-12 11:10:10 +03:00
anonpenguin23
8e4d11a6ce ci: single VERSION file, version guards, goreleaser v2, CI on push
Workflow hardening based on the four-cycle release-debugging session:

Centralized versioning
- Add /VERSION at repo root as single source of truth.
- core/Makefile reads VERSION via `$(shell cat ../VERSION)`.
- Add `make bump VER=X.Y.Z` target that updates /VERSION and syncs
  sdk/package.json in one shot.

Version mismatch guards
- All three release workflows (release.yaml, release-apt.yml,
  publish-sdk.yml) now verify the release tag matches /VERSION at the
  very first step. Stale-VERSION releases fail fast with a clear hint
  to run `make bump`.

GoReleaser v2 migration
- Upgrade goreleaser-action v5 -> v6 (pinned `~> v2`).
- Add `version: 2` to .goreleaser.yaml.
- Migrate to v2 syntax: `archives.format` -> `formats: [...]`,
  `brews.folder` -> `directory`, `snapshot.name_template` ->
  `version_template`, `builds`-style references replaced with `ids:`.
- `before.hooks` can use map syntax again (v2 supports it).

Homebrew tap on stable only
- `brews.skip_upload` is now `'{{ if .Prerelease }}true{{ else }}false{{ end }}'`.
- Stops nightly releases from polluting the tap and from hitting 401
  on stale HOMEBREW_TAP_TOKEN. Stable main releases still publish.

CI on every push
- New ci.yml runs `go vet` + `go test -race` on the core module and
  typecheck/build/unit-tests on the SDK for every push to main/nightly
  and every PR. version-sanity job warns when /VERSION and
  sdk/package.json drift.

Version bump for next pipeline test
- /VERSION: 0.122.8
- sdk/package.json: 0.122.8
2026-05-12 09:49:33 +03:00
anonpenguin23
6e31184d0e ci: point goreleaser at renamed DeBrosDAO/orama repo, bump to 0.122.7
The repo moved from DeBrosOfficial/network to DeBrosDAO/orama.
GoReleaser was uploading artifacts to the old URL and getting 307
redirects, then retrying until secondary rate limits kicked in.

- release.github.owner/name: DeBrosOfficial/network -> DeBrosDAO/orama
- brews.repository.owner: DeBrosOfficial -> DeBrosDAO
- all homepage URLs updated
- bump VERSION to 0.122.7 for fourth pipeline test
2026-05-12 09:42:51 +03:00