# Dynamic Database Clustering — Implementation Plan ### Scope Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing. ### Guiding Principles - Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons. - Backward-compatible config migration with deprecations, feature-flag controlled rollout. - Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes. - Tests and observability at each phase. ### Phase 0: Prep & Scaffolding - Add feature flag `dynamic_db_clustering` (env/config) → default off. - Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated). - Create empty packages and interfaces to enable incremental compilation: - `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}` - `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}` - Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed). - Establish CI jobs for new unit/integration suites and longer-running e2e. ### Phase 1: Metadata Layer (No hibernation yet) - Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`: - `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`. - Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge: - Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health. - Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`: - Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs). - Reuse existing node connectivity/backoff; no new ping service required. - Skip unit tests for now; validate by wiring e2e flows later. ### Phase 2: Database Creation & Client API - Port management: - `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping. - Subprocess control: - `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status). - Cluster manager: - `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces. - Client API: - Update `pkg/client/interface.go` to include `Database(name string)`. - Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name). - Backoff polling for readiness during creation. - Data isolation: - Data dir per db: `./data/_/rqlite` (respect node `data_dir` base). - Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write. ### Phase 3: Hibernation & Wake-Up - Idle detection and coordination: - Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule. - Hibernation protocol: - Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`. - Wake-up protocol: - Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`. - Client retry UX: - Transparent retries with exponential backoff; treat `waking` as wait-only state. - Tests: hibernation under load; thundering herd; resource verification and persistence across cycles. ### Phase 4: Resilience (Failure & Replacement) - Continuous health checks with timeouts → mark node unhealthy. - Replacement orchestration: - Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs. - Startup reconciliation: - Detect and cleanup orphaned or non-member local data directories. - Rate limiting replacements to prevent cascades; prioritize by usage metrics. - Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity. ### Phase 5: Production Hardening & Optimization - Metrics/logging: - Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges. - Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status). - Client metadata caching and query routing improvements (simple round-robin → latency-aware later). - Performance benchmarks and operator-facing docs. ### File Changes (Essentials) - `pkg/config/config.go` - Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`. - Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`. - `pkg/client/interface.go`/`pkg/client/client.go` - Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling. - `pkg/node/node.go` - Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage. - `pkg/rqlite/*` - Refactor to instance-oriented helpers from singleton. - New packages under `pkg/metadata` and `pkg/dbcluster` as listed above. - `configs/node.yaml` and validation paths to reflect new `database` block. ### Config Example (target end-state) ```yaml node: data_dir: "./data" database: replication_factor: 3 hibernation_timeout: 60 max_databases: 100 port_range: http_start: 5001 http_end: 5999 raft_start: 7001 raft_end: 7999 discovery: health_check_interval: 10s ``` ### Rollout Strategy - Keep feature flag off by default; support legacy single-cluster path. - Ship Phase 1 behind flag; enable in dev/e2e only. - Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment. - Remove legacy config after deprecation window. ### Testing & Quality Gates - Unit tests: metadata ops, consensus, ports, subprocess, manager state machine. - Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions. - Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms). - Chaos suite for randomized failures and partitions. ### Risks & Mitigations (operationalized) - Metadata divergence → vector clocks + periodic checksums + majority read checks in client. - Raft churn → adaptive timeouts; allow `always_on` flag per-db (future). - Cascading replacements → global rate limiter and prioritization. - Debuggability → verbose structured logging and metadata dump endpoints. ### Timeline (indicative) - Weeks 1-2: Phases 0-1 - Weeks 3-4: Phase 2 - Weeks 5-6: Phase 3 - Weeks 7-8: Phase 4 - Weeks 9-10+: Phase 5 ### To-dos - [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks - [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields - [ ] Implement metadata types and thread-safe store with versioning - [ ] Implement pubsub messages and handlers using existing pubsub manager - [ ] Implement coordinator election, vector clocks, gossip reconciliation - [ ] Implement `PortManager` with bind-probing and allocation - [ ] Implement rqlite subprocess control and readiness checks - [ ] Implement `ClusterManager` and creation lifecycle orchestration - [ ] Add `Database(name)` and app namespacing to client; backoff polling - [ ] Adopt per-database data dirs under node `data_dir` - [ ] Integration tests for creation and isolation across nodes - [ ] Idle detection, coordinated shutdown, status updates - [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart - [ ] Client transparent retry/backoff for hibernation and waking - [ ] Health checks, replacement orchestration, rate limiting - [ ] Implement orphaned data reconciliation on startup - [ ] Add metrics and structured logging across managers - [ ] Benchmarks for creation, wake-up, sync, query overhead - [ ] Operator and developer docs; config and migration guides