Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide

- Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation. - Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources.
2026-01-30 05:23:03 +00:00 · 2025-10-16 10:29:58 +03:00 · 2025-10-16 10:29:58 +03:00 · 36002d342c
commit 36002d342c
parent dd4cb832dc
3 changed files with 0 additions and 1496 deletions
--- a/.cursor/plans/dynamic-ec358e91.plan.md
+++ b/.cursor/plans/dynamic-ec358e91.plan.md
@ -1,165 +0,0 @@
-<!-- ec358e91-8e19-4fc8-a81e-cb388a4b2fc9 4c357d4a-bae7-4fe2-943d-84e5d3d3714c -->
-# Dynamic Database Clustering — Implementation Plan
-
-### Scope
-
-Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing.
-
-### Guiding Principles
-
- Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons.
- Backward-compatible config migration with deprecations, feature-flag controlled rollout.
- Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes.
- Tests and observability at each phase.
-
-### Phase 0: Prep & Scaffolding
-
- Add feature flag `dynamic_db_clustering` (env/config) → default off.
- Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated).
- Create empty packages and interfaces to enable incremental compilation:
-  - `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}`
-  - `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}`
- Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed).
- Establish CI jobs for new unit/integration suites and longer-running e2e.
-
-### Phase 1: Metadata Layer (No hibernation yet)
-
- Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`:
-  - `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`.
- Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge:
-  - Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health.
- Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`:
-  - Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs).
- Reuse existing node connectivity/backoff; no new ping service required.
- Skip unit tests for now; validate by wiring e2e flows later.
-
-### Phase 2: Database Creation & Client API
-
- Port management:
-  - `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping.
- Subprocess control:
-  - `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status).
- Cluster manager:
-  - `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces.
- Client API:
-  - Update `pkg/client/interface.go` to include `Database(name string)`.
-  - Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name).
-  - Backoff polling for readiness during creation.
- Data isolation:
-  - Data dir per db: `./data/<app>_<db>/rqlite` (respect node `data_dir` base).
- Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write.
-
-### Phase 3: Hibernation & Wake-Up
-
- Idle detection and coordination:
-  - Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule.
- Hibernation protocol:
-  - Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`.
- Wake-up protocol:
-  - Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`.
- Client retry UX:
-  - Transparent retries with exponential backoff; treat `waking` as wait-only state.
- Tests: hibernation under load; thundering herd; resource verification and persistence across cycles.
-
-### Phase 4: Resilience (Failure & Replacement)
-
- Continuous health checks with timeouts → mark node unhealthy.
- Replacement orchestration:
-  - Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs.
- Startup reconciliation:
-  - Detect and cleanup orphaned or non-member local data directories.
- Rate limiting replacements to prevent cascades; prioritize by usage metrics.
- Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity.
-
-### Phase 5: Production Hardening & Optimization
-
- Metrics/logging:
-  - Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges.
- Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status).
- Client metadata caching and query routing improvements (simple round-robin → latency-aware later).
- Performance benchmarks and operator-facing docs.
-
-### File Changes (Essentials)
-
- `pkg/config/config.go`
-  - Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`.
-  - Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`.
- `pkg/client/interface.go`/`pkg/client/client.go`
-  - Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling.
- `pkg/node/node.go`
-  - Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage.
- `pkg/rqlite/*`
-  - Refactor to instance-oriented helpers from singleton.
- New packages under `pkg/metadata` and `pkg/dbcluster` as listed above.
- `configs/node.yaml` and validation paths to reflect new `database` block.
-
-### Config Example (target end-state)
-
-```yaml
-node:
-  data_dir: "./data"
-
-database:
-  replication_factor: 3
-  hibernation_timeout: 60
-  max_databases: 100
-  port_range:
-    http_start: 5001
-    http_end: 5999
-    raft_start: 7001
-    raft_end: 7999
-
-discovery:
-  health_check_interval: 10s
-```
-
-### Rollout Strategy
-
- Keep feature flag off by default; support legacy single-cluster path.
- Ship Phase 1 behind flag; enable in dev/e2e only.
- Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment.
- Remove legacy config after deprecation window.
-
-### Testing & Quality Gates
-
- Unit tests: metadata ops, consensus, ports, subprocess, manager state machine.
- Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions.
- Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms).
- Chaos suite for randomized failures and partitions.
-
-### Risks & Mitigations (operationalized)
-
- Metadata divergence → vector clocks + periodic checksums + majority read checks in client.
- Raft churn → adaptive timeouts; allow `always_on` flag per-db (future).
- Cascading replacements → global rate limiter and prioritization.
- Debuggability → verbose structured logging and metadata dump endpoints.
-
-### Timeline (indicative)
-
- Weeks 1-2: Phases 0-1
- Weeks 3-4: Phase 2
- Weeks 5-6: Phase 3
- Weeks 7-8: Phase 4
- Weeks 9-10+: Phase 5
-
-### To-dos
-
- [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks
- [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields
- [ ] Implement metadata types and thread-safe store with versioning
- [ ] Implement pubsub messages and handlers using existing pubsub manager
- [ ] Implement coordinator election, vector clocks, gossip reconciliation
- [ ] Implement `PortManager` with bind-probing and allocation
- [ ] Implement rqlite subprocess control and readiness checks
- [ ] Implement `ClusterManager` and creation lifecycle orchestration
- [ ] Add `Database(name)` and app namespacing to client; backoff polling
- [ ] Adopt per-database data dirs under node `data_dir`
- [ ] Integration tests for creation and isolation across nodes
- [ ] Idle detection, coordinated shutdown, status updates
- [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart
- [ ] Client transparent retry/backoff for hibernation and waking
- [ ] Health checks, replacement orchestration, rate limiting
- [ ] Implement orphaned data reconciliation on startup
- [ ] Add metrics and structured logging across managers
- [ ] Benchmarks for creation, wake-up, sync, query overhead
- [ ] Operator and developer docs; config and migration guides
--- a/DYNAMIC_CLUSTERING_GUIDE.md
+++ b/DYNAMIC_CLUSTERING_GUIDE.md
@ -1,504 +0,0 @@
-# Dynamic Database Clustering - User Guide
-
-## Overview
-
-Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation.
-
-## Key Features
-
-✅ **Multi-Database Support** - Create unlimited isolated databases on-demand  
-✅ **3-Node Replication** - Fault-tolerant by default (configurable)  
-✅ **Auto Hibernation** - Idle databases hibernate to save resources  
-✅ **Transparent Wake-Up** - Automatic restart on access  
-✅ **App Namespacing** - Databases are scoped by application name  
-✅ **Decentralized Metadata** - LibP2P pubsub-based coordination  
-✅ **Failure Recovery** - Automatic node replacement on failures  
-✅ **Resource Optimization** - Dynamic port allocation and data isolation  
-
-## Configuration
-
-### Node Configuration (`configs/node.yaml`)
-
-```yaml
-node:
-  data_dir: "./data"
-  listen_addresses:
-    - "/ip4/0.0.0.0/tcp/4001"
-  max_connections: 50
-
-database:
-  replication_factor: 3           # Number of replicas per database
-  hibernation_timeout: 60s        # Idle time before hibernation
-  max_databases: 100              # Max databases per node
-  port_range_http_start: 5001     # HTTP port range start
-  port_range_http_end: 5999       # HTTP port range end
-  port_range_raft_start: 7001     # Raft port range start
-  port_range_raft_end: 7999       # Raft port range end
-
-discovery:
-  bootstrap_peers:
-    - "/ip4/127.0.0.1/tcp/4001/p2p/..."
-  discovery_interval: 30s
-  health_check_interval: 10s
-```
-
-### Key Configuration Options
-
-#### `database.replication_factor` (default: 3)
-Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance.
-
-#### `database.hibernation_timeout` (default: 60s)
-Time of inactivity before a database is hibernated. Set to 0 to disable hibernation.
-
-#### `database.max_databases` (default: 100)
-Maximum number of databases this node can host simultaneously.
-
-#### `database.port_range_*`
-Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database).
-
-## Client Usage
-
-### Creating/Accessing Databases
-
-```go
-package main
-
-import (
-    "context"
-    "github.com/DeBrosOfficial/network/pkg/client"
-)
-
-func main() {
-    // Create client with app name for namespacing
-    cfg := client.DefaultClientConfig("myapp")
-    cfg.BootstrapPeers = []string{
-        "/ip4/127.0.0.1/tcp/4001/p2p/...",
-    }
-    
-    c, err := client.NewClient(cfg)
-    if err != nil {
-        panic(err)
-    }
-    
-    // Connect to network
-    if err := c.Connect(); err != nil {
-        panic(err)
-    }
-    defer c.Disconnect()
-    
-    // Get database client (creates database if it doesn't exist)
-    db, err := c.Database().Database("users")
-    if err != nil {
-        panic(err)
-    }
-    
-    // Use the database
-    ctx := context.Background()
-    err = db.CreateTable(ctx, `
-        CREATE TABLE users (
-            id INTEGER PRIMARY KEY,
-            name TEXT NOT NULL,
-            email TEXT UNIQUE
-        )
-    `)
-    
-    // Query data
-    result, err := db.Query(ctx, "SELECT * FROM users")
-    // ...
-}
-```
-
-### Database Naming
-
-Databases are automatically namespaced by your application name:
- `client.Database("users")` → creates `myapp_users` internally
- This prevents name collisions between different applications
-
-## Gateway API Usage
-
-If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints:
-
-### Base URL
-```
-http://gateway-host:8080/v1/database/
-```
-
-### Execute SQL (INSERT, UPDATE, DELETE, DDL)
-```bash
-POST /v1/database/exec
-Content-Type: application/json
-
-{
-  "database": "users",
-  "sql": "INSERT INTO users (name, email) VALUES (?, ?)",
-  "args": ["Alice", "alice@example.com"]
-}
-
-Response:
-{
-  "rows_affected": 1,
-  "last_insert_id": 1
-}
-```
-
-### Query Data (SELECT)
-```bash
-POST /v1/database/query
-Content-Type: application/json
-
-{
-  "database": "users",
-  "sql": "SELECT * FROM users WHERE name LIKE ?",
-  "args": ["A%"]
-}
-
-Response:
-{
-  "items": [
-    {"id": 1, "name": "Alice", "email": "alice@example.com"}
-  ],
-  "count": 1
-}
-```
-
-### Execute Transaction
-```bash
-POST /v1/database/transaction
-Content-Type: application/json
-
-{
-  "database": "users",
-  "queries": [
-    "INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')",
-    "UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'"
-  ]
-}
-
-Response:
-{
-  "success": true
-}
-```
-
-### Get Schema
-```bash
-GET /v1/database/schema?database=users
-
-# OR
-
-POST /v1/database/schema
-Content-Type: application/json
-
-{
-  "database": "users"
-}
-
-Response:
-{
-  "tables": [
-    {
-      "name": "users",
-      "columns": ["id", "name", "email"]
-    }
-  ]
-}
-```
-
-### Create Table
-```bash
-POST /v1/database/create-table
-Content-Type: application/json
-
-{
-  "database": "users",
-  "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
-}
-
-Response:
-{
-  "rows_affected": 0
-}
-```
-
-### Drop Table
-```bash
-POST /v1/database/drop-table
-Content-Type: application/json
-
-{
-  "database": "users",
-  "table_name": "old_table"
-}
-
-Response:
-{
-  "rows_affected": 0
-}
-```
-
-### List Databases
-```bash
-GET /v1/database/list
-
-Response:
-{
-  "databases": ["users", "products", "orders"]
-}
-```
-
-### Important Notes
-
-1. **Authentication Required**: All endpoints require authentication (JWT or API key)
-2. **Database Creation**: Databases are created automatically on first access
-3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database
-4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s
-5. **Namespacing**: Database names are automatically prefixed with your app name
-6. **Concurrent Access**: All endpoints are safe for concurrent use
-
-## Database Lifecycle
-
-### 1. Creation
-
-When you first access a database:
-
-1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST`
-2. **Node Selection** - Eligible nodes respond with available ports
-3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen
-4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM`
-5. **Instance Startup** - Selected nodes start rqlite subprocesses
-6. **Readiness** - Nodes report `active` status when ready
-
-**Typical creation time: < 10 seconds**
-
-### 2. Active State
-
- Database instances run as rqlite subprocesses
- Each instance tracks `LastQuery` timestamp
- Queries update the activity timestamp
- Metadata synced across all network nodes
-
-### 3. Hibernation
-
-After `hibernation_timeout` of inactivity:
-
-1. **Idle Detection** - Nodes detect idle databases
-2. **Idle Notification** - Nodes broadcast idle status
-3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown
-4. **Graceful Stop** - SIGTERM sent to rqlite processes
-5. **Port Release** - Ports freed for reuse
-6. **Status Update** - Metadata updated to `hibernating`
-
-**Data persists on disk during hibernation**
-
-### 4. Wake-Up
-
-On first query to hibernating database:
-
-1. **Detection** - Client/node detects `hibernating` status
-2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST`
-3. **Port Allocation** - Reuse original ports or allocate new ones
-4. **Instance Restart** - Restart rqlite with existing data
-5. **Status Update** - Update to `active` when ready
-
-**Typical wake-up time: < 8 seconds**
-
-### 5. Failure Recovery
-
-When a node fails:
-
-1. **Health Detection** - Missed health checks trigger failure detection
-2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED`
-3. **Offers** - Healthy nodes with capacity offer to replace
-4. **Selection** - First offer accepted (simple approach)
-5. **Join Cluster** - New node joins existing Raft cluster
-6. **Sync** - Data synced from existing members
-
-## Data Management
-
-### Data Directories
-
-Each database gets its own data directory:
-```
-./data/
-  ├── myapp_users/        # Database: users
-  │   └── rqlite/
-  │       ├── db.sqlite
-  │       └── raft/
-  ├── myapp_products/     # Database: products
-  │   └── rqlite/
-  └── myapp_orders/       # Database: orders
-      └── rqlite/
-```
-
-### Orphaned Data Cleanup
-
-On node startup, the system automatically:
- Scans data directories
- Checks against metadata
- Removes directories for:
-  - Non-existent databases
-  - Databases where this node is not a member
-
-## Monitoring & Debugging
-
-### Structured Logging
-
-All operations are logged with structured fields:
-
-```
-INFO  Starting cluster manager node_id=12D3... max_databases=100
-INFO  Received database create request database=myapp_users requester=12D3...
-INFO  Database instance started database=myapp_users http_port=5001 raft_port=7001
-INFO  Database is idle database=myapp_users idle_time=62s
-INFO  Database hibernated successfully database=myapp_users
-INFO  Received wakeup request database=myapp_users
-INFO  Database woke up successfully database=myapp_users
-```
-
-### Health Checks
-
-Nodes perform periodic health checks:
- Every `health_check_interval` (default: 10s)
- Tracks last-seen time for each peer
- 3 missed checks → node marked unhealthy
- Triggers replacement protocol for affected databases
-
-## Best Practices
-
-### 1. **Capacity Planning**
-
-```yaml
-# For 100 databases with 3-node replication:
-database:
-  max_databases: 100
-  port_range_http_start: 5001
-  port_range_http_end: 5200    # 200 ports (100 databases * 2)
-  port_range_raft_start: 7001
-  port_range_raft_end: 7200
-```
-
-### 2. **Hibernation Tuning**
-
- **High Traffic**: Set `hibernation_timeout: 300s` or higher
- **Development**: Set `hibernation_timeout: 30s` for faster cycles
- **Always-On DBs**: Set `hibernation_timeout: 0` to disable
-
-### 3. **Replication Factor**
-
- **Development**: `replication_factor: 1` (single node, no replication)
- **Production**: `replication_factor: 3` (fault tolerant)
- **High Availability**: `replication_factor: 5` (survives 2 failures)
-
-### 4. **Network Topology**
-
- Use at least 3 nodes for `replication_factor: 3`
- Ensure `max_databases * replication_factor <= total_cluster_capacity`
- Example: 3 nodes × 100 max_databases = 300 database instances total
-
-## Troubleshooting
-
-### Database Creation Fails
-
-**Problem**: `insufficient nodes responded: got 1, need 3`
-
-**Solution**:
- Ensure you have at least `replication_factor` nodes online
- Check `max_databases` limit on nodes
- Verify port ranges aren't exhausted
-
-### Database Not Waking Up
-
-**Problem**: Database stays in `waking` status
-
-**Solution**:
- Check node logs for rqlite startup errors
- Verify rqlite binary is installed
- Check port conflicts (use different port ranges)
- Ensure data directory is accessible
-
-### Orphaned Data
-
-**Problem**: Disk space consumed by old databases
-
-**Solution**:
- Orphaned data is automatically cleaned on node restart
- Manual cleanup: Delete directories from `./data/` that don't match metadata
- Check logs for reconciliation results
-
-### Node Replacement Not Working
-
-**Problem**: Failed node not replaced
-
-**Solution**:
- Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`)
- Check network connectivity between nodes
- Verify health check interval is reasonable (not too aggressive)
-
-## Advanced Topics
-
-### Metadata Consistency
-
- **Vector Clocks**: Each metadata update includes vector clock for conflict resolution
- **Gossip Protocol**: Periodic metadata sync via checksums
- **Eventual Consistency**: All nodes eventually agree on database state
-
-### Port Management
-
- Ports allocated randomly within configured ranges
- Bind-probing ensures ports are actually available
- Ports reused during wake-up when possible
- Failed allocations fall back to new random ports
-
-### Coordinator Election
-
- Deterministic selection based on lexicographical peer ID ordering
- Lowest peer ID becomes coordinator
- No persistent coordinator state
- Re-election occurs for each database operation
-
-## Migration from Legacy Mode
-
-If upgrading from single-cluster rqlite:
-
-1. **Backup Data**: Backup your existing `./data/rqlite` directory
-2. **Update Config**: Remove deprecated fields:
-   - `database.data_dir`
-   - `database.rqlite_port`
-   - `database.rqlite_raft_port`
-   - `database.rqlite_join_address`
-3. **Add New Fields**: Configure dynamic clustering (see Configuration section)
-4. **Restart Nodes**: Restart all nodes with new configuration
-5. **Migrate Data**: Create new database and import data from backup
-
-## Future Enhancements
-
-The following features are planned for future releases:
-
-### **Advanced Metrics** (Future)
- Prometheus-style metrics export
- Per-database query counters
- Hibernation/wake-up latency histograms
- Resource utilization gauges
-
-### **Performance Benchmarks** (Future)
- Automated benchmark suite
- Creation time SLOs
- Wake-up latency targets
- Query overhead measurements
-
-### **Enhanced Monitoring** (Future)
- Dashboard for cluster visualization
- Database status API endpoint
- Capacity planning tools
- Alerting integration
-
-## Support
-
-For issues, questions, or contributions:
- GitHub Issues: https://github.com/DeBrosOfficial/network/issues
- Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md
-
-## License
-
-See LICENSE file for details.
-
--- a/TESTING_GUIDE.md
+++ b/TESTING_GUIDE.md
@ -1,827 +0,0 @@
-# Dynamic Database Clustering - Testing Guide
-
-This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature.
-
-## Unit Tests
-
-### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`)
-
-```go
-// Test cases to implement:
-
-func TestMetadataStore_GetSetDatabase(t *testing.T)
-  - Create store
-  - Set database metadata
-  - Get database metadata
-  - Verify data matches
-
-func TestMetadataStore_DeleteDatabase(t *testing.T)
-  - Set database metadata
-  - Delete database
-  - Verify Get returns nil
-
-func TestMetadataStore_ListDatabases(t *testing.T)
-  - Add multiple databases
-  - List all databases
-  - Verify count and contents
-
-func TestMetadataStore_ConcurrentAccess(t *testing.T)
-  - Spawn multiple goroutines
-  - Concurrent reads and writes
-  - Verify no race conditions (run with -race)
-
-func TestMetadataStore_NodeCapacity(t *testing.T)
-  - Set node capacity
-  - Get node capacity
-  - Update capacity
-  - List nodes
-```
-
-### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`)
-
-```go
-func TestVectorClock_Increment(t *testing.T)
-  - Create empty vector clock
-  - Increment for node A
-  - Verify counter is 1
-  - Increment again
-  - Verify counter is 2
-
-func TestVectorClock_Merge(t *testing.T)
-  - Create two vector clocks with different nodes
-  - Merge them
-  - Verify max values are preserved
-
-func TestVectorClock_Compare(t *testing.T)
-  - Test strictly less than case
-  - Test strictly greater than case
-  - Test concurrent case
-  - Test identical case
-
-func TestVectorClock_Concurrent(t *testing.T)
-  - Create clocks with overlapping updates
-  - Verify Compare returns 0 (concurrent)
-```
-
-### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`)
-
-```go
-func TestElectCoordinator_SingleNode(t *testing.T)
-  - Pass single node ID
-  - Verify it's elected
-
-func TestElectCoordinator_MultipleNodes(t *testing.T)
-  - Pass multiple node IDs
-  - Verify lowest lexicographical ID wins
-  - Verify deterministic (same input = same output)
-
-func TestElectCoordinator_EmptyList(t *testing.T)
-  - Pass empty list
-  - Verify error returned
-
-func TestElectCoordinator_Deterministic(t *testing.T)
-  - Run election multiple times with same inputs
-  - Verify same coordinator each time
-```
-
-### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`)
-
-```go
-func TestPortManager_AllocatePortPair(t *testing.T)
-  - Create manager with port range
-  - Allocate port pair
-  - Verify HTTP and Raft ports different
-  - Verify ports within range
-
-func TestPortManager_ReleasePortPair(t *testing.T)
-  - Allocate port pair
-  - Release ports
-  - Verify ports can be reallocated
-
-func TestPortManager_Exhaustion(t *testing.T)
-  - Allocate all available ports
-  - Attempt one more allocation
-  - Verify error returned
-
-func TestPortManager_IsPortAllocated(t *testing.T)
-  - Allocate ports
-  - Check IsPortAllocated returns true
-  - Release ports
-  - Check IsPortAllocated returns false
-
-func TestPortManager_AllocateSpecificPorts(t *testing.T)
-  - Allocate specific ports
-  - Verify allocation succeeds
-  - Attempt to allocate same ports again
-  - Verify error returned
-```
-
-### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`)
-
-```go
-func TestRQLiteInstance_Create(t *testing.T)
-  - Create instance configuration
-  - Verify fields set correctly
-
-func TestRQLiteInstance_IsIdle(t *testing.T)
-  - Set LastQuery to old timestamp
-  - Verify IsIdle returns true
-  - Update LastQuery
-  - Verify IsIdle returns false
-
-// Integration test (requires rqlite binary):
-func TestRQLiteInstance_StartStop(t *testing.T)
-  - Create instance
-  - Start instance
-  - Verify HTTP endpoint responsive
-  - Stop instance
-  - Verify process terminated
-```
-
-### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`)
-
-```go
-func TestMarshalUnmarshalMetadataMessage(t *testing.T)
-  - Create each message type
-  - Marshal to bytes
-  - Unmarshal back
-  - Verify data preserved
-
-func TestDatabaseCreateRequest_Marshal(t *testing.T)
-func TestDatabaseCreateResponse_Marshal(t *testing.T)
-func TestDatabaseCreateConfirm_Marshal(t *testing.T)
-func TestDatabaseStatusUpdate_Marshal(t *testing.T)
-// ... for all message types
-```
-
-### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`)
-
-```go
-func TestCreateCoordinator_AddResponse(t *testing.T)
-  - Create coordinator
-  - Add responses
-  - Verify response count
-
-func TestCreateCoordinator_SelectNodes(t *testing.T)
-  - Add more responses than needed
-  - Call SelectNodes
-  - Verify correct number selected
-  - Verify deterministic selection
-
-func TestCreateCoordinator_WaitForResponses(t *testing.T)
-  - Create coordinator
-  - Wait in goroutine
-  - Add responses from another goroutine
-  - Verify wait completes when enough responses
-
-func TestCoordinatorRegistry(t *testing.T)
-  - Register coordinator
-  - Get coordinator
-  - Remove coordinator
-  - Verify lifecycle
-```
-
-## Integration Tests
-
-### 1. Single Node Database Creation (`e2e/single_node_database_test.go`)
-
-```go
-func TestSingleNodeDatabaseCreation(t *testing.T)
-  - Start 1 node
-  - Set replication_factor = 1
-  - Create database
-  - Verify database active
-  - Write data
-  - Read data back
-  - Verify data matches
-```
-
-### 2. Three Node Database Creation (`e2e/three_node_database_test.go`)
-
-```go
-func TestThreeNodeDatabaseCreation(t *testing.T)
-  - Start 3 nodes
-  - Set replication_factor = 3
-  - Create database from node 1
-  - Wait for all nodes to report active
-  - Write data to node 1
-  - Read from node 2
-  - Verify replication worked
-```
-
-### 3. Multiple Databases (`e2e/multiple_databases_test.go`)
-
-```go
-func TestMultipleDatabases(t *testing.T)
-  - Start 3 nodes
-  - Create database "users"
-  - Create database "products"
-  - Create database "orders"
-  - Verify all databases active
-  - Write to each database
-  - Verify data isolation
-```
-
-### 4. Hibernation Cycle (`e2e/hibernation_test.go`)
-
-```go
-func TestHibernationCycle(t *testing.T)
-  - Start 3 nodes with hibernation_timeout=5s
-  - Create database
-  - Write initial data
-  - Wait 10 seconds (no activity)
-  - Verify status = hibernating
-  - Verify processes stopped
-  - Verify data persisted on disk
-
-func TestWakeUpCycle(t *testing.T)
-  - Create and hibernate database
-  - Issue query
-  - Wait for wake-up
-  - Verify status = active
-  - Verify data still accessible
-  - Verify LastQuery updated
-```
-
-### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`)
-
-```go
-func TestNodeFailureDetection(t *testing.T)
-  - Start 3 nodes
-  - Create database
-  - Kill one node (SIGKILL)
-  - Wait for health checks to detect failure
-  - Verify NODE_REPLACEMENT_NEEDED broadcast
-
-func TestNodeReplacement(t *testing.T)
-  - Start 4 nodes
-  - Create database on nodes 1,2,3
-  - Kill node 3
-  - Wait for replacement
-  - Verify node 4 joins cluster
-  - Verify data accessible from node 4
-```
-
-### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`)
-
-```go
-func TestOrphanedDataCleanup(t *testing.T)
-  - Start node
-  - Manually create orphaned data directory
-  - Restart node
-  - Verify orphaned directory removed
-  - Check logs for reconciliation message
-```
-
-### 7. Concurrent Operations (`e2e/concurrent_test.go`)
-
-```go
-func TestConcurrentDatabaseCreation(t *testing.T)
-  - Start 5 nodes
-  - Create 10 databases concurrently
-  - Verify all successful
-  - Verify no port conflicts
-  - Verify proper distribution
-
-func TestConcurrentHibernation(t *testing.T)
-  - Create multiple databases
-  - Let all go idle
-  - Verify all hibernate correctly
-  - No race conditions
-```
-
-## Manual Test Scenarios
-
-### Test 1: Basic Flow - Three Node Cluster
-
-**Setup:**
-```bash
-# Terminal 1: Bootstrap node
-cd data/bootstrap
-../../bin/node --data bootstrap --id bootstrap --p2p-port 4001
-
-# Terminal 2: Node 2
-cd data/node
-../../bin/node --data node --id node2 --p2p-port 4002
-
-# Terminal 3: Node 3
-cd data/node2
-../../bin/node --data node2 --id node3 --p2p-port 4003
-```
-
-**Test Steps:**
-1. **Create Database**
-   ```bash
-   # Use client or API to create database "testdb"
-   ```
-   
-2. **Verify Creation**
-   - Check logs on all 3 nodes for "Database instance started"
-   - Verify `./data/*/testdb/` directories exist on all nodes
-   - Check different ports allocated on each node
-
-3. **Write Data**
-   ```sql
-   CREATE TABLE users (id INT, name TEXT);
-   INSERT INTO users VALUES (1, 'Alice');
-   INSERT INTO users VALUES (2, 'Bob');
-   ```
-
-4. **Verify Replication**
-   - Query from each node
-   - Verify same data returned
-
-**Expected Results:**
- All nodes show `status=active` for testdb
- Data replicated across all nodes
- Unique port pairs per node
-
---
-
-### Test 2: Hibernation and Wake-Up
-
-**Setup:** Same as Test 1 with database created
-
-**Test Steps:**
-1. **Check Activity**
-   ```bash
-   # In logs, verify "last_query" timestamps updating on queries
-   ```
-
-2. **Wait for Hibernation**
-   - Stop issuing queries
-   - Wait `hibernation_timeout` + 10s
-   - Check logs for "Database is idle"
-   - Verify "Coordinated shutdown message sent"
-   - Verify "Database hibernated successfully"
-
-3. **Verify Hibernation**
-   ```bash
-   # Check that rqlite processes are stopped
-   ps aux | grep rqlite
-   
-   # Verify data directories still exist
-   ls -la data/*/testdb/
-   ```
-
-4. **Wake Up**
-   - Issue a query to the database
-   - Watch logs for "Received wakeup request"
-   - Verify "Database woke up successfully"
-   - Verify query succeeds
-
-**Expected Results:**
- Hibernation happens after idle timeout
- All 3 nodes hibernate coordinated
- Wake-up completes in < 8 seconds
- Data persists across hibernation cycle
-
---
-
-### Test 3: Multiple Databases
-
-**Setup:** 3 nodes running
-
-**Test Steps:**
-1. **Create Multiple Databases**
-   ```
-   Create: users_db
-   Create: products_db
-   Create: orders_db
-   ```
-
-2. **Verify Isolation**
-   - Insert data in users_db
-   - Verify data NOT in products_db
-   - Verify data NOT in orders_db
-
-3. **Check Port Allocation**
-   ```bash
-   # Verify different ports for each database
-   netstat -tlnp | grep rqlite
-   # OR
-   ss -tlnp | grep rqlite
-   ```
-
-4. **Verify Data Directories**
-   ```bash
-   tree data/bootstrap/
-   # Should show:
-   # ├── users_db/
-   # ├── products_db/
-   # └── orders_db/
-   ```
-
-**Expected Results:**
- 3 separate database clusters
- Each with 3 nodes (9 total instances)
- Complete data isolation
- Unique port pairs for each instance
-
---
-
-### Test 4: Node Failure and Recovery
-
-**Setup:** 4 nodes running, database created on nodes 1-3
-
-**Test Steps:**
-1. **Verify Initial State**
-   - Database active on nodes 1, 2, 3
-   - Node 4 idle
-
-2. **Simulate Failure**
-   ```bash
-   # Kill node 3 (SIGKILL for unclean shutdown)
-   kill -9 <node3_pid>
-   ```
-
-3. **Watch for Detection**
-   - Check logs on nodes 1 and 2
-   - Wait for health check failures (3 missed pings)
-   - Verify "Node detected as unhealthy" messages
-
-4. **Watch for Replacement**
-   - Check for "NODE_REPLACEMENT_NEEDED" broadcast
-   - Node 4 should offer to replace
-   - Verify "Starting as replacement node" on node 4
-   - Verify node 4 joins Raft cluster
-
-5. **Verify Data Integrity**
-   - Query database from node 4
-   - Verify all data present
-   - Insert new data from node 4
-   - Verify replication to nodes 1 and 2
-
-**Expected Results:**
- Failure detected within 30 seconds
- Replacement completes automatically
- Data accessible from new node
- No data loss
-
---
-
-### Test 5: Port Exhaustion
-
-**Setup:** 1 node with small port range
-
-**Configuration:**
-```yaml
-database:
-  max_databases: 10
-  port_range_http_start: 5001
-  port_range_http_end: 5005  # Only 5 ports
-  port_range_raft_start: 7001
-  port_range_raft_end: 7005  # Only 5 ports
-```
-
-**Test Steps:**
-1. **Create Databases**
-   - Create database 1 (succeeds - uses 2 ports)
-   - Create database 2 (succeeds - uses 2 ports)
-   - Create database 3 (fails - only 1 port left)
-
-2. **Verify Error**
-   - Check logs for "Cannot allocate ports"
-   - Verify error returned to client
-
-3. **Free Ports**
-   - Hibernate or delete database 1
-   - Ports should be freed
-
-4. **Retry**
-   - Create database 3 again
-   - Should succeed now
-
-**Expected Results:**
- Graceful handling of port exhaustion
- Clear error messages
- Ports properly recycled
-
---
-
-### Test 6: Orphaned Data Cleanup
-
-**Setup:** 1 node stopped
-
-**Test Steps:**
-1. **Create Orphaned Data**
-   ```bash
-   # While node is stopped
-   mkdir -p data/bootstrap/orphaned_db/rqlite
-   echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite
-   ```
-
-2. **Start Node**
-   ```bash
-   ./bin/node --data bootstrap --id bootstrap
-   ```
-
-3. **Check Reconciliation**
-   - Watch logs for "Starting orphaned data reconciliation"
-   - Verify "Found orphaned database directory"
-   - Verify "Removed orphaned database directory"
-
-4. **Verify Cleanup**
-   ```bash
-   ls data/bootstrap/
-   # orphaned_db should be gone
-   ```
-
-**Expected Results:**
- Orphaned directories automatically detected
- Removed on startup
- Clean reconciliation logged
-
---
-
-### Test 7: Stress Test - Many Databases
-
-**Setup:** 5 nodes with high capacity
-
-**Configuration:**
-```yaml
-database:
-  max_databases: 50
-  port_range_http_start: 5001
-  port_range_http_end: 5150
-  port_range_raft_start: 7001
-  port_range_raft_end: 7150
-```
-
-**Test Steps:**
-1. **Create Many Databases**
-   ```
-   Loop: Create databases db_1 through db_25
-   ```
-
-2. **Verify Distribution**
-   - Check logs for node capacity announcements
-   - Verify databases distributed across nodes
-   - No single node overloaded
-
-3. **Concurrent Operations**
-   - Write to multiple databases simultaneously
-   - Read from multiple databases
-   - Verify no conflicts
-
-4. **Hibernation Wave**
-   - Stop all activity
-   - Wait for hibernation
-   - Verify all databases hibernate
-   - Check resource usage drops
-
-5. **Wake-Up Storm**
-   - Query all 25 databases at once
-   - Verify all wake up successfully
-   - Check for thundering herd issues
-
-**Expected Results:**
- All 25 databases created successfully
- Even distribution across nodes
- No port conflicts
- Successful mass hibernation/wake-up
-
---
-
-### Test 8: Gateway API Access
-
-**Setup:** Gateway running with 3 nodes
-
-**Test Steps:**
-1. **Authenticate**
-   ```bash
-   # Get JWT token
-   TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \
-     -H "Content-Type: application/json" \
-     -d '{"wallet": "..."}' | jq -r .token)
-   ```
-
-2. **Create Table**
-   ```bash
-   curl -X POST http://localhost:8080/v1/database/create-table \
-     -H "Authorization: Bearer $TOKEN" \
-     -H "Content-Type: application/json" \
-     -d '{
-       "database": "testdb",
-       "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
-     }'
-   ```
-
-3. **Insert Data**
-   ```bash
-   curl -X POST http://localhost:8080/v1/database/exec \
-     -H "Authorization: Bearer $TOKEN" \
-     -H "Content-Type: application/json" \
-     -d '{
-       "database": "testdb",
-       "sql": "INSERT INTO users (name, email) VALUES (?, ?)",
-       "args": ["Alice", "alice@example.com"]
-     }'
-   ```
-
-4. **Query Data**
-   ```bash
-   curl -X POST http://localhost:8080/v1/database/query \
-     -H "Authorization: Bearer $TOKEN" \
-     -H "Content-Type: application/json" \
-     -d '{
-       "database": "testdb",
-       "sql": "SELECT * FROM users"
-     }'
-   ```
-
-5. **Test Transaction**
-   ```bash
-   curl -X POST http://localhost:8080/v1/database/transaction \
-     -H "Authorization: Bearer $TOKEN" \
-     -H "Content-Type: application/json" \
-     -d '{
-       "database": "testdb",
-       "queries": [
-         "INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")",
-         "INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")"
-       ]
-     }'
-   ```
-
-6. **Get Schema**
-   ```bash
-   curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \
-     -H "Authorization: Bearer $TOKEN"
-   ```
-
-7. **Test Hibernation**
-   - Wait for hibernation timeout
-   - Query again and measure wake-up time
-   - Should see delay on first query after hibernation
-
-**Expected Results:**
- All API calls succeed
- Data persists across calls
- Transactions are atomic
- Schema reflects created tables
- Hibernation/wake-up transparent to API
- Response times reasonable (< 30s for queries)
-
---
-
-## Test Checklist
-
-### Unit Tests (To Implement)
- [ ] Metadata Store operations
- [ ] Metadata Store concurrency
- [ ] Vector Clock increment
- [ ] Vector Clock merge
- [ ] Vector Clock compare
- [ ] Coordinator election (single node)
- [ ] Coordinator election (multiple nodes)
- [ ] Coordinator election (deterministic)
- [ ] Port Manager allocation
- [ ] Port Manager release
- [ ] Port Manager exhaustion
- [ ] Port Manager specific ports
- [ ] RQLite Instance creation
- [ ] RQLite Instance IsIdle
- [ ] Message marshal/unmarshal (all types)
- [ ] Coordinator response collection
- [ ] Coordinator node selection
- [ ] Coordinator registry
-
-### Integration Tests (To Implement)
- [ ] Single node database creation
- [ ] Three node database creation
- [ ] Multiple databases isolation
- [ ] Hibernation cycle
- [ ] Wake-up cycle
- [ ] Node failure detection
- [ ] Node replacement
- [ ] Orphaned data cleanup
- [ ] Concurrent database creation
- [ ] Concurrent hibernation
-
-### Manual Tests (To Perform)
- [ ] Basic three node flow
- [ ] Hibernation and wake-up
- [ ] Multiple databases
- [ ] Node failure and recovery
- [ ] Port exhaustion handling
- [ ] Orphaned data cleanup
- [ ] Stress test with many databases
-
-### Performance Validation
- [ ] Database creation < 10s
- [ ] Wake-up time < 8s
- [ ] Metadata sync < 5s
- [ ] Query overhead < 10ms additional
-
-## Running Tests
-
-### Unit Tests
-```bash
-# Run all tests
-go test ./pkg/rqlite/... -v
-
-# Run with race detector
-go test ./pkg/rqlite/... -race
-
-# Run specific test
-go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v
-
-# Run with coverage
-go test ./pkg/rqlite/... -cover -coverprofile=coverage.out
-go tool cover -html=coverage.out
-```
-
-### Integration Tests
-```bash
-# Run e2e tests
-go test ./e2e/... -v -timeout 30m
-
-# Run specific e2e test
-go test ./e2e/ -run TestThreeNodeDatabaseCreation -v
-```
-
-### Manual Tests
-Follow the scenarios above in dedicated terminals for each node.
-
-## Success Criteria
-
-### Correctness
-✅ All unit tests pass  
-✅ All integration tests pass  
-✅ All manual scenarios complete successfully  
-✅ No data loss in any scenario  
-✅ No race conditions detected  
-
-### Performance
-✅ Database creation < 10 seconds  
-✅ Wake-up < 8 seconds  
-✅ Metadata sync < 5 seconds  
-✅ Query overhead < 10ms  
-
-### Reliability
-✅ Survives node failures  
-✅ Automatic recovery works  
-✅ No orphaned data accumulates  
-✅ Hibernation/wake-up cycles stable  
-✅ Concurrent operations safe  
-
-## Notes for Future Test Enhancements
-
-When implementing advanced metrics and benchmarks:
-
-1. **Prometheus Metrics Tests**
-   - Verify metric export
-   - Validate metric values
-   - Test metric reset on restart
-
-2. **Benchmark Suite**
-   - Automated performance regression detection
-   - Latency percentile tracking (p50, p95, p99)
-   - Throughput measurements
-   - Resource usage profiling
-
-3. **Chaos Engineering**
-   - Random node kills
-   - Network partitions
-   - Clock skew simulation
-   - Disk full scenarios
-
-4. **Long-Running Stability**
-   - 24-hour soak test
-   - Memory leak detection
-   - Slow-growing resource usage
-
-## Debugging Failed Tests
-
-### Common Issues
-
-**Port Conflicts**
-```bash
-# Check for processes using test ports
-lsof -i :5001-5999
-lsof -i :7001-7999
-
-# Kill stale processes
-pkill rqlited
-```
-
-**Stale Data**
-```bash
-# Clean test data directories
-rm -rf data/test_*/
-rm -rf /tmp/debros_test_*/
-```
-
-**Timing Issues**
- Increase timeouts in flaky tests
- Add retry logic with exponential backoff
- Use proper synchronization primitives
-
-**Race Conditions**
-```bash
-# Always run with race detector during development
-go test -race ./...
-```
-
-