Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide

- Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation. - Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources.
2026-01-30 07:13:04 +00:00 · 2025-10-16 10:29:58 +03:00 · 2025-10-16 10:29:58 +03:00 · 36002d342c
commit 36002d342c
parent dd4cb832dc
3 changed files with 0 additions and 1496 deletions
--- a/.cursor/plans/dynamic-ec358e91.plan.md
+++ b/.cursor/plans/dynamic-ec358e91.plan.md
@ -1,165 +0,0 @@
 <!-- ec358e91-8e19-4fc8-a81e-cb388a4b2fc9 4c357d4a-bae7-4fe2-943d-84e5d3d3714c -->
 # Dynamic Database Clustering — Implementation Plan
 ### Scope
 Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing.
 ### Guiding Principles
 - Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons.
 - Backward-compatible config migration with deprecations, feature-flag controlled rollout.
 - Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes.
 - Tests and observability at each phase.
 ### Phase 0: Prep & Scaffolding
 - Add feature flag `dynamic_db_clustering` (env/config) → default off.
 - Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated).
 - Create empty packages and interfaces to enable incremental compilation:
  - `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}`
  - `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}`
 - Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed).
 - Establish CI jobs for new unit/integration suites and longer-running e2e.
 ### Phase 1: Metadata Layer (No hibernation yet)
 - Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`:
  - `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`.
 - Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge:
  - Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health.
 - Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`:
  - Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs).
 - Reuse existing node connectivity/backoff; no new ping service required.
 - Skip unit tests for now; validate by wiring e2e flows later.
 ### Phase 2: Database Creation & Client API
 - Port management:
  - `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping.
 - Subprocess control:
  - `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status).
 - Cluster manager:
  - `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces.
 - Client API:
  - Update `pkg/client/interface.go` to include `Database(name string)`.
  - Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name).
  - Backoff polling for readiness during creation.
 - Data isolation:
  - Data dir per db: `./data/<app>_<db>/rqlite` (respect node `data_dir` base).
 - Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write.
 ### Phase 3: Hibernation & Wake-Up
 - Idle detection and coordination:
  - Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule.
 - Hibernation protocol:
  - Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`.
 - Wake-up protocol:
  - Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`.
 - Client retry UX:
  - Transparent retries with exponential backoff; treat `waking` as wait-only state.
 - Tests: hibernation under load; thundering herd; resource verification and persistence across cycles.
 ### Phase 4: Resilience (Failure & Replacement)
 - Continuous health checks with timeouts → mark node unhealthy.
 - Replacement orchestration:
  - Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs.
 - Startup reconciliation:
  - Detect and cleanup orphaned or non-member local data directories.
 - Rate limiting replacements to prevent cascades; prioritize by usage metrics.
 - Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity.
 ### Phase 5: Production Hardening & Optimization
 - Metrics/logging:
  - Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges.
 - Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status).
 - Client metadata caching and query routing improvements (simple round-robin → latency-aware later).
 - Performance benchmarks and operator-facing docs.
 ### File Changes (Essentials)
 - `pkg/config/config.go`
  - Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`.
  - Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`.
 - `pkg/client/interface.go`/`pkg/client/client.go`
  - Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling.
 - `pkg/node/node.go`
  - Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage.
 - `pkg/rqlite/*`
  - Refactor to instance-oriented helpers from singleton.
 - New packages under `pkg/metadata` and `pkg/dbcluster` as listed above.
 - `configs/node.yaml` and validation paths to reflect new `database` block.
 ### Config Example (target end-state)
 ```yaml
 node:
  data_dir: "./data"
 database:
  replication_factor: 3
  hibernation_timeout: 60
  max_databases: 100
  port_range:
    http_start: 5001
    http_end: 5999
    raft_start: 7001
    raft_end: 7999
 discovery:
  health_check_interval: 10s
 ```
 ### Rollout Strategy
 - Keep feature flag off by default; support legacy single-cluster path.
 - Ship Phase 1 behind flag; enable in dev/e2e only.
 - Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment.
 - Remove legacy config after deprecation window.
 ### Testing & Quality Gates
 - Unit tests: metadata ops, consensus, ports, subprocess, manager state machine.
 - Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions.
 - Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms).
 - Chaos suite for randomized failures and partitions.
 ### Risks & Mitigations (operationalized)
 - Metadata divergence → vector clocks + periodic checksums + majority read checks in client.
 - Raft churn → adaptive timeouts; allow `always_on` flag per-db (future).
 - Cascading replacements → global rate limiter and prioritization.
 - Debuggability → verbose structured logging and metadata dump endpoints.
 ### Timeline (indicative)
 - Weeks 1-2: Phases 0-1
 - Weeks 3-4: Phase 2
 - Weeks 5-6: Phase 3
 - Weeks 7-8: Phase 4
 - Weeks 9-10+: Phase 5
 ### To-dos
 - [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks
 - [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields
 - [ ] Implement metadata types and thread-safe store with versioning
 - [ ] Implement pubsub messages and handlers using existing pubsub manager
 - [ ] Implement coordinator election, vector clocks, gossip reconciliation
 - [ ] Implement `PortManager` with bind-probing and allocation
 - [ ] Implement rqlite subprocess control and readiness checks
 - [ ] Implement `ClusterManager` and creation lifecycle orchestration
 - [ ] Add `Database(name)` and app namespacing to client; backoff polling
 - [ ] Adopt per-database data dirs under node `data_dir`
 - [ ] Integration tests for creation and isolation across nodes
 - [ ] Idle detection, coordinated shutdown, status updates
 - [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart
 - [ ] Client transparent retry/backoff for hibernation and waking
 - [ ] Health checks, replacement orchestration, rate limiting
 - [ ] Implement orphaned data reconciliation on startup
 - [ ] Add metrics and structured logging across managers
 - [ ] Benchmarks for creation, wake-up, sync, query overhead
 - [ ] Operator and developer docs; config and migration guides
--- a/DYNAMIC_CLUSTERING_GUIDE.md
+++ b/DYNAMIC_CLUSTERING_GUIDE.md
@ -1,504 +0,0 @@
 # Dynamic Database Clustering - User Guide
 ## Overview
 Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation.
 ## Key Features
 ✅ **Multi-Database Support** - Create unlimited isolated databases on-demand  
 ✅ **3-Node Replication** - Fault-tolerant by default (configurable)  
 ✅ **Auto Hibernation** - Idle databases hibernate to save resources  
 ✅ **Transparent Wake-Up** - Automatic restart on access  
 ✅ **App Namespacing** - Databases are scoped by application name  
 ✅ **Decentralized Metadata** - LibP2P pubsub-based coordination  
 ✅ **Failure Recovery** - Automatic node replacement on failures  
 ✅ **Resource Optimization** - Dynamic port allocation and data isolation  
 ## Configuration
 ### Node Configuration (`configs/node.yaml`)
 ```yaml
 node:
  data_dir: "./data"
  listen_addresses:
    - "/ip4/0.0.0.0/tcp/4001"
  max_connections: 50
 database:
  replication_factor: 3           # Number of replicas per database
  hibernation_timeout: 60s        # Idle time before hibernation
  max_databases: 100              # Max databases per node
  port_range_http_start: 5001     # HTTP port range start
  port_range_http_end: 5999       # HTTP port range end
  port_range_raft_start: 7001     # Raft port range start
  port_range_raft_end: 7999       # Raft port range end
 discovery:
  bootstrap_peers:
    - "/ip4/127.0.0.1/tcp/4001/p2p/..."
  discovery_interval: 30s
  health_check_interval: 10s
 ```
 ### Key Configuration Options
 #### `database.replication_factor` (default: 3)
 Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance.
 #### `database.hibernation_timeout` (default: 60s)
 Time of inactivity before a database is hibernated. Set to 0 to disable hibernation.
 #### `database.max_databases` (default: 100)
 Maximum number of databases this node can host simultaneously.
 #### `database.port_range_*`
 Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database).
 ## Client Usage
 ### Creating/Accessing Databases
 ```go
 package main
 import (
    "context"
    "github.com/DeBrosOfficial/network/pkg/client"
 )
 func main() {
    // Create client with app name for namespacing
    cfg := client.DefaultClientConfig("myapp")
    cfg.BootstrapPeers = []string{
        "/ip4/127.0.0.1/tcp/4001/p2p/...",
    }
    c, err := client.NewClient(cfg)
    if err != nil {
        panic(err)
    }
    // Connect to network
    if err := c.Connect(); err != nil {
        panic(err)
    }
    defer c.Disconnect()
    // Get database client (creates database if it doesn't exist)
    db, err := c.Database().Database("users")
    if err != nil {
        panic(err)
    }
    // Use the database
    ctx := context.Background()
    err = db.CreateTable(ctx, `
        CREATE TABLE users (
            id INTEGER PRIMARY KEY,
            name TEXT NOT NULL,
            email TEXT UNIQUE
        )
    `)
    // Query data
    result, err := db.Query(ctx, "SELECT * FROM users")
    // ...
 }
 ```
 ### Database Naming
 Databases are automatically namespaced by your application name:
 - `client.Database("users")` → creates `myapp_users` internally
 - This prevents name collisions between different applications
 ## Gateway API Usage
 If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints:
 ### Base URL
 ```
 http://gateway-host:8080/v1/database/
 ```
 ### Execute SQL (INSERT, UPDATE, DELETE, DDL)
 ```bash
 POST /v1/database/exec
 Content-Type: application/json
 {
  "database": "users",
  "sql": "INSERT INTO users (name, email) VALUES (?, ?)",
  "args": ["Alice", "alice@example.com"]
 }
 Response:
 {
  "rows_affected": 1,
  "last_insert_id": 1
 }
 ```
 ### Query Data (SELECT)
 ```bash
 POST /v1/database/query
 Content-Type: application/json
 {
  "database": "users",
  "sql": "SELECT * FROM users WHERE name LIKE ?",
  "args": ["A%"]
 }
 Response:
 {
  "items": [
    {"id": 1, "name": "Alice", "email": "alice@example.com"}
  ],
  "count": 1
 }
 ```
 ### Execute Transaction
 ```bash
 POST /v1/database/transaction
 Content-Type: application/json
 {
  "database": "users",
  "queries": [
    "INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')",
    "UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'"
  ]
 }
 Response:
 {
  "success": true
 }
 ```
 ### Get Schema
 ```bash
 GET /v1/database/schema?database=users
 # OR
 POST /v1/database/schema
 Content-Type: application/json
 {
  "database": "users"
 }
 Response:
 {
  "tables": [
    {
      "name": "users",
      "columns": ["id", "name", "email"]
    }
  ]
 }
 ```
 ### Create Table
 ```bash
 POST /v1/database/create-table
 Content-Type: application/json
 {
  "database": "users",
  "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
 }
 Response:
 {
  "rows_affected": 0
 }
 ```
 ### Drop Table
 ```bash
 POST /v1/database/drop-table
 Content-Type: application/json
 {
  "database": "users",
  "table_name": "old_table"
 }
 Response:
 {
  "rows_affected": 0
 }
 ```
 ### List Databases
 ```bash
 GET /v1/database/list
 Response:
 {
  "databases": ["users", "products", "orders"]
 }
 ```
 ### Important Notes
 1. **Authentication Required**: All endpoints require authentication (JWT or API key)
 2. **Database Creation**: Databases are created automatically on first access
 3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database
 4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s
 5. **Namespacing**: Database names are automatically prefixed with your app name
 6. **Concurrent Access**: All endpoints are safe for concurrent use
 ## Database Lifecycle
 ### 1. Creation
 When you first access a database:
 1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST`
 2. **Node Selection** - Eligible nodes respond with available ports
 3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen
 4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM`
 5. **Instance Startup** - Selected nodes start rqlite subprocesses
 6. **Readiness** - Nodes report `active` status when ready
 **Typical creation time: < 10 seconds**
 ### 2. Active State
 - Database instances run as rqlite subprocesses
 - Each instance tracks `LastQuery` timestamp
 - Queries update the activity timestamp
 - Metadata synced across all network nodes
 ### 3. Hibernation
 After `hibernation_timeout` of inactivity:
 1. **Idle Detection** - Nodes detect idle databases
 2. **Idle Notification** - Nodes broadcast idle status
 3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown
 4. **Graceful Stop** - SIGTERM sent to rqlite processes
 5. **Port Release** - Ports freed for reuse
 6. **Status Update** - Metadata updated to `hibernating`
 **Data persists on disk during hibernation**
 ### 4. Wake-Up
 On first query to hibernating database:
 1. **Detection** - Client/node detects `hibernating` status
 2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST`
 3. **Port Allocation** - Reuse original ports or allocate new ones
 4. **Instance Restart** - Restart rqlite with existing data
 5. **Status Update** - Update to `active` when ready
 **Typical wake-up time: < 8 seconds**
 ### 5. Failure Recovery
 When a node fails:
 1. **Health Detection** - Missed health checks trigger failure detection
 2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED`
 3. **Offers** - Healthy nodes with capacity offer to replace
 4. **Selection** - First offer accepted (simple approach)
 5. **Join Cluster** - New node joins existing Raft cluster
 6. **Sync** - Data synced from existing members
 ## Data Management
 ### Data Directories
 Each database gets its own data directory:
 ```
 ./data/
  ├── myapp_users/        # Database: users
  │   └── rqlite/
  │       ├── db.sqlite
  │       └── raft/
  ├── myapp_products/     # Database: products
  │   └── rqlite/
  └── myapp_orders/       # Database: orders
      └── rqlite/
 ```
 ### Orphaned Data Cleanup
 On node startup, the system automatically:
 - Scans data directories
 - Checks against metadata
 - Removes directories for:
  - Non-existent databases
  - Databases where this node is not a member
 ## Monitoring & Debugging
 ### Structured Logging
 All operations are logged with structured fields:
 ```
 INFO  Starting cluster manager node_id=12D3... max_databases=100
 INFO  Received database create request database=myapp_users requester=12D3...
 INFO  Database instance started database=myapp_users http_port=5001 raft_port=7001
 INFO  Database is idle database=myapp_users idle_time=62s
 INFO  Database hibernated successfully database=myapp_users
 INFO  Received wakeup request database=myapp_users
 INFO  Database woke up successfully database=myapp_users
 ```
 ### Health Checks
 Nodes perform periodic health checks:
 - Every `health_check_interval` (default: 10s)
 - Tracks last-seen time for each peer
 - 3 missed checks → node marked unhealthy
 - Triggers replacement protocol for affected databases
 ## Best Practices
 ### 1. **Capacity Planning**
 ```yaml
 # For 100 databases with 3-node replication:
 database:
  max_databases: 100
  port_range_http_start: 5001
  port_range_http_end: 5200    # 200 ports (100 databases * 2)
  port_range_raft_start: 7001
  port_range_raft_end: 7200
 ```
 ### 2. **Hibernation Tuning**
 - **High Traffic**: Set `hibernation_timeout: 300s` or higher
 - **Development**: Set `hibernation_timeout: 30s` for faster cycles
 - **Always-On DBs**: Set `hibernation_timeout: 0` to disable
 ### 3. **Replication Factor**
 - **Development**: `replication_factor: 1` (single node, no replication)
 - **Production**: `replication_factor: 3` (fault tolerant)
 - **High Availability**: `replication_factor: 5` (survives 2 failures)
 ### 4. **Network Topology**
 - Use at least 3 nodes for `replication_factor: 3`
 - Ensure `max_databases * replication_factor <= total_cluster_capacity`
 - Example: 3 nodes × 100 max_databases = 300 database instances total
 ## Troubleshooting
 ### Database Creation Fails
 **Problem**: `insufficient nodes responded: got 1, need 3`
 **Solution**:
 - Ensure you have at least `replication_factor` nodes online
 - Check `max_databases` limit on nodes
 - Verify port ranges aren't exhausted
 ### Database Not Waking Up
 **Problem**: Database stays in `waking` status
 **Solution**:
 - Check node logs for rqlite startup errors
 - Verify rqlite binary is installed
 - Check port conflicts (use different port ranges)
 - Ensure data directory is accessible
 ### Orphaned Data
 **Problem**: Disk space consumed by old databases
 **Solution**:
 - Orphaned data is automatically cleaned on node restart
 - Manual cleanup: Delete directories from `./data/` that don't match metadata
 - Check logs for reconciliation results
 ### Node Replacement Not Working
 **Problem**: Failed node not replaced
 **Solution**:
 - Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`)
 - Check network connectivity between nodes
 - Verify health check interval is reasonable (not too aggressive)
 ## Advanced Topics
 ### Metadata Consistency
 - **Vector Clocks**: Each metadata update includes vector clock for conflict resolution
 - **Gossip Protocol**: Periodic metadata sync via checksums
 - **Eventual Consistency**: All nodes eventually agree on database state
 ### Port Management
 - Ports allocated randomly within configured ranges
 - Bind-probing ensures ports are actually available
 - Ports reused during wake-up when possible
 - Failed allocations fall back to new random ports
 ### Coordinator Election
 - Deterministic selection based on lexicographical peer ID ordering
 - Lowest peer ID becomes coordinator
 - No persistent coordinator state
 - Re-election occurs for each database operation
 ## Migration from Legacy Mode
 If upgrading from single-cluster rqlite:
 1. **Backup Data**: Backup your existing `./data/rqlite` directory
 2. **Update Config**: Remove deprecated fields:
   - `database.data_dir`
   - `database.rqlite_port`
   - `database.rqlite_raft_port`
   - `database.rqlite_join_address`
 3. **Add New Fields**: Configure dynamic clustering (see Configuration section)
 4. **Restart Nodes**: Restart all nodes with new configuration
 5. **Migrate Data**: Create new database and import data from backup
 ## Future Enhancements
 The following features are planned for future releases:
 ### **Advanced Metrics** (Future)
 - Prometheus-style metrics export
 - Per-database query counters
 - Hibernation/wake-up latency histograms
 - Resource utilization gauges
 ### **Performance Benchmarks** (Future)
 - Automated benchmark suite
 - Creation time SLOs
 - Wake-up latency targets
 - Query overhead measurements
 ### **Enhanced Monitoring** (Future)
 - Dashboard for cluster visualization
 - Database status API endpoint
 - Capacity planning tools
 - Alerting integration
 ## Support
 For issues, questions, or contributions:
 - GitHub Issues: https://github.com/DeBrosOfficial/network/issues
 - Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md
 ## License
 See LICENSE file for details.
--- a/TESTING_GUIDE.md
+++ b/TESTING_GUIDE.md
@ -1,827 +0,0 @@
 # Dynamic Database Clustering - Testing Guide
 This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature.
 ## Unit Tests
 ### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`)
 ```go
 // Test cases to implement:
 func TestMetadataStore_GetSetDatabase(t *testing.T)
  - Create store
  - Set database metadata
  - Get database metadata
  - Verify data matches
 func TestMetadataStore_DeleteDatabase(t *testing.T)
  - Set database metadata
  - Delete database
  - Verify Get returns nil
 func TestMetadataStore_ListDatabases(t *testing.T)
  - Add multiple databases
  - List all databases
  - Verify count and contents
 func TestMetadataStore_ConcurrentAccess(t *testing.T)
  - Spawn multiple goroutines
  - Concurrent reads and writes
  - Verify no race conditions (run with -race)
 func TestMetadataStore_NodeCapacity(t *testing.T)
  - Set node capacity
  - Get node capacity
  - Update capacity
  - List nodes
 ```
 ### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`)
 ```go
 func TestVectorClock_Increment(t *testing.T)
  - Create empty vector clock
  - Increment for node A
  - Verify counter is 1
  - Increment again
  - Verify counter is 2
 func TestVectorClock_Merge(t *testing.T)
  - Create two vector clocks with different nodes
  - Merge them
  - Verify max values are preserved
 func TestVectorClock_Compare(t *testing.T)
  - Test strictly less than case
  - Test strictly greater than case
  - Test concurrent case
  - Test identical case
 func TestVectorClock_Concurrent(t *testing.T)
  - Create clocks with overlapping updates
  - Verify Compare returns 0 (concurrent)
 ```
 ### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`)
 ```go
 func TestElectCoordinator_SingleNode(t *testing.T)
  - Pass single node ID
  - Verify it's elected
 func TestElectCoordinator_MultipleNodes(t *testing.T)
  - Pass multiple node IDs
  - Verify lowest lexicographical ID wins
  - Verify deterministic (same input = same output)
 func TestElectCoordinator_EmptyList(t *testing.T)
  - Pass empty list
  - Verify error returned
 func TestElectCoordinator_Deterministic(t *testing.T)
  - Run election multiple times with same inputs
  - Verify same coordinator each time
 ```
 ### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`)
 ```go
 func TestPortManager_AllocatePortPair(t *testing.T)
  - Create manager with port range
  - Allocate port pair
  - Verify HTTP and Raft ports different
  - Verify ports within range
 func TestPortManager_ReleasePortPair(t *testing.T)
  - Allocate port pair
  - Release ports
  - Verify ports can be reallocated
 func TestPortManager_Exhaustion(t *testing.T)
  - Allocate all available ports
  - Attempt one more allocation
  - Verify error returned
 func TestPortManager_IsPortAllocated(t *testing.T)
  - Allocate ports
  - Check IsPortAllocated returns true
  - Release ports
  - Check IsPortAllocated returns false
 func TestPortManager_AllocateSpecificPorts(t *testing.T)
  - Allocate specific ports
  - Verify allocation succeeds
  - Attempt to allocate same ports again
  - Verify error returned
 ```
 ### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`)
 ```go
 func TestRQLiteInstance_Create(t *testing.T)
  - Create instance configuration
  - Verify fields set correctly
 func TestRQLiteInstance_IsIdle(t *testing.T)
  - Set LastQuery to old timestamp
  - Verify IsIdle returns true
  - Update LastQuery
  - Verify IsIdle returns false
 // Integration test (requires rqlite binary):
 func TestRQLiteInstance_StartStop(t *testing.T)
  - Create instance
  - Start instance
  - Verify HTTP endpoint responsive
  - Stop instance
  - Verify process terminated
 ```
 ### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`)
 ```go
 func TestMarshalUnmarshalMetadataMessage(t *testing.T)
  - Create each message type
  - Marshal to bytes
  - Unmarshal back
  - Verify data preserved
 func TestDatabaseCreateRequest_Marshal(t *testing.T)
 func TestDatabaseCreateResponse_Marshal(t *testing.T)
 func TestDatabaseCreateConfirm_Marshal(t *testing.T)
 func TestDatabaseStatusUpdate_Marshal(t *testing.T)
 // ... for all message types
 ```
 ### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`)
 ```go
 func TestCreateCoordinator_AddResponse(t *testing.T)
  - Create coordinator
  - Add responses
  - Verify response count
 func TestCreateCoordinator_SelectNodes(t *testing.T)
  - Add more responses than needed
  - Call SelectNodes
  - Verify correct number selected
  - Verify deterministic selection
 func TestCreateCoordinator_WaitForResponses(t *testing.T)
  - Create coordinator
  - Wait in goroutine
  - Add responses from another goroutine
  - Verify wait completes when enough responses
 func TestCoordinatorRegistry(t *testing.T)
  - Register coordinator
  - Get coordinator
  - Remove coordinator
  - Verify lifecycle
 ```
 ## Integration Tests
 ### 1. Single Node Database Creation (`e2e/single_node_database_test.go`)
 ```go
 func TestSingleNodeDatabaseCreation(t *testing.T)
  - Start 1 node
  - Set replication_factor = 1
  - Create database
  - Verify database active
  - Write data
  - Read data back
  - Verify data matches
 ```
 ### 2. Three Node Database Creation (`e2e/three_node_database_test.go`)
 ```go
 func TestThreeNodeDatabaseCreation(t *testing.T)
  - Start 3 nodes
  - Set replication_factor = 3
  - Create database from node 1
  - Wait for all nodes to report active
  - Write data to node 1
  - Read from node 2
  - Verify replication worked
 ```
 ### 3. Multiple Databases (`e2e/multiple_databases_test.go`)
 ```go
 func TestMultipleDatabases(t *testing.T)
  - Start 3 nodes
  - Create database "users"
  - Create database "products"
  - Create database "orders"
  - Verify all databases active
  - Write to each database
  - Verify data isolation
 ```
 ### 4. Hibernation Cycle (`e2e/hibernation_test.go`)
 ```go
 func TestHibernationCycle(t *testing.T)
  - Start 3 nodes with hibernation_timeout=5s
  - Create database
  - Write initial data
  - Wait 10 seconds (no activity)
  - Verify status = hibernating
  - Verify processes stopped
  - Verify data persisted on disk
 func TestWakeUpCycle(t *testing.T)
  - Create and hibernate database
  - Issue query
  - Wait for wake-up
  - Verify status = active
  - Verify data still accessible
  - Verify LastQuery updated
 ```
 ### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`)
 ```go
 func TestNodeFailureDetection(t *testing.T)
  - Start 3 nodes
  - Create database
  - Kill one node (SIGKILL)
  - Wait for health checks to detect failure
  - Verify NODE_REPLACEMENT_NEEDED broadcast
 func TestNodeReplacement(t *testing.T)
  - Start 4 nodes
  - Create database on nodes 1,2,3
  - Kill node 3
  - Wait for replacement
  - Verify node 4 joins cluster
  - Verify data accessible from node 4
 ```
 ### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`)
 ```go
 func TestOrphanedDataCleanup(t *testing.T)
  - Start node
  - Manually create orphaned data directory
  - Restart node
  - Verify orphaned directory removed
  - Check logs for reconciliation message
 ```
 ### 7. Concurrent Operations (`e2e/concurrent_test.go`)
 ```go
 func TestConcurrentDatabaseCreation(t *testing.T)
  - Start 5 nodes
  - Create 10 databases concurrently
  - Verify all successful
  - Verify no port conflicts
  - Verify proper distribution
 func TestConcurrentHibernation(t *testing.T)
  - Create multiple databases
  - Let all go idle
  - Verify all hibernate correctly
  - No race conditions
 ```
 ## Manual Test Scenarios
 ### Test 1: Basic Flow - Three Node Cluster
 **Setup:**
 ```bash
 # Terminal 1: Bootstrap node
 cd data/bootstrap
 ../../bin/node --data bootstrap --id bootstrap --p2p-port 4001
 # Terminal 2: Node 2
 cd data/node
 ../../bin/node --data node --id node2 --p2p-port 4002
 # Terminal 3: Node 3
 cd data/node2
 ../../bin/node --data node2 --id node3 --p2p-port 4003
 ```
 **Test Steps:**
 1. **Create Database**
   ```bash
   # Use client or API to create database "testdb"
   ```
 2. **Verify Creation**
   - Check logs on all 3 nodes for "Database instance started"
   - Verify `./data/*/testdb/` directories exist on all nodes
   - Check different ports allocated on each node
 3. **Write Data**
   ```sql
   CREATE TABLE users (id INT, name TEXT);
   INSERT INTO users VALUES (1, 'Alice');
   INSERT INTO users VALUES (2, 'Bob');
   ```
 4. **Verify Replication**
   - Query from each node
   - Verify same data returned
 **Expected Results:**
 - All nodes show `status=active` for testdb
 - Data replicated across all nodes
 - Unique port pairs per node
 ---
 ### Test 2: Hibernation and Wake-Up
 **Setup:** Same as Test 1 with database created
 **Test Steps:**
 1. **Check Activity**
   ```bash
   # In logs, verify "last_query" timestamps updating on queries
   ```
 2. **Wait for Hibernation**
   - Stop issuing queries
   - Wait `hibernation_timeout` + 10s
   - Check logs for "Database is idle"
   - Verify "Coordinated shutdown message sent"
   - Verify "Database hibernated successfully"
 3. **Verify Hibernation**
   ```bash
   # Check that rqlite processes are stopped
   ps aux | grep rqlite
   # Verify data directories still exist
   ls -la data/*/testdb/
   ```
 4. **Wake Up**
   - Issue a query to the database
   - Watch logs for "Received wakeup request"
   - Verify "Database woke up successfully"
   - Verify query succeeds
 **Expected Results:**
 - Hibernation happens after idle timeout
 - All 3 nodes hibernate coordinated
 - Wake-up completes in < 8 seconds
 - Data persists across hibernation cycle
 ---
 ### Test 3: Multiple Databases
 **Setup:** 3 nodes running
 **Test Steps:**
 1. **Create Multiple Databases**
   ```
   Create: users_db
   Create: products_db
   Create: orders_db
   ```
 2. **Verify Isolation**
   - Insert data in users_db
   - Verify data NOT in products_db
   - Verify data NOT in orders_db
 3. **Check Port Allocation**
   ```bash
   # Verify different ports for each database
   netstat -tlnp | grep rqlite
   # OR
   ss -tlnp | grep rqlite
   ```
 4. **Verify Data Directories**
   ```bash
   tree data/bootstrap/
   # Should show:
   # ├── users_db/
   # ├── products_db/
   # └── orders_db/
   ```
 **Expected Results:**
 - 3 separate database clusters
 - Each with 3 nodes (9 total instances)
 - Complete data isolation
 - Unique port pairs for each instance
 ---
 ### Test 4: Node Failure and Recovery
 **Setup:** 4 nodes running, database created on nodes 1-3
 **Test Steps:**
 1. **Verify Initial State**
   - Database active on nodes 1, 2, 3
   - Node 4 idle
 2. **Simulate Failure**
   ```bash
   # Kill node 3 (SIGKILL for unclean shutdown)
   kill -9 <node3_pid>
   ```
 3. **Watch for Detection**
   - Check logs on nodes 1 and 2
   - Wait for health check failures (3 missed pings)
   - Verify "Node detected as unhealthy" messages
 4. **Watch for Replacement**
   - Check for "NODE_REPLACEMENT_NEEDED" broadcast
   - Node 4 should offer to replace
   - Verify "Starting as replacement node" on node 4
   - Verify node 4 joins Raft cluster
 5. **Verify Data Integrity**
   - Query database from node 4
   - Verify all data present
   - Insert new data from node 4
   - Verify replication to nodes 1 and 2
 **Expected Results:**
 - Failure detected within 30 seconds
 - Replacement completes automatically
 - Data accessible from new node
 - No data loss
 ---
 ### Test 5: Port Exhaustion
 **Setup:** 1 node with small port range
 **Configuration:**
 ```yaml
 database:
  max_databases: 10
  port_range_http_start: 5001
  port_range_http_end: 5005  # Only 5 ports
  port_range_raft_start: 7001
  port_range_raft_end: 7005  # Only 5 ports
 ```
 **Test Steps:**
 1. **Create Databases**
   - Create database 1 (succeeds - uses 2 ports)
   - Create database 2 (succeeds - uses 2 ports)
   - Create database 3 (fails - only 1 port left)
 2. **Verify Error**
   - Check logs for "Cannot allocate ports"
   - Verify error returned to client
 3. **Free Ports**
   - Hibernate or delete database 1
   - Ports should be freed
 4. **Retry**
   - Create database 3 again
   - Should succeed now
 **Expected Results:**
 - Graceful handling of port exhaustion
 - Clear error messages
 - Ports properly recycled
 ---
 ### Test 6: Orphaned Data Cleanup
 **Setup:** 1 node stopped
 **Test Steps:**
 1. **Create Orphaned Data**
   ```bash
   # While node is stopped
   mkdir -p data/bootstrap/orphaned_db/rqlite
   echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite
   ```
 2. **Start Node**
   ```bash
   ./bin/node --data bootstrap --id bootstrap
   ```
 3. **Check Reconciliation**
   - Watch logs for "Starting orphaned data reconciliation"
   - Verify "Found orphaned database directory"
   - Verify "Removed orphaned database directory"
 4. **Verify Cleanup**
   ```bash
   ls data/bootstrap/
   # orphaned_db should be gone
   ```
 **Expected Results:**
 - Orphaned directories automatically detected
 - Removed on startup
 - Clean reconciliation logged
 ---
 ### Test 7: Stress Test - Many Databases
 **Setup:** 5 nodes with high capacity
 **Configuration:**
 ```yaml
 database:
  max_databases: 50
  port_range_http_start: 5001
  port_range_http_end: 5150
  port_range_raft_start: 7001
  port_range_raft_end: 7150
 ```
 **Test Steps:**
 1. **Create Many Databases**
   ```
   Loop: Create databases db_1 through db_25
   ```
 2. **Verify Distribution**
   - Check logs for node capacity announcements
   - Verify databases distributed across nodes
   - No single node overloaded
 3. **Concurrent Operations**
   - Write to multiple databases simultaneously
   - Read from multiple databases
   - Verify no conflicts
 4. **Hibernation Wave**
   - Stop all activity
   - Wait for hibernation
   - Verify all databases hibernate
   - Check resource usage drops
 5. **Wake-Up Storm**
   - Query all 25 databases at once
   - Verify all wake up successfully
   - Check for thundering herd issues
 **Expected Results:**
 - All 25 databases created successfully
 - Even distribution across nodes
 - No port conflicts
 - Successful mass hibernation/wake-up
 ---
 ### Test 8: Gateway API Access
 **Setup:** Gateway running with 3 nodes
 **Test Steps:**
 1. **Authenticate**
   ```bash
   # Get JWT token
   TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \
     -H "Content-Type: application/json" \
     -d '{"wallet": "..."}' | jq -r .token)
   ```
 2. **Create Table**
   ```bash
   curl -X POST http://localhost:8080/v1/database/create-table \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "database": "testdb",
       "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
     }'
   ```
 3. **Insert Data**
   ```bash
   curl -X POST http://localhost:8080/v1/database/exec \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "database": "testdb",
       "sql": "INSERT INTO users (name, email) VALUES (?, ?)",
       "args": ["Alice", "alice@example.com"]
     }'
   ```
 4. **Query Data**
   ```bash
   curl -X POST http://localhost:8080/v1/database/query \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "database": "testdb",
       "sql": "SELECT * FROM users"
     }'
   ```
 5. **Test Transaction**
   ```bash
   curl -X POST http://localhost:8080/v1/database/transaction \
     -H "Authorization: Bearer $TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "database": "testdb",
       "queries": [
         "INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")",
         "INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")"
       ]
     }'
   ```
 6. **Get Schema**
   ```bash
   curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \
     -H "Authorization: Bearer $TOKEN"
   ```
 7. **Test Hibernation**
   - Wait for hibernation timeout
   - Query again and measure wake-up time
   - Should see delay on first query after hibernation
 **Expected Results:**
 - All API calls succeed
 - Data persists across calls
 - Transactions are atomic
 - Schema reflects created tables
 - Hibernation/wake-up transparent to API
 - Response times reasonable (< 30s for queries)
 ---
 ## Test Checklist
 ### Unit Tests (To Implement)
 - [ ] Metadata Store operations
 - [ ] Metadata Store concurrency
 - [ ] Vector Clock increment
 - [ ] Vector Clock merge
 - [ ] Vector Clock compare
 - [ ] Coordinator election (single node)
 - [ ] Coordinator election (multiple nodes)
 - [ ] Coordinator election (deterministic)
 - [ ] Port Manager allocation
 - [ ] Port Manager release
 - [ ] Port Manager exhaustion
 - [ ] Port Manager specific ports
 - [ ] RQLite Instance creation
 - [ ] RQLite Instance IsIdle
 - [ ] Message marshal/unmarshal (all types)
 - [ ] Coordinator response collection
 - [ ] Coordinator node selection
 - [ ] Coordinator registry
 ### Integration Tests (To Implement)
 - [ ] Single node database creation
 - [ ] Three node database creation
 - [ ] Multiple databases isolation
 - [ ] Hibernation cycle
 - [ ] Wake-up cycle
 - [ ] Node failure detection
 - [ ] Node replacement
 - [ ] Orphaned data cleanup
 - [ ] Concurrent database creation
 - [ ] Concurrent hibernation
 ### Manual Tests (To Perform)
 - [ ] Basic three node flow
 - [ ] Hibernation and wake-up
 - [ ] Multiple databases
 - [ ] Node failure and recovery
 - [ ] Port exhaustion handling
 - [ ] Orphaned data cleanup
 - [ ] Stress test with many databases
 ### Performance Validation
 - [ ] Database creation < 10s
 - [ ] Wake-up time < 8s
 - [ ] Metadata sync < 5s
 - [ ] Query overhead < 10ms additional
 ## Running Tests
 ### Unit Tests
 ```bash
 # Run all tests
 go test ./pkg/rqlite/... -v
 # Run with race detector
 go test ./pkg/rqlite/... -race
 # Run specific test
 go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v
 # Run with coverage
 go test ./pkg/rqlite/... -cover -coverprofile=coverage.out
 go tool cover -html=coverage.out
 ```
 ### Integration Tests
 ```bash
 # Run e2e tests
 go test ./e2e/... -v -timeout 30m
 # Run specific e2e test
 go test ./e2e/ -run TestThreeNodeDatabaseCreation -v
 ```
 ### Manual Tests
 Follow the scenarios above in dedicated terminals for each node.
 ## Success Criteria
 ### Correctness
 ✅ All unit tests pass  
 ✅ All integration tests pass  
 ✅ All manual scenarios complete successfully  
 ✅ No data loss in any scenario  
 ✅ No race conditions detected  
 ### Performance
 ✅ Database creation < 10 seconds  
 ✅ Wake-up < 8 seconds  
 ✅ Metadata sync < 5 seconds  
 ✅ Query overhead < 10ms  
 ### Reliability
 ✅ Survives node failures  
 ✅ Automatic recovery works  
 ✅ No orphaned data accumulates  
 ✅ Hibernation/wake-up cycles stable  
 ✅ Concurrent operations safe  
 ## Notes for Future Test Enhancements
 When implementing advanced metrics and benchmarks:
 1. **Prometheus Metrics Tests**
   - Verify metric export
   - Validate metric values
   - Test metric reset on restart
 2. **Benchmark Suite**
   - Automated performance regression detection
   - Latency percentile tracking (p50, p95, p99)
   - Throughput measurements
   - Resource usage profiling
 3. **Chaos Engineering**
   - Random node kills
   - Network partitions
   - Clock skew simulation
   - Disk full scenarios
 4. **Long-Running Stability**
   - 24-hour soak test
   - Memory leak detection
   - Slow-growing resource usage
 ## Debugging Failed Tests
 ### Common Issues
 **Port Conflicts**
 ```bash
 # Check for processes using test ports
 lsof -i :5001-5999
 lsof -i :7001-7999
 # Kill stale processes
 pkill rqlited
 ```
 **Stale Data**
 ```bash
 # Clean test data directories
 rm -rf data/test_*/
 rm -rf /tmp/debros_test_*/
 ```
 **Timing Issues**
 - Increase timeouts in flaky tests
 - Add retry logic with exponential backoff
 - Use proper synchronization primitives
 **Race Conditions**
 ```bash
 # Always run with race detector during development
 go test -race ./...
 ```