# Dynamic Database Clustering - Testing Guide This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature. ## Unit Tests ### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`) ```go // Test cases to implement: func TestMetadataStore_GetSetDatabase(t *testing.T) - Create store - Set database metadata - Get database metadata - Verify data matches func TestMetadataStore_DeleteDatabase(t *testing.T) - Set database metadata - Delete database - Verify Get returns nil func TestMetadataStore_ListDatabases(t *testing.T) - Add multiple databases - List all databases - Verify count and contents func TestMetadataStore_ConcurrentAccess(t *testing.T) - Spawn multiple goroutines - Concurrent reads and writes - Verify no race conditions (run with -race) func TestMetadataStore_NodeCapacity(t *testing.T) - Set node capacity - Get node capacity - Update capacity - List nodes ``` ### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`) ```go func TestVectorClock_Increment(t *testing.T) - Create empty vector clock - Increment for node A - Verify counter is 1 - Increment again - Verify counter is 2 func TestVectorClock_Merge(t *testing.T) - Create two vector clocks with different nodes - Merge them - Verify max values are preserved func TestVectorClock_Compare(t *testing.T) - Test strictly less than case - Test strictly greater than case - Test concurrent case - Test identical case func TestVectorClock_Concurrent(t *testing.T) - Create clocks with overlapping updates - Verify Compare returns 0 (concurrent) ``` ### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`) ```go func TestElectCoordinator_SingleNode(t *testing.T) - Pass single node ID - Verify it's elected func TestElectCoordinator_MultipleNodes(t *testing.T) - Pass multiple node IDs - Verify lowest lexicographical ID wins - Verify deterministic (same input = same output) func TestElectCoordinator_EmptyList(t *testing.T) - Pass empty list - Verify error returned func TestElectCoordinator_Deterministic(t *testing.T) - Run election multiple times with same inputs - Verify same coordinator each time ``` ### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`) ```go func TestPortManager_AllocatePortPair(t *testing.T) - Create manager with port range - Allocate port pair - Verify HTTP and Raft ports different - Verify ports within range func TestPortManager_ReleasePortPair(t *testing.T) - Allocate port pair - Release ports - Verify ports can be reallocated func TestPortManager_Exhaustion(t *testing.T) - Allocate all available ports - Attempt one more allocation - Verify error returned func TestPortManager_IsPortAllocated(t *testing.T) - Allocate ports - Check IsPortAllocated returns true - Release ports - Check IsPortAllocated returns false func TestPortManager_AllocateSpecificPorts(t *testing.T) - Allocate specific ports - Verify allocation succeeds - Attempt to allocate same ports again - Verify error returned ``` ### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`) ```go func TestRQLiteInstance_Create(t *testing.T) - Create instance configuration - Verify fields set correctly func TestRQLiteInstance_IsIdle(t *testing.T) - Set LastQuery to old timestamp - Verify IsIdle returns true - Update LastQuery - Verify IsIdle returns false // Integration test (requires rqlite binary): func TestRQLiteInstance_StartStop(t *testing.T) - Create instance - Start instance - Verify HTTP endpoint responsive - Stop instance - Verify process terminated ``` ### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`) ```go func TestMarshalUnmarshalMetadataMessage(t *testing.T) - Create each message type - Marshal to bytes - Unmarshal back - Verify data preserved func TestDatabaseCreateRequest_Marshal(t *testing.T) func TestDatabaseCreateResponse_Marshal(t *testing.T) func TestDatabaseCreateConfirm_Marshal(t *testing.T) func TestDatabaseStatusUpdate_Marshal(t *testing.T) // ... for all message types ``` ### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`) ```go func TestCreateCoordinator_AddResponse(t *testing.T) - Create coordinator - Add responses - Verify response count func TestCreateCoordinator_SelectNodes(t *testing.T) - Add more responses than needed - Call SelectNodes - Verify correct number selected - Verify deterministic selection func TestCreateCoordinator_WaitForResponses(t *testing.T) - Create coordinator - Wait in goroutine - Add responses from another goroutine - Verify wait completes when enough responses func TestCoordinatorRegistry(t *testing.T) - Register coordinator - Get coordinator - Remove coordinator - Verify lifecycle ``` ## Integration Tests ### 1. Single Node Database Creation (`e2e/single_node_database_test.go`) ```go func TestSingleNodeDatabaseCreation(t *testing.T) - Start 1 node - Set replication_factor = 1 - Create database - Verify database active - Write data - Read data back - Verify data matches ``` ### 2. Three Node Database Creation (`e2e/three_node_database_test.go`) ```go func TestThreeNodeDatabaseCreation(t *testing.T) - Start 3 nodes - Set replication_factor = 3 - Create database from node 1 - Wait for all nodes to report active - Write data to node 1 - Read from node 2 - Verify replication worked ``` ### 3. Multiple Databases (`e2e/multiple_databases_test.go`) ```go func TestMultipleDatabases(t *testing.T) - Start 3 nodes - Create database "users" - Create database "products" - Create database "orders" - Verify all databases active - Write to each database - Verify data isolation ``` ### 4. Hibernation Cycle (`e2e/hibernation_test.go`) ```go func TestHibernationCycle(t *testing.T) - Start 3 nodes with hibernation_timeout=5s - Create database - Write initial data - Wait 10 seconds (no activity) - Verify status = hibernating - Verify processes stopped - Verify data persisted on disk func TestWakeUpCycle(t *testing.T) - Create and hibernate database - Issue query - Wait for wake-up - Verify status = active - Verify data still accessible - Verify LastQuery updated ``` ### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`) ```go func TestNodeFailureDetection(t *testing.T) - Start 3 nodes - Create database - Kill one node (SIGKILL) - Wait for health checks to detect failure - Verify NODE_REPLACEMENT_NEEDED broadcast func TestNodeReplacement(t *testing.T) - Start 4 nodes - Create database on nodes 1,2,3 - Kill node 3 - Wait for replacement - Verify node 4 joins cluster - Verify data accessible from node 4 ``` ### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`) ```go func TestOrphanedDataCleanup(t *testing.T) - Start node - Manually create orphaned data directory - Restart node - Verify orphaned directory removed - Check logs for reconciliation message ``` ### 7. Concurrent Operations (`e2e/concurrent_test.go`) ```go func TestConcurrentDatabaseCreation(t *testing.T) - Start 5 nodes - Create 10 databases concurrently - Verify all successful - Verify no port conflicts - Verify proper distribution func TestConcurrentHibernation(t *testing.T) - Create multiple databases - Let all go idle - Verify all hibernate correctly - No race conditions ``` ## Manual Test Scenarios ### Test 1: Basic Flow - Three Node Cluster **Setup:** ```bash # Terminal 1: Bootstrap node cd data/bootstrap ../../bin/node --data bootstrap --id bootstrap --p2p-port 4001 # Terminal 2: Node 2 cd data/node ../../bin/node --data node --id node2 --p2p-port 4002 # Terminal 3: Node 3 cd data/node2 ../../bin/node --data node2 --id node3 --p2p-port 4003 ``` **Test Steps:** 1. **Create Database** ```bash # Use client or API to create database "testdb" ``` 2. **Verify Creation** - Check logs on all 3 nodes for "Database instance started" - Verify `./data/*/testdb/` directories exist on all nodes - Check different ports allocated on each node 3. **Write Data** ```sql CREATE TABLE users (id INT, name TEXT); INSERT INTO users VALUES (1, 'Alice'); INSERT INTO users VALUES (2, 'Bob'); ``` 4. **Verify Replication** - Query from each node - Verify same data returned **Expected Results:** - All nodes show `status=active` for testdb - Data replicated across all nodes - Unique port pairs per node --- ### Test 2: Hibernation and Wake-Up **Setup:** Same as Test 1 with database created **Test Steps:** 1. **Check Activity** ```bash # In logs, verify "last_query" timestamps updating on queries ``` 2. **Wait for Hibernation** - Stop issuing queries - Wait `hibernation_timeout` + 10s - Check logs for "Database is idle" - Verify "Coordinated shutdown message sent" - Verify "Database hibernated successfully" 3. **Verify Hibernation** ```bash # Check that rqlite processes are stopped ps aux | grep rqlite # Verify data directories still exist ls -la data/*/testdb/ ``` 4. **Wake Up** - Issue a query to the database - Watch logs for "Received wakeup request" - Verify "Database woke up successfully" - Verify query succeeds **Expected Results:** - Hibernation happens after idle timeout - All 3 nodes hibernate coordinated - Wake-up completes in < 8 seconds - Data persists across hibernation cycle --- ### Test 3: Multiple Databases **Setup:** 3 nodes running **Test Steps:** 1. **Create Multiple Databases** ``` Create: users_db Create: products_db Create: orders_db ``` 2. **Verify Isolation** - Insert data in users_db - Verify data NOT in products_db - Verify data NOT in orders_db 3. **Check Port Allocation** ```bash # Verify different ports for each database netstat -tlnp | grep rqlite # OR ss -tlnp | grep rqlite ``` 4. **Verify Data Directories** ```bash tree data/bootstrap/ # Should show: # ├── users_db/ # ├── products_db/ # └── orders_db/ ``` **Expected Results:** - 3 separate database clusters - Each with 3 nodes (9 total instances) - Complete data isolation - Unique port pairs for each instance --- ### Test 4: Node Failure and Recovery **Setup:** 4 nodes running, database created on nodes 1-3 **Test Steps:** 1. **Verify Initial State** - Database active on nodes 1, 2, 3 - Node 4 idle 2. **Simulate Failure** ```bash # Kill node 3 (SIGKILL for unclean shutdown) kill -9 ``` 3. **Watch for Detection** - Check logs on nodes 1 and 2 - Wait for health check failures (3 missed pings) - Verify "Node detected as unhealthy" messages 4. **Watch for Replacement** - Check for "NODE_REPLACEMENT_NEEDED" broadcast - Node 4 should offer to replace - Verify "Starting as replacement node" on node 4 - Verify node 4 joins Raft cluster 5. **Verify Data Integrity** - Query database from node 4 - Verify all data present - Insert new data from node 4 - Verify replication to nodes 1 and 2 **Expected Results:** - Failure detected within 30 seconds - Replacement completes automatically - Data accessible from new node - No data loss --- ### Test 5: Port Exhaustion **Setup:** 1 node with small port range **Configuration:** ```yaml database: max_databases: 10 port_range_http_start: 5001 port_range_http_end: 5005 # Only 5 ports port_range_raft_start: 7001 port_range_raft_end: 7005 # Only 5 ports ``` **Test Steps:** 1. **Create Databases** - Create database 1 (succeeds - uses 2 ports) - Create database 2 (succeeds - uses 2 ports) - Create database 3 (fails - only 1 port left) 2. **Verify Error** - Check logs for "Cannot allocate ports" - Verify error returned to client 3. **Free Ports** - Hibernate or delete database 1 - Ports should be freed 4. **Retry** - Create database 3 again - Should succeed now **Expected Results:** - Graceful handling of port exhaustion - Clear error messages - Ports properly recycled --- ### Test 6: Orphaned Data Cleanup **Setup:** 1 node stopped **Test Steps:** 1. **Create Orphaned Data** ```bash # While node is stopped mkdir -p data/bootstrap/orphaned_db/rqlite echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite ``` 2. **Start Node** ```bash ./bin/node --data bootstrap --id bootstrap ``` 3. **Check Reconciliation** - Watch logs for "Starting orphaned data reconciliation" - Verify "Found orphaned database directory" - Verify "Removed orphaned database directory" 4. **Verify Cleanup** ```bash ls data/bootstrap/ # orphaned_db should be gone ``` **Expected Results:** - Orphaned directories automatically detected - Removed on startup - Clean reconciliation logged --- ### Test 7: Stress Test - Many Databases **Setup:** 5 nodes with high capacity **Configuration:** ```yaml database: max_databases: 50 port_range_http_start: 5001 port_range_http_end: 5150 port_range_raft_start: 7001 port_range_raft_end: 7150 ``` **Test Steps:** 1. **Create Many Databases** ``` Loop: Create databases db_1 through db_25 ``` 2. **Verify Distribution** - Check logs for node capacity announcements - Verify databases distributed across nodes - No single node overloaded 3. **Concurrent Operations** - Write to multiple databases simultaneously - Read from multiple databases - Verify no conflicts 4. **Hibernation Wave** - Stop all activity - Wait for hibernation - Verify all databases hibernate - Check resource usage drops 5. **Wake-Up Storm** - Query all 25 databases at once - Verify all wake up successfully - Check for thundering herd issues **Expected Results:** - All 25 databases created successfully - Even distribution across nodes - No port conflicts - Successful mass hibernation/wake-up --- ### Test 8: Gateway API Access **Setup:** Gateway running with 3 nodes **Test Steps:** 1. **Authenticate** ```bash # Get JWT token TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \ -H "Content-Type: application/json" \ -d '{"wallet": "..."}' | jq -r .token) ``` 2. **Create Table** ```bash curl -X POST http://localhost:8080/v1/database/create-table \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "database": "testdb", "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)" }' ``` 3. **Insert Data** ```bash curl -X POST http://localhost:8080/v1/database/exec \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "database": "testdb", "sql": "INSERT INTO users (name, email) VALUES (?, ?)", "args": ["Alice", "alice@example.com"] }' ``` 4. **Query Data** ```bash curl -X POST http://localhost:8080/v1/database/query \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "database": "testdb", "sql": "SELECT * FROM users" }' ``` 5. **Test Transaction** ```bash curl -X POST http://localhost:8080/v1/database/transaction \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "database": "testdb", "queries": [ "INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")", "INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")" ] }' ``` 6. **Get Schema** ```bash curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \ -H "Authorization: Bearer $TOKEN" ``` 7. **Test Hibernation** - Wait for hibernation timeout - Query again and measure wake-up time - Should see delay on first query after hibernation **Expected Results:** - All API calls succeed - Data persists across calls - Transactions are atomic - Schema reflects created tables - Hibernation/wake-up transparent to API - Response times reasonable (< 30s for queries) --- ## Test Checklist ### Unit Tests (To Implement) - [ ] Metadata Store operations - [ ] Metadata Store concurrency - [ ] Vector Clock increment - [ ] Vector Clock merge - [ ] Vector Clock compare - [ ] Coordinator election (single node) - [ ] Coordinator election (multiple nodes) - [ ] Coordinator election (deterministic) - [ ] Port Manager allocation - [ ] Port Manager release - [ ] Port Manager exhaustion - [ ] Port Manager specific ports - [ ] RQLite Instance creation - [ ] RQLite Instance IsIdle - [ ] Message marshal/unmarshal (all types) - [ ] Coordinator response collection - [ ] Coordinator node selection - [ ] Coordinator registry ### Integration Tests (To Implement) - [ ] Single node database creation - [ ] Three node database creation - [ ] Multiple databases isolation - [ ] Hibernation cycle - [ ] Wake-up cycle - [ ] Node failure detection - [ ] Node replacement - [ ] Orphaned data cleanup - [ ] Concurrent database creation - [ ] Concurrent hibernation ### Manual Tests (To Perform) - [ ] Basic three node flow - [ ] Hibernation and wake-up - [ ] Multiple databases - [ ] Node failure and recovery - [ ] Port exhaustion handling - [ ] Orphaned data cleanup - [ ] Stress test with many databases ### Performance Validation - [ ] Database creation < 10s - [ ] Wake-up time < 8s - [ ] Metadata sync < 5s - [ ] Query overhead < 10ms additional ## Running Tests ### Unit Tests ```bash # Run all tests go test ./pkg/rqlite/... -v # Run with race detector go test ./pkg/rqlite/... -race # Run specific test go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v # Run with coverage go test ./pkg/rqlite/... -cover -coverprofile=coverage.out go tool cover -html=coverage.out ``` ### Integration Tests ```bash # Run e2e tests go test ./e2e/... -v -timeout 30m # Run specific e2e test go test ./e2e/ -run TestThreeNodeDatabaseCreation -v ``` ### Manual Tests Follow the scenarios above in dedicated terminals for each node. ## Success Criteria ### Correctness ✅ All unit tests pass ✅ All integration tests pass ✅ All manual scenarios complete successfully ✅ No data loss in any scenario ✅ No race conditions detected ### Performance ✅ Database creation < 10 seconds ✅ Wake-up < 8 seconds ✅ Metadata sync < 5 seconds ✅ Query overhead < 10ms ### Reliability ✅ Survives node failures ✅ Automatic recovery works ✅ No orphaned data accumulates ✅ Hibernation/wake-up cycles stable ✅ Concurrent operations safe ## Notes for Future Test Enhancements When implementing advanced metrics and benchmarks: 1. **Prometheus Metrics Tests** - Verify metric export - Validate metric values - Test metric reset on restart 2. **Benchmark Suite** - Automated performance regression detection - Latency percentile tracking (p50, p95, p99) - Throughput measurements - Resource usage profiling 3. **Chaos Engineering** - Random node kills - Network partitions - Clock skew simulation - Disk full scenarios 4. **Long-Running Stability** - 24-hour soak test - Memory leak detection - Slow-growing resource usage ## Debugging Failed Tests ### Common Issues **Port Conflicts** ```bash # Check for processes using test ports lsof -i :5001-5999 lsof -i :7001-7999 # Kill stale processes pkill rqlited ``` **Stale Data** ```bash # Clean test data directories rm -rf data/test_*/ rm -rf /tmp/debros_test_*/ ``` **Timing Issues** - Increase timeouts in flaky tests - Add retry logic with exponential backoff - Use proper synchronization primitives **Race Conditions** ```bash # Always run with race detector during development go test -race ./... ```