From 36002d342c1e45ed5f916828e4aa9ab34f9ed24c Mon Sep 17 00:00:00 2001 From: anonpenguin23 Date: Thu, 16 Oct 2025 10:29:58 +0300 Subject: [PATCH] Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide - Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation. - Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources. --- .cursor/plans/dynamic-ec358e91.plan.md | 165 ----- DYNAMIC_CLUSTERING_GUIDE.md | 504 --------------- TESTING_GUIDE.md | 827 ------------------------- 3 files changed, 1496 deletions(-) delete mode 100644 .cursor/plans/dynamic-ec358e91.plan.md delete mode 100644 DYNAMIC_CLUSTERING_GUIDE.md delete mode 100644 TESTING_GUIDE.md diff --git a/.cursor/plans/dynamic-ec358e91.plan.md b/.cursor/plans/dynamic-ec358e91.plan.md deleted file mode 100644 index 1d428d9..0000000 --- a/.cursor/plans/dynamic-ec358e91.plan.md +++ /dev/null @@ -1,165 +0,0 @@ - -# Dynamic Database Clustering — Implementation Plan - -### Scope - -Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing. - -### Guiding Principles - -- Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons. -- Backward-compatible config migration with deprecations, feature-flag controlled rollout. -- Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes. -- Tests and observability at each phase. - -### Phase 0: Prep & Scaffolding - -- Add feature flag `dynamic_db_clustering` (env/config) → default off. -- Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated). -- Create empty packages and interfaces to enable incremental compilation: - - `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}` - - `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}` -- Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed). -- Establish CI jobs for new unit/integration suites and longer-running e2e. - -### Phase 1: Metadata Layer (No hibernation yet) - -- Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`: - - `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`. -- Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge: - - Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health. -- Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`: - - Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs). -- Reuse existing node connectivity/backoff; no new ping service required. -- Skip unit tests for now; validate by wiring e2e flows later. - -### Phase 2: Database Creation & Client API - -- Port management: - - `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping. -- Subprocess control: - - `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status). -- Cluster manager: - - `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces. -- Client API: - - Update `pkg/client/interface.go` to include `Database(name string)`. - - Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name). - - Backoff polling for readiness during creation. -- Data isolation: - - Data dir per db: `./data/_/rqlite` (respect node `data_dir` base). -- Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write. - -### Phase 3: Hibernation & Wake-Up - -- Idle detection and coordination: - - Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule. -- Hibernation protocol: - - Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`. -- Wake-up protocol: - - Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`. -- Client retry UX: - - Transparent retries with exponential backoff; treat `waking` as wait-only state. -- Tests: hibernation under load; thundering herd; resource verification and persistence across cycles. - -### Phase 4: Resilience (Failure & Replacement) - -- Continuous health checks with timeouts → mark node unhealthy. -- Replacement orchestration: - - Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs. -- Startup reconciliation: - - Detect and cleanup orphaned or non-member local data directories. -- Rate limiting replacements to prevent cascades; prioritize by usage metrics. -- Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity. - -### Phase 5: Production Hardening & Optimization - -- Metrics/logging: - - Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges. -- Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status). -- Client metadata caching and query routing improvements (simple round-robin → latency-aware later). -- Performance benchmarks and operator-facing docs. - -### File Changes (Essentials) - -- `pkg/config/config.go` - - Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`. - - Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`. -- `pkg/client/interface.go`/`pkg/client/client.go` - - Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling. -- `pkg/node/node.go` - - Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage. -- `pkg/rqlite/*` - - Refactor to instance-oriented helpers from singleton. -- New packages under `pkg/metadata` and `pkg/dbcluster` as listed above. -- `configs/node.yaml` and validation paths to reflect new `database` block. - -### Config Example (target end-state) - -```yaml -node: - data_dir: "./data" - -database: - replication_factor: 3 - hibernation_timeout: 60 - max_databases: 100 - port_range: - http_start: 5001 - http_end: 5999 - raft_start: 7001 - raft_end: 7999 - -discovery: - health_check_interval: 10s -``` - -### Rollout Strategy - -- Keep feature flag off by default; support legacy single-cluster path. -- Ship Phase 1 behind flag; enable in dev/e2e only. -- Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment. -- Remove legacy config after deprecation window. - -### Testing & Quality Gates - -- Unit tests: metadata ops, consensus, ports, subprocess, manager state machine. -- Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions. -- Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms). -- Chaos suite for randomized failures and partitions. - -### Risks & Mitigations (operationalized) - -- Metadata divergence → vector clocks + periodic checksums + majority read checks in client. -- Raft churn → adaptive timeouts; allow `always_on` flag per-db (future). -- Cascading replacements → global rate limiter and prioritization. -- Debuggability → verbose structured logging and metadata dump endpoints. - -### Timeline (indicative) - -- Weeks 1-2: Phases 0-1 -- Weeks 3-4: Phase 2 -- Weeks 5-6: Phase 3 -- Weeks 7-8: Phase 4 -- Weeks 9-10+: Phase 5 - -### To-dos - -- [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks -- [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields -- [ ] Implement metadata types and thread-safe store with versioning -- [ ] Implement pubsub messages and handlers using existing pubsub manager -- [ ] Implement coordinator election, vector clocks, gossip reconciliation -- [ ] Implement `PortManager` with bind-probing and allocation -- [ ] Implement rqlite subprocess control and readiness checks -- [ ] Implement `ClusterManager` and creation lifecycle orchestration -- [ ] Add `Database(name)` and app namespacing to client; backoff polling -- [ ] Adopt per-database data dirs under node `data_dir` -- [ ] Integration tests for creation and isolation across nodes -- [ ] Idle detection, coordinated shutdown, status updates -- [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart -- [ ] Client transparent retry/backoff for hibernation and waking -- [ ] Health checks, replacement orchestration, rate limiting -- [ ] Implement orphaned data reconciliation on startup -- [ ] Add metrics and structured logging across managers -- [ ] Benchmarks for creation, wake-up, sync, query overhead -- [ ] Operator and developer docs; config and migration guides \ No newline at end of file diff --git a/DYNAMIC_CLUSTERING_GUIDE.md b/DYNAMIC_CLUSTERING_GUIDE.md deleted file mode 100644 index eac217a..0000000 --- a/DYNAMIC_CLUSTERING_GUIDE.md +++ /dev/null @@ -1,504 +0,0 @@ -# Dynamic Database Clustering - User Guide - -## Overview - -Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation. - -## Key Features - -✅ **Multi-Database Support** - Create unlimited isolated databases on-demand -✅ **3-Node Replication** - Fault-tolerant by default (configurable) -✅ **Auto Hibernation** - Idle databases hibernate to save resources -✅ **Transparent Wake-Up** - Automatic restart on access -✅ **App Namespacing** - Databases are scoped by application name -✅ **Decentralized Metadata** - LibP2P pubsub-based coordination -✅ **Failure Recovery** - Automatic node replacement on failures -✅ **Resource Optimization** - Dynamic port allocation and data isolation - -## Configuration - -### Node Configuration (`configs/node.yaml`) - -```yaml -node: - data_dir: "./data" - listen_addresses: - - "/ip4/0.0.0.0/tcp/4001" - max_connections: 50 - -database: - replication_factor: 3 # Number of replicas per database - hibernation_timeout: 60s # Idle time before hibernation - max_databases: 100 # Max databases per node - port_range_http_start: 5001 # HTTP port range start - port_range_http_end: 5999 # HTTP port range end - port_range_raft_start: 7001 # Raft port range start - port_range_raft_end: 7999 # Raft port range end - -discovery: - bootstrap_peers: - - "/ip4/127.0.0.1/tcp/4001/p2p/..." - discovery_interval: 30s - health_check_interval: 10s -``` - -### Key Configuration Options - -#### `database.replication_factor` (default: 3) -Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance. - -#### `database.hibernation_timeout` (default: 60s) -Time of inactivity before a database is hibernated. Set to 0 to disable hibernation. - -#### `database.max_databases` (default: 100) -Maximum number of databases this node can host simultaneously. - -#### `database.port_range_*` -Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database). - -## Client Usage - -### Creating/Accessing Databases - -```go -package main - -import ( - "context" - "github.com/DeBrosOfficial/network/pkg/client" -) - -func main() { - // Create client with app name for namespacing - cfg := client.DefaultClientConfig("myapp") - cfg.BootstrapPeers = []string{ - "/ip4/127.0.0.1/tcp/4001/p2p/...", - } - - c, err := client.NewClient(cfg) - if err != nil { - panic(err) - } - - // Connect to network - if err := c.Connect(); err != nil { - panic(err) - } - defer c.Disconnect() - - // Get database client (creates database if it doesn't exist) - db, err := c.Database().Database("users") - if err != nil { - panic(err) - } - - // Use the database - ctx := context.Background() - err = db.CreateTable(ctx, ` - CREATE TABLE users ( - id INTEGER PRIMARY KEY, - name TEXT NOT NULL, - email TEXT UNIQUE - ) - `) - - // Query data - result, err := db.Query(ctx, "SELECT * FROM users") - // ... -} -``` - -### Database Naming - -Databases are automatically namespaced by your application name: -- `client.Database("users")` → creates `myapp_users` internally -- This prevents name collisions between different applications - -## Gateway API Usage - -If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints: - -### Base URL -``` -http://gateway-host:8080/v1/database/ -``` - -### Execute SQL (INSERT, UPDATE, DELETE, DDL) -```bash -POST /v1/database/exec -Content-Type: application/json - -{ - "database": "users", - "sql": "INSERT INTO users (name, email) VALUES (?, ?)", - "args": ["Alice", "alice@example.com"] -} - -Response: -{ - "rows_affected": 1, - "last_insert_id": 1 -} -``` - -### Query Data (SELECT) -```bash -POST /v1/database/query -Content-Type: application/json - -{ - "database": "users", - "sql": "SELECT * FROM users WHERE name LIKE ?", - "args": ["A%"] -} - -Response: -{ - "items": [ - {"id": 1, "name": "Alice", "email": "alice@example.com"} - ], - "count": 1 -} -``` - -### Execute Transaction -```bash -POST /v1/database/transaction -Content-Type: application/json - -{ - "database": "users", - "queries": [ - "INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')", - "UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'" - ] -} - -Response: -{ - "success": true -} -``` - -### Get Schema -```bash -GET /v1/database/schema?database=users - -# OR - -POST /v1/database/schema -Content-Type: application/json - -{ - "database": "users" -} - -Response: -{ - "tables": [ - { - "name": "users", - "columns": ["id", "name", "email"] - } - ] -} -``` - -### Create Table -```bash -POST /v1/database/create-table -Content-Type: application/json - -{ - "database": "users", - "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)" -} - -Response: -{ - "rows_affected": 0 -} -``` - -### Drop Table -```bash -POST /v1/database/drop-table -Content-Type: application/json - -{ - "database": "users", - "table_name": "old_table" -} - -Response: -{ - "rows_affected": 0 -} -``` - -### List Databases -```bash -GET /v1/database/list - -Response: -{ - "databases": ["users", "products", "orders"] -} -``` - -### Important Notes - -1. **Authentication Required**: All endpoints require authentication (JWT or API key) -2. **Database Creation**: Databases are created automatically on first access -3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database -4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s -5. **Namespacing**: Database names are automatically prefixed with your app name -6. **Concurrent Access**: All endpoints are safe for concurrent use - -## Database Lifecycle - -### 1. Creation - -When you first access a database: - -1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST` -2. **Node Selection** - Eligible nodes respond with available ports -3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen -4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM` -5. **Instance Startup** - Selected nodes start rqlite subprocesses -6. **Readiness** - Nodes report `active` status when ready - -**Typical creation time: < 10 seconds** - -### 2. Active State - -- Database instances run as rqlite subprocesses -- Each instance tracks `LastQuery` timestamp -- Queries update the activity timestamp -- Metadata synced across all network nodes - -### 3. Hibernation - -After `hibernation_timeout` of inactivity: - -1. **Idle Detection** - Nodes detect idle databases -2. **Idle Notification** - Nodes broadcast idle status -3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown -4. **Graceful Stop** - SIGTERM sent to rqlite processes -5. **Port Release** - Ports freed for reuse -6. **Status Update** - Metadata updated to `hibernating` - -**Data persists on disk during hibernation** - -### 4. Wake-Up - -On first query to hibernating database: - -1. **Detection** - Client/node detects `hibernating` status -2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST` -3. **Port Allocation** - Reuse original ports or allocate new ones -4. **Instance Restart** - Restart rqlite with existing data -5. **Status Update** - Update to `active` when ready - -**Typical wake-up time: < 8 seconds** - -### 5. Failure Recovery - -When a node fails: - -1. **Health Detection** - Missed health checks trigger failure detection -2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED` -3. **Offers** - Healthy nodes with capacity offer to replace -4. **Selection** - First offer accepted (simple approach) -5. **Join Cluster** - New node joins existing Raft cluster -6. **Sync** - Data synced from existing members - -## Data Management - -### Data Directories - -Each database gets its own data directory: -``` -./data/ - ├── myapp_users/ # Database: users - │ └── rqlite/ - │ ├── db.sqlite - │ └── raft/ - ├── myapp_products/ # Database: products - │ └── rqlite/ - └── myapp_orders/ # Database: orders - └── rqlite/ -``` - -### Orphaned Data Cleanup - -On node startup, the system automatically: -- Scans data directories -- Checks against metadata -- Removes directories for: - - Non-existent databases - - Databases where this node is not a member - -## Monitoring & Debugging - -### Structured Logging - -All operations are logged with structured fields: - -``` -INFO Starting cluster manager node_id=12D3... max_databases=100 -INFO Received database create request database=myapp_users requester=12D3... -INFO Database instance started database=myapp_users http_port=5001 raft_port=7001 -INFO Database is idle database=myapp_users idle_time=62s -INFO Database hibernated successfully database=myapp_users -INFO Received wakeup request database=myapp_users -INFO Database woke up successfully database=myapp_users -``` - -### Health Checks - -Nodes perform periodic health checks: -- Every `health_check_interval` (default: 10s) -- Tracks last-seen time for each peer -- 3 missed checks → node marked unhealthy -- Triggers replacement protocol for affected databases - -## Best Practices - -### 1. **Capacity Planning** - -```yaml -# For 100 databases with 3-node replication: -database: - max_databases: 100 - port_range_http_start: 5001 - port_range_http_end: 5200 # 200 ports (100 databases * 2) - port_range_raft_start: 7001 - port_range_raft_end: 7200 -``` - -### 2. **Hibernation Tuning** - -- **High Traffic**: Set `hibernation_timeout: 300s` or higher -- **Development**: Set `hibernation_timeout: 30s` for faster cycles -- **Always-On DBs**: Set `hibernation_timeout: 0` to disable - -### 3. **Replication Factor** - -- **Development**: `replication_factor: 1` (single node, no replication) -- **Production**: `replication_factor: 3` (fault tolerant) -- **High Availability**: `replication_factor: 5` (survives 2 failures) - -### 4. **Network Topology** - -- Use at least 3 nodes for `replication_factor: 3` -- Ensure `max_databases * replication_factor <= total_cluster_capacity` -- Example: 3 nodes × 100 max_databases = 300 database instances total - -## Troubleshooting - -### Database Creation Fails - -**Problem**: `insufficient nodes responded: got 1, need 3` - -**Solution**: -- Ensure you have at least `replication_factor` nodes online -- Check `max_databases` limit on nodes -- Verify port ranges aren't exhausted - -### Database Not Waking Up - -**Problem**: Database stays in `waking` status - -**Solution**: -- Check node logs for rqlite startup errors -- Verify rqlite binary is installed -- Check port conflicts (use different port ranges) -- Ensure data directory is accessible - -### Orphaned Data - -**Problem**: Disk space consumed by old databases - -**Solution**: -- Orphaned data is automatically cleaned on node restart -- Manual cleanup: Delete directories from `./data/` that don't match metadata -- Check logs for reconciliation results - -### Node Replacement Not Working - -**Problem**: Failed node not replaced - -**Solution**: -- Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`) -- Check network connectivity between nodes -- Verify health check interval is reasonable (not too aggressive) - -## Advanced Topics - -### Metadata Consistency - -- **Vector Clocks**: Each metadata update includes vector clock for conflict resolution -- **Gossip Protocol**: Periodic metadata sync via checksums -- **Eventual Consistency**: All nodes eventually agree on database state - -### Port Management - -- Ports allocated randomly within configured ranges -- Bind-probing ensures ports are actually available -- Ports reused during wake-up when possible -- Failed allocations fall back to new random ports - -### Coordinator Election - -- Deterministic selection based on lexicographical peer ID ordering -- Lowest peer ID becomes coordinator -- No persistent coordinator state -- Re-election occurs for each database operation - -## Migration from Legacy Mode - -If upgrading from single-cluster rqlite: - -1. **Backup Data**: Backup your existing `./data/rqlite` directory -2. **Update Config**: Remove deprecated fields: - - `database.data_dir` - - `database.rqlite_port` - - `database.rqlite_raft_port` - - `database.rqlite_join_address` -3. **Add New Fields**: Configure dynamic clustering (see Configuration section) -4. **Restart Nodes**: Restart all nodes with new configuration -5. **Migrate Data**: Create new database and import data from backup - -## Future Enhancements - -The following features are planned for future releases: - -### **Advanced Metrics** (Future) -- Prometheus-style metrics export -- Per-database query counters -- Hibernation/wake-up latency histograms -- Resource utilization gauges - -### **Performance Benchmarks** (Future) -- Automated benchmark suite -- Creation time SLOs -- Wake-up latency targets -- Query overhead measurements - -### **Enhanced Monitoring** (Future) -- Dashboard for cluster visualization -- Database status API endpoint -- Capacity planning tools -- Alerting integration - -## Support - -For issues, questions, or contributions: -- GitHub Issues: https://github.com/DeBrosOfficial/network/issues -- Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md - -## License - -See LICENSE file for details. - diff --git a/TESTING_GUIDE.md b/TESTING_GUIDE.md deleted file mode 100644 index 85d189b..0000000 --- a/TESTING_GUIDE.md +++ /dev/null @@ -1,827 +0,0 @@ -# Dynamic Database Clustering - Testing Guide - -This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature. - -## Unit Tests - -### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`) - -```go -// Test cases to implement: - -func TestMetadataStore_GetSetDatabase(t *testing.T) - - Create store - - Set database metadata - - Get database metadata - - Verify data matches - -func TestMetadataStore_DeleteDatabase(t *testing.T) - - Set database metadata - - Delete database - - Verify Get returns nil - -func TestMetadataStore_ListDatabases(t *testing.T) - - Add multiple databases - - List all databases - - Verify count and contents - -func TestMetadataStore_ConcurrentAccess(t *testing.T) - - Spawn multiple goroutines - - Concurrent reads and writes - - Verify no race conditions (run with -race) - -func TestMetadataStore_NodeCapacity(t *testing.T) - - Set node capacity - - Get node capacity - - Update capacity - - List nodes -``` - -### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`) - -```go -func TestVectorClock_Increment(t *testing.T) - - Create empty vector clock - - Increment for node A - - Verify counter is 1 - - Increment again - - Verify counter is 2 - -func TestVectorClock_Merge(t *testing.T) - - Create two vector clocks with different nodes - - Merge them - - Verify max values are preserved - -func TestVectorClock_Compare(t *testing.T) - - Test strictly less than case - - Test strictly greater than case - - Test concurrent case - - Test identical case - -func TestVectorClock_Concurrent(t *testing.T) - - Create clocks with overlapping updates - - Verify Compare returns 0 (concurrent) -``` - -### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`) - -```go -func TestElectCoordinator_SingleNode(t *testing.T) - - Pass single node ID - - Verify it's elected - -func TestElectCoordinator_MultipleNodes(t *testing.T) - - Pass multiple node IDs - - Verify lowest lexicographical ID wins - - Verify deterministic (same input = same output) - -func TestElectCoordinator_EmptyList(t *testing.T) - - Pass empty list - - Verify error returned - -func TestElectCoordinator_Deterministic(t *testing.T) - - Run election multiple times with same inputs - - Verify same coordinator each time -``` - -### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`) - -```go -func TestPortManager_AllocatePortPair(t *testing.T) - - Create manager with port range - - Allocate port pair - - Verify HTTP and Raft ports different - - Verify ports within range - -func TestPortManager_ReleasePortPair(t *testing.T) - - Allocate port pair - - Release ports - - Verify ports can be reallocated - -func TestPortManager_Exhaustion(t *testing.T) - - Allocate all available ports - - Attempt one more allocation - - Verify error returned - -func TestPortManager_IsPortAllocated(t *testing.T) - - Allocate ports - - Check IsPortAllocated returns true - - Release ports - - Check IsPortAllocated returns false - -func TestPortManager_AllocateSpecificPorts(t *testing.T) - - Allocate specific ports - - Verify allocation succeeds - - Attempt to allocate same ports again - - Verify error returned -``` - -### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`) - -```go -func TestRQLiteInstance_Create(t *testing.T) - - Create instance configuration - - Verify fields set correctly - -func TestRQLiteInstance_IsIdle(t *testing.T) - - Set LastQuery to old timestamp - - Verify IsIdle returns true - - Update LastQuery - - Verify IsIdle returns false - -// Integration test (requires rqlite binary): -func TestRQLiteInstance_StartStop(t *testing.T) - - Create instance - - Start instance - - Verify HTTP endpoint responsive - - Stop instance - - Verify process terminated -``` - -### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`) - -```go -func TestMarshalUnmarshalMetadataMessage(t *testing.T) - - Create each message type - - Marshal to bytes - - Unmarshal back - - Verify data preserved - -func TestDatabaseCreateRequest_Marshal(t *testing.T) -func TestDatabaseCreateResponse_Marshal(t *testing.T) -func TestDatabaseCreateConfirm_Marshal(t *testing.T) -func TestDatabaseStatusUpdate_Marshal(t *testing.T) -// ... for all message types -``` - -### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`) - -```go -func TestCreateCoordinator_AddResponse(t *testing.T) - - Create coordinator - - Add responses - - Verify response count - -func TestCreateCoordinator_SelectNodes(t *testing.T) - - Add more responses than needed - - Call SelectNodes - - Verify correct number selected - - Verify deterministic selection - -func TestCreateCoordinator_WaitForResponses(t *testing.T) - - Create coordinator - - Wait in goroutine - - Add responses from another goroutine - - Verify wait completes when enough responses - -func TestCoordinatorRegistry(t *testing.T) - - Register coordinator - - Get coordinator - - Remove coordinator - - Verify lifecycle -``` - -## Integration Tests - -### 1. Single Node Database Creation (`e2e/single_node_database_test.go`) - -```go -func TestSingleNodeDatabaseCreation(t *testing.T) - - Start 1 node - - Set replication_factor = 1 - - Create database - - Verify database active - - Write data - - Read data back - - Verify data matches -``` - -### 2. Three Node Database Creation (`e2e/three_node_database_test.go`) - -```go -func TestThreeNodeDatabaseCreation(t *testing.T) - - Start 3 nodes - - Set replication_factor = 3 - - Create database from node 1 - - Wait for all nodes to report active - - Write data to node 1 - - Read from node 2 - - Verify replication worked -``` - -### 3. Multiple Databases (`e2e/multiple_databases_test.go`) - -```go -func TestMultipleDatabases(t *testing.T) - - Start 3 nodes - - Create database "users" - - Create database "products" - - Create database "orders" - - Verify all databases active - - Write to each database - - Verify data isolation -``` - -### 4. Hibernation Cycle (`e2e/hibernation_test.go`) - -```go -func TestHibernationCycle(t *testing.T) - - Start 3 nodes with hibernation_timeout=5s - - Create database - - Write initial data - - Wait 10 seconds (no activity) - - Verify status = hibernating - - Verify processes stopped - - Verify data persisted on disk - -func TestWakeUpCycle(t *testing.T) - - Create and hibernate database - - Issue query - - Wait for wake-up - - Verify status = active - - Verify data still accessible - - Verify LastQuery updated -``` - -### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`) - -```go -func TestNodeFailureDetection(t *testing.T) - - Start 3 nodes - - Create database - - Kill one node (SIGKILL) - - Wait for health checks to detect failure - - Verify NODE_REPLACEMENT_NEEDED broadcast - -func TestNodeReplacement(t *testing.T) - - Start 4 nodes - - Create database on nodes 1,2,3 - - Kill node 3 - - Wait for replacement - - Verify node 4 joins cluster - - Verify data accessible from node 4 -``` - -### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`) - -```go -func TestOrphanedDataCleanup(t *testing.T) - - Start node - - Manually create orphaned data directory - - Restart node - - Verify orphaned directory removed - - Check logs for reconciliation message -``` - -### 7. Concurrent Operations (`e2e/concurrent_test.go`) - -```go -func TestConcurrentDatabaseCreation(t *testing.T) - - Start 5 nodes - - Create 10 databases concurrently - - Verify all successful - - Verify no port conflicts - - Verify proper distribution - -func TestConcurrentHibernation(t *testing.T) - - Create multiple databases - - Let all go idle - - Verify all hibernate correctly - - No race conditions -``` - -## Manual Test Scenarios - -### Test 1: Basic Flow - Three Node Cluster - -**Setup:** -```bash -# Terminal 1: Bootstrap node -cd data/bootstrap -../../bin/node --data bootstrap --id bootstrap --p2p-port 4001 - -# Terminal 2: Node 2 -cd data/node -../../bin/node --data node --id node2 --p2p-port 4002 - -# Terminal 3: Node 3 -cd data/node2 -../../bin/node --data node2 --id node3 --p2p-port 4003 -``` - -**Test Steps:** -1. **Create Database** - ```bash - # Use client or API to create database "testdb" - ``` - -2. **Verify Creation** - - Check logs on all 3 nodes for "Database instance started" - - Verify `./data/*/testdb/` directories exist on all nodes - - Check different ports allocated on each node - -3. **Write Data** - ```sql - CREATE TABLE users (id INT, name TEXT); - INSERT INTO users VALUES (1, 'Alice'); - INSERT INTO users VALUES (2, 'Bob'); - ``` - -4. **Verify Replication** - - Query from each node - - Verify same data returned - -**Expected Results:** -- All nodes show `status=active` for testdb -- Data replicated across all nodes -- Unique port pairs per node - ---- - -### Test 2: Hibernation and Wake-Up - -**Setup:** Same as Test 1 with database created - -**Test Steps:** -1. **Check Activity** - ```bash - # In logs, verify "last_query" timestamps updating on queries - ``` - -2. **Wait for Hibernation** - - Stop issuing queries - - Wait `hibernation_timeout` + 10s - - Check logs for "Database is idle" - - Verify "Coordinated shutdown message sent" - - Verify "Database hibernated successfully" - -3. **Verify Hibernation** - ```bash - # Check that rqlite processes are stopped - ps aux | grep rqlite - - # Verify data directories still exist - ls -la data/*/testdb/ - ``` - -4. **Wake Up** - - Issue a query to the database - - Watch logs for "Received wakeup request" - - Verify "Database woke up successfully" - - Verify query succeeds - -**Expected Results:** -- Hibernation happens after idle timeout -- All 3 nodes hibernate coordinated -- Wake-up completes in < 8 seconds -- Data persists across hibernation cycle - ---- - -### Test 3: Multiple Databases - -**Setup:** 3 nodes running - -**Test Steps:** -1. **Create Multiple Databases** - ``` - Create: users_db - Create: products_db - Create: orders_db - ``` - -2. **Verify Isolation** - - Insert data in users_db - - Verify data NOT in products_db - - Verify data NOT in orders_db - -3. **Check Port Allocation** - ```bash - # Verify different ports for each database - netstat -tlnp | grep rqlite - # OR - ss -tlnp | grep rqlite - ``` - -4. **Verify Data Directories** - ```bash - tree data/bootstrap/ - # Should show: - # ├── users_db/ - # ├── products_db/ - # └── orders_db/ - ``` - -**Expected Results:** -- 3 separate database clusters -- Each with 3 nodes (9 total instances) -- Complete data isolation -- Unique port pairs for each instance - ---- - -### Test 4: Node Failure and Recovery - -**Setup:** 4 nodes running, database created on nodes 1-3 - -**Test Steps:** -1. **Verify Initial State** - - Database active on nodes 1, 2, 3 - - Node 4 idle - -2. **Simulate Failure** - ```bash - # Kill node 3 (SIGKILL for unclean shutdown) - kill -9 - ``` - -3. **Watch for Detection** - - Check logs on nodes 1 and 2 - - Wait for health check failures (3 missed pings) - - Verify "Node detected as unhealthy" messages - -4. **Watch for Replacement** - - Check for "NODE_REPLACEMENT_NEEDED" broadcast - - Node 4 should offer to replace - - Verify "Starting as replacement node" on node 4 - - Verify node 4 joins Raft cluster - -5. **Verify Data Integrity** - - Query database from node 4 - - Verify all data present - - Insert new data from node 4 - - Verify replication to nodes 1 and 2 - -**Expected Results:** -- Failure detected within 30 seconds -- Replacement completes automatically -- Data accessible from new node -- No data loss - ---- - -### Test 5: Port Exhaustion - -**Setup:** 1 node with small port range - -**Configuration:** -```yaml -database: - max_databases: 10 - port_range_http_start: 5001 - port_range_http_end: 5005 # Only 5 ports - port_range_raft_start: 7001 - port_range_raft_end: 7005 # Only 5 ports -``` - -**Test Steps:** -1. **Create Databases** - - Create database 1 (succeeds - uses 2 ports) - - Create database 2 (succeeds - uses 2 ports) - - Create database 3 (fails - only 1 port left) - -2. **Verify Error** - - Check logs for "Cannot allocate ports" - - Verify error returned to client - -3. **Free Ports** - - Hibernate or delete database 1 - - Ports should be freed - -4. **Retry** - - Create database 3 again - - Should succeed now - -**Expected Results:** -- Graceful handling of port exhaustion -- Clear error messages -- Ports properly recycled - ---- - -### Test 6: Orphaned Data Cleanup - -**Setup:** 1 node stopped - -**Test Steps:** -1. **Create Orphaned Data** - ```bash - # While node is stopped - mkdir -p data/bootstrap/orphaned_db/rqlite - echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite - ``` - -2. **Start Node** - ```bash - ./bin/node --data bootstrap --id bootstrap - ``` - -3. **Check Reconciliation** - - Watch logs for "Starting orphaned data reconciliation" - - Verify "Found orphaned database directory" - - Verify "Removed orphaned database directory" - -4. **Verify Cleanup** - ```bash - ls data/bootstrap/ - # orphaned_db should be gone - ``` - -**Expected Results:** -- Orphaned directories automatically detected -- Removed on startup -- Clean reconciliation logged - ---- - -### Test 7: Stress Test - Many Databases - -**Setup:** 5 nodes with high capacity - -**Configuration:** -```yaml -database: - max_databases: 50 - port_range_http_start: 5001 - port_range_http_end: 5150 - port_range_raft_start: 7001 - port_range_raft_end: 7150 -``` - -**Test Steps:** -1. **Create Many Databases** - ``` - Loop: Create databases db_1 through db_25 - ``` - -2. **Verify Distribution** - - Check logs for node capacity announcements - - Verify databases distributed across nodes - - No single node overloaded - -3. **Concurrent Operations** - - Write to multiple databases simultaneously - - Read from multiple databases - - Verify no conflicts - -4. **Hibernation Wave** - - Stop all activity - - Wait for hibernation - - Verify all databases hibernate - - Check resource usage drops - -5. **Wake-Up Storm** - - Query all 25 databases at once - - Verify all wake up successfully - - Check for thundering herd issues - -**Expected Results:** -- All 25 databases created successfully -- Even distribution across nodes -- No port conflicts -- Successful mass hibernation/wake-up - ---- - -### Test 8: Gateway API Access - -**Setup:** Gateway running with 3 nodes - -**Test Steps:** -1. **Authenticate** - ```bash - # Get JWT token - TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \ - -H "Content-Type: application/json" \ - -d '{"wallet": "..."}' | jq -r .token) - ``` - -2. **Create Table** - ```bash - curl -X POST http://localhost:8080/v1/database/create-table \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "database": "testdb", - "schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)" - }' - ``` - -3. **Insert Data** - ```bash - curl -X POST http://localhost:8080/v1/database/exec \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "database": "testdb", - "sql": "INSERT INTO users (name, email) VALUES (?, ?)", - "args": ["Alice", "alice@example.com"] - }' - ``` - -4. **Query Data** - ```bash - curl -X POST http://localhost:8080/v1/database/query \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "database": "testdb", - "sql": "SELECT * FROM users" - }' - ``` - -5. **Test Transaction** - ```bash - curl -X POST http://localhost:8080/v1/database/transaction \ - -H "Authorization: Bearer $TOKEN" \ - -H "Content-Type: application/json" \ - -d '{ - "database": "testdb", - "queries": [ - "INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")", - "INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")" - ] - }' - ``` - -6. **Get Schema** - ```bash - curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \ - -H "Authorization: Bearer $TOKEN" - ``` - -7. **Test Hibernation** - - Wait for hibernation timeout - - Query again and measure wake-up time - - Should see delay on first query after hibernation - -**Expected Results:** -- All API calls succeed -- Data persists across calls -- Transactions are atomic -- Schema reflects created tables -- Hibernation/wake-up transparent to API -- Response times reasonable (< 30s for queries) - ---- - -## Test Checklist - -### Unit Tests (To Implement) -- [ ] Metadata Store operations -- [ ] Metadata Store concurrency -- [ ] Vector Clock increment -- [ ] Vector Clock merge -- [ ] Vector Clock compare -- [ ] Coordinator election (single node) -- [ ] Coordinator election (multiple nodes) -- [ ] Coordinator election (deterministic) -- [ ] Port Manager allocation -- [ ] Port Manager release -- [ ] Port Manager exhaustion -- [ ] Port Manager specific ports -- [ ] RQLite Instance creation -- [ ] RQLite Instance IsIdle -- [ ] Message marshal/unmarshal (all types) -- [ ] Coordinator response collection -- [ ] Coordinator node selection -- [ ] Coordinator registry - -### Integration Tests (To Implement) -- [ ] Single node database creation -- [ ] Three node database creation -- [ ] Multiple databases isolation -- [ ] Hibernation cycle -- [ ] Wake-up cycle -- [ ] Node failure detection -- [ ] Node replacement -- [ ] Orphaned data cleanup -- [ ] Concurrent database creation -- [ ] Concurrent hibernation - -### Manual Tests (To Perform) -- [ ] Basic three node flow -- [ ] Hibernation and wake-up -- [ ] Multiple databases -- [ ] Node failure and recovery -- [ ] Port exhaustion handling -- [ ] Orphaned data cleanup -- [ ] Stress test with many databases - -### Performance Validation -- [ ] Database creation < 10s -- [ ] Wake-up time < 8s -- [ ] Metadata sync < 5s -- [ ] Query overhead < 10ms additional - -## Running Tests - -### Unit Tests -```bash -# Run all tests -go test ./pkg/rqlite/... -v - -# Run with race detector -go test ./pkg/rqlite/... -race - -# Run specific test -go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v - -# Run with coverage -go test ./pkg/rqlite/... -cover -coverprofile=coverage.out -go tool cover -html=coverage.out -``` - -### Integration Tests -```bash -# Run e2e tests -go test ./e2e/... -v -timeout 30m - -# Run specific e2e test -go test ./e2e/ -run TestThreeNodeDatabaseCreation -v -``` - -### Manual Tests -Follow the scenarios above in dedicated terminals for each node. - -## Success Criteria - -### Correctness -✅ All unit tests pass -✅ All integration tests pass -✅ All manual scenarios complete successfully -✅ No data loss in any scenario -✅ No race conditions detected - -### Performance -✅ Database creation < 10 seconds -✅ Wake-up < 8 seconds -✅ Metadata sync < 5 seconds -✅ Query overhead < 10ms - -### Reliability -✅ Survives node failures -✅ Automatic recovery works -✅ No orphaned data accumulates -✅ Hibernation/wake-up cycles stable -✅ Concurrent operations safe - -## Notes for Future Test Enhancements - -When implementing advanced metrics and benchmarks: - -1. **Prometheus Metrics Tests** - - Verify metric export - - Validate metric values - - Test metric reset on restart - -2. **Benchmark Suite** - - Automated performance regression detection - - Latency percentile tracking (p50, p95, p99) - - Throughput measurements - - Resource usage profiling - -3. **Chaos Engineering** - - Random node kills - - Network partitions - - Clock skew simulation - - Disk full scenarios - -4. **Long-Running Stability** - - 24-hour soak test - - Memory leak detection - - Slow-growing resource usage - -## Debugging Failed Tests - -### Common Issues - -**Port Conflicts** -```bash -# Check for processes using test ports -lsof -i :5001-5999 -lsof -i :7001-7999 - -# Kill stale processes -pkill rqlited -``` - -**Stale Data** -```bash -# Clean test data directories -rm -rf data/test_*/ -rm -rf /tmp/debros_test_*/ -``` - -**Timing Issues** -- Increase timeouts in flaky tests -- Add retry logic with exponential backoff -- Use proper synchronization primitives - -**Race Conditions** -```bash -# Always run with race detector during development -go test -race ./... -``` - -