Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide

- Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation.
- Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources.
This commit is contained in:
anonpenguin23 2025-10-16 10:29:58 +03:00
parent dd4cb832dc
commit 36002d342c
No known key found for this signature in database
GPG Key ID: 1CBB1FE35AFBEE30
3 changed files with 0 additions and 1496 deletions

View File

@ -1,165 +0,0 @@
<!-- ec358e91-8e19-4fc8-a81e-cb388a4b2fc9 4c357d4a-bae7-4fe2-943d-84e5d3d3714c -->
# Dynamic Database Clustering — Implementation Plan
### Scope
Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing.
### Guiding Principles
- Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons.
- Backward-compatible config migration with deprecations, feature-flag controlled rollout.
- Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes.
- Tests and observability at each phase.
### Phase 0: Prep & Scaffolding
- Add feature flag `dynamic_db_clustering` (env/config) → default off.
- Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated).
- Create empty packages and interfaces to enable incremental compilation:
- `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}`
- `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}`
- Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed).
- Establish CI jobs for new unit/integration suites and longer-running e2e.
### Phase 1: Metadata Layer (No hibernation yet)
- Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`:
- `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`.
- Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge:
- Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health.
- Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`:
- Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs).
- Reuse existing node connectivity/backoff; no new ping service required.
- Skip unit tests for now; validate by wiring e2e flows later.
### Phase 2: Database Creation & Client API
- Port management:
- `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping.
- Subprocess control:
- `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status).
- Cluster manager:
- `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces.
- Client API:
- Update `pkg/client/interface.go` to include `Database(name string)`.
- Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name).
- Backoff polling for readiness during creation.
- Data isolation:
- Data dir per db: `./data/<app>_<db>/rqlite` (respect node `data_dir` base).
- Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write.
### Phase 3: Hibernation & Wake-Up
- Idle detection and coordination:
- Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule.
- Hibernation protocol:
- Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`.
- Wake-up protocol:
- Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`.
- Client retry UX:
- Transparent retries with exponential backoff; treat `waking` as wait-only state.
- Tests: hibernation under load; thundering herd; resource verification and persistence across cycles.
### Phase 4: Resilience (Failure & Replacement)
- Continuous health checks with timeouts → mark node unhealthy.
- Replacement orchestration:
- Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs.
- Startup reconciliation:
- Detect and cleanup orphaned or non-member local data directories.
- Rate limiting replacements to prevent cascades; prioritize by usage metrics.
- Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity.
### Phase 5: Production Hardening & Optimization
- Metrics/logging:
- Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges.
- Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status).
- Client metadata caching and query routing improvements (simple round-robin → latency-aware later).
- Performance benchmarks and operator-facing docs.
### File Changes (Essentials)
- `pkg/config/config.go`
- Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`.
- Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`.
- `pkg/client/interface.go`/`pkg/client/client.go`
- Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling.
- `pkg/node/node.go`
- Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage.
- `pkg/rqlite/*`
- Refactor to instance-oriented helpers from singleton.
- New packages under `pkg/metadata` and `pkg/dbcluster` as listed above.
- `configs/node.yaml` and validation paths to reflect new `database` block.
### Config Example (target end-state)
```yaml
node:
data_dir: "./data"
database:
replication_factor: 3
hibernation_timeout: 60
max_databases: 100
port_range:
http_start: 5001
http_end: 5999
raft_start: 7001
raft_end: 7999
discovery:
health_check_interval: 10s
```
### Rollout Strategy
- Keep feature flag off by default; support legacy single-cluster path.
- Ship Phase 1 behind flag; enable in dev/e2e only.
- Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment.
- Remove legacy config after deprecation window.
### Testing & Quality Gates
- Unit tests: metadata ops, consensus, ports, subprocess, manager state machine.
- Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions.
- Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms).
- Chaos suite for randomized failures and partitions.
### Risks & Mitigations (operationalized)
- Metadata divergence → vector clocks + periodic checksums + majority read checks in client.
- Raft churn → adaptive timeouts; allow `always_on` flag per-db (future).
- Cascading replacements → global rate limiter and prioritization.
- Debuggability → verbose structured logging and metadata dump endpoints.
### Timeline (indicative)
- Weeks 1-2: Phases 0-1
- Weeks 3-4: Phase 2
- Weeks 5-6: Phase 3
- Weeks 7-8: Phase 4
- Weeks 9-10+: Phase 5
### To-dos
- [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks
- [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields
- [ ] Implement metadata types and thread-safe store with versioning
- [ ] Implement pubsub messages and handlers using existing pubsub manager
- [ ] Implement coordinator election, vector clocks, gossip reconciliation
- [ ] Implement `PortManager` with bind-probing and allocation
- [ ] Implement rqlite subprocess control and readiness checks
- [ ] Implement `ClusterManager` and creation lifecycle orchestration
- [ ] Add `Database(name)` and app namespacing to client; backoff polling
- [ ] Adopt per-database data dirs under node `data_dir`
- [ ] Integration tests for creation and isolation across nodes
- [ ] Idle detection, coordinated shutdown, status updates
- [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart
- [ ] Client transparent retry/backoff for hibernation and waking
- [ ] Health checks, replacement orchestration, rate limiting
- [ ] Implement orphaned data reconciliation on startup
- [ ] Add metrics and structured logging across managers
- [ ] Benchmarks for creation, wake-up, sync, query overhead
- [ ] Operator and developer docs; config and migration guides

View File

@ -1,504 +0,0 @@
# Dynamic Database Clustering - User Guide
## Overview
Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation.
## Key Features
**Multi-Database Support** - Create unlimited isolated databases on-demand
**3-Node Replication** - Fault-tolerant by default (configurable)
**Auto Hibernation** - Idle databases hibernate to save resources
**Transparent Wake-Up** - Automatic restart on access
**App Namespacing** - Databases are scoped by application name
**Decentralized Metadata** - LibP2P pubsub-based coordination
**Failure Recovery** - Automatic node replacement on failures
**Resource Optimization** - Dynamic port allocation and data isolation
## Configuration
### Node Configuration (`configs/node.yaml`)
```yaml
node:
data_dir: "./data"
listen_addresses:
- "/ip4/0.0.0.0/tcp/4001"
max_connections: 50
database:
replication_factor: 3 # Number of replicas per database
hibernation_timeout: 60s # Idle time before hibernation
max_databases: 100 # Max databases per node
port_range_http_start: 5001 # HTTP port range start
port_range_http_end: 5999 # HTTP port range end
port_range_raft_start: 7001 # Raft port range start
port_range_raft_end: 7999 # Raft port range end
discovery:
bootstrap_peers:
- "/ip4/127.0.0.1/tcp/4001/p2p/..."
discovery_interval: 30s
health_check_interval: 10s
```
### Key Configuration Options
#### `database.replication_factor` (default: 3)
Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance.
#### `database.hibernation_timeout` (default: 60s)
Time of inactivity before a database is hibernated. Set to 0 to disable hibernation.
#### `database.max_databases` (default: 100)
Maximum number of databases this node can host simultaneously.
#### `database.port_range_*`
Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database).
## Client Usage
### Creating/Accessing Databases
```go
package main
import (
"context"
"github.com/DeBrosOfficial/network/pkg/client"
)
func main() {
// Create client with app name for namespacing
cfg := client.DefaultClientConfig("myapp")
cfg.BootstrapPeers = []string{
"/ip4/127.0.0.1/tcp/4001/p2p/...",
}
c, err := client.NewClient(cfg)
if err != nil {
panic(err)
}
// Connect to network
if err := c.Connect(); err != nil {
panic(err)
}
defer c.Disconnect()
// Get database client (creates database if it doesn't exist)
db, err := c.Database().Database("users")
if err != nil {
panic(err)
}
// Use the database
ctx := context.Background()
err = db.CreateTable(ctx, `
CREATE TABLE users (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE
)
`)
// Query data
result, err := db.Query(ctx, "SELECT * FROM users")
// ...
}
```
### Database Naming
Databases are automatically namespaced by your application name:
- `client.Database("users")` → creates `myapp_users` internally
- This prevents name collisions between different applications
## Gateway API Usage
If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints:
### Base URL
```
http://gateway-host:8080/v1/database/
```
### Execute SQL (INSERT, UPDATE, DELETE, DDL)
```bash
POST /v1/database/exec
Content-Type: application/json
{
"database": "users",
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
"args": ["Alice", "alice@example.com"]
}
Response:
{
"rows_affected": 1,
"last_insert_id": 1
}
```
### Query Data (SELECT)
```bash
POST /v1/database/query
Content-Type: application/json
{
"database": "users",
"sql": "SELECT * FROM users WHERE name LIKE ?",
"args": ["A%"]
}
Response:
{
"items": [
{"id": 1, "name": "Alice", "email": "alice@example.com"}
],
"count": 1
}
```
### Execute Transaction
```bash
POST /v1/database/transaction
Content-Type: application/json
{
"database": "users",
"queries": [
"INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')",
"UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'"
]
}
Response:
{
"success": true
}
```
### Get Schema
```bash
GET /v1/database/schema?database=users
# OR
POST /v1/database/schema
Content-Type: application/json
{
"database": "users"
}
Response:
{
"tables": [
{
"name": "users",
"columns": ["id", "name", "email"]
}
]
}
```
### Create Table
```bash
POST /v1/database/create-table
Content-Type: application/json
{
"database": "users",
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
}
Response:
{
"rows_affected": 0
}
```
### Drop Table
```bash
POST /v1/database/drop-table
Content-Type: application/json
{
"database": "users",
"table_name": "old_table"
}
Response:
{
"rows_affected": 0
}
```
### List Databases
```bash
GET /v1/database/list
Response:
{
"databases": ["users", "products", "orders"]
}
```
### Important Notes
1. **Authentication Required**: All endpoints require authentication (JWT or API key)
2. **Database Creation**: Databases are created automatically on first access
3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database
4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s
5. **Namespacing**: Database names are automatically prefixed with your app name
6. **Concurrent Access**: All endpoints are safe for concurrent use
## Database Lifecycle
### 1. Creation
When you first access a database:
1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST`
2. **Node Selection** - Eligible nodes respond with available ports
3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen
4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM`
5. **Instance Startup** - Selected nodes start rqlite subprocesses
6. **Readiness** - Nodes report `active` status when ready
**Typical creation time: < 10 seconds**
### 2. Active State
- Database instances run as rqlite subprocesses
- Each instance tracks `LastQuery` timestamp
- Queries update the activity timestamp
- Metadata synced across all network nodes
### 3. Hibernation
After `hibernation_timeout` of inactivity:
1. **Idle Detection** - Nodes detect idle databases
2. **Idle Notification** - Nodes broadcast idle status
3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown
4. **Graceful Stop** - SIGTERM sent to rqlite processes
5. **Port Release** - Ports freed for reuse
6. **Status Update** - Metadata updated to `hibernating`
**Data persists on disk during hibernation**
### 4. Wake-Up
On first query to hibernating database:
1. **Detection** - Client/node detects `hibernating` status
2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST`
3. **Port Allocation** - Reuse original ports or allocate new ones
4. **Instance Restart** - Restart rqlite with existing data
5. **Status Update** - Update to `active` when ready
**Typical wake-up time: < 8 seconds**
### 5. Failure Recovery
When a node fails:
1. **Health Detection** - Missed health checks trigger failure detection
2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED`
3. **Offers** - Healthy nodes with capacity offer to replace
4. **Selection** - First offer accepted (simple approach)
5. **Join Cluster** - New node joins existing Raft cluster
6. **Sync** - Data synced from existing members
## Data Management
### Data Directories
Each database gets its own data directory:
```
./data/
├── myapp_users/ # Database: users
│ └── rqlite/
│ ├── db.sqlite
│ └── raft/
├── myapp_products/ # Database: products
│ └── rqlite/
└── myapp_orders/ # Database: orders
└── rqlite/
```
### Orphaned Data Cleanup
On node startup, the system automatically:
- Scans data directories
- Checks against metadata
- Removes directories for:
- Non-existent databases
- Databases where this node is not a member
## Monitoring & Debugging
### Structured Logging
All operations are logged with structured fields:
```
INFO Starting cluster manager node_id=12D3... max_databases=100
INFO Received database create request database=myapp_users requester=12D3...
INFO Database instance started database=myapp_users http_port=5001 raft_port=7001
INFO Database is idle database=myapp_users idle_time=62s
INFO Database hibernated successfully database=myapp_users
INFO Received wakeup request database=myapp_users
INFO Database woke up successfully database=myapp_users
```
### Health Checks
Nodes perform periodic health checks:
- Every `health_check_interval` (default: 10s)
- Tracks last-seen time for each peer
- 3 missed checks → node marked unhealthy
- Triggers replacement protocol for affected databases
## Best Practices
### 1. **Capacity Planning**
```yaml
# For 100 databases with 3-node replication:
database:
max_databases: 100
port_range_http_start: 5001
port_range_http_end: 5200 # 200 ports (100 databases * 2)
port_range_raft_start: 7001
port_range_raft_end: 7200
```
### 2. **Hibernation Tuning**
- **High Traffic**: Set `hibernation_timeout: 300s` or higher
- **Development**: Set `hibernation_timeout: 30s` for faster cycles
- **Always-On DBs**: Set `hibernation_timeout: 0` to disable
### 3. **Replication Factor**
- **Development**: `replication_factor: 1` (single node, no replication)
- **Production**: `replication_factor: 3` (fault tolerant)
- **High Availability**: `replication_factor: 5` (survives 2 failures)
### 4. **Network Topology**
- Use at least 3 nodes for `replication_factor: 3`
- Ensure `max_databases * replication_factor <= total_cluster_capacity`
- Example: 3 nodes × 100 max_databases = 300 database instances total
## Troubleshooting
### Database Creation Fails
**Problem**: `insufficient nodes responded: got 1, need 3`
**Solution**:
- Ensure you have at least `replication_factor` nodes online
- Check `max_databases` limit on nodes
- Verify port ranges aren't exhausted
### Database Not Waking Up
**Problem**: Database stays in `waking` status
**Solution**:
- Check node logs for rqlite startup errors
- Verify rqlite binary is installed
- Check port conflicts (use different port ranges)
- Ensure data directory is accessible
### Orphaned Data
**Problem**: Disk space consumed by old databases
**Solution**:
- Orphaned data is automatically cleaned on node restart
- Manual cleanup: Delete directories from `./data/` that don't match metadata
- Check logs for reconciliation results
### Node Replacement Not Working
**Problem**: Failed node not replaced
**Solution**:
- Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`)
- Check network connectivity between nodes
- Verify health check interval is reasonable (not too aggressive)
## Advanced Topics
### Metadata Consistency
- **Vector Clocks**: Each metadata update includes vector clock for conflict resolution
- **Gossip Protocol**: Periodic metadata sync via checksums
- **Eventual Consistency**: All nodes eventually agree on database state
### Port Management
- Ports allocated randomly within configured ranges
- Bind-probing ensures ports are actually available
- Ports reused during wake-up when possible
- Failed allocations fall back to new random ports
### Coordinator Election
- Deterministic selection based on lexicographical peer ID ordering
- Lowest peer ID becomes coordinator
- No persistent coordinator state
- Re-election occurs for each database operation
## Migration from Legacy Mode
If upgrading from single-cluster rqlite:
1. **Backup Data**: Backup your existing `./data/rqlite` directory
2. **Update Config**: Remove deprecated fields:
- `database.data_dir`
- `database.rqlite_port`
- `database.rqlite_raft_port`
- `database.rqlite_join_address`
3. **Add New Fields**: Configure dynamic clustering (see Configuration section)
4. **Restart Nodes**: Restart all nodes with new configuration
5. **Migrate Data**: Create new database and import data from backup
## Future Enhancements
The following features are planned for future releases:
### **Advanced Metrics** (Future)
- Prometheus-style metrics export
- Per-database query counters
- Hibernation/wake-up latency histograms
- Resource utilization gauges
### **Performance Benchmarks** (Future)
- Automated benchmark suite
- Creation time SLOs
- Wake-up latency targets
- Query overhead measurements
### **Enhanced Monitoring** (Future)
- Dashboard for cluster visualization
- Database status API endpoint
- Capacity planning tools
- Alerting integration
## Support
For issues, questions, or contributions:
- GitHub Issues: https://github.com/DeBrosOfficial/network/issues
- Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md
## License
See LICENSE file for details.

View File

@ -1,827 +0,0 @@
# Dynamic Database Clustering - Testing Guide
This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature.
## Unit Tests
### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`)
```go
// Test cases to implement:
func TestMetadataStore_GetSetDatabase(t *testing.T)
- Create store
- Set database metadata
- Get database metadata
- Verify data matches
func TestMetadataStore_DeleteDatabase(t *testing.T)
- Set database metadata
- Delete database
- Verify Get returns nil
func TestMetadataStore_ListDatabases(t *testing.T)
- Add multiple databases
- List all databases
- Verify count and contents
func TestMetadataStore_ConcurrentAccess(t *testing.T)
- Spawn multiple goroutines
- Concurrent reads and writes
- Verify no race conditions (run with -race)
func TestMetadataStore_NodeCapacity(t *testing.T)
- Set node capacity
- Get node capacity
- Update capacity
- List nodes
```
### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`)
```go
func TestVectorClock_Increment(t *testing.T)
- Create empty vector clock
- Increment for node A
- Verify counter is 1
- Increment again
- Verify counter is 2
func TestVectorClock_Merge(t *testing.T)
- Create two vector clocks with different nodes
- Merge them
- Verify max values are preserved
func TestVectorClock_Compare(t *testing.T)
- Test strictly less than case
- Test strictly greater than case
- Test concurrent case
- Test identical case
func TestVectorClock_Concurrent(t *testing.T)
- Create clocks with overlapping updates
- Verify Compare returns 0 (concurrent)
```
### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`)
```go
func TestElectCoordinator_SingleNode(t *testing.T)
- Pass single node ID
- Verify it's elected
func TestElectCoordinator_MultipleNodes(t *testing.T)
- Pass multiple node IDs
- Verify lowest lexicographical ID wins
- Verify deterministic (same input = same output)
func TestElectCoordinator_EmptyList(t *testing.T)
- Pass empty list
- Verify error returned
func TestElectCoordinator_Deterministic(t *testing.T)
- Run election multiple times with same inputs
- Verify same coordinator each time
```
### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`)
```go
func TestPortManager_AllocatePortPair(t *testing.T)
- Create manager with port range
- Allocate port pair
- Verify HTTP and Raft ports different
- Verify ports within range
func TestPortManager_ReleasePortPair(t *testing.T)
- Allocate port pair
- Release ports
- Verify ports can be reallocated
func TestPortManager_Exhaustion(t *testing.T)
- Allocate all available ports
- Attempt one more allocation
- Verify error returned
func TestPortManager_IsPortAllocated(t *testing.T)
- Allocate ports
- Check IsPortAllocated returns true
- Release ports
- Check IsPortAllocated returns false
func TestPortManager_AllocateSpecificPorts(t *testing.T)
- Allocate specific ports
- Verify allocation succeeds
- Attempt to allocate same ports again
- Verify error returned
```
### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`)
```go
func TestRQLiteInstance_Create(t *testing.T)
- Create instance configuration
- Verify fields set correctly
func TestRQLiteInstance_IsIdle(t *testing.T)
- Set LastQuery to old timestamp
- Verify IsIdle returns true
- Update LastQuery
- Verify IsIdle returns false
// Integration test (requires rqlite binary):
func TestRQLiteInstance_StartStop(t *testing.T)
- Create instance
- Start instance
- Verify HTTP endpoint responsive
- Stop instance
- Verify process terminated
```
### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`)
```go
func TestMarshalUnmarshalMetadataMessage(t *testing.T)
- Create each message type
- Marshal to bytes
- Unmarshal back
- Verify data preserved
func TestDatabaseCreateRequest_Marshal(t *testing.T)
func TestDatabaseCreateResponse_Marshal(t *testing.T)
func TestDatabaseCreateConfirm_Marshal(t *testing.T)
func TestDatabaseStatusUpdate_Marshal(t *testing.T)
// ... for all message types
```
### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`)
```go
func TestCreateCoordinator_AddResponse(t *testing.T)
- Create coordinator
- Add responses
- Verify response count
func TestCreateCoordinator_SelectNodes(t *testing.T)
- Add more responses than needed
- Call SelectNodes
- Verify correct number selected
- Verify deterministic selection
func TestCreateCoordinator_WaitForResponses(t *testing.T)
- Create coordinator
- Wait in goroutine
- Add responses from another goroutine
- Verify wait completes when enough responses
func TestCoordinatorRegistry(t *testing.T)
- Register coordinator
- Get coordinator
- Remove coordinator
- Verify lifecycle
```
## Integration Tests
### 1. Single Node Database Creation (`e2e/single_node_database_test.go`)
```go
func TestSingleNodeDatabaseCreation(t *testing.T)
- Start 1 node
- Set replication_factor = 1
- Create database
- Verify database active
- Write data
- Read data back
- Verify data matches
```
### 2. Three Node Database Creation (`e2e/three_node_database_test.go`)
```go
func TestThreeNodeDatabaseCreation(t *testing.T)
- Start 3 nodes
- Set replication_factor = 3
- Create database from node 1
- Wait for all nodes to report active
- Write data to node 1
- Read from node 2
- Verify replication worked
```
### 3. Multiple Databases (`e2e/multiple_databases_test.go`)
```go
func TestMultipleDatabases(t *testing.T)
- Start 3 nodes
- Create database "users"
- Create database "products"
- Create database "orders"
- Verify all databases active
- Write to each database
- Verify data isolation
```
### 4. Hibernation Cycle (`e2e/hibernation_test.go`)
```go
func TestHibernationCycle(t *testing.T)
- Start 3 nodes with hibernation_timeout=5s
- Create database
- Write initial data
- Wait 10 seconds (no activity)
- Verify status = hibernating
- Verify processes stopped
- Verify data persisted on disk
func TestWakeUpCycle(t *testing.T)
- Create and hibernate database
- Issue query
- Wait for wake-up
- Verify status = active
- Verify data still accessible
- Verify LastQuery updated
```
### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`)
```go
func TestNodeFailureDetection(t *testing.T)
- Start 3 nodes
- Create database
- Kill one node (SIGKILL)
- Wait for health checks to detect failure
- Verify NODE_REPLACEMENT_NEEDED broadcast
func TestNodeReplacement(t *testing.T)
- Start 4 nodes
- Create database on nodes 1,2,3
- Kill node 3
- Wait for replacement
- Verify node 4 joins cluster
- Verify data accessible from node 4
```
### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`)
```go
func TestOrphanedDataCleanup(t *testing.T)
- Start node
- Manually create orphaned data directory
- Restart node
- Verify orphaned directory removed
- Check logs for reconciliation message
```
### 7. Concurrent Operations (`e2e/concurrent_test.go`)
```go
func TestConcurrentDatabaseCreation(t *testing.T)
- Start 5 nodes
- Create 10 databases concurrently
- Verify all successful
- Verify no port conflicts
- Verify proper distribution
func TestConcurrentHibernation(t *testing.T)
- Create multiple databases
- Let all go idle
- Verify all hibernate correctly
- No race conditions
```
## Manual Test Scenarios
### Test 1: Basic Flow - Three Node Cluster
**Setup:**
```bash
# Terminal 1: Bootstrap node
cd data/bootstrap
../../bin/node --data bootstrap --id bootstrap --p2p-port 4001
# Terminal 2: Node 2
cd data/node
../../bin/node --data node --id node2 --p2p-port 4002
# Terminal 3: Node 3
cd data/node2
../../bin/node --data node2 --id node3 --p2p-port 4003
```
**Test Steps:**
1. **Create Database**
```bash
# Use client or API to create database "testdb"
```
2. **Verify Creation**
- Check logs on all 3 nodes for "Database instance started"
- Verify `./data/*/testdb/` directories exist on all nodes
- Check different ports allocated on each node
3. **Write Data**
```sql
CREATE TABLE users (id INT, name TEXT);
INSERT INTO users VALUES (1, 'Alice');
INSERT INTO users VALUES (2, 'Bob');
```
4. **Verify Replication**
- Query from each node
- Verify same data returned
**Expected Results:**
- All nodes show `status=active` for testdb
- Data replicated across all nodes
- Unique port pairs per node
---
### Test 2: Hibernation and Wake-Up
**Setup:** Same as Test 1 with database created
**Test Steps:**
1. **Check Activity**
```bash
# In logs, verify "last_query" timestamps updating on queries
```
2. **Wait for Hibernation**
- Stop issuing queries
- Wait `hibernation_timeout` + 10s
- Check logs for "Database is idle"
- Verify "Coordinated shutdown message sent"
- Verify "Database hibernated successfully"
3. **Verify Hibernation**
```bash
# Check that rqlite processes are stopped
ps aux | grep rqlite
# Verify data directories still exist
ls -la data/*/testdb/
```
4. **Wake Up**
- Issue a query to the database
- Watch logs for "Received wakeup request"
- Verify "Database woke up successfully"
- Verify query succeeds
**Expected Results:**
- Hibernation happens after idle timeout
- All 3 nodes hibernate coordinated
- Wake-up completes in < 8 seconds
- Data persists across hibernation cycle
---
### Test 3: Multiple Databases
**Setup:** 3 nodes running
**Test Steps:**
1. **Create Multiple Databases**
```
Create: users_db
Create: products_db
Create: orders_db
```
2. **Verify Isolation**
- Insert data in users_db
- Verify data NOT in products_db
- Verify data NOT in orders_db
3. **Check Port Allocation**
```bash
# Verify different ports for each database
netstat -tlnp | grep rqlite
# OR
ss -tlnp | grep rqlite
```
4. **Verify Data Directories**
```bash
tree data/bootstrap/
# Should show:
# ├── users_db/
# ├── products_db/
# └── orders_db/
```
**Expected Results:**
- 3 separate database clusters
- Each with 3 nodes (9 total instances)
- Complete data isolation
- Unique port pairs for each instance
---
### Test 4: Node Failure and Recovery
**Setup:** 4 nodes running, database created on nodes 1-3
**Test Steps:**
1. **Verify Initial State**
- Database active on nodes 1, 2, 3
- Node 4 idle
2. **Simulate Failure**
```bash
# Kill node 3 (SIGKILL for unclean shutdown)
kill -9 <node3_pid>
```
3. **Watch for Detection**
- Check logs on nodes 1 and 2
- Wait for health check failures (3 missed pings)
- Verify "Node detected as unhealthy" messages
4. **Watch for Replacement**
- Check for "NODE_REPLACEMENT_NEEDED" broadcast
- Node 4 should offer to replace
- Verify "Starting as replacement node" on node 4
- Verify node 4 joins Raft cluster
5. **Verify Data Integrity**
- Query database from node 4
- Verify all data present
- Insert new data from node 4
- Verify replication to nodes 1 and 2
**Expected Results:**
- Failure detected within 30 seconds
- Replacement completes automatically
- Data accessible from new node
- No data loss
---
### Test 5: Port Exhaustion
**Setup:** 1 node with small port range
**Configuration:**
```yaml
database:
max_databases: 10
port_range_http_start: 5001
port_range_http_end: 5005 # Only 5 ports
port_range_raft_start: 7001
port_range_raft_end: 7005 # Only 5 ports
```
**Test Steps:**
1. **Create Databases**
- Create database 1 (succeeds - uses 2 ports)
- Create database 2 (succeeds - uses 2 ports)
- Create database 3 (fails - only 1 port left)
2. **Verify Error**
- Check logs for "Cannot allocate ports"
- Verify error returned to client
3. **Free Ports**
- Hibernate or delete database 1
- Ports should be freed
4. **Retry**
- Create database 3 again
- Should succeed now
**Expected Results:**
- Graceful handling of port exhaustion
- Clear error messages
- Ports properly recycled
---
### Test 6: Orphaned Data Cleanup
**Setup:** 1 node stopped
**Test Steps:**
1. **Create Orphaned Data**
```bash
# While node is stopped
mkdir -p data/bootstrap/orphaned_db/rqlite
echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite
```
2. **Start Node**
```bash
./bin/node --data bootstrap --id bootstrap
```
3. **Check Reconciliation**
- Watch logs for "Starting orphaned data reconciliation"
- Verify "Found orphaned database directory"
- Verify "Removed orphaned database directory"
4. **Verify Cleanup**
```bash
ls data/bootstrap/
# orphaned_db should be gone
```
**Expected Results:**
- Orphaned directories automatically detected
- Removed on startup
- Clean reconciliation logged
---
### Test 7: Stress Test - Many Databases
**Setup:** 5 nodes with high capacity
**Configuration:**
```yaml
database:
max_databases: 50
port_range_http_start: 5001
port_range_http_end: 5150
port_range_raft_start: 7001
port_range_raft_end: 7150
```
**Test Steps:**
1. **Create Many Databases**
```
Loop: Create databases db_1 through db_25
```
2. **Verify Distribution**
- Check logs for node capacity announcements
- Verify databases distributed across nodes
- No single node overloaded
3. **Concurrent Operations**
- Write to multiple databases simultaneously
- Read from multiple databases
- Verify no conflicts
4. **Hibernation Wave**
- Stop all activity
- Wait for hibernation
- Verify all databases hibernate
- Check resource usage drops
5. **Wake-Up Storm**
- Query all 25 databases at once
- Verify all wake up successfully
- Check for thundering herd issues
**Expected Results:**
- All 25 databases created successfully
- Even distribution across nodes
- No port conflicts
- Successful mass hibernation/wake-up
---
### Test 8: Gateway API Access
**Setup:** Gateway running with 3 nodes
**Test Steps:**
1. **Authenticate**
```bash
# Get JWT token
TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \
-H "Content-Type: application/json" \
-d '{"wallet": "..."}' | jq -r .token)
```
2. **Create Table**
```bash
curl -X POST http://localhost:8080/v1/database/create-table \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"database": "testdb",
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
}'
```
3. **Insert Data**
```bash
curl -X POST http://localhost:8080/v1/database/exec \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"database": "testdb",
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
"args": ["Alice", "alice@example.com"]
}'
```
4. **Query Data**
```bash
curl -X POST http://localhost:8080/v1/database/query \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"database": "testdb",
"sql": "SELECT * FROM users"
}'
```
5. **Test Transaction**
```bash
curl -X POST http://localhost:8080/v1/database/transaction \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"database": "testdb",
"queries": [
"INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")",
"INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")"
]
}'
```
6. **Get Schema**
```bash
curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \
-H "Authorization: Bearer $TOKEN"
```
7. **Test Hibernation**
- Wait for hibernation timeout
- Query again and measure wake-up time
- Should see delay on first query after hibernation
**Expected Results:**
- All API calls succeed
- Data persists across calls
- Transactions are atomic
- Schema reflects created tables
- Hibernation/wake-up transparent to API
- Response times reasonable (< 30s for queries)
---
## Test Checklist
### Unit Tests (To Implement)
- [ ] Metadata Store operations
- [ ] Metadata Store concurrency
- [ ] Vector Clock increment
- [ ] Vector Clock merge
- [ ] Vector Clock compare
- [ ] Coordinator election (single node)
- [ ] Coordinator election (multiple nodes)
- [ ] Coordinator election (deterministic)
- [ ] Port Manager allocation
- [ ] Port Manager release
- [ ] Port Manager exhaustion
- [ ] Port Manager specific ports
- [ ] RQLite Instance creation
- [ ] RQLite Instance IsIdle
- [ ] Message marshal/unmarshal (all types)
- [ ] Coordinator response collection
- [ ] Coordinator node selection
- [ ] Coordinator registry
### Integration Tests (To Implement)
- [ ] Single node database creation
- [ ] Three node database creation
- [ ] Multiple databases isolation
- [ ] Hibernation cycle
- [ ] Wake-up cycle
- [ ] Node failure detection
- [ ] Node replacement
- [ ] Orphaned data cleanup
- [ ] Concurrent database creation
- [ ] Concurrent hibernation
### Manual Tests (To Perform)
- [ ] Basic three node flow
- [ ] Hibernation and wake-up
- [ ] Multiple databases
- [ ] Node failure and recovery
- [ ] Port exhaustion handling
- [ ] Orphaned data cleanup
- [ ] Stress test with many databases
### Performance Validation
- [ ] Database creation < 10s
- [ ] Wake-up time < 8s
- [ ] Metadata sync < 5s
- [ ] Query overhead < 10ms additional
## Running Tests
### Unit Tests
```bash
# Run all tests
go test ./pkg/rqlite/... -v
# Run with race detector
go test ./pkg/rqlite/... -race
# Run specific test
go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v
# Run with coverage
go test ./pkg/rqlite/... -cover -coverprofile=coverage.out
go tool cover -html=coverage.out
```
### Integration Tests
```bash
# Run e2e tests
go test ./e2e/... -v -timeout 30m
# Run specific e2e test
go test ./e2e/ -run TestThreeNodeDatabaseCreation -v
```
### Manual Tests
Follow the scenarios above in dedicated terminals for each node.
## Success Criteria
### Correctness
✅ All unit tests pass
✅ All integration tests pass
✅ All manual scenarios complete successfully
✅ No data loss in any scenario
✅ No race conditions detected
### Performance
✅ Database creation < 10 seconds
✅ Wake-up < 8 seconds
✅ Metadata sync < 5 seconds
✅ Query overhead < 10ms
### Reliability
✅ Survives node failures
✅ Automatic recovery works
✅ No orphaned data accumulates
✅ Hibernation/wake-up cycles stable
✅ Concurrent operations safe
## Notes for Future Test Enhancements
When implementing advanced metrics and benchmarks:
1. **Prometheus Metrics Tests**
- Verify metric export
- Validate metric values
- Test metric reset on restart
2. **Benchmark Suite**
- Automated performance regression detection
- Latency percentile tracking (p50, p95, p99)
- Throughput measurements
- Resource usage profiling
3. **Chaos Engineering**
- Random node kills
- Network partitions
- Clock skew simulation
- Disk full scenarios
4. **Long-Running Stability**
- 24-hour soak test
- Memory leak detection
- Slow-growing resource usage
## Debugging Failed Tests
### Common Issues
**Port Conflicts**
```bash
# Check for processes using test ports
lsof -i :5001-5999
lsof -i :7001-7999
# Kill stale processes
pkill rqlited
```
**Stale Data**
```bash
# Clean test data directories
rm -rf data/test_*/
rm -rf /tmp/debros_test_*/
```
**Timing Issues**
- Increase timeouts in flaky tests
- Add retry logic with exponential backoff
- Use proper synchronization primitives
**Race Conditions**
```bash
# Always run with race detector during development
go test -race ./...
```