mirror of
https://github.com/DeBrosOfficial/network.git
synced 2025-12-13 00:58:50 +00:00
Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide
- Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation. - Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources.
This commit is contained in:
parent
dd4cb832dc
commit
36002d342c
@ -1,165 +0,0 @@
|
||||
<!-- ec358e91-8e19-4fc8-a81e-cb388a4b2fc9 4c357d4a-bae7-4fe2-943d-84e5d3d3714c -->
|
||||
# Dynamic Database Clustering — Implementation Plan
|
||||
|
||||
### Scope
|
||||
|
||||
Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing.
|
||||
|
||||
### Guiding Principles
|
||||
|
||||
- Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons.
|
||||
- Backward-compatible config migration with deprecations, feature-flag controlled rollout.
|
||||
- Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes.
|
||||
- Tests and observability at each phase.
|
||||
|
||||
### Phase 0: Prep & Scaffolding
|
||||
|
||||
- Add feature flag `dynamic_db_clustering` (env/config) → default off.
|
||||
- Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated).
|
||||
- Create empty packages and interfaces to enable incremental compilation:
|
||||
- `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}`
|
||||
- `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}`
|
||||
- Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed).
|
||||
- Establish CI jobs for new unit/integration suites and longer-running e2e.
|
||||
|
||||
### Phase 1: Metadata Layer (No hibernation yet)
|
||||
|
||||
- Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`:
|
||||
- `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`.
|
||||
- Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge:
|
||||
- Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health.
|
||||
- Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`:
|
||||
- Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs).
|
||||
- Reuse existing node connectivity/backoff; no new ping service required.
|
||||
- Skip unit tests for now; validate by wiring e2e flows later.
|
||||
|
||||
### Phase 2: Database Creation & Client API
|
||||
|
||||
- Port management:
|
||||
- `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping.
|
||||
- Subprocess control:
|
||||
- `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status).
|
||||
- Cluster manager:
|
||||
- `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces.
|
||||
- Client API:
|
||||
- Update `pkg/client/interface.go` to include `Database(name string)`.
|
||||
- Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name).
|
||||
- Backoff polling for readiness during creation.
|
||||
- Data isolation:
|
||||
- Data dir per db: `./data/<app>_<db>/rqlite` (respect node `data_dir` base).
|
||||
- Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write.
|
||||
|
||||
### Phase 3: Hibernation & Wake-Up
|
||||
|
||||
- Idle detection and coordination:
|
||||
- Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule.
|
||||
- Hibernation protocol:
|
||||
- Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`.
|
||||
- Wake-up protocol:
|
||||
- Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`.
|
||||
- Client retry UX:
|
||||
- Transparent retries with exponential backoff; treat `waking` as wait-only state.
|
||||
- Tests: hibernation under load; thundering herd; resource verification and persistence across cycles.
|
||||
|
||||
### Phase 4: Resilience (Failure & Replacement)
|
||||
|
||||
- Continuous health checks with timeouts → mark node unhealthy.
|
||||
- Replacement orchestration:
|
||||
- Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs.
|
||||
- Startup reconciliation:
|
||||
- Detect and cleanup orphaned or non-member local data directories.
|
||||
- Rate limiting replacements to prevent cascades; prioritize by usage metrics.
|
||||
- Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity.
|
||||
|
||||
### Phase 5: Production Hardening & Optimization
|
||||
|
||||
- Metrics/logging:
|
||||
- Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges.
|
||||
- Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status).
|
||||
- Client metadata caching and query routing improvements (simple round-robin → latency-aware later).
|
||||
- Performance benchmarks and operator-facing docs.
|
||||
|
||||
### File Changes (Essentials)
|
||||
|
||||
- `pkg/config/config.go`
|
||||
- Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`.
|
||||
- Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`.
|
||||
- `pkg/client/interface.go`/`pkg/client/client.go`
|
||||
- Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling.
|
||||
- `pkg/node/node.go`
|
||||
- Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage.
|
||||
- `pkg/rqlite/*`
|
||||
- Refactor to instance-oriented helpers from singleton.
|
||||
- New packages under `pkg/metadata` and `pkg/dbcluster` as listed above.
|
||||
- `configs/node.yaml` and validation paths to reflect new `database` block.
|
||||
|
||||
### Config Example (target end-state)
|
||||
|
||||
```yaml
|
||||
node:
|
||||
data_dir: "./data"
|
||||
|
||||
database:
|
||||
replication_factor: 3
|
||||
hibernation_timeout: 60
|
||||
max_databases: 100
|
||||
port_range:
|
||||
http_start: 5001
|
||||
http_end: 5999
|
||||
raft_start: 7001
|
||||
raft_end: 7999
|
||||
|
||||
discovery:
|
||||
health_check_interval: 10s
|
||||
```
|
||||
|
||||
### Rollout Strategy
|
||||
|
||||
- Keep feature flag off by default; support legacy single-cluster path.
|
||||
- Ship Phase 1 behind flag; enable in dev/e2e only.
|
||||
- Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment.
|
||||
- Remove legacy config after deprecation window.
|
||||
|
||||
### Testing & Quality Gates
|
||||
|
||||
- Unit tests: metadata ops, consensus, ports, subprocess, manager state machine.
|
||||
- Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions.
|
||||
- Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms).
|
||||
- Chaos suite for randomized failures and partitions.
|
||||
|
||||
### Risks & Mitigations (operationalized)
|
||||
|
||||
- Metadata divergence → vector clocks + periodic checksums + majority read checks in client.
|
||||
- Raft churn → adaptive timeouts; allow `always_on` flag per-db (future).
|
||||
- Cascading replacements → global rate limiter and prioritization.
|
||||
- Debuggability → verbose structured logging and metadata dump endpoints.
|
||||
|
||||
### Timeline (indicative)
|
||||
|
||||
- Weeks 1-2: Phases 0-1
|
||||
- Weeks 3-4: Phase 2
|
||||
- Weeks 5-6: Phase 3
|
||||
- Weeks 7-8: Phase 4
|
||||
- Weeks 9-10+: Phase 5
|
||||
|
||||
### To-dos
|
||||
|
||||
- [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks
|
||||
- [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields
|
||||
- [ ] Implement metadata types and thread-safe store with versioning
|
||||
- [ ] Implement pubsub messages and handlers using existing pubsub manager
|
||||
- [ ] Implement coordinator election, vector clocks, gossip reconciliation
|
||||
- [ ] Implement `PortManager` with bind-probing and allocation
|
||||
- [ ] Implement rqlite subprocess control and readiness checks
|
||||
- [ ] Implement `ClusterManager` and creation lifecycle orchestration
|
||||
- [ ] Add `Database(name)` and app namespacing to client; backoff polling
|
||||
- [ ] Adopt per-database data dirs under node `data_dir`
|
||||
- [ ] Integration tests for creation and isolation across nodes
|
||||
- [ ] Idle detection, coordinated shutdown, status updates
|
||||
- [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart
|
||||
- [ ] Client transparent retry/backoff for hibernation and waking
|
||||
- [ ] Health checks, replacement orchestration, rate limiting
|
||||
- [ ] Implement orphaned data reconciliation on startup
|
||||
- [ ] Add metrics and structured logging across managers
|
||||
- [ ] Benchmarks for creation, wake-up, sync, query overhead
|
||||
- [ ] Operator and developer docs; config and migration guides
|
||||
@ -1,504 +0,0 @@
|
||||
# Dynamic Database Clustering - User Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation.
|
||||
|
||||
## Key Features
|
||||
|
||||
✅ **Multi-Database Support** - Create unlimited isolated databases on-demand
|
||||
✅ **3-Node Replication** - Fault-tolerant by default (configurable)
|
||||
✅ **Auto Hibernation** - Idle databases hibernate to save resources
|
||||
✅ **Transparent Wake-Up** - Automatic restart on access
|
||||
✅ **App Namespacing** - Databases are scoped by application name
|
||||
✅ **Decentralized Metadata** - LibP2P pubsub-based coordination
|
||||
✅ **Failure Recovery** - Automatic node replacement on failures
|
||||
✅ **Resource Optimization** - Dynamic port allocation and data isolation
|
||||
|
||||
## Configuration
|
||||
|
||||
### Node Configuration (`configs/node.yaml`)
|
||||
|
||||
```yaml
|
||||
node:
|
||||
data_dir: "./data"
|
||||
listen_addresses:
|
||||
- "/ip4/0.0.0.0/tcp/4001"
|
||||
max_connections: 50
|
||||
|
||||
database:
|
||||
replication_factor: 3 # Number of replicas per database
|
||||
hibernation_timeout: 60s # Idle time before hibernation
|
||||
max_databases: 100 # Max databases per node
|
||||
port_range_http_start: 5001 # HTTP port range start
|
||||
port_range_http_end: 5999 # HTTP port range end
|
||||
port_range_raft_start: 7001 # Raft port range start
|
||||
port_range_raft_end: 7999 # Raft port range end
|
||||
|
||||
discovery:
|
||||
bootstrap_peers:
|
||||
- "/ip4/127.0.0.1/tcp/4001/p2p/..."
|
||||
discovery_interval: 30s
|
||||
health_check_interval: 10s
|
||||
```
|
||||
|
||||
### Key Configuration Options
|
||||
|
||||
#### `database.replication_factor` (default: 3)
|
||||
Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance.
|
||||
|
||||
#### `database.hibernation_timeout` (default: 60s)
|
||||
Time of inactivity before a database is hibernated. Set to 0 to disable hibernation.
|
||||
|
||||
#### `database.max_databases` (default: 100)
|
||||
Maximum number of databases this node can host simultaneously.
|
||||
|
||||
#### `database.port_range_*`
|
||||
Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database).
|
||||
|
||||
## Client Usage
|
||||
|
||||
### Creating/Accessing Databases
|
||||
|
||||
```go
|
||||
package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"github.com/DeBrosOfficial/network/pkg/client"
|
||||
)
|
||||
|
||||
func main() {
|
||||
// Create client with app name for namespacing
|
||||
cfg := client.DefaultClientConfig("myapp")
|
||||
cfg.BootstrapPeers = []string{
|
||||
"/ip4/127.0.0.1/tcp/4001/p2p/...",
|
||||
}
|
||||
|
||||
c, err := client.NewClient(cfg)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// Connect to network
|
||||
if err := c.Connect(); err != nil {
|
||||
panic(err)
|
||||
}
|
||||
defer c.Disconnect()
|
||||
|
||||
// Get database client (creates database if it doesn't exist)
|
||||
db, err := c.Database().Database("users")
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
// Use the database
|
||||
ctx := context.Background()
|
||||
err = db.CreateTable(ctx, `
|
||||
CREATE TABLE users (
|
||||
id INTEGER PRIMARY KEY,
|
||||
name TEXT NOT NULL,
|
||||
email TEXT UNIQUE
|
||||
)
|
||||
`)
|
||||
|
||||
// Query data
|
||||
result, err := db.Query(ctx, "SELECT * FROM users")
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
### Database Naming
|
||||
|
||||
Databases are automatically namespaced by your application name:
|
||||
- `client.Database("users")` → creates `myapp_users` internally
|
||||
- This prevents name collisions between different applications
|
||||
|
||||
## Gateway API Usage
|
||||
|
||||
If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints:
|
||||
|
||||
### Base URL
|
||||
```
|
||||
http://gateway-host:8080/v1/database/
|
||||
```
|
||||
|
||||
### Execute SQL (INSERT, UPDATE, DELETE, DDL)
|
||||
```bash
|
||||
POST /v1/database/exec
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users",
|
||||
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
|
||||
"args": ["Alice", "alice@example.com"]
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"rows_affected": 1,
|
||||
"last_insert_id": 1
|
||||
}
|
||||
```
|
||||
|
||||
### Query Data (SELECT)
|
||||
```bash
|
||||
POST /v1/database/query
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users",
|
||||
"sql": "SELECT * FROM users WHERE name LIKE ?",
|
||||
"args": ["A%"]
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"items": [
|
||||
{"id": 1, "name": "Alice", "email": "alice@example.com"}
|
||||
],
|
||||
"count": 1
|
||||
}
|
||||
```
|
||||
|
||||
### Execute Transaction
|
||||
```bash
|
||||
POST /v1/database/transaction
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users",
|
||||
"queries": [
|
||||
"INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')",
|
||||
"UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'"
|
||||
]
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"success": true
|
||||
}
|
||||
```
|
||||
|
||||
### Get Schema
|
||||
```bash
|
||||
GET /v1/database/schema?database=users
|
||||
|
||||
# OR
|
||||
|
||||
POST /v1/database/schema
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users"
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"tables": [
|
||||
{
|
||||
"name": "users",
|
||||
"columns": ["id", "name", "email"]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Create Table
|
||||
```bash
|
||||
POST /v1/database/create-table
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users",
|
||||
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"rows_affected": 0
|
||||
}
|
||||
```
|
||||
|
||||
### Drop Table
|
||||
```bash
|
||||
POST /v1/database/drop-table
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
"database": "users",
|
||||
"table_name": "old_table"
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"rows_affected": 0
|
||||
}
|
||||
```
|
||||
|
||||
### List Databases
|
||||
```bash
|
||||
GET /v1/database/list
|
||||
|
||||
Response:
|
||||
{
|
||||
"databases": ["users", "products", "orders"]
|
||||
}
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
1. **Authentication Required**: All endpoints require authentication (JWT or API key)
|
||||
2. **Database Creation**: Databases are created automatically on first access
|
||||
3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database
|
||||
4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s
|
||||
5. **Namespacing**: Database names are automatically prefixed with your app name
|
||||
6. **Concurrent Access**: All endpoints are safe for concurrent use
|
||||
|
||||
## Database Lifecycle
|
||||
|
||||
### 1. Creation
|
||||
|
||||
When you first access a database:
|
||||
|
||||
1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST`
|
||||
2. **Node Selection** - Eligible nodes respond with available ports
|
||||
3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen
|
||||
4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM`
|
||||
5. **Instance Startup** - Selected nodes start rqlite subprocesses
|
||||
6. **Readiness** - Nodes report `active` status when ready
|
||||
|
||||
**Typical creation time: < 10 seconds**
|
||||
|
||||
### 2. Active State
|
||||
|
||||
- Database instances run as rqlite subprocesses
|
||||
- Each instance tracks `LastQuery` timestamp
|
||||
- Queries update the activity timestamp
|
||||
- Metadata synced across all network nodes
|
||||
|
||||
### 3. Hibernation
|
||||
|
||||
After `hibernation_timeout` of inactivity:
|
||||
|
||||
1. **Idle Detection** - Nodes detect idle databases
|
||||
2. **Idle Notification** - Nodes broadcast idle status
|
||||
3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown
|
||||
4. **Graceful Stop** - SIGTERM sent to rqlite processes
|
||||
5. **Port Release** - Ports freed for reuse
|
||||
6. **Status Update** - Metadata updated to `hibernating`
|
||||
|
||||
**Data persists on disk during hibernation**
|
||||
|
||||
### 4. Wake-Up
|
||||
|
||||
On first query to hibernating database:
|
||||
|
||||
1. **Detection** - Client/node detects `hibernating` status
|
||||
2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST`
|
||||
3. **Port Allocation** - Reuse original ports or allocate new ones
|
||||
4. **Instance Restart** - Restart rqlite with existing data
|
||||
5. **Status Update** - Update to `active` when ready
|
||||
|
||||
**Typical wake-up time: < 8 seconds**
|
||||
|
||||
### 5. Failure Recovery
|
||||
|
||||
When a node fails:
|
||||
|
||||
1. **Health Detection** - Missed health checks trigger failure detection
|
||||
2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED`
|
||||
3. **Offers** - Healthy nodes with capacity offer to replace
|
||||
4. **Selection** - First offer accepted (simple approach)
|
||||
5. **Join Cluster** - New node joins existing Raft cluster
|
||||
6. **Sync** - Data synced from existing members
|
||||
|
||||
## Data Management
|
||||
|
||||
### Data Directories
|
||||
|
||||
Each database gets its own data directory:
|
||||
```
|
||||
./data/
|
||||
├── myapp_users/ # Database: users
|
||||
│ └── rqlite/
|
||||
│ ├── db.sqlite
|
||||
│ └── raft/
|
||||
├── myapp_products/ # Database: products
|
||||
│ └── rqlite/
|
||||
└── myapp_orders/ # Database: orders
|
||||
└── rqlite/
|
||||
```
|
||||
|
||||
### Orphaned Data Cleanup
|
||||
|
||||
On node startup, the system automatically:
|
||||
- Scans data directories
|
||||
- Checks against metadata
|
||||
- Removes directories for:
|
||||
- Non-existent databases
|
||||
- Databases where this node is not a member
|
||||
|
||||
## Monitoring & Debugging
|
||||
|
||||
### Structured Logging
|
||||
|
||||
All operations are logged with structured fields:
|
||||
|
||||
```
|
||||
INFO Starting cluster manager node_id=12D3... max_databases=100
|
||||
INFO Received database create request database=myapp_users requester=12D3...
|
||||
INFO Database instance started database=myapp_users http_port=5001 raft_port=7001
|
||||
INFO Database is idle database=myapp_users idle_time=62s
|
||||
INFO Database hibernated successfully database=myapp_users
|
||||
INFO Received wakeup request database=myapp_users
|
||||
INFO Database woke up successfully database=myapp_users
|
||||
```
|
||||
|
||||
### Health Checks
|
||||
|
||||
Nodes perform periodic health checks:
|
||||
- Every `health_check_interval` (default: 10s)
|
||||
- Tracks last-seen time for each peer
|
||||
- 3 missed checks → node marked unhealthy
|
||||
- Triggers replacement protocol for affected databases
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. **Capacity Planning**
|
||||
|
||||
```yaml
|
||||
# For 100 databases with 3-node replication:
|
||||
database:
|
||||
max_databases: 100
|
||||
port_range_http_start: 5001
|
||||
port_range_http_end: 5200 # 200 ports (100 databases * 2)
|
||||
port_range_raft_start: 7001
|
||||
port_range_raft_end: 7200
|
||||
```
|
||||
|
||||
### 2. **Hibernation Tuning**
|
||||
|
||||
- **High Traffic**: Set `hibernation_timeout: 300s` or higher
|
||||
- **Development**: Set `hibernation_timeout: 30s` for faster cycles
|
||||
- **Always-On DBs**: Set `hibernation_timeout: 0` to disable
|
||||
|
||||
### 3. **Replication Factor**
|
||||
|
||||
- **Development**: `replication_factor: 1` (single node, no replication)
|
||||
- **Production**: `replication_factor: 3` (fault tolerant)
|
||||
- **High Availability**: `replication_factor: 5` (survives 2 failures)
|
||||
|
||||
### 4. **Network Topology**
|
||||
|
||||
- Use at least 3 nodes for `replication_factor: 3`
|
||||
- Ensure `max_databases * replication_factor <= total_cluster_capacity`
|
||||
- Example: 3 nodes × 100 max_databases = 300 database instances total
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Database Creation Fails
|
||||
|
||||
**Problem**: `insufficient nodes responded: got 1, need 3`
|
||||
|
||||
**Solution**:
|
||||
- Ensure you have at least `replication_factor` nodes online
|
||||
- Check `max_databases` limit on nodes
|
||||
- Verify port ranges aren't exhausted
|
||||
|
||||
### Database Not Waking Up
|
||||
|
||||
**Problem**: Database stays in `waking` status
|
||||
|
||||
**Solution**:
|
||||
- Check node logs for rqlite startup errors
|
||||
- Verify rqlite binary is installed
|
||||
- Check port conflicts (use different port ranges)
|
||||
- Ensure data directory is accessible
|
||||
|
||||
### Orphaned Data
|
||||
|
||||
**Problem**: Disk space consumed by old databases
|
||||
|
||||
**Solution**:
|
||||
- Orphaned data is automatically cleaned on node restart
|
||||
- Manual cleanup: Delete directories from `./data/` that don't match metadata
|
||||
- Check logs for reconciliation results
|
||||
|
||||
### Node Replacement Not Working
|
||||
|
||||
**Problem**: Failed node not replaced
|
||||
|
||||
**Solution**:
|
||||
- Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`)
|
||||
- Check network connectivity between nodes
|
||||
- Verify health check interval is reasonable (not too aggressive)
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Metadata Consistency
|
||||
|
||||
- **Vector Clocks**: Each metadata update includes vector clock for conflict resolution
|
||||
- **Gossip Protocol**: Periodic metadata sync via checksums
|
||||
- **Eventual Consistency**: All nodes eventually agree on database state
|
||||
|
||||
### Port Management
|
||||
|
||||
- Ports allocated randomly within configured ranges
|
||||
- Bind-probing ensures ports are actually available
|
||||
- Ports reused during wake-up when possible
|
||||
- Failed allocations fall back to new random ports
|
||||
|
||||
### Coordinator Election
|
||||
|
||||
- Deterministic selection based on lexicographical peer ID ordering
|
||||
- Lowest peer ID becomes coordinator
|
||||
- No persistent coordinator state
|
||||
- Re-election occurs for each database operation
|
||||
|
||||
## Migration from Legacy Mode
|
||||
|
||||
If upgrading from single-cluster rqlite:
|
||||
|
||||
1. **Backup Data**: Backup your existing `./data/rqlite` directory
|
||||
2. **Update Config**: Remove deprecated fields:
|
||||
- `database.data_dir`
|
||||
- `database.rqlite_port`
|
||||
- `database.rqlite_raft_port`
|
||||
- `database.rqlite_join_address`
|
||||
3. **Add New Fields**: Configure dynamic clustering (see Configuration section)
|
||||
4. **Restart Nodes**: Restart all nodes with new configuration
|
||||
5. **Migrate Data**: Create new database and import data from backup
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
The following features are planned for future releases:
|
||||
|
||||
### **Advanced Metrics** (Future)
|
||||
- Prometheus-style metrics export
|
||||
- Per-database query counters
|
||||
- Hibernation/wake-up latency histograms
|
||||
- Resource utilization gauges
|
||||
|
||||
### **Performance Benchmarks** (Future)
|
||||
- Automated benchmark suite
|
||||
- Creation time SLOs
|
||||
- Wake-up latency targets
|
||||
- Query overhead measurements
|
||||
|
||||
### **Enhanced Monitoring** (Future)
|
||||
- Dashboard for cluster visualization
|
||||
- Database status API endpoint
|
||||
- Capacity planning tools
|
||||
- Alerting integration
|
||||
|
||||
## Support
|
||||
|
||||
For issues, questions, or contributions:
|
||||
- GitHub Issues: https://github.com/DeBrosOfficial/network/issues
|
||||
- Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md
|
||||
|
||||
## License
|
||||
|
||||
See LICENSE file for details.
|
||||
|
||||
827
TESTING_GUIDE.md
827
TESTING_GUIDE.md
@ -1,827 +0,0 @@
|
||||
# Dynamic Database Clustering - Testing Guide
|
||||
|
||||
This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature.
|
||||
|
||||
## Unit Tests
|
||||
|
||||
### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`)
|
||||
|
||||
```go
|
||||
// Test cases to implement:
|
||||
|
||||
func TestMetadataStore_GetSetDatabase(t *testing.T)
|
||||
- Create store
|
||||
- Set database metadata
|
||||
- Get database metadata
|
||||
- Verify data matches
|
||||
|
||||
func TestMetadataStore_DeleteDatabase(t *testing.T)
|
||||
- Set database metadata
|
||||
- Delete database
|
||||
- Verify Get returns nil
|
||||
|
||||
func TestMetadataStore_ListDatabases(t *testing.T)
|
||||
- Add multiple databases
|
||||
- List all databases
|
||||
- Verify count and contents
|
||||
|
||||
func TestMetadataStore_ConcurrentAccess(t *testing.T)
|
||||
- Spawn multiple goroutines
|
||||
- Concurrent reads and writes
|
||||
- Verify no race conditions (run with -race)
|
||||
|
||||
func TestMetadataStore_NodeCapacity(t *testing.T)
|
||||
- Set node capacity
|
||||
- Get node capacity
|
||||
- Update capacity
|
||||
- List nodes
|
||||
```
|
||||
|
||||
### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`)
|
||||
|
||||
```go
|
||||
func TestVectorClock_Increment(t *testing.T)
|
||||
- Create empty vector clock
|
||||
- Increment for node A
|
||||
- Verify counter is 1
|
||||
- Increment again
|
||||
- Verify counter is 2
|
||||
|
||||
func TestVectorClock_Merge(t *testing.T)
|
||||
- Create two vector clocks with different nodes
|
||||
- Merge them
|
||||
- Verify max values are preserved
|
||||
|
||||
func TestVectorClock_Compare(t *testing.T)
|
||||
- Test strictly less than case
|
||||
- Test strictly greater than case
|
||||
- Test concurrent case
|
||||
- Test identical case
|
||||
|
||||
func TestVectorClock_Concurrent(t *testing.T)
|
||||
- Create clocks with overlapping updates
|
||||
- Verify Compare returns 0 (concurrent)
|
||||
```
|
||||
|
||||
### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`)
|
||||
|
||||
```go
|
||||
func TestElectCoordinator_SingleNode(t *testing.T)
|
||||
- Pass single node ID
|
||||
- Verify it's elected
|
||||
|
||||
func TestElectCoordinator_MultipleNodes(t *testing.T)
|
||||
- Pass multiple node IDs
|
||||
- Verify lowest lexicographical ID wins
|
||||
- Verify deterministic (same input = same output)
|
||||
|
||||
func TestElectCoordinator_EmptyList(t *testing.T)
|
||||
- Pass empty list
|
||||
- Verify error returned
|
||||
|
||||
func TestElectCoordinator_Deterministic(t *testing.T)
|
||||
- Run election multiple times with same inputs
|
||||
- Verify same coordinator each time
|
||||
```
|
||||
|
||||
### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`)
|
||||
|
||||
```go
|
||||
func TestPortManager_AllocatePortPair(t *testing.T)
|
||||
- Create manager with port range
|
||||
- Allocate port pair
|
||||
- Verify HTTP and Raft ports different
|
||||
- Verify ports within range
|
||||
|
||||
func TestPortManager_ReleasePortPair(t *testing.T)
|
||||
- Allocate port pair
|
||||
- Release ports
|
||||
- Verify ports can be reallocated
|
||||
|
||||
func TestPortManager_Exhaustion(t *testing.T)
|
||||
- Allocate all available ports
|
||||
- Attempt one more allocation
|
||||
- Verify error returned
|
||||
|
||||
func TestPortManager_IsPortAllocated(t *testing.T)
|
||||
- Allocate ports
|
||||
- Check IsPortAllocated returns true
|
||||
- Release ports
|
||||
- Check IsPortAllocated returns false
|
||||
|
||||
func TestPortManager_AllocateSpecificPorts(t *testing.T)
|
||||
- Allocate specific ports
|
||||
- Verify allocation succeeds
|
||||
- Attempt to allocate same ports again
|
||||
- Verify error returned
|
||||
```
|
||||
|
||||
### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`)
|
||||
|
||||
```go
|
||||
func TestRQLiteInstance_Create(t *testing.T)
|
||||
- Create instance configuration
|
||||
- Verify fields set correctly
|
||||
|
||||
func TestRQLiteInstance_IsIdle(t *testing.T)
|
||||
- Set LastQuery to old timestamp
|
||||
- Verify IsIdle returns true
|
||||
- Update LastQuery
|
||||
- Verify IsIdle returns false
|
||||
|
||||
// Integration test (requires rqlite binary):
|
||||
func TestRQLiteInstance_StartStop(t *testing.T)
|
||||
- Create instance
|
||||
- Start instance
|
||||
- Verify HTTP endpoint responsive
|
||||
- Stop instance
|
||||
- Verify process terminated
|
||||
```
|
||||
|
||||
### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`)
|
||||
|
||||
```go
|
||||
func TestMarshalUnmarshalMetadataMessage(t *testing.T)
|
||||
- Create each message type
|
||||
- Marshal to bytes
|
||||
- Unmarshal back
|
||||
- Verify data preserved
|
||||
|
||||
func TestDatabaseCreateRequest_Marshal(t *testing.T)
|
||||
func TestDatabaseCreateResponse_Marshal(t *testing.T)
|
||||
func TestDatabaseCreateConfirm_Marshal(t *testing.T)
|
||||
func TestDatabaseStatusUpdate_Marshal(t *testing.T)
|
||||
// ... for all message types
|
||||
```
|
||||
|
||||
### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`)
|
||||
|
||||
```go
|
||||
func TestCreateCoordinator_AddResponse(t *testing.T)
|
||||
- Create coordinator
|
||||
- Add responses
|
||||
- Verify response count
|
||||
|
||||
func TestCreateCoordinator_SelectNodes(t *testing.T)
|
||||
- Add more responses than needed
|
||||
- Call SelectNodes
|
||||
- Verify correct number selected
|
||||
- Verify deterministic selection
|
||||
|
||||
func TestCreateCoordinator_WaitForResponses(t *testing.T)
|
||||
- Create coordinator
|
||||
- Wait in goroutine
|
||||
- Add responses from another goroutine
|
||||
- Verify wait completes when enough responses
|
||||
|
||||
func TestCoordinatorRegistry(t *testing.T)
|
||||
- Register coordinator
|
||||
- Get coordinator
|
||||
- Remove coordinator
|
||||
- Verify lifecycle
|
||||
```
|
||||
|
||||
## Integration Tests
|
||||
|
||||
### 1. Single Node Database Creation (`e2e/single_node_database_test.go`)
|
||||
|
||||
```go
|
||||
func TestSingleNodeDatabaseCreation(t *testing.T)
|
||||
- Start 1 node
|
||||
- Set replication_factor = 1
|
||||
- Create database
|
||||
- Verify database active
|
||||
- Write data
|
||||
- Read data back
|
||||
- Verify data matches
|
||||
```
|
||||
|
||||
### 2. Three Node Database Creation (`e2e/three_node_database_test.go`)
|
||||
|
||||
```go
|
||||
func TestThreeNodeDatabaseCreation(t *testing.T)
|
||||
- Start 3 nodes
|
||||
- Set replication_factor = 3
|
||||
- Create database from node 1
|
||||
- Wait for all nodes to report active
|
||||
- Write data to node 1
|
||||
- Read from node 2
|
||||
- Verify replication worked
|
||||
```
|
||||
|
||||
### 3. Multiple Databases (`e2e/multiple_databases_test.go`)
|
||||
|
||||
```go
|
||||
func TestMultipleDatabases(t *testing.T)
|
||||
- Start 3 nodes
|
||||
- Create database "users"
|
||||
- Create database "products"
|
||||
- Create database "orders"
|
||||
- Verify all databases active
|
||||
- Write to each database
|
||||
- Verify data isolation
|
||||
```
|
||||
|
||||
### 4. Hibernation Cycle (`e2e/hibernation_test.go`)
|
||||
|
||||
```go
|
||||
func TestHibernationCycle(t *testing.T)
|
||||
- Start 3 nodes with hibernation_timeout=5s
|
||||
- Create database
|
||||
- Write initial data
|
||||
- Wait 10 seconds (no activity)
|
||||
- Verify status = hibernating
|
||||
- Verify processes stopped
|
||||
- Verify data persisted on disk
|
||||
|
||||
func TestWakeUpCycle(t *testing.T)
|
||||
- Create and hibernate database
|
||||
- Issue query
|
||||
- Wait for wake-up
|
||||
- Verify status = active
|
||||
- Verify data still accessible
|
||||
- Verify LastQuery updated
|
||||
```
|
||||
|
||||
### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`)
|
||||
|
||||
```go
|
||||
func TestNodeFailureDetection(t *testing.T)
|
||||
- Start 3 nodes
|
||||
- Create database
|
||||
- Kill one node (SIGKILL)
|
||||
- Wait for health checks to detect failure
|
||||
- Verify NODE_REPLACEMENT_NEEDED broadcast
|
||||
|
||||
func TestNodeReplacement(t *testing.T)
|
||||
- Start 4 nodes
|
||||
- Create database on nodes 1,2,3
|
||||
- Kill node 3
|
||||
- Wait for replacement
|
||||
- Verify node 4 joins cluster
|
||||
- Verify data accessible from node 4
|
||||
```
|
||||
|
||||
### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`)
|
||||
|
||||
```go
|
||||
func TestOrphanedDataCleanup(t *testing.T)
|
||||
- Start node
|
||||
- Manually create orphaned data directory
|
||||
- Restart node
|
||||
- Verify orphaned directory removed
|
||||
- Check logs for reconciliation message
|
||||
```
|
||||
|
||||
### 7. Concurrent Operations (`e2e/concurrent_test.go`)
|
||||
|
||||
```go
|
||||
func TestConcurrentDatabaseCreation(t *testing.T)
|
||||
- Start 5 nodes
|
||||
- Create 10 databases concurrently
|
||||
- Verify all successful
|
||||
- Verify no port conflicts
|
||||
- Verify proper distribution
|
||||
|
||||
func TestConcurrentHibernation(t *testing.T)
|
||||
- Create multiple databases
|
||||
- Let all go idle
|
||||
- Verify all hibernate correctly
|
||||
- No race conditions
|
||||
```
|
||||
|
||||
## Manual Test Scenarios
|
||||
|
||||
### Test 1: Basic Flow - Three Node Cluster
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
# Terminal 1: Bootstrap node
|
||||
cd data/bootstrap
|
||||
../../bin/node --data bootstrap --id bootstrap --p2p-port 4001
|
||||
|
||||
# Terminal 2: Node 2
|
||||
cd data/node
|
||||
../../bin/node --data node --id node2 --p2p-port 4002
|
||||
|
||||
# Terminal 3: Node 3
|
||||
cd data/node2
|
||||
../../bin/node --data node2 --id node3 --p2p-port 4003
|
||||
```
|
||||
|
||||
**Test Steps:**
|
||||
1. **Create Database**
|
||||
```bash
|
||||
# Use client or API to create database "testdb"
|
||||
```
|
||||
|
||||
2. **Verify Creation**
|
||||
- Check logs on all 3 nodes for "Database instance started"
|
||||
- Verify `./data/*/testdb/` directories exist on all nodes
|
||||
- Check different ports allocated on each node
|
||||
|
||||
3. **Write Data**
|
||||
```sql
|
||||
CREATE TABLE users (id INT, name TEXT);
|
||||
INSERT INTO users VALUES (1, 'Alice');
|
||||
INSERT INTO users VALUES (2, 'Bob');
|
||||
```
|
||||
|
||||
4. **Verify Replication**
|
||||
- Query from each node
|
||||
- Verify same data returned
|
||||
|
||||
**Expected Results:**
|
||||
- All nodes show `status=active` for testdb
|
||||
- Data replicated across all nodes
|
||||
- Unique port pairs per node
|
||||
|
||||
---
|
||||
|
||||
### Test 2: Hibernation and Wake-Up
|
||||
|
||||
**Setup:** Same as Test 1 with database created
|
||||
|
||||
**Test Steps:**
|
||||
1. **Check Activity**
|
||||
```bash
|
||||
# In logs, verify "last_query" timestamps updating on queries
|
||||
```
|
||||
|
||||
2. **Wait for Hibernation**
|
||||
- Stop issuing queries
|
||||
- Wait `hibernation_timeout` + 10s
|
||||
- Check logs for "Database is idle"
|
||||
- Verify "Coordinated shutdown message sent"
|
||||
- Verify "Database hibernated successfully"
|
||||
|
||||
3. **Verify Hibernation**
|
||||
```bash
|
||||
# Check that rqlite processes are stopped
|
||||
ps aux | grep rqlite
|
||||
|
||||
# Verify data directories still exist
|
||||
ls -la data/*/testdb/
|
||||
```
|
||||
|
||||
4. **Wake Up**
|
||||
- Issue a query to the database
|
||||
- Watch logs for "Received wakeup request"
|
||||
- Verify "Database woke up successfully"
|
||||
- Verify query succeeds
|
||||
|
||||
**Expected Results:**
|
||||
- Hibernation happens after idle timeout
|
||||
- All 3 nodes hibernate coordinated
|
||||
- Wake-up completes in < 8 seconds
|
||||
- Data persists across hibernation cycle
|
||||
|
||||
---
|
||||
|
||||
### Test 3: Multiple Databases
|
||||
|
||||
**Setup:** 3 nodes running
|
||||
|
||||
**Test Steps:**
|
||||
1. **Create Multiple Databases**
|
||||
```
|
||||
Create: users_db
|
||||
Create: products_db
|
||||
Create: orders_db
|
||||
```
|
||||
|
||||
2. **Verify Isolation**
|
||||
- Insert data in users_db
|
||||
- Verify data NOT in products_db
|
||||
- Verify data NOT in orders_db
|
||||
|
||||
3. **Check Port Allocation**
|
||||
```bash
|
||||
# Verify different ports for each database
|
||||
netstat -tlnp | grep rqlite
|
||||
# OR
|
||||
ss -tlnp | grep rqlite
|
||||
```
|
||||
|
||||
4. **Verify Data Directories**
|
||||
```bash
|
||||
tree data/bootstrap/
|
||||
# Should show:
|
||||
# ├── users_db/
|
||||
# ├── products_db/
|
||||
# └── orders_db/
|
||||
```
|
||||
|
||||
**Expected Results:**
|
||||
- 3 separate database clusters
|
||||
- Each with 3 nodes (9 total instances)
|
||||
- Complete data isolation
|
||||
- Unique port pairs for each instance
|
||||
|
||||
---
|
||||
|
||||
### Test 4: Node Failure and Recovery
|
||||
|
||||
**Setup:** 4 nodes running, database created on nodes 1-3
|
||||
|
||||
**Test Steps:**
|
||||
1. **Verify Initial State**
|
||||
- Database active on nodes 1, 2, 3
|
||||
- Node 4 idle
|
||||
|
||||
2. **Simulate Failure**
|
||||
```bash
|
||||
# Kill node 3 (SIGKILL for unclean shutdown)
|
||||
kill -9 <node3_pid>
|
||||
```
|
||||
|
||||
3. **Watch for Detection**
|
||||
- Check logs on nodes 1 and 2
|
||||
- Wait for health check failures (3 missed pings)
|
||||
- Verify "Node detected as unhealthy" messages
|
||||
|
||||
4. **Watch for Replacement**
|
||||
- Check for "NODE_REPLACEMENT_NEEDED" broadcast
|
||||
- Node 4 should offer to replace
|
||||
- Verify "Starting as replacement node" on node 4
|
||||
- Verify node 4 joins Raft cluster
|
||||
|
||||
5. **Verify Data Integrity**
|
||||
- Query database from node 4
|
||||
- Verify all data present
|
||||
- Insert new data from node 4
|
||||
- Verify replication to nodes 1 and 2
|
||||
|
||||
**Expected Results:**
|
||||
- Failure detected within 30 seconds
|
||||
- Replacement completes automatically
|
||||
- Data accessible from new node
|
||||
- No data loss
|
||||
|
||||
---
|
||||
|
||||
### Test 5: Port Exhaustion
|
||||
|
||||
**Setup:** 1 node with small port range
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
database:
|
||||
max_databases: 10
|
||||
port_range_http_start: 5001
|
||||
port_range_http_end: 5005 # Only 5 ports
|
||||
port_range_raft_start: 7001
|
||||
port_range_raft_end: 7005 # Only 5 ports
|
||||
```
|
||||
|
||||
**Test Steps:**
|
||||
1. **Create Databases**
|
||||
- Create database 1 (succeeds - uses 2 ports)
|
||||
- Create database 2 (succeeds - uses 2 ports)
|
||||
- Create database 3 (fails - only 1 port left)
|
||||
|
||||
2. **Verify Error**
|
||||
- Check logs for "Cannot allocate ports"
|
||||
- Verify error returned to client
|
||||
|
||||
3. **Free Ports**
|
||||
- Hibernate or delete database 1
|
||||
- Ports should be freed
|
||||
|
||||
4. **Retry**
|
||||
- Create database 3 again
|
||||
- Should succeed now
|
||||
|
||||
**Expected Results:**
|
||||
- Graceful handling of port exhaustion
|
||||
- Clear error messages
|
||||
- Ports properly recycled
|
||||
|
||||
---
|
||||
|
||||
### Test 6: Orphaned Data Cleanup
|
||||
|
||||
**Setup:** 1 node stopped
|
||||
|
||||
**Test Steps:**
|
||||
1. **Create Orphaned Data**
|
||||
```bash
|
||||
# While node is stopped
|
||||
mkdir -p data/bootstrap/orphaned_db/rqlite
|
||||
echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite
|
||||
```
|
||||
|
||||
2. **Start Node**
|
||||
```bash
|
||||
./bin/node --data bootstrap --id bootstrap
|
||||
```
|
||||
|
||||
3. **Check Reconciliation**
|
||||
- Watch logs for "Starting orphaned data reconciliation"
|
||||
- Verify "Found orphaned database directory"
|
||||
- Verify "Removed orphaned database directory"
|
||||
|
||||
4. **Verify Cleanup**
|
||||
```bash
|
||||
ls data/bootstrap/
|
||||
# orphaned_db should be gone
|
||||
```
|
||||
|
||||
**Expected Results:**
|
||||
- Orphaned directories automatically detected
|
||||
- Removed on startup
|
||||
- Clean reconciliation logged
|
||||
|
||||
---
|
||||
|
||||
### Test 7: Stress Test - Many Databases
|
||||
|
||||
**Setup:** 5 nodes with high capacity
|
||||
|
||||
**Configuration:**
|
||||
```yaml
|
||||
database:
|
||||
max_databases: 50
|
||||
port_range_http_start: 5001
|
||||
port_range_http_end: 5150
|
||||
port_range_raft_start: 7001
|
||||
port_range_raft_end: 7150
|
||||
```
|
||||
|
||||
**Test Steps:**
|
||||
1. **Create Many Databases**
|
||||
```
|
||||
Loop: Create databases db_1 through db_25
|
||||
```
|
||||
|
||||
2. **Verify Distribution**
|
||||
- Check logs for node capacity announcements
|
||||
- Verify databases distributed across nodes
|
||||
- No single node overloaded
|
||||
|
||||
3. **Concurrent Operations**
|
||||
- Write to multiple databases simultaneously
|
||||
- Read from multiple databases
|
||||
- Verify no conflicts
|
||||
|
||||
4. **Hibernation Wave**
|
||||
- Stop all activity
|
||||
- Wait for hibernation
|
||||
- Verify all databases hibernate
|
||||
- Check resource usage drops
|
||||
|
||||
5. **Wake-Up Storm**
|
||||
- Query all 25 databases at once
|
||||
- Verify all wake up successfully
|
||||
- Check for thundering herd issues
|
||||
|
||||
**Expected Results:**
|
||||
- All 25 databases created successfully
|
||||
- Even distribution across nodes
|
||||
- No port conflicts
|
||||
- Successful mass hibernation/wake-up
|
||||
|
||||
---
|
||||
|
||||
### Test 8: Gateway API Access
|
||||
|
||||
**Setup:** Gateway running with 3 nodes
|
||||
|
||||
**Test Steps:**
|
||||
1. **Authenticate**
|
||||
```bash
|
||||
# Get JWT token
|
||||
TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"wallet": "..."}' | jq -r .token)
|
||||
```
|
||||
|
||||
2. **Create Table**
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/v1/database/create-table \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"database": "testdb",
|
||||
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
|
||||
}'
|
||||
```
|
||||
|
||||
3. **Insert Data**
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/v1/database/exec \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"database": "testdb",
|
||||
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
|
||||
"args": ["Alice", "alice@example.com"]
|
||||
}'
|
||||
```
|
||||
|
||||
4. **Query Data**
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/v1/database/query \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"database": "testdb",
|
||||
"sql": "SELECT * FROM users"
|
||||
}'
|
||||
```
|
||||
|
||||
5. **Test Transaction**
|
||||
```bash
|
||||
curl -X POST http://localhost:8080/v1/database/transaction \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"database": "testdb",
|
||||
"queries": [
|
||||
"INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")",
|
||||
"INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")"
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
6. **Get Schema**
|
||||
```bash
|
||||
curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
|
||||
7. **Test Hibernation**
|
||||
- Wait for hibernation timeout
|
||||
- Query again and measure wake-up time
|
||||
- Should see delay on first query after hibernation
|
||||
|
||||
**Expected Results:**
|
||||
- All API calls succeed
|
||||
- Data persists across calls
|
||||
- Transactions are atomic
|
||||
- Schema reflects created tables
|
||||
- Hibernation/wake-up transparent to API
|
||||
- Response times reasonable (< 30s for queries)
|
||||
|
||||
---
|
||||
|
||||
## Test Checklist
|
||||
|
||||
### Unit Tests (To Implement)
|
||||
- [ ] Metadata Store operations
|
||||
- [ ] Metadata Store concurrency
|
||||
- [ ] Vector Clock increment
|
||||
- [ ] Vector Clock merge
|
||||
- [ ] Vector Clock compare
|
||||
- [ ] Coordinator election (single node)
|
||||
- [ ] Coordinator election (multiple nodes)
|
||||
- [ ] Coordinator election (deterministic)
|
||||
- [ ] Port Manager allocation
|
||||
- [ ] Port Manager release
|
||||
- [ ] Port Manager exhaustion
|
||||
- [ ] Port Manager specific ports
|
||||
- [ ] RQLite Instance creation
|
||||
- [ ] RQLite Instance IsIdle
|
||||
- [ ] Message marshal/unmarshal (all types)
|
||||
- [ ] Coordinator response collection
|
||||
- [ ] Coordinator node selection
|
||||
- [ ] Coordinator registry
|
||||
|
||||
### Integration Tests (To Implement)
|
||||
- [ ] Single node database creation
|
||||
- [ ] Three node database creation
|
||||
- [ ] Multiple databases isolation
|
||||
- [ ] Hibernation cycle
|
||||
- [ ] Wake-up cycle
|
||||
- [ ] Node failure detection
|
||||
- [ ] Node replacement
|
||||
- [ ] Orphaned data cleanup
|
||||
- [ ] Concurrent database creation
|
||||
- [ ] Concurrent hibernation
|
||||
|
||||
### Manual Tests (To Perform)
|
||||
- [ ] Basic three node flow
|
||||
- [ ] Hibernation and wake-up
|
||||
- [ ] Multiple databases
|
||||
- [ ] Node failure and recovery
|
||||
- [ ] Port exhaustion handling
|
||||
- [ ] Orphaned data cleanup
|
||||
- [ ] Stress test with many databases
|
||||
|
||||
### Performance Validation
|
||||
- [ ] Database creation < 10s
|
||||
- [ ] Wake-up time < 8s
|
||||
- [ ] Metadata sync < 5s
|
||||
- [ ] Query overhead < 10ms additional
|
||||
|
||||
## Running Tests
|
||||
|
||||
### Unit Tests
|
||||
```bash
|
||||
# Run all tests
|
||||
go test ./pkg/rqlite/... -v
|
||||
|
||||
# Run with race detector
|
||||
go test ./pkg/rqlite/... -race
|
||||
|
||||
# Run specific test
|
||||
go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v
|
||||
|
||||
# Run with coverage
|
||||
go test ./pkg/rqlite/... -cover -coverprofile=coverage.out
|
||||
go tool cover -html=coverage.out
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
```bash
|
||||
# Run e2e tests
|
||||
go test ./e2e/... -v -timeout 30m
|
||||
|
||||
# Run specific e2e test
|
||||
go test ./e2e/ -run TestThreeNodeDatabaseCreation -v
|
||||
```
|
||||
|
||||
### Manual Tests
|
||||
Follow the scenarios above in dedicated terminals for each node.
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Correctness
|
||||
✅ All unit tests pass
|
||||
✅ All integration tests pass
|
||||
✅ All manual scenarios complete successfully
|
||||
✅ No data loss in any scenario
|
||||
✅ No race conditions detected
|
||||
|
||||
### Performance
|
||||
✅ Database creation < 10 seconds
|
||||
✅ Wake-up < 8 seconds
|
||||
✅ Metadata sync < 5 seconds
|
||||
✅ Query overhead < 10ms
|
||||
|
||||
### Reliability
|
||||
✅ Survives node failures
|
||||
✅ Automatic recovery works
|
||||
✅ No orphaned data accumulates
|
||||
✅ Hibernation/wake-up cycles stable
|
||||
✅ Concurrent operations safe
|
||||
|
||||
## Notes for Future Test Enhancements
|
||||
|
||||
When implementing advanced metrics and benchmarks:
|
||||
|
||||
1. **Prometheus Metrics Tests**
|
||||
- Verify metric export
|
||||
- Validate metric values
|
||||
- Test metric reset on restart
|
||||
|
||||
2. **Benchmark Suite**
|
||||
- Automated performance regression detection
|
||||
- Latency percentile tracking (p50, p95, p99)
|
||||
- Throughput measurements
|
||||
- Resource usage profiling
|
||||
|
||||
3. **Chaos Engineering**
|
||||
- Random node kills
|
||||
- Network partitions
|
||||
- Clock skew simulation
|
||||
- Disk full scenarios
|
||||
|
||||
4. **Long-Running Stability**
|
||||
- 24-hour soak test
|
||||
- Memory leak detection
|
||||
- Slow-growing resource usage
|
||||
|
||||
## Debugging Failed Tests
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Port Conflicts**
|
||||
```bash
|
||||
# Check for processes using test ports
|
||||
lsof -i :5001-5999
|
||||
lsof -i :7001-7999
|
||||
|
||||
# Kill stale processes
|
||||
pkill rqlited
|
||||
```
|
||||
|
||||
**Stale Data**
|
||||
```bash
|
||||
# Clean test data directories
|
||||
rm -rf data/test_*/
|
||||
rm -rf /tmp/debros_test_*/
|
||||
```
|
||||
|
||||
**Timing Issues**
|
||||
- Increase timeouts in flaky tests
|
||||
- Add retry logic with exponential backoff
|
||||
- Use proper synchronization primitives
|
||||
|
||||
**Race Conditions**
|
||||
```bash
|
||||
# Always run with race detector during development
|
||||
go test -race ./...
|
||||
```
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user