mirror of
https://github.com/DeBrosOfficial/network.git
synced 2025-12-15 04:28:49 +00:00
Remove obsolete documentation files for Dynamic Database Clustering and Testing Guide
- Deleted the DYNAMIC_CLUSTERING_GUIDE.md and TESTING_GUIDE.md files as they are no longer relevant to the current implementation. - Removed the dynamic implementation plan file to streamline project documentation and focus on updated resources.
This commit is contained in:
parent
dd4cb832dc
commit
36002d342c
@ -1,165 +0,0 @@
|
|||||||
<!-- ec358e91-8e19-4fc8-a81e-cb388a4b2fc9 4c357d4a-bae7-4fe2-943d-84e5d3d3714c -->
|
|
||||||
# Dynamic Database Clustering — Implementation Plan
|
|
||||||
|
|
||||||
### Scope
|
|
||||||
|
|
||||||
Implement the feature described in `DYNAMIC_DATABASE_CLUSTERING.md`: decentralized metadata via libp2p pubsub, dynamic per-database rqlite clusters (3-node default), idle hibernation/wake-up, node failure replacement, and client UX that exposes `cli.Database(name)` with app namespacing.
|
|
||||||
|
|
||||||
### Guiding Principles
|
|
||||||
|
|
||||||
- Reuse existing `pkg/pubsub` and `pkg/rqlite` where practical; avoid singletons.
|
|
||||||
- Backward-compatible config migration with deprecations, feature-flag controlled rollout.
|
|
||||||
- Strong eventual consistency (vector clocks + periodic gossip) over centralized control planes.
|
|
||||||
- Tests and observability at each phase.
|
|
||||||
|
|
||||||
### Phase 0: Prep & Scaffolding
|
|
||||||
|
|
||||||
- Add feature flag `dynamic_db_clustering` (env/config) → default off.
|
|
||||||
- Introduce config shape for new `database` fields while supporting legacy fields (soft deprecated).
|
|
||||||
- Create empty packages and interfaces to enable incremental compilation:
|
|
||||||
- `pkg/metadata/{types.go,manager.go,pubsub.go,consensus.go,vector_clock.go}`
|
|
||||||
- `pkg/dbcluster/{manager.go,lifecycle.go,subprocess.go,ports.go,health.go,metrics.go}`
|
|
||||||
- Ensure rqlite subprocess availability (binary path detection, `scripts/install-debros-network.sh` update if needed).
|
|
||||||
- Establish CI jobs for new unit/integration suites and longer-running e2e.
|
|
||||||
|
|
||||||
### Phase 1: Metadata Layer (No hibernation yet)
|
|
||||||
|
|
||||||
- Implement metadata types and store (RW locks, versioning) inside `pkg/rqlite/metadata.go`:
|
|
||||||
- `DatabaseMetadata`, `NodeCapacity`, `PortRange`, `MetadataStore`.
|
|
||||||
- Pubsub schema and handlers inside `pkg/rqlite/pubsub.go` using existing `pkg/pubsub` bridge:
|
|
||||||
- Topic `/debros/metadata/v1`; messages for create request/response/confirm, status, node capacity, health.
|
|
||||||
- Consensus helpers inside `pkg/rqlite/consensus.go` and `pkg/rqlite/vector_clock.go`:
|
|
||||||
- Deterministic coordinator (lowest peer ID), vector clocks, merge rules, periodic full-state gossip (checksums + fetch diffs).
|
|
||||||
- Reuse existing node connectivity/backoff; no new ping service required.
|
|
||||||
- Skip unit tests for now; validate by wiring e2e flows later.
|
|
||||||
|
|
||||||
### Phase 2: Database Creation & Client API
|
|
||||||
|
|
||||||
- Port management:
|
|
||||||
- `PortManager` with bind-probing, random allocation within configured ranges; local bookkeeping.
|
|
||||||
- Subprocess control:
|
|
||||||
- `RQLiteInstance` lifecycle (start, wait ready via /status and simple query, stop, status).
|
|
||||||
- Cluster manager:
|
|
||||||
- `ClusterManager` keeps `activeClusters`, listens to metadata events, executes creation protocol, readiness fan-in, failure surfaces.
|
|
||||||
- Client API:
|
|
||||||
- Update `pkg/client/interface.go` to include `Database(name string)`.
|
|
||||||
- Implement app namespacing in `pkg/client/client.go` (sanitize app name + db name).
|
|
||||||
- Backoff polling for readiness during creation.
|
|
||||||
- Data isolation:
|
|
||||||
- Data dir per db: `./data/<app>_<db>/rqlite` (respect node `data_dir` base).
|
|
||||||
- Integration tests: create single db across 3 nodes; multiple databases coexisting; cross-node read/write.
|
|
||||||
|
|
||||||
### Phase 3: Hibernation & Wake-Up
|
|
||||||
|
|
||||||
- Idle detection and coordination:
|
|
||||||
- Track `LastQuery` per instance; periodic scan; all-nodes-idle quorum → coordinated shutdown schedule.
|
|
||||||
- Hibernation protocol:
|
|
||||||
- Broadcast idle notices, coordinator schedules `DATABASE_SHUTDOWN_COORDINATED`, graceful SIGTERM, ports freed, status → `hibernating`.
|
|
||||||
- Wake-up protocol:
|
|
||||||
- Client detects `hibernating`, performs CAS to `waking`, triggers wake request; port reuse if available else re-negotiate; start instances; status → `active`.
|
|
||||||
- Client retry UX:
|
|
||||||
- Transparent retries with exponential backoff; treat `waking` as wait-only state.
|
|
||||||
- Tests: hibernation under load; thundering herd; resource verification and persistence across cycles.
|
|
||||||
|
|
||||||
### Phase 4: Resilience (Failure & Replacement)
|
|
||||||
|
|
||||||
- Continuous health checks with timeouts → mark node unhealthy.
|
|
||||||
- Replacement orchestration:
|
|
||||||
- Coordinator initiates `NODE_REPLACEMENT_NEEDED`, eligible nodes respond, confirm selection, new node joins raft via `-join` then syncs.
|
|
||||||
- Startup reconciliation:
|
|
||||||
- Detect and cleanup orphaned or non-member local data directories.
|
|
||||||
- Rate limiting replacements to prevent cascades; prioritize by usage metrics.
|
|
||||||
- Tests: forced crashes, partitions, replacement within target SLO; reconciliation sanity.
|
|
||||||
|
|
||||||
### Phase 5: Production Hardening & Optimization
|
|
||||||
|
|
||||||
- Metrics/logging:
|
|
||||||
- Structured logs with trace IDs; counters for queries/min, hibernations, wake-ups, replacements; health and capacity gauges.
|
|
||||||
- Config validation, replication factor settings (1,3,5), and debugging APIs (read-only metadata dump, node status).
|
|
||||||
- Client metadata caching and query routing improvements (simple round-robin → latency-aware later).
|
|
||||||
- Performance benchmarks and operator-facing docs.
|
|
||||||
|
|
||||||
### File Changes (Essentials)
|
|
||||||
|
|
||||||
- `pkg/config/config.go`
|
|
||||||
- Remove (deprecate, then delete): `Database.DataDir`, `RQLitePort`, `RQLiteRaftPort`, `RQLiteJoinAddress`.
|
|
||||||
- Add: `ReplicationFactor int`, `HibernationTimeout time.Duration`, `MaxDatabases int`, `PortRange {HTTPStart, HTTPEnd, RaftStart, RaftEnd int}`, `Discovery.HealthCheckInterval`.
|
|
||||||
- `pkg/client/interface.go`/`pkg/client/client.go`
|
|
||||||
- Add `Database(name string)` and app namespace requirement (`DefaultClientConfig(appName)`); backoff polling.
|
|
||||||
- `pkg/node/node.go`
|
|
||||||
- Wire `metadata.Manager` and `dbcluster.ClusterManager`; remove direct rqlite singleton usage.
|
|
||||||
- `pkg/rqlite/*`
|
|
||||||
- Refactor to instance-oriented helpers from singleton.
|
|
||||||
- New packages under `pkg/metadata` and `pkg/dbcluster` as listed above.
|
|
||||||
- `configs/node.yaml` and validation paths to reflect new `database` block.
|
|
||||||
|
|
||||||
### Config Example (target end-state)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
node:
|
|
||||||
data_dir: "./data"
|
|
||||||
|
|
||||||
database:
|
|
||||||
replication_factor: 3
|
|
||||||
hibernation_timeout: 60
|
|
||||||
max_databases: 100
|
|
||||||
port_range:
|
|
||||||
http_start: 5001
|
|
||||||
http_end: 5999
|
|
||||||
raft_start: 7001
|
|
||||||
raft_end: 7999
|
|
||||||
|
|
||||||
discovery:
|
|
||||||
health_check_interval: 10s
|
|
||||||
```
|
|
||||||
|
|
||||||
### Rollout Strategy
|
|
||||||
|
|
||||||
- Keep feature flag off by default; support legacy single-cluster path.
|
|
||||||
- Ship Phase 1 behind flag; enable in dev/e2e only.
|
|
||||||
- Incrementally enable creation (Phase 2), then hibernation (Phase 3) per environment.
|
|
||||||
- Remove legacy config after deprecation window.
|
|
||||||
|
|
||||||
### Testing & Quality Gates
|
|
||||||
|
|
||||||
- Unit tests: metadata ops, consensus, ports, subprocess, manager state machine.
|
|
||||||
- Integration tests under `e2e/` for creation, isolation, hibernation, failure handling, partitions.
|
|
||||||
- Benchmarks for creation (<10s), wake-up (<8s), metadata sync (<5s), query overhead (<10ms).
|
|
||||||
- Chaos suite for randomized failures and partitions.
|
|
||||||
|
|
||||||
### Risks & Mitigations (operationalized)
|
|
||||||
|
|
||||||
- Metadata divergence → vector clocks + periodic checksums + majority read checks in client.
|
|
||||||
- Raft churn → adaptive timeouts; allow `always_on` flag per-db (future).
|
|
||||||
- Cascading replacements → global rate limiter and prioritization.
|
|
||||||
- Debuggability → verbose structured logging and metadata dump endpoints.
|
|
||||||
|
|
||||||
### Timeline (indicative)
|
|
||||||
|
|
||||||
- Weeks 1-2: Phases 0-1
|
|
||||||
- Weeks 3-4: Phase 2
|
|
||||||
- Weeks 5-6: Phase 3
|
|
||||||
- Weeks 7-8: Phase 4
|
|
||||||
- Weeks 9-10+: Phase 5
|
|
||||||
|
|
||||||
### To-dos
|
|
||||||
|
|
||||||
- [ ] Add feature flag, scaffold packages, CI jobs, rqlite binary checks
|
|
||||||
- [ ] Extend `pkg/config/config.go` and YAML schemas; deprecate legacy fields
|
|
||||||
- [ ] Implement metadata types and thread-safe store with versioning
|
|
||||||
- [ ] Implement pubsub messages and handlers using existing pubsub manager
|
|
||||||
- [ ] Implement coordinator election, vector clocks, gossip reconciliation
|
|
||||||
- [ ] Implement `PortManager` with bind-probing and allocation
|
|
||||||
- [ ] Implement rqlite subprocess control and readiness checks
|
|
||||||
- [ ] Implement `ClusterManager` and creation lifecycle orchestration
|
|
||||||
- [ ] Add `Database(name)` and app namespacing to client; backoff polling
|
|
||||||
- [ ] Adopt per-database data dirs under node `data_dir`
|
|
||||||
- [ ] Integration tests for creation and isolation across nodes
|
|
||||||
- [ ] Idle detection, coordinated shutdown, status updates
|
|
||||||
- [ ] Wake-up CAS to `waking`, port reuse/renegotiation, restart
|
|
||||||
- [ ] Client transparent retry/backoff for hibernation and waking
|
|
||||||
- [ ] Health checks, replacement orchestration, rate limiting
|
|
||||||
- [ ] Implement orphaned data reconciliation on startup
|
|
||||||
- [ ] Add metrics and structured logging across managers
|
|
||||||
- [ ] Benchmarks for creation, wake-up, sync, query overhead
|
|
||||||
- [ ] Operator and developer docs; config and migration guides
|
|
||||||
@ -1,504 +0,0 @@
|
|||||||
# Dynamic Database Clustering - User Guide
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Dynamic Database Clustering enables on-demand creation of isolated, replicated rqlite database clusters with automatic resource management through hibernation. Each database runs as a separate 3-node cluster with its own data directory and port allocation.
|
|
||||||
|
|
||||||
## Key Features
|
|
||||||
|
|
||||||
✅ **Multi-Database Support** - Create unlimited isolated databases on-demand
|
|
||||||
✅ **3-Node Replication** - Fault-tolerant by default (configurable)
|
|
||||||
✅ **Auto Hibernation** - Idle databases hibernate to save resources
|
|
||||||
✅ **Transparent Wake-Up** - Automatic restart on access
|
|
||||||
✅ **App Namespacing** - Databases are scoped by application name
|
|
||||||
✅ **Decentralized Metadata** - LibP2P pubsub-based coordination
|
|
||||||
✅ **Failure Recovery** - Automatic node replacement on failures
|
|
||||||
✅ **Resource Optimization** - Dynamic port allocation and data isolation
|
|
||||||
|
|
||||||
## Configuration
|
|
||||||
|
|
||||||
### Node Configuration (`configs/node.yaml`)
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
node:
|
|
||||||
data_dir: "./data"
|
|
||||||
listen_addresses:
|
|
||||||
- "/ip4/0.0.0.0/tcp/4001"
|
|
||||||
max_connections: 50
|
|
||||||
|
|
||||||
database:
|
|
||||||
replication_factor: 3 # Number of replicas per database
|
|
||||||
hibernation_timeout: 60s # Idle time before hibernation
|
|
||||||
max_databases: 100 # Max databases per node
|
|
||||||
port_range_http_start: 5001 # HTTP port range start
|
|
||||||
port_range_http_end: 5999 # HTTP port range end
|
|
||||||
port_range_raft_start: 7001 # Raft port range start
|
|
||||||
port_range_raft_end: 7999 # Raft port range end
|
|
||||||
|
|
||||||
discovery:
|
|
||||||
bootstrap_peers:
|
|
||||||
- "/ip4/127.0.0.1/tcp/4001/p2p/..."
|
|
||||||
discovery_interval: 30s
|
|
||||||
health_check_interval: 10s
|
|
||||||
```
|
|
||||||
|
|
||||||
### Key Configuration Options
|
|
||||||
|
|
||||||
#### `database.replication_factor` (default: 3)
|
|
||||||
Number of nodes that will host each database cluster. Minimum 1, recommended 3 for fault tolerance.
|
|
||||||
|
|
||||||
#### `database.hibernation_timeout` (default: 60s)
|
|
||||||
Time of inactivity before a database is hibernated. Set to 0 to disable hibernation.
|
|
||||||
|
|
||||||
#### `database.max_databases` (default: 100)
|
|
||||||
Maximum number of databases this node can host simultaneously.
|
|
||||||
|
|
||||||
#### `database.port_range_*`
|
|
||||||
Port ranges for dynamic allocation. Ensure ranges are large enough for `max_databases * 2` ports (HTTP + Raft per database).
|
|
||||||
|
|
||||||
## Client Usage
|
|
||||||
|
|
||||||
### Creating/Accessing Databases
|
|
||||||
|
|
||||||
```go
|
|
||||||
package main
|
|
||||||
|
|
||||||
import (
|
|
||||||
"context"
|
|
||||||
"github.com/DeBrosOfficial/network/pkg/client"
|
|
||||||
)
|
|
||||||
|
|
||||||
func main() {
|
|
||||||
// Create client with app name for namespacing
|
|
||||||
cfg := client.DefaultClientConfig("myapp")
|
|
||||||
cfg.BootstrapPeers = []string{
|
|
||||||
"/ip4/127.0.0.1/tcp/4001/p2p/...",
|
|
||||||
}
|
|
||||||
|
|
||||||
c, err := client.NewClient(cfg)
|
|
||||||
if err != nil {
|
|
||||||
panic(err)
|
|
||||||
}
|
|
||||||
|
|
||||||
// Connect to network
|
|
||||||
if err := c.Connect(); err != nil {
|
|
||||||
panic(err)
|
|
||||||
}
|
|
||||||
defer c.Disconnect()
|
|
||||||
|
|
||||||
// Get database client (creates database if it doesn't exist)
|
|
||||||
db, err := c.Database().Database("users")
|
|
||||||
if err != nil {
|
|
||||||
panic(err)
|
|
||||||
}
|
|
||||||
|
|
||||||
// Use the database
|
|
||||||
ctx := context.Background()
|
|
||||||
err = db.CreateTable(ctx, `
|
|
||||||
CREATE TABLE users (
|
|
||||||
id INTEGER PRIMARY KEY,
|
|
||||||
name TEXT NOT NULL,
|
|
||||||
email TEXT UNIQUE
|
|
||||||
)
|
|
||||||
`)
|
|
||||||
|
|
||||||
// Query data
|
|
||||||
result, err := db.Query(ctx, "SELECT * FROM users")
|
|
||||||
// ...
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Database Naming
|
|
||||||
|
|
||||||
Databases are automatically namespaced by your application name:
|
|
||||||
- `client.Database("users")` → creates `myapp_users` internally
|
|
||||||
- This prevents name collisions between different applications
|
|
||||||
|
|
||||||
## Gateway API Usage
|
|
||||||
|
|
||||||
If you prefer HTTP/REST API access instead of the Go client, you can use the gateway endpoints:
|
|
||||||
|
|
||||||
### Base URL
|
|
||||||
```
|
|
||||||
http://gateway-host:8080/v1/database/
|
|
||||||
```
|
|
||||||
|
|
||||||
### Execute SQL (INSERT, UPDATE, DELETE, DDL)
|
|
||||||
```bash
|
|
||||||
POST /v1/database/exec
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users",
|
|
||||||
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
|
|
||||||
"args": ["Alice", "alice@example.com"]
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"rows_affected": 1,
|
|
||||||
"last_insert_id": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Query Data (SELECT)
|
|
||||||
```bash
|
|
||||||
POST /v1/database/query
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users",
|
|
||||||
"sql": "SELECT * FROM users WHERE name LIKE ?",
|
|
||||||
"args": ["A%"]
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"items": [
|
|
||||||
{"id": 1, "name": "Alice", "email": "alice@example.com"}
|
|
||||||
],
|
|
||||||
"count": 1
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Execute Transaction
|
|
||||||
```bash
|
|
||||||
POST /v1/database/transaction
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users",
|
|
||||||
"queries": [
|
|
||||||
"INSERT INTO users (name, email) VALUES ('Bob', 'bob@example.com')",
|
|
||||||
"UPDATE users SET email = 'alice.new@example.com' WHERE name = 'Alice'"
|
|
||||||
]
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"success": true
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Get Schema
|
|
||||||
```bash
|
|
||||||
GET /v1/database/schema?database=users
|
|
||||||
|
|
||||||
# OR
|
|
||||||
|
|
||||||
POST /v1/database/schema
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users"
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"tables": [
|
|
||||||
{
|
|
||||||
"name": "users",
|
|
||||||
"columns": ["id", "name", "email"]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Create Table
|
|
||||||
```bash
|
|
||||||
POST /v1/database/create-table
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users",
|
|
||||||
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"rows_affected": 0
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Drop Table
|
|
||||||
```bash
|
|
||||||
POST /v1/database/drop-table
|
|
||||||
Content-Type: application/json
|
|
||||||
|
|
||||||
{
|
|
||||||
"database": "users",
|
|
||||||
"table_name": "old_table"
|
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"rows_affected": 0
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### List Databases
|
|
||||||
```bash
|
|
||||||
GET /v1/database/list
|
|
||||||
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"databases": ["users", "products", "orders"]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Important Notes
|
|
||||||
|
|
||||||
1. **Authentication Required**: All endpoints require authentication (JWT or API key)
|
|
||||||
2. **Database Creation**: Databases are created automatically on first access
|
|
||||||
3. **Hibernation**: The gateway handles hibernation/wake-up transparently - you may experience a delay (< 8s) on first query to a hibernating database
|
|
||||||
4. **Timeouts**: Query timeout is 30s, transaction timeout is 60s
|
|
||||||
5. **Namespacing**: Database names are automatically prefixed with your app name
|
|
||||||
6. **Concurrent Access**: All endpoints are safe for concurrent use
|
|
||||||
|
|
||||||
## Database Lifecycle
|
|
||||||
|
|
||||||
### 1. Creation
|
|
||||||
|
|
||||||
When you first access a database:
|
|
||||||
|
|
||||||
1. **Request Broadcast** - Node broadcasts `DATABASE_CREATE_REQUEST`
|
|
||||||
2. **Node Selection** - Eligible nodes respond with available ports
|
|
||||||
3. **Coordinator Selection** - Deterministic coordinator (lowest peer ID) chosen
|
|
||||||
4. **Confirmation** - Coordinator selects nodes and broadcasts `DATABASE_CREATE_CONFIRM`
|
|
||||||
5. **Instance Startup** - Selected nodes start rqlite subprocesses
|
|
||||||
6. **Readiness** - Nodes report `active` status when ready
|
|
||||||
|
|
||||||
**Typical creation time: < 10 seconds**
|
|
||||||
|
|
||||||
### 2. Active State
|
|
||||||
|
|
||||||
- Database instances run as rqlite subprocesses
|
|
||||||
- Each instance tracks `LastQuery` timestamp
|
|
||||||
- Queries update the activity timestamp
|
|
||||||
- Metadata synced across all network nodes
|
|
||||||
|
|
||||||
### 3. Hibernation
|
|
||||||
|
|
||||||
After `hibernation_timeout` of inactivity:
|
|
||||||
|
|
||||||
1. **Idle Detection** - Nodes detect idle databases
|
|
||||||
2. **Idle Notification** - Nodes broadcast idle status
|
|
||||||
3. **Coordinated Shutdown** - When all nodes report idle, coordinator schedules shutdown
|
|
||||||
4. **Graceful Stop** - SIGTERM sent to rqlite processes
|
|
||||||
5. **Port Release** - Ports freed for reuse
|
|
||||||
6. **Status Update** - Metadata updated to `hibernating`
|
|
||||||
|
|
||||||
**Data persists on disk during hibernation**
|
|
||||||
|
|
||||||
### 4. Wake-Up
|
|
||||||
|
|
||||||
On first query to hibernating database:
|
|
||||||
|
|
||||||
1. **Detection** - Client/node detects `hibernating` status
|
|
||||||
2. **Wake Request** - Broadcast `DATABASE_WAKEUP_REQUEST`
|
|
||||||
3. **Port Allocation** - Reuse original ports or allocate new ones
|
|
||||||
4. **Instance Restart** - Restart rqlite with existing data
|
|
||||||
5. **Status Update** - Update to `active` when ready
|
|
||||||
|
|
||||||
**Typical wake-up time: < 8 seconds**
|
|
||||||
|
|
||||||
### 5. Failure Recovery
|
|
||||||
|
|
||||||
When a node fails:
|
|
||||||
|
|
||||||
1. **Health Detection** - Missed health checks trigger failure detection
|
|
||||||
2. **Replacement Request** - Surviving nodes broadcast `NODE_REPLACEMENT_NEEDED`
|
|
||||||
3. **Offers** - Healthy nodes with capacity offer to replace
|
|
||||||
4. **Selection** - First offer accepted (simple approach)
|
|
||||||
5. **Join Cluster** - New node joins existing Raft cluster
|
|
||||||
6. **Sync** - Data synced from existing members
|
|
||||||
|
|
||||||
## Data Management
|
|
||||||
|
|
||||||
### Data Directories
|
|
||||||
|
|
||||||
Each database gets its own data directory:
|
|
||||||
```
|
|
||||||
./data/
|
|
||||||
├── myapp_users/ # Database: users
|
|
||||||
│ └── rqlite/
|
|
||||||
│ ├── db.sqlite
|
|
||||||
│ └── raft/
|
|
||||||
├── myapp_products/ # Database: products
|
|
||||||
│ └── rqlite/
|
|
||||||
└── myapp_orders/ # Database: orders
|
|
||||||
└── rqlite/
|
|
||||||
```
|
|
||||||
|
|
||||||
### Orphaned Data Cleanup
|
|
||||||
|
|
||||||
On node startup, the system automatically:
|
|
||||||
- Scans data directories
|
|
||||||
- Checks against metadata
|
|
||||||
- Removes directories for:
|
|
||||||
- Non-existent databases
|
|
||||||
- Databases where this node is not a member
|
|
||||||
|
|
||||||
## Monitoring & Debugging
|
|
||||||
|
|
||||||
### Structured Logging
|
|
||||||
|
|
||||||
All operations are logged with structured fields:
|
|
||||||
|
|
||||||
```
|
|
||||||
INFO Starting cluster manager node_id=12D3... max_databases=100
|
|
||||||
INFO Received database create request database=myapp_users requester=12D3...
|
|
||||||
INFO Database instance started database=myapp_users http_port=5001 raft_port=7001
|
|
||||||
INFO Database is idle database=myapp_users idle_time=62s
|
|
||||||
INFO Database hibernated successfully database=myapp_users
|
|
||||||
INFO Received wakeup request database=myapp_users
|
|
||||||
INFO Database woke up successfully database=myapp_users
|
|
||||||
```
|
|
||||||
|
|
||||||
### Health Checks
|
|
||||||
|
|
||||||
Nodes perform periodic health checks:
|
|
||||||
- Every `health_check_interval` (default: 10s)
|
|
||||||
- Tracks last-seen time for each peer
|
|
||||||
- 3 missed checks → node marked unhealthy
|
|
||||||
- Triggers replacement protocol for affected databases
|
|
||||||
|
|
||||||
## Best Practices
|
|
||||||
|
|
||||||
### 1. **Capacity Planning**
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# For 100 databases with 3-node replication:
|
|
||||||
database:
|
|
||||||
max_databases: 100
|
|
||||||
port_range_http_start: 5001
|
|
||||||
port_range_http_end: 5200 # 200 ports (100 databases * 2)
|
|
||||||
port_range_raft_start: 7001
|
|
||||||
port_range_raft_end: 7200
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. **Hibernation Tuning**
|
|
||||||
|
|
||||||
- **High Traffic**: Set `hibernation_timeout: 300s` or higher
|
|
||||||
- **Development**: Set `hibernation_timeout: 30s` for faster cycles
|
|
||||||
- **Always-On DBs**: Set `hibernation_timeout: 0` to disable
|
|
||||||
|
|
||||||
### 3. **Replication Factor**
|
|
||||||
|
|
||||||
- **Development**: `replication_factor: 1` (single node, no replication)
|
|
||||||
- **Production**: `replication_factor: 3` (fault tolerant)
|
|
||||||
- **High Availability**: `replication_factor: 5` (survives 2 failures)
|
|
||||||
|
|
||||||
### 4. **Network Topology**
|
|
||||||
|
|
||||||
- Use at least 3 nodes for `replication_factor: 3`
|
|
||||||
- Ensure `max_databases * replication_factor <= total_cluster_capacity`
|
|
||||||
- Example: 3 nodes × 100 max_databases = 300 database instances total
|
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
### Database Creation Fails
|
|
||||||
|
|
||||||
**Problem**: `insufficient nodes responded: got 1, need 3`
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
- Ensure you have at least `replication_factor` nodes online
|
|
||||||
- Check `max_databases` limit on nodes
|
|
||||||
- Verify port ranges aren't exhausted
|
|
||||||
|
|
||||||
### Database Not Waking Up
|
|
||||||
|
|
||||||
**Problem**: Database stays in `waking` status
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
- Check node logs for rqlite startup errors
|
|
||||||
- Verify rqlite binary is installed
|
|
||||||
- Check port conflicts (use different port ranges)
|
|
||||||
- Ensure data directory is accessible
|
|
||||||
|
|
||||||
### Orphaned Data
|
|
||||||
|
|
||||||
**Problem**: Disk space consumed by old databases
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
- Orphaned data is automatically cleaned on node restart
|
|
||||||
- Manual cleanup: Delete directories from `./data/` that don't match metadata
|
|
||||||
- Check logs for reconciliation results
|
|
||||||
|
|
||||||
### Node Replacement Not Working
|
|
||||||
|
|
||||||
**Problem**: Failed node not replaced
|
|
||||||
|
|
||||||
**Solution**:
|
|
||||||
- Ensure remaining nodes have capacity (`CurrentDatabases < MaxDatabases`)
|
|
||||||
- Check network connectivity between nodes
|
|
||||||
- Verify health check interval is reasonable (not too aggressive)
|
|
||||||
|
|
||||||
## Advanced Topics
|
|
||||||
|
|
||||||
### Metadata Consistency
|
|
||||||
|
|
||||||
- **Vector Clocks**: Each metadata update includes vector clock for conflict resolution
|
|
||||||
- **Gossip Protocol**: Periodic metadata sync via checksums
|
|
||||||
- **Eventual Consistency**: All nodes eventually agree on database state
|
|
||||||
|
|
||||||
### Port Management
|
|
||||||
|
|
||||||
- Ports allocated randomly within configured ranges
|
|
||||||
- Bind-probing ensures ports are actually available
|
|
||||||
- Ports reused during wake-up when possible
|
|
||||||
- Failed allocations fall back to new random ports
|
|
||||||
|
|
||||||
### Coordinator Election
|
|
||||||
|
|
||||||
- Deterministic selection based on lexicographical peer ID ordering
|
|
||||||
- Lowest peer ID becomes coordinator
|
|
||||||
- No persistent coordinator state
|
|
||||||
- Re-election occurs for each database operation
|
|
||||||
|
|
||||||
## Migration from Legacy Mode
|
|
||||||
|
|
||||||
If upgrading from single-cluster rqlite:
|
|
||||||
|
|
||||||
1. **Backup Data**: Backup your existing `./data/rqlite` directory
|
|
||||||
2. **Update Config**: Remove deprecated fields:
|
|
||||||
- `database.data_dir`
|
|
||||||
- `database.rqlite_port`
|
|
||||||
- `database.rqlite_raft_port`
|
|
||||||
- `database.rqlite_join_address`
|
|
||||||
3. **Add New Fields**: Configure dynamic clustering (see Configuration section)
|
|
||||||
4. **Restart Nodes**: Restart all nodes with new configuration
|
|
||||||
5. **Migrate Data**: Create new database and import data from backup
|
|
||||||
|
|
||||||
## Future Enhancements
|
|
||||||
|
|
||||||
The following features are planned for future releases:
|
|
||||||
|
|
||||||
### **Advanced Metrics** (Future)
|
|
||||||
- Prometheus-style metrics export
|
|
||||||
- Per-database query counters
|
|
||||||
- Hibernation/wake-up latency histograms
|
|
||||||
- Resource utilization gauges
|
|
||||||
|
|
||||||
### **Performance Benchmarks** (Future)
|
|
||||||
- Automated benchmark suite
|
|
||||||
- Creation time SLOs
|
|
||||||
- Wake-up latency targets
|
|
||||||
- Query overhead measurements
|
|
||||||
|
|
||||||
### **Enhanced Monitoring** (Future)
|
|
||||||
- Dashboard for cluster visualization
|
|
||||||
- Database status API endpoint
|
|
||||||
- Capacity planning tools
|
|
||||||
- Alerting integration
|
|
||||||
|
|
||||||
## Support
|
|
||||||
|
|
||||||
For issues, questions, or contributions:
|
|
||||||
- GitHub Issues: https://github.com/DeBrosOfficial/network/issues
|
|
||||||
- Documentation: https://github.com/DeBrosOfficial/network/blob/main/DYNAMIC_DATABASE_CLUSTERING.md
|
|
||||||
|
|
||||||
## License
|
|
||||||
|
|
||||||
See LICENSE file for details.
|
|
||||||
|
|
||||||
827
TESTING_GUIDE.md
827
TESTING_GUIDE.md
@ -1,827 +0,0 @@
|
|||||||
# Dynamic Database Clustering - Testing Guide
|
|
||||||
|
|
||||||
This guide provides a comprehensive list of unit tests, integration tests, and manual tests needed to verify the dynamic database clustering feature.
|
|
||||||
|
|
||||||
## Unit Tests
|
|
||||||
|
|
||||||
### 1. Metadata Store Tests (`pkg/rqlite/metadata_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
// Test cases to implement:
|
|
||||||
|
|
||||||
func TestMetadataStore_GetSetDatabase(t *testing.T)
|
|
||||||
- Create store
|
|
||||||
- Set database metadata
|
|
||||||
- Get database metadata
|
|
||||||
- Verify data matches
|
|
||||||
|
|
||||||
func TestMetadataStore_DeleteDatabase(t *testing.T)
|
|
||||||
- Set database metadata
|
|
||||||
- Delete database
|
|
||||||
- Verify Get returns nil
|
|
||||||
|
|
||||||
func TestMetadataStore_ListDatabases(t *testing.T)
|
|
||||||
- Add multiple databases
|
|
||||||
- List all databases
|
|
||||||
- Verify count and contents
|
|
||||||
|
|
||||||
func TestMetadataStore_ConcurrentAccess(t *testing.T)
|
|
||||||
- Spawn multiple goroutines
|
|
||||||
- Concurrent reads and writes
|
|
||||||
- Verify no race conditions (run with -race)
|
|
||||||
|
|
||||||
func TestMetadataStore_NodeCapacity(t *testing.T)
|
|
||||||
- Set node capacity
|
|
||||||
- Get node capacity
|
|
||||||
- Update capacity
|
|
||||||
- List nodes
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Vector Clock Tests (`pkg/rqlite/vector_clock_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestVectorClock_Increment(t *testing.T)
|
|
||||||
- Create empty vector clock
|
|
||||||
- Increment for node A
|
|
||||||
- Verify counter is 1
|
|
||||||
- Increment again
|
|
||||||
- Verify counter is 2
|
|
||||||
|
|
||||||
func TestVectorClock_Merge(t *testing.T)
|
|
||||||
- Create two vector clocks with different nodes
|
|
||||||
- Merge them
|
|
||||||
- Verify max values are preserved
|
|
||||||
|
|
||||||
func TestVectorClock_Compare(t *testing.T)
|
|
||||||
- Test strictly less than case
|
|
||||||
- Test strictly greater than case
|
|
||||||
- Test concurrent case
|
|
||||||
- Test identical case
|
|
||||||
|
|
||||||
func TestVectorClock_Concurrent(t *testing.T)
|
|
||||||
- Create clocks with overlapping updates
|
|
||||||
- Verify Compare returns 0 (concurrent)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Consensus Tests (`pkg/rqlite/consensus_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestElectCoordinator_SingleNode(t *testing.T)
|
|
||||||
- Pass single node ID
|
|
||||||
- Verify it's elected
|
|
||||||
|
|
||||||
func TestElectCoordinator_MultipleNodes(t *testing.T)
|
|
||||||
- Pass multiple node IDs
|
|
||||||
- Verify lowest lexicographical ID wins
|
|
||||||
- Verify deterministic (same input = same output)
|
|
||||||
|
|
||||||
func TestElectCoordinator_EmptyList(t *testing.T)
|
|
||||||
- Pass empty list
|
|
||||||
- Verify error returned
|
|
||||||
|
|
||||||
func TestElectCoordinator_Deterministic(t *testing.T)
|
|
||||||
- Run election multiple times with same inputs
|
|
||||||
- Verify same coordinator each time
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Port Manager Tests (`pkg/rqlite/ports_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestPortManager_AllocatePortPair(t *testing.T)
|
|
||||||
- Create manager with port range
|
|
||||||
- Allocate port pair
|
|
||||||
- Verify HTTP and Raft ports different
|
|
||||||
- Verify ports within range
|
|
||||||
|
|
||||||
func TestPortManager_ReleasePortPair(t *testing.T)
|
|
||||||
- Allocate port pair
|
|
||||||
- Release ports
|
|
||||||
- Verify ports can be reallocated
|
|
||||||
|
|
||||||
func TestPortManager_Exhaustion(t *testing.T)
|
|
||||||
- Allocate all available ports
|
|
||||||
- Attempt one more allocation
|
|
||||||
- Verify error returned
|
|
||||||
|
|
||||||
func TestPortManager_IsPortAllocated(t *testing.T)
|
|
||||||
- Allocate ports
|
|
||||||
- Check IsPortAllocated returns true
|
|
||||||
- Release ports
|
|
||||||
- Check IsPortAllocated returns false
|
|
||||||
|
|
||||||
func TestPortManager_AllocateSpecificPorts(t *testing.T)
|
|
||||||
- Allocate specific ports
|
|
||||||
- Verify allocation succeeds
|
|
||||||
- Attempt to allocate same ports again
|
|
||||||
- Verify error returned
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. RQLite Instance Tests (`pkg/rqlite/instance_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestRQLiteInstance_Create(t *testing.T)
|
|
||||||
- Create instance configuration
|
|
||||||
- Verify fields set correctly
|
|
||||||
|
|
||||||
func TestRQLiteInstance_IsIdle(t *testing.T)
|
|
||||||
- Set LastQuery to old timestamp
|
|
||||||
- Verify IsIdle returns true
|
|
||||||
- Update LastQuery
|
|
||||||
- Verify IsIdle returns false
|
|
||||||
|
|
||||||
// Integration test (requires rqlite binary):
|
|
||||||
func TestRQLiteInstance_StartStop(t *testing.T)
|
|
||||||
- Create instance
|
|
||||||
- Start instance
|
|
||||||
- Verify HTTP endpoint responsive
|
|
||||||
- Stop instance
|
|
||||||
- Verify process terminated
|
|
||||||
```
|
|
||||||
|
|
||||||
### 6. Pubsub Message Tests (`pkg/rqlite/pubsub_messages_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestMarshalUnmarshalMetadataMessage(t *testing.T)
|
|
||||||
- Create each message type
|
|
||||||
- Marshal to bytes
|
|
||||||
- Unmarshal back
|
|
||||||
- Verify data preserved
|
|
||||||
|
|
||||||
func TestDatabaseCreateRequest_Marshal(t *testing.T)
|
|
||||||
func TestDatabaseCreateResponse_Marshal(t *testing.T)
|
|
||||||
func TestDatabaseCreateConfirm_Marshal(t *testing.T)
|
|
||||||
func TestDatabaseStatusUpdate_Marshal(t *testing.T)
|
|
||||||
// ... for all message types
|
|
||||||
```
|
|
||||||
|
|
||||||
### 7. Coordinator Tests (`pkg/rqlite/coordinator_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestCreateCoordinator_AddResponse(t *testing.T)
|
|
||||||
- Create coordinator
|
|
||||||
- Add responses
|
|
||||||
- Verify response count
|
|
||||||
|
|
||||||
func TestCreateCoordinator_SelectNodes(t *testing.T)
|
|
||||||
- Add more responses than needed
|
|
||||||
- Call SelectNodes
|
|
||||||
- Verify correct number selected
|
|
||||||
- Verify deterministic selection
|
|
||||||
|
|
||||||
func TestCreateCoordinator_WaitForResponses(t *testing.T)
|
|
||||||
- Create coordinator
|
|
||||||
- Wait in goroutine
|
|
||||||
- Add responses from another goroutine
|
|
||||||
- Verify wait completes when enough responses
|
|
||||||
|
|
||||||
func TestCoordinatorRegistry(t *testing.T)
|
|
||||||
- Register coordinator
|
|
||||||
- Get coordinator
|
|
||||||
- Remove coordinator
|
|
||||||
- Verify lifecycle
|
|
||||||
```
|
|
||||||
|
|
||||||
## Integration Tests
|
|
||||||
|
|
||||||
### 1. Single Node Database Creation (`e2e/single_node_database_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestSingleNodeDatabaseCreation(t *testing.T)
|
|
||||||
- Start 1 node
|
|
||||||
- Set replication_factor = 1
|
|
||||||
- Create database
|
|
||||||
- Verify database active
|
|
||||||
- Write data
|
|
||||||
- Read data back
|
|
||||||
- Verify data matches
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. Three Node Database Creation (`e2e/three_node_database_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestThreeNodeDatabaseCreation(t *testing.T)
|
|
||||||
- Start 3 nodes
|
|
||||||
- Set replication_factor = 3
|
|
||||||
- Create database from node 1
|
|
||||||
- Wait for all nodes to report active
|
|
||||||
- Write data to node 1
|
|
||||||
- Read from node 2
|
|
||||||
- Verify replication worked
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. Multiple Databases (`e2e/multiple_databases_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestMultipleDatabases(t *testing.T)
|
|
||||||
- Start 3 nodes
|
|
||||||
- Create database "users"
|
|
||||||
- Create database "products"
|
|
||||||
- Create database "orders"
|
|
||||||
- Verify all databases active
|
|
||||||
- Write to each database
|
|
||||||
- Verify data isolation
|
|
||||||
```
|
|
||||||
|
|
||||||
### 4. Hibernation Cycle (`e2e/hibernation_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestHibernationCycle(t *testing.T)
|
|
||||||
- Start 3 nodes with hibernation_timeout=5s
|
|
||||||
- Create database
|
|
||||||
- Write initial data
|
|
||||||
- Wait 10 seconds (no activity)
|
|
||||||
- Verify status = hibernating
|
|
||||||
- Verify processes stopped
|
|
||||||
- Verify data persisted on disk
|
|
||||||
|
|
||||||
func TestWakeUpCycle(t *testing.T)
|
|
||||||
- Create and hibernate database
|
|
||||||
- Issue query
|
|
||||||
- Wait for wake-up
|
|
||||||
- Verify status = active
|
|
||||||
- Verify data still accessible
|
|
||||||
- Verify LastQuery updated
|
|
||||||
```
|
|
||||||
|
|
||||||
### 5. Node Failure and Recovery (`e2e/failure_recovery_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestNodeFailureDetection(t *testing.T)
|
|
||||||
- Start 3 nodes
|
|
||||||
- Create database
|
|
||||||
- Kill one node (SIGKILL)
|
|
||||||
- Wait for health checks to detect failure
|
|
||||||
- Verify NODE_REPLACEMENT_NEEDED broadcast
|
|
||||||
|
|
||||||
func TestNodeReplacement(t *testing.T)
|
|
||||||
- Start 4 nodes
|
|
||||||
- Create database on nodes 1,2,3
|
|
||||||
- Kill node 3
|
|
||||||
- Wait for replacement
|
|
||||||
- Verify node 4 joins cluster
|
|
||||||
- Verify data accessible from node 4
|
|
||||||
```
|
|
||||||
|
|
||||||
### 6. Orphaned Data Cleanup (`e2e/cleanup_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestOrphanedDataCleanup(t *testing.T)
|
|
||||||
- Start node
|
|
||||||
- Manually create orphaned data directory
|
|
||||||
- Restart node
|
|
||||||
- Verify orphaned directory removed
|
|
||||||
- Check logs for reconciliation message
|
|
||||||
```
|
|
||||||
|
|
||||||
### 7. Concurrent Operations (`e2e/concurrent_test.go`)
|
|
||||||
|
|
||||||
```go
|
|
||||||
func TestConcurrentDatabaseCreation(t *testing.T)
|
|
||||||
- Start 5 nodes
|
|
||||||
- Create 10 databases concurrently
|
|
||||||
- Verify all successful
|
|
||||||
- Verify no port conflicts
|
|
||||||
- Verify proper distribution
|
|
||||||
|
|
||||||
func TestConcurrentHibernation(t *testing.T)
|
|
||||||
- Create multiple databases
|
|
||||||
- Let all go idle
|
|
||||||
- Verify all hibernate correctly
|
|
||||||
- No race conditions
|
|
||||||
```
|
|
||||||
|
|
||||||
## Manual Test Scenarios
|
|
||||||
|
|
||||||
### Test 1: Basic Flow - Three Node Cluster
|
|
||||||
|
|
||||||
**Setup:**
|
|
||||||
```bash
|
|
||||||
# Terminal 1: Bootstrap node
|
|
||||||
cd data/bootstrap
|
|
||||||
../../bin/node --data bootstrap --id bootstrap --p2p-port 4001
|
|
||||||
|
|
||||||
# Terminal 2: Node 2
|
|
||||||
cd data/node
|
|
||||||
../../bin/node --data node --id node2 --p2p-port 4002
|
|
||||||
|
|
||||||
# Terminal 3: Node 3
|
|
||||||
cd data/node2
|
|
||||||
../../bin/node --data node2 --id node3 --p2p-port 4003
|
|
||||||
```
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Create Database**
|
|
||||||
```bash
|
|
||||||
# Use client or API to create database "testdb"
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Verify Creation**
|
|
||||||
- Check logs on all 3 nodes for "Database instance started"
|
|
||||||
- Verify `./data/*/testdb/` directories exist on all nodes
|
|
||||||
- Check different ports allocated on each node
|
|
||||||
|
|
||||||
3. **Write Data**
|
|
||||||
```sql
|
|
||||||
CREATE TABLE users (id INT, name TEXT);
|
|
||||||
INSERT INTO users VALUES (1, 'Alice');
|
|
||||||
INSERT INTO users VALUES (2, 'Bob');
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Verify Replication**
|
|
||||||
- Query from each node
|
|
||||||
- Verify same data returned
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- All nodes show `status=active` for testdb
|
|
||||||
- Data replicated across all nodes
|
|
||||||
- Unique port pairs per node
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 2: Hibernation and Wake-Up
|
|
||||||
|
|
||||||
**Setup:** Same as Test 1 with database created
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Check Activity**
|
|
||||||
```bash
|
|
||||||
# In logs, verify "last_query" timestamps updating on queries
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Wait for Hibernation**
|
|
||||||
- Stop issuing queries
|
|
||||||
- Wait `hibernation_timeout` + 10s
|
|
||||||
- Check logs for "Database is idle"
|
|
||||||
- Verify "Coordinated shutdown message sent"
|
|
||||||
- Verify "Database hibernated successfully"
|
|
||||||
|
|
||||||
3. **Verify Hibernation**
|
|
||||||
```bash
|
|
||||||
# Check that rqlite processes are stopped
|
|
||||||
ps aux | grep rqlite
|
|
||||||
|
|
||||||
# Verify data directories still exist
|
|
||||||
ls -la data/*/testdb/
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Wake Up**
|
|
||||||
- Issue a query to the database
|
|
||||||
- Watch logs for "Received wakeup request"
|
|
||||||
- Verify "Database woke up successfully"
|
|
||||||
- Verify query succeeds
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- Hibernation happens after idle timeout
|
|
||||||
- All 3 nodes hibernate coordinated
|
|
||||||
- Wake-up completes in < 8 seconds
|
|
||||||
- Data persists across hibernation cycle
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 3: Multiple Databases
|
|
||||||
|
|
||||||
**Setup:** 3 nodes running
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Create Multiple Databases**
|
|
||||||
```
|
|
||||||
Create: users_db
|
|
||||||
Create: products_db
|
|
||||||
Create: orders_db
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Verify Isolation**
|
|
||||||
- Insert data in users_db
|
|
||||||
- Verify data NOT in products_db
|
|
||||||
- Verify data NOT in orders_db
|
|
||||||
|
|
||||||
3. **Check Port Allocation**
|
|
||||||
```bash
|
|
||||||
# Verify different ports for each database
|
|
||||||
netstat -tlnp | grep rqlite
|
|
||||||
# OR
|
|
||||||
ss -tlnp | grep rqlite
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Verify Data Directories**
|
|
||||||
```bash
|
|
||||||
tree data/bootstrap/
|
|
||||||
# Should show:
|
|
||||||
# ├── users_db/
|
|
||||||
# ├── products_db/
|
|
||||||
# └── orders_db/
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- 3 separate database clusters
|
|
||||||
- Each with 3 nodes (9 total instances)
|
|
||||||
- Complete data isolation
|
|
||||||
- Unique port pairs for each instance
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 4: Node Failure and Recovery
|
|
||||||
|
|
||||||
**Setup:** 4 nodes running, database created on nodes 1-3
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Verify Initial State**
|
|
||||||
- Database active on nodes 1, 2, 3
|
|
||||||
- Node 4 idle
|
|
||||||
|
|
||||||
2. **Simulate Failure**
|
|
||||||
```bash
|
|
||||||
# Kill node 3 (SIGKILL for unclean shutdown)
|
|
||||||
kill -9 <node3_pid>
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Watch for Detection**
|
|
||||||
- Check logs on nodes 1 and 2
|
|
||||||
- Wait for health check failures (3 missed pings)
|
|
||||||
- Verify "Node detected as unhealthy" messages
|
|
||||||
|
|
||||||
4. **Watch for Replacement**
|
|
||||||
- Check for "NODE_REPLACEMENT_NEEDED" broadcast
|
|
||||||
- Node 4 should offer to replace
|
|
||||||
- Verify "Starting as replacement node" on node 4
|
|
||||||
- Verify node 4 joins Raft cluster
|
|
||||||
|
|
||||||
5. **Verify Data Integrity**
|
|
||||||
- Query database from node 4
|
|
||||||
- Verify all data present
|
|
||||||
- Insert new data from node 4
|
|
||||||
- Verify replication to nodes 1 and 2
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- Failure detected within 30 seconds
|
|
||||||
- Replacement completes automatically
|
|
||||||
- Data accessible from new node
|
|
||||||
- No data loss
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 5: Port Exhaustion
|
|
||||||
|
|
||||||
**Setup:** 1 node with small port range
|
|
||||||
|
|
||||||
**Configuration:**
|
|
||||||
```yaml
|
|
||||||
database:
|
|
||||||
max_databases: 10
|
|
||||||
port_range_http_start: 5001
|
|
||||||
port_range_http_end: 5005 # Only 5 ports
|
|
||||||
port_range_raft_start: 7001
|
|
||||||
port_range_raft_end: 7005 # Only 5 ports
|
|
||||||
```
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Create Databases**
|
|
||||||
- Create database 1 (succeeds - uses 2 ports)
|
|
||||||
- Create database 2 (succeeds - uses 2 ports)
|
|
||||||
- Create database 3 (fails - only 1 port left)
|
|
||||||
|
|
||||||
2. **Verify Error**
|
|
||||||
- Check logs for "Cannot allocate ports"
|
|
||||||
- Verify error returned to client
|
|
||||||
|
|
||||||
3. **Free Ports**
|
|
||||||
- Hibernate or delete database 1
|
|
||||||
- Ports should be freed
|
|
||||||
|
|
||||||
4. **Retry**
|
|
||||||
- Create database 3 again
|
|
||||||
- Should succeed now
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- Graceful handling of port exhaustion
|
|
||||||
- Clear error messages
|
|
||||||
- Ports properly recycled
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 6: Orphaned Data Cleanup
|
|
||||||
|
|
||||||
**Setup:** 1 node stopped
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Create Orphaned Data**
|
|
||||||
```bash
|
|
||||||
# While node is stopped
|
|
||||||
mkdir -p data/bootstrap/orphaned_db/rqlite
|
|
||||||
echo "fake data" > data/bootstrap/orphaned_db/rqlite/db.sqlite
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Start Node**
|
|
||||||
```bash
|
|
||||||
./bin/node --data bootstrap --id bootstrap
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Check Reconciliation**
|
|
||||||
- Watch logs for "Starting orphaned data reconciliation"
|
|
||||||
- Verify "Found orphaned database directory"
|
|
||||||
- Verify "Removed orphaned database directory"
|
|
||||||
|
|
||||||
4. **Verify Cleanup**
|
|
||||||
```bash
|
|
||||||
ls data/bootstrap/
|
|
||||||
# orphaned_db should be gone
|
|
||||||
```
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- Orphaned directories automatically detected
|
|
||||||
- Removed on startup
|
|
||||||
- Clean reconciliation logged
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 7: Stress Test - Many Databases
|
|
||||||
|
|
||||||
**Setup:** 5 nodes with high capacity
|
|
||||||
|
|
||||||
**Configuration:**
|
|
||||||
```yaml
|
|
||||||
database:
|
|
||||||
max_databases: 50
|
|
||||||
port_range_http_start: 5001
|
|
||||||
port_range_http_end: 5150
|
|
||||||
port_range_raft_start: 7001
|
|
||||||
port_range_raft_end: 7150
|
|
||||||
```
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Create Many Databases**
|
|
||||||
```
|
|
||||||
Loop: Create databases db_1 through db_25
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Verify Distribution**
|
|
||||||
- Check logs for node capacity announcements
|
|
||||||
- Verify databases distributed across nodes
|
|
||||||
- No single node overloaded
|
|
||||||
|
|
||||||
3. **Concurrent Operations**
|
|
||||||
- Write to multiple databases simultaneously
|
|
||||||
- Read from multiple databases
|
|
||||||
- Verify no conflicts
|
|
||||||
|
|
||||||
4. **Hibernation Wave**
|
|
||||||
- Stop all activity
|
|
||||||
- Wait for hibernation
|
|
||||||
- Verify all databases hibernate
|
|
||||||
- Check resource usage drops
|
|
||||||
|
|
||||||
5. **Wake-Up Storm**
|
|
||||||
- Query all 25 databases at once
|
|
||||||
- Verify all wake up successfully
|
|
||||||
- Check for thundering herd issues
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- All 25 databases created successfully
|
|
||||||
- Even distribution across nodes
|
|
||||||
- No port conflicts
|
|
||||||
- Successful mass hibernation/wake-up
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Test 8: Gateway API Access
|
|
||||||
|
|
||||||
**Setup:** Gateway running with 3 nodes
|
|
||||||
|
|
||||||
**Test Steps:**
|
|
||||||
1. **Authenticate**
|
|
||||||
```bash
|
|
||||||
# Get JWT token
|
|
||||||
TOKEN=$(curl -X POST http://localhost:8080/v1/auth/login \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{"wallet": "..."}' | jq -r .token)
|
|
||||||
```
|
|
||||||
|
|
||||||
2. **Create Table**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:8080/v1/database/create-table \
|
|
||||||
-H "Authorization: Bearer $TOKEN" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"database": "testdb",
|
|
||||||
"schema": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, email TEXT)"
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
|
|
||||||
3. **Insert Data**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:8080/v1/database/exec \
|
|
||||||
-H "Authorization: Bearer $TOKEN" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"database": "testdb",
|
|
||||||
"sql": "INSERT INTO users (name, email) VALUES (?, ?)",
|
|
||||||
"args": ["Alice", "alice@example.com"]
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
|
|
||||||
4. **Query Data**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:8080/v1/database/query \
|
|
||||||
-H "Authorization: Bearer $TOKEN" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"database": "testdb",
|
|
||||||
"sql": "SELECT * FROM users"
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
|
|
||||||
5. **Test Transaction**
|
|
||||||
```bash
|
|
||||||
curl -X POST http://localhost:8080/v1/database/transaction \
|
|
||||||
-H "Authorization: Bearer $TOKEN" \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"database": "testdb",
|
|
||||||
"queries": [
|
|
||||||
"INSERT INTO users (name, email) VALUES (\"Bob\", \"bob@example.com\")",
|
|
||||||
"INSERT INTO users (name, email) VALUES (\"Charlie\", \"charlie@example.com\")"
|
|
||||||
]
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
|
|
||||||
6. **Get Schema**
|
|
||||||
```bash
|
|
||||||
curl -X GET "http://localhost:8080/v1/database/schema?database=testdb" \
|
|
||||||
-H "Authorization: Bearer $TOKEN"
|
|
||||||
```
|
|
||||||
|
|
||||||
7. **Test Hibernation**
|
|
||||||
- Wait for hibernation timeout
|
|
||||||
- Query again and measure wake-up time
|
|
||||||
- Should see delay on first query after hibernation
|
|
||||||
|
|
||||||
**Expected Results:**
|
|
||||||
- All API calls succeed
|
|
||||||
- Data persists across calls
|
|
||||||
- Transactions are atomic
|
|
||||||
- Schema reflects created tables
|
|
||||||
- Hibernation/wake-up transparent to API
|
|
||||||
- Response times reasonable (< 30s for queries)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Test Checklist
|
|
||||||
|
|
||||||
### Unit Tests (To Implement)
|
|
||||||
- [ ] Metadata Store operations
|
|
||||||
- [ ] Metadata Store concurrency
|
|
||||||
- [ ] Vector Clock increment
|
|
||||||
- [ ] Vector Clock merge
|
|
||||||
- [ ] Vector Clock compare
|
|
||||||
- [ ] Coordinator election (single node)
|
|
||||||
- [ ] Coordinator election (multiple nodes)
|
|
||||||
- [ ] Coordinator election (deterministic)
|
|
||||||
- [ ] Port Manager allocation
|
|
||||||
- [ ] Port Manager release
|
|
||||||
- [ ] Port Manager exhaustion
|
|
||||||
- [ ] Port Manager specific ports
|
|
||||||
- [ ] RQLite Instance creation
|
|
||||||
- [ ] RQLite Instance IsIdle
|
|
||||||
- [ ] Message marshal/unmarshal (all types)
|
|
||||||
- [ ] Coordinator response collection
|
|
||||||
- [ ] Coordinator node selection
|
|
||||||
- [ ] Coordinator registry
|
|
||||||
|
|
||||||
### Integration Tests (To Implement)
|
|
||||||
- [ ] Single node database creation
|
|
||||||
- [ ] Three node database creation
|
|
||||||
- [ ] Multiple databases isolation
|
|
||||||
- [ ] Hibernation cycle
|
|
||||||
- [ ] Wake-up cycle
|
|
||||||
- [ ] Node failure detection
|
|
||||||
- [ ] Node replacement
|
|
||||||
- [ ] Orphaned data cleanup
|
|
||||||
- [ ] Concurrent database creation
|
|
||||||
- [ ] Concurrent hibernation
|
|
||||||
|
|
||||||
### Manual Tests (To Perform)
|
|
||||||
- [ ] Basic three node flow
|
|
||||||
- [ ] Hibernation and wake-up
|
|
||||||
- [ ] Multiple databases
|
|
||||||
- [ ] Node failure and recovery
|
|
||||||
- [ ] Port exhaustion handling
|
|
||||||
- [ ] Orphaned data cleanup
|
|
||||||
- [ ] Stress test with many databases
|
|
||||||
|
|
||||||
### Performance Validation
|
|
||||||
- [ ] Database creation < 10s
|
|
||||||
- [ ] Wake-up time < 8s
|
|
||||||
- [ ] Metadata sync < 5s
|
|
||||||
- [ ] Query overhead < 10ms additional
|
|
||||||
|
|
||||||
## Running Tests
|
|
||||||
|
|
||||||
### Unit Tests
|
|
||||||
```bash
|
|
||||||
# Run all tests
|
|
||||||
go test ./pkg/rqlite/... -v
|
|
||||||
|
|
||||||
# Run with race detector
|
|
||||||
go test ./pkg/rqlite/... -race
|
|
||||||
|
|
||||||
# Run specific test
|
|
||||||
go test ./pkg/rqlite/ -run TestMetadataStore_GetSetDatabase -v
|
|
||||||
|
|
||||||
# Run with coverage
|
|
||||||
go test ./pkg/rqlite/... -cover -coverprofile=coverage.out
|
|
||||||
go tool cover -html=coverage.out
|
|
||||||
```
|
|
||||||
|
|
||||||
### Integration Tests
|
|
||||||
```bash
|
|
||||||
# Run e2e tests
|
|
||||||
go test ./e2e/... -v -timeout 30m
|
|
||||||
|
|
||||||
# Run specific e2e test
|
|
||||||
go test ./e2e/ -run TestThreeNodeDatabaseCreation -v
|
|
||||||
```
|
|
||||||
|
|
||||||
### Manual Tests
|
|
||||||
Follow the scenarios above in dedicated terminals for each node.
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
### Correctness
|
|
||||||
✅ All unit tests pass
|
|
||||||
✅ All integration tests pass
|
|
||||||
✅ All manual scenarios complete successfully
|
|
||||||
✅ No data loss in any scenario
|
|
||||||
✅ No race conditions detected
|
|
||||||
|
|
||||||
### Performance
|
|
||||||
✅ Database creation < 10 seconds
|
|
||||||
✅ Wake-up < 8 seconds
|
|
||||||
✅ Metadata sync < 5 seconds
|
|
||||||
✅ Query overhead < 10ms
|
|
||||||
|
|
||||||
### Reliability
|
|
||||||
✅ Survives node failures
|
|
||||||
✅ Automatic recovery works
|
|
||||||
✅ No orphaned data accumulates
|
|
||||||
✅ Hibernation/wake-up cycles stable
|
|
||||||
✅ Concurrent operations safe
|
|
||||||
|
|
||||||
## Notes for Future Test Enhancements
|
|
||||||
|
|
||||||
When implementing advanced metrics and benchmarks:
|
|
||||||
|
|
||||||
1. **Prometheus Metrics Tests**
|
|
||||||
- Verify metric export
|
|
||||||
- Validate metric values
|
|
||||||
- Test metric reset on restart
|
|
||||||
|
|
||||||
2. **Benchmark Suite**
|
|
||||||
- Automated performance regression detection
|
|
||||||
- Latency percentile tracking (p50, p95, p99)
|
|
||||||
- Throughput measurements
|
|
||||||
- Resource usage profiling
|
|
||||||
|
|
||||||
3. **Chaos Engineering**
|
|
||||||
- Random node kills
|
|
||||||
- Network partitions
|
|
||||||
- Clock skew simulation
|
|
||||||
- Disk full scenarios
|
|
||||||
|
|
||||||
4. **Long-Running Stability**
|
|
||||||
- 24-hour soak test
|
|
||||||
- Memory leak detection
|
|
||||||
- Slow-growing resource usage
|
|
||||||
|
|
||||||
## Debugging Failed Tests
|
|
||||||
|
|
||||||
### Common Issues
|
|
||||||
|
|
||||||
**Port Conflicts**
|
|
||||||
```bash
|
|
||||||
# Check for processes using test ports
|
|
||||||
lsof -i :5001-5999
|
|
||||||
lsof -i :7001-7999
|
|
||||||
|
|
||||||
# Kill stale processes
|
|
||||||
pkill rqlited
|
|
||||||
```
|
|
||||||
|
|
||||||
**Stale Data**
|
|
||||||
```bash
|
|
||||||
# Clean test data directories
|
|
||||||
rm -rf data/test_*/
|
|
||||||
rm -rf /tmp/debros_test_*/
|
|
||||||
```
|
|
||||||
|
|
||||||
**Timing Issues**
|
|
||||||
- Increase timeouts in flaky tests
|
|
||||||
- Add retry logic with exponential backoff
|
|
||||||
- Use proper synchronization primitives
|
|
||||||
|
|
||||||
**Race Conditions**
|
|
||||||
```bash
|
|
||||||
# Always run with race detector during development
|
|
||||||
go test -race ./...
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
Loading…
x
Reference in New Issue
Block a user