Files
luzia/QUEUE_PER_USER_DESIGN.md
admin ec33ac1936 Refactor cockpit to use DockerTmuxController pattern
Based on claude-code-tools TmuxCLIController, this refactor:

- Added DockerTmuxController class for robust tmux session management
- Implements send_keys() with configurable delay_enter
- Implements capture_pane() for output retrieval
- Implements wait_for_prompt() for pattern-based completion detection
- Implements wait_for_idle() for content-hash-based idle detection
- Implements wait_for_shell_prompt() for shell prompt detection

Also includes workflow improvements:
- Pre-task git snapshot before agent execution
- Post-task commit protocol in agent guidelines

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 10:42:16 -03:00

507 lines
11 KiB
Markdown

# Per-User Queue Isolation Design
## Overview
The per-user queue system ensures that only **one task per user** can execute concurrently. This prevents agent edit conflicts and ensures clean isolation when multiple agents work on the same user's project.
## Problem Statement
Before this implementation, multiple agents could simultaneously work on the same user's project, causing:
- **Edit conflicts** - Agents overwriting each other's changes
- **Race conditions** - Simultaneous file modifications
- **Data inconsistency** - Partial updates and rollbacks
- **Unpredictable behavior** - Non-deterministic execution order
Example conflict:
```
Agent 1: Read file.py (version 1)
Agent 2: Read file.py (version 1)
Agent 1: Modify and write file.py (version 2)
Agent 2: Modify and write file.py (version 2) ← Overwrites Agent 1's changes
```
## Solution Architecture
### 1. Per-User Lock Manager (`per_user_queue_manager.py`)
Implements exclusive file-based locking per user:
```python
manager = PerUserQueueManager()
# Acquire lock (blocks if another task is running for this user)
acquired, lock_id = manager.acquire_lock(user="alice", task_id="task_123", timeout=30)
if acquired:
# Safe to execute task for this user
execute_task()
# Release lock when done
manager.release_lock(user="alice", lock_id=lock_id)
```
**Lock Mechanism:**
- File-based locks at `/var/lib/luzia/locks/user_{username}.lock`
- Atomic creation using `O_EXCL | O_CREAT` flags
- Metadata file for monitoring and lock info
- Automatic cleanup of stale locks (1-hour timeout)
**Lock Files:**
```
/var/lib/luzia/locks/
├── user_alice.lock # Lock file (exists = locked)
├── user_alice.json # Lock metadata (acquired time, pid, etc)
├── user_bob.lock
└── user_bob.json
```
### 2. Enhanced Queue Controller v2 (`queue_controller_v2.py`)
Extends original QueueController with per-user awareness:
```python
qc = QueueControllerV2()
# Enqueue task
task_id, position = qc.enqueue(
project="alice_project",
prompt="Fix the bug",
priority=5
)
# Queue daemon respects per-user locks
# - Can select from other users' tasks
# - Skips tasks for users with active locks
# - Fair scheduling across projects/users
```
**Key Features:**
1. **Per-User Task Selection** - Task scheduler checks user locks before dispatch
2. **Capacity Tracking by User** - Monitors active tasks per user
3. **Lock Acquisition Before Dispatch** - Acquires lock BEFORE starting agent
4. **Lock Release on Completion** - Cleanup module releases locks when tasks finish
**Capacity JSON Structure:**
```json
{
"slots": {
"max": 4,
"used": 2,
"available": 2
},
"by_project": {
"alice_project": 1,
"bob_project": 1
},
"by_user": {
"alice": 1,
"bob": 1
}
}
```
### 3. Conductor Lock Cleanup (`conductor_lock_cleanup.py`)
Manages lock lifecycle tied to task execution:
```python
cleanup = ConductorLockCleanup()
# Called when task completes
cleanup.check_and_cleanup_conductor_locks(project="alice_project")
# Called periodically to clean stale locks
cleanup.cleanup_stale_task_locks(max_age_seconds=3600)
# Manual lock release (for administrative use)
cleanup.release_task_lock(user="alice", task_id="task_123")
```
**Integration with Conductor:**
Conductor's `meta.json` tracks lock information:
```json
{
"id": "task_123",
"status": "completed",
"user": "alice",
"lock_id": "task_123_1768005905",
"lock_released": true
}
```
When task finishes, cleanup detects:
- Final status (completed, failed, cancelled)
- Associated user and lock_id
- Releases the lock
## Configuration
Enable per-user serialization in config:
```json
{
"per_user_serialization": {
"enabled": true,
"lock_timeout_seconds": 3600
}
}
```
**Settings:**
- `enabled`: Toggle per-user locking on/off
- `lock_timeout_seconds`: Maximum time before stale lock cleanup (1 hour default)
## Task Execution Flow
### Normal Flow
```
1. Task Enqueued
2. Queue Daemon Polls
- Get pending tasks
- Check system capacity
3. Task Selection
- Filter by fair share rules
- Check user has no active lock
4. Lock Acquisition
- Try to acquire per-user lock
- If fails, skip this task (another task running for user)
5. Dispatch
- Create conductor directory
- Write meta.json with lock_id
- Spawn agent
6. Agent Execution
- Agent has exclusive access to user's project
7. Completion
- Agent finishes (success/failure/timeout)
- Conductor status updated
8. Lock Cleanup
- Watchdog detects completion
- Conductor cleanup module releases lock
9. Ready for Next Task
- Lock released
- Queue daemon can select next task for this user
```
### Contention Scenario
```
Queue Daemon 1 User Lock Queue Daemon 2
(alice: LOCKED)
Try acquire for alice ---> FAIL
Skip this task
Try next eligible task ---> alice_task_2
Try acquire for alice ---> FAIL
Try different user (bob) -> SUCCESS
Start bob's task alice: LOCKED
bob: LOCKED
(after alice task completes)
(alice: RELEASED)
Polling...
Try acquire for alice ---> SUCCESS
Start alice_task_3 alice: LOCKED
bob: LOCKED
```
## Monitoring and Status
### Queue Status
```python
qc = QueueControllerV2()
status = qc.get_queue_status()
# Output includes:
{
"pending": {
"high": 2,
"normal": 5,
"total": 7
},
"active": {
"slots_used": 2,
"slots_max": 4,
"by_project": {"alice_project": 1, "bob_project": 1},
"by_user": {"alice": 1, "bob": 1}
},
"user_locks": {
"active": 2,
"details": [
{
"user": "alice",
"lock_id": "task_123_1768005905",
"task_id": "task_123",
"acquired_at": "2024-01-09T15:30:45...",
"acquired_by_pid": 12345,
"expires_at": "2024-01-09T16:30:45..."
},
{
"user": "bob",
"lock_id": "task_124_1768005906",
"task_id": "task_124",
"acquired_at": "2024-01-09T15:31:10...",
"acquired_by_pid": 12346,
"expires_at": "2024-01-09T16:31:10..."
}
]
}
}
```
### Active Locks
```bash
# Check all active locks
python3 lib/per_user_queue_manager.py list_locks
# Check specific user
python3 lib/per_user_queue_manager.py check alice
# Release specific lock (admin)
python3 lib/conductor_lock_cleanup.py release alice task_123
```
## Stale Lock Recovery
Locks are automatically cleaned if:
1. **Age Exceeded** - Lock older than `lock_timeout_seconds` (default 1 hour)
2. **Expired Metadata** - Lock metadata has `expires_at` in the past
3. **Manual Cleanup** - Administrator runs cleanup command
**Cleanup Triggers:**
```bash
# Automatic (run by daemon periodically)
cleanup.cleanup_all_stale_locks()
# Manual (administrative)
python3 lib/conductor_lock_cleanup.py cleanup_stale 3600
# Per-project
python3 lib/conductor_lock_cleanup.py check_project alice_project
```
## Implementation Details
### Lock Atomicity
Lock acquisition is atomic using OS-level primitives:
```python
# Atomic lock creation - only one process succeeds
fd = os.open(
lock_path,
os.O_CREAT | os.O_EXCL | os.O_WRONLY, # Fail if exists
0o644
)
```
No race conditions because `O_EXCL` is atomic at filesystem level.
### Lock Ordering
To prevent deadlocks:
1. Always acquire per-user lock BEFORE any other resources
2. Always release per-user lock AFTER all operations
3. Never hold multiple user locks simultaneously
### Lock Duration
Typical lock lifecycle:
- **Acquisition**: < 100ms
- **Holding**: Variable (task duration, typically 5-60 seconds)
- **Release**: < 100ms
- **Timeout**: 3600 seconds (1 hour) - prevents forever-locked users
## Testing
Comprehensive test suite in `tests/test_per_user_queue.py`:
```bash
cd /opt/server-agents/orchestrator
python3 tests/test_per_user_queue.py
```
**Tests Included:**
1. Basic lock acquire/release
2. Concurrent lock contention
3. Stale lock cleanup
4. Multiple user independence
5. QueueControllerV2 integration
6. Fair scheduling with locks
**Expected Results:**
```
Results: 6 passed, 0 failed
```
## Integration Points
### Conductor Integration
Conductor metadata tracks user and lock:
```json
{
"meta.json": {
"id": "task_id",
"user": "alice",
"lock_id": "task_id_timestamp",
"status": "running|completed|failed"
}
}
```
### Watchdog Integration
Watchdog detects task completion and triggers cleanup:
```python
# In watchdog loop
conductor_dir = Path(f"/home/{project}/conductor/active/{task_id}")
if is_task_complete(conductor_dir):
lock_cleanup.check_and_cleanup_conductor_locks(project)
```
### Daemon Integration
Queue daemon respects user locks in task selection:
```python
# In queue daemon
while True:
capacity = read_capacity()
if has_capacity(capacity):
task = select_next_task(capacity) # Respects per-user locks
if task:
dispatch(task)
time.sleep(poll_interval)
```
## Performance Implications
### Lock Overhead
- **Acquisition**: ~1-5ms (filesystem I/O)
- **Check Active**: ~1ms (metadata file read)
- **Release**: ~1-5ms (filesystem I/O)
- **Total per task**: < 20ms overhead
### Scalability
- Per-user locking has O(1) complexity
- No contention between different users
- Fair sharing prevents starvation
- Tested with 100+ pending tasks
## Failure Handling
### Agent Crash
```
1. Agent crashes (no heartbeat)
2. Watchdog detects missing heartbeat
3. Task marked as failed in conductor
4. Lock cleanup runs, detects failed task
5. Lock released for user
6. Next task can proceed
```
### Queue Daemon Crash
```
1. Queue daemon dies (no dispatch)
2. Locks remain but accumulate stale ones
3. New queue daemon starts
4. Periodic cleanup removes stale locks
5. System recovers
```
### Lock File Corruption
```
1. Lock metadata corrupted
2. Cleanup detects invalid metadata
3. Lock file removed (safe)
4. Lock acquired again for same user
```
## Configuration Recommendations
### Development
```json
{
"per_user_serialization": {
"enabled": true,
"lock_timeout_seconds": 300
}
}
```
Short timeout for testing (5 minutes).
### Production
```json
{
"per_user_serialization": {
"enabled": true,
"lock_timeout_seconds": 3600
}
}
```
Standard timeout of 1 hour.
### Debugging (Disabled)
```json
{
"per_user_serialization": {
"enabled": false
}
}
```
Disable for debugging or testing parallel execution.
## Migration from Old System
Old system allowed concurrent tasks per user. Migration is safe:
1. **Enable gradually**: Set `enabled: true`
2. **Monitor**: Watch task queue logs for impact
3. **Adjust timeout**: Increase if tasks need more time
4. **Deploy**: No data migration needed
The system is backward compatible - old queue tasks continue to work.
## Future Enhancements
1. **Per-project locks** - If projects have concurrent users
2. **Priority-based waiting** - High-priority tasks skip the queue
3. **Task grouping** - Related tasks stay together
4. **Preemptive cancellation** - Kill stale tasks automatically
5. **Lock analytics** - Track lock contention and timing
## References
- [Per-User Queue Manager](per_user_queue_manager.py)
- [Queue Controller v2](queue_controller_v2.py)
- [Conductor Lock Cleanup](conductor_lock_cleanup.py)
- [Test Suite](tests/test_per_user_queue.py)