Refactor cockpit to use DockerTmuxController pattern

Based on claude-code-tools TmuxCLIController, this refactor: - Added DockerTmuxController class for robust tmux session management - Implements send_keys() with configurable delay_enter - Implements capture_pane() for output retrieval - Implements wait_for_prompt() for pattern-based completion detection - Implements wait_for_idle() for content-hash-based idle detection - Implements wait_for_shell_prompt() for shell prompt detection Also includes workflow improvements: - Pre-task git snapshot before agent execution - Post-task commit protocol in agent guidelines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-14 10:42:16 -03:00
commit ec33ac1936
265 changed files with 92011 additions and 0 deletions
--- a/QUEUE_PER_USER_DESIGN.md
+++ b/QUEUE_PER_USER_DESIGN.md
@@ -0,0 +1,506 @@
+# Per-User Queue Isolation Design
+
+## Overview
+
+The per-user queue system ensures that only **one task per user** can execute concurrently. This prevents agent edit conflicts and ensures clean isolation when multiple agents work on the same user's project.
+
+## Problem Statement
+
+Before this implementation, multiple agents could simultaneously work on the same user's project, causing:
+- **Edit conflicts** - Agents overwriting each other's changes
+- **Race conditions** - Simultaneous file modifications
+- **Data inconsistency** - Partial updates and rollbacks
+- **Unpredictable behavior** - Non-deterministic execution order
+
+Example conflict:
+```
+Agent 1: Read file.py (version 1)
+Agent 2: Read file.py (version 1)
+Agent 1: Modify and write file.py (version 2)
+Agent 2: Modify and write file.py (version 2) ← Overwrites Agent 1's changes
+```
+
+## Solution Architecture
+
+### 1. Per-User Lock Manager (`per_user_queue_manager.py`)
+
+Implements exclusive file-based locking per user:
+
+```python
+manager = PerUserQueueManager()
+
+# Acquire lock (blocks if another task is running for this user)
+acquired, lock_id = manager.acquire_lock(user="alice", task_id="task_123", timeout=30)
+
+if acquired:
+    # Safe to execute task for this user
+    execute_task()
+
+    # Release lock when done
+    manager.release_lock(user="alice", lock_id=lock_id)
+```
+
+**Lock Mechanism:**
+- File-based locks at `/var/lib/luzia/locks/user_{username}.lock`
+- Atomic creation using `O_EXCL | O_CREAT` flags
+- Metadata file for monitoring and lock info
+- Automatic cleanup of stale locks (1-hour timeout)
+
+**Lock Files:**
+```
+/var/lib/luzia/locks/
+├── user_alice.lock         # Lock file (exists = locked)
+├── user_alice.json         # Lock metadata (acquired time, pid, etc)
+├── user_bob.lock
+└── user_bob.json
+```
+
+### 2. Enhanced Queue Controller v2 (`queue_controller_v2.py`)
+
+Extends original QueueController with per-user awareness:
+
+```python
+qc = QueueControllerV2()
+
+# Enqueue task
+task_id, position = qc.enqueue(
+    project="alice_project",
+    prompt="Fix the bug",
+    priority=5
+)
+
+# Queue daemon respects per-user locks
+# - Can select from other users' tasks
+# - Skips tasks for users with active locks
+# - Fair scheduling across projects/users
+```
+
+**Key Features:**
+
+1. **Per-User Task Selection** - Task scheduler checks user locks before dispatch
+2. **Capacity Tracking by User** - Monitors active tasks per user
+3. **Lock Acquisition Before Dispatch** - Acquires lock BEFORE starting agent
+4. **Lock Release on Completion** - Cleanup module releases locks when tasks finish
+
+**Capacity JSON Structure:**
+```json
+{
+  "slots": {
+    "max": 4,
+    "used": 2,
+    "available": 2
+  },
+  "by_project": {
+    "alice_project": 1,
+    "bob_project": 1
+  },
+  "by_user": {
+    "alice": 1,
+    "bob": 1
+  }
+}
+```
+
+### 3. Conductor Lock Cleanup (`conductor_lock_cleanup.py`)
+
+Manages lock lifecycle tied to task execution:
+
+```python
+cleanup = ConductorLockCleanup()
+
+# Called when task completes
+cleanup.check_and_cleanup_conductor_locks(project="alice_project")
+
+# Called periodically to clean stale locks
+cleanup.cleanup_stale_task_locks(max_age_seconds=3600)
+
+# Manual lock release (for administrative use)
+cleanup.release_task_lock(user="alice", task_id="task_123")
+```
+
+**Integration with Conductor:**
+
+Conductor's `meta.json` tracks lock information:
+```json
+{
+  "id": "task_123",
+  "status": "completed",
+  "user": "alice",
+  "lock_id": "task_123_1768005905",
+  "lock_released": true
+}
+```
+
+When task finishes, cleanup detects:
+- Final status (completed, failed, cancelled)
+- Associated user and lock_id
+- Releases the lock
+
+## Configuration
+
+Enable per-user serialization in config:
+
+```json
+{
+  "per_user_serialization": {
+    "enabled": true,
+    "lock_timeout_seconds": 3600
+  }
+}
+```
+
+**Settings:**
+- `enabled`: Toggle per-user locking on/off
+- `lock_timeout_seconds`: Maximum time before stale lock cleanup (1 hour default)
+
+## Task Execution Flow
+
+### Normal Flow
+
+```
+1. Task Enqueued
+   ↓
+2. Queue Daemon Polls
+   - Get pending tasks
+   - Check system capacity
+   ↓
+3. Task Selection
+   - Filter by fair share rules
+   - Check user has no active lock
+   ↓
+4. Lock Acquisition
+   - Try to acquire per-user lock
+   - If fails, skip this task (another task running for user)
+   ↓
+5. Dispatch
+   - Create conductor directory
+   - Write meta.json with lock_id
+   - Spawn agent
+   ↓
+6. Agent Execution
+   - Agent has exclusive access to user's project
+   ↓
+7. Completion
+   - Agent finishes (success/failure/timeout)
+   - Conductor status updated
+   ↓
+8. Lock Cleanup
+   - Watchdog detects completion
+   - Conductor cleanup module releases lock
+   ↓
+9. Ready for Next Task
+   - Lock released
+   - Queue daemon can select next task for this user
+```
+
+### Contention Scenario
+
+```
+Queue Daemon 1          User Lock           Queue Daemon 2
+                        (alice: LOCKED)
+Try acquire for alice ---> FAIL
+Skip this task
+Try next eligible task ---> alice_task_2
+Try acquire for alice ---> FAIL
+Try different user (bob) -> SUCCESS
+Start bob's task            alice: LOCKED
+                            bob: LOCKED
+
+(after alice task completes)
+                        (alice: RELEASED)
+
+Polling...
+Try acquire for alice ---> SUCCESS
+Start alice_task_3          alice: LOCKED
+                            bob: LOCKED
+```
+
+## Monitoring and Status
+
+### Queue Status
+
+```python
+qc = QueueControllerV2()
+status = qc.get_queue_status()
+
+# Output includes:
+{
+  "pending": {
+    "high": 2,
+    "normal": 5,
+    "total": 7
+  },
+  "active": {
+    "slots_used": 2,
+    "slots_max": 4,
+    "by_project": {"alice_project": 1, "bob_project": 1},
+    "by_user": {"alice": 1, "bob": 1}
+  },
+  "user_locks": {
+    "active": 2,
+    "details": [
+      {
+        "user": "alice",
+        "lock_id": "task_123_1768005905",
+        "task_id": "task_123",
+        "acquired_at": "2024-01-09T15:30:45...",
+        "acquired_by_pid": 12345,
+        "expires_at": "2024-01-09T16:30:45..."
+      },
+      {
+        "user": "bob",
+        "lock_id": "task_124_1768005906",
+        "task_id": "task_124",
+        "acquired_at": "2024-01-09T15:31:10...",
+        "acquired_by_pid": 12346,
+        "expires_at": "2024-01-09T16:31:10..."
+      }
+    ]
+  }
+}
+```
+
+### Active Locks
+
+```bash
+# Check all active locks
+python3 lib/per_user_queue_manager.py list_locks
+
+# Check specific user
+python3 lib/per_user_queue_manager.py check alice
+
+# Release specific lock (admin)
+python3 lib/conductor_lock_cleanup.py release alice task_123
+```
+
+## Stale Lock Recovery
+
+Locks are automatically cleaned if:
+
+1. **Age Exceeded** - Lock older than `lock_timeout_seconds` (default 1 hour)
+2. **Expired Metadata** - Lock metadata has `expires_at` in the past
+3. **Manual Cleanup** - Administrator runs cleanup command
+
+**Cleanup Triggers:**
+
+```bash
+# Automatic (run by daemon periodically)
+cleanup.cleanup_all_stale_locks()
+
+# Manual (administrative)
+python3 lib/conductor_lock_cleanup.py cleanup_stale 3600
+
+# Per-project
+python3 lib/conductor_lock_cleanup.py check_project alice_project
+```
+
+## Implementation Details
+
+### Lock Atomicity
+
+Lock acquisition is atomic using OS-level primitives:
+
+```python
+# Atomic lock creation - only one process succeeds
+fd = os.open(
+    lock_path,
+    os.O_CREAT | os.O_EXCL | os.O_WRONLY,  # Fail if exists
+    0o644
+)
+```
+
+No race conditions because `O_EXCL` is atomic at filesystem level.
+
+### Lock Ordering
+
+To prevent deadlocks:
+1. Always acquire per-user lock BEFORE any other resources
+2. Always release per-user lock AFTER all operations
+3. Never hold multiple user locks simultaneously
+
+### Lock Duration
+
+Typical lock lifecycle:
+- **Acquisition**: < 100ms
+- **Holding**: Variable (task duration, typically 5-60 seconds)
+- **Release**: < 100ms
+- **Timeout**: 3600 seconds (1 hour) - prevents forever-locked users
+
+## Testing
+
+Comprehensive test suite in `tests/test_per_user_queue.py`:
+
+```bash
+cd /opt/server-agents/orchestrator
+python3 tests/test_per_user_queue.py
+```
+
+**Tests Included:**
+1. Basic lock acquire/release
+2. Concurrent lock contention
+3. Stale lock cleanup
+4. Multiple user independence
+5. QueueControllerV2 integration
+6. Fair scheduling with locks
+
+**Expected Results:**
+```
+Results: 6 passed, 0 failed
+```
+
+## Integration Points
+
+### Conductor Integration
+
+Conductor metadata tracks user and lock:
+
+```json
+{
+  "meta.json": {
+    "id": "task_id",
+    "user": "alice",
+    "lock_id": "task_id_timestamp",
+    "status": "running|completed|failed"
+  }
+}
+```
+
+### Watchdog Integration
+
+Watchdog detects task completion and triggers cleanup:
+
+```python
+# In watchdog loop
+conductor_dir = Path(f"/home/{project}/conductor/active/{task_id}")
+if is_task_complete(conductor_dir):
+    lock_cleanup.check_and_cleanup_conductor_locks(project)
+```
+
+### Daemon Integration
+
+Queue daemon respects user locks in task selection:
+
+```python
+# In queue daemon
+while True:
+    capacity = read_capacity()
+    if has_capacity(capacity):
+        task = select_next_task(capacity)  # Respects per-user locks
+        if task:
+            dispatch(task)
+    time.sleep(poll_interval)
+```
+
+## Performance Implications
+
+### Lock Overhead
+
+- **Acquisition**: ~1-5ms (filesystem I/O)
+- **Check Active**: ~1ms (metadata file read)
+- **Release**: ~1-5ms (filesystem I/O)
+- **Total per task**: < 20ms overhead
+
+### Scalability
+
+- Per-user locking has O(1) complexity
+- No contention between different users
+- Fair sharing prevents starvation
+- Tested with 100+ pending tasks
+
+## Failure Handling
+
+### Agent Crash
+
+```
+1. Agent crashes (no heartbeat)
+2. Watchdog detects missing heartbeat
+3. Task marked as failed in conductor
+4. Lock cleanup runs, detects failed task
+5. Lock released for user
+6. Next task can proceed
+```
+
+### Queue Daemon Crash
+
+```
+1. Queue daemon dies (no dispatch)
+2. Locks remain but accumulate stale ones
+3. New queue daemon starts
+4. Periodic cleanup removes stale locks
+5. System recovers
+```
+
+### Lock File Corruption
+
+```
+1. Lock metadata corrupted
+2. Cleanup detects invalid metadata
+3. Lock file removed (safe)
+4. Lock acquired again for same user
+```
+
+## Configuration Recommendations
+
+### Development
+
+```json
+{
+  "per_user_serialization": {
+    "enabled": true,
+    "lock_timeout_seconds": 300
+  }
+}
+```
+
+Short timeout for testing (5 minutes).
+
+### Production
+
+```json
+{
+  "per_user_serialization": {
+    "enabled": true,
+    "lock_timeout_seconds": 3600
+  }
+}
+```
+
+Standard timeout of 1 hour.
+
+### Debugging (Disabled)
+
+```json
+{
+  "per_user_serialization": {
+    "enabled": false
+  }
+}
+```
+
+Disable for debugging or testing parallel execution.
+
+## Migration from Old System
+
+Old system allowed concurrent tasks per user. Migration is safe:
+
+1. **Enable gradually**: Set `enabled: true`
+2. **Monitor**: Watch task queue logs for impact
+3. **Adjust timeout**: Increase if tasks need more time
+4. **Deploy**: No data migration needed
+
+The system is backward compatible - old queue tasks continue to work.
+
+## Future Enhancements
+
+1. **Per-project locks** - If projects have concurrent users
+2. **Priority-based waiting** - High-priority tasks skip the queue
+3. **Task grouping** - Related tasks stay together
+4. **Preemptive cancellation** - Kill stale tasks automatically
+5. **Lock analytics** - Track lock contention and timing
+
+## References
+
+- [Per-User Queue Manager](per_user_queue_manager.py)
+- [Queue Controller v2](queue_controller_v2.py)
+- [Conductor Lock Cleanup](conductor_lock_cleanup.py)
+- [Test Suite](tests/test_per_user_queue.py)