# Per-User Queue Isolation Design ## Overview The per-user queue system ensures that only **one task per user** can execute concurrently. This prevents agent edit conflicts and ensures clean isolation when multiple agents work on the same user's project. ## Problem Statement Before this implementation, multiple agents could simultaneously work on the same user's project, causing: - **Edit conflicts** - Agents overwriting each other's changes - **Race conditions** - Simultaneous file modifications - **Data inconsistency** - Partial updates and rollbacks - **Unpredictable behavior** - Non-deterministic execution order Example conflict: ``` Agent 1: Read file.py (version 1) Agent 2: Read file.py (version 1) Agent 1: Modify and write file.py (version 2) Agent 2: Modify and write file.py (version 2) ← Overwrites Agent 1's changes ``` ## Solution Architecture ### 1. Per-User Lock Manager (`per_user_queue_manager.py`) Implements exclusive file-based locking per user: ```python manager = PerUserQueueManager() # Acquire lock (blocks if another task is running for this user) acquired, lock_id = manager.acquire_lock(user="alice", task_id="task_123", timeout=30) if acquired: # Safe to execute task for this user execute_task() # Release lock when done manager.release_lock(user="alice", lock_id=lock_id) ``` **Lock Mechanism:** - File-based locks at `/var/lib/luzia/locks/user_{username}.lock` - Atomic creation using `O_EXCL | O_CREAT` flags - Metadata file for monitoring and lock info - Automatic cleanup of stale locks (1-hour timeout) **Lock Files:** ``` /var/lib/luzia/locks/ ├── user_alice.lock # Lock file (exists = locked) ├── user_alice.json # Lock metadata (acquired time, pid, etc) ├── user_bob.lock └── user_bob.json ``` ### 2. Enhanced Queue Controller v2 (`queue_controller_v2.py`) Extends original QueueController with per-user awareness: ```python qc = QueueControllerV2() # Enqueue task task_id, position = qc.enqueue( project="alice_project", prompt="Fix the bug", priority=5 ) # Queue daemon respects per-user locks # - Can select from other users' tasks # - Skips tasks for users with active locks # - Fair scheduling across projects/users ``` **Key Features:** 1. **Per-User Task Selection** - Task scheduler checks user locks before dispatch 2. **Capacity Tracking by User** - Monitors active tasks per user 3. **Lock Acquisition Before Dispatch** - Acquires lock BEFORE starting agent 4. **Lock Release on Completion** - Cleanup module releases locks when tasks finish **Capacity JSON Structure:** ```json { "slots": { "max": 4, "used": 2, "available": 2 }, "by_project": { "alice_project": 1, "bob_project": 1 }, "by_user": { "alice": 1, "bob": 1 } } ``` ### 3. Conductor Lock Cleanup (`conductor_lock_cleanup.py`) Manages lock lifecycle tied to task execution: ```python cleanup = ConductorLockCleanup() # Called when task completes cleanup.check_and_cleanup_conductor_locks(project="alice_project") # Called periodically to clean stale locks cleanup.cleanup_stale_task_locks(max_age_seconds=3600) # Manual lock release (for administrative use) cleanup.release_task_lock(user="alice", task_id="task_123") ``` **Integration with Conductor:** Conductor's `meta.json` tracks lock information: ```json { "id": "task_123", "status": "completed", "user": "alice", "lock_id": "task_123_1768005905", "lock_released": true } ``` When task finishes, cleanup detects: - Final status (completed, failed, cancelled) - Associated user and lock_id - Releases the lock ## Configuration Enable per-user serialization in config: ```json { "per_user_serialization": { "enabled": true, "lock_timeout_seconds": 3600 } } ``` **Settings:** - `enabled`: Toggle per-user locking on/off - `lock_timeout_seconds`: Maximum time before stale lock cleanup (1 hour default) ## Task Execution Flow ### Normal Flow ``` 1. Task Enqueued ↓ 2. Queue Daemon Polls - Get pending tasks - Check system capacity ↓ 3. Task Selection - Filter by fair share rules - Check user has no active lock ↓ 4. Lock Acquisition - Try to acquire per-user lock - If fails, skip this task (another task running for user) ↓ 5. Dispatch - Create conductor directory - Write meta.json with lock_id - Spawn agent ↓ 6. Agent Execution - Agent has exclusive access to user's project ↓ 7. Completion - Agent finishes (success/failure/timeout) - Conductor status updated ↓ 8. Lock Cleanup - Watchdog detects completion - Conductor cleanup module releases lock ↓ 9. Ready for Next Task - Lock released - Queue daemon can select next task for this user ``` ### Contention Scenario ``` Queue Daemon 1 User Lock Queue Daemon 2 (alice: LOCKED) Try acquire for alice ---> FAIL Skip this task Try next eligible task ---> alice_task_2 Try acquire for alice ---> FAIL Try different user (bob) -> SUCCESS Start bob's task alice: LOCKED bob: LOCKED (after alice task completes) (alice: RELEASED) Polling... Try acquire for alice ---> SUCCESS Start alice_task_3 alice: LOCKED bob: LOCKED ``` ## Monitoring and Status ### Queue Status ```python qc = QueueControllerV2() status = qc.get_queue_status() # Output includes: { "pending": { "high": 2, "normal": 5, "total": 7 }, "active": { "slots_used": 2, "slots_max": 4, "by_project": {"alice_project": 1, "bob_project": 1}, "by_user": {"alice": 1, "bob": 1} }, "user_locks": { "active": 2, "details": [ { "user": "alice", "lock_id": "task_123_1768005905", "task_id": "task_123", "acquired_at": "2024-01-09T15:30:45...", "acquired_by_pid": 12345, "expires_at": "2024-01-09T16:30:45..." }, { "user": "bob", "lock_id": "task_124_1768005906", "task_id": "task_124", "acquired_at": "2024-01-09T15:31:10...", "acquired_by_pid": 12346, "expires_at": "2024-01-09T16:31:10..." } ] } } ``` ### Active Locks ```bash # Check all active locks python3 lib/per_user_queue_manager.py list_locks # Check specific user python3 lib/per_user_queue_manager.py check alice # Release specific lock (admin) python3 lib/conductor_lock_cleanup.py release alice task_123 ``` ## Stale Lock Recovery Locks are automatically cleaned if: 1. **Age Exceeded** - Lock older than `lock_timeout_seconds` (default 1 hour) 2. **Expired Metadata** - Lock metadata has `expires_at` in the past 3. **Manual Cleanup** - Administrator runs cleanup command **Cleanup Triggers:** ```bash # Automatic (run by daemon periodically) cleanup.cleanup_all_stale_locks() # Manual (administrative) python3 lib/conductor_lock_cleanup.py cleanup_stale 3600 # Per-project python3 lib/conductor_lock_cleanup.py check_project alice_project ``` ## Implementation Details ### Lock Atomicity Lock acquisition is atomic using OS-level primitives: ```python # Atomic lock creation - only one process succeeds fd = os.open( lock_path, os.O_CREAT | os.O_EXCL | os.O_WRONLY, # Fail if exists 0o644 ) ``` No race conditions because `O_EXCL` is atomic at filesystem level. ### Lock Ordering To prevent deadlocks: 1. Always acquire per-user lock BEFORE any other resources 2. Always release per-user lock AFTER all operations 3. Never hold multiple user locks simultaneously ### Lock Duration Typical lock lifecycle: - **Acquisition**: < 100ms - **Holding**: Variable (task duration, typically 5-60 seconds) - **Release**: < 100ms - **Timeout**: 3600 seconds (1 hour) - prevents forever-locked users ## Testing Comprehensive test suite in `tests/test_per_user_queue.py`: ```bash cd /opt/server-agents/orchestrator python3 tests/test_per_user_queue.py ``` **Tests Included:** 1. Basic lock acquire/release 2. Concurrent lock contention 3. Stale lock cleanup 4. Multiple user independence 5. QueueControllerV2 integration 6. Fair scheduling with locks **Expected Results:** ``` Results: 6 passed, 0 failed ``` ## Integration Points ### Conductor Integration Conductor metadata tracks user and lock: ```json { "meta.json": { "id": "task_id", "user": "alice", "lock_id": "task_id_timestamp", "status": "running|completed|failed" } } ``` ### Watchdog Integration Watchdog detects task completion and triggers cleanup: ```python # In watchdog loop conductor_dir = Path(f"/home/{project}/conductor/active/{task_id}") if is_task_complete(conductor_dir): lock_cleanup.check_and_cleanup_conductor_locks(project) ``` ### Daemon Integration Queue daemon respects user locks in task selection: ```python # In queue daemon while True: capacity = read_capacity() if has_capacity(capacity): task = select_next_task(capacity) # Respects per-user locks if task: dispatch(task) time.sleep(poll_interval) ``` ## Performance Implications ### Lock Overhead - **Acquisition**: ~1-5ms (filesystem I/O) - **Check Active**: ~1ms (metadata file read) - **Release**: ~1-5ms (filesystem I/O) - **Total per task**: < 20ms overhead ### Scalability - Per-user locking has O(1) complexity - No contention between different users - Fair sharing prevents starvation - Tested with 100+ pending tasks ## Failure Handling ### Agent Crash ``` 1. Agent crashes (no heartbeat) 2. Watchdog detects missing heartbeat 3. Task marked as failed in conductor 4. Lock cleanup runs, detects failed task 5. Lock released for user 6. Next task can proceed ``` ### Queue Daemon Crash ``` 1. Queue daemon dies (no dispatch) 2. Locks remain but accumulate stale ones 3. New queue daemon starts 4. Periodic cleanup removes stale locks 5. System recovers ``` ### Lock File Corruption ``` 1. Lock metadata corrupted 2. Cleanup detects invalid metadata 3. Lock file removed (safe) 4. Lock acquired again for same user ``` ## Configuration Recommendations ### Development ```json { "per_user_serialization": { "enabled": true, "lock_timeout_seconds": 300 } } ``` Short timeout for testing (5 minutes). ### Production ```json { "per_user_serialization": { "enabled": true, "lock_timeout_seconds": 3600 } } ``` Standard timeout of 1 hour. ### Debugging (Disabled) ```json { "per_user_serialization": { "enabled": false } } ``` Disable for debugging or testing parallel execution. ## Migration from Old System Old system allowed concurrent tasks per user. Migration is safe: 1. **Enable gradually**: Set `enabled: true` 2. **Monitor**: Watch task queue logs for impact 3. **Adjust timeout**: Increase if tasks need more time 4. **Deploy**: No data migration needed The system is backward compatible - old queue tasks continue to work. ## Future Enhancements 1. **Per-project locks** - If projects have concurrent users 2. **Priority-based waiting** - High-priority tasks skip the queue 3. **Task grouping** - Related tasks stay together 4. **Preemptive cancellation** - Kill stale tasks automatically 5. **Lock analytics** - Track lock contention and timing ## References - [Per-User Queue Manager](per_user_queue_manager.py) - [Queue Controller v2](queue_controller_v2.py) - [Conductor Lock Cleanup](conductor_lock_cleanup.py) - [Test Suite](tests/test_per_user_queue.py)