luzia/QUEUE_PER_USER_DESIGN.md

# Per-User Queue Isolation Design

## Overview

The per-user queue system ensures that only **one task per user** can execute concurrently. This prevents agent edit conflicts and ensures clean isolation when multiple agents work on the same user's project.

## Problem Statement

Before this implementation, multiple agents could simultaneously work on the same user's project, causing:
- **Edit conflicts** - Agents overwriting each other's changes
- **Race conditions** - Simultaneous file modifications
- **Data inconsistency** - Partial updates and rollbacks
- **Unpredictable behavior** - Non-deterministic execution order

Example conflict:
```
Agent 1: Read file.py (version 1)
Agent 2: Read file.py (version 1)
Agent 1: Modify and write file.py (version 2)
Agent 2: Modify and write file.py (version 2) ← Overwrites Agent 1's changes
```

## Solution Architecture

### 1. Per-User Lock Manager (`per_user_queue_manager.py`)

Implements exclusive file-based locking per user:

```python
manager = PerUserQueueManager()

# Acquire lock (blocks if another task is running for this user)
acquired, lock_id = manager.acquire_lock(user="alice", task_id="task_123", timeout=30)

if acquired:
    # Safe to execute task for this user
    execute_task()

    # Release lock when done
    manager.release_lock(user="alice", lock_id=lock_id)
```

**Lock Mechanism:**
- File-based locks at `/var/lib/luzia/locks/user_{username}.lock`
- Atomic creation using `O_EXCL | O_CREAT` flags
- Metadata file for monitoring and lock info
- Automatic cleanup of stale locks (1-hour timeout)

**Lock Files:**
```
/var/lib/luzia/locks/
├── user_alice.lock         # Lock file (exists = locked)
├── user_alice.json         # Lock metadata (acquired time, pid, etc)
├── user_bob.lock
└── user_bob.json
```

### 2. Enhanced Queue Controller v2 (`queue_controller_v2.py`)

Extends original QueueController with per-user awareness:

```python
qc = QueueControllerV2()

# Enqueue task
task_id, position = qc.enqueue(
    project="alice_project",
    prompt="Fix the bug",
    priority=5
)

# Queue daemon respects per-user locks
# - Can select from other users' tasks
# - Skips tasks for users with active locks
# - Fair scheduling across projects/users
```

**Key Features:**

1. **Per-User Task Selection** - Task scheduler checks user locks before dispatch
2. **Capacity Tracking by User** - Monitors active tasks per user
3. **Lock Acquisition Before Dispatch** - Acquires lock BEFORE starting agent
4. **Lock Release on Completion** - Cleanup module releases locks when tasks finish

**Capacity JSON Structure:**
```json
{
  "slots": {
    "max": 4,
    "used": 2,
    "available": 2
  },
  "by_project": {
    "alice_project": 1,
    "bob_project": 1
  },
  "by_user": {
    "alice": 1,
    "bob": 1
  }
}
```

### 3. Conductor Lock Cleanup (`conductor_lock_cleanup.py`)

Manages lock lifecycle tied to task execution:

```python
cleanup = ConductorLockCleanup()

# Called when task completes
cleanup.check_and_cleanup_conductor_locks(project="alice_project")

# Called periodically to clean stale locks
cleanup.cleanup_stale_task_locks(max_age_seconds=3600)

# Manual lock release (for administrative use)
cleanup.release_task_lock(user="alice", task_id="task_123")
```

**Integration with Conductor:**

Conductor's `meta.json` tracks lock information:
```json
{
  "id": "task_123",
  "status": "completed",
  "user": "alice",
  "lock_id": "task_123_1768005905",
  "lock_released": true
}
```

When task finishes, cleanup detects:
- Final status (completed, failed, cancelled)
- Associated user and lock_id
- Releases the lock

## Configuration

Enable per-user serialization in config:

```json
{
  "per_user_serialization": {
    "enabled": true,
    "lock_timeout_seconds": 3600
  }
}
```

**Settings:**
- `enabled`: Toggle per-user locking on/off
- `lock_timeout_seconds`: Maximum time before stale lock cleanup (1 hour default)

## Task Execution Flow

### Normal Flow

```
1. Task Enqueued
   ↓
2. Queue Daemon Polls
   - Get pending tasks
   - Check system capacity
   ↓
3. Task Selection
   - Filter by fair share rules
   - Check user has no active lock
   ↓
4. Lock Acquisition
   - Try to acquire per-user lock
   - If fails, skip this task (another task running for user)
   ↓
5. Dispatch
   - Create conductor directory
   - Write meta.json with lock_id
   - Spawn agent
   ↓
6. Agent Execution
   - Agent has exclusive access to user's project
   ↓
7. Completion
   - Agent finishes (success/failure/timeout)
   - Conductor status updated
   ↓
8. Lock Cleanup
   - Watchdog detects completion
   - Conductor cleanup module releases lock
   ↓
9. Ready for Next Task
   - Lock released
   - Queue daemon can select next task for this user
```

### Contention Scenario

```
Queue Daemon 1          User Lock           Queue Daemon 2
                        (alice: LOCKED)
Try acquire for alice ---> FAIL
Skip this task
Try next eligible task ---> alice_task_2
Try acquire for alice ---> FAIL
Try different user (bob) -> SUCCESS
Start bob's task            alice: LOCKED
                            bob: LOCKED

(after alice task completes)
                        (alice: RELEASED)

Polling...
Try acquire for alice ---> SUCCESS
Start alice_task_3          alice: LOCKED
                            bob: LOCKED
```

## Monitoring and Status

### Queue Status

```python
qc = QueueControllerV2()
status = qc.get_queue_status()

# Output includes:
{
  "pending": {
    "high": 2,
    "normal": 5,
    "total": 7
  },
  "active": {
    "slots_used": 2,
    "slots_max": 4,
    "by_project": {"alice_project": 1, "bob_project": 1},
    "by_user": {"alice": 1, "bob": 1}
  },
  "user_locks": {
    "active": 2,
    "details": [
      {
        "user": "alice",
        "lock_id": "task_123_1768005905",
        "task_id": "task_123",
        "acquired_at": "2024-01-09T15:30:45...",
        "acquired_by_pid": 12345,
        "expires_at": "2024-01-09T16:30:45..."
      },
      {
        "user": "bob",
        "lock_id": "task_124_1768005906",
        "task_id": "task_124",
        "acquired_at": "2024-01-09T15:31:10...",
        "acquired_by_pid": 12346,
        "expires_at": "2024-01-09T16:31:10..."
      }
    ]
  }
}
```

### Active Locks

```bash
# Check all active locks
python3 lib/per_user_queue_manager.py list_locks

# Check specific user
python3 lib/per_user_queue_manager.py check alice

# Release specific lock (admin)
python3 lib/conductor_lock_cleanup.py release alice task_123
```

## Stale Lock Recovery

Locks are automatically cleaned if:

1. **Age Exceeded** - Lock older than `lock_timeout_seconds` (default 1 hour)
2. **Expired Metadata** - Lock metadata has `expires_at` in the past
3. **Manual Cleanup** - Administrator runs cleanup command

**Cleanup Triggers:**

```bash
# Automatic (run by daemon periodically)
cleanup.cleanup_all_stale_locks()

# Manual (administrative)
python3 lib/conductor_lock_cleanup.py cleanup_stale 3600

# Per-project
python3 lib/conductor_lock_cleanup.py check_project alice_project
```

## Implementation Details

### Lock Atomicity

Lock acquisition is atomic using OS-level primitives:

```python
# Atomic lock creation - only one process succeeds
fd = os.open(
    lock_path,
    os.O_CREAT | os.O_EXCL | os.O_WRONLY,  # Fail if exists
    0o644
)
```

No race conditions because `O_EXCL` is atomic at filesystem level.

### Lock Ordering

To prevent deadlocks:
1. Always acquire per-user lock BEFORE any other resources
2. Always release per-user lock AFTER all operations
3. Never hold multiple user locks simultaneously

### Lock Duration

Typical lock lifecycle:
- **Acquisition**: < 100ms
- **Holding**: Variable (task duration, typically 5-60 seconds)
- **Release**: < 100ms
- **Timeout**: 3600 seconds (1 hour) - prevents forever-locked users

## Testing

Comprehensive test suite in `tests/test_per_user_queue.py`:

```bash
cd /opt/server-agents/orchestrator
python3 tests/test_per_user_queue.py
```

**Tests Included:**
1. Basic lock acquire/release
2. Concurrent lock contention
3. Stale lock cleanup
4. Multiple user independence
5. QueueControllerV2 integration
6. Fair scheduling with locks

**Expected Results:**
```
Results: 6 passed, 0 failed
```

## Integration Points

### Conductor Integration

Conductor metadata tracks user and lock:

```json
{
  "meta.json": {
    "id": "task_id",
    "user": "alice",
    "lock_id": "task_id_timestamp",
    "status": "running|completed|failed"
  }
}
```

### Watchdog Integration

Watchdog detects task completion and triggers cleanup:

```python
# In watchdog loop
conductor_dir = Path(f"/home/{project}/conductor/active/{task_id}")
if is_task_complete(conductor_dir):
    lock_cleanup.check_and_cleanup_conductor_locks(project)
```

### Daemon Integration

Queue daemon respects user locks in task selection:

```python
# In queue daemon
while True:
    capacity = read_capacity()
    if has_capacity(capacity):
        task = select_next_task(capacity)  # Respects per-user locks
        if task:
            dispatch(task)
    time.sleep(poll_interval)
```

## Performance Implications

### Lock Overhead

- **Acquisition**: ~1-5ms (filesystem I/O)
- **Check Active**: ~1ms (metadata file read)
- **Release**: ~1-5ms (filesystem I/O)
- **Total per task**: < 20ms overhead

### Scalability

- Per-user locking has O(1) complexity
- No contention between different users
- Fair sharing prevents starvation
- Tested with 100+ pending tasks

## Failure Handling

### Agent Crash

```
1. Agent crashes (no heartbeat)
2. Watchdog detects missing heartbeat
3. Task marked as failed in conductor
4. Lock cleanup runs, detects failed task
5. Lock released for user
6. Next task can proceed
```

### Queue Daemon Crash

```
1. Queue daemon dies (no dispatch)
2. Locks remain but accumulate stale ones
3. New queue daemon starts
4. Periodic cleanup removes stale locks
5. System recovers
```

### Lock File Corruption

```
1. Lock metadata corrupted
2. Cleanup detects invalid metadata
3. Lock file removed (safe)
4. Lock acquired again for same user
```

## Configuration Recommendations

### Development

```json
{
  "per_user_serialization": {
    "enabled": true,
    "lock_timeout_seconds": 300
  }
}
```

Short timeout for testing (5 minutes).

### Production

```json
{
  "per_user_serialization": {
    "enabled": true,
    "lock_timeout_seconds": 3600
  }
}
```

Standard timeout of 1 hour.

### Debugging (Disabled)

```json
{
  "per_user_serialization": {
    "enabled": false
  }
}
```

Disable for debugging or testing parallel execution.

## Migration from Old System

Old system allowed concurrent tasks per user. Migration is safe:

1. **Enable gradually**: Set `enabled: true`
2. **Monitor**: Watch task queue logs for impact
3. **Adjust timeout**: Increase if tasks need more time
4. **Deploy**: No data migration needed

The system is backward compatible - old queue tasks continue to work.

## Future Enhancements

1. **Per-project locks** - If projects have concurrent users
2. **Priority-based waiting** - High-priority tasks skip the queue
3. **Task grouping** - Related tasks stay together
4. **Preemptive cancellation** - Kill stale tasks automatically
5. **Lock analytics** - Track lock contention and timing

## References

- [Per-User Queue Manager](per_user_queue_manager.py)
- [Queue Controller v2](queue_controller_v2.py)
- [Conductor Lock Cleanup](conductor_lock_cleanup.py)
- [Test Suite](tests/test_per_user_queue.py)