luzia/AGENT-AUTONOMY-RESEARCH.md

# Luzia Agent Autonomy Research
## Interactive Prompts and Autonomous Agent Patterns

**Date:** 2026-01-09
**Version:** 1.0
**Status:** Complete

---

## Executive Summary

This research documents how **Luzia** and the **Claude Agent SDK** enable autonomous agents to handle interactive scenarios without blocking. The key insight is that **blocking is prevented through architectural choices, not technical tricks**:

1. **Detached Execution** - Agents run in background processes, not waiting for input
2. **Non-Interactive Mode** - Permission mode set to `bypassPermissions` to avoid approval dialogs
3. **Async Communication** - Results delivered via files and notification logs, not stdin/stdout
4. **Failure Recovery** - Exit codes captured for retry logic without agent restart
5. **Context-First Design** - All necessary context provided upfront in prompts

---

## Part 1: How Luzia Prevents Agent Blocking

### 1.1 The Core Pattern: Detached Spawning

**File:** `/opt/server-agents/orchestrator/bin/luzia` (lines 1012-1200)

```bash
# Agents run detached with nohup, not waiting for completion
os.system(f'nohup "{script_file}" >/dev/null 2>&1 &')
```

**Key Design Decisions:**

| Aspect | Implementation | Why It Works |
|--------|---|---|
| **Process Isolation** | `nohup ... &` spawns detached process | Parent doesn't block; agent runs independently |
| **Permission Mode** | `--permission-mode bypassPermissions` | No permission dialogs to pause agent |
| **PID Tracking** | Job directory captures PID at startup | Can monitor/kill if needed without blocking |
| **Output Capture** | `tee` pipes output to log file | Results captured even if agent backgrounded |
| **Status Tracking** | Exit code appended to output.log | Job status determined post-execution |

### 1.2 The Full Agent Spawn Flow

**Complete lifecycle (simplified):**

```
1. spawn_claude_agent() called
   ↓
2. Job directory created: /var/log/luz-orchestrator/jobs/{job_id}/
   ├── prompt.txt (full context + task)
   ├── run.sh (executable shell script)
   ├── meta.json (job metadata)
   └── output.log (will capture all output)
   ↓
3. Script written with all environment setup
   ├── TMPDIR set to user's home (prevent /tmp collisions)
   ├── HOME set to project user
   ├── Current directory: project path
   └── Claude CLI invoked with full prompt
   ↓
4. Execution via nohup (detached)
   os.system(f'nohup "{script_file}" >/dev/null 2>&1 &')
   ↓
5. Control returns immediately to CLI
   ↓
6. Agent continues in background:
   ├── Reads prompt from file
   ├── Executes task (reads/writes files)
   ├── All output captured to output.log
   ├── Exit code captured: "exit:{code}"
   └── Completion logged to notifications.log
```

### 1.3 Permission Bypass Strategy

**Critical Flag:** `--permission-mode bypassPermissions`

```python
# From spawn_claude_agent()
claude_cmd = f'claude --dangerously-skip-permissions --permission-mode bypassPermissions ...'
```

**Why This Works:**
- **No User Prompts**: Claude doesn't ask for approval on tool use
- **Full Autonomy**: Agent makes all decisions without waiting
- **Pre-Authorization**: All permissions granted upfront in job spawning
- **Isolation**: Each agent runs as project user in their own space

**When This is Safe:**
- All agents have limited scope (project directory)
- Running as restricted user (not root)
- Task fully specified in prompt (no ambiguity)
- Agent context includes execution environment details

---

## Part 2: Handling Clarification Without Blocking

### 2.1 The AskUserQuestion Problem

When agents need clarification, Claude's `AskUserQuestion` tool blocks the agent process waiting for stdin input. For background agents, this is problematic.

**Solutions in Luzia:**

#### Solution 1: Context-First Design
Provide all necessary context upfront so agents rarely need to ask:

```python
prompt = f"""
You are a project agent working on the **{project}** project.

## Your Task
{task}

## Execution Environment
- You are running as user: {run_as_user}
- Working directory: {project_path}
- All file operations are pre-authorized
- Complete the task autonomously

## Guidelines
- Complete the task autonomously
- If you encounter errors, debug and fix them
- Store important findings in the shared knowledge graph
"""
```

#### Solution 2: Structured Task Format
Use specific, unambiguous task descriptions:

```
GOOD:   "Run tests in /workspace/tests and report pass/fail count"
BAD:    "Fix the test suite"  (unclear what 'fix' means)

GOOD:   "Analyze src/index.ts for complexity metrics"
BAD:    "Improve code quality" (needs clarification on what to improve)
```

#### Solution 3: Async Fallback Mechanism
If clarification is truly needed, agents can:

1. **Create a hold file** in job directory
2. **Log the question** to a status file
3. **Return exit code 1** (needs input)
4. **Await resolution** via file modification

```python
# Example pattern for agent code:
# (Not yet implemented in Luzia, but pattern is documented)

import json
from pathlib import Path

job_dir = Path("/var/log/luz-orchestrator/jobs/{job_id}")

# Agent encounters ambiguity
question = "Should I update production database or staging?"

# Write question to file
clarification = {
    "status": "awaiting_clarification",
    "question": question,
    "options": ["production", "staging"],
    "agent_paused_at": datetime.now().isoformat()
}
(job_dir / "clarification.json").write_text(json.dumps(clarification))

# Exit with code 1 to signal "needs input"
exit(1)
```

Then externally:
```bash
# Operator provides input
echo '{"choice": "staging"}' > /var/log/luz-orchestrator/jobs/{job_id}/clarification.json

# Restart agent (automatic retry system)
luzia retry {job_id}
```

### 2.2 Why AskUserQuestion Doesn't Work for Async Agents

| Scenario | Issue | Solution |
|----------|-------|----------|
| User runs `luzia project task` | User might close terminal | Store prompt in file, not stdin |
| Agent backgrounded | stdin not available | No interactive input possible |
| Multiple agents running | stdin interference | Use file-based IPC instead |
| Agent on remote machine | stdin tunneling complex | All I/O via files or HTTP |

---

## Part 3: Job State Machine and Exit Codes

### 3.1 Job Lifecycle States

**Defined in:** `/opt/server-agents/orchestrator/bin/luzia` (lines 607-646)

```python
def _get_actual_job_status(job_dir: Path) -> str:
    """Determine actual job status by checking output.log"""

    # Status values:
    # - "running" (process still active)
    # - "completed" (exit:0)
    # - "failed" (exit:non-zero)
    # - "killed" (exit:-9)
    # - "unknown" (no status info)
```

**State Transitions:**

```
Job Created
  ↓
[meta.json: status="running", output.log: empty]
  ↓
Agent Executes (captured in output.log)
  ↓
Agent Completes/Exits
  ↓
output.log appended with "exit:{code}"
  ↓
Status determined:
  ├─ exit:0     → "completed"
  ├─ exit:non-0 → "failed"
  ├─ exit:-9    → "killed"
  └─ no exit    → "running" (still active)
```

### 3.2 The Critical Line: Capturing Exit Code

**From run.sh template (lines 1148-1173):**

```bash
# Command with output capture
stdbuf ... {claude_cmd} 2>&1 | tee "{output_file}"
exit_code=${PIPESTATUS[0]}

# CRITICAL: Append exit code to log
echo "" >> "{output_file}"
echo "exit:$exit_code" >> "{output_file}"

# Notify completion
{notify_cmd}
```

**Why This Matters:**
- Job status determined by examining log file, not process exit
- Exit code persists in file even after process terminates
- Allows status queries without spawning process
- Enables automatic retry logic based on exit code

---

## Part 4: Handling Approval Prompts in Background

### 4.1 How Claude Code Approval Prompts Work

Claude Code tools can ask for permission before executing risky operations:

```
⚠️ This command has high privilege level. Approve?
[Y/n] _
```

In **interactive mode**: User can respond
In **background mode**: Command blocks indefinitely waiting for stdin

### 4.2 Luzia's Prevention Mechanism

**Three-layer approach:**

1. **CLI Flag**: `--permission-mode bypassPermissions`
   - Tells Claude CLI to skip permission dialogs
   - Requires `--dangerously-skip-permissions` flag

2. **Environment Setup**: User runs as project user, not root
   - Limited scope prevents catastrophic damage
   - Job runs in isolated directory
   - File ownership is correct by default

3. **Process Isolation**: Agent runs detached
   - Even if blocked, parent CLI returns immediately
   - Job continues in background
   - Can be monitored/killed separately

**Example: Safe Bash Execution**

```python
# This command would normally require approval in interactive mode
command = "rm -rf /opt/sensitive-data"

# But in agent context:
# 1. Agent running as limited user (not root)
# 2. Project path restricted (can't access /opt from project user)
# 3. Permission flags bypass confirmation dialog
# 4. Agent detached (blocking doesn't affect CLI)

# Result: Command executes without interactive prompt
```

---

## Part 5: Async Communication Patterns

### 5.1 File-Based Job Queue

**Implemented in:** `/opt/server-agents/orchestrator/lib/queue_controller.py`

**Pattern:**

```
User provides task → Enqueue to file-based queue → Status logged to disk
                                                       ↓
                                          Load-aware scheduler polls queue
                                                       ↓
                                          Task spawned as background agent
                                                       ↓
                                          Agent writes to output.log
                                                       ↓
                                          User queries status via filesystem
```

**Queue Structure:**

```
/var/lib/luzia/queue/
├── pending/
│   ├── high/
│   │   └── {priority}_{timestamp}_{project}_{task_id}.json
│   └── normal/
│       └── {priority}_{timestamp}_{project}_{task_id}.json
├── config.json
└── capacity.json
```

**Task File Format:**

```json
{
  "id": "a1b2c3d4",
  "project": "musica",
  "priority": 5,
  "prompt": "Run tests in /workspace/tests",
  "skill_match": "test-runner",
  "enqueued_at": "2026-01-09T15:30:45Z",
  "enqueued_by": "admin",
  "status": "pending"
}
```

### 5.2 Notification Log Pattern

**Location:** `/var/log/luz-orchestrator/notifications.log`

**Pattern:**

```
[14:23:15] Agent 142315-a1b2 finished (exit 0)
[14:24:03] Agent 142403-c3d4 finished (exit 1)
[14:25:12] Agent 142512-e5f6 finished (exit 0)
```

**Usage:**

```python
# Script can tail this file to await completion
# without polling job directories
tail -f /var/log/luz-orchestrator/notifications.log | \
  grep "Agent {job_id}"
```

### 5.3 Job Directory as IPC Channel

**Location:** `/var/log/luz-orchestrator/jobs/{job_id}/`

**Files Used for Communication:**

| File | Purpose | Direction |
|------|---------|-----------|
| `prompt.txt` | Task definition & context | Input (before agent starts) |
| `output.log` | Agent's stdout/stderr + exit code | Output (written during execution) |
| `meta.json` | Job metadata & status | Both (initial + final) |
| `clarification.json` | Awaiting user input (pattern) | Bidirectional |
| `run.sh` | Execution script | Input |
| `pid` | Process ID | Output |

**Example: Monitoring Job Completion**

```bash
#!/bin/bash
job_id="142315-a1b2"
job_dir="/var/log/luz-orchestrator/jobs/$job_id"

# Poll for completion
while true; do
  if grep -q "^exit:" "$job_dir/output.log"; then
    exit_code=$(grep "^exit:" "$job_dir/output.log" | tail -1 | cut -d: -f2)
    echo "Job completed with exit code: $exit_code"
    break
  fi
  sleep 1
done
```

---

## Part 6: Prompt Patterns for Agent Autonomy

### 6.1 The Ideal Autonomous Agent Prompt

**Pattern:**

```
1. Identity & Context
   - What role is the agent playing?
   - What project/domain?

2. Task Specification
   - What needs to be done?
   - What are success criteria?
   - What are the constraints?

3. Execution Environment
   - What tools are available?
   - What directories can be accessed?
   - What permissions are granted?

4. Decision Autonomy
   - What decisions can the agent make alone?
   - When should it ask for clarification? (ideally: never)
   - What should it do if ambiguous?

5. Communication
   - Where should results be written?
   - What format (JSON, text, files)?
   - When should it report progress?

6. Failure Handling
   - What to do if task fails?
   - Should it retry? How many times?
   - What exit codes to use?
```

### 6.2 Good vs Bad Prompts for Autonomy

**BAD - Requires Clarification:**
```
"Help me improve the code"
- Ambiguous: which files? What metrics?
- No success criteria
- Agent likely to ask questions

"Fix the bug"
- Which bug? What symptoms?
- Agent needs to investigate then ask
```

**GOOD - Autonomous:**
```
"Run tests in /workspace/tests and report:
- Total test count
- Passed count
- Failed count
- Exit code (0 if all pass, 1 if any fail)"

"Analyze src/index.ts for:
- Lines of code
- Number of functions
- Max function complexity
- Save results to analysis.json"
```

### 6.3 Prompt Template for Autonomous Agents

**Used in Luzia:** `/opt/server-agents/orchestrator/bin/luzia` (lines 1053-1079)

```python
prompt_template = """You are a project agent working on the **{project}** project.

{context}

## Your Task
{task}

## Execution Environment
- You are running as user: {run_as_user}
- You are running directly in the project directory: {project_path}
- You have FULL permission to read, write, and execute files in this directory
- Use standard Claude tools (Read, Write, Edit, Bash) directly
- All file operations are pre-authorized - proceed without asking for permission

## Knowledge Graph - IMPORTANT
Use the **shared/global knowledge graph** for storing knowledge:
- Use `mcp__shared-projects-memory__store_fact` to store facts
- Use `mcp__shared-projects-memory__query_relations` to query
- Use `mcp__shared-projects-memory__search_context` to search

## Guidelines
- Complete the task autonomously
- If you encounter errors, debug and fix them
- Store important findings in the shared knowledge graph
- Provide a summary of what was done when complete
"""
```

**Key Autonomy Features:**
- No "ask for help" - pre-authorization is explicit
- Clear environment details - no guessing about paths/permissions
- Knowledge graph integration - preserve learnings across runs
- Exit code expectations - clear success/failure criteria

---

## Part 7: Pattern Summary - Building Autonomous Agents

### 7.1 The Five Patterns

| Pattern | When to Use | Implementation |
|---------|------------|---|
| **Detached Spawning** | Background tasks that shouldn't block CLI | `nohup ... &` with PID tracking |
| **Permission Bypass** | Autonomous execution without prompts | `--permission-mode bypassPermissions` |
| **File-Based IPC** | Async communication with agents | Job directory as channel |
| **Exit Code Signaling** | Status determination without polling | Append "exit:{code}" to output |
| **Context-First Prompts** | Avoid clarification questions | Detailed spec + success criteria |

### 7.2 Comparison: Interactive vs Autonomous Patterns

| Aspect | Interactive Agent | Autonomous Agent |
|--------|---|---|
| **Execution** | Runs in foreground, blocks | Detached process, returns immediately |
| **Prompts** | Can use `AskUserQuestion` | Must provide all context upfront |
| **Approval** | Can request tool permission | Uses `--permission-mode bypassPermissions` |
| **I/O** | stdin/stdout with user | Files, logs, notification channels |
| **Failure** | User responds to errors | Agent handles/reports via exit code |
| **Monitoring** | User watches output | Query filesystem for status |

### 7.3 When to Use Each Pattern

**Use Interactive Agents When:**
- User is present and waiting
- Task requires user input/decisions
- Working in development/exploration mode
- Real-time feedback is valuable

**Use Autonomous Agents When:**
- Running background maintenance tasks
- Multiple parallel operations needed
- No user available to respond to prompts
- Results needed asynchronously

---

## Part 8: Real Implementation Examples

### 8.1 Example: Running Tests Autonomously

**Task:**
```
Run pytest in /workspace/tests and report results as JSON
```

**Luzia Command:**
```bash
luzia musica "Run pytest in /workspace/tests and save results to tests.json with {passed: int, failed: int, errors: int}"
```

**What Happens:**

1. Job directory created with UUID
2. Prompt written with full context
3. Script prepared with environment setup
4. Launched via nohup
5. Immediately returns job_id to user
6. Agent runs in background:
   ```bash
   cd /workspace
   pytest tests/ --json > tests.json
   # Results saved to file
   ```
7. Exit code captured (0 if all pass, 1 if failures)
8. Output logged to output.log
9. Completion notification sent

**User Monitor:**
```bash
luzia jobs {job_id}
# Status: running/completed/failed
# Exit code: 0/1
# Output preview: last 10 lines
```

### 8.2 Example: Code Analysis Autonomously

**Task:**
```
Analyze the codebase structure and save metrics to analysis.json
```

**Agent Does (no prompts needed):**

1. Reads prompt from job directory
2. Scans project structure
3. Collects metrics (LOC, functions, classes, complexity)
4. Writes results to analysis.json
5. Stores findings in knowledge graph
6. Exits with 0

**Success Criteria (in prompt):**
```
Results saved to analysis.json with:
- total_files: int
- total_lines: int
- total_functions: int
- total_classes: int
- average_complexity: float
```

---

## Part 9: Best Practices

### 9.1 Prompt Design for Autonomy

1. **Be Specific**
   - What files? What directories?
   - What exact metrics/outputs?
   - What format (JSON, CSV, text)?

2. **Provide Success Criteria**
   - What makes this task complete?
   - What should the output look like?
   - What exit code for success/failure?

3. **Include Error Handling**
   - What if file doesn't exist?
   - What if command fails?
   - Should it retry or report and exit?

4. **Minimize Ambiguity**
   - Don't say "improve code quality" - say what to measure
   - Don't say "fix bugs" - specify which bugs or how to find them
   - Don't say "optimize" - specify what metric to optimize

### 9.2 Environment Setup for Autonomy

1. **Pre-authorize Everything**
   - Set correct user/group
   - Ensure file permissions allow operations
   - Document what's accessible

2. **Provide Full Context**
   - Include CLAUDE.md or similar
   - Document project structure
   - Explain architectural decisions

3. **Set Clear Boundaries**
   - Which directories can be modified?
   - Which operations are allowed?
   - What can't be changed?

### 9.3 Failure Recovery for Autonomy

1. **Use Exit Codes Meaningfully**
   - 0 = success
   - 1 = recoverable failure
   - 2 = unrecoverable failure
   - -9 = killed/timeout

2. **Log Failures Comprehensively**
   - What was attempted?
   - What failed and why?
   - What was tried to recover?

3. **Enable Automatic Retry**
   - Retry on exit code 1 (optional)
   - Don't retry on exit code 2 (unrecoverable)
   - Track retry count to prevent infinite loops

---

## Part 10: Advanced Patterns

### 10.1 Multi-Phase Autonomous Tasks

For complex tasks requiring multiple steps:

```python
# Phase 1: Context gathering
# Phase 2: Analysis
# Phase 3: Implementation
# Phase 4: Verification
# Phase 5: Reporting

# All phases defined upfront in prompt
# Agent proceeds through all without asking
# Exit code reflects overall success
```

### 10.2 Knowledge Graph Integration

Agents store findings persistently:

```python
# From agent code:
from mcp__shared-projects-memory__store_fact import store_fact

# After analysis, store for future agents
store_fact(
    entity_source_name="musica-project",
    relation="has_complexity_metrics",
    entity_target_name="analysis-2026-01-09",
    context={
        "avg_complexity": 3.2,
        "hotspots": ["index.ts", "processor.ts"]
    }
)
```

### 10.3 Cross-Agent Coordination

Use shared state file for coordination:

```python
# /opt/server-agents/state/cross-agent-todos.json
# Agents read/update this to coordinate

{
  "current_tasks": [
    {
      "id": "analyze-musica",
      "project": "musica",
      "status": "in_progress",
      "assigned_to": "agent-a1b2",
      "started": "2026-01-09T14:23:00Z"
    }
  ],
  "completed_archive": { ... }
}
```

---

## Part 11: Failure Cases and Solutions

### 11.1 Common Blocking Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Agent pauses on permission prompt | Tool permission check enabled | Use `--permission-mode bypassPermissions` |
| Agent blocks on AskUserQuestion | Prompt causes clarification needed | Redesign prompt with full context |
| stdin unavailable | Agent backgrounded | Use file-based IPC for input |
| Exit code not recorded | Script exits before writing exit code | Ensure "exit:{code}" in output.log |
| Job marked "running" forever | Process dies but exit code not appended | Use `tee` and explicit exit code capture |

### 11.2 Debugging Blocking Agents

```bash
# Check if agent is actually running
ps aux | grep {job_id}

# Check job output (should show what it's doing)
tail -f /var/log/luz-orchestrator/jobs/{job_id}/output.log

# Check if waiting on stdin
strace -p {pid} | grep read

# Look for approval prompts in output
grep -i "approve\|confirm\|permission" /var/log/luz-orchestrator/jobs/{job_id}/output.log

# Check if exit code was written
tail -5 /var/log/luz-orchestrator/jobs/{job_id}/output.log
```

---

## Part 12: Conclusion - Key Takeaways

### 12.1 The Core Principle

**Autonomous agents don't ask for input because they don't need to.**

Rather than implementing complex async prompting, the better approach is:
1. **Specify tasks completely** - No ambiguity
2. **Provide full context** - No guessing required
3. **Set clear boundaries** - Know what's allowed
4. **Detach execution** - Run independent of CLI
5. **Capture results** - File-based communication

### 12.2 Implementation Checklist

- [ ] Prompt includes all necessary context
- [ ] Task has clear success criteria
- [ ] Environment fully described (user, directory, permissions)
- [ ] No ambiguous language in prompt
- [ ] Exit codes defined (0=success, 1=failure, 2=error)
- [ ] Output format specified (JSON, text, files)
- [ ] Job runs as appropriate user
- [ ] Results captured to files/logs
- [ ] Notification system tracks completion
- [ ] Status queryable without blocking

### 12.3 When Blocks Still Occur

1. **Rare**: Well-designed prompts rarely need clarification
2. **Detectable**: Agent exits with code 1 and logs to output.log
3. **Recoverable**: Can retry or modify task and re-queue
4. **Monitorable**: Parent CLI never blocks, can watch from elsewhere

---

## Appendix A: Key Code Locations

| Location | Purpose |
|----------|---------|
| `/opt/server-agents/orchestrator/bin/luzia` (lines 1012-1200) | `spawn_claude_agent()` - Core autonomous agent spawning |
| `/opt/server-agents/orchestrator/lib/docker_bridge.py` | Container isolation for project agents |
| `/opt/server-agents/orchestrator/lib/queue_controller.py` | File-based task queue with load awareness |
| `/var/log/luz-orchestrator/jobs/` | Job directory structure and IPC |
| `/opt/server-agents/orchestrator/CLAUDE.md` | Embedded instructions for agents |

---

## Appendix B: Environment Variables

Agents have these set automatically:

```bash
TMPDIR="/home/{user}/.tmp"      # Prevent /tmp collisions
TEMP="/home/{user}/.tmp"        # Same
TMP="/home/{user}/.tmp"         # Same
HOME="/home/{user}"             # User's home directory
PWD="/path/to/project"          # Working directory
```

---

## Appendix C: File Formats Reference

### Job Directory Files

**meta.json:**
```json
{
  "id": "142315-a1b2",
  "project": "musica",
  "task": "Run tests and report results",
  "type": "agent",
  "user": "musica",
  "pid": "12847",
  "started": "2026-01-09T14:23:15Z",
  "status": "running",
  "debug": false
}
```

**output.log:**
```
[14:23:15] Starting agent...
[14:23:16] Reading prompt from file
[14:23:17] Executing task...
[14:23:18] Running tests...
PASSED: test_1
PASSED: test_2
...
[14:23:25] Task complete

exit:0
```

---

## References

- **Luzia CLI**: `/opt/server-agents/orchestrator/bin/luzia`
- **Agent SDK**: Claude Agent SDK (Anthropic)
- **Docker Bridge**: Container isolation for agent execution
- **Queue Controller**: File-based task queue implementation
- **Bot Orchestration Protocol**: `/opt/server-agents/BOT-ORCHESTRATION-PROTOCOL.md`