Refactor cockpit to use DockerTmuxController pattern
Based on claude-code-tools TmuxCLIController, this refactor: - Added DockerTmuxController class for robust tmux session management - Implements send_keys() with configurable delay_enter - Implements capture_pane() for output retrieval - Implements wait_for_prompt() for pattern-based completion detection - Implements wait_for_idle() for content-hash-based idle detection - Implements wait_for_shell_prompt() for shell prompt detection Also includes workflow improvements: - Pre-task git snapshot before agent execution - Post-task commit protocol in agent guidelines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
398
docs/CLAUDE-DISPATCH-ANALYSIS.md
Normal file
398
docs/CLAUDE-DISPATCH-ANALYSIS.md
Normal file
@@ -0,0 +1,398 @@
|
||||
# Claude Dispatch and Monitor Flow Analysis
|
||||
|
||||
**Date:** 2026-01-11
|
||||
**Author:** Luzia Research Agent
|
||||
**Status:** Complete
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Luzia dispatches Claude tasks using a **fully autonomous, non-blocking pattern**. The current architecture intentionally **does not support human-in-the-loop interaction** for background agents. This analysis documents the current flow, identifies pain points, and researches potential improvements for scenarios where user input is needed.
|
||||
|
||||
---
|
||||
|
||||
## Part 1: Current Dispatch Flow
|
||||
|
||||
### 1.1 Task Dispatch Mechanism
|
||||
|
||||
**Entry Point:** `spawn_claude_agent()` in `/opt/server-agents/orchestrator/bin/luzia:1102-1401`
|
||||
|
||||
**Flow:**
|
||||
```
|
||||
User runs: luzia <project> <task>
|
||||
↓
|
||||
1. Permission check (project access validation)
|
||||
↓
|
||||
2. QA Preflight checks (optional, validates task)
|
||||
↓
|
||||
3. Job directory created: /var/log/luz-orchestrator/jobs/{job_id}/
|
||||
├── prompt.txt (task + context)
|
||||
├── run.sh (shell script with env setup)
|
||||
├── meta.json (job metadata)
|
||||
└── output.log (will capture output)
|
||||
↓
|
||||
4. Shell script generated with:
|
||||
- User-specific TMPDIR to avoid /tmp collisions
|
||||
- HOME set to target user
|
||||
- stdbuf for unbuffered output
|
||||
- tee to capture to output.log
|
||||
↓
|
||||
5. Launched via: os.system(f'nohup "{script_file}" >/dev/null 2>&1 &')
|
||||
↓
|
||||
6. Control returns IMMEDIATELY to CLI (job_id returned)
|
||||
↓
|
||||
7. Agent runs in background, detached from parent process
|
||||
```
|
||||
|
||||
### 1.2 Claude CLI Invocation
|
||||
|
||||
**Command Line Built:**
|
||||
```bash
|
||||
claude --dangerously-skip-permissions \
|
||||
--permission-mode bypassPermissions \
|
||||
--add-dir "{project_path}" \
|
||||
--add-dir /opt/server-agents \
|
||||
--print \
|
||||
--verbose \
|
||||
-p # Reads prompt from stdin
|
||||
```
|
||||
|
||||
**Critical Flags:**
|
||||
| Flag | Purpose |
|
||||
|------|---------|
|
||||
| `--dangerously-skip-permissions` | Required to use bypassPermissions mode |
|
||||
| `--permission-mode bypassPermissions` | Skip ALL interactive prompts |
|
||||
| `--print` | Non-interactive output mode |
|
||||
| `--verbose` | Progress visibility in logs |
|
||||
| `-p` | Read prompt from stdin (piped from prompt.txt) |
|
||||
|
||||
### 1.3 Output Capture
|
||||
|
||||
**Run script template:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
echo $$ > "{pid_file}"
|
||||
|
||||
# Environment setup
|
||||
export TMPDIR="{user_tmp_dir}"
|
||||
export HOME="{user_home}"
|
||||
|
||||
# Execute with unbuffered output capture
|
||||
sudo -u {user} bash -c '... cd "{project_path}" && cat "{prompt_file}" | stdbuf -oL -eL {claude_cmd}' 2>&1 | tee "{output_file}"
|
||||
|
||||
exit_code=${PIPESTATUS[0]}
|
||||
echo "" >> "{output_file}"
|
||||
echo "exit:$exit_code" >> "{output_file}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 2: Output Monitoring Flow
|
||||
|
||||
### 2.1 Status Checking
|
||||
|
||||
**Function:** `get_job_status()` at line 1404
|
||||
|
||||
Status is determined by:
|
||||
1. Reading `output.log` for `exit:` line at end
|
||||
2. Checking if process is still running (via PID)
|
||||
3. Updating `meta.json` with completion time metrics
|
||||
|
||||
**Status Values:**
|
||||
- `running` - No exit code yet, PID may still be active
|
||||
- `completed` - `exit:0` found
|
||||
- `failed` - `exit:non-zero` found
|
||||
- `killed` - `exit:-9` or manual kill detected
|
||||
|
||||
### 2.2 User Monitoring Commands
|
||||
|
||||
```bash
|
||||
# List all jobs
|
||||
luzia jobs
|
||||
|
||||
# Show specific job status
|
||||
luzia jobs {job_id}
|
||||
|
||||
# View job output
|
||||
luzia logs {job_id}
|
||||
|
||||
# Show with timing details
|
||||
luzia jobs --timing
|
||||
```
|
||||
|
||||
### 2.3 Notification Flow
|
||||
|
||||
On completion, the run script appends to notification log:
|
||||
```bash
|
||||
echo "[$(date +%H:%M:%S)] Agent {job_id} finished (exit $exit_code)" >> /var/log/luz-orchestrator/notifications.log
|
||||
```
|
||||
|
||||
This allows external monitoring via:
|
||||
```bash
|
||||
tail -f /var/log/luz-orchestrator/notifications.log
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Part 3: Current User Interaction Handling
|
||||
|
||||
### 3.1 The Problem: No Interaction Supported
|
||||
|
||||
**Current Design:** Background agents **cannot** receive user input.
|
||||
|
||||
**Why:**
|
||||
1. `nohup` detaches from terminal - stdin unavailable
|
||||
2. `--permission-mode bypassPermissions` skips prompts
|
||||
3. No mechanism exists to pause agent and wait for input
|
||||
4. Output is captured to file, not interactive terminal
|
||||
|
||||
### 3.2 When Claude Would Ask Questions
|
||||
|
||||
Claude's `AskUserQuestion` tool would block waiting for stdin, which isn't available. Current mitigations:
|
||||
|
||||
1. **Context-First Design** - Prompts include all necessary context
|
||||
2. **Pre-Authorization** - Permissions granted upfront
|
||||
3. **Structured Tasks** - Clear success criteria reduce ambiguity
|
||||
4. **Exit Code Signaling** - Agent exits with code 1 if unable to proceed
|
||||
|
||||
### 3.3 Current Pain Points
|
||||
|
||||
| Pain Point | Impact | Current Workaround |
|
||||
|------------|--------|-------------------|
|
||||
| Agent can't ask clarifying questions | May proceed with wrong assumptions | Write detailed prompts |
|
||||
| User can't provide mid-task guidance | Task might fail when adjustments needed | Retry with modified task |
|
||||
| No approval workflow for risky actions | Security relies on upfront authorization | Careful permission scoping |
|
||||
| Long tasks give no progress updates | User doesn't know if task is stuck | Check output.log manually |
|
||||
| AskUserQuestion blocks indefinitely | Agent hangs, appears as "running" forever | Must kill and retry |
|
||||
|
||||
---
|
||||
|
||||
## Part 4: Research on Interaction Improvements
|
||||
|
||||
### 4.1 Pattern: File-Based Clarification Queue
|
||||
|
||||
**Concept:** Agent writes questions to file, waits for answer file.
|
||||
|
||||
```
|
||||
/var/log/luz-orchestrator/jobs/{job_id}/
|
||||
├── clarification.json # Agent writes question
|
||||
├── response.json # User writes answer
|
||||
└── output.log # Agent logs waiting status
|
||||
```
|
||||
|
||||
**Agent Behavior:**
|
||||
```python
|
||||
# Agent encounters ambiguity
|
||||
question = {
|
||||
"type": "choice",
|
||||
"question": "Which database: production or staging?",
|
||||
"options": ["production", "staging"],
|
||||
"timeout_minutes": 30,
|
||||
"default_if_timeout": "staging"
|
||||
}
|
||||
Path("clarification.json").write_text(json.dumps(question))
|
||||
|
||||
# Wait for response (polling)
|
||||
for _ in range(timeout * 60):
|
||||
if Path("response.json").exists():
|
||||
response = json.loads(Path("response.json").read_text())
|
||||
return response["choice"]
|
||||
time.sleep(1)
|
||||
|
||||
# Timeout - use default
|
||||
return question["default_if_timeout"]
|
||||
```
|
||||
|
||||
**User Side:**
|
||||
```bash
|
||||
# List pending questions
|
||||
luzia questions
|
||||
|
||||
# Answer a question
|
||||
luzia answer {job_id} staging
|
||||
```
|
||||
|
||||
### 4.2 Pattern: WebSocket Status Bridge
|
||||
|
||||
**Concept:** Real-time bidirectional communication via WebSocket.
|
||||
|
||||
```
|
||||
User Browser ←→ Luzia Status Server ←→ Agent Process
|
||||
↑
|
||||
/var/lib/luzia/status.sock
|
||||
```
|
||||
|
||||
**Implementation in Existing Code:**
|
||||
`lib/luzia_status_integration.py` already has a status publisher framework that could be extended.
|
||||
|
||||
**Flow:**
|
||||
1. Agent publishes status updates to socket
|
||||
2. Status server broadcasts to connected clients
|
||||
3. When question arises, server notifies all clients
|
||||
4. User responds via web UI or CLI
|
||||
5. Response routed back to agent
|
||||
|
||||
### 4.3 Pattern: Telegram/Chat Integration
|
||||
|
||||
**Existing:** `/opt/server-agents/mcp-servers/assistant-channel/` provides Telegram integration.
|
||||
|
||||
**Extended for Agent Questions:**
|
||||
```python
|
||||
# Agent needs input
|
||||
channel_query(
|
||||
sender=f"agent-{job_id}",
|
||||
question="Should I update production database?",
|
||||
context="Running migration task for musica project"
|
||||
)
|
||||
|
||||
# Bruno responds via Telegram
|
||||
# Response delivered to agent via file or status channel
|
||||
```
|
||||
|
||||
### 4.4 Pattern: Approval Gates
|
||||
|
||||
**Concept:** Pre-define checkpoints where agent must wait for approval.
|
||||
|
||||
```python
|
||||
# In task prompt
|
||||
"""
|
||||
## Approval Gates
|
||||
- Before running migrations: await approval
|
||||
- Before deleting files: await approval
|
||||
- Before modifying production config: await approval
|
||||
|
||||
Write to approval.json when reaching a gate. Wait for approved.json.
|
||||
"""
|
||||
```
|
||||
|
||||
**Gate File:**
|
||||
```json
|
||||
{
|
||||
"gate": "database_migration",
|
||||
"description": "About to run 3 migrations on staging DB",
|
||||
"awaiting_since": "2026-01-11T14:30:00Z",
|
||||
"auto_approve_after_minutes": null
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5 Pattern: Interactive Mode Flag
|
||||
|
||||
**Concept:** Allow foreground execution when user is present.
|
||||
|
||||
```bash
|
||||
# Background (current default)
|
||||
luzia musica "run tests"
|
||||
|
||||
# Foreground/Interactive
|
||||
luzia musica "run tests" --fg
|
||||
|
||||
# Interactive session (already exists)
|
||||
luzia work on musica
|
||||
```
|
||||
|
||||
The `--fg` flag already exists but doesn't fully support interactive Q&A. Enhancement needed:
|
||||
- Don't detach process
|
||||
- Keep stdin connected
|
||||
- Allow Claude's AskUserQuestion to work normally
|
||||
|
||||
---
|
||||
|
||||
## Part 5: Recommendations
|
||||
|
||||
### 5.1 Short-Term (Quick Wins)
|
||||
|
||||
1. **Better Exit Code Semantics**
|
||||
- Exit 100 = "needs clarification" (new code)
|
||||
- Capture the question in `clarification.json`
|
||||
- `luzia questions` command to list pending
|
||||
|
||||
2. **Enhanced `--fg` Mode**
|
||||
- Don't background the process
|
||||
- Keep stdin/stdout connected
|
||||
- Allow normal interactive Claude session
|
||||
|
||||
3. **Progress Streaming**
|
||||
- Add `luzia watch {job_id}` for `tail -f` on output.log
|
||||
- Color-coded output for better readability
|
||||
|
||||
### 5.2 Medium-Term (New Features)
|
||||
|
||||
4. **File-Based Clarification System**
|
||||
- Agent writes to `clarification.json`
|
||||
- Luzia CLI watches for pending questions
|
||||
- `luzia answer {job_id} <response>` writes `response.json`
|
||||
- Agent polls and continues
|
||||
|
||||
5. **Telegram/Chat Bridge for Questions**
|
||||
- Extend assistant-channel for agent questions
|
||||
- Push notification when agent needs input
|
||||
- Reply via chat, response routed to agent
|
||||
|
||||
6. **Status Dashboard**
|
||||
- Web UI showing all running agents
|
||||
- Real-time output streaming
|
||||
- Question/response interface
|
||||
|
||||
### 5.3 Long-Term (Architecture Evolution)
|
||||
|
||||
7. **Approval Workflows**
|
||||
- Define approval gates in task specification
|
||||
- Configurable auto-approve timeouts
|
||||
- Audit log of approvals
|
||||
|
||||
8. **Agent Orchestration Layer**
|
||||
- Queue of pending questions across agents
|
||||
- Priority handling for urgent questions
|
||||
- SLA tracking for response times
|
||||
|
||||
9. **Hybrid Execution Mode**
|
||||
- Start background, attach to foreground if question arises
|
||||
- Agent sends signal when needing input
|
||||
- CLI can "attach" to running agent
|
||||
|
||||
---
|
||||
|
||||
## Part 6: Implementation Priority
|
||||
|
||||
| Priority | Feature | Effort | Impact |
|
||||
|----------|---------|--------|--------|
|
||||
| **P0** | Better `--fg` mode | Low | High - enables immediate interactive use |
|
||||
| **P0** | Exit code 100 for clarification | Low | Medium - better failure understanding |
|
||||
| **P1** | `luzia watch {job_id}` | Low | Medium - easier monitoring |
|
||||
| **P1** | File-based clarification | Medium | High - enables async Q&A |
|
||||
| **P2** | Telegram question bridge | Medium | Medium - mobile notification |
|
||||
| **P2** | Status dashboard | High | High - visual monitoring |
|
||||
| **P3** | Approval workflows | High | Medium - enterprise feature |
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The current Luzia dispatch architecture is optimized for **fully autonomous** agent execution. This is the right default for background tasks. However, there's a gap for scenarios where:
|
||||
- Tasks are inherently ambiguous
|
||||
- User guidance is needed mid-task
|
||||
- High-stakes actions require approval
|
||||
|
||||
The recommended path forward is:
|
||||
1. **Improve `--fg` mode** for true interactive sessions
|
||||
2. **Add file-based clarification** for async Q&A on background tasks
|
||||
3. **Integrate with Telegram** for push notifications on questions
|
||||
4. **Build status dashboard** for visual monitoring and interaction
|
||||
|
||||
These improvements maintain the autonomous-by-default philosophy while enabling human-in-the-loop interaction when needed.
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Key File Locations
|
||||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `/opt/server-agents/orchestrator/bin/luzia:1102-1401` | `spawn_claude_agent()` - main dispatch |
|
||||
| `/opt/server-agents/orchestrator/bin/luzia:1404-1449` | `get_job_status()` - status checking |
|
||||
| `/opt/server-agents/orchestrator/bin/luzia:4000-4042` | `route_logs()` - log viewing |
|
||||
| `/opt/server-agents/orchestrator/lib/responsive_dispatcher.py` | Async dispatch patterns |
|
||||
| `/opt/server-agents/orchestrator/lib/cli_feedback.py` | CLI output formatting |
|
||||
| `/opt/server-agents/orchestrator/AGENT-AUTONOMY-RESEARCH.md` | Prior research on autonomy |
|
||||
| `/var/log/luz-orchestrator/jobs/` | Job directories |
|
||||
| `/var/log/luz-orchestrator/notifications.log` | Completion notifications |
|
||||
Reference in New Issue
Block a user