Implements three complementary patterns for coordinating multi-agent swarms: 1. Status Polling (Fix 1): Orchestrator periodically spawns status-checker agents to monitor swarm health, detect stuck agents, and identify conflicts early. 2. File Claiming (Fix 2): Agents claim file ownership before editing via a claims registry (.claude/file-claims.md). Prevents multiple agents from editing the same file simultaneously. 3. Checkpoint-Based Orchestration (Fix 5): Separates swarm execution into phases - planning (read-only), conflict detection, resolution, then implementation with monitoring. Plugin contents: - /swarm command for full orchestrated workflow - status-checker agent (haiku, lightweight polling) - conflict-detector agent (analyzes plans for overlaps) - plan-reviewer agent (validates individual plans) - swarm-patterns skill with comprehensive documentation
8.5 KiB
Checkpoint-Based Orchestration
A phased approach to swarm execution that prevents conflicts through planning, review, and controlled implementation.
Overview
Checkpoint-based orchestration separates swarm execution into distinct phases:
- Planning - Agents analyze and plan (read-only)
- Review - Orchestrator detects conflicts
- Resolution - Conflicts resolved before implementation
- Claiming - Files assigned to agents
- Implementation - Agents execute plans
- Verification - Results validated
Why Checkpoints?
Without Checkpoints
Launch agents → Agents work in parallel → CONFLICT! →
Agents overwrite each other → Endless fix loops → Chaos
With Checkpoints
Launch planning agents → Collect plans → Detect conflicts →
Resolve conflicts → Claim files → Sequential/parallel execution → Success
Phase Details
Phase 1: Planning (Parallel, Read-Only)
Purpose: Gather implementation plans without making changes
Key Rules:
- Agents may READ any file
- Agents must NOT WRITE any file
- Each agent produces a structured plan
Agent Instructions:
You are in PLANNING MODE. Analyze the codebase and create an implementation plan.
CRITICAL RESTRICTIONS:
- DO NOT use Edit, Write, or any file modification tools
- DO NOT execute commands that modify files
- ONLY use Read, Glob, Grep for analysis
Your output must be a structured plan listing:
- All files you need to modify (with full paths)
- All files you need to create
- All files you need to delete
- Dependencies on other components
- Step-by-step implementation approach
Plan Format:
## Agent Plan: [agent-id]
### Task Summary
[1-2 sentence description of what this agent will accomplish]
### Files to Modify
- `src/auth/handler.ts`: Add validateToken() function and update handleRequest()
- `src/types/auth.ts`: Add TokenPayload interface
### Files to Create
- `src/auth/tokens.ts`: Token generation and validation utilities
### Files to Delete
- `src/auth/legacy-auth.ts`: Replaced by new implementation
### Dependencies
- **Requires**: Database schema must include users table
- **Blocks**: API routes cannot be updated until auth is complete
### Implementation Steps
1. Create TokenPayload interface in types
2. Implement token utilities in new file
3. Update handler with validation logic
4. Remove legacy file after verification
### Estimated Scope
- Files touched: 4
- Lines added: ~150
- Lines removed: ~80
- Risk level: Medium (touching auth system)
Phase 2: Conflict Detection
Purpose: Identify overlapping file edits before they happen
Process:
- Collect all agent plans
- Build file → agent mapping
- Identify conflicts:
- Same file modified by multiple agents
- Delete conflicts with modify
- Creation conflicts
- Dependency cycles
Conflict Types:
| Type | Severity | Example |
|---|---|---|
| Same file modify | Critical | agent-1 and agent-2 both modify handler.ts |
| Create collision | Critical | Both agents create utils/helper.ts |
| Delete + Modify | Critical | agent-1 deletes file agent-2 modifies |
| Dependency cycle | Critical | agent-1 waits for agent-2, agent-2 waits for agent-1 |
| Same directory | Warning | Both agents add files to src/utils/ |
| Import chain | Info | agent-1's file imports from agent-2's file |
Phase 3: Resolution
Purpose: Resolve all conflicts before implementation begins
Resolution Strategies:
Sequential Execution:
Conflict: agent-1 and agent-2 both modify src/api/index.ts
Resolution: Execute sequentially
- Execution order: agent-1 first, then agent-2
- agent-2 will see agent-1's changes before starting
Scope Reassignment:
Conflict: agent-1 (auth) and agent-2 (logging) both modify middleware.ts
Resolution: Reassign to single agent
- Expand agent-1's scope to include logging changes
- Remove middleware.ts from agent-2's plan
File Splitting:
Conflict: agent-1 and agent-2 both modify large config.ts
Resolution: Split the file
- Create config/auth.ts (agent-1)
- Create config/db.ts (agent-2)
- Update config/index.ts to re-export
User Decision:
Conflict: Complex dependency between agent-1 and agent-3
Resolution: Present to user
"Agents 1 and 3 have interleaved dependencies. Options:
1. Merge into single agent
2. Manual sequencing with intermediate reviews
3. Redesign the task split"
Phase 4: File Claiming
Purpose: Register file ownership before implementation
Process:
- For each resolved plan, register claims
- Update
.claude/file-claims.md - Determine execution batches
Execution Order Determination:
Given resolved plans:
- agent-1: No dependencies
- agent-2: No dependencies
- agent-3: Depends on agent-1
- agent-4: Depends on agent-2 and agent-3
Execution order:
Batch 1 (parallel): agent-1, agent-2
Batch 2 (after batch 1): agent-3
Batch 3 (after agent-3): agent-4
Phase 5: Implementation with Monitoring
Purpose: Execute plans with status tracking
Process:
- Launch batch 1 agents
- Start polling loop (every 30-60 seconds)
- As agents complete:
- Release their file claims
- Launch dependent agents
- Handle issues as detected:
- Stuck agents → investigate/reassign
- Conflicts → pause and resolve
- Failures → report and decide
Agent Instructions for Implementation:
You are now in IMPLEMENTATION MODE. Execute your approved plan.
Your approved plan is in: .claude/swarm-plans/[your-agent-id].md
Your claimed files are in: .claude/file-claims.md
RULES:
1. Only modify files that are claimed by YOUR agent ID
2. Follow your plan exactly - do not expand scope
3. If you need to modify an unclaimed file, STOP and report
4. Update progress by completing your assigned tasks
Phase 6: Verification
Purpose: Validate swarm completed successfully
Checks:
- All agents reported completion
- All planned files were modified
- No orphaned file claims
- Build succeeds (if applicable)
- Tests pass (if applicable)
- No unexpected files modified
Checkpoint Gates
Each phase has a gate that must pass before proceeding:
| Gate | Condition | Failure Action |
|---|---|---|
| Planning → Review | All planning agents completed | Wait or timeout |
| Review → Resolution | Conflict report generated | Re-run detection |
| Resolution → Claiming | All conflicts resolved | Return to resolution |
| Claiming → Implementation | All files claimed, no overlaps | Fix claim issues |
| Implementation → Verification | All agents completed | Investigate failures |
| Verification → Complete | All checks pass | Fix issues or report |
State Machine
┌─────────────┐
│ INITIALIZED │
└──────┬──────┘
│ Start swarm
▼
┌─────────────┐
│ PLANNING │◄────────────────┐
└──────┬──────┘ │
│ All plans received │
▼ │
┌─────────────┐ │
│ REVIEWING │ │
└──────┬──────┘ │
│ Conflicts identified │
▼ │
┌─────────────┐ │
│ RESOLVING │─────────────────┘
└──────┬──────┘ Need re-plan
│ All resolved
▼
┌─────────────┐
│ CLAIMING │
└──────┬──────┘
│ Files assigned
▼
┌─────────────┐
│IMPLEMENTING │◄───┐
└──────┬──────┘ │
│ │ Next batch
▼ │
┌─────────────┐ │
│ VERIFYING │────┘
└──────┬──────┘ More batches
│ All verified
▼
┌─────────────┐
│ COMPLETED │
└─────────────┘
Benefits
- No Conflicts: Detected and resolved before implementation
- Visibility: Know exactly what each agent will do
- Control: Orchestrator maintains full oversight
- Recovery: Can roll back or adjust between phases
- Efficiency: Parallel execution where safe, sequential where needed