Runtime resilience
Liveness & Recovery
Long-running autonomy only works if the runtime can detect degradation, preserve state, recover honestly, and escalate when needed.
Why liveness matters
A system that can plan work but cannot detect degradation is not reliable autonomy. tamux treats liveness as a separate architectural layer above execution, with checkpointing, health monitoring, stuck-pattern analysis, and recovery planning.
Checkpointing
Goal runs are checkpointed before and after meaningful step transitions. The checkpoint model captures multiple layers of state rather than just a shallow status string.
- Goal state
- Execution state
- Context state
- Runtime state
This is what lets the system recover honestly after crashes or other failures.
Health monitoring
The health monitor periodically evaluates the runtime instead of waiting until the operator notices something is wrong. It watches metrics such as tool-call frequency, error rate, context pressure, and consecutive failures.
Healthy
Execution is progressing normally and indicators remain within expected bounds.
Degraded
Progress exists, but the system is showing signs of rising risk, instability, or inefficiency.
Stuck
The runtime is looping, not making progress, or exhausting a key resource.
Crashed
The runtime has crossed error thresholds that require recovery, not optimistic continuation.
Stuck patterns
tamux does not reduce “stuck” to a single timeout. It can classify several failure patterns:
| Pattern | What it means |
|---|---|
| No progress | The runtime is idle or stagnant for too long. |
| Error loop | The same or equivalent errors repeat with no real change. |
| Tool-call loop | The runtime cycles through repeated tool patterns without progress. |
| Resource exhaustion | Context or runtime pressure is too high to continue honestly. |
| Timeout | Work exceeds its allowed duration in a way that implies failure or intervention. |
Recovery ladder
When the runtime detects problems, it should not always jump directly to operator interruption. tamux uses a graduated escalation pathway:
- Self-correction: retry, rotate strategy, compress context, or change local behavior.
- Sub-agent help: use a narrower specialist or bounded child path.
- Operator escalation: ask for input or show the blockage honestly.
- External escalation: notify through external channels when configured and appropriate.
This keeps the system from both over-escalating and silently grinding forever.
Why this beats a plain agent loop
- It can preserve work across failure boundaries.
- It can explain what went wrong instead of just timing out.
- It can detect loops and stagnation structurally.
- It can replan or escalate instead of silently retrying forever.
- It makes long-running goals trustworthy enough to use in real workflows.
Goal runner connection
Liveness features are most visible in goal runs because that is where long-duration work lives. Checkpointing, health monitoring, stuck detection, and recovery are what make goal runners more than a superficial planning wrapper.
Operator expectation
Operators should expect that tamux will sometimes pause, replan, ask for approval, or escalate rather than forcing forward motion. That is a sign of a more trustworthy autonomous environment, not a sign of weakness.