Liveness & Recovery

Why liveness matters

A system that can plan work but cannot detect degradation is not reliable autonomy. tamux treats liveness as a separate architectural layer above execution, with checkpointing, health monitoring, stuck-pattern analysis, and recovery planning.

Checkpointing

Goal runs are checkpointed before and after meaningful step transitions. The checkpoint model captures multiple layers of state rather than just a shallow status string.

Goal state
Execution state
Context state
Runtime state

This is what lets the system recover honestly after crashes or other failures.

Health monitoring

The health monitor periodically evaluates the runtime instead of waiting until the operator notices something is wrong. It watches metrics such as tool-call frequency, error rate, context pressure, and consecutive failures.

Healthy

Execution is progressing normally and indicators remain within expected bounds.

Degraded

Progress exists, but the system is showing signs of rising risk, instability, or inefficiency.

Stuck

The runtime is looping, not making progress, or exhausting a key resource.

Crashed

The runtime has crossed error thresholds that require recovery, not optimistic continuation.

Stuck patterns

tamux does not reduce “stuck” to a single timeout. It can classify several failure patterns:

Pattern	What it means
No progress	The runtime is idle or stagnant for too long.
Error loop	The same or equivalent errors repeat with no real change.
Tool-call loop	The runtime cycles through repeated tool patterns without progress.
Resource exhaustion	Context or runtime pressure is too high to continue honestly.
Timeout	Work exceeds its allowed duration in a way that implies failure or intervention.

Recovery ladder

When the runtime detects problems, it should not always jump directly to operator interruption. tamux uses a graduated escalation pathway:

Self-correction: retry, rotate strategy, compress context, or change local behavior.
Sub-agent help: use a narrower specialist or bounded child path.
Operator escalation: ask for input or show the blockage honestly.
External escalation: notify through external channels when configured and appropriate.

This keeps the system from both over-escalating and silently grinding forever.

Why this beats a plain agent loop

It can preserve work across failure boundaries.
It can explain what went wrong instead of just timing out.
It can detect loops and stagnation structurally.
It can replan or escalate instead of silently retrying forever.
It makes long-running goals trustworthy enough to use in real workflows.

Goal runner connection

Liveness features are most visible in goal runs because that is where long-duration work lives. Checkpointing, health monitoring, stuck detection, and recovery are what make goal runners more than a superficial planning wrapper.

Operator expectation

Operators should expect that tamux will sometimes pause, replan, ask for approval, or escalate rather than forcing forward motion. That is a sign of a more trustworthy autonomous environment, not a sign of weakness.