TRD Section 10 — Resilience and Degradation
Failure Modes
The table below consolidates failure modes across all system components. Each row defines the failure condition, the system’s behaviour, the user-facing impact, and the recovery path.
| Component | Failure mode | System behaviour | User-facing impact | Recovery |
|---|---|---|---|---|
| Chat API | Service unreachable (Fly.io machine down, deploy failure) | Widget detects connection error on first request; activates fallback state after VITE_STREAM_TIMEOUT_MS (default 10s) |
Visitor sees fallback message with link to contact form | Ops alert via uptime monitor; redeploy or Fly.io auto-restart |
| Chat API | HTTP 5xx on subsequent turn (session already active) | Widget shows inline error for that turn only; session continues | Turn fails silently with a retry prompt | Automatic — next visitor message retries normally |
| Conversation Orchestrator | update_state LLM call fails or times out |
Log state_extraction_failure; proceed to score_router with unchanged SessionState |
No visible impact — LLM continues with prior qualification state | None — next turn retries extraction from current message |
| Conversation Orchestrator | generate_response LLM call fails |
Log llm_generation_failure; return graceful fallback message; route to propose_handoff with reason = "llm_failure" |
Visitor receives: “I’m having trouble responding right now — can I connect you with the team directly?” | Handoff captures lead; session closes or continues from proposal |
| Conversation Orchestrator | generate_response stream timeout (> LLM_STREAM_TIMEOUT_MS) |
Close stream; emit stream_timeout event; return same fallback message as LLM call failure |
Same as LLM call failure | Same as LLM call failure |
| LLM — Claude Haiku 4.5 | Anthropic API degradation or outage | generate_response and update_state calls time out or return errors; Orchestrator error handling activates |
Visitors receive fallback message; active sessions trigger llm_failure handoff path |
Ops alert via LLM error rate metric; no automated recovery — Anthropic SLA |
| RAG — Knowledge Retriever | retrieve_knowledge returns no results above threshold |
Log rag_no_result; proceed with response generation without retrieved context; LLM acknowledges knowledge limit |
Visitor receives honest “I don’t have that information” response | None required — LLM handles gracefully via prompt instruction |
| RAG — Knowledge Retriever | Embedding API (OpenAI) unavailable | Log embedding_api_failure; retrieval cannot proceed; treat as rag_no_result |
Same as no-result case above | Ops alert via RAG failure rate metric; no automated retry in v1 |
| RAG — Knowledge Retriever | pgvector / vector search failure | Log vector_search_failure; treat as rag_no_result |
Same as no-result case above | Same as embedding API failure |
| PostgreSQL — Checkpointer | Read failure at session start | Log checkpointer_read_failure at ERROR; initialise fresh SessionState; session proceeds as new |
Visitor loses prior session context — conversation restarts | No automated recovery; ops alert via checkpointer failure rate metric |
| PostgreSQL — Checkpointer | Write failure at turn end | Log checkpointer_write_failure at ERROR; turn considered complete (response already streamed) |
No visible impact — response was delivered | Next turn loads last good persisted state; current turn’s qualification progress may be lost |
| Human Handoff Subsystem | Slack delivery fails after 3 retries | Log handoff_partial_failure at ERROR; send fallback email to FALLBACK_EMAIL_ADDRESS; persist HandoffRecord with slack_ok = False |
No visible impact — visitor has already received the handoff proposal | Manual follow-up via ops alert |
| Human Handoff Subsystem | CRM delivery fails after 3 retries | Log handoff_partial_failure at ERROR; send fallback email; persist HandoffRecord with crm_ok = False |
No visible impact | Same as Slack failure |
| Human Handoff Subsystem | Both Slack and CRM fail after retries | Log handoff_total_failure at CRITICAL; send fallback email; handoff_triggered = False in SessionState |
No visible impact — visitor’s email was captured | Ops CRITICAL alert; manual follow-up required; handoff_triggered = False allows re-attempt if visitor returns (v2 feature) |
| Human Handoff Subsystem | Fallback SMTP also fails | Log fallback_email_failure at CRITICAL; no further automated delivery in v1 |
No visible impact | Ops CRITICAL alert; manual recovery via raw session log in PostgreSQL |
| Human Handoff Subsystem | generate_context_packet() raises exception |
Log context_packet_generation_failure at ERROR; abort delivery; emit CRITICAL alert; handoff_triggered = False |
No visible impact | Ops CRITICAL alert; manual recovery |
| Chat Widget | api-url attribute missing |
Widget logs ConfigurationError; renders in permanent fallback state |
Visitor sees fallback message without link | Fix configuration and redeploy widget embed |
| Chat Widget | fallback-url attribute missing |
Widget logs ConfigurationError; fallback state renders without a link — degraded but functional |
Visitor sees fallback message without a clickable link | Fix configuration; not a blocking failure |
| State Machine | update_state produces invalid QualificationDelta |
Log state_update_validation_failure; discard delta; session continues with unchanged QualificationState |
No visible impact | Next turn retries extraction from full conversation context |
| State Machine | Qualification dimension monotonicity violation | Log qualification_monotonicity_violation at WARN; reject downgrade silently; retain higher confidence level |
No visible impact | No action required |
| State Machine | CONTEXT_WINDOW_TURNS set to 0 or negative |
Raise ConfigurationError at startup; prevent service from starting |
Service unavailable — widget activates fallback | Fix configuration and redeploy |
| Business Hours Detection | BUSINESS_HOURS_TIMEZONE not set or invalid |
Default to Europe/Madrid; log ConfigurationError at WARN |
No visible impact — handoff routing proceeds with default timezone | Fix configuration; low urgency |
Graceful Degradation
AI Backend Unavailable — Fallback Form
Resolution of EC-07.
When the AI backend is unavailable, the Chat Widget activates a permanent fallback state for the duration of the browser session. The fallback path has zero dependency on the AI backend — it is architecturally independent by design.
Fallback activation conditions:
| Condition | Trigger |
|---|---|
| Connection error on first request | Widget cannot reach api-url within VITE_STREAM_TIMEOUT_MS |
| HTTP 5xx on first request | Chat API returns a server error before any tokens are streamed |
| Stream timeout on first request | No token received within VITE_STREAM_TIMEOUT_MS ms |
Note: A connection error or timeout on a subsequent turn (session already active) does not activate fallback — it shows an inline error for that turn only. Fallback is only triggered on the first request failure, when no session context has been established.
Fallback UI:
The chat panel replaces the message input with:
“Our chat assistant isn’t available right now. You can still reach us using our contact form.”
[Contact us →] (opensfallback-urlin a new tab)
The visitor cannot send messages in fallback state. The launcher button remains visible.
Fallback is session-permanent. Once activated, no retry is attempted from the widget. Repeated retries on a down backend generate noise in error logs without improving the visitor experience.
Fallback form submission path:
The fallback-url attribute points to the existing company contact form or any external URL. The form submission is handled entirely by the host site’s own infrastructure. This system builds no backend endpoint for fallback submissions.
Known gap: Leads submitted via the fallback form are not automatically created in the CRM by this system. They are handled by whatever process currently handles the company contact form. The sales team has been informed. This is an accepted limitation for MVP. See also: Section 12 — Open Questions.
Handoff Partial Failure — One Channel Down
Resolution of FR-19.
When one delivery channel (Slack or CRM) fails after exhausting retries and the other succeeds, the handoff is considered partially failed.
Partial failure handling:
1. Log failed channel at ERROR:
fields: session_id, failed_channel, last_http_status, attempt_count, triggered_at
2. Emit WARN-level ops alert (Better Stack)
3. Send fallback email to FALLBACK_EMAIL_ADDRESS:
Subject: "[HANDOFF FALLBACK] [lead_level] lead — [visitor_email or session_id]"
Body: full ContextPacket as plain text
4. Persist HandoffRecord:
slack_ok / crm_ok reflects actual delivery outcome
5. Set SessionState.handoff_triggered = True
(one channel confirmed — handoff is considered dispatched)
The visitor is not informed of the delivery failure in either case. The propose_handoff node has already delivered its proposal and collected the visitor’s email before delivery is attempted.
Handoff Total Failure — Both Channels Down
When both Slack and CRM fail after exhausting retries:
Total failure handling:
1. Log both channels at ERROR
2. Emit CRITICAL-level ops alert (Better Stack)
3. Send fallback email to FALLBACK_EMAIL_ADDRESS (same format as partial failure)
4. Persist HandoffRecord: slack_ok = False, crm_ok = False
5. Set SessionState.handoff_triggered = False
// False allows re-attempt if the visitor returns (v2 feature)
If SMTP also fails, log fallback_email_failure at CRITICAL. No further automated delivery in v1. Manual recovery is required via the raw session log in PostgreSQL.
LLM Failure Mid-Conversation — Capture Handoff
When generate_response fails or times out during an active session, the orchestrator does not terminate the session silently. Instead:
- It returns a graceful fallback message: “I’m having trouble responding right now — can I connect you with the team directly?”
- It routes to
propose_handoffwithreason = "llm_failure". - The handoff proposal captures the visitor’s email before the session ends.
This ensures that even a hard LLM failure produces a lead capture attempt, not a silent drop.
RAG Unavailable — Retrieval Bypass
When the Knowledge Retriever cannot return results (embedding API failure, vector search failure, or no results above threshold):
- The orchestrator proceeds with response generation without retrieved context.
- The LLM is instructed to acknowledge the knowledge limit honestly: “I don’t have specific information on that — let me connect you with the team.”
- If the query was domain-specific, this typically triggers a natural handoff.
- No fallback retrieval source is implemented in v1.
RAG failure does not affect session continuity. The conversation continues; only the quality of domain-specific answers is degraded.
Checkpointer Failure — State Loss Scenarios
| Scenario | Behaviour | Data lost |
|---|---|---|
| Read failure at session start | Fresh SessionState initialised; session proceeds as new |
All prior session context — conversation restarts from scratch |
| Write failure at turn end | Turn completes normally (response already streamed) | Current turn’s qualification dimension updates |
Neither failure terminates the session from the visitor’s perspective. The risk is qualification state loss, which may cause the LLM to re-ask questions already answered in the evicted turn. This is acceptable for MVP given the low expected frequency of checkpointer failures.
Context Window Management
Resolution of EC-13.
Strategy
The conversation history passed to the LLM is a sliding window of fixed maximum size. When the window is full and a new message is added, the oldest exchange pair (one visitor message + one assistant message) is evicted.
This strategy is chosen over hard limits (which terminate conversations abruptly) and summarisation (which adds LLM cost and latency per turn) for MVP. See Engineering Review EC-13 for the evaluation.
What the Sliding Window Contains
The window holds the last CONTEXT_WINDOW_TURNS visitor/assistant exchange pairs. Default: 10 pairs (20 individual messages).
What is evicted: raw message history — the text of older exchanges.
What is never evicted: qualification state. The QualificationState object (problem_fit, authority_fit, company_fit, timing_fit, confidence levels, signals_observed) is stored independently of the message window and injected fresh every turn. The LLM never loses qualification context due to window eviction.
Additionally, the following SessionState fields survive window eviction and are always available:
lead_levelturn_counterstage3_proposals_issuedvisitor_email,visitor_name,visitor_company,visitor_roleis_consultant,is_negative_persona,is_no_fitreferral_mentionedsignals_observed
Eviction Behaviour
On each new exchange:
if len(messages) >= CONTEXT_WINDOW_TURNS * 2:
messages.pop(0) # evict oldest visitor message
messages.pop(0) # evict oldest assistant message
messages.append(new_visitor_message)
messages.append(new_assistant_message)
Eviction happens before the new exchange is appended, ensuring the window never exceeds the configured size.
Configuration
| Variable | Default | Description |
|---|---|---|
CONTEXT_WINDOW_TURNS |
10 |
Number of visitor/assistant exchange pairs retained in the sliding window. Setting to 0 or negative raises a ConfigurationError at startup. |
The window size is tunable post-launch. Increasing the window increases per-turn token cost and latency. Decreasing it risks the LLM re-asking questions already answered in evicted turns (mitigated by the always-fresh QualificationState injection).
Context Window Budget
The full context budget per turn is allocated as follows (reference values for CONTEXT_WINDOW_TURNS = 10):
| Layer | Allocation | Notes |
|---|---|---|
| System prompt (stable layers 1–6) | ~2,000 tokens | Role, conversation model, prohibited behaviours, knowledge scope, handoff instructions |
| Qualification state (layer 7) | ~500 tokens | JSON-serialised QualificationState; injected fresh every turn |
| Retrieved chunks (layer 8) | ~1,500 tokens | Only when retrieve_knowledge is called; omitted otherwise |
| Conversation history (layer 9) | ~5,000 tokens | Last 10 exchange pairs at ~250 tokens per pair |
| Total (with retrieval) | ~9,000 tokens | Well within Claude Haiku 4.5’s 200K token context window |
The system is not at risk of context overflow at default configuration. The CONTEXT_WINDOW_TURNS cap is a cost control, not a hard technical constraint at current window sizes.
v1 Limitation
No summarisation of evicted turns is performed. If a visitor references something said in an evicted exchange, the LLM will not have that context. In practice, the QualificationState injection mitigates most continuity risks — the key facts (problem, authority, company, timing) are always present regardless of window eviction.
Summarisation-based context compression is identified as a v2 enhancement if post-launch conversation depth metrics indicate meaningful continuity failures.
Engineering concerns resolved by this section: EC-07 (graceful degradation fallback destination), EC-13 (context window strategy and turn limit). FR-19 (partial handoff failure behaviour) is fully specified here. The failure mode table in consolidates error handling dispersed across Section 3 into a single reference.