Performance Requirements

Targets

Metric Target Percentile Measurement point
Chat response TTFT (Time to First Token) < 3s p95 From API request received at Chat API to first SSE token delivered to client
RAG retrieval latency (embedding + vector search + threshold filter) < 500ms p95 From retrieve_knowledge tool call received to RetrievalResult returned
OpenAI embedding API call < 200ms p95 From HTTP request sent to response received; EU endpoint required
HNSW vector search (pgvector) < 100ms p95 SQL query execution time; valid for corpus < 10K vectors — re-evaluate if corpus exceeds 1M
Widget load time < 1s Non-blocking; defer load strategy; bundle must not block host page critical path
Widget bundle size ≤ 200KB gzipped Measured at Phase 1 build; if > 250KB gzipped, evaluate tree-shaking before considering alternative library

Streaming requirement: Streaming must be enabled in the frontend widget from day one. The TTFT target is not achievable without streaming — full-response latency is not the target metric and is not measured (EC-09).


Latency Budget

The following breakdown is informative. The per-stage figures are not binding SLAs, but they validate that the 3s end-to-end TTFT target is achievable and provide a baseline for diagnosing regressions.

Stage Typical range p95 budget
Embedding query (OpenAI EU endpoint) 50–100ms 200ms
HNSW vector search (pgvector) 50–100ms 100ms
Threshold filter + result assembly < 10ms 20ms
LLM first token (Anthropic, streaming enabled) 500ms–2,000ms ~2,500ms
Network round-trip (EU, CDN) 50–150ms 180ms
Total 700ms–2,500ms < 3,000ms

The 500ms p95 retrieval target (embedding + vector search + filter) is derived from this budget. With retrieval completing within 500ms, the remaining ~2.5s is sufficient for LLM first-token delivery under normal operating conditions.

Turns that do not trigger a retrieve_knowledge call skip the embedding and vector search stages entirely, making the full ~2.8s available to the LLM.


Stress Test Plan

Observed traffic baseline: ~100 unique visitors per day, ~5 per hour, with no significant traffic spikes (corporate website profile).

Expected concurrent sessions in production: At a 5% chat activation rate and a 10-minute average session duration, expected peak concurrency is below 1 session. The system will not experience meaningful concurrency pressure under normal operating conditions.

Test design: Because observed concurrency is sub-1, a traffic-replication test provides no useful signal. The test is designed as a stress test against a fixed capacity target that provides a meaningful safety margin over expected load.

Parameter Value
Test type Stress test (fixed concurrency, not traffic replay)
Target concurrent sessions 10 (~40× expected peak production concurrency)
Sustained duration 10 minutes at peak concurrency
Ramp-up 2 minutes linear ramp to 10 concurrent sessions
Traffic pattern Simulated visitor messages at realistic inter-message cadence (15–30s between turns)
RAG trigger rate ~60% of turns trigger a retrieve_knowledge call
Test environment Staging on Fly.io, same instance class as production (see Section 6.1)
Tooling k6 or Locust — to be confirmed by engineering before Phase 5
Success criterion p95 TTFT < 3s sustained across the full 10-minute window

Re-evaluation trigger: If corpus grows past 1M vectors, HNSW vector search performance must be re-benchmarked and index parameters (m, ef_construction, ef_search) re-tuned before the stress test is re-run against the larger index.


Engineering concern resolved by this section: EC-09 — TTFT is confirmed as the target metric; full-response latency is explicitly out of scope. Load level is defined as 10 concurrent sessions based on observed site traffic (~100 visits/day, ~5/hour) and a 40× safety margin over expected peak concurrency.