Technical Requirements
Performance Requirements

Performance Requirements

Targets

Metric	Target	Percentile	Measurement point
Chat response TTFT (Time to First Token)	< 3s	p95	From API request received at Chat API to first SSE token delivered to client
RAG retrieval latency (embedding + vector search + threshold filter)	< 500ms	p95	From `retrieve_knowledge` tool call received to `RetrievalResult` returned
OpenAI embedding API call	< 200ms	p95	From HTTP request sent to response received; EU endpoint required
HNSW vector search (pgvector)	< 100ms	p95	SQL query execution time; valid for corpus < 10K vectors — re-evaluate if corpus exceeds 1M
Widget load time	< 1s	—	Non-blocking; `defer` load strategy; bundle must not block host page critical path
Widget bundle size	≤ 200KB gzipped	—	Measured at Phase 1 build; if > 250KB gzipped, evaluate tree-shaking before considering alternative library

Streaming requirement: Streaming must be enabled in the frontend widget from day one. The TTFT target is not achievable without streaming — full-response latency is not the target metric and is not measured (EC-09).

Latency Budget

The following breakdown is informative. The per-stage figures are not binding SLAs, but they validate that the 3s end-to-end TTFT target is achievable and provide a baseline for diagnosing regressions.

Stage	Typical range	p95 budget
Embedding query (OpenAI EU endpoint)	50–100ms	200ms
HNSW vector search (pgvector)	50–100ms	100ms
Threshold filter + result assembly	< 10ms	20ms
LLM first token (Anthropic, streaming enabled)	500ms–2,000ms	~2,500ms
Network round-trip (EU, CDN)	50–150ms	180ms
Total	700ms–2,500ms	< 3,000ms

The 500ms p95 retrieval target (embedding + vector search + filter) is derived from this budget. With retrieval completing within 500ms, the remaining ~2.5s is sufficient for LLM first-token delivery under normal operating conditions.

Turns that do not trigger a retrieve_knowledge call skip the embedding and vector search stages entirely, making the full ~2.8s available to the LLM.

Stress Test Plan

Observed traffic baseline: ~100 unique visitors per day, ~5 per hour, with no significant traffic spikes (corporate website profile).

Expected concurrent sessions in production: At a 5% chat activation rate and a 10-minute average session duration, expected peak concurrency is below 1 session. The system will not experience meaningful concurrency pressure under normal operating conditions.

Test design: Because observed concurrency is sub-1, a traffic-replication test provides no useful signal. The test is designed as a stress test against a fixed capacity target that provides a meaningful safety margin over expected load.

Parameter	Value
Test type	Stress test (fixed concurrency, not traffic replay)
Target concurrent sessions	10 (~40× expected peak production concurrency)
Sustained duration	10 minutes at peak concurrency
Ramp-up	2 minutes linear ramp to 10 concurrent sessions
Traffic pattern	Simulated visitor messages at realistic inter-message cadence (15–30s between turns)
RAG trigger rate	~60% of turns trigger a `retrieve_knowledge` call
Test environment	Staging on Fly.io, same instance class as production (see Section 6.1)
Tooling	k6 or Locust — to be confirmed by engineering before Phase 5
Success criterion	p95 TTFT < 3s sustained across the full 10-minute window

Re-evaluation trigger: If corpus grows past 1M vectors, HNSW vector search performance must be re-benchmarked and index parameters (m, ef_construction, ef_search) re-tuned before the stress test is re-run against the larger index.

Engineering concern resolved by this section: EC-09 — TTFT is confirmed as the target metric; full-response latency is explicitly out of scope. Load level is defined as 10 concurrent sessions based on observed site traffic (~100 visits/day, ~5/hour) and a 40× safety margin over expected peak concurrency.

Performance Requirements

Targets

Latency Budget

Stress Test Plan

onThisPage