title: Runbook: High Search Latency description: Diagnose and resolve slow responses from /retrieve and /retrieve_pack in AI-Box. icon: material/speedometer
High Search Latency¶
Impact: High — degraded UX / upstream timeouts
This alert fires when P95 latency for /retrieve or /retrieve_pack breaches the SLO.
Tackle this in two steps: confirm where time is spent, then mitigate per leg (BM25, kNN, RRF, rerank).
Triage (≤5 minutes)¶
- Check API latency histograms
- Metric:
aibox_request_duration_seconds(histogram) -
Quick peek:
-
Inspect per-leg timings returned by API
- Each response includes diagnostics (ms):
-
Find the slow leg in your recent requests (app logs):
-
Check OpenSearch health
Remediation playbooks¶
- Enable slowlogs (temporary) to capture bad queries:
- Reduce search fields / highlights: confirm you only search relevant fields and limit highlight fragments.
- Tune analyzers (Arabic/English) and stopwords; revisit custom analyzers if token bloat is observed.
- Right-size shards: avoid extreme shard counts; reindex if needed.
- Immediate user-level mitigation: temporarily reduce
k_bm25in API requests (e.g., 10–20).
- Check mapping & HNSW params (
m,ef_construction) and runtimeef_search: - Dimension sanity: ensure
VECTOR_FIELDmatches indexdimension; mismatches can cascade into retries. - Immediate mitigation: reduce
k_knn(e.g., from 50 → 10–20).
- RRF is normally cheap; high values often reflect huge candidate lists.
- Mitigate: lower
k_bm25andk_knnso fused set is smaller.
- Immediate switch off: set
ENABLE_RERANK=falseand/or call with"rerank": false. - Hardware: confirm you are not running a heavy cross-encoder on CPU inadvertently.
- Model: consider a lighter
RERANK_MODEL_ID, ONNX/quantization, or cap rerank candidates (e.g., top 50).
Prometheus quick views¶
- Overall latency (p95):
- Error rate (5xx):
- RRF processing ms (mean):
Post-incident actions¶
- Add the slow queries to a regression suite.
- Revisit defaults in callers (reduce
k_*for general UI, keep bigger for analysts). - Review OS heap / CPU; scale if sustained saturation.