title: Runbook: High Search Latency description: Diagnose and resolve slow responses from /retrieve and /retrieve_pack in AI-Box. icon: material/speedometer

High Search Latency¶

Impact: High — degraded UX / upstream timeouts

This alert fires when P95 latency for /retrieve or /retrieve_pack breaches the SLO. Tackle this in two steps: confirm where time is spent, then mitigate per leg (BM25, kNN, RRF, rerank).

Triage (≤5 minutes)¶

Check API latency histograms
Metric: aibox_request_duration_seconds (histogram)

Quick peek:

curl -s ${AIBOX_URL:-http://localhost:8001}/metrics | head -n 50

Inspect per-leg timings returned by API

Each response includes diagnostics (ms):

{ "diagnostics": { "bm25_ms": 12.3, "knn_ms": 4.8, "rrf_ms": 0.01, "rerank_ms": 0.0 } }

Find the slow leg in your recent requests (app logs):

docker compose logs --tail=200 ai-box | grep -E "diagnostics|bm25_ms|knn_ms|rerank_ms"

Check OpenSearch health

curl -s http://localhost:9200/_cluster/health?pretty
curl -s http://localhost:9200/_nodes/stats/jvm,os,fs,thread_pool?pretty | head -n 80

Remediation playbooks¶

A) BM25 slow (high bm25_ms)

Enable slowlogs (temporary) to capture bad queries:

curl -s -X PUT 'http://localhost:9200/news_docs/_settings' -H 'content-type: application/json' -d '{
  "index.search.slowlog.threshold.query.warn": "200ms",
  "index.search.slowlog.threshold.fetch.warn": "200ms",
  "index.search.slowlog.level": "info"
}'

Reduce search fields / highlights: confirm you only search relevant fields and limit highlight fragments.
Tune analyzers (Arabic/English) and stopwords; revisit custom analyzers if token bloat is observed.
Right-size shards: avoid extreme shard counts; reindex if needed.
Immediate user-level mitigation: temporarily reduce k_bm25 in API requests (e.g., 10–20).

B) Vector slow (high knn_ms)

Check mapping & HNSW params (m, ef_construction) and runtime ef_search:

curl -s -X PUT 'http://localhost:9200/news_docs/_settings' -H 'content-type: application/json' -d '{
  "index.knn.algo_param.ef_search": 64
}'

Dimension sanity: ensure VECTOR_FIELD matches index dimension; mismatches can cascade into retries.
Immediate mitigation: reduce k_knn (e.g., from 50 → 10–20).

C) RRF compute heavy (high rrf_ms)

RRF is normally cheap; high values often reflect huge candidate lists.
Mitigate: lower k_bm25 and k_knn so fused set is smaller.

D) Reranker slow (high rerank_ms)

Immediate switch off: set ENABLE_RERANK=false and/or call with "rerank": false.
Hardware: confirm you are not running a heavy cross-encoder on CPU inadvertently.
Model: consider a lighter RERANK_MODEL_ID, ONNX/quantization, or cap rerank candidates (e.g., top 50).

Prometheus quick views¶

Overall latency (p95):

histogram_quantile(0.95, sum(rate(aibox_request_duration_seconds_bucket[5m])) by (le, route))

Error rate (5xx):

sum(rate(aibox_requests_total{status=~"5.."}[5m])) by (route) / sum(rate(aibox_requests_total[5m])) by (route)

RRF processing ms (mean):

rate(aibox_retrieval_rrf_ms_sum[5m]) / rate(aibox_retrieval_rrf_ms_count[5m])

Post-incident actions¶

Add the slow queries to a regression suite.
Revisit defaults in callers (reduce k_* for general UI, keep bigger for analysts).
Review OS heap / CPU; scale if sustained saturation.