Skip to content

title: Runbook: High Search Latency description: Diagnose and resolve slow responses from /retrieve and /retrieve_pack in AI-Box. icon: material/speedometer


High Search Latency

Impact: High — degraded UX / upstream timeouts

This alert fires when P95 latency for /retrieve or /retrieve_pack breaches the SLO. Tackle this in two steps: confirm where time is spent, then mitigate per leg (BM25, kNN, RRF, rerank).

Triage (≤5 minutes)

  1. Check API latency histograms
  2. Metric: aibox_request_duration_seconds (histogram)
  3. Quick peek:

    curl -s ${AIBOX_URL:-http://localhost:8001}/metrics | head -n 50
    

  4. Inspect per-leg timings returned by API

  5. Each response includes diagnostics (ms):
    { "diagnostics": { "bm25_ms": 12.3, "knn_ms": 4.8, "rrf_ms": 0.01, "rerank_ms": 0.0 } }
    
  6. Find the slow leg in your recent requests (app logs):

    docker compose logs --tail=200 ai-box | grep -E "diagnostics|bm25_ms|knn_ms|rerank_ms"
    

  7. Check OpenSearch health

    curl -s http://localhost:9200/_cluster/health?pretty
    curl -s http://localhost:9200/_nodes/stats/jvm,os,fs,thread_pool?pretty | head -n 80
    


Remediation playbooks

  • Enable slowlogs (temporary) to capture bad queries:
    curl -s -X PUT 'http://localhost:9200/news_docs/_settings' -H 'content-type: application/json' -d '{
      "index.search.slowlog.threshold.query.warn": "200ms",
      "index.search.slowlog.threshold.fetch.warn": "200ms",
      "index.search.slowlog.level": "info"
    }'
    
  • Reduce search fields / highlights: confirm you only search relevant fields and limit highlight fragments.
  • Tune analyzers (Arabic/English) and stopwords; revisit custom analyzers if token bloat is observed.
  • Right-size shards: avoid extreme shard counts; reindex if needed.
  • Immediate user-level mitigation: temporarily reduce k_bm25 in API requests (e.g., 10–20).
  • Check mapping & HNSW params (m, ef_construction) and runtime ef_search:
    curl -s -X PUT 'http://localhost:9200/news_docs/_settings' -H 'content-type: application/json' -d '{
      "index.knn.algo_param.ef_search": 64
    }'
    
  • Dimension sanity: ensure VECTOR_FIELD matches index dimension; mismatches can cascade into retries.
  • Immediate mitigation: reduce k_knn (e.g., from 50 → 10–20).
  • RRF is normally cheap; high values often reflect huge candidate lists.
  • Mitigate: lower k_bm25 and k_knn so fused set is smaller.
  • Immediate switch off: set ENABLE_RERANK=false and/or call with "rerank": false.
  • Hardware: confirm you are not running a heavy cross-encoder on CPU inadvertently.
  • Model: consider a lighter RERANK_MODEL_ID, ONNX/quantization, or cap rerank candidates (e.g., top 50).

Prometheus quick views

  • Overall latency (p95):
    histogram_quantile(0.95, sum(rate(aibox_request_duration_seconds_bucket[5m])) by (le, route))
    
  • Error rate (5xx):
    sum(rate(aibox_requests_total{status=~"5.."}[5m])) by (route) / sum(rate(aibox_requests_total[5m])) by (route)
    
  • RRF processing ms (mean):
    rate(aibox_retrieval_rrf_ms_sum[5m]) / rate(aibox_retrieval_rrf_ms_count[5m])
    

Post-incident actions

  • Add the slow queries to a regression suite.
  • Revisit defaults in callers (reduce k_* for general UI, keep bigger for analysts).
  • Review OS heap / CPU; scale if sustained saturation.