AI-Box Observability Guide¶
This document is the SRE guide to the observability stack of the AI-Box service. It details the key signals the service emits and how to use them to assess its health and performance.
The Three Pillars of Observability
Our monitoring strategy is built on three pillars:
- Metrics: Aggregated numerical data that provides a high-level view of system health (e.g., request rates, error rates, latency percentiles).
- Logging: Detailed, structured records of discrete events, essential for deep debugging and root cause analysis.
- Tracing: A view of the entire lifecycle of a request as it flows through multiple services. (Note: Distributed tracing is a future goal for the platform.)
1. Metrics (Prometheus)¶
The service exposes a wide range of metrics in a Prometheus-compatible format at the /metrics endpoint. These are the primary source for dashboards and alerting.
Locked Down
The metrics endpoint is disabled by default outside of local dev. Provide a METRICS_TOKEN env var and call with X-Metrics-Token to enable.
In production, use a strong token and restrict exposure to trusted networks only.
METRICS_TOKEN lives in the service's .env file. Rotate it by updating the value in your secret store and Prometheus
scrape config, then redeploy the service. Once the new token is confirmed, remove the old one.
Metrics Reference¶
| Metric | Type | Labels | Units | Description |
|---|---|---|---|---|
aibox_requests_total |
Counter | route, method, code |
requests | HTTP requests by route/method/status |
aibox_request_duration_seconds |
Histogram | route |
seconds | Request duration per route |
aibox_retrieval_rrf_ms |
Histogram | — | milliseconds | RRF fusion time |
aibox_rerank_ms |
Histogram | — | milliseconds | Rerank model time |
aibox_rrf_mode |
Gauge | mode |
1 | Active RRF mode |
aibox_rrf_fallback_total |
Counter | — | fallbacks | Times Python RRF fallback used |
s1_requests_total |
Counter | — | requests | S1 check-worthiness requests |
s1_latency_seconds |
Histogram | — | seconds | S1 latency |
aibox_s1_mode |
Gauge | mode |
1 | Active S1 mode |
Key Performance Indicators (KPIs)¶
| Metric | Prometheus Query | Threshold (Example) | Why It Matters |
|---|---|---|---|
| P95 Latency | histogram_quantile(0.95, rate(aibox_request_duration_seconds_bucket[5m])) |
> 500ms |
Indicates a slow response time for the majority of users. The primary measure of user experience. |
| Error Rate | rate(aibox_requests_total{code=~"5.."}[5m]) / rate(aibox_requests_total[5m]) |
> 2% |
A high error rate indicates a systemic problem with the service or its dependencies. |
| Request Rate | rate(aibox_requests_total[5m]) |
N/A | Provides a baseline of service traffic. Sudden drops can indicate upstream issues. |
| CPU Usage | container_cpu_usage_seconds_total{container="ai-box"} |
> 85% |
Sustained high CPU can lead to increased latency and request queuing. |
| Memory Usage | container_memory_usage_bytes{container="ai-box"} |
> 90% |
High memory usage risks the container being OOM-killed by the orchestrator. |
Per-Request Diagnostics¶
In addition to Prometheus metrics, the response body of the /retrieve and /retrieve_pack endpoints includes a diagnostics object with detailed timings for each stage of the retrieval process. This is invaluable for debugging specific slow queries.
{
...
"diagnostics": {
"bm25_ms": 5.0,
"knn_ms": 4.8,
"rrf_ms": 0.01,
"rerank_ms": 0.0
}
}
Dashboards & Alerts¶
- Dashboards: Ingest Latency, Search Latency, Queue Latency
- Alerts:
- API 5xx ratio exceeds 2% over 5m
- Queue latency p95 exceeds 5s
- AI‑Box smoke test fails
2. Logging¶
The service uses the python-json-logger library to emit structured logs in JSON format. This is a critical feature for production-grade observability.
Why Structured Logs?
JSON logs are machine-readable, which allows them to be easily ingested, parsed, and queried in a centralized logging platform (like OpenSearch, Loki, or Splunk). This enables powerful searching and filtering (e.g., Show me all logs with level=ERROR for the /retrieve endpoint).
Example Log Entry¶
{
"timestamp": "2025-08-25T10:30:00.123Z",
"level": "INFO",
"message": "Hybrid search completed",
"route": "/retrieve",
"query": "elections in syria",
"results_count": 20,
"duration_ms": 152.4
}
3. Tracing (Future Goal)¶
Distributed tracing is not yet implemented in the Labeeb platform. However, it is the next logical step in our observability journey.
- What it is: Tracing provides a way to visualize the entire lifecycle of a request as it moves from the client, through the API service, to the AI-Box, and finally to the OpenSearch cluster. Each step in the journey is a "span," and the collection of spans for a single request is a "trace."
- Why it matters: It is the single most powerful tool for debugging latency issues in a microservices architecture, as it can pinpoint exactly which service or which operation is the bottleneck.