API Observability Guide¶
This document provides a guide to the observability signals for the API service. As the central component of the platform, the API's health is critical. Understanding its logs and metrics is essential for troubleshooting.
1. Logging¶
The API service uses structured logging to provide detailed, queryable insight into its operations. All logs are written to standard output.
How to View Logs¶
Use the following command to tail the live logs for the API service:
Key Log Messages¶
When troubleshooting, look for these specific log messages:
"message": "payload too large"- Meaning: The Scraper service attempted to send a body over the 1.5 MB limit.
- Action: The API returns
413withRetry-After: 30and anX-Request-IDheader. Investigate spikes to ensure clients are backing off; correlate with the loggedrequest_id.
-
"message": "batch too large"- Meaning: More than the allowed number of articles were submitted in a single request.
- Action: The API returns
413withRetry-After: 30and anX-Request-IDheader. Split batches and monitor for repeated occurrences.
-
"message": "conflict for external_id ..."- Meaning: The Scraper sent an article that already exists in the database but with different content, indicating a potential content hash mismatch.
- Action: This may require manual investigation to determine why the content has changed.
-
Illuminate\Database\QueryException- Meaning: A fatal error indicating that the API cannot communicate with the PostgreSQL database.
- Action: This is a high-severity incident. Escalate to the database administrator immediately.
2. Key Metrics & Dashboards¶
The API service's performance and health are monitored through a combination of application metrics and queue monitoring.
Laravel Horizon Dashboard¶
Primary Monitoring Tool: Horizon
The most critical observability tool for the API service is the Laravel Horizon dashboard. Horizon provides a real-time view of the Redis queue, including job throughput, failure rates, and retry statistics.
- URL:
http://localhost/horizon(or your production equivalent) - What to Watch:
- Failed Jobs: A rising number of failed jobs is the primary indicator of a problem with the ingestion or analysis pipeline.
- Queue Wait Times: High wait times indicate that the queue workers are overloaded or stuck, and may require scaling up the number of worker processes.
Prometheus Metrics¶
The API exposes a Prometheus-compatible /metrics endpoint.
Locked Down
The endpoint is protected in non-local environments. Set a METRICS_TOKEN env var and present it via the X-Metrics-Token header.
# dev-only example
export METRICS_TOKEN=dev
curl -H "X-Metrics-Token: $METRICS_TOKEN" http://localhost/metrics
In production, use a strong token and restrict network access at the ingress or firewall level.
The token is stored in the service's .env file (METRICS_TOKEN). To rotate without downtime: (1) update the value in your
secret store and Prometheus scrape config, (2) redeploy the service to pick up the new token, then (3) remove the old token
after verifying scrapes succeed with the new one.
Metrics¶
| Metric | Type | Labels | Units | Description |
|---|---|---|---|---|
ingest_requests_total |
Counter | outcome |
requests | Total ingest requests by outcome |
ingest_body_bytes_total |
Counter | — | bytes | Sum of request body sizes |
search_requests_total |
Counter | mode, outcome |
requests | Search requests grouped by retrieval mode |
queue_latency_seconds |
Histogram | — | seconds | Time jobs spend waiting in the queue |
aibox_rrf_mode |
Gauge | mode |
1 | Active AI‑Box RRF mode |
ai_classify_requests_total |
Counter | task, upstream, status |
requests | Total classify requests by task and upstream |
ai_classify_latency_ms |
Histogram | task, upstream |
milliseconds | Latency distribution for classification |
ai_classify_failures_total |
Counter | task, reason |
failures | Classification failures grouped by reason |
Dashboards & Alerts¶
- Dashboards: Ingest Latency, Search Latency, Queue Latency
- Alerts:
- API 5xx ratio exceeds 2% over 5m
- Queue latency p95 exceeds 5s
- AI‑Box smoke test fails
3. Health Checks¶
The API service provides a simple HTTP health check endpoint.
- Endpoint:
/api/v1/health(Note: This may not be exposed publicly and may only be accessible from within the Docker network). - Command (from another container):
- Success Response: A
200 OKresponse with a simple JSON body.