API Observability Guide¶

This document provides a guide to the observability signals for the API service. As the central component of the platform, the API's health is critical. Understanding its logs and metrics is essential for troubleshooting.

1. Logging¶

The API service uses structured logging to provide detailed, queryable insight into its operations. All logs are written to standard output.

How to View Logs¶

Use the following command to tail the live logs for the API service:

docker compose logs -f api

Key Log Messages¶

When troubleshooting, look for these specific log messages:

"message": "payload too large"
- Meaning: The Scraper service attempted to send a body over the 1.5 MB limit.
- Action: The API returns 413 with Retry-After: 30 and an X-Request-ID header. Investigate spikes to ensure clients are backing off; correlate with the logged request_id.
"message": "batch too large"
- Meaning: More than the allowed number of articles were submitted in a single request.
- Action: The API returns 413 with Retry-After: 30 and an X-Request-ID header. Split batches and monitor for repeated occurrences.
"message": "conflict for external_id ..."
- Meaning: The Scraper sent an article that already exists in the database but with different content, indicating a potential content hash mismatch.
- Action: This may require manual investigation to determine why the content has changed.
Illuminate\Database\QueryException
- Meaning: A fatal error indicating that the API cannot communicate with the PostgreSQL database.
- Action: This is a high-severity incident. Escalate to the database administrator immediately.

2. Key Metrics & Dashboards¶

The API service's performance and health are monitored through a combination of application metrics and queue monitoring.

Laravel Horizon Dashboard¶

Primary Monitoring Tool: Horizon

The most critical observability tool for the API service is the Laravel Horizon dashboard. Horizon provides a real-time view of the Redis queue, including job throughput, failure rates, and retry statistics.

URL: http://localhost/horizon (or your production equivalent)
What to Watch:
- Failed Jobs: A rising number of failed jobs is the primary indicator of a problem with the ingestion or analysis pipeline.
- Queue Wait Times: High wait times indicate that the queue workers are overloaded or stuck, and may require scaling up the number of worker processes.

Prometheus Metrics¶

The API exposes a Prometheus-compatible /metrics endpoint.

Locked Down

The endpoint is protected in non-local environments. Set a METRICS_TOKEN env var and present it via the X-Metrics-Token header.

# dev-only example
export METRICS_TOKEN=dev
curl -H "X-Metrics-Token: $METRICS_TOKEN" http://localhost/metrics

In production, use a strong token and restrict network access at the ingress or firewall level.

The token is stored in the service's .env file (METRICS_TOKEN). To rotate without downtime: (1) update the value in your secret store and Prometheus scrape config, (2) redeploy the service to pick up the new token, then (3) remove the old token after verifying scrapes succeed with the new one.

Metrics¶

Metric	Type	Labels	Units	Description
`ingest_requests_total`	Counter	`outcome`	requests	Total ingest requests by outcome
`ingest_body_bytes_total`	Counter	—	bytes	Sum of request body sizes
`search_requests_total`	Counter	`mode`, `outcome`	requests	Search requests grouped by retrieval mode
`queue_latency_seconds`	Histogram	—	seconds	Time jobs spend waiting in the queue
`aibox_rrf_mode`	Gauge	`mode`	1	Active AI‑Box RRF mode
`ai_classify_requests_total`	Counter	`task`, `upstream`, `status`	requests	Total classify requests by task and upstream
`ai_classify_latency_ms`	Histogram	`task`, `upstream`	milliseconds	Latency distribution for classification
`ai_classify_failures_total`	Counter	`task`, `reason`	failures	Classification failures grouped by reason

Dashboards & Alerts¶

Dashboards: Ingest Latency, Search Latency, Queue Latency
Alerts:
- API 5xx ratio exceeds 2% over 5m
- Queue latency p95 exceeds 5s
- AI‑Box smoke test fails

3. Health Checks¶

The API service provides a simple HTTP health check endpoint.

Endpoint: /api/v1/health (Note: This may not be exposed publicly and may only be accessible from within the Docker network).
Command (from another container):
```
curl http://api/api/v1/health
```
Success Response: A 200 OK response with a simple JSON body.