Scraper Observability Guide¶
This document provides a guide to the observability signals for the Scraper service. Understanding these signals is critical for monitoring the health of the data ingestion pipeline and for diagnosing problems.
1. Primary Health Signal: Upstream Ingestion Rate¶
Black-Box Monitoring Principle
The most reliable indicator of the Scraper's health is the health of its primary consumer: the API Service. If the API is successfully ingesting data, the Scraper is healthy. This black-box approach is often more reliable than internal service metrics alone.
- Metric to Watch:
http_requests_total{route="/api/v1/ingest/articles", status_code="2xx"}on the API service. - Expected Behavior: A steady, non-zero rate of
2xxstatus codes. - Alerting Threshold: Fire a
P2(Warning) alert if the rate of successful ingestions drops to zero for more than 30 minutes. - Dashboard: This metric should be displayed prominently on the main "Platform Health" Grafana dashboard.
2. Structured Logging¶
The Scraper produces structured JSON logs to provide detailed, queryable insight into its operations.
- Every log line includes a
request_idfield to correlate scrape runs across services.
How to View Logs¶
For real-time analysis during an incident, tail the logs directly from the container.
Key Log Events & Interpretation¶
When troubleshooting, filter your logs for these specific event fields:
-
event: "Profile validation summary"- What it means: The service has started and loaded all profile files from disk.
- Why it matters: The
invalid_skippedcount is the first indicator of a configuration error. If this number is greater than zero, a deployment has introduced a malformed profile. - Example:
{"event": "Profile validation summary", "valid": 3, "invalid_skipped": 1, "level": "info"}
-
event: "Ingestion batch failed"- What it means: The Scraper successfully fetched articles, but the upstream API service rejected the batch.
- Why it matters: This log is critical for assigning responsibility during an incident. The
statusfield tells you if the error is on the API side (5xx) or a configuration/auth issue (4xx). - Example:
{"event": "Ingestion batch failed", "source": "aljazeera", "status": 500, "level": "error"}
-
event: "Scrape job completed"- What it means: An entire scrape job (scheduled or on-demand) has finished.
- Why it matters: The
articles_sentfield provides a clear signal of success. Theduration_sfield can be used to detect performance regressions over time. - Example:
{"event": "Scrape job completed", "source": "verify_sy", "articles_sent": 12, "duration_s": 15.2, "level": "info"}
3. Health & Metrics Endpoints¶
The service provides two simple endpoints for automated health checks and basic metrics.
-
Health Check:
GET /health- Purpose: A simple liveness probe used by the Docker health check to ensure the FastAPI server is running and responsive.
- Command:
curl http://localhost:9001/health
-
Metrics Endpoint:
GET /metrics- Purpose: Exposes Prometheus metrics (counters, histograms, gauges) for basic visibility.
- Auth: Requires
X-Metrics-Tokenheader whenDEBUG=false. - Key Metrics:
scrape_runs_total{source}– scrape jobs startedpages_fetched_total{source}– HTTP fetchesitems_emitted_total{source}– normalized items producedingest_requests_total{outcome}– POSTs to/ingest(success/error)fetch_duration_seconds/parse_duration_seconds– durations for fetch/parsescheduler_next_run_timestamp{source}/last_success_timestamp{source}– gauges