Skip to content

Scraper Observability Guide

This document provides a guide to the observability signals for the Scraper service. Understanding these signals is critical for monitoring the health of the data ingestion pipeline and for diagnosing problems.


1. Primary Health Signal: Upstream Ingestion Rate

Black-Box Monitoring Principle

The most reliable indicator of the Scraper's health is the health of its primary consumer: the API Service. If the API is successfully ingesting data, the Scraper is healthy. This black-box approach is often more reliable than internal service metrics alone.

  • Metric to Watch: http_requests_total{route="/api/v1/ingest/articles", status_code="2xx"} on the API service.
  • Expected Behavior: A steady, non-zero rate of 2xx status codes.
  • Alerting Threshold: Fire a P2 (Warning) alert if the rate of successful ingestions drops to zero for more than 30 minutes.
  • Dashboard: This metric should be displayed prominently on the main "Platform Health" Grafana dashboard.

2. Structured Logging

The Scraper produces structured JSON logs to provide detailed, queryable insight into its operations.

  • Every log line includes a request_id field to correlate scrape runs across services.

How to View Logs

For real-time analysis during an incident, tail the logs directly from the container.

docker compose logs -f scraper

Key Log Events & Interpretation

When troubleshooting, filter your logs for these specific event fields:

  • event: "Profile validation summary"

    • What it means: The service has started and loaded all profile files from disk.
    • Why it matters: The invalid_skipped count is the first indicator of a configuration error. If this number is greater than zero, a deployment has introduced a malformed profile.
    • Example: {"event": "Profile validation summary", "valid": 3, "invalid_skipped": 1, "level": "info"}
  • event: "Ingestion batch failed"

    • What it means: The Scraper successfully fetched articles, but the upstream API service rejected the batch.
    • Why it matters: This log is critical for assigning responsibility during an incident. The status field tells you if the error is on the API side (5xx) or a configuration/auth issue (4xx).
    • Example: {"event": "Ingestion batch failed", "source": "aljazeera", "status": 500, "level": "error"}
  • event: "Scrape job completed"

    • What it means: An entire scrape job (scheduled or on-demand) has finished.
    • Why it matters: The articles_sent field provides a clear signal of success. The duration_s field can be used to detect performance regressions over time.
    • Example: {"event": "Scrape job completed", "source": "verify_sy", "articles_sent": 12, "duration_s": 15.2, "level": "info"}

3. Health & Metrics Endpoints

The service provides two simple endpoints for automated health checks and basic metrics.

  • Health Check: GET /health

    • Purpose: A simple liveness probe used by the Docker health check to ensure the FastAPI server is running and responsive.
    • Command: curl http://localhost:9001/health
  • Metrics Endpoint: GET /metrics

    • Purpose: Exposes Prometheus metrics (counters, histograms, gauges) for basic visibility.
    • Auth: Requires X-Metrics-Token header when DEBUG=false.
    • Key Metrics:
      • scrape_runs_total{source} – scrape jobs started
      • pages_fetched_total{source} – HTTP fetches
      • items_emitted_total{source} – normalized items produced
      • ingest_requests_total{outcome} – POSTs to /ingest (success/error)
      • fetch_duration_seconds / parse_duration_seconds – durations for fetch/parse
      • scheduler_next_run_timestamp{source} / last_success_timestamp{source} – gauges