Skip to content

title: Runbook: Scraper Backlog & Stuck Scheduler description: A playbook for diagnosing and recovering from a stalled scraper or a growing data backlog. icon: material/backup-restore


Runbook: Scraper Backlog & Stuck Scheduler

Impact: High - Stale Platform Data

This alert fires when the Scraper service stops processing new articles, either due to a scheduler failure or a persistent inability to fetch/ingest data. The direct impact is that the entire Labeeb platform will be operating on stale data, and new events will not be reflected in search results or analysis.

Triage Checklist (5 Minutes)

Your immediate goal is to determine why the scraper is not processing data. Follow these steps methodically.

  1. Verify Service Health: First, confirm the service is running and responsive.

    curl http://localhost:9001/health
    

  2. Check for Scheduler Activity in Logs: Inspect the logs for messages from the APScheduler component. A lack of recent "scheduled" or "running job" messages indicates a stalled scheduler.

    docker compose logs --tail=100 scraper | grep "scheduler"
    

  3. Check for Errors: Look for obvious errors like network timeouts, crashes, or repeated exceptions.

    docker compose logs --tail=200 scraper | grep -E "ERROR|Traceback|timeout"
    

  4. Check Upstream API Health: The scraper depends on the main API to ingest data. If the API is down, the scraper will be blocked. Check the API's health.

    # This command must be run from within the scraper container
    docker compose exec scraper curl http://api/api/v1/health
    


Remediation Playbooks

Based on your triage, select the appropriate playbook to resolve the issue.

Symptom: The logs show the scheduler is running, but no new jobs are being executed for a specific profile. This often points to a stale lock from a previous run that failed without cleanup.

Manual State Intervention Required

This procedure involves manually editing the service's state file. This is a high-risk operation. Proceed with caution and make a backup of the file before making any changes.

  1. Enter the Container: Open a shell inside the running scraper container.

    docker compose exec scraper bash
    

  2. Backup the State File: The state file (state.json) is the source of truth for the last-seen articles. Create a timestamped backup.

    cp /app/data/out/state.json /app/data/out/state.json.bak-$(date +%s)
    

  3. Inspect the State File: Examine the contents of the state file to identify the stale entry. Look for the profile that is no longer running.

    cat /app/data/out/state.json
    

  4. Clear the Stale Lock: Using a text editor inside the container (like vi or nano), carefully remove the entry for the stalled profile or category. Alternatively, for a full reset of a single profile, you can use jq.

    # Example using jq to remove the 'aljazeera' profile state
    jq 'del(.aljazeera)' /app/data/out/state.json > /app/data/out/state.json.tmp && mv /app/data/out/state.json.tmp /app/data/out/state.json
    

  5. Trigger a Manual Scrape: Exit the container and trigger a manual scrape for the affected profile to verify that it now runs correctly.

    curl -X POST http://localhost:9001/scrape -H "Content-Type: application/json" -d '{"sources": ["aljazeera"]}'
    

Symptom: The scraper logs are filled with connection errors or HTTP 5xx status codes when trying to contact the main API service.

  1. Confirm API Unavailability: Follow the triage steps in the API Service Runbook to diagnose and resolve the issue with the main API.

  2. Temporarily Disable Ingestion: If the API service requires extended downtime, you can prevent the scraper from generating further errors by configuring it to write to disk instead of sending to the API. This is done by submitting a POST request to the /scrape endpoint with send_to_api set to false.

    Note

    This is a temporary measure. Once the API is restored, the data written to disk will need to be manually ingested.

Symptom: The scraper is running but is extremely slow, or the container is frequently restarting.

  1. Check Container Resource Usage: Use docker stats to check the CPU and memory usage of the scraper container.

    docker stats scraper
    

  2. Increase Resources: If the container is hitting its resource limits, increase the memory or CPU allocation in your docker-compose.yml file.


Post-Incident Actions

  • Root Cause Analysis: Determine why the job stalled or the lock was not released. Was it a network blip, a bug in a provider, or a non-graceful shutdown?
  • Improve Lock Management: Investigate adding a TTL (Time To Live) to the job locks in state.py so that they expire automatically after a reasonable period.
  • Enhance Health Checks: The /health endpoint could be improved to include the status of the scheduler and the age of the last successfully completed job.