Skip to content

title: Runbook: Scraping Profile Failures description: A playbook for diagnosing and resolving failures in specific scraping profiles, such as from selector drift or content changes. icon: material/file-alert-outline


Runbook: Scraping Profile Failures

Impact: Data Quality Degradation

This alert fires when a specific scraping profile is consistently failing to fetch or process articles, while other profiles are succeeding. This leads to a loss of data from the affected source, impacting the completeness and timeliness of the Labeeb platform.

Triage Checklist (5 Minutes)

Your immediate goal is to isolate the failing profile and understand the nature of the failure.

  1. Identify the Failing Profile: Check the service logs for repeated error messages associated with a specific profile name.

    # Look for log messages containing "Failed to fetch for profile ..."
    docker compose logs --tail=200 scraper | grep "Failed to fetch"
    

  2. Isolate the Failure: Trigger a manual, on-demand scrape for only the suspected profile. This provides a clean set of logs and a direct response to analyze.

    # Replace 'problem-source' with the name of the failing profile
    curl -X POST http://localhost:9001/scrape -H "Content-Type: application/json" -d '{
      "sources": ["problem-source"],
      "limit": 5, # Keep the limit low for a fast test
      "send_to_api": false, # Prevent sending partial/bad data upstream
      "write_to_disk": false
    }'
    

  3. Analyze the Response & Logs:

    • Check the failed array in the JSON response from the previous command.
    • Immediately after the manual scrape, tail the logs again for the detailed error message or Python traceback. This will tell you why it failed.

Remediation Playbooks

Based on your triage, select the appropriate playbook to resolve the issue.

Symptom: The logs show errors related to parsing (e.g., AttributeError: 'NoneType' object has no attribute 'select_one') or the scrape returns zero articles from a source that should have many. This is the most common cause of failure and happens when a website redesigns its HTML layout.

  1. Get the Target URL: Open the profile's JSON file in scraper/profiles/ and find a URL from the start_urls list.

  2. Inspect the Live HTML: Use curl from within the container to fetch the live HTML of the page, or open it in a browser and use the developer tools.

    # From your local machine
    docker compose exec scraper curl -s "https://example.com/news"
    

  3. Compare HTML to Selectors:

    • If using the generic_html provider, check the CSS selectors in the profile's meta.selectors object.
    • If using a site-specific provider (e.g., aljazeera.py), check the hardcoded selectors in the provider's Python file.
  4. Update the Selectors: Modify the selectors in the appropriate file to match the new HTML structure.

  5. Test the Fix: Re-run the isolated manual scrape command from the triage step. A successful response with a count greater than zero indicates the fix was successful.

  6. Reload All Profiles: Once verified, apply the change permanently by reloading all profiles.

    curl -X POST http://localhost:9001/profiles/reload
    

Symptom: The logs show a Traceback from a parsing library like dateparser or newspaper3k. This can happen if a website changes its date format or article structure in a subtle way.

  1. Identify the Erroring Library: The Python traceback in the logs will clearly name the library that is failing (e.g., dateparser.parse()).

  2. Isolate the Problem Content: The logs should also contain the specific URL of the article that failed to parse. Manually inspect the content at that URL.

  3. Implement a Code Fix: This type of error almost always requires a code change in the provider's Python file in scraper/app/scraping/providers/. You may need to add more robust error handling or adjust the logic to account for the new content format.

  4. Deploy the Fix: This change requires a new Docker image to be built and deployed.

Symptom: The logs show connection timeouts, 403 Forbidden, or 404 Not Found errors for all requests to a specific domain.

  1. Verify External Status: Confirm the website is down for everyone, not just our scraper. Use an external tool like a web browser.

  2. Disable the Profile: The safest and most immediate action is to temporarily disable the profile to stop generating failing requests. Follow the procedure in the Source Rate-Limiting Runbook.

  3. Notify Stakeholders: Inform the relevant team that a data source is offline and that data will be missing until it is restored.


Post-Incident Actions

  • Improve Selector Robustness: For critical sources, consider adding fallback selectors or using more resilient selection logic (e.g., searching for microdata schemas instead of CSS classes).
  • Create Regression Tests: For complex, site-specific providers, add a simple test case to the scraper/tests/ directory that fetches a saved copy of a known-good HTML page and asserts that the selectors can still parse it. This catches selector drift in CI before it reaches production.