Skip to content

Scraper Service: Environment & Configuration

This document is the definitive reference for configuring the Scraper service. It covers all environment variables used by the application and the structure of the JSON profiles that drive the scraping logic.


1. Environment Variables

These variables are used to configure the service's runtime behavior. They can be set in your docker-compose.yml file or in a .env file in the scraper/ directory.

Variable Default Required Description
APP_PORT 8080 No The internal port the FastAPI application will listen on.
LOG_LEVEL INFO No The logging level for the application (e.g., DEBUG, INFO, WARNING).
DEBUG false No If true, enables developer mode and leaves /metrics unprotected.
OUT_DIR data/out No The directory where scraped data is written if write_to_disk is enabled.
PROFILES_DIR /app/profiles Yes The absolute path inside the container where the JSON profile files are located.
DEFAULT_PROVIDER generic_rss No The provider to use if a profile does not specify one.
HTTP_TIMEOUT 25 No The timeout in seconds for all outbound HTTP requests.
MAX_RETRIES 5 No The maximum number of retries for failed HTTP requests.
BACKOFF_FACTOR 1.2 No The backoff factor to use between retries (sleep = backoff * 2 ** (attempt - 1)).
INGEST_API_URL null Yes The full URL of the main API's ingestion endpoint (e.g., http://api/api/v1/ingest/articles).
INGEST_TOKEN null Yes The secret bearer token for authenticating with the main API service.
INGEST_BATCH_SIZE 50 No The number of articles to send in a single batch to the ingestion API.
SCHEDULER_ENABLED false No Set to true to enable the built-in cron-based job scheduler.
METRICS_TOKEN null No Token required to access /metrics when DEBUG is false.

2. Profile Configuration Schema

The Scraper is driven by .json files located in the PROFILES_DIR. Each file defines a data source and how to scrape it. The structure of these files is critical for the correct operation of the service.

Profile Validation

All profiles are validated against a formal JSON Schema on service startup. Invalid profiles are logged and skipped. For the complete schema, see scraper/app/data/schemas/profile.schema.json.

Top-Level Profile Keys

Key Type Required Description
name string Yes A unique identifier for the profile (e.g., aljazeera).
provider string Yes The name of the provider to use. Must be one of: generic_rss, generic_html, aljazeera, fatabyyano, verify_sy, matsd24.
start_urls array Yes A list of entry-point URLs for the scraper to begin its work.
rule object No A set of rules to filter articles after they have been scraped.
schedule string No A cron string (e.g., "*/30 * * * *") that defines the automated scraping schedule.
enabled boolean No If false, this profile will be ignored by both scheduled and on-demand scrapes. Defaults to true.
language string No The primary language of the source. Defaults to ar.
meta object No A flexible field for provider-specific settings (e.g., CSS selectors for the generic_html provider).

The rule Object

The rule object defines filters that are applied after articles have been fetched.

  • keywords (array of string): Only keep articles containing at least one of these keywords.
  • categories (array of string): Only keep articles belonging to one of these categories.
  • limit (integer): The maximum number of articles to return from this source.

The start_urls Array

This array can contain simple URL strings or more complex objects to pass metadata.

Example of start_urls
"start_urls": [
  "https://example.com/feed.xml", // A simple URL
  {
    "url": "https://example.com/news/politics",
    "category": "Politics",
    "verdict": "real"
  }
]