Scraper Service: Environment & Configuration¶

This document is the definitive reference for configuring the Scraper service. It covers all environment variables used by the application and the structure of the JSON profiles that drive the scraping logic.

1. Environment Variables¶

These variables are used to configure the service's runtime behavior. They can be set in your docker-compose.yml file or in a .env file in the scraper/ directory.

Variable	Default	Required	Description
`APP_PORT`	`8080`	No	The internal port the FastAPI application will listen on.
`LOG_LEVEL`	`INFO`	No	The logging level for the application (e.g., `DEBUG`, `INFO`, `WARNING`).
`DEBUG`	`false`	No	If `true`, enables developer mode and leaves `/metrics` unprotected.
`OUT_DIR`	`data/out`	No	The directory where scraped data is written if `write_to_disk` is enabled.
`PROFILES_DIR`	`/app/profiles`	Yes	The absolute path inside the container where the JSON profile files are located.
`DEFAULT_PROVIDER`	`generic_rss`	No	The provider to use if a profile does not specify one.
`HTTP_TIMEOUT`	`25`	No	The timeout in seconds for all outbound HTTP requests.
`MAX_RETRIES`	`5`	No	The maximum number of retries for failed HTTP requests.
`BACKOFF_FACTOR`	`1.2`	No	The backoff factor to use between retries (`sleep = backoff * 2 ** (attempt - 1)`).
`INGEST_API_URL`	`null`	Yes	The full URL of the main API's ingestion endpoint (e.g., `http://api/api/v1/ingest/articles`).
`INGEST_TOKEN`	`null`	Yes	The secret bearer token for authenticating with the main API service.
`INGEST_BATCH_SIZE`	`50`	No	The number of articles to send in a single batch to the ingestion API.
`SCHEDULER_ENABLED`	`false`	No	Set to `true` to enable the built-in cron-based job scheduler.
`METRICS_TOKEN`	`null`	No	Token required to access `/metrics` when `DEBUG` is `false`.

2. Profile Configuration Schema¶

The Scraper is driven by .json files located in the PROFILES_DIR. Each file defines a data source and how to scrape it. The structure of these files is critical for the correct operation of the service.

Profile Validation

All profiles are validated against a formal JSON Schema on service startup. Invalid profiles are logged and skipped. For the complete schema, see scraper/app/data/schemas/profile.schema.json.

Top-Level Profile Keys¶

Key	Type	Required	Description
`name`	`string`	Yes	A unique identifier for the profile (e.g., `aljazeera`).
`provider`	`string`	Yes	The name of the provider to use. Must be one of: `generic_rss`, `generic_html`, `aljazeera`, `fatabyyano`, `verify_sy`, `matsd24`.
`start_urls`	`array`	Yes	A list of entry-point URLs for the scraper to begin its work.
`rule`	`object`	No	A set of rules to filter articles after they have been scraped.
`schedule`	`string`	No	A cron string (e.g., `"/30 * * *"`) that defines the automated scraping schedule.
`enabled`	`boolean`	No	If `false`, this profile will be ignored by both scheduled and on-demand scrapes. Defaults to `true`.
`language`	`string`	No	The primary language of the source. Defaults to `ar`.
`meta`	`object`	No	A flexible field for provider-specific settings (e.g., CSS selectors for the `generic_html` provider).

The `rule` Object¶

The rule object defines filters that are applied after articles have been fetched.

keywords (array of string): Only keep articles containing at least one of these keywords.
categories (array of string): Only keep articles belonging to one of these categories.
limit (integer): The maximum number of articles to return from this source.

The `start_urls` Array¶

This array can contain simple URL strings or more complex objects to pass metadata.

Example of start_urls

"start_urls": [
  "https://example.com/feed.xml", // A simple URL
  {
    "url": "https://example.com/news/politics",
    "category": "Politics",
    "verdict": "real"
  }
]