Scraper Service: Environment & Configuration¶
This document is the definitive reference for configuring the Scraper service. It covers all environment variables used by the application and the structure of the JSON profiles that drive the scraping logic.
1. Environment Variables¶
These variables are used to configure the service's runtime behavior. They can be set in your docker-compose.yml file or in a .env file in the scraper/ directory.
| Variable | Default | Required | Description |
|---|---|---|---|
APP_PORT |
8080 |
No | The internal port the FastAPI application will listen on. |
LOG_LEVEL |
INFO |
No | The logging level for the application (e.g., DEBUG, INFO, WARNING). |
DEBUG |
false |
No | If true, enables developer mode and leaves /metrics unprotected. |
OUT_DIR |
data/out |
No | The directory where scraped data is written if write_to_disk is enabled. |
PROFILES_DIR |
/app/profiles |
Yes | The absolute path inside the container where the JSON profile files are located. |
DEFAULT_PROVIDER |
generic_rss |
No | The provider to use if a profile does not specify one. |
HTTP_TIMEOUT |
25 |
No | The timeout in seconds for all outbound HTTP requests. |
MAX_RETRIES |
5 |
No | The maximum number of retries for failed HTTP requests. |
BACKOFF_FACTOR |
1.2 |
No | The backoff factor to use between retries (sleep = backoff * 2 ** (attempt - 1)). |
INGEST_API_URL |
null |
Yes | The full URL of the main API's ingestion endpoint (e.g., http://api/api/v1/ingest/articles). |
INGEST_TOKEN |
null |
Yes | The secret bearer token for authenticating with the main API service. |
INGEST_BATCH_SIZE |
50 |
No | The number of articles to send in a single batch to the ingestion API. |
SCHEDULER_ENABLED |
false |
No | Set to true to enable the built-in cron-based job scheduler. |
METRICS_TOKEN |
null |
No | Token required to access /metrics when DEBUG is false. |
2. Profile Configuration Schema¶
The Scraper is driven by .json files located in the PROFILES_DIR. Each file defines a data source and how to scrape it. The structure of these files is critical for the correct operation of the service.
Profile Validation
All profiles are validated against a formal JSON Schema on service startup. Invalid profiles are logged and skipped. For the complete schema, see scraper/app/data/schemas/profile.schema.json.
Top-Level Profile Keys¶
| Key | Type | Required | Description |
|---|---|---|---|
name |
string |
Yes | A unique identifier for the profile (e.g., aljazeera). |
provider |
string |
Yes | The name of the provider to use. Must be one of: generic_rss, generic_html, aljazeera, fatabyyano, verify_sy, matsd24. |
start_urls |
array |
Yes | A list of entry-point URLs for the scraper to begin its work. |
rule |
object |
No | A set of rules to filter articles after they have been scraped. |
schedule |
string |
No | A cron string (e.g., "*/30 * * * *") that defines the automated scraping schedule. |
enabled |
boolean |
No | If false, this profile will be ignored by both scheduled and on-demand scrapes. Defaults to true. |
language |
string |
No | The primary language of the source. Defaults to ar. |
meta |
object |
No | A flexible field for provider-specific settings (e.g., CSS selectors for the generic_html provider). |
The rule Object¶
The rule object defines filters that are applied after articles have been fetched.
keywords(arrayofstring): Only keep articles containing at least one of these keywords.categories(arrayofstring): Only keep articles belonging to one of these categories.limit(integer): The maximum number of articles to return from this source.
The start_urls Array¶
This array can contain simple URL strings or more complex objects to pass metadata.