انتقل إلى المحتوى

Endpoints

The Scraper service provides a comprehensive FastAPI interface for profile-driven web scraping and content normalization. All endpoints are designed for both scheduled and on-demand operation.

  • /health --- Liveness + dependency checks. Returns summary JSON; 200 on success.
  • /metrics --- Prometheus exposition format. Token-protected in non-debug mode.
  • /profiles --- List all available profiles with their configuration status.
  • /profiles/reload (POST) --- Reload profiles from disk (Git‑ops friendly). Validates and applies changes without restart.
  • /profiles/{name}/categories --- List categories available for a specific profile.
  • /runs (POST) --- Trigger a run: { "profile_id":"…", "limit":100 }
  • /runs (GET) --- Query historical runs with filters.
  • /replay (POST) --- Guarded endpoint to re‑emit cached items.
  • /scrape (POST) --- On‑demand scrape for profiles, with comprehensive filtering options.

GET /health

Verify that the service is running and check dependency status.

200 OK

{"status":"ok","deps":{"api":"ok","providers":{"apify":"ok","scrapingdog":"ok"}}}

Failure example
{"status":"degraded","deps":{"api":"fail","providers":{"apify":"ok","scrapingdog":"ok"}}}

Example:

curl http://localhost:9001/health

GET /metrics

Prometheus endpoint for basic counters and gauges.

  • Endpoint: GET /metrics
  • Headers (non-debug only): X-Metrics-Token: <token>

Key Metrics: - Request counters and latency histograms - Profile execution statistics - Provider success/failure rates

GET /profiles

List all available profiles with their configuration and status.

200 OK

[
  {
    "name": "aljazeera",
    "provider": "aljazeera",
    "enabled": true,
    "language": "ar",
    "schedule": "*/30 * * * *"
  }
]

GET /profiles/{name}/categories

List categories available for a specific profile.

Example:

curl http://localhost:9001/profiles/aljazeera/categories

200 OK

["politics", "economy", "sports"]

POST /profiles/reload

Reload profiles from disk to apply changes without restart. Use this endpoint after you add, remove, or edit any of the JSON files in the /profiles directory to make the changes take effect without restarting the service.

POST /profiles/reload
204 No Content

Validation: Profiles are validated against the JSON Schema in app/data/schemas/profile.schema.json. Invalid files are logged and skipped.

POST /runs

POST /runs
Content-Type: application/json

{"profile_id":"newsdata:ar:sy","limit":100}
202 Accepted
{"run_id":"r_20250825_030000_001","profile_id":"newsdata:ar:sy","status":"queued"}

POST /scrape (on‑demand)

Trigger an immediate scrape for one or more sources with comprehensive filtering options.

Request Body Parameters: - sources (array, optional): A list of profile names (e.g., ["aljazeera", "fatabyyano"]). Defaults to all enabled profiles. - query (string, optional): A keyword to filter article titles and content. - categories (array, optional): Filter by specific categories (e.g., ["fake", "real"]). - limit (integer, optional): A hard limit on the number of articles to return per source. - write_to_disk (boolean, optional): If true, appends the output to a .jsonl file in /app/data/out. - send_to_api (boolean, optional): If true (default), sends scraped articles to the ingest API.

POST /scrape
Content-Type: application/json

{
  "sources": ["verify_sy"],
  "query": "سوريا",
  "categories": ["fake"],
  "limit": 10,
  "write_to_disk": true,
  "send_to_api": false
}

200 OK returns list of normalized items or file path if write_to_disk=true.

Response:

{
  "results": [
    {
      "id": "unique_article_id",
      "title": "Article Title",
      "content": "Article content...",
      "url": "https://source.com/article",
      "published_at": "2025-01-01T00:00:00Z",
      "source": "Source Name",
      "category": "politics",
      "language": "ar"
    }
  ],
  "metadata": {
    "total_articles": 1,
    "sources_processed": ["verify_sy"],
    "processing_time_ms": 1250
  }
}

Auth

Protect write endpoints with service auth (token or IP allow‑list).

Interactive API Console


Last updated: 2025-08-25 · Docs version: v0.3