Scraper Architecture¶
This document provides a detailed overview of the Scraper service's internal architecture and its role within the Labeeb platform. Understanding this is critical for effective troubleshooting and extension.
1. Platform Service Responsibilities¶
System-Wide Context
The Labeeb platform is a distributed system. A failure in one service can manifest as a symptom in another. This matrix defines the clear ownership and responsibility of each service, which is the foundation of our incident response process.
| Service | Tech | Core Responsibility | Inputs | Outputs | Depends On |
|---|---|---|---|---|---|
| API | Laravel/PHP | Central gateway, orchestrates jobs, owns PG & OS writes. | Ingest batches, client requests. | API responses, jobs. | PG, Redis, OS, AI-Box. |
| AI-Box | Python/FastAPI | Hosts AI models (search, NER, etc.). | API jobs/requests. | Analysis results (JSON). | OS, API (for hydration). |
| Scraper | Python/FastAPI | Fetches & normalizes articles from external sources. | Profiles, external websites. | Ingest batches. | API (for ingestion). |
| Search | OpenSearch | Provides search capabilities. | Indexing requests, search queries. | Search results. | (None) |
| Frontend | Next.js | User interface. | User actions. | Web pages. | API. |
2. Internal Architecture & Data Flow¶
/scraper/app/
├── core/ # Core application logic & configuration
├── data/ # Data models, normalization, and schemas
├── scraping/ # The business logic of scraping
│ ├── providers/ # All specific provider implementations
└── services/ # Clients for external services & state
Architectural Principles
The Scraper's architecture is designed for extensibility and operational safety. The key design decisions are:
- Decoupling Configuration from Code: Scraping logic (Python
Providerclasses) is kept entirely separate from scraping targets (JSONProfilefiles). This allows operators to add or change targets without deploying new code. - Provider-Based Strategy: Each external source type has a dedicated
Providerclass. This isolates the logic for handling different website layouts (e.g., RSS vs. HTML) and makes adding new sources predictable. - Stateful Operation: The service maintains a simple state file (
state.json) to track the last-seen article for each profile. This prevents reprocessing duplicate data and provides a clear audit trail. - Dual-Mode Triggers: The service can be operated via a time-based schedule (
APScheduler) for routine collection or triggered on-demand via a REST API for manual overrides and testing.
Data Flow Diagram (DFD)¶
This diagram illustrates the primary data flow for an on-demand scrape triggered via the API.
flowchart TD
subgraph "User / Operator"
U[Client]:::ext
end
subgraph "Scraper Service"
S[FastAPI Server]:::svc
P[Profile Loader]:::svc
E[Scraper Engine]:::svc
R[Provider Factory]:::svc
F[Profiles/*.json]:::store
end
subgraph "Downstream"
API[(Labeeb API)]:::ext
W[(data/out/*.jsonl)]:::store
end
U -- "POST /scrape" --> S
S --> P --> F
S -- "Builds Job" --> E
E --> R
P -- "Provides Profile" --> R
R -- "Selects & Runs" --> Provider(Provider Instance)
Provider -- "Fetches Articles" --> A(Normalized Articles)
A --> E
E -- "Returns Response" --> S
S -- "Optionally Sends" --> API
S -- "Optionally Writes" --> W
S -- "Job Response" --> U