Labeeb Runbook (On-Call)¶
Single page for on-call engineers to restore ${SERVICE} with clarity, accuracy, and safety.
How to use this page
- Read top-to-bottom on first response.
- Use the checklists; copy/paste commands as-is.
- Escalation paths are at the bottom.
At-a-Glance¶
- Environments: [PROD_BADGE] [STAGE_BADGE] [DEV_BADGE]
- Services: api · ai-box · scraper · search
- Quick Links: Dashboards · Alerts · Runbooks
| Signal | Owner | Probe |
|---|---|---|
| Latency | ${OWNER_SRE} | curl -w "%{time_total}\n" -o /dev/null -s ${API_URL}/health |
| Errors | ${OWNER_SRE} | docker compose logs --tail=50 |
| Saturation | ${OWNER_SRE} | docker stats --no-stream |
| Traffic | ${OWNER_SRE} | curl -s ${API_URL}/metrics | jq '.http_requests' |
On-call Overview¶
- First checks → health endpoints + traces
- Then → go to the specific runbook (API/AI-Box/Scraper/Search)
- Finally → consult the incident playbooks
curl -s ${API_URL}/health | jq
curl -s ${SCRAPER_URL}/health | jq
curl -s ${AI_BOX_URL}/health | jq
curl -s ${OS_URL}/_cluster/health | jq '.status'
See also: Operational Contracts
First 5 Minutes — Universal Checklist¶
- Confirm incident scope (user-facing? ingestion-only? search-only?).
- Check health endpoints across services.
- Tail logs with filters.
- Capture context (links placeholders).
- ${TICKET_URL}
- ${DASHBOARD_URL}
# Start all services
docker compose up -d
# Stop all services
docker compose down
# Tail logs for a service
docker compose logs -f <service>
# Health check
curl -s ${API_URL}/v1/health
# Authenticated request
curl -s -H "Authorization: Bearer ${INGEST_TOKEN}" -X POST ${API_URL}/v1/ingest/articles -d @payload.json
Service Status & Health Commands¶
api¶
- Health:
GET ${API_URL}/health→ JSON{"status":"ok"} - Depends on: PostgreSQL, Redis, OpenSearch
ai-box¶
- Health:
GET ${AI_BOX_URL}/health→ JSON{"status":"ok"} - Depends on: OpenSearch, model cache
scraper¶
- Health:
GET ${SCRAPER_URL}/health→ JSON{"status":"ok"} - Depends on: api
search¶
- Health:
GET ${OS_URL}/_cluster/health→ fieldstatus - Depends on: disk, JVM heap
Common Runbooks (Pointers)¶
- services/api/runbook.md — for API 5xx or GraphQL failures.
- services/ai-box/runbook.md — for S1/S2 or retrieval errors.
- services/scraper/runbook.md — for ingestion stalls or 429s.
- services/search/troubleshooting.md — for query latency or index issues.
Safety Rails¶
Do not
- Purge indices in production without a recent snapshot.
- Reindex with stale mappings.
- Roll ai-box without a warm model cache.
Pre-flight checks
- Verify snapshots.
- Verify
.envand compose overrides for the target environment.
Quick Probes (Copy/Paste)¶
curl -s -o /dev/null -w "%{time_total}\n" ${API_URL}/health
curl -s ${API_URL}/v1/search?q=test | jq '.hits | length'
curl -s ${API_URL}/metrics | jq '.http_requests'
curl -s ${AI_BOX_URL}/health | jq
curl -s ${AI_BOX_URL}/s1/check-worthiness -d '{"text":"test"}' | jq '.score'
curl -s ${AI_BOX_URL}/metrics | jq '.inflight'
curl -s ${SCRAPER_URL}/health | jq
curl -s ${SCRAPER_URL}/profiles | jq '. | length'
docker compose logs scraper -n 20
curl -s ${OS_URL}/_cluster/health | jq '.status'
curl -s ${OS_URL}/${INDEX}/_count | jq '.count'
curl -s -H 'Content-Type: application/json' ${OS_URL}/${INDEX}/_search -d '{"query":{"match_all":{}}}' | jq '.hits.total'
Triage Matrix¶
| Symptom | Likely Root Cause | Probe | Next Action |
|---|---|---|---|
| Search slow | OpenSearch overload | curl -s ${OS_URL}/_nodes/stats/jvm?pretty |
Scale nodes or clear heavy queries |
| 429 from sources | Upstream rate limit | docker compose logs scraper | grep 429 |
Backoff & adjust profile schedule |
| Reranker timeout | AI-Box saturation | curl -s ${AI_BOX_URL}/metrics | jq '.latency_reranker' |
Restart ai-box; check model cache |
| Auth timeouts | API or DB latency | curl -s ${API_URL}/health | jq |
Restart API or database |
Minimal Architecture (Orientation)¶
flowchart TD
FE[FE] --> API
API --> Queue
Queue --> AIB[AI-Box]
AIB --> OS[OpenSearch]
SCR[Scraper] --> API
Escalation¶
| SEV | Description | Response SLO |
|---|---|---|
| SEV-1 | Full outage | ≤5 min |
| SEV-2 | Major user impact | ≤15 min |
| SEV-3 | Partial degradation | ≤1 hr |
| SEV-4 | Minor issue | Next business day |
- Contacts:
${PRIMARY_ONCALL},${BACKUP_ONCALL},${ENG_MANAGER} - Handover checklist:
- Update incident channel.
- Link dashboards and logs.
- Transfer open actions.
Appendices¶
- services/api/runbook.md
- services/ai-box/runbook.md
- services/scraper/runbook.md
- services/search/troubleshooting.md
- architecture/platform.md
| Placeholder | Description |
|---|---|
${API_URL} |
Base URL for API service |
${SCRAPER_URL} |
Base URL for Scraper service |
${AIBOX_URL} |
Base URL for AI-Box service |
${SEARCH_URL} |
Base URL for OpenSearch cluster |
| Variable | Description |
| ---------- | ------------- |
EXAMPLE_VAR |
Example description |
Note
Follow Conventional Commits.
Warning
Commands assume execution from the repository root unless noted.
Last updated: ${DATE} · Version: ${DOCS_VERSION}