title: Runbook: Job Queue Backed Up description: A playbook for diagnosing and resolving a stalled or backlogged Laravel Horizon queue in the API service. icon: material/playlist-remove
Runbook: Job Queue Backed Up¶
Impact: High - Stale Data & Processing Delays
This alert fires when the background job queue is not being processed, or when jobs are failing at a high rate. This will cause a significant delay in data processing, including content analysis from the AI-Box and indexing to OpenSearch. New data will not appear in the platform until the queue is cleared.
Triage Checklist (5 Minutes)¶
Your immediate goal is to determine the status of the Horizon queue workers and identify why jobs are not being processed.
-
Check Horizon Status: The primary command to check the health of the queue system is
horizon:status.- Healthy: The output shows
Horizon is running.and theprocessescount is greater than zero for your queues. - Unhealthy: The output shows
Horizon is inactive.or the process count is zero.
- Healthy: The output shows
-
Check for Failed Jobs: List any jobs that have recently failed. This is the fastest way to see if a specific job type is causing the backlog.
-
Check Redis Connectivity: Horizon uses Redis as its message broker. Verify that the API container can connect to the Redis container.
Remediation Playbooks¶
Based on your triage, select the appropriate playbook to resolve the issue.
Symptom: The horizon:status command shows Horizon is inactive.
-
Start Horizon: The Horizon process has stopped and needs to be restarted. In a production environment with a process manager like
supervisor, this would happen automatically. In the local Docker environment, you may need to restart the container. -
Verify Status: After the container restarts, check the status again to ensure the workers are running.
Symptom: Horizon is active, but the failed jobs list (horizon:failed) is growing. This indicates a bug in a specific job handler or a problem with a downstream service (like the AI-Box).
-
Inspect a Failed Job: Get the details of a specific failed job, including its exception and stack trace. Replace
[FAILED_JOB_ID]with an ID from thehorizon:failedlist. -
Analyze the Exception: The stack trace will point to the root cause. Common causes include:
- A bug in the job's
handle()method. - The job is unable to connect to a downstream service (e.g., AI-Box is down).
- Unexpected data is causing the job to fail validation.
- A bug in the job's
-
Clear Failed Jobs (After Fixing): Once you have identified and fixed the underlying cause, you must clear the failed jobs from the queue. Do not do this until the root cause is resolved, or the jobs will simply fail again.
-
Restart Horizon Workers: To ensure the workers pick up any new code, it's best practice to restart them after a deployment or fix.
Post-Incident Actions¶
- Root Cause Analysis: If a job is failing due to a bug, create a ticket to fix it permanently.
- Improve Job Resilience: Can the failing job be made more resilient? Should it retry more or less often? Can it handle the unexpected data gracefully?
- Enhance Monitoring: Add specific monitoring and alerting for high-priority queue lengths and failed job counts. A high number of failed jobs should trigger a PagerDuty alert.