Skip to content

title: Runbook: Job Queue Backed Up description: A playbook for diagnosing and resolving a stalled or backlogged Laravel Horizon queue in the API service. icon: material/playlist-remove


Runbook: Job Queue Backed Up

Impact: High - Stale Data & Processing Delays

This alert fires when the background job queue is not being processed, or when jobs are failing at a high rate. This will cause a significant delay in data processing, including content analysis from the AI-Box and indexing to OpenSearch. New data will not appear in the platform until the queue is cleared.

Triage Checklist (5 Minutes)

Your immediate goal is to determine the status of the Horizon queue workers and identify why jobs are not being processed.

  1. Check Horizon Status: The primary command to check the health of the queue system is horizon:status.

    docker compose exec api php artisan horizon:status
    

    • Healthy: The output shows Horizon is running. and the processes count is greater than zero for your queues.
    • Unhealthy: The output shows Horizon is inactive. or the process count is zero.
  2. Check for Failed Jobs: List any jobs that have recently failed. This is the fastest way to see if a specific job type is causing the backlog.

    docker compose exec api php artisan horizon:failed
    

  3. Check Redis Connectivity: Horizon uses Redis as its message broker. Verify that the API container can connect to the Redis container.

    # 1. Open a shell in the API container
    docker compose exec api bash
    
    # 2. Ping the Redis service
    redis-cli -h redis ping
    # Expected output: PONG
    


Remediation Playbooks

Based on your triage, select the appropriate playbook to resolve the issue.

Symptom: The horizon:status command shows Horizon is inactive.

  1. Start Horizon: The Horizon process has stopped and needs to be restarted. In a production environment with a process manager like supervisor, this would happen automatically. In the local Docker environment, you may need to restart the container.

    # A simple restart will usually bring Horizon back up
    docker compose restart api
    

  2. Verify Status: After the container restarts, check the status again to ensure the workers are running.

    docker compose exec api php artisan horizon:status
    

Symptom: Horizon is active, but the failed jobs list (horizon:failed) is growing. This indicates a bug in a specific job handler or a problem with a downstream service (like the AI-Box).

  1. Inspect a Failed Job: Get the details of a specific failed job, including its exception and stack trace. Replace [FAILED_JOB_ID] with an ID from the horizon:failed list.

    docker compose exec api php artisan horizon:failed [FAILED_JOB_ID]
    

  2. Analyze the Exception: The stack trace will point to the root cause. Common causes include:

    • A bug in the job's handle() method.
    • The job is unable to connect to a downstream service (e.g., AI-Box is down).
    • Unexpected data is causing the job to fail validation.
  3. Clear Failed Jobs (After Fixing): Once you have identified and fixed the underlying cause, you must clear the failed jobs from the queue. Do not do this until the root cause is resolved, or the jobs will simply fail again.

    # To forget a single failed job
    docker compose exec api php artisan horizon:forget [FAILED_JOB_ID]
    
    # To forget ALL failed jobs
    docker compose exec api php artisan horizon:forget-failed
    

  4. Restart Horizon Workers: To ensure the workers pick up any new code, it's best practice to restart them after a deployment or fix.

    docker compose exec api php artisan horizon:terminate
    # The process manager (supervisor) will automatically restart the workers.
    


Post-Incident Actions

  • Root Cause Analysis: If a job is failing due to a bug, create a ticket to fix it permanently.
  • Improve Job Resilience: Can the failing job be made more resilient? Should it retry more or less often? Can it handle the unexpected data gracefully?
  • Enhance Monitoring: Add specific monitoring and alerting for high-priority queue lengths and failed job counts. A high number of failed jobs should trigger a PagerDuty alert.