Skip to content

title: Runbook: ML Model Loading Failure description: Diagnose and resolve failures when a model fails to load in AI-Box. icon: material/cpu-64-bit


Model Loading Failure

Impact: Critical — service start failure

Failures in loading an optional model (e.g., S1 when S1_BACKEND=hf, or a reranker) can prevent startup or crash during first use.

Triage (≤5 minutes)

  1. Inspect container logs

    docker compose logs --tail=200 ai-box
    
    Look for Python tracebacks mentioning model IDs/paths.

  2. Identify the failing component

  3. S1 (AIB-15): controlled by ENABLE_AIB_15, S1_BACKEND, S1_MODEL_ID
  4. Reranker: controlled by ENABLE_RERANK, RERANK_MODEL_ID

  5. Check env/config

  6. Confirm paths/IDs and that heavy deps exist only if needed.
  7. Default image may not include transformers/torch; using S1_BACKEND=hf without them will fail by design.

Remediation

  • Verify S1_MODEL_ID or RERANK_MODEL_ID is correct.
  • If mounting local models, confirm volume:
    services:
      ai-box:
        volumes:
          - ./models:/models
    
  • Rebuild/restart:
    docker compose up -d --build ai-box
    
  • Remove the local HF cache inside container and restart:
    docker compose exec ai-box bash -lc 'rm -rf ~/.cache/huggingface/*'
    docker compose restart ai-box
    
  • Increase container memory limits or pick a smaller/quantized model.
  • S1: ENABLE_AIB_15=false
  • Reranker: ENABLE_RERANK=false

Post-incident

  • Add model health to /health (lazy probe w/ cache).
  • Document RAM/CPU needs per model in Requirements.