Troubleshooting Guide

Common Issues and Solutions

Service Startup Issues

Issue: "Connection refused" when accessing API

Symptoms:

Error: Connection refused
curl: (7) Failed to connect to localhost port 8125

Solutions:

Check if service is running:

bash

docker-compose ps
# or
ps aux | grep "uvicorn"

Check port availability:

bash

netstat -tuln | grep 8125
lsof -i :8125

Start the service:

bash

docker-compose up -d api-server
# or
python -m app.main

Issue: "Redis connection failed"

Symptoms:

Error: ConnectionError: Error 111 connecting to localhost:6379

Solutions:

Verify Redis is running:

bash

redis-cli ping
# Should return: PONG

Check Redis configuration:

bash

redis-cli CONFIG GET maxmemory
redis-cli INFO memory

Restart Redis:

bash

sudo systemctl restart redis-server
# or
docker-compose restart redis

Audio Processing Issues

Issue: "Unsupported audio format"

Symptoms:

json

{
  "status": "error",
  "error": "Unsupported audio format"
}

Solutions:

Check supported formats:

Supported: WAV, MP3, FLAC, M4A, OGG

Convert audio to WAV:

bash

ffmpeg -i input.mp3 -acodec pcm_s16le -ar 16000 output.wav

Check sample rate:

bash

# Should be between 8000 and 48000 Hz
ffmpeg -i audio.wav

Issue: "Audio file too large"

Symptoms:

json

{
  "status": "error",
  "error": "File exceeds maximum size"
}

Solutions:

Check file size:

bash

ls -lh audio.wav

Compress audio:

bash

ffmpeg -i input.wav -b:a 128k output.wav

Adjust max size in configuration:

bash

MAX_AUDIO_SIZE_MB=1000  # Increase limit

Issue: "Audio processing timeout"

Symptoms:

Task timeout after 300 seconds

Solutions:

Increase request timeout:

bash

REQUEST_TIMEOUT=600  # 10 minutes

Check processing queue:

bash

curl http://localhost:8125/audio/queue/status

Scale up workers:

bash

docker-compose up -d --scale celery-worker=4

Model Loading Issues

Issue: "CUDA out of memory"

Symptoms:

CUDA out of memory. Tried to allocate X.XX GiB

Solutions:

Check GPU memory:

bash

nvidia-smi

Reduce batch size:

bash

BATCH_SIZE=1  # Process one at a time

Enable model quantization:

bash

QUANTIZE_MODELS=true
QUANTIZATION_TYPE="int8"

Use smaller model variant:

bash

# For real-time
REALTIME_WHISPER_MODEL=base

# For batch
POSTCALL_WHISPER_MODEL=small

Issue: "Model not found"

Symptoms:

FileNotFoundError: Model not found at path

Solutions:

Verify model path exists:

bash

ls -la models/

Download models:

bash

python scripts/download_models.py

Check model configuration:

bash

curl http://localhost:8125/health/models

Issue: "Insufficient memory for model loading"

Symptoms:

MemoryError: Unable to load model

Solutions:

Check system memory:

bash

free -h

Clear cache:

bash

redis-cli FLUSHALL
docker system prune -a

Reduce number of cached models:

bash

MODEL_CACHE_SIZE=2  # Only keep 2 models in memory

Database Issues

Issue: "Database connection failed"

Symptoms:

Error: (psycopg2.OperationalError) could not connect to server

Solutions:

Check database server:

bash

docker-compose logs postgres
# or
sudo systemctl status postgresql

Verify connection string:

bash

# Check .env file
cat .env | grep DATABASE_URL

Test connection:

bash

psql -h localhost -U user -d ai_service -c "SELECT 1"

Issue: "Disk space full"

Symptoms:

Error: No space left on device

Solutions:

Check disk usage:

bash

df -h
du -sh /path/to/data

Clean up old logs:

bash

find logs/ -mtime +30 -delete

Clean up temporary files:

bash

rm -rf temp/*
docker system prune -a

API Response Issues

Issue: "502 Bad Gateway"

Symptoms:

502 Bad Gateway - The server is temporarily unable to service the request

Solutions:

Check API server:

bash

docker-compose logs api-server

Check NGINX configuration:

bash

nginx -t

Restart services:

bash

docker-compose restart nginx api-server

Issue: "504 Gateway Timeout"

Symptoms:

504 Gateway Timeout - The server did not respond within the timeout period

Solutions:

Increase timeout in NGINX:

nginx

proxy_connect_timeout 600s;
proxy_send_timeout 600s;
proxy_read_timeout 600s;

Increase processing timeout:

bash

REQUEST_TIMEOUT=600

Restart NGINX:

bash

docker-compose restart nginx

Authentication Issues

Issue: "Invalid or expired token"

Symptoms:

json

{
  "status": "error",
  "error": "Invalid or expired token"
}

Solutions:

Generate new token:

bash

curl -X POST http://localhost:8125/auth/login \
  -H "Content-Type: application/json" \
  -d '{"username": "user", "password": "pass"}'

Check token expiry:

bash

# Decoded JWT to verify expiration

Increase token expiry:

bash

TOKEN_EXPIRY_MINUTES=120  # 2 hours

Logging and Diagnostics

Enable Debug Logging

bash

# Set environment variables
export LOG_LEVEL=DEBUG
export DEBUG=true

# Or update .env
LOG_LEVEL=DEBUG
DEBUG=true

# Restart services
docker-compose restart api-server celery-worker

View Logs

bash

# API logs
docker-compose logs -f api-server

# Worker logs
docker-compose logs -f celery-worker

# All logs
docker-compose logs -f

# Filter by service and follow
docker-compose logs -f --tail=100 api-server

Get System Information

bash

# Python and packages
python --version
pip list | grep -E "fastapi|celery|torch"

# Docker info
docker version
docker-compose version

# System resources
uname -a
cat /proc/cpuinfo | grep processor | wc -l
free -h
nvidia-smi

Health Checks

Check Overall Health

bash

curl http://localhost:8125/health

Expected response:

json

{
  "status": "healthy"
}

Check Component Health

bash

curl http://localhost:8125/health/detailed | jq

Check Model Status

bash

curl http://localhost:8125/health/models | jq

Check Queue Status

bash

curl http://localhost:8125/audio/queue/status | jq

Check Worker Status

bash

curl http://localhost:8125/audio/workers/status | jq

Performance Issues

Issue: Slow audio processing

Symptoms:

Processing takes longer than expected
High CPU/GPU usage

Solutions:

Check worker status:

bash

curl http://localhost:8125/audio/workers/status | jq

Check queue length:

bash

curl http://localhost:8125/audio/queue/status | jq '.queue.queued'

Scale up workers:

bash

docker-compose up -d --scale celery-worker=4

Profile performance:

bash

python -m cProfile -s cumtime app/main.py

Data Issues

Issue: Missing or corrupted results

Symptoms:

Task completed but no results
Partial results returned

Solutions:

Check task status:

bash

curl http://localhost:8125/audio/task/{task_id}

Check database integrity:

bash

# For SQLite
sqlite3 ai_service.db ".check"

# For PostgreSQL
psql -d ai_service -c "PRAGMA integrity_check"

Re-process task:

bash

curl -X DELETE http://localhost:8125/audio/task/{task_id}
# Then resubmit the audio

Getting Help

Collect Diagnostic Information

bash

#!/bin/bash
# save_diagnostics.sh

echo "=== System Info ===" > diagnostics.txt
uname -a >> diagnostics.txt

echo -e "\n=== Python Version ===" >> diagnostics.txt
python --version >> diagnostics.txt

echo -e "\n=== Installed Packages ===" >> diagnostics.txt
pip list >> diagnostics.txt

echo -e "\n=== Service Status ===" >> diagnostics.txt
docker-compose ps >> diagnostics.txt

echo -e "\n=== API Health ===" >> diagnostics.txt
curl -s http://localhost:8125/health/detailed >> diagnostics.txt

echo -e "\n=== Recent Logs ===" >> diagnostics.txt
docker-compose logs --tail=50 >> diagnostics.txt

echo "Diagnostics saved to diagnostics.txt"

Support Resources

GitHub Issues: https://github.com/openchlai/ai-service/issues
Documentation: https://docs.openchs.org
Email: support@openchs.org
Slack: #ai-service (OpenCHS Slack workspace)

Troubleshooting Guide ​

Common Issues and Solutions ​

Service Startup Issues ​

Issue: "Connection refused" when accessing API ​

Issue: "Redis connection failed" ​

Audio Processing Issues ​

Issue: "Unsupported audio format" ​

Issue: "Audio file too large" ​

Issue: "Audio processing timeout" ​

Model Loading Issues ​

Issue: "CUDA out of memory" ​

Issue: "Model not found" ​

Issue: "Insufficient memory for model loading" ​

Database Issues ​

Issue: "Database connection failed" ​

Issue: "Disk space full" ​

API Response Issues ​

Issue: "502 Bad Gateway" ​

Issue: "504 Gateway Timeout" ​

Authentication Issues ​

Issue: "Invalid or expired token" ​

Logging and Diagnostics ​

Enable Debug Logging ​

View Logs ​

Get System Information ​

Health Checks ​

Check Overall Health ​

Check Component Health ​

Check Model Status ​

Check Queue Status ​

Check Worker Status ​

Performance Issues ​

Issue: Slow audio processing ​

Data Issues ​

Issue: Missing or corrupted results ​

Getting Help ​

Collect Diagnostic Information ​

Support Resources ​

Troubleshooting Guide

Common Issues and Solutions

Service Startup Issues

Issue: "Connection refused" when accessing API

Issue: "Redis connection failed"

Audio Processing Issues

Issue: "Unsupported audio format"

Issue: "Audio file too large"

Issue: "Audio processing timeout"

Model Loading Issues

Issue: "CUDA out of memory"

Issue: "Model not found"

Issue: "Insufficient memory for model loading"

Database Issues

Issue: "Database connection failed"

Issue: "Disk space full"

API Response Issues

Issue: "502 Bad Gateway"

Issue: "504 Gateway Timeout"

Authentication Issues

Issue: "Invalid or expired token"

Logging and Diagnostics

Enable Debug Logging

View Logs

Get System Information

Health Checks

Check Overall Health

Check Component Health

Check Model Status

Check Queue Status

Check Worker Status

Performance Issues

Issue: Slow audio processing

Data Issues

Issue: Missing or corrupted results

Getting Help

Collect Diagnostic Information

Support Resources