Whisper Transcription Model Documentation

1. Model Overview

The openchs/asr-whisper-helpline-sw-v1 transcription model is a Whisper large-v2 based automatic speech recognition (ASR) model fine-tuned for Swahili transcription and translation in the context of child helpline services. This model is specifically trained on conversations from the 116 Child Helpline in Tanzania, making it domain-specific for sensitive topics including abuse reporting, emergency situations, emotional distress, trauma narratives, and general child welfare inquiries.

Key Features

Architecture: OpenAI Whisper large-v2 (fine-tuned)
Domain: Child helpline conversations (Swahili focus)
Deployment: Available via AI Service API and Hugging Face Hub
Repository: openchs/asr-whisper-helpline-sw-v1
Dual Capability: Supports both transcription (Swahili→Swahili) AND translation (Swahili→English)
Special Capabilities: Handles code-switching, fragmented trauma narratives, emergency language, emotional distress expressions, and hallucination filtering
Language Support: 99+ languages with explicit support for African languages (Swahili, Amharic, Luganda, Kinyarwanda, Somali, Yoruba, Igbo, Hausa, Zulu, Xhosa, Afrikaans, Chichewa)

Integration in AI Service Pipeline

The Whisper model is the first component in the BITZ AI Service pipeline, converting audio input into text:

Audio Input → Whisper Transcription (Swahili) → Translation Model (English) → NER → Classification → Summarization
                  ↓ (optional)
           Whisper Translation (English) → NER → Classification

The model serves as the entry point for both:

Real-time processing: Progressive transcription during live calls
Post-call processing: Full audio file transcription after call completion

2. Integration in AI Service Architecture

The Whisper model is deeply integrated into the AI service through multiple layers:

2.1. Configuration Layer

The Whisper model is configured through the central settings system (app/config/settings.py):

python

class Settings(BaseSettings):
    # Whisper Model Configuration
    whisper_model_variant: str = "large_v2"
    whisper_large_v2_path: str = "./models/whisper_large_v2"
    whisper_large_turbo_path: str = "./models/whisper_large_turbo"
    whisper_active_symlink: str = "./models/whisper"
    
    # HuggingFace Hub Configuration
    use_hf_models: bool = True
    hf_whisper_large_v2: str = "openchs/asr-whisper-helpline-sw-v1"
    hf_whisper_large_turbo: str = "openchs/asr-whisper-helpline-sw-v1"
    hf_token: Optional[str] = None  # No token needed for public models
    
    # Model paths
    models_path: str = "./models"  # Auto-adjusted for Docker
    
    # Enable HuggingFace models
    use_hf_models: bool = True

Configuration Helper Methods:

python

def get_active_whisper_path(self) -> str:
    """Get path to the currently active whisper model"""
    ...

def get_hf_model_kwargs(self) -> Dict[str, Any]:
    """Get common kwargs for HuggingFace model loading - NO TOKEN for public models"""
    ...

Environment Detection:

The system automatically detects whether it's running in Docker or local development and adjusts paths accordingly.

2.2. Model Loading and Initialization

The Whisper model is loaded through the WhisperModel class during FastAPI application startup:

python

class WhisperModel:
    """HuggingFace Whisper model supporting both transcription and translation"""
    
    def __init__(self, model_path: str = None, enable_translation: bool = True):
        from ..config.settings import settings
        
        self.settings = settings
        self.model_path = model_path or settings.get_model_path("whisper")
        self.enable_translation = enable_translation
        
        ...

        # Supported language codes for Whisper
        self.supported_languages = {
            "auto": "Auto-detect",
            "en": "English",
            "es": "Spanish", 
            "fr": "French",
            "de": "German",
            "it": "Italian",
            "pt": "Portuguese",
            "ru": "Russian",
            "ja": "Japanese",
            "ko": "Korean",
            "zh": "Chinese",
            "ar": "Arabic",
            "hi": "Hindi",
            "sw": "Swahili",
            "am": "Amharic",
            "lg": "Luganda",
            "rw": "Kinyarwanda",
            "so": "Somali",
            "yo": "Yoruba",
            "ig": "Igbo",
            "ha": "Hausa",
            "zu": "Zulu",
            "xh": "Xhosa",
            "af": "Afrikaans",
            "ny": "Chichewa"
        }

Model Loading Implementation:

python

def load(self) -> bool:
    """Load Whisper model with HuggingFace Hub support - NO AUTHENTICATION"""
    try:
        ...

Model Loader Integration:

The Whisper model is managed by the central model_loader which handles:

Startup initialization
Readiness checks
Dependency tracking
Health monitoring

python

# During FastAPI startup
@asynccontextmanager
async def lifespan(app: FastAPI):
    # Initialize Whisper model via model_loader
    model_loader.load_model("whisper")
    yield
    # Cleanup on shutdown

2.3. API Endpoints Layer

Whisper functionality is exposed through FastAPI routes (app/api/endpoints/whisper_routes.py):

python

router = APIRouter(prefix="/whisper", tags=["whisper"])

class TranscriptionRequest(BaseModel):
    language: Optional[str] = None  # e.g., "en", "sw", "fr", "auto"

class TranscriptionResponse(BaseModel):
    transcript: str
    language: Optional[str]
    processing_time: float
    model_info: Dict
    timestamp: str
    audio_info: Dict

@router.post("/transcribe", response_model=TranscriptionResponse)
async def transcribe_audio(
    audio: UploadFile = File(...),
    language: Optional[str] = Form(None)
):
    """
    Transcribe uploaded audio file to text
    
    Parameters:
    - audio: Audio file (wav, mp3, flac, m4a, ogg, webm)
    - language: Language code (e.g., 'en', 'sw', 'fr') or 'auto' for auto-detection
    """
    ...

Information Endpoints:

python

@router.get("/info")
async def get_whisper_info():
    """Get Whisper model information"""
    ...

@router.get("/languages")
async def get_supported_languages():
    """Get list of supported languages"""
    if not model_loader.is_model_ready("whisper"):
        return {
            "error": "Whisper model not ready",
            "supported_languages": {}
        }
    
    whisper_model = model_loader.models.get("whisper")
    ...

2.4. Audio Processing Strategy

The WhisperModel class implements intelligent audio processing with automatic chunking and hallucination filtering:

Why Audio Chunking is Needed:

Whisper models have optimal performance on 30-second audio segments
Long helpline conversations often exceed this optimal length
Direct processing of very long audio can cause memory issues
Chunking with proper reconstruction preserves conversation continuity

Core Transcription Method:

python

def transcribe_audio_file(self, audio_file_path: str, language: Optional[str] = None, task: str = "transcribe") -> str:
    """Transcribe or translate audio file to text with support for long audio"""
    if not self.is_loaded:
        raise RuntimeError("Whisper model not loaded")
    
    ...

Hallucination Filtering:

Whisper models can produce false outputs during silence or low-energy audio. The implementation includes filtering:

python

def transcribe_pcm_audio(self, pcm_bytes: bytes, sample_rate: int = 16000, language: Optional[str] = None, task: str = "transcribe") -> str:
    """Transcribe or translate PCM audio data directly from bytes"""
    ...

Dual-Task Processing Methods:

The model provides methods for getting both transcription AND translation in a single call:

python

def transcribe_and_translate(
    self,
    audio_file_path: str,
    language: Optional[str] = None
) -> Dict[str, str]:
    """
    Get both transcript (original language) AND translation (English) from audio file
    
    Args:
        audio_file_path: Path to audio file
        language: Source language code (e.g., "sw" for Swahili)
    
    Returns:
        Dict with keys:
            - 'transcript': Original language text
            - 'translation': English translation (or None if translation not enabled)
    """
    if not self.is_loaded:
        raise RuntimeError("Whisper model not loaded")
    
    ...

def process_audio_with_options(
    self,
    audio_file_path: str,
    language: Optional[str] = None,
    include_translation: bool = False
) -> Dict[str, Any]:
    """
    Flexible audio processing method that returns transcript and optional translation
    
    Args:
        audio_file_path: Path to audio file
        language: Source language code (e.g., "sw")
        include_translation: Whether to include English translation
    
    Returns:
        Dict with keys:
            - 'transcript': Original language text (ALWAYS present)
            - 'translation': English translation (only if include_translation=True)
            - 'language': Language code used
            - 'has_translation': Boolean indicating if translation was performed
    """
    ...

2.5. Health Monitoring

The Whisper model integrates with the AI service health monitoring system (app/api/endpoints/health_routes.py):

Model Status Endpoint:

python

@router.get("/health/models")
async def models_health():
    """Get detailed model status with dependency info"""
    ...

Model Readiness States:

Ready: Model loaded and available for transcription
Implementable: Model can be loaded but not yet initialized
Blocked: Missing dependencies preventing model loading

Model Info Method:

python

def get_model_info(self) -> Dict[str, Any]:
    """Get model information"""
   ...

Health Check Example Response:

json

{
  "ready_models": ["whisper", "translator", "classifier", "ner"],
  "implementable_models": [],
  "blocked_models": [],
  "details": {
    "whisper": {
      "loaded": true,
      "device": "cuda:0",
      "current_model_id": "openchs/asr-whisper-helpline-sw-v1",
      "translation_enabled": true,
      "version": "large-v2",
      "supported_formats": ["wav", "mp3", "flac", "m4a", "ogg", "webm"],
      "max_audio_length": "unlimited (chunked processing)"
    }
  }
}

2.6. Pipeline Integration

The Whisper model integrates into two processing modes:

Real-time Processing

For live calls, Whisper transcribes audio progressively:

python

# Progressive transcription during active call
@router.get("/{call_id}/transcript")
async def get_call_transcript(call_id: str):
    """Get cumulative transcript for active call"""
    ...

Real-time Flow:

Audio stream → Whisper transcription (Swahili) in 30-second chunks
Optional: Whisper translation (English)
Translated/transcribed text → NER extraction
Entities + classification → Agent notifications

Configuration for Real-time Mode:

python

# Settings control real-time behavior
realtime_enable_progressive_translation: bool = True
realtime_processing_interval_seconds: int = 30
realtime_enable_agent_notifications: bool = True

Post-call Processing

For completed calls, full audio file processing:

python

# Complete pipeline after call ends
Audio File (SCP download) → Whisper Transcription (full file) → Optional Translation
→ NER → Classification → Summarization → Unified Insights

Configuration for Post-call Mode:

python

# Settings control post-call processing
postcall_enable_full_pipeline: bool = True
postcall_enable_enhanced_transcription: bool = True
postcall_whisper_model: str = "large-v2"
postcall_enable_audio_quality_improvement: bool = True
postcall_enable_noise_reduction: bool = True
postcall_convert_to_wav: bool = True

Audio Download and Processing:

python

# SCP Configuration for audio retrieval
scp_user: str = "helpline"
scp_server: str = "192.168.10.3"
scp_password: str = "h3lpl1n3"
scp_remote_path_template: str = "/home/dat/helpline/calls/{call_id}.gsm"
scp_timeout_seconds: int = 30

# Audio preprocessing
postcall_enable_noise_reduction: bool = True
postcall_convert_to_wav: bool = True

2.7. Memory Management

The Whisper model implements automatic GPU/CPU memory management:

GPU Memory Handling:

python

# Automatic device management in transcription
try:
    with torch.no_grad():
        predicted_ids = self.model.generate(
            input_features,
            attention_mask=attention_mask,
            max_length=448,
            num_beams=5,
            **generate_kwargs
        )
except torch.cuda.OutOfMemoryError:
    logger.warning("CUDA out of memory, falling back to CPU...")
    self.model.to("cpu")
    input_features = input_features.to(device="cpu", dtype=torch.float32)
    attention_mask = attention_mask.to("cpu")
    
    # Retry on CPU
    with torch.no_grad():
        predicted_ids = self.model.generate(...)
    
    # Restore to GPU
    self.model.to(self.device)

Memory Cleanup:

GPU cache cleared after processing
Temporary audio files deleted after transcription
Automatic garbage collection for long audio processing

3. Using the Model

3.1. Via AI Service API (Production Use)

The Whisper model is deployed as part of the AI Service and accessible via REST API.

Endpoint

POST /whisper/transcribe

Request Format

Headers:

Content-Type: multipart/form-data

Request Body:

- audio: file (required) - Audio file in supported format
- language: string (optional) - Language code (e.g., "sw", "en", "auto")

Response Format

Success Response (200):

json

{
  "transcript": "string",
  "language": "string",
  "processing_time": 0,
  "model_info": {
    "model_name": "whisper",
    "model_path": "string",
    "fallback_model_id": "openchs/asr-whisper-helpline-sw-v1",
    "device": "cuda:0",
    "is_loaded": true,
    "translation_enabled": true,
    "version": "large-v2",
    "supported_formats": ["wav", "mp3", "flac", "m4a", "ogg", "webm"]
  },
  "timestamp": "string",
  "audio_info": {
    "filename": "string",
    "file_size_mb": 0,
    "format": "string",
    "content_type": "string"
  }
}

Validation Error (422):

json

{
  "detail": [
    {
      "loc": ["string"],
      "msg": "string",
      "type": "string"
    }
  ]
}

Example cURL Requests

Basic Transcription (Auto-detect language):

bash

curl -X POST "http://192.168.8.18:8123/whisper/transcribe" \
  -H "Content-Type: multipart/form-data" \
  -F "audio=@sample.wav"

Response:

json

{
  "transcript": "Mbukweli, kaitike vipi? Nukuanasema. Kikozana na baba mdo...",
  "language": null,
  "processing_time": 63.396394,
  "model_info": {
    "model_name": "whisper",
    "device": "cuda:0",
    "translation_enabled": true,
    "version": "large-v2"
  },
  "timestamp": "2025-10-21T12:05:27.726046",
  "audio_info": {
    "filename": "sample.wav",
    "file_size_mb": 1.97,
    "format": ".wav",
    "content_type": "audio/wav"
  }
}

Swahili Transcription (Specify language):

bash

curl -X POST "http://192.168.8.18:8123/whisper/transcribe" \
  -H "Content-Type: multipart/form-data" \
  -F "audio=@helpline_call.wav" \
  -F "language=sw"

English Transcription:

bash

curl -X POST "http://192.168.8.18:8123/whisper/transcribe" \
  -H "Content-Type: multipart/form-data" \
  -F "audio=@english_audio.mp3" \
  -F "language=en"

Long Audio File (Automatic chunking):

bash

curl -X POST "http://192.168.8.18:8123/whisper/transcribe" \
  -H "Content-Type: multipart/form-data" \
  -F "audio=@long_call_5_minutes.wav" \
  -F "language=sw"

Getting Whisper Model Info

Endpoint:

GET /whisper/info

Response:

json

{
  "status": "ready",
  "model_info": {
    "model_name": "whisper",
    "current_model_id": "openchs/asr-whisper-helpline-sw-v1",
    "device": "cuda:0",
    "is_loaded": true,
    "translation_enabled": true,
    "version": "large-v2",
    "supported_formats": ["wav", "mp3", "flac", "m4a", "ogg", "webm"],
    "max_audio_length": "unlimited (chunked processing)",
    "sample_rate": "16kHz",
    "tasks_supported": "transcribe, translate",
    "languages": "multilingual (99+ languages)"
  }
}

Getting Supported Languages

Endpoint:

GET /whisper/languages

Response:

json

{
  "supported_languages": {
    "auto": "Auto-detect",
    "en": "English",
    "sw": "Swahili",
    "am": "Amharic",
    "lg": "Luganda",
    "rw": "Kinyarwanda",
    "so": "Somali",
    "yo": "Yoruba",
    "ig": "Igbo",
    "ha": "Hausa",
    "zu": "Zulu",
    "xh": "Xhosa",
    "af": "Afrikaans",
    "ny": "Chichewa"
  },
  "total_supported": "99+ languages",
  "note": "Whisper supports many more languages beyond this list",
  "usage": "Pass language code in request: language='sw' for Swahili"
}

Features

Automatic Language Detection: Pass language=null or language=auto for auto-detection
Format Flexibility: Supports wav, mp3, flac, m4a, ogg, webm
Large File Handling: Automatic chunking for files >30 seconds
File Size Limit: 100MB maximum
GPU Acceleration: Automatic CUDA utilization with CPU fallback
Hallucination Filtering: Removes false transcriptions from silence

3.2. Via Hugging Face Hub

The model is publicly available on Hugging Face for direct inference and fine-tuning.

Model Repository

Organization: openchs
Model: openchs/asr-whisper-helpline-sw-v1

Installation

bash

pip install transformers torch librosa

Inference Examples

Using Pipeline (Recommended for Transcription):

python

from transformers import pipeline
import torch

# Load the ASR pipeline

transcriber = pipeline(
    "automatic-speech-recognition",
    model="openchs/asr-whisper-helpline-sw-v1",
    device=device,
    torch_dtype=torch_dtype
)

# Transcribe audio file (Swahili)
result = transcriber("helpline_call.wav")
print(result["text"])
# Output: "Ninajisikia vibaya sana. Nahitaji mtu wa kuongea naye."

# With language specification
result = transcriber(
    "helpline_call.wav",
    generate_kwargs={"language": "sw", "task": "transcribe"}
)
print(result["text"])

Using Model and Processor Directly (For Translation):

python

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa

# Load model and processor
model_id = "openchs/asr-whisper-helpline-sw-v1"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load audio
audio_array, sample_rate = librosa.load("helpline_call.wav", sr=16000, mono=True)

# Process audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device, dtype=torch_dtype)

# Generate transcription
with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        language="sw",
        task="transcribe"
    )

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Translation Example (Swahili to English):

python

model_id = "openchs/asr-whisper-helpline-sw-v1"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load audio
audio_array, sample_rate = librosa.load("swahili_audio.wav", sr=16000, mono=True)

# Process audio
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
input_features = inputs.input_features.to(device, dtype=torch_dtype)

# Generate translation to English
with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        language="sw",
        task="translate"  # Translate to English
    )

translation = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(translation)
# Output: "I feel very bad. I need someone to talk to."

Batch Transcription:

python


transcriber = pipeline(
    "automatic-speech-recognition",
    model="openchs/asr-whisper-helpline-sw-v1",
    device=device,
    torch_dtype=torch_dtype,
    batch_size=8  # Process multiple files at once
)

audio_files = [
    "call_001.wav",
    "call_002.wav",
    "call_003.wav"
]

results = transcriber(audio_files)

for audio_file, result in zip(audio_files, results):
    print(f"{audio_file}: {result['text']}")

Long Audio Processing (Chunking):

python


model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

# Load long audio file
audio_array, sample_rate = librosa.load("long_helpline_call.wav", sr=16000, mono=True)

# Calculate chunks (30-second segments)
chunk_duration = 30
chunk_samples = chunk_duration * sample_rate
audio_length = len(audio_array)

transcriptions = []

for i in range(0, audio_length, chunk_samples):
    chunk_end = min(i + chunk_samples, audio_length)
    audio_chunk = audio_array[i:chunk_end]
    
    # Process chunk
    inputs = processor(audio_chunk, sampling_rate=16000, return_tensors="pt")
    input_features = inputs.input_features.to(device, dtype=torch_dtype)
    
    with torch.no_grad():
        predicted_ids = model.generate(
            input_features,
            language="sw",
            task="transcribe",
            max_length=448,
            num_beams=5
        )
    
    chunk_transcript = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    transcriptions.append(chunk_transcript.strip())
    
    print(f"Processed chunk {i//chunk_samples + 1}/{(audio_length + chunk_samples - 1)//chunk_samples}")

# Combine all chunks
full_transcript = " ".join(transcriptions)
print("\nFull transcript:")
print(full_transcript)

Note: When using the model directly from Hugging Face, you'll need to implement your own chunking logic for audio files longer than 30 seconds. The AI Service API handles this automatically.

4. Production Considerations

Audio File Limits

Maximum File Size: 100MB
Supported Formats: wav, mp3, flac, m4a, ogg, webm
Sample Rate: Automatically resampled to 16kHz
Optimal Chunk Size: 30 seconds per segment
Maximum Audio Length: Unlimited (automatic chunking)

Processing Time

Short audio (< 10 seconds): ~0.5-2 seconds
Medium audio (10-60 seconds): ~2-10 seconds
Long audio (> 60 seconds): ~10-60 seconds (depends on length)
Very long audio (> 5 minutes): ~1-5 minutes with chunking

Processing times vary based on:

GPU availability (CUDA vs CPU)
Audio file length and quality
Number of chunks required
System load and available memory
Language (Swahili is optimized)

Automatic Chunking

The API automatically handles long audio files by:

Duration Check: Detecting if audio > 30 seconds
Segmentation: Creating 30-second chunks
Processing: Transcribing each chunk independently
Reconstruction: Combining transcriptions with proper spacing
Memory Management: Cleaning up between chunks

Error Handling

Common Error Scenarios:

No Audio File:

json

{
  "detail": [
    {
      "loc": ["body", "audio"],
      "msg": "No audio file provided",
      "type": "value_error"
    }
  ]
}

Unsupported Format:

json

{
  "detail": "Unsupported audio format: .avi. Supported: ['.wav', '.mp3', '.flac', '.m4a', '.ogg', '.webm']"
}

File Too Large:

json

{
  "detail": "File too large: 105.3MB. Max: 100MB"
}

Model Not Ready:

json

{
  "detail": "Whisper model not ready. Check /health/models for status."
}

Service Unavailable:

json

{
  "detail": "Whisper model not available"
}

Processing Failure:

json

{
  "detail": "Transcription failed: [error details]"
}

Health Checks

Monitor Whisper service health:

bash

# Check if Whisper is ready
curl -X GET "http://192.168.8.18:8123/whisper/info"

# Get detailed model status
curl -X GET "http://192.168.8.18:8123/health/models"

# Get supported languages
curl -X GET "http://192.168.8.18:8123/whisper/languages"

GPU Memory Management

The model automatically:

Uses CUDA when available (float16 precision)
Falls back to CPU on OOM errors (float32 precision)
Cleans up GPU cache between chunks
Restores GPU after CPU fallback

Memory Requirements:

GPU (CUDA): ~3-4GB VRAM for large-v2
CPU: ~6-8GB RAM for large-v2
Disk: ~3GB for model weights

Hallucination Mitigation

The model implements several strategies to reduce false transcriptions:

Silence Detection: Filters outputs from near-silent audio
Energy Threshold: Checks audio energy before processing
Common Phrase Filtering: Removes known hallucinations like "Thank you", "Goodbye", etc.
Confidence Scoring: Uses beam search with high beam count (5)

Language Optimization

While the model supports 99+ languages, it is specifically optimized for:

Swahili (primary training language)
English (translation target)
Code-switching (Swahili-English mixed speech)
African Languages (Somali, Luganda, Kinyarwanda, etc.)

For best results with non-Swahili languages, consider testing on sample audio first.

5. Model Limitations

Domain Specificity

Optimized for: Child helpline conversations, emotional support language, emergency situations, trauma narratives
May require adaptation for: General Swahili transcription, formal speeches, technical content, non-helpline domains
Performance varies on: Out-of-distribution data, highly specialized terminology, studio-quality recordings

Technical Constraints

Maximum File Size: 100MB per request
Optimal Audio Length: 30-second segments (longer audio is chunked)
Memory Requirements: GPU recommended for production (CPU fallback available but slower)
Processing Speed: Dependent on hardware, audio length, and quality
Supported Formats: Limited to wav, mp3, flac, m4a, ogg, webm

Audio Quality Considerations

Background Noise: May reduce transcription accuracy
Multiple Speakers: Works best with single speaker or clear alternating speakers
Audio Encoding: GSM, low-bitrate formats may have reduced quality
Sample Rate: Automatically resampled to 16kHz (may affect very high-quality recordings)
Silence/Low Volume: May produce hallucinations (filtered automatically)

Known Considerations

Code-Switching: Generally successful, but complex patterns may reduce accuracy
Cultural Context: Some culturally-specific Tanzanian expressions may not transcribe perfectly
Emotion Recognition: Optimized for emotional distress language, may vary on neutral speech
Accents: Fine-tuned on Tanzanian Swahili, may vary on other regional accents
Long Audio: Chunking may occasionally affect cohesion at segment boundaries

Translation Limitations

Translation Quality: Swahili→English translation is generally good but may lose nuance
Cultural Context: Some cultural references may not translate well
Idiomatic Expressions: May translate literally rather than idiomatically
Proper Nouns: May be inconsistently translated or transliterated

Recommendations

Monitor transcription quality for edge cases in your domain
Use health check endpoints to verify model readiness before critical operations
Consider implementing quality feedback loops for continuous improvement
Review transcriptions for sensitive or critical content
Test with sample audio from your specific use case before production deployment
For non-Swahili languages, validate accuracy with ground truth data
Use the API's automatic chunking rather than pre-segmenting long audio

6. Citation

If you use this model in your research or application, please cite:

bibtex

@misc{asr_whisper_helpline_sw_v1,
  title={ASR Whisper Helpline Swahili v1},
  author={OpenCHS Team},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/openchs/asr-whisper-helpline-sw-v1},
  note={Fine-tuned Whisper large-v2 for child helpline transcription}
}

7. Support and Contact

Issues and Questions

Email: info@bitz-itc.com
Hugging Face: openchs organization

Contributing

For dataset contributions, model improvements, or bug reports, please contact the BITZ AI Team at info@bitz-itc.com.

8. License

This model is released under the Apache 2.0 License, allowing for both commercial and non-commercial use with proper attribution.

Documentation last updated: October 21, 2025

Whisper Transcription Model Documentation ​

1. Model Overview ​

Key Features ​

Integration in AI Service Pipeline ​

2. Integration in AI Service Architecture ​

2.1. Configuration Layer ​

2.2. Model Loading and Initialization ​

2.3. API Endpoints Layer ​

2.4. Audio Processing Strategy ​

2.5. Health Monitoring ​

2.6. Pipeline Integration ​

Real-time Processing ​

Post-call Processing ​

2.7. Memory Management ​

3. Using the Model ​

3.1. Via AI Service API (Production Use) ​

Endpoint ​

Request Format ​

Response Format ​

Example cURL Requests ​

Getting Whisper Model Info ​

Getting Supported Languages ​

Features ​

3.2. Via Hugging Face Hub ​

Model Repository ​

Installation ​

Inference Examples ​

4. Production Considerations ​

Audio File Limits ​

Processing Time ​

Automatic Chunking ​

Error Handling ​

Health Checks ​

GPU Memory Management ​

Hallucination Mitigation ​

Language Optimization ​

5. Model Limitations ​

Domain Specificity ​

Technical Constraints ​

Audio Quality Considerations ​

Known Considerations ​

Translation Limitations ​

Recommendations ​

6. Citation ​

7. Support and Contact ​

Issues and Questions ​

Contributing ​

8. License ​

Whisper Transcription Model Documentation

1. Model Overview

Key Features

Integration in AI Service Pipeline

2. Integration in AI Service Architecture

2.1. Configuration Layer

2.2. Model Loading and Initialization

2.3. API Endpoints Layer

2.4. Audio Processing Strategy

2.5. Health Monitoring

2.6. Pipeline Integration

Real-time Processing

Post-call Processing

2.7. Memory Management

3. Using the Model

3.1. Via AI Service API (Production Use)

Endpoint

Request Format

Response Format

Example cURL Requests

Getting Whisper Model Info

Getting Supported Languages

Features

3.2. Via Hugging Face Hub

Model Repository

Installation

Inference Examples

4. Production Considerations

Audio File Limits

Processing Time

Automatic Chunking

Error Handling

Health Checks

GPU Memory Management

Hallucination Mitigation

Language Optimization

5. Model Limitations

Domain Specificity

Technical Constraints

Audio Quality Considerations

Known Considerations

Translation Limitations

Recommendations

6. Citation

7. Support and Contact

Issues and Questions

Contributing

8. License