Skip to content

Translator Model Documentation

1. Model Overview

The translator model is a sequence-to-sequence model that is designed to translate text from one language to another. It is based on the MarianMT architecture from Hugging Face and has been fine-tuned for Swahili–English domain-specific translation.

The script (translation.py) supports training, evaluation, and experiment tracking via MLflow.

2. Model Details

  • Model Architecture: The model is based on the MarianMT architecture.
  • Training Data: The model is fine-tuned on a custom Swahili–English dataset, loaded through the Hugging Face datasets library.
  • Frameworks: PyTorch, Hugging Face Transformers, Datasets, and Evaluate.
  • Experiment Tracking: MLflow is used to log training parameters, metrics, and artifacts.

3. Training Pipeline

3.1. Configuration

The script requires a JSON configuration file. Example:

json
{
  "language_pair": "en-sw",
  "train_file": "data/train.json",
  "validation_file": "data/val.json",
  "test_file": "data/test.json",
  "source_lang": "en",
  "target_lang": "sw",
  "num_train_epochs": 5,
  "per_device_train_batch_size": 16,
  "per_device_eval_batch_size": 16,
  "learning_rate": 5e-5
}

3.2. Running Training

bash
python translation.py --config config.json

This will:

  • Load the dataset and tokenizer.
  • Fine-tune the MarianMT model.
  • Log metrics and hyperparameters in MLflow.
  • Save the trained model to ./models/finetuned-<lang-pair>/.

4. Request Format

Since this is a training script, the request format is the configuration file passed to the program. Each language being trained has a seperate configuration file.

Example:

json
{
  "language_pair": "en-sw",
  "train_file": "data/train.json"
}

5. Response Format

The script produces:

json
{
  "metrics": {
    "train_loss": "...",
    "eval_loss": "...",
    "chrf_scroe": "...",
    "bleu_score": "..."
  },
  "artifacts": {
    "saved_model": "./models/finetuned-en-sw",
    "mlflow_experiment": "translation-en-sw"
  }
}

6. Example Usage

python
from transformers import MarianMTModel, MarianTokenizer

model_name = "./models/finetuned-en-sw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = ["Watoto wanahitaji msaada."]
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Output:

"Children need help."

7. Error Handling

  • If the configuration file is missing, a FileNotFoundError will be raised.
  • If the dataset path is invalid, loading will fail with an error.
  • If the model cannot be saved or loaded, an exception will be thrown and logged.