Translator Model Documentation

1. Model Overview

The translator model is a sequence-to-sequence model that is designed to translate text from one language to another. It is based on the MarianMT architecture from Hugging Face and has been fine-tuned for Swahili–English domain-specific translation.

The script (translation.py) supports training, evaluation, and experiment tracking via MLflow.

2. Model Details

Model Architecture: The model is based on the MarianMT architecture.
Training Data: The model is fine-tuned on a custom Swahili–English dataset, loaded through the Hugging Face datasets library.
Frameworks: PyTorch, Hugging Face Transformers, Datasets, and Evaluate.
Experiment Tracking: MLflow is used to log training parameters, metrics, and artifacts.

3. Training Pipeline

3.1. Configuration

The script requires a JSON configuration file. Example:

json

{
  "language_pair": "en-sw",
  "train_file": "data/train.json",
  "validation_file": "data/val.json",
  "test_file": "data/test.json",
  "source_lang": "en",
  "target_lang": "sw",
  "num_train_epochs": 5,
  "per_device_train_batch_size": 16,
  "per_device_eval_batch_size": 16,
  "learning_rate": 5e-5
}

3.2. Running Training

bash

python translation.py --config config.json

This will:

Load the dataset and tokenizer.
Fine-tune the MarianMT model.
Log metrics and hyperparameters in MLflow.
Save the trained model to ./models/finetuned-<lang-pair>/.

4. Request Format

Since this is a training script, the request format is the configuration file passed to the program. Each language being trained has a seperate configuration file.

Example:

json

{
  "language_pair": "en-sw",
  "train_file": "data/train.json"
}

5. Response Format

The script produces:

json

{
  "metrics": {
    "train_loss": "...",
    "eval_loss": "...",
    "chrf_scroe": "...",
    "bleu_score": "..."
  },
  "artifacts": {
    "saved_model": "./models/finetuned-en-sw",
    "mlflow_experiment": "translation-en-sw"
  }
}

6. Example Usage

python

from transformers import MarianMTModel, MarianTokenizer

model_name = "./models/finetuned-en-sw"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

text = ["Watoto wanahitaji msaada."]
inputs = tokenizer(text, return_tensors="pt", padding=True)
translated = model.generate(**inputs)
print(tokenizer.decode(translated[0], skip_special_tokens=True))

Output:

"Children need help."

7. Error Handling

If the configuration file is missing, a FileNotFoundError will be raised.
If the dataset path is invalid, loading will fail with an error.
If the model cannot be saved or loaded, an exception will be thrown and logged.

Translator Model Documentation ​

1. Model Overview ​

2. Model Details ​

3. Training Pipeline ​

3.1. Configuration ​

3.2. Running Training ​

4. Request Format ​

5. Response Format ​

6. Example Usage ​

7. Error Handling ​

Translator Model Documentation

1. Model Overview

2. Model Details

3. Training Pipeline

3.1. Configuration

3.2. Running Training

4. Request Format

5. Response Format

6. Example Usage

7. Error Handling