translation

Breton-French translator m2m100_418M_br_fr

This model is a fine-tuned version of facebook/m2m100_418M (Fan et al., 2021) on a Breton-French parallel corpus. In order to obtain the best possible results, we use all our parallel data on training and consequently report no quantitative evaluation at this time. Empirical qualitative evidence suggests that the translations are generally adequate for short and simple examples, the behaviour of the model on long and/or complex inputs is currently unknown.

Try this model online in Troer, feedback and suggestions are welcome!

Model description

See the description of the base model.

Intended uses & limitations

This is intended as a demonstration of the improvements brought by fine-tuning a large-scale many-to-many translation system on a medium-sized dataset of high-quality data. As it is, and as far as I can tell it usually provides translations that are least as good as those of other available Breton-French translators, but it has not been evaluated quantitatively at a large scale.

Training and evaluation data

The training dataset consists of:

These are obtained from the OPUS base (Tiedemann, 2012) and filtered using OpusFilter (Aulamo et al., 2020), see dl_opus.yaml for the details. The filtering is slightly non-deterministic due to the retraining of a statistical alignment model, but in my experience, different runs tend to give extremely similar results. Do not hesitate to reach out if you experience difficulties in using this to collect data.

In addition to these, the training dataset also includes parallel br/fr sentences, provided as glosses in the Arbres wiki (Jouitteau, 2022), obtained from their ongoing port to Universal Dependencies in the Autogramm project.

Training procedure

The training hyperparameters are those suggested by Adelani et al. (2022) in their code release, which gave their best results for machine translation of several African languages.

More specifically, we train this model with zeldarose with the following parameters

zeldarose transformer \
   --config train_config.toml \
   --tokenizer "facebook/m2m100_418M" --pretrained-model "facebook/m2m100_418M" \
   --out-dir m2m100_418M+br-fr --model-name m2m100_418M+br-fr \
   --strategy ddp --accelerator gpu --num-devices 4 --device-batch-size 2 --num-workers 8\
   --max-epochs 16 --precision 16 --tf32-mode medium \
   --val-data {val_path}.jsonl \
   {train_path}.jsonl

Training hyperparameters

The following hyperparameters were used during training:

[task]
change_ratio = 0.3
denoise_langs = []
poisson_lambda = 3.0
source_langs = ["br"]
target_langs = ["fr"]

[tuning]
batch_size = 16
betas = [0.9, 0.999]
epsilon = 1e-8
learning_rate = 5e-5
gradient_clipping = 1.0
lr_decay_steps = -1
warmup_steps = 1024

Framework versions

Carbon emissions

At this time, we estimate emissions of a rough 300 gCO<sub>2</sub> per fine-tuning run. So far, we account for

So far, the equivalent carbon emissions for this model are approximately 3300 gCO<sub>2</sub>.

References