text2text generation spelling normalization 19th-century Dutch

19th Century Dutch Spelling Normalization

This repository contains a pretrained and finetuned model of the original google/ByT5-small. This model has been pretrained and finetuned for the task of 19th-century Dutch spelling normalization. We first pretrained the model with 2 million sentences from Dutch historical novels. Afterward, we finetuned the model with a 10k dataset consisting of 19th-century Dutch sentences; these sentences were automatically annotated by a rule-based system built for 19th-century Dutch spelling normalization (van Cranenburgh and van Noord, 2022).

The finetuned model is only available in the TensorFlow format but can be converted to a PyTorch environment. The pretrained only weights are available in the PyTorch environment; note that this model has to be finetuned first. The pretrained only weights are available in the directory Pretrained_ByT5. The train and validation sets used for finetuning are available in the main repository. For further information about the model, please see the GitHub repository.

How to use:

from transformers import AutoTokenizer, TFT5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')
model = TFT5ForConditionalGeneration.from_pretrained('AWolters/ByT5_DutchSpellingNormalization')

text = 'De menschen waren aan het werk.'
tokenized = tokenizer(text, return_tensors='tf')

prediction = model.generate(input_ids=tokenized['input_ids'],
                            attention_mask=tokenized['attention_mask'],
                            max_new_tokens=100)

print(tokenizer.decode(prediction[0], text_target=True, skip_special_tokens=True))

Setup:

The model has been finetuned with the following (hyper)parameters values:

Learn rate: 5e-5
Batch size: 32
Optimizer: AdamW
Epochs: 30, with earlystopping

To further finetune the model, use the T5Trainer.py script.