gpt2-medium gpt2

GPT2-Medium pre-trained on cleaned Dutch mC4 🇳🇱

A GPT2 medium-sized model (345M parameters) trained from scratch on Dutch, with perplexity 15.1 on cleaned Dutch mC4.

How To Use

You can use this GPT2-model directly with a pipeline for text generation.

MODEL_DIR='yhavinga/gpt2-medium-dutch'
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained(MODEL_DIR)
model = GPT2LMHeadModel.from_pretrained(MODEL_DIR)
generator = pipeline('text-generation', model, tokenizer=tokenizer, config={'max_length':100})

generated_text = generator('In Antwerpen heeft zich gisteren', max_length=100, do_sample=True, top_k=40, top_p=0.95, repetition_penalty=2.0))

"In Antwerpen heeft zich gisteren" - " een dramatische ontknoping voorgedaan in de Vlaamse deelregering. De VLD, die sinds afgelopen woensdag aan het bewind is in Vlaams-Waals gebied (de zogenaamde gewestelijke en niet rechtstreeks met Vlaanderen samenwerkende gewesten), krijgt toch geen meerderheidszetels bij verkiezingen voor gemeenteraadsverkiezingen in oktober of november volgend jaar in Westmalle, Berchem, Tervuren enz., aldus premier Jean-Pierre Van Cauwenberghe van Wallonië vandaag"

Tokenizer

Dataset

This model was trained on of the full configuration (33B tokens) of cleaned Dutch mC4, which is the original mC4, except

Models

TL;DR: yhavinga/gpt2-medium-dutch is the best model.

model params train seq len ppl loss batch size epochs steps optim lr duration config
yhavinga/gpt-neo-125M-dutch gpt neo 125M 512 20.9 3.04 128 1 190000/558608 adam 2.4e-3 1d 12h full
yhavinga/gpt2-medium-dutch gpt2 345M 512 15.1 2.71 128 1 320000/520502 adam 8e-4 7d 2h full
yhavinga/gpt2-large-dutch gpt2 762M 512 15.1 2.72 32 1 1100000/2082009 adafactor 3.3e-5 8d 15h large
yhavinga/gpt-neo-1.3B-dutch gpt neo 1.3B 512 16.0 2.77 16 1 960000/3049896 adafactor 5e-4 7d 11h full

Acknowledgements

This project would not have been possible without compute generously provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also instrumental in most, if not all, parts of the training. The following repositories where helpful in setting up the TPU-VM, and training the models:

Created by Yeb Havinga