FolkGPT

This model is a fine-tuned version of gpt2 on vicclab/fairy_tales dataset.

Model description

This model is the result of fine-tuning gpt2 on a dataset of fairy tales from various cultures.

Intended uses & limitations

The idea behind this is to generate text in the fashion of fairy tales written in the 18th and 19th centuries.

Why? Fairy tales seemed an appropriate application for text generation, as stories are usually short(ish), self-contained, and easy to read.

Training and evaluation data

Trained on the vicclab/fairy_tales dataset. The dataset consists of a number of texts which were downloaded from Project Gutenberg, and then edited to remove all text except for the stories themselves. These were then all concatenated into a text file and pushed to HF at https://huggingface.co/datasets/vicclab/fairy_tales. The latest update to the dataset, which was used in the training of this model, was created and uploaded on February 26th, 2023. Texts used [and token count after removing boilerplate text]: https://www.gutenberg.org/files/2591/2591-0.txt [102927 tokens] https://www.gutenberg.org/files/503/503-0.txt [138353 tokens] https://www.gutenberg.org/cache/epub/69739/pg69739.txt [51035 tokens] https://www.gutenberg.org/files/2435/2435-0.txt [98791 tokens] https://www.gutenberg.org/cache/epub/7871/pg7871.txt [49410 tokens] https://www.gutenberg.org/files/8933/8933-0.txt [178622 tokens] gutenberg.org/cache/epub/30834/pg30834.txt [58359 tokens] https://www.gutenberg.org/cache/epub/68589/pg68589.txt [39815 tokens] https://www.gutenberg.org/cache/epub/34453/pg34453.txt [69365 tokens] gutenberg.org/cache/epub/8653/pg8653.txt [35351]

[Total tokens in actual dataset: 1002654 tokens]

Training procedure

The dataset was loaded, sampling by paragraph. From here, the dataset was split into a training dataset and a validation dataset in an 80-20 split. These were then tokenized. The model was set up, and the trainer was instantiated with the training_arguments listed below. Then, the training took place.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 2
eval_batch_size: 2
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 4
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 1000
num_epochs: 1
mixed_precision_training: Native AMP

Training results

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Datasets 2.10.0
Tokenizers 0.13.2