gpt2-shakespeare

This model is a fine-tuned version of gpt2 on datasets containing Shakespeare Books. It achieves the following results on the evaluation set:

Loss: 2.5738

Model description

GPT-2 model is finetuned with text corpus.

Intended uses & limitations

Intended use for this model is to write novel in Shakespeare Style. It has limitations to write in other writer's style.

Datasets Description

Text corpus is developed for fine-tuning gpt-2 model. Books are downloaded from Project Gutenberg as plain text files. A large text corpus were needed to train the model to be abled to write in Shakespeare style.

The following books are used to develop text corpus:

Macbeth, word count: 38197
THE TRAGEDY OF TITUS ANDRONICUS, word count: 40413
King Richard II, word count: 48423
Shakespeare's Tragedy of Romeo and Juliet, word count: 144935
A MIDSUMMER NIGHT’S DREAM, word count: 36597
ALL’S WELL THAT ENDS WELL, word count: 49363
THE TRAGEDY OF HAMLET, PRINCE OF DENMARK, word count: 57471
THE TRAGEDY OF JULIUS CAESAR, word count: 37391
THE TRAGEDY OF KING LEAR, word count: 54101
THE LIFE AND DEATH OF KING RICHARD III, word count: 55985
Romeo and Juliet, word count: 51417
Measure for Measure, word count: 62703
Much Ado about Nothing, word count: 45577
Othello, the Moor of Venice, word count: 53967
THE WINTER’S TALE, word count: 52911
The Comedy of Errors, word count: 43179
The Merchant of Venice, word count: 45903
The Taming of the Shrew, word count: 44777
The Tempest, word count: 32323
TWELFTH NIGHT: OR, WHAT YOU WILL, word count: 42907
The Sonnets, word count: 39849

Corpus has total 1078389 word tokens.

Datasets Preprocessing

Header text are removed manually.
Using sent_tokenize() function from NLTK python library, extra spaces and new-lines were removed programmatically.

Training and evaluation data

Training dataset has 880447 word tokens and test dataset has 197913 word tokens.

Training procedure

To train the model, training api from Transformer class is used.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 64
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 350
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.63	250	2.7133
2.8492	1.25	500	2.6239
2.8492	1.88	750	2.5851
2.3842	2.51	1000	2.5738

Sample Code Using Transformers Pipeline

from transformers import pipeline

story = pipeline('text-generation',model='./gpt2-shakespeare', tokenizer='gpt2', max_length = 300)
story("how art thou")

Framework versions

Transformers 4.26.1
Pytorch 1.13.1+cu116
Datasets 2.10.0
Tokenizers 0.13.2