<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->
charles-dickens-gpt2
This model is a fine-tuned version of gpt2 on a dataset created by Charles Dickens's books. It achieves the following results on the evaluation set:
- Loss: 3.2286
Model description
GPT-2 model is fine-tuned with text corpus from Charles Dickens's books.
Intended uses & limitations
The model will generate text in Charles Dickens's style. The limitation is that the texts generated are not always in a chronological manner. While fine-tuning, since the sentences were split based on the occurrence of a full stop, sentences having honorifics were truncated prematurely. An example would be sentences truncated after the dot of Mr.
List of books included in the text corpus
- A Christmas Carol
- A Tale of Two Cities
- David Copperfield
- Great Expectations
- Hard Times
- Hunted Down
- Oliver Twist Vol 1 of 3
- Oliver Twist
- The Magic Fishbone
Number of tokens in each book and in total
- Total number of tokens in the corpus: 1029751
- A Christmas Carol: 28691
- A Tale of Two Cities: 135641
- David Copperfield: 353905
- Great Expectations: 184350
- Hard Times: 102939
- Hunted Down: 8627
- Oliver Twist Vol 1 of 3: 54622
- Oliver Twist: 157172
- The Magic Fishbone: 3805
Dataset
The total dataset comprises of nine books written by Charles Dickens. The books were downloaded in text format from the Project Gutenberg website. The books were downloaded between 18th and 24th of February of 2023. The data was collected to predict text in Charles Dickens's style. Link to the dataset: https://github.umn.edu/tasni008/charles-dickens-gpt2 https://drive.google.com/drive/folders/1f0R69L9jltXJaRcHJmcSkgMjnKk5VpRr?usp=sharing
Text Preprocessing
The new lines were replaced with whitespaces, and the sentences were divided into smaller sentences based on the occurrence of punctuations(?.;!). If the length of a sentence was greater than 200, it was discarded. Sentences having lengths between 100 and 200 were shortened one more time based on the occurrence of commas. The notes and texts related to Project Gutenberg at the beginning and end of each book were removed manually.
Training and evaluation data
The total data is split with 20% of it being test data. The number of tokens in the training dataset is 88981 and the number of tokens in the test dataset is 22246.
Training procedure
The model is trained with the trainer API from Huggingface transformer library.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 4
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
4.5557 | 0.19 | 50 | 4.1287 |
4.0166 | 0.39 | 100 | 3.6971 |
3.7114 | 0.58 | 150 | 3.5248 |
3.5769 | 0.77 | 200 | 3.4414 |
3.4964 | 0.97 | 250 | 3.3912 |
3.4327 | 1.16 | 300 | 3.3578 |
3.3962 | 1.35 | 350 | 3.3368 |
3.3791 | 1.54 | 400 | 3.3164 |
3.3573 | 1.74 | 450 | 3.2998 |
3.3419 | 1.93 | 500 | 3.2851 |
3.294 | 2.12 | 550 | 3.2762 |
3.2767 | 2.32 | 600 | 3.2665 |
3.2534 | 2.51 | 650 | 3.2563 |
3.2607 | 2.7 | 700 | 3.2471 |
3.2593 | 2.9 | 750 | 3.2401 |
3.2224 | 3.09 | 800 | 3.2409 |
3.1909 | 3.28 | 850 | 3.2358 |
3.192 | 3.47 | 900 | 3.2330 |
3.2001 | 3.67 | 950 | 3.2301 |
3.199 | 3.86 | 1000 | 3.2286 |
Framework versions
- Transformers 4.26.1
- Pytorch 1.13.1+cu116
- Tokenizers 0.13.2