generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

charles-dickens-gpt2

This model is a fine-tuned version of gpt2 on a dataset created by Charles Dickens's books. It achieves the following results on the evaluation set:

Model description

GPT-2 model is fine-tuned with text corpus from Charles Dickens's books.

Intended uses & limitations

The model will generate text in Charles Dickens's style. The limitation is that the texts generated are not always in a chronological manner. While fine-tuning, since the sentences were split based on the occurrence of a full stop, sentences having honorifics were truncated prematurely. An example would be sentences truncated after the dot of Mr.

List of books included in the text corpus

  1. A Christmas Carol
  2. A Tale of Two Cities
  3. David Copperfield
  4. Great Expectations
  5. Hard Times
  6. Hunted Down
  7. Oliver Twist Vol 1 of 3
  8. Oliver Twist
  9. The Magic Fishbone

Number of tokens in each book and in total

  1. Total number of tokens in the corpus: 1029751
  2. A Christmas Carol: 28691
  3. A Tale of Two Cities: 135641
  4. David Copperfield: 353905
  5. Great Expectations: 184350
  6. Hard Times: 102939
  7. Hunted Down: 8627
  8. Oliver Twist Vol 1 of 3: 54622
  9. Oliver Twist: 157172
  10. The Magic Fishbone: 3805

Dataset

The total dataset comprises of nine books written by Charles Dickens. The books were downloaded in text format from the Project Gutenberg website. The books were downloaded between 18th and 24th of February of 2023. The data was collected to predict text in Charles Dickens's style. Link to the dataset: https://github.umn.edu/tasni008/charles-dickens-gpt2 https://drive.google.com/drive/folders/1f0R69L9jltXJaRcHJmcSkgMjnKk5VpRr?usp=sharing

Text Preprocessing

The new lines were replaced with whitespaces, and the sentences were divided into smaller sentences based on the occurrence of punctuations(?.;!). If the length of a sentence was greater than 200, it was discarded. Sentences having lengths between 100 and 200 were shortened one more time based on the occurrence of commas. The notes and texts related to Project Gutenberg at the beginning and end of each book were removed manually.

Training and evaluation data

The total data is split with 20% of it being test data. The number of tokens in the training dataset is 88981 and the number of tokens in the test dataset is 22246.

Training procedure

The model is trained with the trainer API from Huggingface transformer library.

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss
4.5557 0.19 50 4.1287
4.0166 0.39 100 3.6971
3.7114 0.58 150 3.5248
3.5769 0.77 200 3.4414
3.4964 0.97 250 3.3912
3.4327 1.16 300 3.3578
3.3962 1.35 350 3.3368
3.3791 1.54 400 3.3164
3.3573 1.74 450 3.2998
3.3419 1.93 500 3.2851
3.294 2.12 550 3.2762
3.2767 2.32 600 3.2665
3.2534 2.51 650 3.2563
3.2607 2.7 700 3.2471
3.2593 2.9 750 3.2401
3.2224 3.09 800 3.2409
3.1909 3.28 850 3.2358
3.192 3.47 900 3.2330
3.2001 3.67 950 3.2301
3.199 3.86 1000 3.2286

Framework versions