generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

Model Description

The aim of this program is to streamline distilgpt2 to achieve the lowest loss and perplexity, while ensuring that the generated text remains formal.. In order to do so, the model fine-tuned [distilgpt2] on the IMDb dataset. IMDb dataset was chosen among the datasets provided by huggingface due to the following reasons: Firstly, datasets were not included if they were specific to a particular field. For example, Eli5 was not utilised as it solely concentrates on science, technology, and engineering. Secondly, datasets driven from Social Network Services were not included because it often contains slangs and vernaculars. Lastly, dataset that will be incorporated should be written in sophisticated languages, and should contain both words that convey feelings and facts. Considering the above criteria, IMDb dataset was chosen.

Training and evaluation data

IMDb dataset is a movie online database that provides comprehensive information about movies, actors, filmmakers, and related industry professionals. It is one of the most popular and widely used sources for film information. The dataset is composed with three sub-datasets: [train], [test], and [unsupervised]. [train] and [test] datasets possess 25,000 highly polar movie reveiws, and [unsupervised] accommodates 50,000 reviews. These sub-datasets contain 'text' component and 'label' component. 'Text' component delineates the opinion towards the movie from the individual consumer or critic. 'Label' component demonstrates whether the comment was positive (1) or negative (-1). Note that 'Label' component is empty for 'unsupervised' sub-dataset.

Among these datasets, the model only uses first 5,000 reviews from [train] sub-datasets. If not, the model will spend insupportable amount of time due to the CPU constraint. The sampling code used in this model is provided as follows.

imdb=imdb['train'].select(range(5000))

Training and Evaluation

The following hyperparameters were used during training:

To assess the efficacy of the model, the perplexity was computed both before and after the fine-tuning process. Perplexity serves as a statistical metric that measures the model's ability to predict and comprehend new and unfamiliar text data. Note that the lower the perplexity, the better the ability of the model (huggingface, n.d.)

By fine-tuning [distilgpt2] on the [imdb] dataset, the perplexity of the model has decresaed from 23.18 to 3.51. Therefore, it is plausible to say that the bias in [distilgpt2] has decreased by [imdb] dataset. Following is the code used to calculate the perplexity.

import math

eval_results = trainer.evaluate() print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Training Results

Training Loss Epoch Step Validation Loss
1.4274 1.0 6000 1.3390
1.3495 2.0 12000 1.2737
1.3243 3.0 18000 1.2565

Limitations

The model encountered unexpected error that is not solved by huggingface yet. Even if the code is successfully uploaded to huggingface/models, the model constantly was facing the error code below. This error code is expected to happen due to the bug with model.save_pretrained().

OSError: Can't load tokenizer for 'pellucid/my_awesome_imdb_clm-model'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'pellucid/my_awesome_imdb_clm-model' is the correct path to a directory containing all relevant files for a GPT2TokenizerFast tokenizer.

The below code was utilised to bring the model into Google Colab for use.

prompt = "This is just a precious little diamond. The play, the script are excellent."

from transformers import AutoTokenizer

MODEL = f"my_awesome_imdb_clm-model" tokenizer.save_pretrained(MODEL) tokenizer = AutoTokenizer.from_pretrained(MODEL)

from transformers import pipeline

tokenizer.save_pretrained("my_awesome_imdb_clm-model") generator = pipeline("text-generation", model="my_awesome_imdb_clm-model") generator(prompt)

Note that tokenizer.save_pretrained(MODEL) was added before the model.save_pretrained(MODEL). By doing so, it will ajoute tokenizer information in the cache folder so that the code can work properly. Finally, this is the estimated result.

'This is just a precious little diamond. The play, the script are excellent. T h e j o k e s t o r y t i m e w h o h a v e b e e n w i t h a c o n t r a i n o f t h e s t o r y t o t h e h o u s e w i t h '

Framework versions