William Shakespeare Writer
NLP models unleashed the most astonishing and notable abilities of recent machine learning releases. The internet blew out with ChatGPT launching; people are posting
about it, dazzled with its communication skills, especially on summarizing,
translating,
and question-answering
tasks. As someone who has worked with machine
learning for the last five years and understand how it works under the hood, LLM awes me with its capabilities of condensing information and sharp writing.
The essence of LLM and other NLP capstones are the Transformers and the self-attention apparatus, which empowers neural networks to fit preceding displacements. Models can
grasp syntactically and semantically in new unique ways. Specifying those models in Python
is more straightforward than it might look. Practitioners can leverage Pytorch,
Tensorflow and Jax to specify transformer-based architecture on a high-level language, taking advantage of local resources in an optimal form. HuggingFace library is a new
high-level Python option that allows model specification for training and fine-tuning.
I decided to cram the HuggingFace library by developing an old dream, a generative model that can write as a classical writer. I picked up Willian Shakespeare. The platform Gutemberg.org hosts many classical masterpieces, so all W.Shakespeare assortment books. william_shakespeare__writer project attempts writing as Shakespeare would do from a given initial text. My favorite case is checking out how Shakespeare would write several popular songs if he were the composer. In the parlance and context of NLP canonical tasks, we will specify a causal language model. Without further ado, let's jump to it.
Model description
william_shakespeare__writer model is a causal language model for all Willian Shakespeare texts. It leverages GPT-2 architecture and fits it all
from scratch. The full model specification happens with the aid of the HuggingFace library, taking advantage of some endpoints of transformers
and
datasets
.
Data pulling and dataset object
The following code snippet pulls data from the gutemberg.org page; breaks the corpus into stanzas and splits it into training and valid datasets. Two variables hugging_face_dt_training and hugging_face_dt_val hold data for upcoming training.
import requests
import re
import pandas as pd
from datasets import Dataset
william_shakespeare__full_work = 'https://www.gutenberg.org/cache/epub/100/pg100.txt'
response = requests.get(william_shakespeare__full_work)
data = response.text
data = data.split('THE SONNETS')[2]
data = data.split("*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***")[0]
stanzas = data.split('\r\n\r\n')
stanzas = list(map(lambda x: x.replace("\r\n", " "), stanzas))
stanzas = list(map(lambda x: x.replace(" ", " "), stanzas))
stanzas = list(filter(lambda x: len(x) > 64, stanzas))
training_dataset = stanzas[:15500]
valid_dataset = stanzas[15500:]
hugging_face_dt_training = Dataset.from_pandas(pd.DataFrame(training_dataset,columns=["corpus"]))
hugging_face_dt_val = Dataset.from_pandas(pd.DataFrame(valid_dataset,columns=["corpus"]))
Tokenizer
transformers' AutoTokenizer enables to access a myriad of tokenizers. I leverage the usual GPT-2 implementation, the byte-level version of Byte Pair Encoding (BPE).
from transformers import AutoTokenizer
context_length = 64
tokenizer = AutoTokenizer.from_pretrained("gpt2")
def tokenize(element):
outputs = tokenizer(
element["corpus"],
truncation=True,
max_length=context_length,
return_overflowing_tokens=True,
return_length=True,
)
input_batch = []
for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
input_batch.append(input_ids)
return {"input_ids": input_batch}
tokenized_datasets__valid = hugging_face_dt_val.map(
tokenize, batched=True, remove_columns=hugging_face_dt_val.column_names
)
tokenized_datasets__training = hugging_face_dt_training.map(
tokenize, batched=True, remove_columns=hugging_face_dt_training.column_names
)
Model Specification
With transformers library we can get GPT-2 architecture seamlessly. model variable holds the architecture with fresh values.
from transformers import GPT2LMHeadModel, AutoConfig
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)
model_size = sum(t.numel() for t in model.parameters())
print(f"GPT size: {model_size/1000**2:.1f}M parameters")
... GPT size: 124.2M parameters
Model Training
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments
tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
args = TrainingArguments(
output_dir="william_shakespeare__writer",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
evaluation_strategy="steps",
eval_steps=250,
logging_steps=100,
gradient_accumulation_steps=8,
num_train_epochs=2,
weight_decay=0.1,
warmup_steps=500,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=500,
fp16=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets__training,
eval_dataset=tokenized_datasets__valid,
)
trainer.train()
... Step Training Loss Validation Loss
... 250 5.741500 5.936911
... 500 4.634500 5.560539
Intended uses & limitations
Shakespeare writer has study and amusement sake. It aims to explore HuggingFace
library and the power of transfomers on causal language models.
How to use
You can use this model directly with a pipeline for causal language modeling:
import torch
from transformers import pipeline
pipe = pipeline(
"text-generation", model="JeanFaootMaia/william_shakespeare__writer", device="cuda:0"
)
harry_styles = "If you are feeling down, I just wanna make you happier, baby"
pipe(harry_styles, num_return_sequences=3)
... Setting pad_token_id
to eos_token_id
:0 for open-end generation.
[{'generated_text': 'If you are feeling down, I just wanna make you happier, baby. You shall be in these arms, as you shall; If that our friends are honest; Our inward confess, and their de'},
{'generated_text': 'If you are feeling down, I just wanna make you happier, baby. Give me your passion. I am your good wife. I am sorry I had not rather have my wife than a garments and a'},
{'generated_text': 'If you are feeling down, I just wanna make you happier, baby of love with her, Which she ands but that I will praising of her, And the best virtuous, with all sighs, Have'}]
## Billy Joel - Honesty
billy_joel__honesty = "If you search for tenderness, It isn't hard to find"
pipe(billy_joel__honesty, num_return_sequences=3)
... [{'generated_text': "If you search for tenderness, It isn't hard to find me, I will not make it. You must bring you to me, my noble sword or no more more, but do myself. If he take my word, give out it me"}, {'generated_text': "If you search for tenderness, It isn't hard to find yourself. If you must find them. I beseech you, I know not how you would do not have the way that have not her. You could have heard it, but"}, {'generated_text': "If you search for tenderness, It isn't hard to find more than I can take To give your head, and it. We hope you no use or no more. You say more. Well then, I fear, they have no more Than"}]
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 500
- num_epochs: 3
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
5.7327 | 1.2 | 250 | 5.9251 |
4.6322 | 2.4 | 500 | 5.5482 |
Framework versions
- Transformers 4.28.1
- Pytorch 2.0.0+cu118
- Datasets 2.12.0
- Tokenizers 0.13.3