William Shakespeare Writer

NLP models unleashed the most astonishing and notable abilities of recent machine learning releases. The internet blew out with ChatGPT launching; people are posting about it, dazzled with its communication skills, especially on summarizing, translating, and question-answering tasks. As someone who has worked with machine learning for the last five years and understand how it works under the hood, LLM awes me with its capabilities of condensing information and sharp writing.

The essence of LLM and other NLP capstones are the Transformers and the self-attention apparatus, which empowers neural networks to fit preceding displacements. Models can grasp syntactically and semantically in new unique ways. Specifying those models in Python is more straightforward than it might look. Practitioners can leverage Pytorch, Tensorflow and Jax to specify transformer-based architecture on a high-level language, taking advantage of local resources in an optimal form. HuggingFace library is a new high-level Python option that allows model specification for training and fine-tuning.

I decided to cram the HuggingFace library by developing an old dream, a generative model that can write as a classical writer. I picked up Willian Shakespeare. The platform Gutemberg.org hosts many classical masterpieces, so all W.Shakespeare assortment books. william_shakespeare__writer project attempts writing as Shakespeare would do from a given initial text. My favorite case is checking out how Shakespeare would write several popular songs if he were the composer. In the parlance and context of NLP canonical tasks, we will specify a causal language model. Without further ado, let's jump to it.

Model description

william_shakespeare__writer model is a causal language model for all Willian Shakespeare texts. It leverages GPT-2 architecture and fits it all from scratch. The full model specification happens with the aid of the HuggingFace library, taking advantage of some endpoints of transformers and datasets.

Data pulling and dataset object

The following code snippet pulls data from the gutemberg.org page; breaks the corpus into stanzas and splits it into training and valid datasets. Two variables hugging_face_dt_training and hugging_face_dt_val hold data for upcoming training.

import requests
import re
import pandas as pd

from datasets import Dataset

william_shakespeare__full_work = 'https://www.gutenberg.org/cache/epub/100/pg100.txt'
response = requests.get(william_shakespeare__full_work)
data = response.text
data = data.split('THE SONNETS')[2]
data = data.split("*** END OF THE PROJECT GUTENBERG EBOOK THE COMPLETE WORKS OF WILLIAM SHAKESPEARE ***")[0]

stanzas = data.split('\r\n\r\n')

stanzas = list(map(lambda x: x.replace("\r\n", " "), stanzas))
stanzas = list(map(lambda x: x.replace("   ", " "), stanzas))
stanzas = list(filter(lambda x: len(x) > 64, stanzas))

training_dataset = stanzas[:15500]
valid_dataset = stanzas[15500:]

hugging_face_dt_training = Dataset.from_pandas(pd.DataFrame(training_dataset,columns=["corpus"]))
hugging_face_dt_val = Dataset.from_pandas(pd.DataFrame(valid_dataset,columns=["corpus"]))

Tokenizer

transformers' AutoTokenizer enables to access a myriad of tokenizers. I leverage the usual GPT-2 implementation, the byte-level version of Byte Pair Encoding (BPE).

from transformers import AutoTokenizer

context_length = 64
tokenizer = AutoTokenizer.from_pretrained("gpt2")

def tokenize(element):
    outputs = tokenizer(
        element["corpus"],
        truncation=True,
        max_length=context_length,
        return_overflowing_tokens=True,
        return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs["length"], outputs["input_ids"]):
      input_batch.append(input_ids)
    return {"input_ids": input_batch}


tokenized_datasets__valid = hugging_face_dt_val.map(
    tokenize, batched=True, remove_columns=hugging_face_dt_val.column_names
)
tokenized_datasets__training = hugging_face_dt_training.map(
    tokenize, batched=True, remove_columns=hugging_face_dt_training.column_names
)

Model Specification

With transformers library we can get GPT-2 architecture seamlessly. model variable holds the architecture with fresh values.

from transformers import GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

model_size = sum(t.numel() for t in model.parameters())
print(f"GPT size: {model_size/1000**2:.1f}M parameters")

... GPT size: 124.2M parameters

Model Training

from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)


args = TrainingArguments(
    output_dir="william_shakespeare__writer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="steps",
    eval_steps=250,
    logging_steps=100,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    weight_decay=0.1,
    warmup_steps=500,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=500,
    fp16=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets__training,
    eval_dataset=tokenized_datasets__valid,
)


trainer.train()

... Step Training Loss Validation Loss
... 250 5.741500 5.936911
... 500 4.634500 5.560539

Intended uses & limitations

Shakespeare writer has study and amusement sake. It aims to explore HuggingFace library and the power of transfomers on causal language models.

How to use

You can use this model directly with a pipeline for causal language modeling:


import torch
from transformers import pipeline

pipe = pipeline(
    "text-generation", model="JeanFaootMaia/william_shakespeare__writer", device="cuda:0"
)

harry_styles = "If you are feeling down, I just wanna make you happier, baby"

pipe(harry_styles, num_return_sequences=3)

... Setting pad_token_id to eos_token_id:0 for open-end generation. [{'generated_text': 'If you are feeling down, I just wanna make you happier, baby. You shall be in these arms, as you shall; If that our friends are honest; Our inward confess, and their de'}, {'generated_text': 'If you are feeling down, I just wanna make you happier, baby. Give me your passion. I am your good wife. I am sorry I had not rather have my wife than a garments and a'}, {'generated_text': 'If you are feeling down, I just wanna make you happier, baby of love with her, Which she ands but that I will praising of her, And the best virtuous, with all sighs, Have'}]

## Billy Joel - Honesty
billy_joel__honesty = "If you search for tenderness, It isn't hard to find"
pipe(billy_joel__honesty, num_return_sequences=3)

... [{'generated_text': "If you search for tenderness, It isn't hard to find me, I will not make it. You must bring you to me, my noble sword or no more more, but do myself. If he take my word, give out it me"}, {'generated_text': "If you search for tenderness, It isn't hard to find yourself. If you must find them. I beseech you, I know not how you would do not have the way that have not her. You could have heard it, but"}, {'generated_text': "If you search for tenderness, It isn't hard to find more than I can take To give your head, and it. We hope you no use or no more. You say more. Well then, I fear, they have no more Than"}]

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0005
train_batch_size: 16
eval_batch_size: 16
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 128
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 500
num_epochs: 3
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss
5.7327	1.2	250	5.9251
4.6322	2.4	500	5.5482

Framework versions

Transformers 4.28.1
Pytorch 2.0.0+cu118
Datasets 2.12.0
Tokenizers 0.13.3