SCRIPBOZO
This model is based on GPT2-Medium finetuned on chat logs from twitch.tv/MOONMOON.
Data
The data consists of ~3.8GB of plaintext across 632 days of logs, ranging from 2021-01-01 to 2022-09-26. They were sourced from https://logs.ivr.fi/. The logs were cleaned by dropping
-
bots: messages from a (manually determined, non-exhaustive) list of bots
-
links: messages matching the regex
r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
-
long messages: messages longer than 300 characters
-
short messages: messages shorter than 2 characters
-
caps spam: messages containing more than 100 characters that are more than 80% capital letters
-
commands: messages starting with !
The data was batched into groups of up to 512 tokens, preferring to end on a newline (\n
) rather than start another line and truncate it. The batches were then padded to 512 tokens using a pad token added to the model and tokenizer.
10% of the data was set aside for validation.
Training
Training was done on a system with a 6800XT (16GB of VRAM) and 32GB of RAM. The following hyperparameters were used:
- epochs: 1
- learning rate: 3e-4
- weight decay: 1e-4
- warmup ratio: 0.01
- optimizer:
adamw_torch
- gradient accumulation steps: 1
- gradient checkpointing:
true
- fp16: true
Evaluation
Evaluation was performed 10 times throughout training. Accuracy and perplexity were calculated. <details> <summary>View Metrics</summary>
Training Metrics
Epochs | Validation Loss | Accuracy |
---|---|---|
0.1 | 1.778 | 0.6789 |
0.2 | 1.721 | 0.6858 |
0.3 | 1.687 | 0.6899 |
0.4 | 1.664 | 0.6925 |
0.5 | 1.645 | 0.695 |
0.6 | 1.63 | 0.6969 |
0.7 | 1.616 | 0.6987 |
0.8 | 1.604 | 0.7003 |
0.9 | 1.594 | 0.7017 |
1.0 | 1.588 | 0.7025 |