Hyperparameter Value
Steps 150k
Max length 256
LR 1e-4
LR schedule constant
Optimizer AdamW
beta_1, beta_2 0.9, 0.95
Final eval loss 2.245
Final eval perplexity 9.44