A 2L, width 736 SoLU model trained on 15B tokens of the Pile. Bugs: the layernorm just before the unembed is an RMS norm, and the width is not a multiple of 64, so d_head=64 and n_heads=11, and n_heads * d_head != d_model :(