A GPT-2 Medium sized SoLU model trained on 11.7B tokens of the Pile (training crashed because of dodgy data loaders at 11B, and wasn't resumed, so this is shorter than the others). 12 layers, d_model=1536.