A modified GPT-2 model with only 25 million non-embedding params that outbenches GPT-2(124m), Pythia-70m/160m, and Cerebras-111m, it has ScaledSinusoidal position embeddings, embedding layernorm, no biases, and was trained on only 8 billion tokens of the SlimPajama dataset at home on 2xA6000.

model avg arc hellaswag mmlu truthfulqa
cramp-41m 30.57 21.76 27.35 25.53 47.66
gpt2 (125m) 30.06 22.1 31.6 25.86 40.67
pythia 70m deduped 30.25 21.08 27.17 25.26 47.51
pythia 70m 30.46 21.59 27.29 25.9 47.06
pythia 160m deduped 31.16 24.06 30.34 24.95 44.34
pythia 160m 30.58 22.78 30.34 24.95 44.26

image/png