random-mega-ar-small-4096

This is a random weight init for architecture:

It needs to be trained before being useful.

architecture

Here is an image from the paper. This architecture roughly follows the enwiki8 architecture in the paper, with some differences:

this leads to approx 70M params.

lm_arch

Note that the parameter counts in the figure vs. this model/others will not be the same as this model uses the GPTNeoX tokenizer/vocab.