random-mega-ar-small-4096
This is a random weight init for architecture:
- 12 decoder layers, 512 hidden size
- 4096 context length, chunked at 1024
- GPT NeoX tokenizer
It needs to be trained before being useful.
architecture
Here is an image from the paper. This architecture roughly follows the enwiki8 architecture in the paper, with some differences:
- dropout rates are > 0 for attention, FFN (want to generate not memorize)
- Chunk size set to 1024
- ctx length/max positions set to 4096 (this may be hf-implementation specific)
this leads to approx 70M params.
Note that the parameter counts in the figure vs. this model/others will not be the same as this model uses the GPTNeoX tokenizer/vocab.