EuroGPT2
NOTE: THIS IS THE ORIGINAL MEGATRON-DEEPSPEED CHECKPOINT INCLUDING OPTIMIZER STATES
A GPT2 language model for European languages (EU-24 + Ukrainian). The model follows the original architecture as OpenAI's GPT2 apart from using rotary instead of learned positional embeddigs.
Model settings
- parameters: 124M
- number of layers: 12
- hidden size: 768
- number of heads: 12
- sequence length: 1024
- batch size: 168
- test PPL after training: 23.6 (steps: 436,940)
Training data
- Wikimedia dumps (Wikipedia, Wikinews, Wikibooks, Wikisource, Wikivoyage; 20230301)
- EUR-Lex
- OSCAR 2023.01
- Tokens: 75,167,662,080
Languages
Included languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish, and Ukrainian.
| Language | Ratio |
|---|---|
| bg | 5,92% |
| cs | 4,77% |
| da | 2,19% |
| de | 7,36% |
| el | 8,60% |
| en | 10,11% |
| es | 6,57% |
| et | 1,67% |
| fi | 2,70% |
| fr | 7,18% |
| ga | 0,25% |
| hr | 1,09% |
| hu | 6,38% |
| it | 5,80% |
| lt | 2,01% |
| lv | 1,76% |
| mt | 1,49% |
| nl | 5,20% |
| pl | 4,82% |
| pt | 4,64% |
| ro | 2,93% |
| sk | 2,03% |
| sl | 1,54% |
| sv | 3,00% |
License
MIT