NB-ROBERTA Training Code
This is the current training code for the planned nb-roberta models.
We are currently planning to run the following experiments:
<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-old (C)</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nb_bert </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>
<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-ext (B)</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nbailab_extended </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>
<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-large-ext</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nbailab_extended </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>3248 = 2024 = 1k </td> </tr> <tr> <td>Learning rate </td> <td>2-e4 (RoBERTa article is using 4e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>500k </td> </tr> </table>
<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-scandi</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/scandinavian </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>
<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-large-scandi</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/scandinavian </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>3248 = 1024 = 1k </td> </tr> <tr> <td>Learning rate </td> <td>2-e4 (RoBERTa article is using 4e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>500k </td> </tr> </table>
Calculations
Some basic that we used when estimating the number of training steps:
- The Scandinavic Corpus is 85GB
- The Scandinavic Corpus contains 13B words
- With a conversion factor of 2.3, this is estimated to around 30B tokens
- 30B tokens / (512 seq length * 3000 batch size) = 20.000 steps