NB-ROBERTA Training Code

This is the current training code for the planned nb-roberta models.

We are currently planning to run the following experiments:

<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-old (C)</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nb_bert </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>

<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-ext (B)</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nbailab_extended </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>

<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-large-ext</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/nbailab_extended </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>3248 = 2024 = 1k </td> </tr> <tr> <td>Learning rate </td> <td>2-e4 (RoBERTa article is using 4e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>500k </td> </tr> </table>

<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-base-scandi</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/scandinavian </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>6248 = 1984 = 2k </td> </tr> <tr> <td>Learning rate </td> <td>3e-4 (RoBERTa article is using 6e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>250k </td> </tr> </table>

<table> <tr> <td><strong>Name</strong> </td> <td><strong>nb-roberta-large-scandi</strong> </td> </tr> <tr> <td>Corpus </td> <td>NbAiLab/scandinavian </td> </tr> <tr> <td>Pod size </td> <td>v4-64 </td> </tr> <tr> <td>Batch size </td> <td>3248 = 1024 = 1k </td> </tr> <tr> <td>Learning rate </td> <td>2-e4 (RoBERTa article is using 4e-4 and bs=8k) </td> </tr> <tr> <td>Number of steps </td> <td>500k </td> </tr> </table>

Calculations

Some basic that we used when estimating the number of training steps: