Pretrained BART in Korean

This is pretrained BART model with multiple Korean Datasets.

I used multiple datasets for generalizing the model for both colloquial and written texts.

The training is supported by TPU Research Cloud program.

The script which is used to pre-train model is here.

When you use the reference API, you must wrap the sentence with [BOS] and [EOS] like below example.

[BOS] 안녕하세요? 반가워요~~ [EOS]

You can also test mask filling performance using [MASK] token like this.

[BOS] [MASK] 먹었어? [EOS]

Benchmark

td, th { border: 1px solid #4d5562; padding: 8px; } </style>

<table> <tr> <th>Dataset</th>

<td>KLUE NLI dev</th> <td>NSMC test</td> <td>QuestionPair test</td> <td colspan="2">KLUE TC dev</td> <td colspan="3">KLUE STS dev</td> <td colspan="3">KorSTS dev</td> <td colspan="2">HateSpeech dev</td> </tr> <tr> <th>Metric</th>

<td>Acc</th>

<td>Acc</td>

<td>Acc</td>

<td>Acc</td> <td>F1</td>

<td>F1</td> <td>Pearson</td> <td>Spearman</td>

<td>F1</td> <td>Pearson</td> <td>Spearman</td>

<td>Bias Acc</td> <td>Hate Acc</td> </tr>

<tr> <th>Score</th>

<td>0.7390</th>

<td>0.8877</td>

<td>0.9208</td>

<td>0.8667</td> <td>0.8637</td>

<td>0.7654</td> <td>0.8090</td> <td>0.8040</td>

<td>0.8067</td> <td>0.7909</td> <td>0.7784</td>

<td>0.8280</td> <td>0.5669</td> </tr> </table>

The performance was measured using the notebooks here with colab.

Used Datasets

모두의 말뭉치

일상 대화 말뭉치 2020
구어 말뭉치
문어 말뭉치
신문 말뭉치

Pretrained BART in Korean

Benchmark

Used Datasets

모두의 말뭉치

AIhub

세종 말뭉치

NSDT 3DConvert

UnrealSynth

DreamTexture.js