OCR postcorrection task 1
This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on bert-base-multilingual-cased and finetuned on the dataset of the 2019 ICDAR competition on post-OCR correction. It contains texts in the following languages:
- BG
- CZ
- DE
- EN
- ES
- FI
- FR
- NL
- PL
- SL
10% of the texts (stratified on language) were selected for validation. The test set is as provided.
The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance.
There are 3 classes in the data:
- 0: No OCR mistake
- 1: Start token of an OCR mistake
- 2: Inside token of an OCR mistake
Results
Set | Loss |
---|---|
Train | 0.224500 |
Val | 0.285791 |
Test | 0.4178357720375061 |
Average F1 by language:
BG | CZ | DE | EN | ES | FI | FR | NL | PL | SL |
---|---|---|---|---|---|---|---|---|---|
0.74 | 0.69 | 0.96 | 0.67 | 0.63 | 0.83 | 0.65 | 0.69 | 0.8 | 0.69 |