post-ocr correction ocr postcorrection

OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on bert-base-multilingual-cased and finetuned on the dataset of the 2019 ICDAR competition on post-OCR correction. It contains texts in the following languages:

10% of the texts (stratified on language) were selected for validation. The test set is as provided.

The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance.

There are 3 classes in the data:

Results

Set Loss
Train 0.224500
Val 0.285791
Test 0.4178357720375061

Average F1 by language:

BG CZ DE EN ES FI FR NL PL SL
0.74 0.69 0.96 0.67 0.63 0.83 0.65 0.69 0.8 0.69

Demo

Space for this model.

Code