OCR postcorrection task 1

This is a BertForTokenClassification model that predicts whether a token is an OCR mistake or not. It is based on bert-base-multilingual-cased and finetuned on the dataset of the 2019 ICDAR competition on post-OCR correction. It contains texts in the following languages:

10% of the texts (stratified on language) were selected for validation. The test set is as provided.

The training data consists of (partially overlapping) sequences of 150 tokens. Only sequences with a normalized editdistance of < 0.3 were included in the train and validation set. The test set was not filtered on editdistance.

There are 3 classes in the data:

0: No OCR mistake
1: Start token of an OCR mistake
2: Inside token of an OCR mistake

Results

Set	Loss
Train	0.224500
Val	0.285791
Test	0.4178357720375061

Average F1 by language:

BG	CZ	DE	EN	ES	FI	FR	NL	PL	SL
0.74	0.69	0.96	0.67	0.63	0.83	0.65	0.69	0.8	0.69

Demo

Space for this model.

Code

OCR post correction package
Notebooks

OCR postcorrection task 1

Results

Demo

Code

NSDT 3DConvert

UnrealSynth

DreamTexture.js