mmdet htrflow instance segmentation

Model Description

The Swedish National Archives presents an end-to-end Handwritten Text Recognition (HTR) pipeline for running-text documents ranging from the mid 17th century to the late 19th century. The pipeline consists of the following components:

  1. RTMDet Instance Segmentation Models: The pipeline utilizes two RTMDet instance segmentation models, trained using MMDetection. The first model is designed to segment text regions within the documents, while the second model focuses on segmenting text lines within these regions. These models enable the identification and localization of text-line regions, which is a crucial step in the HTR pipeline since text-recognition models work at the text-line level.

  2. SATRN HTR Model: The pipeline incorporates a SATRN (Spatial Attention Transformer Networks) model, trained using MMOCR (OpenMMLab's OCR toolbox). SATRN is a state-of-the-art model for irregular scene-text recognition, which makes it an excellent choice for HTR, given that handwriting is highly irregular. The SATRN model consists of a shallow CNN, a 2D-transformer encoder, and a transformer decoder that works on the character level. It is trained on about a million text-line images of running-text handwritten documents ranging from the mid 17th century to the late 19th century.

The models are designed to provide a generic pipeline for handwritten text recognition, offering robust performance for running-text documents from the mid 17th to the late 19th century.

Evaluation

The Swedish National Archives HTR pipeline has been evaluated using standard evaluation metrics for Handwritten Text Recognition. The Character Error Rate (CER) is commonly used to assess the accuracy of the text-recognition model. The best way to evaluate the entire pipeline is to run all three models on unsegmented document images and calculate CER for the entire pipeline.

The reported performance metrics are obtained on several test-sets from archives that weren't included in the training-set, ranging the entire time-period the model was trained on. So these error rates are what you should expect if you run the pipeline out-of-the-box on your own documents given that the documents contain running-text and are from the model's time-period-domain. It is important to note that the actual performance may vary depending on the specific layout and handwriting styles encountered in the document.

Model train-eval 1661-testset 1664-testset 1688-testset-unusual-layout 1735-testset 1740-1793-testset 1777-testset 1840-1890-testset 1861-testset
SATRN_1650_1900 0.033 0.096 0.078 0.215 0.079 0.066 0.074 0.037 0.043
SATRN_1650_1800 0.039 0.109 0.085 0.243 0.079 0.079 0.087 0.239 0.157
SATRN_1800_1900 0.031 0.455 0.382 0.381 0.309 0.252 0.182 0.046 0.051

The lower two rows are for comparison only. You can see that the model trained exclusively on the 19th century actually performed worse on 19th century testsets than the model trained on the entire time-period. This was the reason we only published the aggregated model rather than models specialized on a specific century.

Regular evaluations are conducted to monitor and improve the performance of the pipeline. As new evaluation results become available, this table will be updated to reflect the most recent performance metrics.

We also did some fine-tuning experiments to give an idea of the performance benefits of finetuning the model on domain-specific material, as well as a rough estimate of how many pages one needs to transcribe to do the fine-tuning.

Model 16th-century-testsets-combined 17th-century-testsets-combined 18th-century-testsets-combined
SATRN_1650_1900 0.124 0.095 0.038
SATRN_1650_1900_ft 0.064 0.084 0.026
Number of pages 57 28 29

As seen 50-60 transcribed pages is enough to halve the CER on 17th century documents. 30 pages of transcribed text gives significant improvements on 18th and 19th century text, but the improvement are not as steep. Our recommendation, if you have a large domain you want to run the pipeline on, is to transcribe 50-100 pages, and finetune the text-recognition model on this data. Guides on how to do this will be forthcoming.

Intended Use

The Swedish National Archives HTR pipeline is intended to be used for the following purposes:

It's important to note that the pipeline is optimized for running-text documents from the specified time period and may not perform optimally for other types of documents or handwriting styles. Additionally, it is currently more suitable for documents from books rather than complex layouts from either tables or newspapers.

Performance and Limitations

The performance of the Swedish National Archives HTR pipeline is influenced by several factors:

Training Data

The Swedish National Archives HTR pipeline was trained using a diverse dataset of binarized, running-text documents from the 17th to the 19th century. The training data includes various types of historical texts, such as letters, manuscripts, and official records.

The dataset comprises both high-quality and challenging examples to ensure the models' robustness. It covers a wide range of handwriting styles, legibility levels, and document conditions.

The training data was annotated to provide ground truth for text region and line segmentation, as well as text transcription. Expert archivists and historians contributed to the annotation process to ensure accurate labeling.

The data can be find here: (WIP will be added soon)

Caveats and Future Work

Although the Swedish National Archives HTR pipeline has been trained and optimized for running-text documents from the specified time period, there are a few caveats and considerations to keep in mind:

Continuous Improvement: The pipeline is continuously being updated and improved as new training data becomes available and advancements in OCR technology occur. With access to more training data, the models will be updated to further enhance their performance and adaptability.

User Feedback: Users are encouraged to provide feedback on the pipeline's performance, identify issues, and report any potential biases or limitations. This feedback is highly valuable in refining the pipeline, addressing concerns, and informing future updates.

References

If you would like to learn more about the Swedish National Archives HTR pipeline or access the training data, please refer to the following resources: