A Novel Machine Learning Based Approach for Post-OCR Error Detection
Paper i proceeding, 2021

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition (OCR) systems. Two steps are usually taken to correct OCR errors: detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.

Författare

Shafqat Mumtaz Virk

Göteborgs universitet

Dana Dannélls

Göteborgs universitet

Muhammad Azam Sheikh

Chalmers, Data- och informationsteknik, CSE Verksamhetsstöd

International Conference Recent Advances in Natural Language Processing, RANLP

13138502 (ISSN)

1463-1470
9789544520724 (ISBN)

International Conference on Recent Advances in Natural Language Processing: Deep Learning for Natural Language Processing Methods and Applications, RANLP 2021
Virtual, Online, ,

Ämneskategorier

Språkteknologi (språkvetenskaplig databehandling)

Datorsystem

Datorseende och robotik (autonoma system)

DOI

10.26615/978-954-452-072-4_164

Mer information

Senast uppdaterat

2022-02-09