OCR Processing of Swedish Historical Newspapers Using Deep Hybrid CNN-LSTM Networks
Paper i proceeding, 2021

Deep CNN-LSTM hybrid neural networks have proven to improve the accuracy of Optical Character Recognition (OCR) models for different languages. In this paper we examine to what extent these networks improve the OCR accuracy rates on Swedish historical newspapers. By experimenting with the open source OCR engine Calamari, we are able to show that mixed deep CNN-LSTM hybrid models outperform previous models on the task of character recognition of Swedish historical newspapers spanning 1818-1848. We achieved an average character accuracy rate (CAR) of 97.43% which is a new state-of-the-art result on 19th century Swedish newspaper text.

Newsprint

Long short-term memory

Författare

Molly Brandt Skelbye

Student vid Chalmers

Dana Dannélls

Göteborgs universitet

International Conference Recent Advances in Natural Language Processing, RANLP

13138502 (ISSN)

190-198
9789544520724 (ISBN)

International Conference on Recent Advances in Natural Language Processing: Deep Learning for Natural Language Processing Methods and Applications, RANLP 2021
Virtual, Online, ,

Ämneskategorier

Språkteknologi (språkvetenskaplig databehandling)

Studier av enskilda språk

Bioinformatik (beräkningsbiologi)

DOI

10.26615/978-954-452-072-4_023

Mer information

Senast uppdaterat

2022-02-10