Language Model Assisted OCR Classification for Republican Chinese Newspaper Text

Henke, Konstantin ; Arnold, Matthias

[thumbnail of Language_model_Henke_Arnold_2023.pdf]

Preview

PDF, English - main document
Download (4MB) | Lizenz:

Creative Commons Attribution 4.0

Official URL: http://doi.org/10.6853/DADH.202310_(12).0001

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00031416
URN: urn:nbn:de:bsz:16-heidok-314169

Abstract

In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. Our basis is a small fraction of the image corpus for which text ground truth exists. We introduce a character segmentation method which produces over 90,000 labeled images of single characters and train a GoogLeNet classifier as an OCR model. In addition, we create synthetic training data from character images extracted from Song-Ti fonts. Randomly augmented on the fly and used for pre-training, they increase OCR accuracy from 95.49% to 96.95% on our test set. Finally, we employ post-OCR correction based on a pre-trained masked language model and present heuristics to select the required hyperparameters, by which we are able to correct 16% of remaining classification errors, increasing accuracy on the test set to 97.44%.

Translation of abstract (other)

Chinese Abstract: 本文為研發使用神經網絡的光學字元辨識（optical character recognition, OCR）工具提出了一些方法，以辨識民國時期中文報紙中的文章部分。這項工作的基礎為一小部分已存在基準真相（ground truth）的圖像語料。我們引入了一種字符分割方法，從而生成了超過90,000 個有標籤的單一字符圖像，並且訓練了一個GoogLeNet 分類器作為OCR 模型。此外，我們從宋體字體中提取字符圖像，以此製作了訓練數據。這些圖像被隨機增強並被用於預訓練，測試集的OCR 準確率由95.49% 提高到96.95%。最後，我們採用了基於預訓練遮罩語言模型（Masked LM）的OCR 後校正，並提出啟發式方法來選擇所需的超參數。通過這些方法，我們能夠校正16% 的剩餘分類錯誤，將測試集的準確率提高到97.44%。

Document type:	Preprint
Publisher:	Taiwanese Association for Digital Humanities
Place of Publication:	Taipeh, ROC
Date Deposited:	29 Feb 2024 14:04
Date:	2024
Faculties / Institutes:	Philosophische Fakultät > Institut für Sinologie Service facilities > Heidelberg Center for Transcultural Studies (HCTS) Neuphilologische Fakultät > Institut für Computerlinguistik
DDC-classification:	004 Data processing Computer science 020 Library and information sciences 490 Other languages 890 Literatures of other languages 950 General history of Asia Far East
Additional Information:	Chinese Title: 以語言模型輔助民國報紙文本的光學字元辨識分類