title: Language Model Assisted OCR Classification for Republican Chinese Newspaper Text creator: Henke, Konstantin creator: Arnold, Matthias subject: ddc-004 subject: 004 Data processing Computer science subject: ddc-020 subject: 020 Library and information sciences subject: ddc-490 subject: 490 Other languages subject: ddc-890 subject: 890 Literatures of other languages subject: ddc-950 subject: 950 General history of Asia Far East description: In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. Our basis is a small fraction of the image corpus for which text ground truth exists. We introduce a character segmentation method which produces over 90,000 labeled images of single characters and train a GoogLeNet classifier as an OCR model. In addition, we create synthetic training data from character images extracted from Song-Ti fonts. Randomly augmented on the fly and used for pre-training, they increase OCR accuracy from 95.49% to 96.95% on our test set. Finally, we employ post-OCR correction based on a pre-trained masked language model and present heuristics to select the required hyperparameters, by which we are able to correct 16% of remaining classification errors, increasing accuracy on the test set to 97.44%. publisher: Taiwanese Association for Digital Humanities date: 2024 type: Preprint type: info:eu-repo/semantics/preprint type: NonPeerReviewed format: application/pdf identifier: https://archiv.ub.uni-heidelberg.de/volltextserverhttps://archiv.ub.uni-heidelberg.de/volltextserver/31416/7/Language_model_Henke_Arnold_2023.pdf identifier: DOI:10.11588/heidok.00031416 identifier: http://doi.org/10.6853/DADH.202310_(12).0001 identifier: urn:nbn:de:bsz:16-heidok-314169 identifier: Henke, Konstantin ; Arnold, Matthias (2024) Language Model Assisted OCR Classification for Republican Chinese Newspaper Text. [Preprint] relation: https://archiv.ub.uni-heidelberg.de/volltextserver/31416/ rights: info:eu-repo/semantics/openAccess rights: Please see front page of the work (Sorry, Dublin Core plugin does not recognise license id) language: eng