TY - GEN ID - heidok31416 CY - Taipeh, ROC AV - public N1 - Chinese Title: ?????????????????????? Y1 - 2024/// TI - Language Model Assisted OCR Classification for Republican Chinese Newspaper Text PB - Taiwanese Association for Digital Humanities N2 - In this work, we present methods to obtain a neural optical character recognition (OCR) tool for article blocks in a Republican Chinese newspaper. Our basis is a small fraction of the image corpus for which text ground truth exists. We introduce a character segmentation method which produces over 90,000 labeled images of single characters and train a GoogLeNet classifier as an OCR model. In addition, we create synthetic training data from character images extracted from Song-Ti fonts. Randomly augmented on the fly and used for pre-training, they increase OCR accuracy from 95.49% to 96.95% on our test set. Finally, we employ post-OCR correction based on a pre-trained masked language model and present heuristics to select the required hyperparameters, by which we are able to correct 16% of remaining classification errors, increasing accuracy on the test set to 97.44%. A1 - Henke, Konstantin A1 - Arnold, Matthias UR - http://doi.org/10.6853/DADH.202310_(12).0001 ER -