title: Building and Improving an OCR Classifier for Republican Chinese Newspaper Text creator: Henke, Konstantin description: This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%. date: 2021 type: Bachelor thesis type: info:eu-repo/semantics/bachelorThesis type: NonPeerReviewed format: application/pdf identifier: https://archiv.ub.uni-heidelberg.de/volltextserverhttps://archiv.ub.uni-heidelberg.de/volltextserver/30845/1/Bachelor_Thesis.pdf identifier: DOI:10.11588/heidok.00030845 identifier: urn:nbn:de:bsz:16-heidok-308453 identifier: Henke, Konstantin (2021) Building and Improving an OCR Classifier for Republican Chinese Newspaper Text. [Bachelor thesis] relation: https://archiv.ub.uni-heidelberg.de/volltextserver/30845/ rights: info:eu-repo/semantics/openAccess rights: Please see front page of the work (Sorry, Dublin Core plugin does not recognise license id) language: eng