eprintid: 30845 rev_number: 13 eprint_status: archive userid: 6343 dir: disk0/00/03/08/45 datestamp: 2021-12-08 10:17:03 lastmod: 2021-12-14 06:48:18 status_changed: 2021-12-08 10:17:03 type: bachelorThesis metadata_visibility: show creators_name: Henke, Konstantin title: Building and Improving an OCR Classifier for Republican Chinese Newspaper Text divisions: i-728300 divisions: i-90500 adv_faculty: af-09 cterms_swd: Computerlinguistik cterms_swd: Optische Zeichenerkennung cterms_swd: Chinesisch abstract: This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%. date: 2021 id_scheme: DOI id_number: 10.11588/heidok.00030845 ppn_swb: 1780724381 own_urn: urn:nbn:de:bsz:16-heidok-308453 date_accepted: 2021-11-26 advisor: HASH(0x55fc36ca0e48) language: eng bibsort: HENKEKONSTBUILDINGAN2021 full_text_status: public place_of_pub: Heidelberg citation: Henke, Konstantin (2021) Building and Improving an OCR Classifier for Republican Chinese Newspaper Text. [Bachelor thesis] document_url: https://archiv.ub.uni-heidelberg.de/volltextserver/30845/1/Bachelor_Thesis.pdf