eprintid: 30845
rev_number: 13
eprint_status: archive
userid: 6343
dir: disk0/00/03/08/45
datestamp: 2021-12-08 10:17:03
lastmod: 2021-12-14 06:48:18
status_changed: 2021-12-08 10:17:03
type: bachelorThesis
metadata_visibility: show
creators_name: Henke, Konstantin
title: Building and Improving an OCR Classifier for Republican Chinese Newspaper Text
divisions: i-728300
divisions: i-90500
adv_faculty: af-09
cterms_swd: Computerlinguistik
cterms_swd: Optische Zeichenerkennung
cterms_swd: Chinesisch
abstract: This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%.
date: 2021
id_scheme: DOI
id_number: 10.11588/heidok.00030845
ppn_swb: 1780724381
own_urn: urn:nbn:de:bsz:16-heidok-308453
date_accepted: 2021-11-26
advisor: HASH(0x5638d5f0b968)
language: eng
bibsort: HENKEKONSTBUILDINGAN2021
full_text_status: public
place_of_pub: Heidelberg
citation:   Henke, Konstantin  (2021) Building and Improving an OCR Classifier for Republican Chinese Newspaper Text.  [Bachelor thesis]     
document_url: https://archiv.ub.uni-heidelberg.de/volltextserver/30845/1/Bachelor_Thesis.pdf