Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

Building and Improving an OCR Classifier for Republican Chinese Newspaper Text

Henke, Konstantin

[thumbnail of Bachelor_Thesis.pdf]
Preview
PDF, English - main document
Download (3MB) | Lizenz: Creative Commons LizenzvertragBuilding and Improving an OCR Classifier for Republican Chinese Newspaper Text by Henke, Konstantin underlies the terms of Creative Commons Attribution-NonCommercial-ShareAlike 4.0

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

Abstract

This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%.

Document type: Bachelor thesis
Supervisor: Frank, Prof. Dr. Anette
Place of Publication: Heidelberg
Date of thesis defense: 26 November 2021
Date Deposited: 08 Dec 2021 10:17
Date: 2021
Faculties / Institutes: Service facilities > Heidelberg Center for Transcultural Studies (HCTS)
Neuphilologische Fakultät > Institut für Computerlinguistik
Controlled Keywords: Computerlinguistik, Optische Zeichenerkennung, Chinesisch
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative