Preview |
PDF, English
- main document
Download (3MB) | Lizenz: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 |
Abstract
This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%.
Document type: | Bachelor thesis |
---|---|
Supervisor: | Frank, Prof. Dr. Anette |
Place of Publication: | Heidelberg |
Date of thesis defense: | 26 November 2021 |
Date Deposited: | 08 Dec 2021 10:17 |
Date: | 2021 |
Faculties / Institutes: | Service facilities > Heidelberg Center for Transcultural Studies (HCTS) Neuphilologische Fakultät > Institut für Computerlinguistik |
Controlled Keywords: | Computerlinguistik, Optische Zeichenerkennung, Chinesisch |