Vorschau |
PDF, Englisch
- Hauptdokument
Download (3MB) | Lizenz: ![]() |
Abstract
This work presents methods and results of an initial step towards full text extraction from a Republican Chinese newspaper. My basis is a small fraction of the image corpus for which text ground truth exists. I introduce a character segmentation method which produces over 90,000 labeled images of single characters. Then I pre-train a GoogLeNet classifier as an OCR model on character images extracted from font files and randomly augmented on the fly, whereafter I fine-tune it on the previously segmented character images. I show that the pre-training step is able to increase OCR accuracy from 95.49% to 96.95% on the test set and finally, how post-processing using a masked language model corrects up to 16% of remaining errors, increasing accuracy on the test set to 97.44%.
Dokumententyp: | Bachelorarbeit |
---|---|
Erstgutachter: | Frank, Prof. Dr. Anette |
Ort der Veröffentlichung: | Heidelberg |
Tag der Prüfung: | 26 November 2021 |
Erstellungsdatum: | 08 Dez. 2021 10:17 |
Erscheinungsjahr: | 2021 |
Institute/Einrichtungen: | Zentrale und Sonstige Einrichtungen > Heidelberger Zentrum für Transkulturelle Studien (HCTS)
Neuphilologische Fakultät > Institut für Computerlinguistik |
Normierte Schlagwörter: | Computerlinguistik, Optische Zeichenerkennung, Chinesisch |