Analyzing Handwritten and Transcribed Symbols in Disparate Corpora

Bogacz, Bartosz Ryszard

Vorschau

PDF, Englisch - Hauptdokument
Download (16MB) | Nutzungsbedingungen

Zitieren von Dokumenten: Bitte verwenden Sie für Zitate nicht die URL in der Adresszeile Ihres Webbrowsers, sondern entweder die angegebene DOI, URN oder die persistente URL, deren langfristige Verfügbarkeit wir garantieren. [mehr ...]

DOI: 10.11588/heidok.00024537
URN: urn:nbn:de:bsz:16-heidok-245379

Abstract

Cuneiform tablets appertain to the oldest textual artifacts used for more than three millennia and are comparable in amount and relevance to texts written in Latin or ancient Greek. These tablets are typically found in the Middle East and were written by imprinting wedge-shaped impressions into wet clay. Motivated by the increased demand for computerized analysis of documents within the Digital Humanities, we develop the foundation for quantitative processing of cuneiform script.

Using a 3D-Scanner to acquire a cuneiform tablet or manually creating line tracings are two completely different representations of the same type of text source. Each representation is typically processed with its own tool-set and the textual analysis is therefore limited to a certain type of digital representation. To homogenize these data source a unifying minimal wedge feature description is introduced. It is extracted by pattern matching and subsequent conflict resolution as cuneiform is written densely with highly overlapping wedges.

Similarity metrics for cuneiform signs based on distinct assumptions are presented. (i) An implicit model represents cuneiform signs using undirected mathematical graphs and measures the similarity of signs with graph kernels. (ii) An explicit model approaches the problem of recognition by an optimal assignment between the wedge configurations of two signs. Further, methods for spotting cuneiform script are developed, combining the feature descriptors for cuneiform wedges with prior work on segmentation-free word spotting using part-structured models. The ink-ball model is adapted by treating wedge feature descriptors as individual parts. The similarity metrics and the adapted spotting model are both evaluated on a real-world dataset outperforming the state-of-the-art in cuneiform sign similarity and spotting.

To prove the applicability of these methods for computational cuneiform analysis, a novel approach is presented for mining frequent constellations of wedges resulting in spatial n-grams. Furthermore, a method for automatized transliteration of tablets is evaluated by employing structured and sequential learning on a dataset of parallel sentences. Finally, the conclusion outlines how the presented methods enable the development of new tools and computational analyses, which are objective and reproducible, for quantitative processing of cuneiform script.

Dokumententyp:	Dissertation
Erstgutachter:	Mara, Dr. Hubert
Ort der Veröffentlichung:	Heidelberg, Germany
Tag der Prüfung:	7 Februar 2018
Erstellungsdatum:	25 Mai 2018 10:39
Erscheinungsjahr:	2018
Institute/Einrichtungen:	Fakultät für Mathematik und Informatik > Institut für Informatik
DDC-Sachgruppe:	004 Informatik
Normierte Schlagwörter:	Maschinelles Lernen, Visuelle Suche, Sprachanalyse