Analyzing Handwritten and Transcribed Symbols in Disparate Corpora

Bogacz, Bartosz Ryszard

Preview

PDF, English - main document
Download (16MB) | Terms of use

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

DOI: 10.11588/heidok.00024537
URN: urn:nbn:de:bsz:16-heidok-245379
URL: http://www.ub.uni-heidelberg.de/archiv/24537

Abstract

Cuneiform tablets appertain to the oldest textual artifacts used for more than three millennia and are comparable in amount and relevance to texts written in Latin or ancient Greek. These tablets are typically found in the Middle East and were written by imprinting wedge-shaped impressions into wet clay. Motivated by the increased demand for computerized analysis of documents within the Digital Humanities, we develop the foundation for quantitative processing of cuneiform script.

Using a 3D-Scanner to acquire a cuneiform tablet or manually creating line tracings are two completely different representations of the same type of text source. Each representation is typically processed with its own tool-set and the textual analysis is therefore limited to a certain type of digital representation. To homogenize these data source a unifying minimal wedge feature description is introduced. It is extracted by pattern matching and subsequent conflict resolution as cuneiform is written densely with highly overlapping wedges.

Similarity metrics for cuneiform signs based on distinct assumptions are presented. (i) An implicit model represents cuneiform signs using undirected mathematical graphs and measures the similarity of signs with graph kernels. (ii) An explicit model approaches the problem of recognition by an optimal assignment between the wedge configurations of two signs. Further, methods for spotting cuneiform script are developed, combining the feature descriptors for cuneiform wedges with prior work on segmentation-free word spotting using part-structured models. The ink-ball model is adapted by treating wedge feature descriptors as individual parts. The similarity metrics and the adapted spotting model are both evaluated on a real-world dataset outperforming the state-of-the-art in cuneiform sign similarity and spotting.

To prove the applicability of these methods for computational cuneiform analysis, a novel approach is presented for mining frequent constellations of wedges resulting in spatial n-grams. Furthermore, a method for automatized transliteration of tablets is evaluated by employing structured and sequential learning on a dataset of parallel sentences. Finally, the conclusion outlines how the presented methods enable the development of new tools and computational analyses, which are objective and reproducible, for quantitative processing of cuneiform script.

Document type:	Dissertation
Supervisor:	Mara, Dr. Hubert
Place of Publication:	Heidelberg, Germany
Date of thesis defense:	7 February 2018
Date Deposited:	25 May 2018 10:39
Date:	2018
Faculties / Institutes:	The Faculty of Mathematics and Computer Science > Department of Computer Science
DDC-classification:	004 Data processing Computer science
Controlled Keywords:	Maschinelles Lernen, Visuelle Suche, Sprachanalyse