eprintid: 30918
rev_number: 24
eprint_status: archive
userid: 2006
dir: disk0/00/03/09/18
datestamp: 2021-12-10 12:12:34
lastmod: 2021-12-13 12:47:47
status_changed: 2021-12-10 12:12:34
type: preprint
metadata_visibility: show
creators_name: Arnold, Matthias
title: Multilingual research projects: Challenges for making use of standards, authority files, and character recognition
subjects: ddc-004
subjects: ddc-020
subjects: ddc-400
subjects: ddc-490
subjects: ddc-890
subjects: ddc-950
divisions: i-719000
divisions: i-72140
divisions: i-728300
keywords: language bias; multilingual and non-Latin script research data; metadata standards; document layout analysis; optical character recognition; page segmentation
cterms_swd: Mehrsprachigkeit
cterms_swd: Metadaten
cterms_swd: Optische Zeichenerkennung
cterms_swd: Segmentierung
abstract: Academic research about digital non-Latin script (hereafter: NLS) research data can pose a number of challenges just because the material is from a region where the Latin alphabet was not used. Not all of them are easy to spot. In this paper, I introduce two use cases to demonstrate different aspects of the complex tasks that may be related to NLS material. The first use case focuses on metadata standards used to describe NLS material. Taking the VRA Core 4 XML as example, I will show where we found limitations for NLS material and how we were able to overcome them by expanding the standard. In the second use case, I look at the research data itself. Although the full text digitization of western newspapers from the 20th century usually is not problematic anymore, this is not the case for Chinese newspapers from the Republican era (1912-1949). A major obstacle here is the dense and complex layout of the pages, which prevents OCR solutions to get to the character recognition part. In our approach, we are combining different manual and computational methods, like crowdsourcing, pattern recognition, and neural networks to be able to process the material in a more efficient way. The two use cases illustrate that data standards or processing methods which are established and stable for Latin script material may not always be easily adopted to non-Latin script research data.
date: 2021
id_scheme: DOI
id_number: 10.11588/heidok.00030918
ppn_swb: 1782000046
own_urn: urn:nbn:de:bsz:16-heidok-309181
language: eng
bibsort: ARNOLDMATTMULTILINGU2021
full_text_status: public
place_of_pub: Heidelberg
citation:   Arnold, Matthias  (2021) Multilingual research projects: Challenges for making use of standards, authority files, and character recognition.  [Preprint]     
document_url: https://archiv.ub.uni-heidelberg.de/volltextserver/30918/7/Arnold_Multilingual_research_2021.pdf