Directly to content
  1. Publishing |
  2. Search |
  3. Browse |
  4. Recent items rss |
  5. Open Access |
  6. Jur. Issues |
  7. DeutschClear Cookie - decide language by browser settings

Multilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition

Arnold, Matthias

In: Digital Studies / Le champ numérique, 12 (28 September 2022), Nr. 1. pp. 1-36. ISSN 1918-3666

[thumbnail of dscn-8110-arnold.pdf]
Preview
PDF, English - main document
Download (878kB) | Lizenz: Creative Commons LizenzvertragMultilingual Research Projects: Non-Latin Script Challenges for Making Use of Standards, Authority Files, and Character Recognition by Arnold, Matthias underlies the terms of Creative Commons Attribution 4.0

Citation of documents: Please do not cite the URL that is displayed in your browser location input, instead use the DOI, URN or the persistent URL below, as we can guarantee their long-time accessibility.

Abstract

Academic research about digital non-Latin script (hereafter: NLS) research data can pose a number of challenges just because the material is from a region where the Latin alphabet was not used. Not all of them are easy to spot. In this paper, I introduce two use cases to demonstrate different aspects of the complex tasks that may be related to NLS material. The first use case focuses on metadata standards used to describe NLS material. Taking the VRA Core 4 XML as example, I will show where we found limitations for NLS material and how we were able to overcome them by expanding the standard. In the second use case, I look at the research data itself. Although the full-text digitization of western newspapers from the 20th century usually is not problematic anymore, this is not the case for Chinese newspapers from the Republican era (1912–1949). A major obstacle here is the dense and complex layout of the pages, which prevents OCR solutions from getting to the character recognition part. In our approach, we are combining different manual and computational methods like crowdsourcing, pattern recognition, and neural networks to be able to process the material in a more efficient way. The two use cases illustrate that data standards or processing methods that are established and stable for Latin script material may not always be easily adopted to non-Latin script research data.

Document type: Article
Journal or Publication Title: Digital Studies / Le champ numérique
Volume: 12
Number: 1
Publisher: University of Lethbridge
Place of Publication: Lethbridge
Date Deposited: 30 Sep 2022 09:03
Date: 28 September 2022
ISSN: 1918-3666
Page Range: pp. 1-36
Faculties / Institutes: Philosophische Fakultät > Institut für Sinologie
Service facilities > Heidelberg Center for Transcultural Studies (HCTS)
DDC-classification: 004 Data processing Computer science
020 Library and information sciences
400 Linguistics
490 Other languages
950 General history of Asia Far East
About | FAQ | Contact | Imprint |
OA-LogoDINI certificate 2013Logo der Open-Archives-Initiative