Reading Old Turkic runiform inscriptions with the aid of 3D simulation
« previous post | next post »
"Augmenting parametric data synthesis with 3D simulation for OCR on Old Turkic runiform inscriptions: A case study of the Kül Tegin inscription", Mehmet Oğuz Derin and Erdem Uçar, Journal of Old Turkic Studies (7/21/24)
Abstract
Optical character recognition for historical scripts like Old Turkic runiform script poses significant challenges due to the need for abundant annotated data and varying writing styles, materials, and degradations. The paper proposes a novel data synthesis pipeline that augments parametric generation with 3D rendering to build realistic and diverse training data for Old Turkic runiform script grapheme classification. Our approach synthesizes distance field variations of graphemes, applies parametric randomization, and renders them in simulated 3D scenes with varying textures, lighting, and environments. We train a Vision Transformer model on the synthesized data and evaluate its performance on the Kül Tegin inscription photographs. Experimental results demonstrate the effectiveness of our approach, with the model achieving high accuracy without seeing any real-world data during training. We finally discuss avenues for future research. Our work provides a promising direction to overcome data scarcity in Old Turkic runiform script.
Aside from the Abstract, the lead author also shared with me the following summary paragraph:
For Old Turkic, there is a problem with the text of inscriptions that they are deformed, etc., due to aging and environmental conditions, and there is not a good enough amount of data that correlates various angles of a glyph to its value, as you know, data is the oil for AI. To tackle this problem, we developed a system where we create completely random strings and put them on virtual inscriptions with photorealistic rendering techniques, and it turns out that works wonders: we have been able to go beyond 80% accuracy for actual photographs without making the AI ever see one. Although we had success for this one, and generating images for training was in an application for paper materials, etc., I am also pondering if it might be helpful for other ancient inscriptions whose systematic nature might be more or less known, but a layer of complexity on the surface makes it more challenging to annotate data, hence making it harder to train with actual photographs or estampages.
I am hoping that the techniques developed here for reading Old Turkic runiform script may also be adapted for use on other historical scripts.
Selected readings
- "Pugu, boga, beg" (8/11/20)
- "Tocharian, Turkic, and Old Sinitic 'ten thousand'" (4/23/19)
- "Northernmost runic finds in the world" (2/10/20)
- "Turkish written with Latin letters half a millennium ago" (8/29/16)
- "Unknown language #18" (6/3/24)
- "Unknown language #17" (5/2/24)
- "On the etymology of the title Tham of Burusho kings" (5/17/20)
Chris Button said,
July 21, 2024 @ 10:31 am
Wow. It's also so great to see the mention of generative adversarial networks (GANs) in the role of synthetic data generation. Usually the GenAI talk on Language Log is focused solely on LLM-based generative pre-trained transformers (GPTs).