PROF Nathan Hill nh36@soas.ac.uk
Professor Tibetan&Historical Linguistics
Printed Text Recognition for Lexical Lists in Chinese- International Phonetic Alphabet (IPA) Glossing
Hill, Nathan W.; Li, Shihua
Authors
Shihua Li
Abstract
This study presents a dataset serving as a benchmark for the recognition of printed text in lexical lists using Chinese-IPA glossing. The paper provides an overview of the baseline model, transcription model, and PyLaia engines employed in the research. Furthermore, it elucidates the specific need for digitizing the aforementioned lexical lists, outlines the methodology employed for training the baseline model for layout analysis, and describes the training process of the transcription model using the ground truth data generated on Transkribus. This comprehensive approach encompasses both the images of the lexical list content and their corresponding transcriptions as input. Additionally, the study highlights the limitations of the model and identifies avenues for future development. By making this dataset openly accessible, it can be utilized by researchers seeking to digitize lexical lists using Chinese-IPA glossing. Moreover, since the model can recognize both Chinese characters and IPA symbols, it has the potential to contribute to linguistic analysis of languages documented in Chinese-IPA glossing.
Citation
Hill, N. W., & Li, S. (2023). Printed Text Recognition for Lexical Lists in Chinese- International Phonetic Alphabet (IPA) Glossing. Journal of Open Humanities Data, 9(15), 1-8. https://doi.org/10.5334/johd.119
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 1, 2023 |
Publication Date | Jul 21, 2023 |
Deposit Date | Oct 25, 2023 |
Publicly Available Date | Oct 25, 2023 |
Journal | Journal of Open Humanities Data |
Electronic ISSN | 2059-481X |
Publisher | Ubiquity Press |
Peer Reviewed | Peer Reviewed |
Volume | 9 |
Issue | 15 |
Pages | 1-8 |
DOI | https://doi.org/10.5334/johd.119 |
Keywords | printed text recognition, Chinese, IPA, Burmish and Tujia languages, lexical lists, baseline model, transcription model, Transkribus |
Publisher URL | https://openhumanitiesdata.metajnl.com/articles/10.5334/johd.119 |
Files
Li and Hill 2023 Transkribus.pdf
(1.4 Mb)
PDF
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
Grouping sounds into evolving units for the purpose of historical language comparison
(2024)
Journal Article
A Tibetan Passive Construction in the Old Tibetan Rāmāyaṇa
(2023)
Journal Article
Chinese Transcription of Buddhist Terms in the Late Hàn Dynasty
(2023)
Journal Article
Downloadable Citations
About SOAS Research Online
Administrator e-mail: outputs@soas.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search