DR Edward Garrett eg15@soas.ac.uk
Research Assistant
A rule based Tibetan part-of-speech (POS) tagger for the creation of gold standard training data
Garrett, Edward; Hill, Nathan W.
Authors
PROF Nathan Hill nh36@soas.ac.uk
Professor Tibetan&Historical Linguistics
Abstract
This rule based Tibetan part-of-speech (POS) tagger was prepared in the course of the research project 'Tibetan in Digital Communication' (2012-2015) hosted at SOAS, University of London and funded by the UK's Arts and Humanities Research Council (grant code: AH/J00152X/1). For a description of the tag set see Garrett et al. 2014. and Garrett et al. 2015. For a description of the tagger itself see Garrett et al. 2014. Note that the tagger must be used together with a lexicon (for example Hill & Garrett 2017a). One must use one's own script to tag all words with all tags in the lexicon and then apply the tagger to remove incorrect tags.
On the associated corpus of 318,230 words (Hill & Garrett 2017b) the lexical tagger (i.e. simply applying all available tags to all words) tags 141,911 words with the correct unique tag, achieves as accuracy of 1.000 (by definition getting the right tag among others for each word) with an ambiguity of 2.73111. In contrast, the Rule Tagger tags 241,256 words with the correct unique tag, achieves an accuracy of 0.99893 and an ambiguity of 1.38577.
Because this tagger does not achieve ambiguity 1.000 it is not suitable for tagging large scale corpora, but instead is useful for the creation of gold standard training data.
N.B. In some rare cases the tagger removes all POS-tags.
Citation
Garrett, E., & Hill, N. W. A rule based Tibetan part-of-speech (POS) tagger for the creation of gold standard training data. [Data]
Online Publication Date | May 11, 2017 |
---|---|
Deposit Date | Jun 16, 2017 |
Publicly Available Date | Jun 16, 2017 |
Publisher URL | http://doi.org/10.5281/zenodo.574882 |
Type of Data | regular expressions |
Additional Information | References : Hill, Nathan W., & Garrett, Edward. (2017a). A part-of-speech (POS) lexicon of Classical Tibetan for NLP [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574876 Hill, Nathan W., & Garrett, Edward. (2017b). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878 |
Files
Taggers.zip
(42 Kb)
Archive
Licence
http://creativecommons.org/licenses/by/4.0/
Publisher Licence URL
http://creativecommons.org/licenses/by/4.0/
You might also like
The lexicography of Tibetan
(2017)
Book Chapter
Downloadable Citations
About SOAS Research Online
Administrator e-mail: outputs@soas.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search