Tibetan BERT Model
We also open-sourced the training corpus here.
Citation
Please cite our paper if you use this model or the training corpus:
@inproceedings{10.1145/3548608.3559255,
author = {Zhang, Jiangyan and Kazhuo, Deji and Gadeng, Luosang and Trashi, Nyima and Qun, Nuo},
title = {Research and Application of Tibetan Pre-Training Language Model Based on BERT},
year = {2022},
isbn = {9781450397179},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3548608.3559255},
doi = {10.1145/3548608.3559255},
abstract = {In recent years, pre-training language models have been widely used in the field of natural language processing, but the research on Tibetan pre-training language models is still in the exploratory stage. To promote the further development of Tibetan natural language processing and effectively solve the problem of the scarcity of Tibetan annotation data sets, the article studies the Tibetan pre-training language model based on BERT. First, given the characteristics of the Tibetan language, we constructed a data set for the BERT pre-training language model and downstream text classification tasks. Secondly, construct a small-scale Tibetan BERT pre-training language model to train it. Finally, the performance of the model was verified through the downstream task of Tibetan text classification, and an accuracy rate of 86\% was achieved on the task of text classification. Experiments show that the model we built has a significant effect on the task of Tibetan text classification.},
booktitle = {Proceedings of the 2022 2nd International Conference on Control and Intelligent Robotics},
pages = {519–524},
numpages = {6},
location = {Nanjing, China},
series = {ICCIR '22}
}