现有的少数民族语言预训练模型仍然较为稀缺,尽管国内少数民族语言模型CINO具有较强的理解能力,但仍然缺乏面向生成与翻译领域的研究。 CMPT (Chinese Minority Pre-Trained Language Model) 是在BART的基础上,加入DeepNorm预训练的超深层生成模型。其最大具有128+128层。其在超过10G的汉英维藏蒙语料中进行受限预训练。其具有较强的理解与生成性能。

Github Link: https://github.com/WENGSYX/CMPT

Usage

>>> from modeling_cmpt import BartForConditionalGeneration
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('./CMTP')
>>> model = BartForConditionalGeneration.from_pretrained('./CMTP')
>>> inputs = tokenizer.encode("Hello world, 你好 世界", return_tensors='pt')
>>> pred_ids = model.generate(input_ids, num_beams=4, max_length=20)
>>> print(tokenizer.convert_ids_to_tokens(pred_ids[i]))