CPM-Bee

CPM-Bee is a fully open-source, commercially-usable Chinese-English bilingual base model with a capacity of ten billion parameters. It is the second milestone achieved through the training process of CPM-live. Utilizing the Transformer auto-regressive architecture, CPM-Bee has been pre-trained on an extensive corpus of trillion-scale tokens, thereby possessing remarkable foundational capabilities.

Model description

Intended uses & limitations

You can use the raw model for many NLP tasks like text generation or fine-tune it to a downstream task.

How to use

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True)
>>> model = AutoModelForCausalLM.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True).cuda()  # 
>>> result = model.generate({"input": "今天天气不错,", "<ans>": ""}, tokenizer)
>>> print(result)
[{'input': '今天天气不错,', '<ans>': '适合睡觉。'}]

If you wanna use multi GPUs to inference, you can use accelerate as follow:

from transformers import AutoModelForCausalLM, AutoTokenizer
from accelerate import dispatch_model
from accelerate.utils import get_balanced_memory, infer_auto_device_map

tokenizer = AutoTokenizer.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True).cuda()

max_memory = get_balanced_memory(
    model, 
    no_split_module_classes=["CpmBeeTransformerBlock"]
)
device_map = infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["CpmBeeTransformerBlock"]) 
# make sure the data on the same device when projecting hidden states to logits.
device_map["cpmbee.encoder.output_layernorm"] = device_map["cpmbee.input_embedding"] = 0

model = dispatch_model(model, device_map=device_map)

res = model.generate(
    [
        {"input": "今天天气是真的", "<ans>": ""},
        {"input": "NGC 6231是一个位于天蝎座的疏散星团,天球座标为赤经16时54分,赤纬-41度48分,视觉观测大小约45角分,亮度约2.6视星等,距地球5900光年。NGC 6231年龄约为三百二十万年,是一个非常年轻的星团,星团内的最亮星是5等的天蝎座 ζ1星。用双筒望远镜或小型望远镜就能看到个别的行星。NGC 6231在1654年被意大利天文学家乔瓦尼·巴蒂斯特·霍迪尔纳(Giovanni Battista Hodierna)以Luminosae的名字首次纪录在星表中,但是未见记载于夏尔·梅西耶的天体列表和威廉·赫歇尔的深空天体目录。这个天体在1678年被爱德蒙·哈雷(I.7)、1745年被夏西亚科斯(Jean-Phillippe Loys de Cheseaux)(9)、1751年被尼可拉·路易·拉卡伊(II.13)分别再次独立发现。", "question": "NGC 6231的经纬度是多少?", "<ans>": ""}
    ],
    tokenizer,
    max_new_tokens=100
)
print(res)

We suggest to use bmtrain to finetune CPM-Bee. Also, you can use accelerate and deepspeed to finetune CPM-Bee. Here we will give a brief example of a training loop:

from transformers import AutoTokenizer, AutoModelForCausalLM
from accelerate import Accelerator
from torch.utils.data import Dataset, DataLoader

accelerator = Accelerator()

trainset = Dataset()  # Make sure trainset.__getitem__() can get data with correct format like {"input": "...", "<ans>": ""}
# for details, you can read https://github.com/OpenBMB/CPM-Bee/tree/main/tutorials/basic_task_finetune
train_loader = DataLoader(trainset, batch_size=1)

tokenizer = AutoTokenizer.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("openbmb/cpm-bee-10b", trust_remote_code=True).cuda()

optimizer = torch.optim.Adam(model.parameters())

model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

for iter, data in enumerate(train_loader):
    optimizer.zero_grad()

    # change the data to a trainable format
    input_encoded = tokenizer.prepare_for_finetune(data, max_length=512).to(model.device)

    outputs = model(**input_encoded)
    loss = outputs.loss
    accelerator.backward(loss)
    optimizer.step()

You should design your own parallel and mix_precision training strategy on the basis of it.