Model Card for Model ID
Model description
BELLE is based on Bloomz-7b1-mt and finetuned with 0.6M Chinese data combined with 50,000 pieces of English data from the open source Stanford-Alpaca, resulting in good Chinese instruction understanding and response generation capabilities.
The code of Chinese data generation and other detailed information can be found in our Github project repository: https://github.com/LianjiaTech/BELLE.
We trained models using datasets of different sizes (200,000, 600,000, and 1,000,000 samples) for instruction learning, and we obtained different model versions as shown below:
Datasize | 200,000 | 600,000 | 1,000,000 |
---|---|---|---|
Finetuned Model | BELLE-7B-0.2M | BELLE-7B-0.6M | BELLE-7B-1M |
Training hyper-parameters
Parameter | Value |
---|---|
Batch size | 64 |
Learning rate | 3e-6 |
Epochs | 3 |
Weight_decay | 0.001 |
Warmup_rate | 0.1 |
LR_scheduler | linear |
Use model
Please note that the input should be formatted as follows in both training and inference.
Human: {input} \n\nAssistant:
BELLE can be easily loaded with AutoModelForCausalLM.
from transformers import AutoTokenizer, AutoModelForCausalLM
import sys
model_path = "./" # You can modify the path for storing the local model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Human:")
line = input()
while line:
inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
input_ids = tokenizer(inputs, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=200, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.35, repetition_penalty=1.2)
rets = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
print("\n------------------------------------------------\nHuman:")
line = input()
The above example did not utilize the context from previous turns of the conversation. If you want to perform multi-turn dialogue understanding, you can concatenate the previous turns of the conversation in the format of Human:{input}\n\nAssistant:{output} and input it to the generate function.
Limitations
There still exists a few issues in the model trained on current base model and data:
-
The model might generate factual errors when asked to follow instructions related to facts.
-
Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
-
Needs improvements on reasoning and coding.
Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
Some use cases
Human | Assitant |
---|---|
使用python写字符串翻转的函数 | 可以使用Python内置方法 reversed() 来实现字符串翻转,代码如下:<br/><br/>def reverse_string(s):<br/> return s[::-1]<br/> <br/>print(reverse_string("Hello, world!"))<br/># 输出:'!dlrow,olleH'<br/> |
做难而正确的事,翻译成英语 | Do difficult and correct things. |
The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the techniques in the Self-Instruct paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite. How many instruction-following data generated? | 52K |
给一个北京三日游的旅行攻略 | Day 1:<br/>- 参观故宫博物院,了解中国历史和文化。<br/>- 在天安门广场观看升旗仪式。 <br/>- 到颐和园游览美丽的园林景观。<br/>- 品尝烤鸭和其他地道美食。<br/>- 去王府井购物中心逛街购物。 <br/><br/>Day 2:<br/>- 前往长城游玩,感受古代中国的壮丽景色。<br/>- 乘坐索道缆车到达八达岭长城,欣赏美景。<br/>- 晚上去簋街体验当地的小吃文化。<br/><br/>Day 3:<br/>- 参观天坛公园,观赏古建筑群和祈年殿。<br/>- 到鸟巢和水立方观看奥运会比赛。<br/>- 午餐后到798艺术区参观当代艺术家的工作室。 |
来一个温馨的睡前故事 | 从前有一个小兔子,它住在森林里。每天晚上,小兔子都会在树下打盹儿。但是有一天晚上,一只大灰狼来了!他悄悄地溜进小兔子的洞穴里,想要偷走它的胡萝卜。<br/>然而,小兔子并没有被吓到。相反,它用自己的智慧和勇气保护了自己。它在洞口放了一堆胡萝卜来吸引大灰狼,然后趁机逃脱了。从那天起,小兔子学会了在危险时保持镇静并采取适当的行动 |
如何快速提升自己的写作能力 | 1. 阅读大量不同类型的文学作品,包括小说、散文和诗歌等。<br/>2. 练习写作,每天写一些东西并不断修改和完善。<br/>3. 参加写作班或请教专业作家以获得反馈和建议。<br/>4. 利用各种工具和技术来提高写作效率和质量,例如语法检查器、拼写检查器和自动润色功能等。 |
北京大学是哪一年成立的? | 北京大学成立于1898年 |
模型描述
BELLE模型以Bloomz-7b1-mt为基础,在 0.2M/0.6M/1.0M 条中文数据上,结合Stanford-Alpaca开源的5万条英文数据上训练,取得了较好的中文instruction理解以及response生成能力。
中文数据生成代码及其他详细信息,位于我们的Github项目仓库: https://github.com/LianjiaTech/BELLE
我们采取了不同大小规模(20万、60万和100万样本)的指令学习的数据集训练模型,我们得到不同的模型版本如下所示:
Datasize | 200,000 | 600,000 | 1,000,000 |
---|---|---|---|
Finetuned Model | BELLE-7B-0.2M | BELLE-7B-0.6M | BELLE-7B-1M |
模型训练超参数
参数 | 值 |
---|---|
Batch size | 64 |
Learning rate | 3e-6 |
Epochs | 3 |
Weight_decay | 0.001 |
Warmup_rate | 0.1 |
LR_scheduler | linear |
使用模型
请注意,在训练和推理时,模型的输入应该处理成如下形式:
Human: {input} \n\nAssistant:
通过AutoModelForCausalLM即可直接载入模型并使用。
from transformers import AutoTokenizer, AutoModelForCausalLM
import sys
model_path = "./" # You can modify the path for storing the local model
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Human:")
line = input()
while line:
inputs = 'Human: ' + line.strip() + '\n\nAssistant:'
input_ids = tokenizer(inputs, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=200, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.35, repetition_penalty=1.2)
rets = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("Assistant:\n" + rets[0].strip().replace(inputs, ""))
print("\n------------------------------------------------\nHuman:")
line = input()
上面样例没有利用对话上文,如果你想做多轮对话理解,可以将对话上文按照Human:{input}\n\nAssistant:{output} 方式拼接起来输入给generate函数
局限性和使用限制
基于当前数据和基础模型训练得到的SFT模型,在效果上仍存在以下问题:
-
在涉及事实性的指令上可能会产生违背事实的错误回答。
-
对于具备危害性的指令无法很好的鉴别,由此会产生危害性言论。
-
在一些涉及推理、代码等场景下模型的能力仍有待提高。
基于以上模型局限性,我们要求开发者仅将我们开源的代码、数据、模型及后续用此项目生成的衍生物用于研究目的,不得用于商业,以及其他会对社会带来危害的用途。
一些示例
Human | Assitant |
---|---|
使用python写字符串翻转的函数 | 可以使用Python内置方法 reversed() 来实现字符串翻转,代码如下:<br/><br/>def reverse_string(s):<br/> return s[::-1]<br/> <br/>print(reverse_string("Hello, world!"))<br/># 输出:'!dlrow,olleH'<br/> |
做难而正确的事,翻译成英语 | Do difficult and correct things. |
The current Alpaca model is fine-tuned from a 7B LLaMA model on 52K instruction-following data generated by the techniques in the Self-Instruct paper, with some modifications that we discuss in the next section. In a preliminary human evaluation, we found that the Alpaca 7B model behaves similarly to the text-davinci-003 model on the Self-Instruct instruction-following evaluation suite. How many instruction-following data generated? | 52K |
给一个北京三日游的旅行攻略 | Day 1:<br/>- 参观故宫博物院,了解中国历史和文化。<br/>- 在天安门广场观看升旗仪式。 <br/>- 到颐和园游览美丽的园林景观。<br/>- 品尝烤鸭和其他地道美食。<br/>- 去王府井购物中心逛街购物。 <br/><br/>Day 2:<br/>- 前往长城游玩,感受古代中国的壮丽景色。<br/>- 乘坐索道缆车到达八达岭长城,欣赏美景。<br/>- 晚上去簋街体验当地的小吃文化。<br/><br/>Day 3:<br/>- 参观天坛公园,观赏古建筑群和祈年殿。<br/>- 到鸟巢和水立方观看奥运会比赛。<br/>- 午餐后到798艺术区参观当代艺术家的工作室。 |
来一个温馨的睡前故事 | 从前有一个小兔子,它住在森林里。每天晚上,小兔子都会在树下打盹儿。但是有一天晚上,一只大灰狼来了!他悄悄地溜进小兔子的洞穴里,想要偷走它的胡萝卜。<br/>然而,小兔子并没有被吓到。相反,它用自己的智慧和勇气保护了自己。它在洞口放了一堆胡萝卜来吸引大灰狼,然后趁机逃脱了。从那天起,小兔子学会了在危险时保持镇静并采取适当的行动 |
如何快速提升自己的写作能力 | 1. 阅读大量不同类型的文学作品,包括小说、散文和诗歌等。<br/>2. 练习写作,每天写一些东西并不断修改和完善。<br/>3. 参加写作班或请教专业作家以获得反馈和建议。<br/>4. 利用各种工具和技术来提高写作效率和质量,例如语法检查器、拼写检查器和自动润色功能等。 |
北京大学是哪一年成立的? | 北京大学成立于1898年 |