Taiyi (太一): A Bilingual (Chinese and English) Fine-Tuned Large Language Model for Diverse Biomedical Tasks

Demo | Github

This is the model of Taiyi using Qwen-7b-base as the base model, developed by DUTIR lab.

Project Background

With the rapid development of deep learning technology, large language models like ChatGPT have made significant progress in the field of natural language processing. In the context of biomedical applications, large language models facilitate communication between healthcare professionals and patients, provide valuable medical information, and have enormous potential in assisting diagnosis, biomedical knowledge discovery, drug development, and personalized healthcare solutions, among others. However, in the AI community, there is a relative scarcity of existing open-source biomedical large models, with most of them primarily focused on monolingual medical question-answering dialogues in either Chinese or English. Therefore, this project embarks on research dedicated to large models for the biomedical domain and introduces the initial version of a bilingual Chinese-English biomedical large model named 'Taiyi', iming to explore the capabilities of large models in handling a variety of Chinese-English natural language processing tasks in the biomedical field.

Project Highlights

Model Inference

We concatenate multi-turn dialogues into the following format, and then tokenize them. Where eod is the special character <|endoftext|> in the qwen tokenizer.

<eod>input1<eod>answer1<eod>input2<eod>answer2<eod>.....

The following code can be used to perform inference using our model:


from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "DUTIR-BioNLP/Taiyi-LLM"

device = 'cuda:0'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map = device
)


model.eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True
)

import logging
logging.disable(logging.WARNING)
tokenizer.pad_token_id = tokenizer.eod_id
tokenizer.bos_token_id = tokenizer.eod_id
tokenizer.eos_token_id = tokenizer.eod_id
history_token_ids = torch.tensor([[]], dtype=torch.long)
max_new_tokens = 500
top_p = 0.9
temperature = 0.3
repetition_penalty = 1.0

# begin chat
history_max_len = 1000 
utterance_id = 0
history_token_ids = None

user_input = "Hi, could you please introduce yourself?"

input_ids = tokenizer(user_input, return_tensors="pt", add_special_tokens=False).input_ids
bos_token_id = torch.tensor([[tokenizer.bos_token_id]], dtype=torch.long)
eos_token_id = torch.tensor([[tokenizer.eos_token_id]], dtype=torch.long)
user_input_ids = torch.concat([bos_token_id,input_ids, eos_token_id], dim=1)


model_input_ids = user_input_ids.to(device)
with torch.no_grad():
    outputs = model.generate(
        input_ids=model_input_ids, max_new_tokens=max_new_tokens, do_sample=True, top_p=top_p,
        temperature=temperature, repetition_penalty=repetition_penalty, eos_token_id=tokenizer.eos_token_id
    )

response = tokenizer.batch_decode(outputs)
print(response[0])
#<|endoftext|>Hi, could you please introduce yourself?<|endoftext|>Hello! My name is Taiyi,.....<|endoftext|>

We provide two test codes for dialogue. You can use the code in dialogue_one_trun.py to test single-turn QA dialogue, or use the sample code in dialogue_multi_trun.py to test multi-turn conversational QA.

Citation

If you use the repository of this project, please cite it.

@misc{taiyi,
    author = {Taiyi-Team}.
    title = {Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks}
    year = {2023},
    publisher = {GitHub},
    journal = {GitHub repository}
    howpublished = {\url{https://github.com/DUTIR-BioNLP/Taiyi-LLM}}
}