code

<p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/logo_zhixi.png?raw=true" alt="ZJU-KnowLM" style="width: 40%; min-width: 40px; display: block; margin: auto;"></a> </p>

This is the result of the ZhiXi-13B LoRA weights. You can click here to learn more.

Knowledgable Large Language Model Framework.

With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.

The project's initial phase introduced a knowledge extraction LLM based on LLaMA, dubbed ZhiXi (智析, which means intelligent analysis of data for information extraction). To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly <b>(1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities;</b> Then <b>(2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.</b>

The features of this project are as follows:

All weights have been uploaded to Hugging Face. The ZhiXi differential weights can be found here, and the LoRA weights can be found here.

Why it's called ZhiXi (智析)?

In Chinese, "Zhi" (智) signifies intelligence, referencing the AI's advanced language understanding capabilities. "Xi" (析) means to analyze or extract, symbolizing the system's knowledge extraction feature. Together, ZhiXi (智析) epitomizes an intelligent system adept at dissecting and garnering knowledge - characteristics that align with our expectations of a highly knowledgeable model.

Contents

<h2 id="1">1. Cases</h2>

<h3 id="1-1">1.1 Pretraining Cases</h3>

Our pre-trained model has demonstrated certain abilities in instruction following, coding, reasoning, as well as some translation capabilities, without any fine-tuning using instructions. Additionally, it has acquired new knowledge. Below are some of our sample cases. If you wish to reproduce our examples and view detailed decoding configuration, please first set up the environment and restore the weights, then follow the steps outlined here.

In the follwing cases, text in bold represents the prompt, while non-bold text represents the model's output.

Due to the maximum inference length set to 512, our cases fall into three situations:

  1. Compeleted output. The model generates the termination token EOS and completes the output. We mark this with :white_check_mark:.
  2. Incomplete output. The output is cut off due to the maximum inference length. We mark this with :eight_spoked_asterisk:.
  3. Repeated output. We remove repeated content manually and mark it with :arrow_left:.

<details> <summary><b>Translation</b></summary>

<details> <summary><b>Knowledge</b></summary>

<details> <summary><b>Instruction Following</b></summary>

<details> <summary><b>Coding</b></summary>

</details>

<details> <summary><b>Generate long text in Chinese</b></summary>

<details> <summary><b>Generate long text in English</b></summary>

<details> <summary><b>Reasoning</b></summary>

<h3 id="1-2">1.2 Information Extraction Cases</h3>

The effectiveness of information extraction is illustrated in the following figure. We tested different instructions for different tasks as well as the same instructions for the same task, and achieved good results for all of them.

<p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/ie-case-new_logo-en.png?raw=true" alt="IE" style="width: 60%; min-width: 60px; display: block; margin: auto;"></a> </p>

Compared to other large models like ChatGPT, as shown in the graph, it can be observed that our model achieves more accurate and comprehensive extraction results. However, we have also identified some extraction errors in ZhiXi. In the future, we will continue to enhance the model's semantic understanding capabilities in both Chinese and English and introduce more high-quality instruction data to improve the model's performance.

<p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/casevschatgpt.png?raw=true" alt="IE-cases-vs-chatgpt"style="width: 60%; min-width: 60px; display: block; margin: auto;"></a> </p>

<h3 id="1-3">1.3 General Ablities Cases</h3>

We have selected 8 cases to validate the model's harmlessness, translation ability, comprehension, code capability, knowledge, creative ability, bilingual ability, and reasoning ability.

<details> <summary><b>Harmlessness</b></summary>

</details>

<details> <summary><b>Translation Ability</b></summary>

</details>

<details> <summary><b>Comprehension</b></summary>

</details>

<details> <summary><b>Code Ability</b></summary>

</details>

<details> <summary><b>Knowledge</b></summary>

</details>

<details> <summary><b>Creative Ability</b></summary>

</details>

<details> <summary><b>Bilingual Ability</b></summary>

</details>

<details> <summary><b>Reasoning Ability</b></summary>

</details>

<h2 id="2">2. Quick Start</h2>

<h3 id="2-1">2.1 Environment Configuration</h3>

conda create -n zhixi python=3.9 -y
conda activate zhixi
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

<h3 id="2-2">2.2 Pretraining model weight acquisition and restoration</h3>

❗❗❗ Note that in terms of hardware, performing step 2.2, which involves merging LLaMA-13B with ZhiXI-13B-Diff, requires approximately 100GB of RAM, with no demand for VRAM (this is due to the memory overhead caused by our merging strategy. For your convenience, we have provided the fp16 weights at this link: https://huggingface.co/zjunlp/zhixi-13b-diff-fp16. fp16 weights require less memory but may slightly impact performance. We will improve our merging approach in future updates, and we are currently developing a 7B model as well, so stay tuned). For step 2.4, which involves inference using ZhiXi, a minimum of 26GB of VRAM is required.

1. Download LLaMA 13B and ZhiXi-13B-Diff

Please click here to apply for the official pre-training weights of LLaMA from meta. In this case, we are using the 13B version of the model, so you only need to download the 13B version. Once downloaded, the file directory will be as follows:

|-- 13B
|	|-- checklist.chk
|	|-- consolidated.00.pth
|	|-- consolidated.01.pth
|	|-- params.json
|-- llama.sh
|-- tokenizer.model
|-- tokenizer_checklist.chk

You can use the following command to download the ZhiXi-13B-Diff file (assuming it is saved in the ./zhixi-diff folder):

python tools/download.py --download_path ./zhixi-diff --only_base

If you want to download the diff weights in the fp16 format, please use the following command (assuming it is saved in the ./zhixi-diff-fp16 folder):

python tools/download.py --download_path ./zhixi-diff-fp16 --only_base --fp16

:exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.

2. Use the conversion script provided by huggingface

To convert the original LLaMA-13B model into the HuggingFace format, you can use the provided script file by HuggingFace, which can be found here. Below is the command to run the script (assuming the downloaded original files(LLaMA-13B) are located in ./ and you want the converted files to be stored in ./converted):

python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted

3. Restore ZhiXi 13B

Use the script we provided, located at ./tools/weight_diff.py, execute the following command, and you will get the complete ZhiXi weight:

python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff --path_tuned ./zhixi

The final complete ZhiXi weights are saved in the ./zhixi folder.

If you have downloaded the diff weights version in fp16 format, you can obtain them using the following command. Please note that there might be slight differences compared to the weights obtained in fp32 format:

python tools/weight_diff.py recover --path_raw ./converted --path_diff ./zhixi-diff-fp16 --path_tuned ./zhixi

❗NOTE. We do not provide an MD5 for verifying the successful merge of the ZhiXi-13B because the weights are divided into six files. We employ the same validation strategy as Stanford Alpaca, which involves performing a sum check on the weights (you can refer to this link). If you have successfully merged the files without any errors, it indicates that you have obtained the correct pre-trained model.

<h3 id="2-3">2.3 Instruction tuning LoRA weight acquisition</h3>

Use the script file we provided, located at ./tools/download.py, execute the following command to get the LoRA weight (assuming the saved path is located at ./LoRA):

python tools/download.py --download_path ./LoRA --only_lora

The final complete weights are saved in the ./LoRA folder.

<h3 id="2-4">2.4 Model Usage Guide</h3>

1. Reproduce the results in Section 1

The cases in Section 1 were all run on V100. If running on other devices, the results may vary. Please run multiple times or change the decoding parameters.

  1. If you want to reproduce the results in section 1.1(pretraining cases), please run the following command (assuming that the complete pre-training weights of ZhiXi have been obtained according to the steps in section 2.2, and the ZhiXi weight is saved in the ./zhixi folder):

    python examples/generate_finetune.py --base_model ./zhixi
    

    The result in section 1.1 can be obtained.

  2. If you want to reproduce the results in section 1.2(information extraction cases), please run the following command (assuming that the LoRA weights of ZhiXi have been obtained according to the steps in section 2.3, and the LoRA weights is saved in the ./lora folder):

    python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_ie_cases
    

    The result in section 1.2 can be obtained.

  3. If you want to reproduce the results in section 1.3(general ablities cases), please run the following command (assuming that the LoRA weights of ZhiXi have been obtained according to the steps in section 2.3, and the LoRA weights is saved in the ./lora folder):

    python examples/generate_lora.py --load_8bit --base_model ./zhixi --lora_weights ./lora --run_general_cases
    

    The result in section 1.3 can be obtained.

2. Usage of Pretraining Model

We offer two methods: the first one is command-line interaction, and the second one is web-based interaction, which provides greater flexibility.

  1. Use the following command to enter command-line interaction:

    python examples/generate_finetune.py --base_model ./zhixi --interactive
    

    The disadvantage is the inability to dynamically change decoding parameters.

  2. Use the following command to enter web-based interaction:

    python examples/generate_finetune_web.py --base_model ./zhixi
    

    Here is a screenshot of the web-based interaction: <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

3. Usage of Instruction tuning Model

Here, we provide a web-based interaction method. Use the following command to access the web:

python examples/generate_lora_web.py --base_model ./zhixi --lora_weights ./lora

Here is a screenshot of the web-based interaction: <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

The instruction is a required parameter, while input is an optional parameter. For general tasks (such as the examples provided in section 1.3), you can directly enter the input in the instruction field. For information extraction tasks (as shown in the example in section 1.2), please enter the instruction in the instruction field and the sentence to be extracted in the input field. We provide an information extraction prompt in section 2.5.

If you want to perform batch testing, please modify the examples/generate_lora.py file and update the examples and hyperparameters in the variable cases.

<h3 id="2-5">2.5 Information Extraction Prompt</h3>

For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this link for examples. Of course, you can also try using your own prompts.

Here is a case where ZhiXi-13B-LoRA is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.

<h2 id="3">3. Training Details</h2>

The following figures illustrates the entire training process and dataset construction. The training process is divided into two stages:

(1) Full pre-training stage. The purpose of this stage is to enhance the model's Chinese language proficiency and knowledge base.

(2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.

<h3 id="3-1">3.1 Dataset Construction (Pretraining)</h3>

In order to enhance the model's understanding of Chinese while preserving its original code and English language capabilities, we did not expand the vocabulary. Instead, we collected Chinese corpora, English corpora, and code corpora. The Chinese corpora were sourced from Baidu Baike, Wudao, and Chinese Wikipedia. The English dataset was sampled from the original English corpus of LLaMA, with the exception of the Wikipedia data. The original paper's English Wikipedia data was up until August 2022, and we additionally crawled data from September 2022 to February 2023, covering a total of six months. As for the code dataset, due to the low-quality code in the Pile dataset, we crawled code data from GitHub and LeetCode. A portion of the data was used for pre-training, while another portion was used for fine-tuning with instructions.

For the crawled datasets mentioned above, we employed a heuristic approach to filter out harmful content. Additionally, we removed duplicate data.

<h3 id="3-2">3.2 Training Process (Pretraining)</h3>

Detailed data processing code, training code, complete training scripts, and detailed training results can be found in ./pretrain.

Before training, we need to tokenize the data. We set the maximum length of a single sample to 1024, while most documents are much longer than this. Therefore, we need to partition these documents. We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample. Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to DeepSpeed-Megatron and used the mmap method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.

Finally, we performed pre-training on 5.5 million Chinese samples, 1.5 million English samples, and 0.9 million code samples. We utilized the transformers' Trainer in conjunction with Deepspeed ZeRO3 (it was observed that strategy ZeRO2 had slower speeds in a multi-node, multi-GPU setup). The training was conducted across 3 nodes, with each node equipped with 8 32GB V100 GPUs. The table below showcases our training speeds:

Parameter Values
micro batch size 20
gradient accumulation 3
global batch size 20*3*24=1440
Time-consuming of a step 260s

<h3 id="3-3">3.3 Dataset Construction (Instruction tuning)</h3>

In addition to incorporating general capabilities such as reasoning and coding, we have also introduced additional information extraction abilities, including NER (Named Entity Recognition), IE (Information Extraction), and EE (Event Extraction), into the current homogeneous models. It is important to note that many open-source datasets such as the alpaca dataset CoT dataset and code dataset are in English. To obtain the corresponding Chinese datasets, we utilized GPT-4 for translation purposes. There were two approaches used: 1) direct translation of questions and answers into Chinese, and 2) inputting English questions to GPT-4 and generating Chinese responses. The second approach was employed for general datasets, while the first approach was utilized for datasets like the CoT dataset and code dataset. These datasets are readily available online.

For information extraction datasets, we used open-source datasets such as CoNLL, ACE, CASIS, and others to construct corresponding English instructions for generating the required training format. For the Chinese part, for NER and EE tasks, we utilized open-source datasets such as DualEE, PEOPLE DAILY, and others, and then created corresponding Chinese instructions to synthesize the required training format. As for the RE task, we built a dataset called KG2Instruction. Specifically, we used Chinese Wikipedia data and BERT for Chinese entity recognition. We then aligned the recognized entities with the Wikipedia index. Due to potential ambiguity (i.e., a Chinese entity may have multiple indexes, such as apple referring to both a fruit and a company), we devised a strategy to disambiguate the entities. Subsequently, we used a distantly supervised method to generate possible triplets and applied predefined rules to filter out illegal or incorrect triplets. Finally, with the help of crowdsourcing, we refined the obtained triplets. Following that, we constructed corresponding Chinese instructions to generate the required training format.

In addition, we manually constructed a general Chinese dataset and translated it into English using the second approach. Finally, our data distribution is as follows:

Dataset Number
COT Datasets (Chinese, English) 202333
General Datasets (Chinese, English) 105216
Code Datasets (Chinese, English) 44688
Information Extraction Datasets (English) 537429
Information Extraction Datasets (Chinese) 486768

Flow diagram of KG2Instruction and other instruction fine-tuning datasets <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/kg2instructions-en.png?raw=true"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a> </p>

<h3 id="3-4">3.4 Training Process (Instruction tuning)</h3>

Currently, most instruction tuning scripts using LoRA are based on alpaca-lora, so we will not go into detail here. Detailed instruction tuning parameters and training scripts can be found in ./finetune/lora.

<h2 id="4">4. Limitations</h2>

Due to time constraints, hardware limitations, and technical reasons, our model has limitations, including but not limited to:

<h2 id="5">5. TODO List</h2>

<h2 id="6">6. FAQ</h2>

<h2 id="7">7. Others</h2>

<h3 id="7-1">7.1 Contributors(in random order)</h3>

Pretraining:Xiang Chen, Jintian Zhang, Xiaozhuan Liang

Pretraining Data:Zhen Bi, Honghao Gui, Jing Chen, Runnan Fang

Instruction data and Instruction tuning:Xiaohan Wang, Shengyu Mao

Tool learning and Multimodal:Shuofei Qiao, Yixin Ou, Lei Li

Model Editing and Safety:Yunzhi Yao, Peng Wang, Siyuan Cheng, Bozhong Tian, Mengru Wang, Zhoubo Li

Model Testing and Deployment:Yinuo Jiang, Yuqi Zhu, Hongbin Ye, Zekun Xi

<h3 id="7-2">7.2 Citation</h3>

If you use our repository, please cite the following related papers:

@article{cama,
  author = {Jintian Zhang, Xiaohan Wang, Honghao Gui, Xiang Chen, Yinuo Jiang, Zhen Bi, Jing Chen, Shengyu Mao, Shuofei Qiao, Xiaozhuan Liang, Yixin Ou, Ruinan Fang, Zekun Xi, Shumin Deng, Huajun Chen, Ningyu Zhang},
  title = {DeepKE-LLM: A Large Language Model Based Knowledge Extraction Toolkit},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/}},
}

<h3 id="7-3">7.3 Acknowledgment</h3>

We are very grateful to the following open source projects for their help: