code

<p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/KnowLM.png?raw=true" alt="ZJU-KnowLM" style="width: 40%; min-width: 40px; display: block; margin: auto;"></a> </p>

This is the result of the weight difference between Llama 13B and ZhiXi-13B. You can click here to learn more.

Knowledgable Large Language Model Framework.

With the rapid development of deep learning technology, large language models such as ChatGPT have made substantial strides in the realm of natural language processing. However, these expansive models still encounter several challenges in acquiring and comprehending knowledge, including the difficulty of updating knowledge and potential knowledge discrepancies and biases, collectively known as knowledge fallacies. The KnowLM project endeavors to tackle these issues by launching an open-source large-scale knowledgable language model framework and releasing corresponding models.

The project's initial phase introduced a knowledge extraction LLM based on LLaMA, dubbed ZhiXi (智析, which means intelligent analysis of data for knowledge extraction). To integrate the capacity of Chinese understanding into the language models without compromising their inherent knowledge, we firstly <b>(1) use Chinese corpora for the full-scale pre-training with LLaMA (13B), augment the language model's understanding of Chinese and improve its knowledge richness while retaining its original English and code capacities;</b> Then <b>(2) we fine-tune the model obtained from the first step with an instruction dataset, thus bolstering the language model's understanding of human instructions for knowledge extraction.</b>

The features of this project are as follows:

All weights have been uploaded to HuggingFace🤗. It should be noted that all the following effects are based on ZhiXi-13B-Diff. If you have downloaded ZhiXi-13B-Diff-fp16, there may be some variations in the effects.

Model Name Train Method Weight Type Size Download Link Notes
ZhiXi-13B-Diff Full Pretraining Differential Weights 48GB HuggingFace <br/> GoogleDrive Restoring the pre-trained weights (i.e. ZhiXi-13B) needs to match the weights of LLaMA-13B, please refer to here for specific instructions.
ZhiXi-13B-Diff-fp16 Full Pretraining Differential Weights(fp16) 24GB HuggingFace <br/> Google Drive The main difference with ZhiXi-13B-Diff is the adoption of the fp16 format for storage, which reduces memory usage. However, it may result in slight differences in the weights obtained from our actual training, which can slightly impact performance. For specific usage instructions, please refer to here for specific instructions.
ZhiXi-13B-LoRA LoRA Instruction-tuning LoRA Weights 251MB HuggingFace <br/> GoogleDrive It needs to be used with ZhiXi-13B. For specific instructions, please refer to here.
ZhiXi-7B Series Coming soon Coming soon Coming soon Coming soon Coming soon

NEWS

Contents

<h2 id="1">1. Quick Start</h2>

<h3 id="1-1">1.1 Environment Configuration</h3>

conda create -n knowlm python=3.9 -y
conda activate knowlm
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

<h3 id="1-2">1.2 Pretraining model weight acquisition and restoration</h3>

❗❗❗ Note that in terms of hardware, performing step 2.2, which involves merging LLaMA-13B with KnowLM-13B-Diff, requires approximately 100GB of RAM, with no demand for VRAM (this is due to the memory overhead caused by our merging strategy. For your convenience, we have provided the fp16 weights at this link: https://huggingface.co/zjunlp/zhixi-13b-diff-fp16. fp16 weights require less memory but may slightly impact performance. We will improve our merging approach in future updates, and we are currently developing a 7B model as well, so stay tuned). For step 2.4, which involves inference using ZhiXi, a minimum of 26GB of VRAM is required.

1. Download LLaMA 13B and KnowLM-13B-Diff

Please click here to apply for the official pre-training weights of LLaMA from meta. In this case, we are using the 13B version of the model, so you only need to download the 13B version. Once downloaded, the file directory will be as follows:

|-- 13B
|	|-- checklist.chk
|	|-- consolidated.00.pth
|	|-- consolidated.01.pth
|	|-- params.json
|-- llama.sh
|-- tokenizer.model
|-- tokenizer_checklist.chk

You can use the following command to download the KnowLM-13B-Diff file (assuming it is saved in the ./knowlm-diff folder):

python tools/download.py --specify --repo_name openkg/knowlm-13b-diff --download_path ./knowlm-diff

:exclamation:Noted. If the download is interrupted, please repeat the command mentioned above. HuggingFace provides the functionality of resumable downloads, allowing you to resume the download from where it was interrupted.

2. Use the conversion script provided by huggingface

To convert the original LLaMA-13B model into the HuggingFace format, you can use the provided script file by HuggingFace, which can be found here. Below is the command to run the script (assuming the downloaded original files(LLaMA-13B) are located in ./ and you want the converted files to be stored in ./converted):

python convert_llama_weights_to_hf.py --input_dir ./ --model_size 13B --output_dir ./converted

3. Restore KnowLM 13B

Use the script we provided, located at ./tools/weight_diff.py, execute the following command, and you will get the complete KnowLM weight:

python tools/weight_diff.py recover --path_raw ./converted --path_diff ./knowlm-diff --path_tuned ./knowlm --check_integrity_naively False

The final complete KnowLM weights are saved in the ./knowlm folder.

<h3 id="1-3">1.3 Instruction tuning LoRA weight acquisition</h3>

Use the script file we provided, located at ./tools/download.py, execute the following command to get the LoRA weight (assuming the saved path is located at ./LoRA):

python tools/download.py --download_path ./lora --specify --repo_name openkg/knowlm-13b-lora

The final complete weights are saved in the ./lora folder.

<h3 id="1-4">1.4 Model Usage Guide</h3>

1. Usage of Pretraining Model

We offer two methods: the first one is command-line interaction, and the second one is web-based interaction, which provides greater flexibility.

  1. Use the following command to enter command-line interaction:

    python examples/generate_finetune.py --base_model ./knowlm --interactive
    

    The disadvantage is the inability to dynamically change decoding parameters.

  2. Use the following command to enter web-based interaction:

    python examples/generate_finetune_web.py --base_model ./knowlm
    

    Here is a screenshot of the web-based interaction: <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/finetune_web.jpg?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

2. Usage of Instruction tuning Model

Here, we provide a web-based interaction method. Use the following command to access the web:

python examples/generate_lora_web.py --base_model ./knowlm --lora_weights ./lora

Here is a screenshot of the web-based interaction: <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/lora_web.png?raw=true" alt="finetune-web" style="width: 100%; min-width: 100px; display: block; margin: auto;"></a> </p>

The instruction is a required parameter, while input is an optional parameter. For general tasks (such as the examples provided in section 1.3), you can directly enter the input in the instruction field. For information extraction tasks (as shown in the example in section 1.2), please enter the instruction in the instruction field and the sentence to be extracted in the input field. We provide an information extraction prompt in section 2.5.

If you want to perform batch testing, please modify the examples/generate_lora.py file and update the examples and hyperparameters in the variable cases.

<h3 id="1-5">1.5 Information Extraction Prompt</h3>

For information extraction tasks such as named entity recognition (NER), event extraction (EE), and relation extraction (RE), we provide some prompts for ease of use. You can refer to this link for examples. Of course, you can also try using your own prompts.

Here is a case where KnowLM-13B-LoRA is used to accomplish the instruction-based knowledge graph construction task in CCKS2023.

<h2 id="2">2. Training Details</h2>

The following figures illustrates the entire training process and dataset construction. The training process is divided into two stages:

(1) Full pre-training stage. The purpose of this stage is to enhance the model's Chinese language proficiency and knowledge base.

(2) Instruction tuning stage using LoRA. This stage enables the model to understand human instructions and generate appropriate responses.

<h3 id="2-1">2.1 Dataset Construction (Pretraining)</h3>

In order to enhance the model's understanding of Chinese while preserving its original code and English language capabilities, we did not expand the vocabulary. Instead, we collected Chinese corpora, English corpora, and code corpora. The Chinese corpora were sourced from Baidu Baike, Wudao, and Chinese Wikipedia. The English dataset was sampled from the original English corpus of LLaMA, with the exception of the Wikipedia data. The original paper's English Wikipedia data was up until August 2022, and we additionally crawled data from September 2022 to February 2023, covering a total of six months. As for the code dataset, due to the low-quality code in the Pile dataset, we crawled code data from GitHub and LeetCode. A portion of the data was used for pre-training, while another portion was used for fine-tuning with instructions.

For the crawled datasets mentioned above, we employed a heuristic approach to filter out harmful content. Additionally, we removed duplicate data.

<h3 id="2-2">2.2 Training Process (Pretraining)</h3>

Detailed data processing code, training code, complete training scripts, and detailed training results can be found in ./pretrain.

Before training, we need to tokenize the data. We set the maximum length of a single sample to 1024, while most documents are much longer than this. Therefore, we need to partition these documents. We designed a greedy algorithm to split the documents, with the goal of ensuring that each sample consists of complete sentences and minimizing the number of segments while maximizing the length of each sample. Additionally, due to the diversity of data sources, we developed a comprehensive data preprocessing tool that can process and merge data from various sources. Finally, considering the large amount of data, loading it directly into memory would impose excessive hardware pressure. Therefore, we referred to DeepSpeed-Megatron and used the mmap method to process and load the data. This involves loading the indices into memory and accessing the corresponding data on disk when needed.

Finally, we performed pre-training on 5.5 million Chinese samples, 1.5 million English samples, and 0.9 million code samples. We utilized the transformers' Trainer in conjunction with Deepspeed ZeRO3 (it was observed that strategy ZeRO2 had slower speeds in a multi-node, multi-GPU setup). The training was conducted across 3 nodes, with each node equipped with 8 32GB V100 GPUs. The table below showcases our training speeds:

Parameter Values
micro batch size 20
gradient accumulation 3
global batch size 20*3*24=1440
Time-consuming of a step 260s

<h3 id="2-3">2.3 Dataset Construction (Instruction tuning)</h3>

In addition to incorporating general capabilities such as reasoning and coding, we have also introduced additional information extraction abilities, including NER (Named Entity Recognition), IE (Information Extraction), and EE (Event Extraction), into the current homogeneous models. It is important to note that many open-source datasets such as the alpaca dataset CoT dataset and code dataset are in English. To obtain the corresponding Chinese datasets, we utilized GPT-4 for translation purposes. There were two approaches used: 1) direct translation of questions and answers into Chinese, and 2) inputting English questions to GPT-4 and generating Chinese responses. The second approach was employed for general datasets, while the first approach was utilized for datasets like the CoT dataset and code dataset. These datasets are readily available online.

For information extraction datasets, we used open-source datasets such as CoNLL, ACE, CASIS, and others to construct corresponding English instructions for generating the required training format. For the Chinese part, for NER and EE tasks, we utilized open-source datasets such as DualEE, PEOPLE DAILY, and others, and then created corresponding Chinese instructions to synthesize the required training format. As for the RE task, we built a dataset called KG2Instruction. Specifically, we used Chinese Wikipedia data and BERT for Chinese entity recognition. We then aligned the recognized entities with the Wikipedia index. Due to potential ambiguity (i.e., a Chinese entity may have multiple indexes, such as apple referring to both a fruit and a company), we devised a strategy to disambiguate the entities. Subsequently, we used a distantly supervised method to generate possible triplets and applied predefined rules to filter out illegal or incorrect triplets. Finally, with the help of crowdsourcing, we refined the obtained triplets. Following that, we constructed corresponding Chinese instructions to generate the required training format.

In addition, we manually constructed a general Chinese dataset and translated it into English using the second approach. Finally, our data distribution is as follows:

Dataset Number
COT Datasets (Chinese, English) 202333
General Datasets (Chinese, English) 105216
Code Datasets (Chinese, English) 44688
Information Extraction Datasets (English) 537429
Information Extraction Datasets (Chinese) 486768

Flow diagram of KG2Instruction and other instruction fine-tuning datasets <p align="center" width="100%"> <a href="" target="_blank"><img src="https://github.com/zjunlp/KnowLM/blob/main/assets/kg2instructions-en.png?raw=true"style="width: 90%; min-width: 90px; display: block; margin: auto;"></a> </p>

<h3 id="2-4">2.4 Training Process (Instruction tuning)</h3>

Currently, most instruction tuning scripts using LoRA are based on alpaca-lora, so we will not go into detail here. Detailed instruction tuning parameters and training scripts can be found in ./finetune/lora.

<h2 id="3">3. Limitations</h2>

Due to time constraints, hardware limitations, and technical reasons, our model has limitations, including but not limited to:

<h2 id="4">4. TODO List</h2>

<h2 id="5">5. FAQ</h2>

<h2 id="7">6. Others</h2>

<h3 id="7-1">6.1 Contributors(In Random Order)</h3>

Pretraining:Xiang Chen, Jintian Zhang, Xiaozhuan Liang

Pretraining Data:Zhen Bi, Honghao Gui, Jing Chen, Runnan Fang

Instruction data and Instruction tuning:Xiaohan Wang, Shengyu Mao

Tool learning and Multimodal:Shuofei Qiao, Yixin Ou, Lei Li

Model Editing and Safety:Yunzhi Yao, Peng Wang, Siyuan Cheng, Bozhong Tian, Mengru Wang, Zhoubo Li

Model Testing and Deployment:Yinuo Jiang, Yuqi Zhu, Hongbin Ye, Zekun Xi, Xinrong Li

<h3 id="7-2">6.2 Citation</h3>

If you use our repository, please cite the following related papers:

@article{deepke-llm,
  author = {Ningyu Zhang, Jintian Zhang, Xiaohan Wang, Honghao Gui, Yinuo Jiang, Xiang Chen, Shengyu Mao, Shuofei Qiao, Zhen Bi, Jing Chen, Xiaozhuan Liang, Yixin Ou, Ruinan Fang, Zekun Xi, Xin Xu, Liankuan Tao, Lei Li, Peng Wang, Zhoubo Li, Guozhou Zheng, Huajun Chen},
  title = {DeepKE-LLM: A Large Language Model Based Knowledge Extraction Toolkit},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/}},
}

<h3 id="7-3">6.3 Acknowledgment</h3>

We are very grateful to the following open source projects for their help: