license: apache-2.0

1.Differences from knowlm-13b-zhixi

Compared to zjunlp/knowlm-13b-zhixi, zjunlp/knowlm-13b-ie exhibits slightly stronger practicality in information extraction but with a decrease in its general applicability.

zjunlp/knowlm-13b-ie samples around 10% of the data from Chinese-English information extraction datasets, which then undergo negative sampling. For instance, if dataset A contains labels [a, b, c, d, e, f], we first sample 10% of the data from A. For a given sample 's', it might only contain labels a and b. We randomly add relationships that it doesn't originally have, such as c and d, from the specified list of relation candidates. When encountering these additional relationships, the model might output text similar to 'NAN'.This method equips the model with the ability to generate 'NAN' outputs to a certain extent, enhancing its information extraction capability while weakening its generalization ability.

2.IE template

RE supports the following templates:

relation_template_zh =  {
    0:'已知候选的关系列表:{s_schema},请你根据关系列表,从以下输入中抽取出可能存在的头实体与尾实体,并给出对应的关系三元组。请按照{s_format}的格式回答。',
    1:'我将给你个输入,请根据关系列表:{s_schema},从输入中抽取出可能包含的关系三元组,并以{s_format}的形式回答。',
    2:'我希望你根据关系列表从给定的输入中抽取可能的关系三元组,并以{s_format}的格式回答,关系列表={s_schema}。',
    3:'给定的关系列表是{s_schema}\n根据关系列表抽取关系三元组,在这个句子中可能包含哪些关系三元组?请以{s_format}的格式回答。',
} 

relation_int_out_format_zh = {
    0:['"(头实体,关系,尾实体)"', relation_convert_target0],
    1:['"头实体是\n关系是\n尾实体是\n\n"', relation_convert_target1],
    2:['"关系:头实体,尾实体\n"', relation_convert_target2],
    3:["JSON字符串[{'head':'', 'relation':'', 'tail':''}, ]", relation_convert_target3],
}

relation_template_en =  {
    0:'Identify the head entities (subjects) and tail entities (objects) in the following text and provide the corresponding relation triples from relation list {s_schema}. Please provide your answer as a list of relation triples in the form of {s_format}.',
    1:'From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples. The relations are {s_schema}. Please format your answer as a list of relation triples in the form of {s_format}.', 
}

relation_int_out_format_en = {
    0:['(Subject, Relation, Object)', relation_convert_target0_en],
    1:["{'head':'', 'relation':'', 'tail':''}", relation_convert_target1_en],
}

Both the schema and format placeholders ({s_schema} and {s_format}) are embedded within the templates and must be specified by users.

For a more comprehensive understanding of the templates, please refer to the files ner_template.pyre_template.pyee_template.py .

3.Common relationship types

{
    '组织': ['别名', '位于', '类型', '成立时间', '解散时间', '成员', '创始人', '事件', '子组织', '产品', '成就', '运营'], 
    '医学': ['别名', '病因', '症状', '可能后果', '包含', '发病部位'], 
    '事件': ['别名', '类型', '发生时间', '发生地点', '参与者', '主办方', '提名者', '获奖者', '赞助者', '获奖作品', '获胜者', '奖项'], 
    '运输': ['别名', '位于', '类型', '属于', '途径', '开通时间', '创建时间', '车站等级', '长度', '面积'], 
    '人造物件': ['别名', '类型', '受众', '成就', '品牌', '产地', '长度', '宽度', '高度', '重量', '价值', '制造商', '型号', '生产时间', '材料', '用途', '发现者或发明者'], 
    '生物': ['别名', '学名', '类型', '分布', '父级分类单元', '主要食物来源', '用途', '长度', '宽度', '高度', '重量', '特征'], 
    '建筑': ['别名', '类型', '位于', '临近', '名称由来', '长度', '宽度', '高度', '面积', '创建时间', '创建者', '成就', '事件'], 
    '自然科学': ['别名', '类型', '性质', '生成物', '用途', '组成', '产地', '发现者或发明者'], 
    '地理地区': ['别名', '类型', '所在行政领土', '接壤', '事件', '面积', '人口', '行政中心', '产业', '气候'], 
    '作品': ['别名', '类型', '受众', '产地', '成就', '导演', '编剧', '演员', '平台', '制作者', '改编自', '包含', '票房', '角色', '作曲者', '作词者', '表演者', '出版时间', '出版商', '作者'], 
    '人物': ['别名', '籍贯', '国籍', '民族', '朝代', '出生时间', '出生地点', '死亡时间', '死亡地点', '专业', '学历', '作品', '职业', '职务', '成就', '所属组织', '父母', '配偶', '兄弟姊妹', '亲属', '同事', '参与'], 
    '天文对象': ['别名', '类型', '坐标', '发现者', '发现时间', '名称由来', '属于', '直径', '质量', '公转周期', '绝对星等', '临近']
}

Here schema provides 12 text topics and common relationship types under the topic.

4.Convert script

A script named convert.pyconvert_test.py is provided to facilitate the uniform conversion of data into KnowLM instructions. The data directory contains the expected data format for each task before executing convert.py.

python kg2instruction/convert.py \
  --src_path data/NER/sample.json \
  --tgt_path data/NER/processed.json \
  --schema_path data/NER/schema.json \
  --language zh \
  --task NER \
  --sample 0 \
  --all

convert_test.py does not require data to have label (entity, relation, event) fields, only needs to have an input field and provide a schema_path is suitable for processing test data.

python kg2instruction/convert_test.py \
    --src_path data/NER/sample.json \
    --tgt_path data/NER/processed.json \
    --schema_path data/NER/schema.json \
    --language zh \      
    --task NER \          
    --sample 0 

5.Datasets

Below are some readily processed datasets:

Name Download Links Quantity Description
KnowLM-IE.json Google Drive <br/> HuggingFace 281860 Dataset mentioned in InstructIE
KnowLM-ke HuggingFace XXXX Contains all instruction data (General, IE, Code, COT, etc.) used for training zjunlp/knowlm-13b-zhixi

KnowLM-IE.json: Contains fields such as 'id' (unique identifier), 'cate' (text category), 'instruction' (extraction instruction), 'input' (input text), 'output' (output text), and 'relation' (triples). The 'relation' field can be used to construct extraction instructions and outputs freely. 'instruction' has 16 formats (4 prompts * 4 output formats), and 'output' is generated in the specified format from 'instruction'.

KnowLM-ke: Contains fields 'instruction', 'input', and 'output' only. The files ee-en.json, ee_train.json, ner-en.json, ner_train.json, re-en.json, and re_train.json under its directory contain Chinese-English IE instruction data.

6.Usage

We provide a script, inference.py, for direct inference using the zjunlp/knowlm-13b-ie model. Please refer to the README.md for environment configuration and other details.

CUDA_VISIBLE_DEVICES="0" python src/inference.py \
    --model_name_or_path 'models/knowlm-13b-ie' \
    --model_name 'llama' \
    --input_file 'data/NER/processed.json' \
    --output_file 'results/ner_test.json' \
    --fp16 

If GPU memory is not enough, you can use --bits 8 or --bits 4.

7.Evaluate

We provide a script at evaluate.py to convert the string output of the model into a list and calculate F1

python kg2instruction/evaluate.py \
  --standard_path data/NER/processed.json \
  --submit_path data/NER/processed.json \
  --task ner \
  --language zh