Pre-trained Language Model for the Humanities and Social Sciences in Chinese

Introduction

The research for social science texts in Chinese needs the support natural language processing tools.

The pre-trained language model has greatly improved the accuracy of text mining in general texts. At present, there is an urgent need for a pre-trained language model specifically for the automatic processing of scientific texts in Chinese social science.

We used the abstract of social science research as the training set. Based on the deep language model framework of BERT, we constructed CSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm pre-training language models by transformers/run_mlm.py and transformers/mlm_wwm.

We designed four downstream tasks of Text Classification on different Chinese social scientific article corpus to verify the performance of the model.

News

How to use

Huggingface Transformers

The from_pretrained method based on Huggingface Transformers can directly obtain CSSCI_ABS_BERT, CSSCI_ABS_roberta and CSSCI_ABS_roberta-wwm models online.

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_BERT")

model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_BERT")
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta")

model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta")
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta_wwm")

model = AutoModel.from_pretrained("KM4STfulltext/CSSCI_ABS_roberta_wwm")

Download Models

From Huggingface

Evaluation & Results

Discipline classification experiments of articles published in CSSCI journals

https://github.com/S-T-Full-Text-Knowledge-Mining/CSSCI-BERT

Movement recognition experiments for data analysis and knowledge discovery abstract

Tag bert-base-Chinese chinese-roberta-wwm,ext CSSCI_ABS_BERT CSSCI_ABS_roberta CSSCI_ABS_roberta_wwm support
Abstract 55.23 62.44 56.8 57.96 58.26 223
Location 61.61 54.38 61.83 61.4 61.94 2866
Metric 45.08 41 45.27 46.74 47.13 622
Organization 46.85 35.29 45.72 45.44 44.65 327
Person 88.66 82.79 88.21 88.29 88.51 4850
Thing 71.68 65.34 71.88 71.68 71.81 5993
Time 65.35 60.38 64.15 65.26 66.03 1272
avg 72.69 66.62 72.59 72.61 72.89 16153

Chinese literary entity recognition

Tag bert-base-Chinese chinese-roberta-wwm,ext CSSCI_ABS_BERT CSSCI_ABS_roberta CSSCI_ABS_roberta_wwm support
Abstract 55.23 62.44 56.8 57.96 58.26 223
Location 61.61 54.38 61.83 61.4 61.94 2866
Metric 45.08 41 45.27 46.74 47.13 622
Organization 46.85 35.29 45.72 45.44 44.65 327
Person 88.66 82.79 88.21 88.29 88.51 4850
Thing 71.68 65.34 71.88 71.68 71.81 5993
Time 65.35 60.38 64.15 65.26 66.03 1272
avg 72.69 66.62 72.59 72.61 72.89 16153

Cited

Disclaimer

Acknowledgment