COVID-SciBERT: A small language modelling expansion of SciBERT, a BERT model trained on scientific text.

Details of SciBERT

The SciBERT model was presented in SciBERT: A Pretrained Language Model for Scientific Text by Iz Beltagy, Kyle Lo, Arman Cohan and here is the abstract:

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks.

Details of the downstream task (Language Modeling) - Dataset 📚

There are actually two datasets that have been used here:

Model training

The training script is present here.

Pipelining the Model

import transformers

model = transformers.AutoModelWithLMHead.from_pretrained('lordtt13/COVID-SciBERT')

tokenizer = transformers.AutoTokenizer.from_pretrained('lordtt13/COVID-SciBERT')

nlp_fill = transformers.pipeline('fill-mask', model = model, tokenizer = tokenizer)
nlp_fill('Coronavirus or COVID-19 can be prevented by a' + nlp_fill.tokenizer.mask_token)

# Output:
# [{'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a combination [SEP]',
#   'score': 0.1719885915517807,
#   'token': 2702},
#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a simple [SEP]',
#   'score': 0.054218728095293045,
#   'token': 2177},
#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a novel [SEP]',
#   'score': 0.043364267796278,
#   'token': 3045},
#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a high [SEP]',
#   'score': 0.03732519596815109,
#   'token': 597},
#  {'sequence': '[CLS] coronavirus or covid - 19 can be prevented by a vaccine [SEP]',
#   'score': 0.021863549947738647,
#   'token': 7039}]

Created by Tanmay Thakur | LinkedIn

PS: Still looking for more resources to expand my expansion!