PubMedNCL
A pretrained language model for document representations of biomedical papers. PubMedNCL is based on PubMedBERT, which is a BERT model pretrained on abstracts and full-texts from PubMedCentral, and fine-tuned via citation neighborhood contrastive learning, as introduced by SciNCL.
How to use the pretrained model
from transformers import AutoTokenizer, AutoModel
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('malteos/PubMedNCL')
model = AutoModel.from_pretrained('malteos/PubMedNCL')
papers = [{'title': 'BERT', 'abstract': 'We introduce a new language representation model called BERT'},
{'title': 'Attention is all you need', 'abstract': ' The dominant sequence transduction models are based on complex recurrent or convolutional neural networks'}]
# concatenate title and abstract with [SEP] token
title_abs = [d['title'] + tokenizer.sep_token + (d.get('abstract') or '') for d in papers]
# preprocess the input
inputs = tokenizer(title_abs, padding=True, truncation=True, return_tensors="pt", max_length=512)
# inference
result = model(**inputs)
# take the first token ([CLS] token) in the batch as the embedding
embeddings = result.last_hidden_state[:, 0, :]
Citation
- Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper).
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing.
License
MIT