BART fine-tuned for keyphrase generation

This is the <a href="https://huggingface.co/facebook/bart-base">bart-base</a> (<a href = "https://arxiv.org/abs/1910.13461">Lewis et al.. 2019</a>) model <a href="https://ieeexplore.ieee.org/document/10139061">finetuned</a> for generating titles and keyphrases for scientific texts on the following corpora:

Krapivin (<a href = "http://eprints.biblio.unitn.it/1671/1/disi09055%2Dkrapivin%2Dautayeu%2Dmarchese.pdf">Krapivin et al., 2009</a>)
Inspec (<a href = "https://aclanthology.org/W03-1028.pdf">Hulth, 2003</a>)

Inspired by <a href = "https://aclanthology.org/2020.findings-emnlp.428.pdf">(Cachola et al., 2020)</a>, we applied control codes to fine-tune BART in a multi-task manner. First, we create a training set containing comma-separated lists of keyphrases and titles as text generation targets. For this purpose, we form text-title and text-keyphrases pairs based on the original text corpus. Second, we append each source text in the training set with control codes <|TITLE|> and <|KEYPHRASES|> respectively. After that, the training set is shuffled in random order. Finally, the preprocessed training set is utilized to fine-tune the pre-trained BART model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("beogradjanka/bart_multitask_finetuned_for_title_and_keyphrase_generation")
model = AutoModelForSeq2SeqLM.from_pretrained("beogradjanka/bart_multitask_finetuned_for_title_and_keyphrase_generation")


text = "In this paper, we investigate cross-domain limitations of keyphrase generation using the models for abstractive text summarization.\
        We present an evaluation of BART fine-tuned for keyphrase generation across three types of texts, \
        namely scientific texts from computer science and biomedical domains and news texts. \
        We explore the role of transfer learning between different domains to improve the model performance on small text corpora."

#generating \n-separated keyphrases
tokenized_text = tokenizer.prepare_seq2seq_batch(["<|KEYPHRASES|> " + text], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(translated_text)

#generating title
tokenized_text = tokenizer.prepare_seq2seq_batch(["<|TITLE|> " + text], return_tensors='pt')
translation = model.generate(**tokenized_text)
translated_text = tokenizer.batch_decode(translation, skip_special_tokens=True)[0]
print(translated_text)

Training Hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e-5
train_batch_size: 8
optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
num_epochs: 3

BibTeX:

@INPROCEEDINGS{10139061,
  author={Glazkova, Anna and Morozov, Dmitry},
  booktitle={2023 IX International Conference on Information Technology and Nanotechnology (ITNT)}, 
  title={Multi-task fine-tuning for generating keyphrases in a scientific domain}, 
  year={2023},
  pages={1-5},
  doi={10.1109/ITNT57377.2023.10139061}}

BART fine-tuned for keyphrase generation

Training Hyperparameters

NSDT 3DConvert

UnrealSynth

DreamTexture.js