Model Card

Model Details

Model Description

IssueReportClassifier-NLBSE22 is a RoBERTa model which is fine-tuned on the NLBSE22 dataset. The model is trained to classify issue reports from GitHub into three categories: bug, enhancement, and question. The model is trained on a dataset of labeled issue reports and is designed to predict the category of a new issue report based on its text content (title and body).

Dataset

Category Training Set Test Set
bug 361,239 (50%) 40,152 (49.9%)
enhancement 299,287 (41.4%) 33,290 (41.3%)
question 62,373 (8.6%) 7,076 (8.8%)

Data preprocessing

The data used for training was preprocessed with ekphrasis, adding some regular expressions to remove code, images and URLs. Check out our GitHub code for more information about this.

Metrics

The model is evaluated using the following metrics:

References

Cite our work

@inproceedings{Colavito-2022,
  title = {Issue Report Classification Using Pre-trained Language Models},
  booktitle = {2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)},
  author = {Colavito, Giuseppe and Lanubile, Filippo and Novielli, Nicole},
  year = {2022},
  month = may,
  pages = {29--32},
  doi = {10.1145/3528588.3528659},
  abstract = {This paper describes our participation in the tool competition organized in the scope of the 1st International Workshop on Natural Language-based Software Engineering. We propose a supervised approach relying on fine-tuned BERT-based language models for the automatic classification of GitHub issues. We experimented with different pre-trained models, achieving the best performance with fine-tuned RoBERTa (F1 = .8591).},
  keywords = {Issue classification, BERT, deep learning, labeling unstructured data,
software maintenance and evolution},
}