Model Card

Model Details

Model Name: IssueReportClassifier-NLBSE22
Base Model: RoBERTa
Dataset: NLBSE22
Model Type: Fine-tuned
Model Version: 1.0
Model Date: 2023-03-21

Model Description

IssueReportClassifier-NLBSE22 is a RoBERTa model which is fine-tuned on the NLBSE22 dataset. The model is trained to classify issue reports from GitHub into three categories: bug, enhancement, and question. The model is trained on a dataset of labeled issue reports and is designed to predict the category of a new issue report based on its text content (title and body).

Dataset

Category	Training Set	Test Set
bug	361,239 (50%)	40,152 (49.9%)
enhancement	299,287 (41.4%)	33,290 (41.3%)
question	62,373 (8.6%)	7,076 (8.8%)

Data preprocessing

The data used for training was preprocessed with ekphrasis, adding some regular expressions to remove code, images and URLs. Check out our GitHub code for more information about this.

Metrics

The model is evaluated using the following metrics:

Accuracy
Precision
Recall
F1 Score (micro and macro average)

References

NLBSE22 Dataset

Cite our work

@inproceedings{Colavito-2022,
  title = {Issue Report Classification Using Pre-trained Language Models},
  booktitle = {2022 IEEE/ACM 1st International Workshop on Natural Language-Based Software Engineering (NLBSE)},
  author = {Colavito, Giuseppe and Lanubile, Filippo and Novielli, Nicole},
  year = {2022},
  month = may,
  pages = {29--32},
  doi = {10.1145/3528588.3528659},
  abstract = {This paper describes our participation in the tool competition organized in the scope of the 1st International Workshop on Natural Language-based Software Engineering. We propose a supervised approach relying on fine-tuned BERT-based language models for the automatic classification of GitHub issues. We experimented with different pre-trained models, achieving the best performance with fine-tuned RoBERTa (F1 = .8591).},
  keywords = {Issue classification, BERT, deep learning, labeling unstructured data,
software maintenance and evolution},
}