Punctuator for Uncased English
The model is fine-tuned based on DistilBertForTokenClassification
for adding punctuations to plain text (uncased English)
Usage
from transformers import DistilBertForTokenClassification, DistilBertTokenizerFast
model = DistilBertForTokenClassification.from_pretrained("Qishuai/distilbert_punctuator_en")
tokenizer = DistilBertTokenizerFast.from_pretrained("Qishuai/distilbert_punctuator_en")
Model Overview
Training data
Combination of following three dataset:
- BBC news: From BBC news website corresponding to stories in five topical areas from 2004-2005. Reference
- News articles: 20000 samples of short news articles scraped from Hindu, Indian times and Guardian between Feb 2017 and Aug 2017 Reference
- Ted talks: transcripts of over 4,000 TED talks between 2004 and 2019 Reference
Model Performance
-
Validation with 500 samples of dataset scraped from https://www.thenews.com.pk website. Reference
-
Metrics Report:
precision recall f1-score support COMMA 0.66 0.55 0.60 7064 EXLAMATIONMARK 1.00 0.00 0.00 5 PERIOD 0.73 0.63 0.68 6573 QUESTIONMARK 0.54 0.41 0.47 17 micro avg 0.69 0.59 0.64 13659 macro avg 0.73 0.40 0.44 13659 weighted avg 0.69 0.59 0.64 13659 -
Validation with 86 news ted talks of 2020 which are not included in training dataset Reference
-
Metrics Report:
precision recall f1-score support COMMA 0.71 0.56 0.63 10712 EXLAMATIONMARK 0.45 0.07 0.12 75 PERIOD 0.75 0.65 0.70 7921 QUESTIONMARK 0.73 0.67 0.70 827 micro avg 0.73 0.60 0.66 19535 macro avg 0.66 0.49 0.53 19535 weighted avg 0.73 0.60 0.66 19535