<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->
mt5_correct_puntuation
本模型使用中文維基百科語料微調 google/mt5-base預訓練模型之中文標點符號訂正器。目前之準確率為 0.794。
This is a google/mt5-base model trained on Mandarin Wikipedia corpus and finetuned for Mandarin punctuation correction. Currently the accuracy is 0.794.
Datasets
模型使用中文維基百科公開資料微調。將取得的文本以「。」或「,」切分為不超過100字的句子。因為逗號和句號數量壓倒性地多,為盡量平衡資料集,僅保留包含冒號、分號、驚嘆號、問號的句子,作為正確句。將正確句之「,。:;、!?」隨機以「,。:;、!?」,製作為不正確句。訓練用句子共有291,112句。
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 1
Framework versions
- Transformers 4.20.1
- Pytorch 1.12.0+cu113
- Datasets 2.3.2
- Tokenizers 0.12.1