DeBERTa 330M CP on Tranditional Chinese

Dataset

LT-TW5:
- NUDB 資料 + data group g + Redpajama Github + Redpajama stackexchange 5G

Training Methods

This project builds upon the foundations of the https://huggingface.co/microsoft/deberta-v3-small model. To optimize DeBERTa for Traditional Chinese text, we employed the following training methods:

Token Expansion:

Token Addition: To enrich the model's token representation, we introduced previously unseen tokens to the tokenizer and incorporated their embeddings into the model.
Full Model Refinement: A comprehensive Masked Language Model (MLM) process followed, refining the entire model, including both embeddings and transformer layers.

Span BERT for Token Masking:

We adopted the Span BERT approach for masking tokens during training. Some research suggests that this technique outperforms Whole Word Masking (WWM) in certain scenarios, enhancing the model's performance.

Implementation Details

For our training, we utilized the following configurations:

Learning Rate: Set at 1e-4 for both stages of training.
Maximum Sequence Length: Capped at 512 tokens.
Batch Size: 8 * 4 * 32
Warmup Steps: Incorporated 500 warmup steps into the training schedule.
Hardware: The training took advantage of 32 V100 GPUs, harnessing mixed-precision training (fp16) for efficiency.

Performance

DRCD

Erlangshen-DeBERTa-v2-320M-Chinese:
- Dev: 0.733
- Test: 0.735
TLLM/DeBERTa-v2-320M-TC (Our):
- Dev: 0.804
- Test: 0.800
TLLM/deberta-xsmall-22m:
- Dev: 0.685
- Test: 0.675