GeBERTa

<!-- Provide a quick summary of what the model is/does. --> GeBERTa is a set of German DeBERTa models developed in a joint effort between the University of Florida, NVIDIA, and IKIM. The models range in size from 122M to 750M parameters.

Model details

The models follow the architecture of DeBERTa-v2 and make use of sentence piece tokenizers. The base and large models use a 50k token vocabulary, while the large model uses a 128k token vocabulary. All models were trained with a batch size of 2k for a maximum of 1 million steps and have a maximum sequence length of 512 tokens.

Dataset

The pre-training dataset consists of documents from different domains:

Domain Dataset Data Size #Docs #Tokens
Formal Wikipedia 9GB 2,665,357 1.9B
Formal News 28GB 12,305,326 6.1B
Formal GC4 90GB 31,669,772 19.4B
Informal Reddit 2019-2023 (GER) 5.8GB 15,036,592 1.3B
Informal Holiday Reviews 2GB 4,876,405 428M
Legal OpenLegalData: German cases and laws 5.4GB 308,228 1B
Medical Smaller public datasets 253MB 179,776 50M
Medical CC medical texts 3.6GB 2,000,000 682M
Medical Medicine Dissertations 1.4GB 14,496 295M
Medical Pubmed abstracts (translated) 8.5GB 21,044,382 1.7B
Medical MIMIC III (translated) 2.6GB 24,221,834 695M
Medical PMC-Patients-ReCDS (translated) 2.1GB 1,743,344 414M
Literature German Fiction 1.1GB 3,219 243M
Literature English books (translated) 7.1GB 11,038 1.6B
- Total 167GB 116,079,769 35.8B

Benchmark

In a comprehensive benchmark, we evaluated existing German models and our own. The benchmark included a variety of task types, such as question answering, classification, and named entity recognition (NER). In addition, we introduced a new task focused on hate speech detection using two existing datasets. When the datasets provided training, development, and test sets, we used them accordingly.

We randomly split the data into 80% for training, 10% for validation, and 10% for test in cases where such sets were not available. The following table presents the F1 scores:

Model GE14 GQuAD GE18 TS GGP GRAS<sup>1</sup> JS DROC Avg
GBERT<sub>large</sub> 88.48±0.23 81.51±0.84 54.37±1.65 73.60±0.61 79.17±0.14 69.28±0.80 76.32±4.42 90.29±0.15 76.63±0.63
GELECTRA<sub>large</sub> 88.39±0.13 80.51±0.41 55.41±1.54 73.84±0.86 79.09±0.09 70.16±0.92 73.73±2.35 89.83±0.27 76.37±0.69
GeBERTa<sub>large</sub> 88.84±0.18 82.52±0.59 53.76±1.86 75.32±0.53 78.35±0.08 70.02±1.34 82.16±2.36 90.39±0.24 77.67±0.69
GeBERTa<sub>xlarge</sub> 89.04±0.26 85.05±0.63 55.80±1.42 76.25±0.704 76.71±0.08 67.92±1.00 82.42±4.70 90.63±0.21 77.98±0.62

<sup>1</sup>Is not published yet but is described in the MedBERT.de paper.

Publication

@misc{dada2023impact,
      title={On the Impact of Cross-Domain Data on German Language Models}, 
      author={Amin Dada and Aokun Chen and Cheng Peng and Kaleb E Smith and Ahmad Idrissi-Yaghir and Constantin Marc Seibold and Jianning Li and Lars Heiliger and Xi Yang and Christoph M. Friedrich and Daniel Truhn and Jan Egger and Jiang Bian and Jens Kleesiek and Yonghui Wu},
      year={2023},
      eprint={2310.07321},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact

amin.dada@uk-essen.de