<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->
dfm-encoder-small-v1
This model is a fine-tuned version of jonfd/electra-small-nordic on the dcc_v1.1.0 dataset. It achieves the following results on the evaluation set:
- Loss: 3.2486
- Accuracy: 0.4795
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 2048
- eval_batch_size: 512
- seed: 42
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 10000
- training_steps: 100000
- mixed_precision_training: Native AMP
Training results
Training Loss | Epoch | Step | Accuracy | Validation Loss |
---|---|---|---|---|
5.1291 | 0.01 | 2000 | 0.2411 | 5.5301 |
4.0835 | 0.04 | 4000 | 0.3315 | 4.6197 |
3.4693 | 0.06 | 6000 | 0.3880 | 4.1647 |
3.1649 | 0.08 | 8000 | 0.4101 | 3.9395 |
3.0402 | 0.1 | 10000 | 0.4176 | 3.8325 |
2.9657 | 0.12 | 12000 | 0.4385 | 3.6722 |
2.9122 | 0.14 | 14000 | 0.4399 | 3.5847 |
2.7984 | 0.16 | 16000 | 0.4419 | 3.5808 |
2.7991 | 0.18 | 18000 | 0.4482 | 3.5451 |
2.8262 | 0.2 | 20000 | 3.5354 | 0.4443 |
2.8117 | 0.22 | 22000 | 3.5062 | 0.4529 |
2.7851 | 0.24 | 24000 | 3.4272 | 0.4579 |
2.7227 | 0.26 | 26000 | 3.4070 | 0.4596 |
2.7706 | 0.28 | 28000 | 3.4115 | 0.4616 |
2.7068 | 0.3 | 30000 | 3.3926 | 0.4597 |
2.6644 | 0.32 | 32000 | 3.4268 | 0.4567 |
2.6947 | 0.34 | 34000 | 3.3313 | 0.4622 |
2.675 | 0.36 | 36000 | 3.3661 | 0.4643 |
2.6374 | 0.38 | 38000 | 3.3463 | 0.4690 |
2.6722 | 0.4 | 40000 | 3.3454 | 0.4681 |
2.6843 | 0.42 | 42000 | 3.3430 | 0.4692 |
2.627 | 0.44 | 44000 | 3.3475 | 0.4713 |
2.5831 | 0.46 | 46000 | 3.3950 | 0.4637 |
2.6288 | 0.48 | 48000 | 3.3960 | 0.4647 |
2.5997 | 0.5 | 50000 | 3.4021 | 0.4666 |
2.5768 | 0.52 | 52000 | 3.4098 | 0.4652 |
2.5849 | 0.54 | 54000 | 3.3236 | 0.4711 |
2.5815 | 0.56 | 56000 | 3.3184 | 0.4736 |
2.5897 | 0.58 | 58000 | 3.3207 | 0.4742 |
2.5732 | 0.6 | 60000 | 3.3409 | 0.4702 |
2.5536 | 0.62 | 62000 | 3.2904 | 0.4771 |
2.5415 | 0.64 | 64000 | 3.3119 | 0.4748 |
2.543 | 0.66 | 66000 | 3.2655 | 0.4755 |
2.592 | 0.68 | 68000 | 3.2643 | 0.4785 |
2.5889 | 0.7 | 70000 | 3.2711 | 0.4786 |
2.5682 | 0.72 | 72000 | 3.2358 | 0.4773 |
2.527 | 0.74 | 74000 | 3.2889 | 0.4740 |
2.6039 | 0.76 | 76000 | 3.2613 | 0.4752 |
2.5239 | 0.78 | 78000 | 3.2569 | 0.4766 |
2.541 | 0.8 | 80000 | 3.2693 | 0.4761 |
2.4988 | 0.82 | 82000 | 3.2483 | 0.4801 |
2.5037 | 0.84 | 84000 | 3.2605 | 0.4797 |
2.5344 | 0.86 | 86000 | 3.2567 | 0.4790 |
2.5301 | 0.88 | 88000 | 3.2293 | 0.4833 |
2.545 | 0.9 | 90000 | 3.2578 | 0.4832 |
2.51 | 0.92 | 92000 | 3.2581 | 0.4831 |
2.5223 | 0.94 | 94000 | 3.2688 | 0.4770 |
2.5111 | 0.96 | 96000 | 3.3057 | 0.4740 |
2.5356 | 0.98 | 98000 | 3.2644 | 0.4797 |
2.5541 | 1.0 | 100000 | 3.2751 | 0.4762 |
Framework versions
- Transformers 4.20.1
- Pytorch 1.11.0+cu102
- Datasets 2.5.3.dev0
- Tokenizers 0.12.1
Model Card
Following [1], the following constitutes a model for this model.
Organization developing the Model: The Danish Foundation Models project
Model Creation Date: June 2022
Model Type: Transformer encoder model [2]; BERT [3] (further pre-trained from an ELECTRA model)
Feedback on the Model: For feedback on the model please use the community forum.
Training logs and performance metrics: Check out this Weight and biases Dashboard.
Intended Uses
Primary Intended Uses:
The primary intended use case of this model is the reproduction and validation of dataset quality. The intended use cases for future iterations of this model are the application in industry and research for Danish natural language tasks.
Primary Intended Users:
Future iterations of the model are intended for NLP practitioners dealing with Danish text documents.
Out-of-Scope Uses:
Use of the model for profiling in a way which is inconsiderate of the potential harm it might cause, such as racial profiling.
Factors
Card prompts - Relevant Factors:
Relevant factors include which language is used. Our model is trained on a Danish text corpus and is intended to compare the training data.
Card prompts - Evaluation Factors:
Future iterations of this model should include a validation of biases pertaining to gender, race, and religious and social groups.
Metrics
Performance Metrics:
Our model is evaluated on the following performance metrics:
- Pseudo perplexity, following [4], across eight distinct domains, including Danish dialects, books, legal, social media (Reddit, Twitter), spontaneous speech, news and Wikipedia.
- The Danish subsection of Scandeval [5].
To see the performance metrics, check out this Weight and biases Dashboard.
Decision Threshold:
N/A
Approaches to Uncertainty and Variability:
Due to the cost of training the model is only pre-trained once, but the ScandEval fine-tunes ten times to obtain a reasonable estimate of model performance.
Evaluation Data
Datasets:
The ScandEval's Danish benchmark includes:
- Named entity recognition on DaNE [7,8].
- Part-of-speech tagging and dependency on DDT [8].
- Sentiment classification on AngryTweets [9], TwitterSent [9], Europarl [9], LCC [10]
- Hate speech classification on DKHate [11].
Motivation:
The ScandEval benchmark is the most comprehensive benchmark for Danish. Pseudo perplexity was analysed to examine the model's ability to model certain language domains.
Training Data
For our training data, we sample from HopeTwitter, DaNews, DAGW and Netarkivet Text (NAT) with the probabilites; 0.10, 0.10, 0.10, 0.70. For more information on the training and datasets, see the respective datasheets on the Danish foundation models GitHub page.
Pre-processing:
Input documents are tokenized using the tokenizer of the Danish BERT by BotXO [12], which uses a BPE with a vocabulary size of ~30,000 and NFKC normalization.
Ethical Considerations
Data: The is sources from News, DAGW, Twitter, and Netarkivet Text (NAT) and might thus contain hate-speech, sexually explicit content and otherwise harmful content.
Mitigations: We considered removing sexually explicit content by filtering web domians using a DNS or using google safe-search. However, examining the filtering domains these were also found to include news media pertaining to a specific demographic (e.g. Dagens.dk) and educational sites pertaining to sexual education. We also examined the use of word-based filters, but found that might influence certain demographic groups disproportionally.
Risk and Harms: As Netarkivet Text cover such a wide array of the Danish internet it undoubtably contains personal information. To avoid model memorization of this information we have deduplicated the data such that the model does not learn this information.
References:
- [1] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596
- [2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
- [3] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
- [4] Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked Language Model Scoring. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2699–2712. https://doi.org/10.18653/v1/2020.acl-main.240
- [6] Nielsen, D. S. (2021). ScandEval: Evaluation of language models on mono- or multilingual Scandinavian language tasks. GitHub. Note: Https://Github.Com/Saattrupdan/ScandEval.
- [7] Hvingelby, R., Pauli, A. B., Barrett, M., Rosted, C., Lidegaard, L. M., & Søgaard, A. (2020). DaNE: A named entity resource for danish. Proceedings of the 12th Language Resources and Evaluation Conference, 4597–4604.
- [8] Kromann, M. T. (2003). The Danish Dependency Treebank and the DTAG Treebank Tool. https://research.cbs.dk/en/publications/the-danish-dependency-treebank-and-the-dtag-treebank-tool
- [9] Alexandrainst/danlp. (2022). Alexandra Institute. https://github.com/alexandrainst/danlp/blob/a1e9fa70fc5a3ae7ff78877062da3a8a8da80758/docs/docs/datasets.md (Original work published 2019)
- [10] Nielsen, F. Å. (2022). Lcc-sentiment. https://github.com/fnielsen/lcc-sentiment (Original work published 2016)
- [11] Sigurbergsson, G. I., & Derczynski, L. (2020). Offensive Language and Hate Speech Detection for Danish. Proceedings of the 12th Language Resources and Evaluation Conference, 3498–3508. https://aclanthology.org/2020.lrec-1.430
- [12] Møllerhøj, J. D. (2019, December 5). Danish BERT model: BotXO has trained the most advanced BERT model. BotXO. https://www.botxo.ai/blog/danish-bert-model/