dfm-encoder-small-v1

This model is a fine-tuned version of jonfd/electra-small-nordic on the dcc_v1.1.0 dataset. It achieves the following results on the evaluation set:

Loss: 3.2486
Accuracy: 0.4795

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 2048
eval_batch_size: 512
seed: 42
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-06
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 10000
training_steps: 100000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Accuracy	Validation Loss
5.1291	0.01	2000	0.2411	5.5301
4.0835	0.04	4000	0.3315	4.6197
3.4693	0.06	6000	0.3880	4.1647
3.1649	0.08	8000	0.4101	3.9395
3.0402	0.1	10000	0.4176	3.8325
2.9657	0.12	12000	0.4385	3.6722
2.9122	0.14	14000	0.4399	3.5847
2.7984	0.16	16000	0.4419	3.5808
2.7991	0.18	18000	0.4482	3.5451
2.8262	0.2	20000	3.5354	0.4443
2.8117	0.22	22000	3.5062	0.4529
2.7851	0.24	24000	3.4272	0.4579
2.7227	0.26	26000	3.4070	0.4596
2.7706	0.28	28000	3.4115	0.4616
2.7068	0.3	30000	3.3926	0.4597
2.6644	0.32	32000	3.4268	0.4567
2.6947	0.34	34000	3.3313	0.4622
2.675	0.36	36000	3.3661	0.4643
2.6374	0.38	38000	3.3463	0.4690
2.6722	0.4	40000	3.3454	0.4681
2.6843	0.42	42000	3.3430	0.4692
2.627	0.44	44000	3.3475	0.4713
2.5831	0.46	46000	3.3950	0.4637
2.6288	0.48	48000	3.3960	0.4647
2.5997	0.5	50000	3.4021	0.4666
2.5768	0.52	52000	3.4098	0.4652
2.5849	0.54	54000	3.3236	0.4711
2.5815	0.56	56000	3.3184	0.4736
2.5897	0.58	58000	3.3207	0.4742
2.5732	0.6	60000	3.3409	0.4702
2.5536	0.62	62000	3.2904	0.4771
2.5415	0.64	64000	3.3119	0.4748
2.543	0.66	66000	3.2655	0.4755
2.592	0.68	68000	3.2643	0.4785
2.5889	0.7	70000	3.2711	0.4786
2.5682	0.72	72000	3.2358	0.4773
2.527	0.74	74000	3.2889	0.4740
2.6039	0.76	76000	3.2613	0.4752
2.5239	0.78	78000	3.2569	0.4766
2.541	0.8	80000	3.2693	0.4761
2.4988	0.82	82000	3.2483	0.4801
2.5037	0.84	84000	3.2605	0.4797
2.5344	0.86	86000	3.2567	0.4790
2.5301	0.88	88000	3.2293	0.4833
2.545	0.9	90000	3.2578	0.4832
2.51	0.92	92000	3.2581	0.4831
2.5223	0.94	94000	3.2688	0.4770
2.5111	0.96	96000	3.3057	0.4740
2.5356	0.98	98000	3.2644	0.4797
2.5541	1.0	100000	3.2751	0.4762

Framework versions

Transformers 4.20.1
Pytorch 1.11.0+cu102
Datasets 2.5.3.dev0
Tokenizers 0.12.1

Model Card

Following [1], the following constitutes a model for this model.

Organization developing the Model: The Danish Foundation Models project

Model Creation Date: June 2022

Model Type: Transformer encoder model [2]; BERT [3] (further pre-trained from an ELECTRA model)

Feedback on the Model: For feedback on the model please use the community forum.

Training logs and performance metrics: Check out this Weight and biases Dashboard.

Intended Uses

Primary Intended Uses:

The primary intended use case of this model is the reproduction and validation of dataset quality. The intended use cases for future iterations of this model are the application in industry and research for Danish natural language tasks.

Primary Intended Users:

Future iterations of the model are intended for NLP practitioners dealing with Danish text documents.

Out-of-Scope Uses:

Use of the model for profiling in a way which is inconsiderate of the potential harm it might cause, such as racial profiling.

Factors

Card prompts - Relevant Factors:

Relevant factors include which language is used. Our model is trained on a Danish text corpus and is intended to compare the training data.

Card prompts - Evaluation Factors:

Future iterations of this model should include a validation of biases pertaining to gender, race, and religious and social groups.

Metrics

Performance Metrics:

Our model is evaluated on the following performance metrics:

Pseudo perplexity, following [4], across eight distinct domains, including Danish dialects, books, legal, social media (Reddit, Twitter), spontaneous speech, news and Wikipedia.
The Danish subsection of Scandeval [5].

To see the performance metrics, check out this Weight and biases Dashboard.

Decision Threshold:

N/A

Approaches to Uncertainty and Variability:

Due to the cost of training the model is only pre-trained once, but the ScandEval fine-tunes ten times to obtain a reasonable estimate of model performance.

Evaluation Data

Datasets:

The ScandEval's Danish benchmark includes:

Named entity recognition on DaNE [7,8].
Part-of-speech tagging and dependency on DDT [8].
Sentiment classification on AngryTweets [9], TwitterSent [9], Europarl [9], LCC [10]
Hate speech classification on DKHate [11].

Motivation:

The ScandEval benchmark is the most comprehensive benchmark for Danish. Pseudo perplexity was analysed to examine the model's ability to model certain language domains.

Training Data

For our training data, we sample from HopeTwitter, DaNews, DAGW and Netarkivet Text (NAT) with the probabilites; 0.10, 0.10, 0.10, 0.70. For more information on the training and datasets, see the respective datasheets on the Danish foundation models GitHub page.

Pre-processing:

Input documents are tokenized using the tokenizer of the Danish BERT by BotXO [12], which uses a BPE with a vocabulary size of ~30,000 and NFKC normalization.

Ethical Considerations

Data: The is sources from News, DAGW, Twitter, and Netarkivet Text (NAT) and might thus contain hate-speech, sexually explicit content and otherwise harmful content.

Mitigations: We considered removing sexually explicit content by filtering web domians using a DNS or using google safe-search. However, examining the filtering domains these were also found to include news media pertaining to a specific demographic (e.g. Dagens.dk) and educational sites pertaining to sexual education. We also examined the use of word-based filters, but found that might influence certain demographic groups disproportionally.

Risk and Harms: As Netarkivet Text cover such a wide array of the Danish internet it undoubtably contains personal information. To avoid model memorization of this information we have deduplicated the data such that the model does not learn this information.

References:

[1] Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model Cards for Model Reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596
[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
[3] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [Cs]. http://arxiv.org/abs/1810.04805
[4] Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked Language Model Scoring. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2699–2712. https://doi.org/10.18653/v1/2020.acl-main.240
[6] Nielsen, D. S. (2021). ScandEval: Evaluation of language models on mono- or multilingual Scandinavian language tasks. GitHub. Note: Https://Github.Com/Saattrupdan/ScandEval.
[7] Hvingelby, R., Pauli, A. B., Barrett, M., Rosted, C., Lidegaard, L. M., & Søgaard, A. (2020). DaNE: A named entity resource for danish. Proceedings of the 12th Language Resources and Evaluation Conference, 4597–4604.
[8] Kromann, M. T. (2003). The Danish Dependency Treebank and the DTAG Treebank Tool. https://research.cbs.dk/en/publications/the-danish-dependency-treebank-and-the-dtag-treebank-tool
[9] Alexandrainst/danlp. (2022). Alexandra Institute. https://github.com/alexandrainst/danlp/blob/a1e9fa70fc5a3ae7ff78877062da3a8a8da80758/docs/docs/datasets.md (Original work published 2019)
[10] Nielsen, F. Å. (2022). Lcc-sentiment. https://github.com/fnielsen/lcc-sentiment (Original work published 2016)
[11] Sigurbergsson, G. I., & Derczynski, L. (2020). Offensive Language and Hate Speech Detection for Danish. Proceedings of the 12th Language Resources and Evaluation Conference, 3498–3508. https://aclanthology.org/2020.lrec-1.430
[12] Møllerhøj, J. D. (2019, December 5). Danish BERT model: BotXO has trained the most advanced BERT model. BotXO. https://www.botxo.ai/blog/danish-bert-model/

dfm-encoder-small-v1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model Card

Intended Uses

Factors

Metrics

Evaluation Data

Training Data

Ethical Considerations

References:

NSDT 3DConvert

UnrealSynth

DreamTexture.js