Model Card

Following [1], the following constitutes a model for this model.


Organization developing the Model: The Danish Foundation Models project

Model Creation Date: June 2022

Model Type: Transformer encoder model [2]; BERT [3]

Feedback on the Model: For feedback on the model please use the community forum.

Training logs and performance metrics: Check out this Weight and biases Dashboard.

Intended Uses

Primary Intended Uses:

The primary intended use case of this model is the reproduction and validation of dataset quality. The intended use cases for future iterations of this model are the application in industry and research for Danish natural language tasks.

Primary Intended Users:

Future iterations of the model are intended for NLP practitioners dealing with Danish text documents.

Out-of-Scope Uses:

Use of the model for profiling in a way which is inconsiderate of the potential harm it might cause, such as racial profiling.

Factors

Card prompts - Relevant Factors:

Relevant factors include which language is used. Our model is trained on a Danish text corpus and is intended to compare the training data.

Card prompts - Evaluation Factors:

Future iterations of this model should include a validation of biases pertaining to gender, race, and religious and social groups.

Metrics

Performance Metrics:

Our model is evaluated on the following performance metrics:

To see the performance metrics, check out this Weight and biases Dashboard.

Decision Threshold:

N/A

Approaches to Uncertainty and Variability:

Due to the cost of training the model is only pre-trained once, but the ScandEval fine-tunes ten times to obtain a reasonable estimate of model performance.

Evaluation Data

Datasets:

The ScandEval's Danish benchmark includes:

Motivation:

The ScandEval benchmark is the most comprehensive benchmark for Danish. Pseudo perplexity was analysed to examine the model's ability to model certain language domains.

Training Data

For our training data, we sample from HopeTwitter, DaNews, DAGW and Netarkivet Text (NAT) with the probabilites; 0.10, 0.10, 0.10, 0.70. For more information on the training and datasets, see the respective datasheets on the Danish foundation models GitHub page.

Pre-processing:

Input documents are tokenized using the tokenizer of the Danish BERT by BotXO [12], which uses a BPE with a vocabulary size of ~30,000 and NFKC normalization.

Ethical Considerations

Data: The is sources from News, DAGW, Twitter, and Netarkivet Text (NAT) and might thus contain hate-speech, sexually explicit content and otherwise harmful content.

Mitigations: We considered removing sexually explicit content by filtering web domians using a DNS or using google safe-search. However, examining the filtering domains these were also found to include news media pertaining to a specific demographic (e.g. Dagens.dk) and educational sites pertaining to sexual education. We also examined the use of word-based filters, but found that might influence certain demographic groups disproportionally.

Risk and Harms: As Netarkivet Text cover such a wide array of the Danish internet it undoubtably contains personal information. To avoid model memorization of this information we have deduplicated the data such that the model does not learn this information.

References: