<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->
Model Card: EUBERT
Overview
- Model Name: EUBERT
 - Model Version: 1.1
 - Date of Release: 16 October 2023
 - Model Architecture: BERT (Bidirectional Encoder Representations from Transformers)
 - Training Data: Documents registered by the European Publications Office
 - Model Use Case: Text Classification, Question Answering, Language Understanding
 

Model Description
EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the European Publications Office. These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains. EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks, making it a valuable resource for a variety of applications.
Intended Use
EUBERT serves as a starting point for building more specific natural language understanding models. Its versatility makes it suitable for a wide range of tasks, including but not limited to:
- 
Text Classification: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.
 - 
Question Answering: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.
 - 
Language Understanding: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.
 
Performance
The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning. Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.
Considerations
- 
Data Privacy and Compliance: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.
 - 
Fine-Tuning: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.
 - 
Bias and Fairness: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.
 
Conclusion
EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.
Training procedure
Dedicated Word Piece tokenizer vocabulary size 2**16,
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
 - train_batch_size: 32
 - eval_batch_size: 32
 - seed: 42
 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
 - lr_scheduler_type: linear
 - num_epochs: 1.85
 
Training results
Coming soon
Framework versions
- Transformers 4.33.3
 - Pytorch 2.0.1+cu117
 - Datasets 2.14.5
 - Tokenizers 0.13.3
 
Infrastructure
- Hardware Type: 4 x GPUs 24GB
 - GPU Days: 16
 - Cloud Provider: EuroHPC
 - Compute Region: Meluxina
 
Model Card Authors
Sebastien Campion
Model Card Contact
sebastien.campion@europarl.europa.eu