Model Card for SzegedAI/bert-medium-mlsm

<!-- Provide a quick summary of what the model is/does. -->

This medium-sized BERT model was created using the Masked Latent Semantic Modeling (MLSM) pre-training objective, which is a sample efficient alternative for classic Masked Language Modeling (MLM).
During MLSM, the objective is to recover the latent semantic profile of the masked tokens, as opposed to recovering their exact identity.
The contextualized latent semantic profile during pre-training is determined by performing sparse coding of the hidden representation of an already pre-trained model (a base-sized BERT model in this particular case).

Model Details

Model Description

<!-- Provide a longer summary of what this model is. -->

Model Sources

<!-- Provide the basic links for the model. -->

How to Get Started with the Model

The pre-trained model can be used in the usual manner, e.g., for fine tuning on a particular sequence classification task, invoke the code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('SzegedAI/bert-medium-mlsm')
model = AutoModelForSequenceClassification.from_pretrained('SzegedAI/bert-medium-mlsm')

Training Details

Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was pre-trained using a 2022 English Wikipedia dump pre-processed with wiki-bert-pipeline.

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Preprocessing

Training Hyperparameters

Pre-training was conducted with a batch size of 32 sequences and a gradient accumulation over 32 batches, resulting in an effective batch size of 1024.
A total of 300,000 update steps were performed using the AdamW optimizer with a linear learning rate scheduling having a peak learning rate of 1e-04. A maximum sequence length of 128 tokens was employed over the first 90% of the pre-training, while for the final 10% of the pre-training, the maximum sequence length was increased to 512 tokens.

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

The model was evaluated on GLUE tasks and CoNLL2003 for named entity recognition.

Results

The evaluation result after fine-tuning the given model on a wide range of tasks.
On each tasks 10 different fine-tuning were performed, during which the only difference was the random initialization of the task-specific classification head. Both the average and the standard deviation are displayed below on each tasks.

Dataset Metric Avg. Std.
CoLA Matthews correlation 0.403 0.012
CoNLL2003 F1 0.926 0.003
MNLI (matched) Accuracy 0.798 0.001
MNLI (mismatched) Accuracy 0.808 0.002
MRPC Accuracy 0.786 0.020
MRPC F1 0.851 0.013
QNLI Accuracy 0.870 0.004
QQP Accuracy 0.892 0.001
QQP F1 0.855 0.001
RTE Accuracy 0.571 0.011
SST2 Accuracy 0.905 0.004
STSB Pearson correlation 0.818 0.024
STSB Spearman correlation 0.820 0.021
WiC Accuracy 0.639 0.007
Average --- 0.7815 ---

Summary

This model was more sample efficient and reached practically the same average performance as an alternatively pre-trained language model of 2.5 times more parameter (of base size) that was pre-trained using the classical MLM objective.

Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> The pre-training objective is introduced in the ACL Findings paper Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling.

BibTeX:

@inproceedings{berend-2023-masked,
    title = "Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling",
    author = "Berend, G{\'a}bor",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.876",
    pages = "13949--13962",
    abstract = "In this paper, we propose an alternative to the classic masked language modeling (MLM) pre-training paradigm, where the objective is altered from the reconstruction of the exact identity of randomly selected masked subwords to the prediction of their latent semantic properties. We coin the proposed pre-training technique masked latent semantic modeling (MLSM for short). In order to make the contextualized determination of the latent semantic properties of the masked subwords possible, we rely on an unsupervised technique which uses sparse coding. Our experimental results reveal that the fine-tuned performance of those models that we pre-trained via MLSM is consistently and significantly better compared to the use of vanilla MLM pretraining and other strong baselines.",
}