mekjr1/guilbert-base-uncased

This model is a fine-tuned version of bert-base-uncased on an guilbert dataset. It is a masked language model that predicts missing tokens in a sentence.

Model description

The model is based on the bert-base-uncased architecture, which has 12 layers, 768 hidden units, and 12 attention heads. It has been fine-tuned on a dataset with samples labeled as guilt or non-guilt from the Vent dataset. The model was trained with a maximum sequence length of 128 tokens and a batch size of 32. The training process used the AdamW optimizer with a learning rate of 2e-5, a weight decay rate of 0.01, and a linear learning rate warmup over 1,000 steps. The model achieved a validation loss of 1.8529 after 8 epochs.

Intended uses & limitations

This model can be used for predicting missing tokens in text sequences, particularly in the context of detecting guilt emotion in documents or other relevant applications. However, the accuracy of the model may be limited by the quality and representativeness of the training data, as well as the biases present in the pre-trained bert-base-uncased architecture.

Training and evaluation data

The model was trained on a dataset of samples labeled as guilt or non-guilt from the guilbert dataset (Extracted from Vent).

Training procedure

The model was trained using TensorFlow Keras with the AdamW optimizer and a learning rate of 2e-5. The training process used a batch size of 32 and a maximum sequence length of 128 tokens. The optimizer used a weight decay rate of 0.01 and a linear learning rate warmup over 1,000 steps. The model was trained for 8 epochs, with early stopping based on the validation loss. The training process achieved a validation loss of 1.8529 after 8 epochs.

Training hyperparameters

The following hyperparameters were used during training:

Optimizer: AdamWeightDecay with a learning rate of WarmUp(initial_learning_rate=2e-05, decay_schedule_fn=PolynomialDecay(initial_learning_rate=2e-05, decay_steps=7167, end_learning_rate=0.0, power=1.0, cycle=False), warmup_steps=1000, power=1.0)
Weight decay rate: 0.01
Batch size: 32
Maximum sequence length: 128
Number of warmup steps: 1,000
Number of training steps: 1,761 The following hyperparameters were used during training:
optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'class_name': 'WarmUp', 'config': {'initial_learning_rate': 2e-05, 'decay_schedule_fn': {'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 2e-05, 'decay_steps': 7167, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'passive_serialization': True}, 'warmup_steps': 1000, 'power': 1.0, 'name': None}}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
training_precision: mixed_float16

Training results

The following table shows the training and validation loss for each epoch:

Train Loss	Validation Loss	Epoch
2.0976	1.8593	0
1.9643	1.8547	1
1.9651	1.9003	2
1.9608	1.8617	3
1.9646	1.8756	4
1.9626	1.9024	5
1.9574	1.8421	6
1.9594	1.8632	7
1.9616	1.8529	8

Framework versions

Transformers 4.26.1
TensorFlow 2.11.0
Datasets 2.10.1
Tokenizers 0.13.2