Important Note:

I created the combined metric (55% F1 score + 45% exact match score) and load the state with the best result at the end. Here is the setting in the TrainingArguments:

  load_best_model_at_end=True,
  metric_for_best_model='combined',
  greater_is_better=True,

DSPFirst-Finetuning-5

This model is a fine-tuned version of ahotrod/electra_large_discriminator_squad2_512 on a generated Questions and Answers dataset from the DSPFirst textbook based on the SQuAD 2.0 format.<br /> It achieves the following results on the evaluation set:

Loss: 0.8529
Exact: 67.0964
F1: 74.4842
Combined: 71.1597

More accurate metrics:

Before fine-tuning:

 'HasAns_exact': 54.71817606079797,
 'HasAns_f1': 61.08672724332754,
 'HasAns_total': 1579,
 'NoAns_exact': 88.78048780487805,
 'NoAns_f1': 88.78048780487805,
 'NoAns_total': 205,
 'best_exact': 58.63228699551569,
 'best_exact_thresh': 0.0,
 'best_f1': 64.26902596256402,
 'best_f1_thresh': 0.0,
 'exact': 58.63228699551569,
 'f1': 64.26902596256404,
 'total': 1784

After fine-tuning:

 'HasAns_exact': 67.57441418619379,
 'HasAns_f1': 75.92137683558988,
 'HasAns_total': 1579,
 'NoAns_exact': 63.41463414634146,
 'NoAns_f1': 63.41463414634146,
 'NoAns_total': 205,
 'best_exact': 67.0964125560538,
 'best_exact_thresh': 0.0,
 'best_f1': 74.48422310728503,
 'best_f1_thresh': 0.0,
 'exact': 67.0964125560538,
 'f1': 74.48422310728503,
 'total': 1784

Dataset

A visualization of the dataset can be found here.<br /> The split between train and test is 70% and 30% respectively.

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 4160
    })
    test: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 1784
    })
})

Intended uses & limitations

This model is fine-tuned to answer questions from the DSPFirst textbook. I'm not really sure what I am doing so you should review before using it.<br /> Also, you should improve the Dataset either by using a better generated questions and answers model (currently using https://github.com/patil-suraj/question_generation) or perform data augmentation to increase dataset size.

Training and evaluation data

batch_size of 6 results in 14.03 GB VRAM
Utilizes gradient_accumulation_steps to get total batch size to 516 (total batch size should be at least 256)
4.52 GB RAM
30% of the total questions is dedicated for evaluating.

Training procedure

The model was trained from Google Colab
Utilizes Tesla P100 16GB, took 6.3 hours to train
load_best_model_at_end is enabled in TrainingArguments

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 6
eval_batch_size: 6
seed: 42
gradient_accumulation_steps: 86
total_train_batch_size: 516
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 7

Model hyperparameters

hidden_dropout_prob: 0.36
attention_probs_dropout_prob = 0.36

Training results

Training Loss	Epoch	Step	Validation Loss	Exact	F1	Combined
2.3222	0.81	20	1.0363	60.3139	68.8586	65.0135
1.6149	1.65	40	0.9702	64.7422	72.5555	69.0395
1.2375	2.49	60	1.0007	64.6861	72.6306	69.0556
1.0417	3.32	80	0.9963	66.0874	73.8634	70.3642
0.9401	4.16	100	0.8803	67.0964	74.4842	71.1597
0.8799	4.97	120	0.8652	66.7040	74.1267	70.7865
0.8712	5.81	140	0.8921	66.3677	73.7213	70.4122
0.8311	6.65	160	0.8529	66.3117	73.4039	70.2124

Framework versions

Transformers 4.18.0
Pytorch 1.10.0+cu111
Datasets 2.1.0
Tokenizers 0.12.1