generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

llama-7b-SFT-qlora-wiki_DPO_ds_RM_contrast_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_ds_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.692 0.1 19 0.6510 -0.3841 -0.5657 0.5949 0.1816 -204.5857 -210.8932 1.1059 1.1283
0.6585 0.21 38 0.6389 -0.0095 -0.2372 0.6373 0.2276 -201.3002 -207.1476 1.1111 1.1363
0.6581 0.31 57 0.6299 -0.0360 -0.3003 0.6417 0.2643 -201.9318 -207.4127 1.1053 1.1315
0.6485 0.42 76 0.6332 -0.2261 -0.4511 0.6194 0.2250 -203.4390 -209.3129 1.0905 1.1138
0.6551 0.52 95 0.6270 -0.1240 -0.3577 0.6362 0.2337 -202.5053 -208.2919 1.1088 1.1331
0.6484 0.62 114 0.6293 -0.1372 -0.3680 0.6440 0.2308 -202.6089 -208.4242 1.1213 1.1467
0.6427 0.73 133 0.6264 -0.1804 -0.4360 0.6451 0.2556 -203.2879 -208.8561 1.1096 1.1347
0.645 0.83 152 0.6249 -0.1145 -0.3663 0.6451 0.2518 -202.5918 -208.1973 1.1131 1.1385
0.6335 0.94 171 0.6253 -0.1000 -0.3475 0.6339 0.2476 -202.4036 -208.0517 1.1149 1.1394

Framework versions