generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

llama-7b-SFT-qlora-wiki_DPO_ds_RM_random_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_ds_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6904 0.1 19 0.6904 -0.3143 -0.3636 0.5458 0.0493 -207.3793 -204.3384 1.1224 1.1416
0.6725 0.21 38 0.6850 -0.3901 -0.4540 0.5547 0.0640 -208.2836 -205.0964 1.1270 1.1469
0.6818 0.31 57 0.6801 -0.1790 -0.2369 0.5469 0.0578 -206.1121 -202.9860 1.1465 1.1674
0.6671 0.41 76 0.6863 -0.2598 -0.3469 0.5580 0.0871 -207.2126 -203.7936 1.1468 1.1665
0.6683 0.52 95 0.6841 -0.1475 -0.2325 0.5502 0.0851 -206.0687 -202.6704 1.1388 1.1590
0.6626 0.62 114 0.6846 -0.0836 -0.1600 0.5480 0.0764 -205.3429 -202.0314 1.1263 1.1474
0.6593 0.72 133 0.6864 -0.1272 -0.2184 0.5625 0.0912 -205.9276 -202.4675 1.1106 1.1306
0.672 0.83 152 0.6857 -0.1452 -0.2334 0.5592 0.0882 -206.0777 -202.6477 1.1086 1.1293
0.6671 0.93 171 0.6855 -0.1472 -0.2350 0.5547 0.0878 -206.0934 -202.6673 1.1071 1.1270

Framework versions