generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

llama-7b-SFT-qlora-eli5-wiki_DPO_ds_RM_contrast_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_eli5_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen
0.6867 0.1 19 0.6390 0.0633 -0.1318 0.6451 0.1951 -197.8286 -205.5991 0.7774 0.8133
0.6727 0.21 38 0.6384 0.0354 -0.2285 0.6529 0.2639 -198.3123 -205.7386 0.8054 0.8432
0.6577 0.31 57 0.6391 -0.0114 -0.2258 0.6406 0.2145 -198.2988 -205.9725 0.7954 0.8346
0.6609 0.42 76 0.6344 -0.3737 -0.6175 0.6417 0.2438 -200.2571 -207.7841 0.7818 0.8194
0.6536 0.52 95 0.6285 -0.1130 -0.3816 0.6652 0.2687 -199.0778 -206.4805 0.7958 0.8350
0.654 0.62 114 0.6342 0.0007 -0.2311 0.6484 0.2318 -198.3250 -205.9122 0.7917 0.8303
0.6435 0.73 133 0.6258 0.0462 -0.2234 0.6562 0.2696 -198.2865 -205.6845 0.7949 0.8332
0.6508 0.83 152 0.6234 0.0858 -0.1898 0.6574 0.2756 -198.1188 -205.4868 0.7931 0.8315
0.6361 0.94 171 0.6269 0.1007 -0.1655 0.6618 0.2662 -197.9971 -205.4121 0.7975 0.8353

Framework versions