alignment instruction tuned RLHF text generation conversation assistant

Aira-RLHF-124M

Aira-RLHF is an RLHF-tuned version of Aira-Instruct-124M, using both the RewardModel as a primary preference model and the ToxicityModel as an auxiliary preference model.

This model/repository only serves to host the RLHF experiments performed on the Aira-Instruct series. The training notebook can be found in this repository.

Limitations

This model is for research purposes only. We recommend using the Aira-Instruct-124M for better performance and light with experimentation.

🤥 Generative models can perpetuate the generation of pseudo-informative content, that is, false information that may appear truthful.

🤬 In certain types of tasks, generative models can produce harmful and discriminatory content inspired by historical stereotypes.

Note: Aira-RLHF-124M has a worst performance than Aira-Instruct-124M in several domains, which may come from several reasons (size of the model, unalignment of the reward model, etc.) we are still trying to iron out. 🤔

Cite as 🤗


@misc{nicholas22aira,
  doi = {10.5281/zenodo.6989727},
  url = {https://huggingface.co/nicholasKluge/Aira-RLHF-124M},
  author = {Nicholas Kluge Corrêa},
  title = {Aira},
  year = {2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
}

License

The Aira-RLHF-124M is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.