RoPE Scaled QLoRA Fine-tune of Llama-2 13b on airoboros-2.1, with Long Context Pretraining (fp16 weights)

Overview

This is a finetune of Llama-2-13b, intended to extend the useful context window to 16384 tokens via position interpolation (PI). There are two training phases:

  1. Scaling the RoPE embeddings by a factor of 0.25 (linear method), train on 16384 token sequences from the chapter component of the the booksum dataset. (one epoch, ~ 150mm tokens)
  2. The model was then finetuned on Jon Durbin's Airoboros 2.1 dataset, with same scaling approach, for 2 epochs.

This is a (merged) QLoRA fine-tune (rank 64).

The finetune was performed with 1x RTX 6000 Ada.

How to Use

This model employs linear RoPE scaling, which now has native support in Transformers (be sure to update it if you have issues). Use it as you would with any normal context length variant.

Please comment with any questions.

Ooba use: Be sure to increase the Truncate the prompt up to this length parameter to 16384 to utilize the full context capabilities.

Motivation

Given the excellent performance of llama-2 13b finetunes relative to llama 33b, I have received several requests for a 16k model using the latest airoboros dataset. Furthermore, while partial NTK scaling appears to be better for retaining short context performance, it is not natively supported in transformers and is thus not as accessible to less technical audiences. This model is designed to offer long context capabilites with the stylistic characteristics of the new airoboros dataset without any additional configuration.

Relative Performance (wikitext perplexity)

Context (tokens) bhenrym14/airoboros-l2-13b-PI-16k-fp16 bhenrym14/airophin-v2-13b-PI-8k-fp16 bhenrym14/airophin-13b-pntk-16k-fp16 bhenrym14/airoboros-13b-gpt4-1.4.1-PI-8192-fp16 bhenrym14/airoboros-33b-gpt4-1.4.1-lxctx-PI-16384-fp16 jondurbin/airoboros-l2-13b-gpt4-1.4.1
512 7.67 7.38 7.62 8.24 7.90 7.23
1024 6.15 5.99 6.20 6.71 6.17 5.85
2048 5.29 5.22 5.38 5.87 5.23 5.07
4096 4.94 4.90 5.08 5.50 4.91 4.77
8192 4.71 4.71 4.90 5.32 Not Tested 57.1
12000 4.54 55 4.82 56.1 Not Tested Not Tested

Prompting:

Prompting differs with the airoboros 2.1 models. See jondurbin/airoboros-l2-13b-2.1