Fine-Tuning the Llama 2 model
How to fine-tune Llama 2
In this part, we will illustrate the process of refining a Llama 2 model with 70 billion parameters, requiring a minimum of 80GB of VRAM. However, this amount is inadequate for storing the weights of Llama 2-70b, which amount to 140 GB in FP16 (70 billion parameters * 2 bytes). Additionally, we need to account for the extra VRAM needed for optimizer states, gradients, and forward activations. Given these limitations, we will utilize parameter-efficient fine-tuning (PEFT) techniques like LoRA or QLoRA to maximize the utility of the available resources.
To substantially decrease VRAM consumption, we will fine-tune the model using 4-bit precision, and to achieve this, we will make use of QLoRA. The benefit of this approach is that we can leverage the capabilities of the Hugging Face ecosystem by integrating libraries such as transformers, accelerate, peft, trl, and bitsandbytes.
Initially, our objective is to load a llama-2-70b-chat-hf model and conduct training on the training dataset created through the RAG methodology. This process will result in the creation of our refined model, which we'll refer to as llama-2-70b-dexter
RAG Trainig Data
The training dataset created to fine-tune the model is auto-generated(use llama 2 to generate it), to do this we create a template like this:
prompt_template = prompt_template or """\
Using the following templated example, generate an equivalent Question, Context, and Answer for the new paper:
Template
--------------
Question: {Question}
Context: {Context}
Answer: {Answer}
--------------
New Paper:
{new_paper}
"""
And use this template to ask the llama 2 model and try to generate a template in the same way but using the information provided by the new paper. The Questions
, Context
and Answer
were obtained from a file with some questions. context and answer revised by profesionals in the matter. With the response of the model we obtained the Questions
, Context
and Answer
based in the new paper provided, we do this for 100 diferents PDFs and obtained a train dataset with 13K examples to finetune the model.
Quantized Low-Rank Adaptation (QLoRA)
QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques.
-
NF4 Quantization: NF4 quantization exploits the inherent distribution of pre-trained neural network weights, typically zero-centered normal distributions with specified standard deviations. By transforming all weights to a fixed distribution that fits within the range of NF4 (-1 to 1), NF4 quantization effectively quantifies the weights without the need for expensive quantile estimation algorithms.
-
Double Quantization: Double Quantization addresses the memory overhead of quantization constants. Double Quantization significantly reduces the memory footprint without compromising performance by quantizing the quantization constants themselves. The process involves using 8-bit Floats with a block size 256 for the second quantization step, resulting in substantial memory savings.
Advantages of QLoRA
-
Further Memory Reduction: QLoRA achieves even higher memory efficiency by introducing quantisation, making it particularly valuable for deploying large models on resource-constrained devices.
-
Preserving Performance: Despite its parameter-efficient nature, QLoRA retains high model quality, performing on par or even better than fully fine-tuned models on various downstream tasks.
-
Applicability to Various LLMs: QLoRA is a versatile technique applicable to different language models, including RoBERTa, DeBERTa, GPT-2, and GPT-3, enabling researchers to explore parameter-efficient fine-tuning for various LLM architectures.
Obtain the hf_token to use the model
To obtain a hf_token you need a HuggingFace account, then go to your profile ans select the option settings
, then go to Access Tokens
and create a new token with the name you want and in the Role option select the read
, then select Generate a Token
and the token will be created.
With this generated token you can use the model in the private repository if you the have access. To test the model use the token generated and change it for the hf_token
in the code below.
Test code to use the model
!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
pipeline,
logging,
)
# The model that you want to get from the Hugging Face hub
model_name = "aquinovo/llama-2-7b-dexter-2k" # change according you requirement (aquinovo/llama-2-70b-dexter-13k, aquinovo/llama-2-70b-dexter-2k)
################################################################################
# bitsandbytes parameters
################################################################################
# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False
# Load the entire model on the GPU 0
device_map = {"": 0}
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
from huggingface_hub import login
login(token="hf_token")
bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
major, _ = torch.cuda.get_device_capability()
if major >= 8:
print("=" * 80)
print("Your GPU supports bfloat16: accelerate training with bf16=True")
print("=" * 80)
# Load base model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1
# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Run text generation pipeline with our next model
prompt = "At what stage does secondary cell wall deposition take place for cotton fiber development?"
#
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=2000)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])