nyc-savvy-llama2-7b

Essentials:

Based on LLaMa2-7b-hf (version 2, 7B params)
Used QLoRA to fine-tune on 13k rows of /r/AskNYC formatted as Human/Assistant exchanges
Released the adapter weights
Merged quantized-then-dequantized LLaMa2 and the adapter weights to produce this full-sized model

Prompt options

Here is the template used in training. Note it starts with "### Human: " (following space), the post title and content, then "### Assistant: " (no preceding space, yes following space).

### Human: Post title - post content### Assistant:

For example:

### Human: Where can I find a good bagel? - We are in Brooklyn### Assistant: Anywhere with fresh-baked bagels and lots of cream cheese options.

From QLoRA's Gradio example, it looks helpful to add a more assistant-like prompt, especially if you follow their lead for a chat format:

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

Training data

Collected one month of posts to /r/AskNYC from each year 2015-2019 (no content after July 2019)
Downloaded from PushShift, accepted comments only if upvote scores >= 3
Originally collected for my GPT-NYC model in spring 2021 - model / blog

Training script

Takes about 2 hours on CoLab once you get it right. You can only set max_steps for QLoRA, but I wanted to stop at 1 epoch.

git clone https://github.com/artidoro/qlora
cd qlora

pip3 install -r requirements.txt --quiet

python3 qlora.py \
    --model_name_or_path ../llama-2-7b-hf \
    --use_auth \
    --output_dir ../nyc-savvy-llama2-7b \
    --logging_steps 10 \
    --save_strategy steps \
    --data_seed 42 \
    --save_steps 500 \
    --save_total_limit 40 \
    --dataloader_num_workers 1 \
    --group_by_length False \
    --logging_strategy steps \
    --remove_unused_columns False \
    --do_train \
    --num_train_epochs 1 \
    --lora_r 64 \
    --lora_alpha 16 \
    --lora_modules all \
    --double_quant \
    --quant_type nf4 \
    --bf16 \
    --bits 4 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type constant \
    --gradient_checkpointing \
    --dataset /content/gpt_nyc.jsonl \
    --dataset_format oasst1 \
    --source_max_len 16 \
    --target_max_len 512 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --max_steps 760 \
    --learning_rate 0.0002 \
    --adam_beta2 0.999 \
    --max_grad_norm 0.3 \
    --lora_dropout 0.1 \
    --weight_decay 0.0 \
    --seed 0 \

Merging it back

What you get in the output_dir is an adapter model. Here's ours. Cool, but not as easy to drop into their script.

Two options for merging:

The included peftmerger.py script merges the adapter and saves the model.
Chris Hayduk produced a script to quantize then de-quantize the base model before merging a QLoRA adapter. This requires bitsandbytes and a GPU.

Testing that the model is NYC-savvy

You might wonder if the model successfully learned anything about NYC or is the same old LLaMa2. With your prompt not adding clues, try this from the pefttester.py script in this repo:

m = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
tok = LlamaTokenizer.from_pretrained(model_name)

messages = "A chat between a curious human and an assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
messages += "### Human: What museums should I visit? - My kids are aged 12 and 5"
messages += "### Assistant: "

input_ids = tok(messages, return_tensors="pt").input_ids

# ...

temperature = 0.7
top_p = 0.9
top_k = 0
repetition_penalty = 1.1

op = m.generate(
    input_ids=input_ids,
    max_new_tokens=100,
    temperature=temperature,
    do_sample=temperature > 0.0,
    top_p=top_p,
    top_k=top_k,
    repetition_penalty=repetition_penalty,
    stopping_criteria=StoppingCriteriaList([stop]),
)
for line in op:
    print(tok.decode(line))