mpt-7b-dolphin 🐬
This instruction model was built via full parameter finetuning of mpt-7b on the first 100k rows of ehartford/dolphin (an open-source implementation of Microsoft's Orca). Finetuning was executed on a single H100 (80 GB PCIe) for roughly 12 hours on the Lambda Labs platform.
- Model license: Apache 2.0
- Basic usage: notebook
- Finetuning code: script
- Loss curves: plot
- Runtime stats: table
Example prompts and responses
Example 1:
User:
You are a helpful assistant. Write me a numbered list of things to do in New York City.\n
mpt-7b-dolphin:
- Visit the Statue of Liberty: The Statue of Liberty is a symbol of freedom and democracy, and it is a must-see attraction in New York City.
- Take a walk in Central Park: Central Park is a massive green space in the heart of Manhattan, and it is a great place to relax, enjoy nature, and people-watch.
- Visit the Metropolitan Museum of Art: The Met is one of the world's largest and most renowned art museums, housing an extensive collection of art from around the world.
- Take a stroll through Times Square: Times Square is the heart of the city, and it is a must-see for its iconic billboards, bright lights, and bustling energy.
- Visit the Empire State Building: The Empire State Building is a famous skyscraper in Midtown Manhattan, offering panoramic views of the city from its observation deck.
- Explore the Museum of Modern Art (MoMA): MoMA is a renowned art museum in Midtown Manhattan, showcasing modern and contemporary art from around the world.
- Visit the 9/11 Memorial and Museum: The 9/11 Memorial and Museum is a somber but important attraction in Lower Manhattan, commemorating the victims of the September 11th attacks.
- Take a ferry ride to the Statue of Liberty: A ferry ride to the Statue of Liberty is a great way to see the city from the water
<br>
Example 2:
User:
You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's work this out in a step by step fashion to make sure we have the right answer.\n
mpt-7b-dolphin:
Step 1: Identify the days of the week when the barber works.
The barber works on Mondays, Wednesdays, and Fridays.
Step 2: Determine the day of the week on which Daniel went for a haircut.
Daniel went for a haircut on Sunday.
Step 3: Compare the day of the week on which Daniel went for a haircut to the days when the barber works.
Sunday is not one of the days when the barber works.
Step 4: Determine if the statement makes logical sense.
The statement does not make logical sense because Daniel went for a haircut on a day when the barber does not work.
Answer: No, the statement does not make logical sense. Daniel should have gone for a haircut on a day when the barber works, such as Monday, Wednesday, or Friday.
<br>
Example 3:
User:
You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n
mpt-7b-dolphin:
Subject: Invitation to Dinner Party on Friday
Dear Friend,
I hope this email finds you well. I am writing to invite you to a dinner party that I am hosting on Friday. The party will be a casual gathering, and I would love to have you join us.
The dinner will be held at my home, and the time is from 7:00 pm to 10:00 pm. The address is 123 Main Street, San Francisco, CA 94122.
The dinner will be a potluck, so please bring a dish to share. I will provide the main course, so you can focus on bringing a side dish, dessert, or appetizer.
Please RSVP by replying to this email or by calling me at 555-555-5555. I look forward to seeing you on Friday!
Yours,
Your Friendly Assistant
Model Description
The architecture is a modification of a standard decoder-only transformer.
The model has been modified from a standard transformer in the following ways:
- It uses FlashAttention
- It uses ALiBi (Attention with Linear Biases) and does not use positional embeddings
- It does not use biases
Hyperparameter | Value |
---|---|
n_parameters | 6.65B |
n_layers | 32 |
n_heads | 32 |
d_model | 4096 |
vocab size | 50432 |
sequence length | 2048 |
Finetuning Description
This model was trained on a single H100 (80 GB PCIe) for about 12 hours using the Lambda Labs platform.
Run: July 5, 2023
- args: {'lr': 2e-5, 'num_epochs': 1, 'seed': 43}
- logs: {'train_runtime': 61098.1062, 'train_samples_per_second': 1.637, 'train_steps_per_second': 0.409, 'train_loss': 1.4058428125, 'epoch': 1.0}
The above loss curve was generated from the tfevents log at link.
PreTraining Data
For more details on the pretraining process, see MPT-7B.
The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer.
Limitations and Biases
The following language is modified from EleutherAI's GPT-NeoX-20B
This model can produce factually incorrect output, and should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
How to Use
Basic Usage: Notebook
Note: This model requires that trust_remote_code=True
be passed to the from_pretrained
method. This is because we use a custom model architecture that is not yet part of the transformers
package.
It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022), ALiBi, QK LayerNorm, and more.
First, install the package dependencies:
!pip install -q -U transformers einops accelerate torch
!pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python
Basic model loading:
import transformers
model = transformers.AutoModelForCausalLM.from_pretrained(
'dfurman/mpt-7b-dolphin',
trust_remote_code=True,
device_map="auto",
)
To use the optimized triton implementation of FlashAttention, you can load the model on GPU with attn_impl='triton'
and with bfloat16
precision:
import torch
import transformers
name = 'dfurman/mpt-7b-dolphin'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'meta'
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. Note that larger context windows require more available VRAM. For example:
import transformers
name = 'dfurman/mpt-7b-dolphin'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True,
device_map="auto",
)
This model was trained with the EleutherAI/gpt-neox-20b tokenizer. It can be invoked directly from this model's repo:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-dolphin')
Once loaded, the model and tokenizer can be used with the following code:
import transformers
import torch
# text generation function
def mpt_generate(
model: transformers.AutoModelForCausalLM,
tokenizer: transformers.AutoTokenizer,
prompt: str,
max_new_tokens: int = 128,
temperature: float = 0.92,
) -> str:
"""
Initialize the pipeline
Uses Hugging Face GenerationConfig defaults
https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
Args:
model (transformers.AutoModelForCausalLM): Model for text generation
tokenizer (transformers.AutoTokenizer): Tokenizer for model
prompt (str): Prompt for text generation
max_new_tokens (int, optional): Max new tokens after the prompt to generate.
Defaults to 128.
temperature (float, optional): The value used to modulate the next token probabilities.
Defaults to 1.0
"""
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
inputs = tokenizer(
prompt,
return_tensors="pt",
return_token_type_ids=False,
).to(device)
# when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
with torch.autocast("cuda", dtype=torch.bfloat16):
response = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
return_dict_in_generate=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
decoded_output = tokenizer.decode(
response["sequences"][0],
skip_special_tokens=True,
) # grab output in natural language
return decoded_output[len(prompt) :] # remove prompt from output
We can now generate text! For example:
prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n"
response = mpt_generate(
model,
tokenizer,
prompt,
max_new_tokens=150,
temperature=0.92,
)
print(response)
Runtime tests
runtime / 50 tokens (sec) | GPU | attn | torch dtype | VRAM (GB) |
---|---|---|---|---|
0.61 | 1x H100 (80 GB PCIe) | triton | bfloat16 | 12 |
0.67 | 1x H100 (80 GB PCIe) | torch | bfloat16 | 12 |
1.17 | 1x A100 (40 GB SXM) | triton | bfloat16 | 13 |
1.36 | 1x A100 (40 GB SXM) | torch | bfloat16 | 13 |
2.25 | 1x V100 (16 GB SXM) | torch | float16 | 13 |
3.75 | 1x V100 (16 GB SXM) | torch | fp4 | 4 |
4.84 | 1x Tesla T4 (15 GB) | torch | float16 | 13 |
8.77 | 1x Tesla T4 (15 GB) | torch | fp4 | 4 |
The above runtime stats were generated from this notebook.
Acknowledgements
This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.
MosaicML Citation for MPT-7B
@online{MosaicML2023Introducing,
author = {MosaicML NLP Team},
title = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
year = {2023},
url = {www.mosaicml.com/blog/mpt-7b},
note = {Accessed: 2023-07-02},
urldate = {2023-07-02}
}