MosaicML

mpt-7b-dolphin 🐬

This instruction model was built via full parameter finetuning of mpt-7b on the first 100k rows of ehartford/dolphin (an open-source implementation of Microsoft's Orca). Finetuning was executed on a single H100 (80 GB PCIe) for roughly 12 hours on the Lambda Labs platform.

Example prompts and responses

Example 1:

User:

You are a helpful assistant. Write me a numbered list of things to do in New York City.\n

mpt-7b-dolphin:

  1. Visit the Statue of Liberty: The Statue of Liberty is a symbol of freedom and democracy, and it is a must-see attraction in New York City.
  2. Take a walk in Central Park: Central Park is a massive green space in the heart of Manhattan, and it is a great place to relax, enjoy nature, and people-watch.
  3. Visit the Metropolitan Museum of Art: The Met is one of the world's largest and most renowned art museums, housing an extensive collection of art from around the world.
  4. Take a stroll through Times Square: Times Square is the heart of the city, and it is a must-see for its iconic billboards, bright lights, and bustling energy.
  5. Visit the Empire State Building: The Empire State Building is a famous skyscraper in Midtown Manhattan, offering panoramic views of the city from its observation deck.
  6. Explore the Museum of Modern Art (MoMA): MoMA is a renowned art museum in Midtown Manhattan, showcasing modern and contemporary art from around the world.
  7. Visit the 9/11 Memorial and Museum: The 9/11 Memorial and Museum is a somber but important attraction in Lower Manhattan, commemorating the victims of the September 11th attacks.
  8. Take a ferry ride to the Statue of Liberty: A ferry ride to the Statue of Liberty is a great way to see the city from the water

<br>

Example 2:

User:

You are a helpful assistant. Daniel is in need of a haircut. His barber works Mondays, Wednesdays, and Fridays. So, Daniel went in for a haircut on Sunday. Does this make logical sense? Let's work this out in a step by step fashion to make sure we have the right answer.\n

mpt-7b-dolphin:

Step 1: Identify the days of the week when the barber works.

The barber works on Mondays, Wednesdays, and Fridays.

Step 2: Determine the day of the week on which Daniel went for a haircut.

Daniel went for a haircut on Sunday.

Step 3: Compare the day of the week on which Daniel went for a haircut to the days when the barber works.

Sunday is not one of the days when the barber works.

Step 4: Determine if the statement makes logical sense.

The statement does not make logical sense because Daniel went for a haircut on a day when the barber does not work.

Answer: No, the statement does not make logical sense. Daniel should have gone for a haircut on a day when the barber works, such as Monday, Wednesday, or Friday.

<br>

Example 3:

User:

You are a helpful assistant. Write a short email inviting my friends to a dinner party on Friday. Respond succinctly.\n

mpt-7b-dolphin:

Subject: Invitation to Dinner Party on Friday

Dear Friend,

I hope this email finds you well. I am writing to invite you to a dinner party that I am hosting on Friday. The party will be a casual gathering, and I would love to have you join us.

The dinner will be held at my home, and the time is from 7:00 pm to 10:00 pm. The address is 123 Main Street, San Francisco, CA 94122.

The dinner will be a potluck, so please bring a dish to share. I will provide the main course, so you can focus on bringing a side dish, dessert, or appetizer.

Please RSVP by replying to this email or by calling me at 555-555-5555. I look forward to seeing you on Friday!

Yours,

Your Friendly Assistant

Model Description

The architecture is a modification of a standard decoder-only transformer.

The model has been modified from a standard transformer in the following ways:

Hyperparameter Value
n_parameters 6.65B
n_layers 32
n_heads 32
d_model 4096
vocab size 50432
sequence length 2048

Finetuning Description

This model was trained on a single H100 (80 GB PCIe) for about 12 hours using the Lambda Labs platform.

Run: July 5, 2023

loss curves

The above loss curve was generated from the tfevents log at link.

PreTraining Data

For more details on the pretraining process, see MPT-7B.

The data was tokenized using the EleutherAI/gpt-neox-20b tokenizer.

Limitations and Biases

The following language is modified from EleutherAI's GPT-NeoX-20B

This model can produce factually incorrect output, and should not be relied on to produce factually accurate information. This model was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

How to Use

Basic Usage: Notebook

Note: This model requires that trust_remote_code=True be passed to the from_pretrained method. This is because we use a custom model architecture that is not yet part of the transformers package.

It includes options for many training efficiency features such as FlashAttention (Dao et al. 2022), ALiBi, QK LayerNorm, and more.

First, install the package dependencies:

!pip install -q -U transformers einops accelerate torch
!pip install -q -U triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

Basic model loading:

import transformers

model = transformers.AutoModelForCausalLM.from_pretrained(
  'dfurman/mpt-7b-dolphin',
  trust_remote_code=True,
  device_map="auto",
)

To use the optimized triton implementation of FlashAttention, you can load the model on GPU with attn_impl='triton' and with bfloat16 precision:

import torch
import transformers

name = 'dfurman/mpt-7b-dolphin'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'
config.init_device = 'meta'

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="auto",
)

Although the model was trained with a sequence length of 2048, ALiBi enables users to increase the maximum sequence length during finetuning and/or inference. Note that larger context windows require more available VRAM. For example:

import transformers

name = 'dfurman/mpt-7b-dolphin'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 4096 # (input + output) tokens can now be up to 4096

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  trust_remote_code=True,
  device_map="auto",
)

This model was trained with the EleutherAI/gpt-neox-20b tokenizer. It can be invoked directly from this model's repo:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('dfurman/mpt-7b-dolphin')

Once loaded, the model and tokenizer can be used with the following code:

import transformers
import torch

# text generation function
def mpt_generate(
    model: transformers.AutoModelForCausalLM,
    tokenizer: transformers.AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 128,
    temperature: float = 0.92,
) -> str:
    """
    Initialize the pipeline
    Uses Hugging Face GenerationConfig defaults
        https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationConfig
    Args:
        model (transformers.AutoModelForCausalLM): Model for text generation
        tokenizer (transformers.AutoTokenizer): Tokenizer for model
        prompt (str): Prompt for text generation
        max_new_tokens (int, optional): Max new tokens after the prompt to generate.
            Defaults to 128.
        temperature (float, optional): The value used to modulate the next token probabilities.
            Defaults to 1.0
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        return_token_type_ids=False,
    ).to(device)

    # when running Torch modules in lower precision, it is best practice to use the torch.autocast context manager.
    with torch.autocast("cuda", dtype=torch.bfloat16):
        response = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            return_dict_in_generate=True,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )

    decoded_output = tokenizer.decode(
        response["sequences"][0],
        skip_special_tokens=True,
    )  # grab output in natural language

    return decoded_output[len(prompt) :]  # remove prompt from output

We can now generate text! For example:

prompt = "You are a helpful assistant. Here is a recipe for vegan banana bread:\n"

response = mpt_generate(
    model,
    tokenizer,
    prompt,
    max_new_tokens=150,
    temperature=0.92,
)

print(response)

Runtime tests

runtime / 50 tokens (sec) GPU attn torch dtype VRAM (GB)
0.61 1x H100 (80 GB PCIe) triton bfloat16 12
0.67 1x H100 (80 GB PCIe) torch bfloat16 12
1.17 1x A100 (40 GB SXM) triton bfloat16 13
1.36 1x A100 (40 GB SXM) torch bfloat16 13
2.25 1x V100 (16 GB SXM) torch float16 13
3.75 1x V100 (16 GB SXM) torch fp4 4
4.84 1x Tesla T4 (15 GB) torch float16 13
8.77 1x Tesla T4 (15 GB) torch fp4 4

The above runtime stats were generated from this notebook.

Acknowledgements

This model was finetuned by Daniel Furman on July 5, 2023 and is intended primarily for research purposes.

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please cosult an attorney before using this model for commercial purposes.

MosaicML Citation for MPT-7B

@online{MosaicML2023Introducing,
    author    = {MosaicML NLP Team},
    title     = {Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs},
    year      = {2023},
    url       = {www.mosaicml.com/blog/mpt-7b},
    note      = {Accessed: 2023-07-02},
    urldate   = {2023-07-02}
}