LLM LLaMA-13b

Special Notes

Due to the license regulation of LLaMA, we are not allowed to release the accelerated parameters directly. We hope to discuss with you guys to figure out a legal way to share lyraLLaMA. If you have any suggestions, please feel free to drop us a line at benbinwu@tencent.com.

Model Card for lyraLLaMA

lyraLLaMA is currently the fastest LLaMA-13b available. The inference speed of lyraLLaMA has achieved 3000+ tokens/s on A100, up to 6x acceleration upon the torch version.

Among its main features are:

We use the LLaMA.13B model for measurement, but this optimized inference is applicable to LLaMA models of different sizes.

Speed

LLaMA-Ziya-13B

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch LLaMA 31.74 289.2 521.37 775.69 OOM
lyraLLaMA fp16 73.2 565.6 1179.59 1795.63 3061.27
lyraLLaMA MEMOPT 104 770.5 1389.9 2390.4 3782.1

LLaMA-Vicuna-13B

Version Batch Size 1 Batch Size 8 Batch Size 16 Batch Size 32 Batch Size 64
Torch LLaMA 24.65 167.3 322.97 407.99 OOM
lyraLLaMA fp16 53.67 421.38 804.31 1519.28 2679.82
lyraLLaMA MEMOPT 79.81 603.15 1117.27 1966.52 3200.32

Docker Environment Recommendation

docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraLLaMA nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py

Uses

from lyra_llama import lyraLLaMA

model_path = "./models/lamma-13b-1-gpu-fp16.bin"
tokenizer_path = "./models/"
dtype='fp16'
prompt = "今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服 裤子鞋子搭配"
max_output_length = 512
memopt_mode = 0 # To use MEMOPT mode, set memopt_mode=1

model = lyraLLaMA(model_path, tokenizer_path, dtype, memopt_mode)

prompt = '<human>:' + prompt.strip() + '\n<bot>:'

bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
        prompts, output_length=max_output_length,
        top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

print(output_texts)

Demo Outputs

LLaMA-Ziya-13B

#### input 今天天气大概 25度,有点小雨,吹着风,我想去户外散步,应该穿什么样的衣服 裤子鞋子搭配

#### output

今天天气大概25度,有点小雨,吹着风。建议穿上轻便的运动服装,比如:

LLaMA-Vicuna-13B

#### input List 3 different machine learning algorithms and explain where they are applicable.

#### output

Outputs:

  1. Linear Regression: Linear regression is a simple, yet powerful algorithm that is used to predict the value of a continuous variable based on one or more input features. It is commonly used in prediction and forecasting tasks, such as predicting the price of a stock or the sales of a product.
  2. Decision Trees: Decision Trees are a type of supervised learning algorithm that can be used for both classification and regression tasks. They work by partitioning the feature space into smaller subspaces, with each subspace corresponding to a leaf node in the tree. Decision Trees are commonly used in applications such as credit risk assessment and customer segmentation.
  3. Support Vector Machines (SVMs): SVMs are a type of supervised learning algorithm that can be used for both classification and regression tasks. They work by finding the best hyperplane that separates the data into different classes. SVMs are commonly used in applications such as image classification and natural language processing.

TODO

  1. Support for int4
  2. Inference for longer context situations
  3. Streaming inference mode.

Citation

@Misc{lyraLLaMA2023,
  author =       {Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraLLaMA: Accelerating LLaMA-13b(fp16) to 3000+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraLLaMA}},
  year =         {2023}
}

Report bug