Paper: https://arxiv.org/pdf/2310.06694.pdf
Code: https://github.com/princeton-nlp/LLM-Shearing
Models: Sheared-LLaMA-1.3B, Sheared-LLaMA-2.7B

License: Must comply with license of Llama2 since it's a model derived from Llama2.


Sheared-LLaMA-1.3B is a model pruned and further pre-trained from meta-llama/Llama-2-7b-hf. We dynamically load data from different domains in the RedPajama dataset to prune and contune pre-train the model. We use 0.4B tokens for pruning and 50B tokens for continued pre-training the pruned model. This model can be loaded with HuggingFace via

model = AutoModelForCausalLM.from_pretrained("princeton-nlp/Sheared-LLaMA-1.3B")

Downstream Tasks

We evaluate on an extensive set of downstream tasks including reasoning, reading comprehension, language modeling and knowledge intensive tasks. Our Sheared-LLaMA models outperform existing large language models.

Model # Pre-training Tokens Average Performance
LLaMA2-7B 2T 64.6

1.3B

Model # Pre-training Tokens Average Performance
OPT-1.3B 300B 48.2
Pythia-1.4B 300B 48.9
Sheared-LLaMA-1.3B 50B 51.0

3B

Model # Pre-training Tokens Average Performance
OPT-2.7B 300B 51.4
Pythia-2.8B 300B 52.5
INCITE-Base-3B 800B 54.7
Open-LLaMA-3B-v1 1T 55.1
Open-LLaMA-3B-v2 1T 55.7
Sheared-LLaMA-2.7B 50B 56.7

Bibtex

@article{xia2023sheared,
   title={Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning},
   author={Xia, Mengzhou and Gao, Tianyu, and Zeng, Zhiyuan and Chen, Danqi},
   year={2023}
}