text-generation-inference causal-lm int8 tensorrt ENOT-AutoDL

INT8 GPT-J 6B

GPT-J 6B is a transformer model trained using Ben Wang's Mesh Transformer JAX. "GPT-J" refers to the class of model, while "6B" represents the number of trainable parameters.

This repository contains GPT-J 6B onnx model suitable for building TensorRT int8+fp32 engines. Quantization of model was performed by the ENOT-AutoDL framework. Code for building of TensorRT engines and examples published on github.

Metrics:

TensorRT INT8+FP32 torch FP16 torch FP32
Lambada Acc 78.46% 79.53% -
Model size (GB) 8.5 12.1 24.2

Test environment

Latency:

Input sequance length Number of generated tokens TensorRT INT8+FP32 ms torch FP16 ms Acceleration
64 64 1040 1610 1.55
64 128 2089 3224 1.54
64 256 4236 6479 1.53
128 64 1060 1619 1.53
128 128 2120 3241 1.53
128 256 4296 6510 1.52
256 64 1109 1640 1.49
256 128 2204 3276 1.49
256 256 4443 6571 1.49

Test environment

How to use

Example of inference and accuracy test published on github:

git clone https://github.com/ENOT-AutoDL/ENOT-transformers