Quantizations for Gryphe/MythoMax-L2-13b in the EXL2 format

Quant VRAM estimate Additional
4k_hb8_b8 18GB Recommended!
4k_hb6_b6 15GB
4k_hb6_b5 13GB Should fit in 12GB cards with 2k context
2k_hb8_b8 16GB
2k_hb6_b4.125 10GB EXL2 Defaults

Breaking down the names:

All quantizations were calibrated with wikitext-2

You can run a model calibrated at 2k with a 4k context or vice versa. The actual difference between 2k and 4k calibrations appears to be very small.

VRAM estimates are performed with an extremely long chatlog in oobabooga webui on a 7900 XTX using nvtop to monitor pytorch usage only, rounded up. Systems with lots of extra background processes may use more. Additionally, NVIDIA based systems with flash attention 2 will use less VRAM than otherwise estimated.

The measurement files are provided in the main branch so you can make your own quants at other bit depths without going through the 2-3 hours of measuring.