65B LLaMA quantized to 4 bit on the old CUDA branch. No groupsize.