LLaMa text-generation-inference ggml

NOTE: DEPRECIATED, BETTER PEOPLE DO THIS NOW

LLaMa 65B converted to ggml via LLaMa.cpp, then quantized to 4bit.

Legacy is for llama.cpp setups older than https://github.com/ggerganov/llama.cpp/pull/1508, the regular is faster but does not work on old versions.

I recommend the following settings when running as a good starting point: main.exe -m ggml-LLaMa-65B-q4_0.bin -n -1 -t 32 -c 2048 --temp 0.7 --repeat_penalty 1.2 --mirostat 2 --interactive-first --color

Be aware that LLaMa is a text generation model, not a conversational one, and as such you will have to prompt it differently than, for example, Vicuna or ChatGPT.