Dubbed Spoonbill Garuda version used instruction tuned sugiv/garuda-from-llama2-7B-chat as languade model. The above said Spoonbill Garuda is also vision-language model which was also trained on visual instruction datasets (limited from Otter).

@article{li2023mimicit, title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning}, author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu}, year={2023}, eprint={2306.05425}, archivePrefix={arXiv}, primaryClass={cs.CV} }