Sharded BLIP-2 Model Card - flan-t5-xl

<a href="https://colab.research.google.com/gist/pszemraj/0822b7f28b14405f10cfd382296873de/blip2-flan-t5-xl-sharded-example.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> </a>

This is a sharded version of the blip2-flan-t5-xl which leverages Flan T5-xl for image-to-text tasks such as image captioning and visual question answering.

Usage

Refer to the original model card for details or see this blog post. Here is how you can use it on CPU:

Install

Requires the current main of transformers (at time of writing):

pip install accelerate git+https://github.com/huggingface/transformers.git -U -q

Use (this is for CPU, check out the original model card/blog for fp16 and int8 usage)

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

model_name = "ethzanalytics/blip2-flan-t5-xl-sharded"
processor = BlipProcessor.from_pretrained(model_name)
model = Blip2ForConditionalGeneration.from_pretrained(model_name)

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))