image-captioning

FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions

A framework designed to generate semantically rich image captions.

Resources

Running the model

Our BLIP-based model can be run using the following code,

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = BlipProcessor.from_pretrained("noamrot/FuseCap")
model = BlipForConditionalGeneration.from_pretrained("noamrot/FuseCap").to(device)

img_url = 'https://huggingface.co/spaces/noamrot/FuseCap/resolve/main/bike.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

text = "a picture of "
inputs = processor(raw_image, text, return_tensors="pt").to(device)

out = model.generate(**inputs, num_beams = 3)
print(processor.decode(out[0], skip_special_tokens=True))

Upcoming Updates

The official codebase, datasets and trained models for this project will be released soon.

BibTeX

@article{rotstein2023fusecap,
  title={FuseCap: Leveraging Large Language Models to Fuse Visual Data into Enriched Image Captions},
  author={Rotstein, Noam and Bensaid, David and Brody, Shaked and Ganz, Roy and Kimmel, Ron},
  journal={arXiv preprint arXiv:2305.17718},
  year={2023}
}