image-to-text

lokibots/vit-patch16-1280-gpt2-large-image-summary

This model generates a summary from a given chart image. The model accepts an image of size 1280x768 (or less) and generates a summary describing the contents of the image. However, training is still required.

sample inference code

from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, GPT2Tokenizer
from PIL import Image

model = VisionEncoderDecoderModel.from_pretrained("lokibots/vit-patch16-1280-gpt2-large-image-summary")
feature_extractor = ViTFeatureExtractor.from_pretrained("lokibots/vit-patch16-1280-gpt2-large-image-summary")
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-large')

image = Image.open("image_file").convert("RGB")
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values

gen_kwargs = {"max_length": 1024, "num_beams": 4}
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)