image-captioning image-to-text

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

Image-caption-generator

This model is trained on Flickr8k dataset to generate captions given an image.

It achieves the following results on the evaluation set:

Running the model using transformers library

  1. Load the pre-trained model from the model hub

    from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
    import torch
    from PIL import Image
    
    model_name = "bipin/image-caption-generator"
    
    # load model
    model = VisionEncoderDecoderModel.from_pretrained(model_name)
    feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained("gpt2")
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    
  2. Load the image for which the caption is to be generated

    img_name = "flickr_data.jpg"
    img = Image.open(img_name)
    if img.mode != 'RGB':
        img = img.convert(mode="RGB")
    
  3. Pre-process the image

    pixel_values = feature_extractor(images=[img], return_tensors="pt").pixel_values
    pixel_values = pixel_values.to(device)
    
  4. Generate the caption

     max_length = 128
     num_beams = 4
    
     # get model prediction
     output_ids = model.generate(pixel_values, num_beams=num_beams, max_length=max_length)
    
     # decode the generated prediction
     preds = tokenizer.decode(output_ids[0], skip_special_tokens=True)
     print(preds)
    

Training procedure

The procedure used to train this model can be found here.

Training hyperparameters

The following hyperparameters were used during training:

Framework versions