unum-cloud/uform-vl-multilingual - AI Model Zoo

<h1 align="center">UForm</h1> <h3 align="center"> Multi-Modal Inference Library<br/> For Semantic Search Applications<br/> </h3>

UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (12 languages) with:

12 layers BERT (8 layers for unimodal encoding and rest layers for multimodal encoding)
ViT-B/16 (image resolution is 224x224)

The model was trained on balanced multilingual dataset.

If you need English model, check this.

If you need more languages, check this.

Evaluation

The following metrics were obtained with multimodal re-ranking:

Monolingual

Dataset	Recall@1	Recall@5	Recall@10
Zero-Shot Flickr	0.558	0.813	0.874
MS-COCO (train split was in training data)	0.401	0.680	0.781

Multilingual (XTD-10)

Metric is recall@10

English	German	Spanish	French	Italian	Russian	Japanese	Korean	Turkish	Chinese	Polish
96.3	92.6	94.5	94.4	94.4	90.4	88.3	92.5	94.4	93.6	95.0

Installation

pip install uform

Usage

To load the model:

import uform

model = uform.get_model('unum-cloud/uform-vl-english')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

Computationally cheap.
Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
Suitable for retrieval in large collections.

Cons:

Takes into account only coarse-grained features.

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)