<h1 align="center">UForm</h1> <h3 align="center"> Multi-Modal Inference Library<br/> For Semantic Search Applications<br/> </h3>


UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (12 languages) with:

The model was trained on balanced multilingual dataset.

If you need English model, check this.

If you need more languages, check this.

Evaluation

The following metrics were obtained with multimodal re-ranking:

Monolingual

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.558 0.813 0.874
MS-COCO (train split was in training data) 0.401 0.680 0.781

Multilingual (XTD-10)

Metric is recall@10

English German Spanish French Italian Russian Japanese Korean Turkish Chinese Polish
96.3 92.6 94.5 94.4 94.4 90.4 88.3 92.5 94.4 93.6 95.0

Installation

pip install uform

Usage

To load the model:

import uform

model = uform.get_model('unum-cloud/uform-vl-english')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

Cons:

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

Pros:

Cons: