<h1 align="center">UForm</h1> <h3 align="center"> Multi-Modal Inference Library<br/> For Semantic Search Applications<br/> </h3>


UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (21 languages) with:

The model was trained on balanced multilingual dataset.

If you need English model, check this.

Evaluation

For all evaluations, the multimodal part was used unless otherwise stated.

Monolingual

Dataset Recall@1 Recall@5 Recall@10
Zero-Shot Flickr 0.558 0.813 0.874
MS-COCO (train split was in training data) 0.401 0.680 0.781

Multilingual

XTD-10

Metric is recall@10

English German Spanish French Italian Russian Japanese Korean Turkish Chinese Polish
96.1 93.5 95.7 94.1 94.4 90.4 90.2 91.3 95.2 93.8 95.8

COCO-SM

For this evaluation only unimodal part was used.

Recall

Target Language OpenCLIP @ 1 UForm @ 1 OpenCLIP @ 5 UForm @ 5 OpenCLIP @ 10 UForm @ 10 Speakers
Arabic 22.7 31.7 44.9 57.8 55.8 69.2 274 M
Armenian 5.6 22.0 14.3 44.7 20.2 56.0 4 M
Chinese 27.3 32.2 51.3 59.0 62.1 70.5 1'118 M
English 37.8 37.7 63.5 65.0 73.5 75.9 1'452 M
French 31.3 35.4 56.5 62.6 67.4 73.3 274 M
German 31.7 35.1 56.9 62.2 67.4 73.3 134 M
Hebrew 23.7 26.7 46.3 51.8 57.0 63.5 9 M
Hindi 20.7 31.3 42.5 57.9 53.7 69.6 602 M
Indonesian 26.9 30.7 51.4 57.0 62.7 68.6 199 M
Italian 31.3 34.9 56.7 62.1 67.1 73.1 67 M
Japanese 27.4 32.6 51.5 59.2 62.6 70.6 125 M
Korean 24.4 31.5 48.1 57.8 59.2 69.2 81 M
Persian 24.0 28.8 47.0 54.6 57.8 66.2 77 M
Polish 29.2 33.6 53.9 60.1 64.7 71.3 41 M
Portuguese 31.6 32.7 57.1 59.6 67.9 71.0 257 M
Russian 29.9 33.9 54.8 60.9 65.8 72.0 258 M
Spanish 32.6 35.6 58.0 62.8 68.8 73.7 548 M
Thai 21.5 28.7 43.0 54.6 53.7 66.0 61 M
Turkish 25.5 33.0 49.1 59.6 60.3 70.8 88 M
Ukranian 26.0 30.6 49.9 56.7 60.9 68.1 41 M
Vietnamese 25.4 28.3 49.2 53.9 60.3 65.5 85 M
Mean 26.5±6.4 31.8±3.5 49.8±9.8 58.1±4.5 60.4±10.6 69.4±4.3 -
Google Translate 27.4±6.3 31.5±3.5 51.1±9.5 57.8±4.4 61.7±10.3 69.1±4.3 -
Microsoft Translator 27.2±6.4 31.4±3.6 50.8±9.8 57.7±4.7 61.4±10.6 68.9±4.6 -
Meta NLLB 24.9±6.7 32.4±3.5 47.5±10.3 58.9±4.5 58.2±11.2 70.2±4.3 -

NDCG@20

Arabic Armenian Chinese French German Hebrew Hindi Indonesian Italian Japanese Korean Persian Polish Portuguese Russian Spanish Thai Turkish Ukranian Vietnamese Mean (all) Mean (Google Translate) Mean(Microsoft Translator) Mean(NLLB)
OpenCLIP NDCG 0.639 0.204 0.731 0.823 0.806 0.657 0.616 0.733 0.811 0.737 0.686 0.667 0.764 0.832 0.777 0.849 0.606 0.701 0.704 0.697 0.716 ± 0.149 0.732 ± 0.145 0.730 ± 0.149 0.686 ± 0.158
UForm NDCG 0.868 0.691 0.880 0.932 0.927 0.791 0.879 0.870 0.930 0.885 0.869 0.831 0.897 0.897 0.906 0.939 0.822 0.898 0.851 0.818 0.875 ± 0.064 0.869 ± 0.063 0.869 ± 0.066 0.888 ± 0.064

Installation

pip install uform

Usage

To load the model:

import uform

model = uform.get_model('unum-cloud/uform-vl-multilingual-v2')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

Cons:

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)

Pros:

Cons: