unum-cloud/uform-vl-multilingual-v2 - AI Model Zoo

<h1 align="center">UForm</h1> <h3 align="center"> Multi-Modal Inference Library<br/> For Semantic Search Applications<br/> </h3>

UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!

This is model card of the Multilingual model (21 languages) with:

12 layers BERT (8 layers for unimodal encoding and rest layers for multimodal encoding)
ViT-B/16 (image resolution is 224x224)

The model was trained on balanced multilingual dataset.

If you need English model, check this.

Evaluation

For all evaluations, the multimodal part was used unless otherwise stated.

Monolingual

Dataset	Recall@1	Recall@5	Recall@10
Zero-Shot Flickr	0.558	0.813	0.874
MS-COCO (train split was in training data)	0.401	0.680	0.781

Multilingual

XTD-10

Metric is recall@10

English	German	Spanish	French	Italian	Russian	Japanese	Korean	Turkish	Chinese	Polish
96.1	93.5	95.7	94.1	94.4	90.4	90.2	91.3	95.2	93.8	95.8

COCO-SM

For this evaluation only unimodal part was used.

Recall

Target Language	OpenCLIP @ 1	UForm @ 1	OpenCLIP @ 5	UForm @ 5	OpenCLIP @ 10	UForm @ 10	Speakers
Arabic	22.7	31.7	44.9	57.8	55.8	69.2	274 M
Armenian	5.6	22.0	14.3	44.7	20.2	56.0	4 M
Chinese	27.3	32.2	51.3	59.0	62.1	70.5	1'118 M
English	37.8	37.7	63.5	65.0	73.5	75.9	1'452 M
French	31.3	35.4	56.5	62.6	67.4	73.3	274 M
German	31.7	35.1	56.9	62.2	67.4	73.3	134 M
Hebrew	23.7	26.7	46.3	51.8	57.0	63.5	9 M
Hindi	20.7	31.3	42.5	57.9	53.7	69.6	602 M
Indonesian	26.9	30.7	51.4	57.0	62.7	68.6	199 M
Italian	31.3	34.9	56.7	62.1	67.1	73.1	67 M
Japanese	27.4	32.6	51.5	59.2	62.6	70.6	125 M
Korean	24.4	31.5	48.1	57.8	59.2	69.2	81 M
Persian	24.0	28.8	47.0	54.6	57.8	66.2	77 M
Polish	29.2	33.6	53.9	60.1	64.7	71.3	41 M
Portuguese	31.6	32.7	57.1	59.6	67.9	71.0	257 M
Russian	29.9	33.9	54.8	60.9	65.8	72.0	258 M
Spanish	32.6	35.6	58.0	62.8	68.8	73.7	548 M
Thai	21.5	28.7	43.0	54.6	53.7	66.0	61 M
Turkish	25.5	33.0	49.1	59.6	60.3	70.8	88 M
Ukranian	26.0	30.6	49.9	56.7	60.9	68.1	41 M
Vietnamese	25.4	28.3	49.2	53.9	60.3	65.5	85 M

Mean	26.5±6.4	31.8±3.5	49.8±9.8	58.1±4.5	60.4±10.6	69.4±4.3	-
Google Translate	27.4±6.3	31.5±3.5	51.1±9.5	57.8±4.4	61.7±10.3	69.1±4.3	-
Microsoft Translator	27.2±6.4	31.4±3.6	50.8±9.8	57.7±4.7	61.4±10.6	68.9±4.6	-
Meta NLLB	24.9±6.7	32.4±3.5	47.5±10.3	58.9±4.5	58.2±11.2	70.2±4.3	-

NDCG@20

	Arabic	Armenian	Chinese	French	German	Hebrew	Hindi	Indonesian	Italian	Japanese	Korean	Persian	Polish	Portuguese	Russian	Spanish	Thai	Turkish	Ukranian	Vietnamese	Mean (all)	Mean (Google Translate)	Mean(Microsoft Translator)	Mean(NLLB)
OpenCLIP NDCG	0.639	0.204	0.731	0.823	0.806	0.657	0.616	0.733	0.811	0.737	0.686	0.667	0.764	0.832	0.777	0.849	0.606	0.701	0.704	0.697	0.716 ± 0.149	0.732 ± 0.145	0.730 ± 0.149	0.686 ± 0.158
UForm NDCG	0.868	0.691	0.880	0.932	0.927	0.791	0.879	0.870	0.930	0.885	0.869	0.831	0.897	0.897	0.906	0.939	0.822	0.898	0.851	0.818	0.875 ± 0.064	0.869 ± 0.063	0.869 ± 0.066	0.888 ± 0.064

Installation

pip install uform

Usage

To load the model:

import uform

model = uform.get_model('unum-cloud/uform-vl-multilingual-v2')

To encode data:

from PIL import Image

text = 'a small red panda in a zoo'
image = Image.open('red_panda.jpg')

image_data = model.preprocess_image(image)
text_data = model.preprocess_text(text)

image_embedding = model.encode_image(image_data)
text_embedding = model.encode_text(text_data)
joint_embedding = model.encode_multimodal(image=image_data, text=text_data)

To get features:

image_features, image_embedding = model.encode_image(image_data, return_features=True)
text_features, text_embedding = model.encode_text(text_data, return_features=True)

These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:

joint_embedding = model.encode_multimodal(
    image_features=image_features,
    text_features=text_features,
    attention_mask=text_data['attention_mask']
)

There are two options to calculate semantic compatibility between an image and a text: Cosine Similarity and Matching Score.

Cosine Similarity

import torch.nn.functional as F

similarity = F.cosine_similarity(image_embedding, text_embedding)

The similarity will belong to the [-1, 1] range, 1 meaning the absolute match.

Pros:

Computationally cheap.
Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
Suitable for retrieval in large collections.

Cons:

Takes into account only coarse-grained features.

Matching Score

Unlike cosine similarity, unimodal embedding are not enough. Joint embedding will be needed and the resulting score will belong to the [0, 1] range, 1 meaning the absolute match.

score = model.get_matching_scores(joint_embedding)