<h1 align="center">UForm</h1> <h3 align="center"> Multi-Modal Inference Library<br/> For Semantic Search Applications<br/> </h3>
UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!
This is the repository of English and multilingual UForm models converted to CoreML MLProgram format. Currently, only unimodal parts of models are converted.
Description
Each model is separated into two parts: image-encoder
and text-encoder
:
- English image-encoder: english.image-encoder.mlpackage
- English text-encoder: english.text-encoder.mlpackage
- Multilingual image-encoder: multilingual.image-encoder.mlpackage
- Multilingual text-encoder: multilingual.text-encoder.mlpackage
- Multilingual-v2 image-encoder: multilingual-v2.image-encoder.mlpackage
- Multilingual-v2 text-encoder: multilingual-v2.text-encoder.mlpackage
- Onnx Multilingual image-encoder: multilingual.image-encoder.onnx
- Onnx Multilingual text-encoder: multilingual.text-encoder.onnx
Each checkpoint is a zip archive with an MLProgram of the corresponding encoder.
Text encoders have the following input fields:
input_ids
: int32attention_mask
: int32
and support flexible batch size.
Image encoders has a single input field image
: float32 and support only batch of single image (due to CoreML bug).
Both encoders return:
features
: float32embeddings
: float32
If you want to convert a model with other parameters (i.e fp16 precision or other batch size range), you can use convert.py.