shibing624/text2vec-base-multilingual
This is a CoSENT(Cosine Sentence) model: shibing624/text2vec-base-multilingual.
It maps sentences to a 384 dimensional dense vector space and can be used for tasks like sentence embeddings, text matching or semantic search.
- training dataset: https://huggingface.co/datasets/shibing624/nli-zh-all/tree/main/text2vec-base-multilingual-dataset
- base model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- max_seq_length: 256
- best epoch: 4
- sentence embedding dim: 384
Evaluation
For an automated evaluation of this model, see the Evaluation Benchmark: text2vec
Languages
Available languages are: de, en, es, fr, it, nl, pl, pt, ru, zh
Release Models
Arch | BaseModel | Model | ATEC | BQ | LCQMC | PAWSX | STS-B | SOHU-dd | SOHU-dc | Avg | QPS |
---|---|---|---|---|---|---|---|---|---|---|---|
Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 55.04 | 20.70 | 35.03 | 23769 |
SBERT | xlm-roberta-base | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 63.01 | 52.28 | 46.46 | 3138 |
Instructor | hfl/chinese-roberta-wwm-ext | moka-ai/m3e-base | 41.27 | 63.81 | 74.87 | 12.20 | 76.96 | 75.83 | 60.55 | 57.93 | 2980 |
CoSENT | hfl/chinese-macbert-base | shibing624/text2vec-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 70.27 | 50.42 | 51.61 | 3008 |
CoSENT | hfl/chinese-lert-large | GanymedeNil/text2vec-large-chinese | 32.61 | 44.59 | 69.30 | 14.51 | 79.44 | 73.01 | 59.04 | 53.12 | 2092 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-sentence | 43.37 | 61.43 | 73.48 | 38.90 | 78.25 | 70.60 | 53.08 | 59.87 | 3089 |
CoSENT | nghuyong/ernie-3.0-base-zh | shibing624/text2vec-base-chinese-paraphrase | 44.89 | 63.58 | 74.24 | 40.90 | 78.93 | 76.70 | 63.30 | 63.08 | 3066 |
CoSENT | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 | shibing624/text2vec-base-multilingual | 32.39 | 50.33 | 65.64 | 32.56 | 74.45 | 68.88 | 51.17 | 53.67 | 4004 |
Illustrate:
- Result evaluation index: spearman coefficient
- The
shibing624/text2vec-base-chinese
model is trained using the CoSENT method. It is trained on Chinese STS-B data based onhfl/chinese-macbert-base
and has achieved good results in the Chinese STS-B test set evaluation. , run examples/training_sup_text_matching_model.py code to train the model, the model file has been uploaded to HF model hub, Chinese universal semantic matching task Recommended Use - The
shibing624/text2vec-base-chinese-sentence
model is trained using the CoSENT method and is based on the manually selected Chinese STS data set ofnghuyong/ernie-3.0-base-zh
shibing624/nli-zh-all/ text2vec-base-chinese-sentence-dataset, and is used in various Chinese NLI test set evaluation has achieved good results. Run the examples/training_sup_text_matching_model_jsonl_data.py code to train the model, and the model file has been uploaded to HF model hub, recommended for Chinese s2s (sentence vs sentence) semantic matching tasks - The
shibing624/text2vec-base-chinese-paraphrase
model is trained using the CoSENT method and is based on the manually selected Chinese STS data set ofnghuyong/ernie-3.0-base-zh
shibing624/nli-zh-all/ text2vec-base-chinese-paraphrase-dataset, the data set is relative to shibing624 /nli-zh-all/text2vec-base-chinese-sentence-dataset s2p (sentence to paraphrase) data was added to strengthen its long text representation capabilities, and the evaluation on each Chinese NLI test set reached SOTA, running [examples/training_sup_text_matching_model_jsonl_data.py](https://github.com/shibing624/text2vec /blob/master/examples/training_sup_text_matching_model_jsonl_data.py) code can train the model. The model file has been uploaded to HF model hub. It is recommended for Chinese s2p (sentence vs paragraph) semantic matching tasks. - The
shibing624/text2vec-base-multilingual
model is trained using the CoSENT method and is based on the manually selected multilingual STS data set ofsentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
shibing624/nli-zh -all/text2vec-base-multilingual-dataset trained and tested in Chinese and English The set evaluation effect is improved compared to the original model. Run the examples/training_sup_text_matching_model_jsonl_data.py code to train the model, and the model file has been uploaded. HF model hub, recommended for multi-language semantic matching tasks w2v-light-tencent-chinese
is the Word2Vec model of Tencent word vector, which is loaded and used by CPU. It is suitable for Chinese text matching tasks and cold start situations where data is missing.- The GPU test environment of QPS is Tesla V100 with 32GB memory.
Model training experiment report: Experiment report
Usage (text2vec)
Using this model becomes easy when you have text2vec installed:
pip install -U text2vec
Then you can use the model like this:
from text2vec import SentenceModel
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
model = SentenceModel('shibing624/text2vec-base-multilingual')
embeddings = model.encode(sentences)
print(embeddings)
Usage (HuggingFace Transformers)
Without text2vec, you can use the model like this:
First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
Install transformers:
pip install transformers
Then load model and predict:
from transformers import AutoTokenizer, AutoModel
import torch
# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] # First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/text2vec-base-multilingual')
model = AutoModel.from_pretrained('shibing624/text2vec-base-multilingual')
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
Usage (sentence-transformers)
sentence-transformers is a popular library to compute dense vector representations for sentences.
Install sentence-transformers:
pip install -U sentence-transformers
Then load model and predict:
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("shibing624/text2vec-base-multilingual")
sentences = ['如何更换花呗绑定银行卡', 'How to replace the Huabei bundled bank card']
sentence_embeddings = m.encode(sentences)
print("Sentence embeddings:")
print(sentence_embeddings)
Full Model Architecture
CoSENT(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
)
Intended uses
Our model is intented to be used as a sentence and short paragraph encoder. Given an input text, it ouptuts a vector which captures the semantic information. The sentence vector may be used for information retrieval, clustering or sentence similarity tasks.
By default, input text longer than 256 word pieces is truncated.
Training procedure
Pre-training
We use the pretrained sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
model.
Please refer to the model card for more detailed information about the pre-training procedure.
Fine-tuning
We fine-tune the model using a contrastive objective. Formally, we compute the cosine similarity from each possible sentence pairs from the batch. We then apply the rank loss by comparing with true pairs and false pairs.
Citing & Authors
This model was trained by text2vec.
If you find this model helpful, feel free to cite:
@software{text2vec,
author = {Ming Xu},
title = {text2vec: A Tool for Text to Vector},
year = {2023},
url = {https://github.com/shibing624/text2vec},
}