sentence-transformers feature-extraction sentence-similarity transformers information-retrieval

<h1 align="center">MMLW-retrieval-roberta-large</h1>

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish. This model is optimized for information retrieval tasks. It can transform queries and passages to 1024 dimensional vectors. The model was developed using a two-step procedure:

Usage (Sentence-Transformers)

⚠️ Our dense retrievers require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix "zapytanie: " ⚠️

You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "zapytanie: "
answer_prefix = ""
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-retrieval-roberta-large")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.

Evaluation Results

The model achieves NDCG@10 of 58.15 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Acknowledgements

This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.