sentence similarity passage retrieval

Dense Passage Retrieval is a set of tools for performing state of the art open-domain question answering. It was initially developed by Facebook and there is an official repository. DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.

Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with the passage encoder) to find the most similar documents , which then should be used for either extracting the answer or generating it.

For training the model, we used the spanish version of SQUAD, SQUAD-ES, with which we created positive and negative examples for the model.

Example of use:

from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer

model_str = "avacaondata/dpr-spanish-passage_encoder-squades-base"
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(model_str)
model = DPRQuestionEncoder.from_pretrained(model_str)

input_ids = tokenizer("¿Qué medallas ganó Usain Bolt en 2012?", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output

The full metrics of this model on the evaluation split of SQUADES are:

evalloss: 0.08608942725107592
acc: 0.9925325215819639
f1: 0.8805402320715237
acc_and_f1: 0.9365363768267438
average_rank: 0.27430093209054596

And the classification report:

                precision   recall    f1-score   support

hard_negative     0.9961    0.9961    0.9961    325878
     positive     0.8805    0.8805    0.8805     10514

     accuracy                         0.9925    336392
    macro avg     0.9383    0.9383    0.9383    336392
 weighted avg     0.9925    0.9925    0.9925    336392

Contributions

Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.