sentence similarity passage retrieval

Dense Passage Retrieval is a set of tools for performing state of the art open-domain question answering. It was initially developed by Facebook and there is an official repository. DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.

Regarding its use, this model should be used to vectorize the documents in the database of a question answering system in Spanish. Then, when a new question enters, the question encoder should be used to encode it, and then we compare that encoding with the encodings of the database to find the most similar documents, which then should be used for either extracting the answer or generating it.

For training the model, we used the spanish version of SQUAD, SQUAD-ES, with which we created positive and negative examples for the model.

Example of use:

from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

model_str = "avacaondata/dpr-spanish-passage_encoder-squades-base"
tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_str)
model = DPRContextEncoder.from_pretrained(model_str)

input_ids = tokenizer("Usain Bolt ganó varias medallas de oro en las Olimpiadas del año 2012", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output

The full metrics of this model on the evaluation split of SQUADES are:

evalloss: 0.08608942725107592
acc: 0.9925325215819639
f1: 0.8805402320715237
acc_and_f1: 0.9365363768267438
average_rank: 0.27430093209054596

And the classification report:

                precision   recall    f1-score   support

hard_negative     0.9961    0.9961    0.9961    325878
     positive     0.8805    0.8805    0.8805     10514

     accuracy                         0.9925    336392
    macro avg     0.9383    0.9383    0.9383    336392
 weighted avg     0.9925    0.9925    0.9925    336392

Contributions

Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.