Dense Passage Retrieval is a set of tools for performing state of the art open-domain question answering. It was initially developed by Facebook and there is an official repository. DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.
Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with the passage encoder) to find the most similar documents , which then should be used for either extracting the answer or generating it.
For training the model, we used the spanish version of SQUAD, SQUAD-ES, with which we created positive and negative examples for the model.
Example of use:
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
model_str = "avacaondata/dpr-spanish-passage_encoder-squades-base"
tokenizer = DPRQuestionEncoderTokenizer.from_pretrained(model_str)
model = DPRQuestionEncoder.from_pretrained(model_str)
input_ids = tokenizer("¿Qué medallas ganó Usain Bolt en 2012?", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output
The full metrics of this model on the evaluation split of SQUADES are:
evalloss: 0.08608942725107592
acc: 0.9925325215819639
f1: 0.8805402320715237
acc_and_f1: 0.9365363768267438
average_rank: 0.27430093209054596
And the classification report:
precision recall f1-score support
hard_negative 0.9961 0.9961 0.9961 325878
positive 0.8805 0.8805 0.8805 10514
accuracy 0.9925 336392
macro avg 0.9383 0.9383 0.9383 336392
weighted avg 0.9925 0.9925 0.9925 336392
Contributions
Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.