Dense Passage Retrieval-DPR is a set of tools for performing State of the Art open-domain question answering. It was initially developed by Facebook and there is an official repository. DPR is intended to retrieve the relevant documents to answer a given question, and is composed of 2 models, one for encoding passages and other for encoding questions. This concrete model is the one used for encoding passages.
With this and the question encoder model we introduce the best passage retrievers in Spanish up to date (to the best of our knowledge), improving over the previous model we developed, by training it for longer and with more data.
Regarding its use, this model should be used to vectorize a question that enters in a Question Answering system, and then we compare that encoding with the encodings of the database (encoded with the passage encoder) to find the most similar documents , which then should be used for either extracting the answer or generating it.
For training the model, we used a collection of Question Answering datasets in Spanish:
- the Spanish version of SQUAD, SQUAD-ES
- SQAC- Spanish Question Answering Corpus
- BioAsq22-ES - we translated this last one by using automatic translation with Transformers.
With this complete dataset we created positive and negative examples for the model (For more information look at the paper to understand the training process for DPR). We trained for 25 epochs with the same configuration as the paper. The previous DPR model was trained for only 3 epochs with about 60% of the data.
Example of use:
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
model_str = "IIC/dpr-spanish-passage_encoder-allqa-base"
tokenizer = DPRContextEncoderTokenizer.from_pretrained(model_str)
model = DPRContextEncoder.from_pretrained(model_str)
input_ids = tokenizer("Usain Bolt ganó varias medallas de oro en las Olimpiadas del año 2012", return_tensors="pt")["input_ids"]
embeddings = model(input_ids).pooler_output
The full metrics of this model on the evaluation split of SQUADES are:
eval_loss: 0.010779764448327261
eval_acc: 0.9982682224158297
eval_f1: 0.9446059155411182
eval_acc_and_f1: 0.9714370689784739
eval_average_rank: 0.11728500598392888
And the classification report:
precision recall f1-score support
hard_negative 0.9991 0.9991 0.9991 1104999
positive 0.9446 0.9446 0.9446 17547
accuracy 0.9983 1122546
macro avg 0.9719 0.9719 0.9719 1122546
weighted avg 0.9983 0.9983 0.9983 1122546
Contributions
Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.