sentence-transformers feature-extraction sentence-similarity

Sentence Transformers

We are forking sentence-transformers/all-MiniLM-L6-v2 as it is similar to the targeting dataset and use case. For more details, please check the pre-trained model weight repository.

Fine-tuning

Hyper parameters

Datasets

Dataset Paper Number of training tuples
Reddit comments (2015-2018) paper 726,484,430
S2ORC Citation pairs (Abstracts) paper 116,288,806
WikiAnswers Duplicate question pairs paper 77,427,422
PAQ (Question, Answer) pairs paper 64,371,441
S2ORC Citation pairs (Titles) paper 52,603,982
S2ORC (Title, Abstract) paper 41,769,185
Stack Exchange (Title, Body) pairs - 25,316,456
Stack Exchange (Title+Body, Answer) pairs - 21,396,559
Stack Exchange (Title, Answer) pairs - 21,396,559
MS MARCO triplets paper 9,144,553
GOOAQ: Open Question Answering with Diverse Answer Types paper 3,012,496
Yahoo Answers (Title, Answer) paper 1,198,260
Code Search - 1,151,414
COCO Image captions paper 828,395
SPECTER citation triplets paper 684,100
Yahoo Answers (Question, Answer) paper 681,164
Yahoo Answers (Title, Question) paper 659,896
SearchQA paper 582,261
Eli5 paper 325,475
Flickr 30k paper 317,695
Stack Exchange Duplicate questions (titles) 304,525
AllNLI (SNLI and MultiNLI paper SNLI, paper MultiNLI 277,230
Stack Exchange Duplicate questions (bodies) 250,519
Stack Exchange Duplicate questions (titles+bodies) 250,460
Sentence Compression paper 180,000
Wikihow paper 128,542
Altlex paper 112,696
Quora Question Triplets - 103,663
Simple Wikipedia paper 102,225
Natural Questions (NQ) paper 100,231
SQuAD2.0 paper 87,599
TriviaQA - 73,346
Total 1,170,060,424