
This model is T5-v1.1-large finetuned on RSS dataset. The model was finetuned as part of "How Optimal is Greedy Decoding for Extractive Question Answering?", while the RSS pretraining method was introduced in this paper.

Model description

The original T5-v1.1-large was only pre-trained on C4 excluding any supervised training. Our version is further trained on Rucurrent Span Selection scheme (RSS), using a sample from the dataset used to pretrain Splinter:

During training time, the masked span is replaced with <extra_id_0> and the labels are formatted as <extra_id_0>span<extra_id_0>. Unlike Splinter, only one span is mask at a time.

Intended uses & limitations

This model naturally fits tasks where a span from a context is intended to be copied, like extractive question answering. This checkpoint is primarily aimed to be used in zero-shot setting - further fine-tuning it on an annotated dataset gives equal results to those of the original T5-v1.1-large.

How to use

You can use this model directly but it is recommended to format the input to be aligned with that of the training scheme and as a text-question context:

from transformers import  AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('tau/t5-v1_1-large-rss')
tokenizer = AutoTokenizer.from_pretrained('tau/t5-v1_1-large-rss')

passage = 'Barack Hussein Obama II is an American politician and attorney who served as the 44th president of the United States from 2009 to 2017. '
question = 'When was Obama inaugurated?'
text = f'Text: {passage}.\nQuestion: {question}\nAnswer:{tokenizer.additional_special_tokens[0]}.'
encoded_input = tokenizer(text, return_tensors='pt')
output_ids = model.generate(input_ids=encoded_input.input_ids, attention_mask=encoded_input.attention_mask,
               eos_token_id=tokenizer.additional_special_tokens_ids[1], num_beams=1, max_length=512, min_length=3)

The generated answer is then "<pad><extra_id_0> 2009<extra_id_1>", while the one generated by the original T5-v1.1-large is "<pad><extra_id_0> On January 20, 2009<extra_id_1>" - a correct yet non-extractive answer.

Limitations and bias

Although using the model with greedy decoding tends toward extracted outputs, is may sometimes produce non-extracted ones - may it be different casing or a whole different string (or substring) that may bear another semantic meaning.


The model was finetuned with 100,000 rss-examples for 3 epochs using Adafactor optimizer with constant learning rate of 5e-5.

Evaluation results

Evaluated over few-shot QA in a zero-shot setting (no finetuning on annotated examples):

Model \ Dataset SQuAD TriviaQA NaturalQs NewsQA SearchQA HotpotQA BioASQ TextbookQA
T5 50.4 61.7 42.1 19.2 24.0 43.3 55.5 17.8
T5-rss 71.4 69.3 57.2 43.2 29.7 59.0 65.5 39.0

The gap between the two models diminishes as more training examples are introduced, for additional result see the [paper]((https://arxiv.org/abs/2108.05857).

