summarization seq2seq

<img src="https://huggingface.co/IIC/marimari-r2r-mlsum/resolve/main/marimariLogo.png"/>

This is a model for text summarization in Spanish. It has been trained on the spanish portion of mlsum. For that, MariMari was created. It is called like that because it is an EncoderDecoder model built from Maria model, specifically, the roberta model from the Maria Project. For building the Encoder Decoder model, this paper was followed, which has a direct implementation in transformers. As there are no natural encoder decoder models in Spanish, such as BART or T5, we decided to leverage the capacity of the Roberta model of the MarIA project, as it has shown great results on several NLU tasks, therefore it was natural to think it could perform well on NLG tasks when trained properly.

For tuning the hyperparameters of the model we used Optuna, with only 10 different trials and 7 initial random trials, as the dataset chosen for training the model, mlsum was huge. The set of hyperparameters used was the following:


    def hp_space(trial):
        return {
            "learning_rate": trial.suggest_float(
                "learning_rate", 3e-5, 7e-5, log=True
            ),
            "num_train_epochs": trial.suggest_categorical(
                "num_train_epochs", [7]
            ),
            "per_device_train_batch_size": trial.suggest_categorical(
                "per_device_train_batch_size", [16]),
            "per_device_eval_batch_size": trial.suggest_categorical(
                "per_device_eval_batch_size", [32]),
            "gradient_accumulation_steps": trial.suggest_categorical(
                "gradient_accumulation_steps", [2, 4, 8]),
            "warmup_steps": trial.suggest_categorical(
                "warmup_steps", [50, 100, 500, 1000]
            ),
            "weight_decay": trial.suggest_float(
                 "weight_decay", 0.0, 0.1
            ),

The reported results are on the test split of mlsum. As you can see, MariMari-r2r-mlsum works better for summarization on mlsum than the previous best model in this regard, beto2beto. The complete metrics on test are:

{"rouge1": 28.7802, "rouge2": 10.6748, "rougeL": 23.0447, "rougeLsum": 23.4055, "gen_len": 25.7803}

This model is really easy to use, and with the following lines of code you can just start summarizing your documents in Spanish:

from transformers import EncoderDecoderModel, AutoTokenizer

text = "Hola esto es un ejemplo de texto a resumir. Poco hay que resumir aquí, pero es sólo de muestra."

tokenizer = AutoTokenizer.from_pretrained("IIC/marimari-r2r-mlsum")
model = EncoderDecoderModel.from_pretrained("IIC/marimari-r2r-mlsum")

input_ids = tokenizer(text, return_tensors="pt").input_ids

output_ids = model.generate(input_ids)[0]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

Contributions

Thanks to @avacaondata, @alborotis, @albarji, @Dabs, @GuillemGSubies for adding this model.