transformers

INESC-ID

A Semantic Search System for Supremo Tribunal de Justiça

Work developed as part of Project IRIS.

Thesis: A Semantic Search System for Supremo Tribunal de Justiça

stjiris/t5-portuguese-legal-summarization

T5 Model fine-tuned over “unicamp-dl/ptt5-base-portuguese-vocab” t5 model.

We utilized various jurisprudence and its summary to train this model.

Usage (HuggingFace transformers)

# name of folder principal
from transformers import T5Tokenizer, T5ForConditionalGeneration

model_checkpoint = "stjiris/t5-portuguese-legal-summarization"
t5_model = T5ForConditionalGeneration.from_pretrained(model_checkpoint)
t5_tokenizer = T5Tokenizer.from_pretrained(model_checkpoint)

preprocess_text = "These are some big words and text and words and text, again, that we want to summarize"
t5_prepared_Text = "summarize: "+preprocess_text
#print ("original text preprocessed: \n", preprocess_text)

tokenized_text = t5_tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)


# summmarize 
summary_ids = t5_model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=512,
                                    max_length=1024,
                                    early_stopping=True)

output = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print ("\n\nSummarized text: \n",output)

Citing & Authors

Contributions

@rufimelo99

If you use this work, please cite:

@inproceedings{MeloSemantic,
	author = {Melo, Rui and Santos, Professor Pedro Alexandre and Dias, Professor Jo{\~ a}o},
	title = {A {Semantic} {Search} {System} for {Supremo} {Tribunal} de {Justi}{\c c}a},
}

@article{ptt5_2020,
  title={PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data},
  author={Carmo, Diedre and Piau, Marcos and Campiotti, Israel and Nogueira, Rodrigo and Lotufo, Roberto},
  journal={arXiv preprint arXiv:2008.09144},
  year={2020}
}