UmBERTo Wikipedia Uncased

UmBERTo is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at github.com/huggingface/transformers

<p align="center"> <img src="https://user-images.githubusercontent.com/7140210/72913702-d55a8480-3d3d-11ea-99fc-f2ef29af4e72.jpg" width="700"> </br> Marco Lodola, Monument to Umberto Eco, Alessandria 2019 </p>

Dataset

UmBERTo-Wikipedia-Uncased Training is trained on a relative small corpus (~7GB) extracted from Wikipedia-ITA.

Pre-trained model

Model WWM Cased Tokenizer Vocab Size Train Steps Download
umberto-wikipedia-uncased-v1 YES YES SPM 32K 100k Link

This model was trained with SentencePiece and Whole Word Masking.

Downstream Tasks

These results refers to umberto-wikipedia-uncased model. All details are at Umberto Official Page.

Named Entity Recognition (NER)

Dataset F1 Precision Recall Accuracy
ICAB-EvalITA07 86.240 85.939 86.544 98.534
WikiNER-ITA 90.483 90.328 90.638 98.661

Part of Speech (POS)

Dataset F1 Precision Recall Accuracy
UD_Italian-ISDT 98.563 98.508 98.618 98.717
UD_Italian-ParTUT 97.810 97.835 97.784 98.060

Usage

Load UmBERTo Wikipedia Uncased with AutoModel, Autotokenizer:

import torch
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")
umberto = AutoModel.from_pretrained("Musixmatch/umberto-wikipedia-uncased-v1")

encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore")
input_ids = torch.tensor(encoded_input).unsqueeze(0)  # Batch size 1
outputs = umberto(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output
Predict masked token:
from transformers import pipeline

fill_mask = pipeline(
	"fill-mask",
	model="Musixmatch/umberto-wikipedia-uncased-v1",
	tokenizer="Musixmatch/umberto-wikipedia-uncased-v1"
)

result = fill_mask("Umberto Eco è <mask> un grande scrittore")
# {'sequence': '<s> umberto eco è stato un grande scrittore</s>', 'score': 0.5784581303596497, 'token': 361}
# {'sequence': '<s> umberto eco è anche un grande scrittore</s>', 'score': 0.33813193440437317, 'token': 269}
# {'sequence': '<s> umberto eco è considerato un grande scrittore</s>', 'score': 0.027196012437343597, 'token': 3236}
# {'sequence': '<s> umberto eco è diventato un grande scrittore</s>', 'score': 0.013716378249228, 'token': 5742}
# {'sequence': '<s> umberto eco è inoltre un grande scrittore</s>', 'score': 0.010662357322871685, 'token': 1030}

Citation

All of the original datasets are publicly available or were released with the owners' grant. The datasets are all released under a CC0 or CCBY license.

@inproceedings {magnini2006annotazione,
	title = {Annotazione di contenuti concettuali in un corpus italiano: I - CAB},
	author = {Magnini,Bernardo and Cappelli,Amedeo and Pianta,Emanuele and Speranza,Manuela and Bartalesi Lenzi,V and Sprugnoli,Rachele and Romano,Lorenza and Girardi,Christian and Negri,Matteo},
	booktitle = {Proc.of SILFI 2006},
	year = {2006}
}
@inproceedings {magnini2006cab,
	title = {I - CAB: the Italian Content Annotation Bank.},
	author = {Magnini,Bernardo and Pianta,Emanuele and Girardi,Christian and Negri,Matteo and Romano,Lorenza and Speranza,Manuela and Lenzi,Valentina Bartalesi and Sprugnoli,Rachele},
	booktitle = {LREC},
	pages = {963--968},
	year = {2006},
	organization = {Citeseer}
}

Authors

Loreto Parisi: loreto at musixmatch dot com, loretoparisi Simone Francia: simone.francia at musixmatch dot com, simonefrancia Paolo Magnani: paul.magnani95 at gmail dot com, paulthemagno

About Musixmatch AI

Musxmatch Ai mac app icon-128 We do Machine Learning and Artificial Intelligence @musixmatch Follow us on Twitter Github