Perceiver IO masked language model (IMDb)
This model is a Perceiver IO masked language model fine-tuned with masked language modeling on the IMDb dataset. It is a training example of the perceiver-io library.
Model description
The pretrained model is specified in Section 4 (Table 1) and Appendix F (Table 11) of the Perceiver IO paper (UTF-8 bytes tokenization, vocabulary size of 262, 201M parameters). The fine-tuned model has the same architecture as the pretrained model. It cross-attends to the raw UTF-8 bytes of the input.
Model training
The model was trained with masked language modeling and whole word masking on the unsupervised split of the IMDb dataset. Input data are tokenized with a UTF-8 bytes tokenizer (vocabulary size = 262). Word masking is done dynamically at data loading time i.e. each epoch has a different set of words masked. Training was done with PyTorch Lightning and the resulting checkpoint was converted to this 🤗 model with a library-specific conversion utility.
Intended use and limitations
The fine-tuned model can be used for downstream tasks related to movie reviews such as movie review sentiment analysis (example). Direct usage of the model is shown below.
Usage examples
To use this model you first need to install
the perceiver-io
library with extension text
.
pip install perceiver-io[text]
Then the model can be used with PyTorch. Either use the model and tokenizer directly
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
from perceiver.model.text import mlm # auto-class registration
repo_id = "krasserm/perceiver-io-mlm-imdb"
model = AutoModelForMaskedLM.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."
encoding = tokenizer(masked_text, return_tensors="pt")
# get index of first and last mask token
_, mask_indices = torch.where(encoding.input_ids == tokenizer.mask_token_id)
mask_beg = mask_indices[0]
mask_end = mask_indices[-1]
outputs = model(**encoding)
# get predictions for 9 [MASK] tokens (exclude [SEP] token at the end)
masked_token_predictions = outputs.logits[0, mask_beg : mask_end + 1].argmax(dim=-1)
print(tokenizer.decode(masked_token_predictions))
film
or use a fill-mask
pipeline:
from transformers import pipeline
from perceiver.model.text import mlm # auto-class registration
repo_id = "krasserm/perceiver-io-mlm-imdb"
masked_text = "I watched this[MASK][MASK][MASK][MASK][MASK] and it was awesome."
filler_pipeline = pipeline("fill-mask", model=repo_id)
masked_token_predictions = filler_pipeline(masked_text)
print("".join([pred[0]["token_str"] for pred in masked_token_predictions]))
film
Checkpoint conversion
The krasserm/perceiver-io-mlm-imdb
model has been created from a training checkpoint with:
from perceiver.model.text.mlm import convert_checkpoint
convert_checkpoint(
save_dir="krasserm/perceiver-io-mlm-imdb",
ckpt_url="https://martin-krasser.com/perceiver/logs-0.8.0/mlm/version_0/checkpoints/epoch=012-val_loss=1.165.ckpt",
tokenizer_name="krasserm/perceiver-io-mlm",
push_to_hub=True,
)