text tokenizer preprocessor bert tensorflow

Model name: bert_en_cased_preprocess

Description adapted from TFHub

Overview

This SavedModel is a companion of BERT models to preprocess plain text inputs into the input format expected by BERT. Check the model documentation to find the correct preprocessing model for each particular BERT or other Transformer encoder model.

BERT and its preprocessing were originally published by

This model uses a vocabulary for English extracted from the Wikipedia and BooksCorpus (same as in the models by the original BERT authors). Text inputs have been normalized the "cased" way, meaning that the distinction between lower and upper case as well as accent markers have been preserved.

This model has no trainable parameters and can be used in an input pipeline outside the training loop.

Prerequisites

This SavedModel uses TensorFlow operations defined by the TensorFlow Text library. On Google Colaboratory, it can be installed with

!pip install tensorflow_text
import tensorflow_text as text  # Registers the ops.

Usage

This SavedModel implements the preprocessor API for text embeddings with Transformer encoders, which offers several ways to go from one or more batches of text segments (plain text encoded as UTF-8) to the inputs for the Transformer encoder model.

Basic usage for single segments

Inputs with a single text segment can be mapped to encoder inputs like this:

Using TF Hub and HF Hub

model_path = snapshot_download(repo_id="Dimitre/bert_en_cased_preprocess")
preprocessor =  KerasLayer(handle=model_path)
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)

Using TF Hub fork

preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)

The resulting encoder inputs have seq_length=128.

General usage

For pairs of input segments, to control the seq_length, or to modify tokenized sequences before packing them into encoder inputs, the preprocessor can be called like this:

preprocessor = pull_from_hub(repo_id="Dimitre/bert_en_cased_preprocess")

# Step 1: tokenize batches of text inputs.
text_inputs = [tf.keras.layers.Input(shape=(), dtype=tf.string),
               ...] # This SavedModel accepts up to 2 text inputs.
tokenize = hub.KerasLayer(preprocessor.tokenize)
tokenized_inputs = [tokenize(segment) for segment in text_inputs]

# Step 2 (optional): modify tokenized inputs.
pass

# Step 3: pack input sequences for the Transformer encoder.
seq_length = 128  # Your choice here.
bert_pack_inputs = hub.KerasLayer(
    preprocessor.bert_pack_inputs,
    arguments=dict(seq_length=seq_length))  # Optional argument.
encoder_inputs = bert_pack_inputs(tokenized_inputs)

The call to tokenize() returns an int32 RaggedTensor of shape [batch_size, (words), (tokens_per_word)]. Correspondingly, the call to bert_pack_inputs() accepts a RaggedTensor of shape [batch_size, ...] with rank 2 or 3.

Output details

The result of preprocessing is a batch of fixed-length input sequences for the Transformer encoder.

An input sequence starts with one start-of-sequence token, followed by the tokenized segments, each terminated by one end-of-segment token. Remaining positions up to seq_length, if any, are filled up with padding tokens. If an input sequence would exceed seq_length, the tokenized segments in it are truncated to prefixes of approximately equal sizes to fit exactly.

The encoder_inputs are a dict of three int32 Tensors, all with shape [batch_size, seq_length], whose elements represent the batch of input sequences as follows:

Custom input packing and MLM support

The function

special_tokens_dict = preprocessor.tokenize.get_special_tokens_dict()

returns a dict of scalar int32 Tensors that report the tokenizer's "vocab_size" as well as the ids of certain special tokens: "padding_id", "start_of_sequence_id" (aka. [CLS]), "end_of_segment_id" (aka. [SEP]) and "mask_id". This allows users to replace preprocessor.bert_pack_inputs() with Python code such as text.combine_segments(), possibly text.masked_language_model(), and text.pad_model_inputs() from the TensorFlow Text library.