g2p fill-mask

ID G2P BERT

ID G2P BERT is a phoneme de-masking model based on the BERT architecture. This model was trained from scratch on a modified Malay/Indonesian lexicon.

This model was trained using the Keras framework. All training was done on Google Colaboratory. We adapted the BERT Masked Language Modeling training script provided by the official Keras Code Example.

Model

Model #params Arch. Training/Validation data
id-g2p-bert 200K BERT Malay/Indonesian Lexicon

Training Procedure

<details> <summary>Model Config</summary>

vocab_size: 32
max_len: 32
embed_dim: 128
num_attention_head: 2
feed_forward_dim: 128
num_layers: 2

</details>

<details> <summary>Training Setting</summary>

batch_size: 32
optimizer: "adam"
learning_rate: 0.001
epochs: 100

</details>

How to Use

<details> <summary>Tokenizers</summary>

id2token = {
    0: '',
    1: '[UNK]',
    2: 'a',
    3: 'n',
    4: 'ə',
    5: 'i',
    6: 'r',
    7: 'k',
    8: 'm',
    9: 't',
    10: 'u',
    11: 'g',
    12: 's',
    13: 'b',
    14: 'p',
    15: 'l',
    16: 'd',
    17: 'o',
    18: 'e',
    19: 'h',
    20: 'c',
    21: 'y',
    22: 'j',
    23: 'w',
    24: 'f',
    25: 'v',
    26: '-',
    27: 'z',
    28: "'",
    29: 'q',
    30: '[mask]'
}

token2id = {
    '': 0,
    "'": 28,
    '-': 26,
    '[UNK]': 1,
    '[mask]': 30,
    'a': 2,
    'b': 13,
    'c': 20,
    'd': 16,
    'e': 18,
    'f': 24,
    'g': 11,
    'h': 19,
    'i': 5,
    'j': 22,
    'k': 7,
    'l': 15,
    'm': 8,
    'n': 3,
    'o': 17,
    'p': 14,
    'q': 29,
    'r': 6,
    's': 12,
    't': 9,
    'u': 10,
    'v': 25,
    'w': 23,
    'y': 21,
    'z': 27,
    'ə': 4
}

</details>

import keras
import tensorflow as tf
import numpy as np
from huggingface_hub import from_pretrained_keras

model = from_pretrained_keras("bookbot/id-g2p-bert")

MAX_LEN = 32
MASK_TOKEN_ID = 30

def inference(sequence):
    sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
    tokens = [token2id[c] for c in sequence.split()]
    pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]

    tokens = tokens + pad
    input_ids = tf.convert_to_tensor(np.array([tokens]))
    prediction = model.predict(input_ids)

    # find masked idx token
    masked_index = np.where(input_ids == MASK_TOKEN_ID)
    masked_index = masked_index[1]

    # get prediction at those masked index only
    mask_prediction = prediction[0][masked_index]
    predicted_ids = np.argmax(mask_prediction, axis=1)

    # replace mask with predicted token
    for i, idx in enumerate(masked_index):
        tokens[idx] = predicted_ids[i]

    return "".join([id2token[t] for t in tokens if t != 0])

inference("mengembangkannya")

Authors

ID G2P BERT was trained and evaluated by Ananto Joyoadikusumo, Steven Limcorn, Wilson Wongso. All computation and development are done on Google Colaboratory.

Framework versions