ID G2P BERT

ID G2P BERT is a phoneme de-masking model based on the BERT architecture. This model was trained from scratch on a modified Malay/Indonesian lexicon.

This model was trained using the Keras framework. All training was done on Google Colaboratory. We adapted the BERT Masked Language Modeling training script provided by the official Keras Code Example.

Model

Model	#params	Arch.	Training/Validation data
`id-g2p-bert`	200K	BERT	Malay/Indonesian Lexicon

Training Procedure

<details> <summary>Model Config</summary>

vocab_size: 32
max_len: 32
embed_dim: 128
num_attention_head: 2
feed_forward_dim: 128
num_layers: 2

</details>

<details> <summary>Training Setting</summary>

batch_size: 32
optimizer: "adam"
learning_rate: 0.001
epochs: 100

</details>

How to Use

<details> <summary>Tokenizers</summary>

id2token = {
    0: '',
    1: '[UNK]',
    2: 'a',
    3: 'n',
    4: 'ə',
    5: 'i',
    6: 'r',
    7: 'k',
    8: 'm',
    9: 't',
    10: 'u',
    11: 'g',
    12: 's',
    13: 'b',
    14: 'p',
    15: 'l',
    16: 'd',
    17: 'o',
    18: 'e',
    19: 'h',
    20: 'c',
    21: 'y',
    22: 'j',
    23: 'w',
    24: 'f',
    25: 'v',
    26: '-',
    27: 'z',
    28: "'",
    29: 'q',
    30: '[mask]'
}

token2id = {
    '': 0,
    "'": 28,
    '-': 26,
    '[UNK]': 1,
    '[mask]': 30,
    'a': 2,
    'b': 13,
    'c': 20,
    'd': 16,
    'e': 18,
    'f': 24,
    'g': 11,
    'h': 19,
    'i': 5,
    'j': 22,
    'k': 7,
    'l': 15,
    'm': 8,
    'n': 3,
    'o': 17,
    'p': 14,
    'q': 29,
    'r': 6,
    's': 12,
    't': 9,
    'u': 10,
    'v': 25,
    'w': 23,
    'y': 21,
    'z': 27,
    'ə': 4
}

</details>

import keras
import tensorflow as tf
import numpy as np
from huggingface_hub import from_pretrained_keras

model = from_pretrained_keras("bookbot/id-g2p-bert")

MAX_LEN = 32
MASK_TOKEN_ID = 30

def inference(sequence):
    sequence = " ".join([c if c != "e" else "[mask]" for c in sequence])
    tokens = [token2id[c] for c in sequence.split()]
    pad = [token2id[""] for _ in range(MAX_LEN - len(tokens))]

    tokens = tokens + pad
    input_ids = tf.convert_to_tensor(np.array([tokens]))
    prediction = model.predict(input_ids)

    # find masked idx token
    masked_index = np.where(input_ids == MASK_TOKEN_ID)
    masked_index = masked_index[1]

    # get prediction at those masked index only
    mask_prediction = prediction[0][masked_index]
    predicted_ids = np.argmax(mask_prediction, axis=1)

    # replace mask with predicted token
    for i, idx in enumerate(masked_index):
        tokens[idx] = predicted_ids[i]

    return "".join([id2token[t] for t in tokens if t != 0])

inference("mengembangkannya")

Authors

ID G2P BERT was trained and evaluated by Ananto Joyoadikusumo, Steven Limcorn, Wilson Wongso. All computation and development are done on Google Colaboratory.

Framework versions

Keras 2.8.0
TensorFlow 2.8.0

ID G2P BERT

Model

Training Procedure

How to Use

Authors

Framework versions

NSDT 3DConvert

UnrealSynth

DreamTexture.js