Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:

Proposed Method:

The proposed method transfers a mono-lingual Transformer model into new target language at lexical level by learning new token embeddings. All implementation in this repo uses XLNet as a source Transformer model, however, other Transformer models can also be used similarly.

Main files:

All files are IPython Notebook files which can be excuted simply in Google Colab.

train.ipynb : Fine-tunes XLNet (mono-lingual transformer) on new target language (Tigrinya) sentiment analysis dataset.
test.ipynb : Evaluates the fine-tuned model on test data.
token_embeddings.ipynb : Trains a word2vec token embeddings for Tigrinya language.
process_Tigrinya_comments.ipynb : Extracts Tigrinya comments from mixed language contents.
extract_YouTube_comments.ipynb : Downloads available comments from a YouTube channel ID.
auto_labelling.ipynb : Automatically labels Tigrinya comments in to positive or negative sentiments based on Emoji's sentiment.

Tigrinya Tokenizer:

A sentencepiece based tokenizer for Tigrinya has been released to the public and can be accessed as in the following:

 from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("abryee/TigXLNet")
 tokenizer.tokenize("ዋዋዋው እዛ ፍሊም ካብተን ዘድንቀን ሓንቲ ኢያ ሞ ብጣዕሚ ኢና ነመስግን ሓንቲ ክብላ ደልየ ዘሎኹ ሓደራኣኹም ኣብ ጊዜኹም ተረክቡ")

TigXLNet:

A new general purpose transformer model for low-resource language Tigrinya is also released to the public and be accessed as in the following:

from transformers import AutoConfig, AutoModel
config = AutoConfig.from_pretrained("abryee/TigXLNet")
config.d_head = 64
model = AutoModel.from_pretrained("abryee/TigXLNet", config=config)

Evaluation:

The proposed method is evaluated using two datasets:

A newly created sentiment analysis dataset for low-resource language (Tigriyna).

<table> <tr> <td> <table> <thead> <tr> <th>Models</th> <th>Configuration</th> <th>F1-Score</th> </tr> </thead> <tbody> <tr> <td rowspan=3>BERT</td> <td rowspan=1>+Frozen BERT weights</td> <td>54.91</td> </tr> <tr> <td rowspan=1>+Random embeddings</td> <td>74.26</td> </tr> <tr> <td rowspan=1>+Frozen token embeddings</td> <td>76.35</td> </tr>
<tr> <td rowspan=3>mBERT</td> <td rowspan=1>+Frozen mBERT weights</td> <td>57.32</td> </tr> <tr> <td rowspan=1>+Random embeddings</td> <td>76.01</td> </tr> <tr> <td rowspan=1>+Frozen token embeddings</td> <td>77.51</td> </tr>
<tr> <td rowspan=3>XLNet</td> <td rowspan=1>+Frozen XLNet weights</td> <td>68.14</td> </tr> <tr> <td rowspan=1>+Random embeddings</td> <td>77.83</td> </tr> <tr> <td rowspan=1>+Frozen token embeddings</td> <td>81.62</td> </tr> </tbody> </table> </td> <td><img src="data/effect_of_dataset_size.png" alt="3" width = 480px height = 280px></td> </tr> </table>

Cross-lingual Sentiment dataset (CLS).

<table> <thead> <tr> <th rowspan=2>Models</th> <th rowspan=1 colspan=3>English</th> <th rowspan=1 colspan=3>German</th> <th rowspan=1 colspan=3>French</th> <th rowspan=1 colspan=3>Japanese</th> <th rowspan=2>Average</th> </tr> <tr> <th colspan=1>Books</th> <th colspan=1>DVD</th> <th colspan=1>Music</th> <th colspan=1>Books</th> <th colspan=1>DVD</th> <th colspan=1>Music</th> <th colspan=1>Books</th> <th colspan=1>DVD</th> <th colspan=1>Music</th> <th colspan=1>Books</th> <th colspan=1>DVD</th> <th colspan=1>Music</th> </tr> </thead> <tbody> <tr> <td colspan=1>XLNet</td> <td colspan=1>92.90</td> <td colspan=1>93.31</td> <td colspan=1>92.02</td> <td colspan=1>85.23</td> <td colspan=1>83.30</td> <td colspan=1>83.89</td> <td colspan=1>73.05</td> <td colspan=1>69.80</td> <td colspan=1>70.12</td> <td colspan=1>83.20</td> <td colspan=1>86.07</td> <td colspan=1>85.24</td> <td colspan=1>83.08</td> </tr> <tr> <td colspan=1>mBERT</td> <td colspan=1>92.78</td> <td colspan=1>90.30</td> <td colspan=1>91.88</td> <td colspan=1>88.65</td> <td colspan=1>85.85</td> <td colspan=1>90.38</td> <td colspan=1>91.09</td> <td colspan=1>88.57</td> <td colspan=1>93.67</td> <td colspan=1>84.35</td> <td colspan=1>81.77</td> <td colspan=1>87.53</td> <td colspan=1>88.90</td> </tr> </tbody> </table>

Dataset used for this paper:

We have constructed new sentiment analysis dataset for Tigrinya language and it can be found in the zip file (Tigrinya Sentiment Analysis Dataset)

Citing our paper:

Our paper can be accessed from ArXiv link, and please consider citing our work.

 @misc{tela2020transferring,
      title={Transferring Monolingual Model to Low-Resource Language: The Case of Tigrinya},
      author={Abrhalei Tela and Abraham Woubie and Ville Hautamaki},
      year={2020},
      eprint={2006.07698},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
  }

Transferring Monolingual Model to Low-Resource Language: The Case Of Tigrinya:

Proposed Method:

Main files:

Tigrinya Tokenizer:

TigXLNet:

Evaluation:

Dataset used for this paper:

Citing our paper:

Any questions, comments, feedback is appreciated! And can be forwarded to the following email: abrhalei.tela@gmail.com

NSDT 3DConvert

UnrealSynth

DreamTexture.js