Dialectical-MSA-detection

Model description

This model was trained on 108,173 manually annotated User-Generated Content (e.g. tweets and online comments) to classify the Arabic language of the text into one of two categories: 'Dialectical', or 'MSA' (i.e. Modern Standard Arabic).

Training data

Dialectical-MSA-detection was trained on the English-speaking subset of the The Arabic online commentary dataset (Zaidan, et al 20211). The AOC dataset was created by crawling the websites of three Arabic newspapers, and extracting online articles and readers' comments.

Training procedure

xlm-roberta-base was trained using the Hugging Face trainer with the following hyperparameters.

training_args = TrainingArguments(
    num_train_epochs=4,               # total number of training epochs
    learning_rate=2e-5,               # learning rate
    per_device_train_batch_size=32,   # batch size per device during training
    per_device_eval_batch_size=4,     # batch size for evaluation
    warmup_steps=0,                   # number of warmup steps for learning rate scheduler
    weight_decay=0.02,                # strength of weight decay
    
)

Eval results

The model was evaluated using 10% of the sentences (90-10 train-dev split). Accuracy 0.88 on the dev set.

Limitations and bias

The model was trained on sentences from the online commentary domain. Other forms of UGT such as tweet can be different in the degree of dialectness.

BibTeX entry and citation info

@article{saadany2022semi,
  title={A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT},
  author={Saadany, Hadeel and Orasan, Constantin and Mohamed, Emad and Tantawy, Ashraf},
  journal={arXiv preprint arXiv:2210.11899},
  year={2022}
}