NER Model using Roberta

This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is CoNLL 2003, while the Persian dataset is PEYMA-ARMAN-Mixed, a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks.

The model training pipeline involves the following steps:

Data Preparation: Cleaning, aligning, and mixing data from the two datasets. Data Loading: Loading the prepared data for subsequent processing. Tokenization: Utilizing tokenization to prepare the text data for model input. Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens. Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks. Model Training: Training the reconstructed model on the combined dataset and evaluating its performance. The model's performance, as shown in the table below, demonstrates promising results:

Epoch Training Loss Validation Loss F1 Recall Precision Accuracy
1 0.072600 0.038918 89.5% 0.906680 0.883703 0.987799
2 0.027600 0.030184 92.3% 0.933840 0.915573 0.991334
3 0.013500 0.030962 94% 0.946840 0.933740 0.992702
4 0.006600 0.029897 94.8% 0.955207 0.941990 0.993574

The model achieves an impressive F1-score of almost 95%.

To use the model, the following Python code snippet can be employed:

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner")
tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner")
model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner")

By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks.