NER Model using Roberta

This markdown presents a Robustly Optimized BERT Pretraining Approach (RoBERTa) model trained on a combination of two diverse datasets for two languages: English and Persian. The English dataset used is CoNLL 2003, while the Persian dataset is PEYMA-ARMAN-Mixed, a fusion of the "PEYAM" and "ARMAN" datasets, both popular for Named Entity Recognition (NER) tasks.

The model training pipeline involves the following steps:

Data Preparation: Cleaning, aligning, and mixing data from the two datasets. Data Loading: Loading the prepared data for subsequent processing. Tokenization: Utilizing tokenization to prepare the text data for model input. Token Splitting: Handling token splitting (e.g., "jack" becomes "_ja _ck") and using "-100" for optimization and ignoring certain tokens. Model Reconstruction: Adapting the RoBERTa model for token classification in NER tasks. Model Training: Training the reconstructed model on the combined dataset and evaluating its performance. The model's performance, as shown in the table below, demonstrates promising results:

Epoch	Training Loss	Validation Loss	F1	Recall	Precision	Accuracy
1	0.072600	0.038918	89.5%	0.906680	0.883703	0.987799
2	0.027600	0.030184	92.3%	0.933840	0.915573	0.991334
3	0.013500	0.030962	94%	0.946840	0.933740	0.992702
4	0.006600	0.029897	94.8%	0.955207	0.941990	0.993574

The model achieves an impressive F1-score of almost 95%.

To use the model, the following Python code snippet can be employed:

from transformers import AutoConfig, AutoTokenizer, AutoModel

config = AutoConfig.from_pretrained("AliFartout/Roberta-fa-en-ner")
tokenizer = AutoTokenizer.from_pretrained("AliFartout/Roberta-fa-en-ner")
model = AutoModel.from_pretrained("AliFartout/Roberta-fa-en-ner")

By following this approach, you can seamlessly access and incorporate the trained multilingual NER model into various Natural Language Processing tasks.