LanguageDetection

This model is a fine-tuned version of papluca/xlm-roberta-base-language-detection. The model has been trained to identify text written in Roman Urdu, Urdu, and English languages using a personal dataset.

Model Details

Base Model: xlm-roberta-base
Fine-Tuned Model: xlm-roberta-base-language-detection
Languages Detected: Roman Urdu, Urdu, English

Model Training

The model was fine-tuned on a personal dataset for language detection. The training procedure involved the following steps:

Dataset

The training dataset consisted of text samples in Roman Urdu, Urdu, and English. The dataset was collected from various sources and annotated for language labels.

Preprocessing

Text samples were tokenized and preprocessed using the Tokenizers library to prepare the input for the model.

Hyperparameters

The model was trained with the following hyperparameters:

Learning Rate: 5e-05
Batch Size: 32
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
Learning Rate Scheduler: Linear with warmup steps
Number of Epochs: 3

Evaluation

The model was evaluated on a separate validation set to monitor its performance during training.

Model Usage

You can use this model for language detection by providing a text input. The model will predict one of the following languages: Roman Urdu, Urdu, or English.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your-model-name")
model = AutoModelForSequenceClassification.from_pretrained("your-model-name")

# Prepare input text
text = "Abhi kuch log baqi hain"

# Tokenize and predict language
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_language = outputs.logits.argmax().item()

languages = ["English", "Roman Urdu", "Urdu"]
predicted_language_label = languages[predicted_language]

print("Predicted Language:", predicted_language_label)

Framework versions

Transformers 4.30.2
Pytorch 2.0.0
Datasets 2.1.0
Tokenizers 0.13.3