LanguageDetection
This model is a fine-tuned version of papluca/xlm-roberta-base-language-detection. The model has been trained to identify text written in Roman Urdu, Urdu, and English languages using a personal dataset.
Model Details
- Base Model:
xlm-roberta-base
- Fine-Tuned Model:
xlm-roberta-base-language-detection
- Languages Detected: Roman Urdu, Urdu, English
Model Training
The model was fine-tuned on a personal dataset for language detection. The training procedure involved the following steps:
Dataset
The training dataset consisted of text samples in Roman Urdu, Urdu, and English. The dataset was collected from various sources and annotated for language labels.
Preprocessing
Text samples were tokenized and preprocessed using the Tokenizers
library to prepare the input for the model.
Hyperparameters
The model was trained with the following hyperparameters:
- Learning Rate: 5e-05
- Batch Size: 32
- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- Learning Rate Scheduler: Linear with warmup steps
- Number of Epochs: 3
Evaluation
The model was evaluated on a separate validation set to monitor its performance during training.
Model Usage
You can use this model for language detection by providing a text input. The model will predict one of the following languages: Roman Urdu, Urdu, or English.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your-model-name")
model = AutoModelForSequenceClassification.from_pretrained("your-model-name")
# Prepare input text
text = "Abhi kuch log baqi hain"
# Tokenize and predict language
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_language = outputs.logits.argmax().item()
languages = ["English", "Roman Urdu", "Urdu"]
predicted_language_label = languages[predicted_language]
print("Predicted Language:", predicted_language_label)
Framework versions
- Transformers 4.30.2
- Pytorch 2.0.0
- Datasets 2.1.0
- Tokenizers 0.13.3