generated_from_trainer

LanguageDetection

This model is a fine-tuned version of papluca/xlm-roberta-base-language-detection. The model has been trained to identify text written in Roman Urdu, Urdu, and English languages using a personal dataset.

Model Details

Model Training

The model was fine-tuned on a personal dataset for language detection. The training procedure involved the following steps:

Dataset

The training dataset consisted of text samples in Roman Urdu, Urdu, and English. The dataset was collected from various sources and annotated for language labels.

Preprocessing

Text samples were tokenized and preprocessed using the Tokenizers library to prepare the input for the model.

Hyperparameters

The model was trained with the following hyperparameters:

Evaluation

The model was evaluated on a separate validation set to monitor its performance during training.

Model Usage

You can use this model for language detection by providing a text input. The model will predict one of the following languages: Roman Urdu, Urdu, or English.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("your-model-name")
model = AutoModelForSequenceClassification.from_pretrained("your-model-name")

# Prepare input text
text = "Abhi kuch log baqi hain"

# Tokenize and predict language
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
predicted_language = outputs.logits.argmax().item()

languages = ["English", "Roman Urdu", "Urdu"]
predicted_language_label = languages[predicted_language]

print("Predicted Language:", predicted_language_label)

Framework versions