nepali-nlp nepali-news-classificiation nlp transformers deep-learning pytorch transfer-learning

patrakar/ पत्रकार (Nepali News Classifier)

Last updated: September 2022

Model Details

patrakar is a DistilBERT pre-trained sequence classification transformer model which classifies Nepali language news into 9 newsgroup category, such as:

It is developed by Sahaj Raj Malla to be generally usefuly for general public and so that others could explore them for commercial and scientific purposes. This model was trained on Sakonii/distilgpt2-nepali model.

It achieves the following results on the test dataset:

Total Number of samples Accuracy(%)
5670 95.475

Model date

September 2022

Model type

Sequence classification model

Model version


Model Usage

This model can be used directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

from transformers import pipeline, set_seed


model_name = "sahajrajmalla/patrakar"
classifier = pipeline('text-classification', model=model_name)

text = "नेकपा (एमाले)का नेता गोकर्णराज विष्टले सहमति र सहकार्यबाटै संविधान बनाउने तथा जनताको जीवनस्तर उकास्ने काम गर्नु नै अबको मुख्य काम रहेको बताएका छन् ।"


Here is how we can use the model to get the features of a given text in PyTorch:

!pip install transformers torch

from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

import torch
import torch.nn.functional as F

# initializing model and tokenizer
model_name = "sahajrajmalla/patrakar"

# downloading tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

# downloading model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["data"], padding="max_length", truncation=True)

# predicting with the model
sequence_i_want_to_predict = "राजनीतिक स्थिरता नहुँदा विकास निर्माणले गति लिन सकेन"

# initializing our labels
label_list = [

batch = tokenizer(sequence_i_want_to_predict, padding=True, truncation=True, max_length=512, return_tensors='pt')

with torch.no_grad():
    outputs = model(**batch)
    predictions = F.softmax(outputs.logits, dim=1)
    labels = torch.argmax(predictions, dim=1)

print(f"The sequence: \n\n {word_i_want_to_predict} \n\n is predicted to be of newsgroup {label_list[labels.item()]}")

Training data

This model is trained on 50,945 rows of Nepali language news grouped dataset found on Kaggle which was also used in IT Meet 2022 Text challenge.

Framework versions