Satoken

This is a SetFit model trained on multilingual datasets (mentioned below) for Sentiment classification.

The model has been trained using an efficient few-shot learning technique that involves:

Fine-tuning a Sentence Transformer with contrastive learning.
Training a classification head with features from the fine-tuned Sentence Transformer.

It is utilized by Germla for it's feedback analysis tool. (specifically the Sentiment analysis feature)

For other models (specific language-basis) check here

Usage

To use this model for inference, first install the SetFit library:

python -m pip install setfit

You can then run inference as follows:

from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("germla/satoken")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

Training Details

Training Data

Training Procedure

We made sure to have a balanced dataset. The model was trained on only 35% (50% for chinese) of the train split of all datasets.

Preprocessing

Basic Cleaning (removal of dups, links, mentions, hashtags, etc.)
Removal of stopwords using nltk

Speeds, Sizes, Times

The training procedure took 6hours on the NVIDIA T4 GPU.

Evaluation

Testing Data, Factors & Metrics

IMDB test split

Environmental Impact

Hardware Type: NVIDIA T4 GPU
Hours used: 6
Cloud Provider: Amazon Web Services
Compute Region: ap-south-1 (Mumbai)
Carbon Emitted: 0.39 kg co2 eq.