transformers argument-mining opinion-mining information-extraction inference-extraction Twitter

WRAP -- A Content Management System for Twitter

Introducing WRAP, an advanced classification model built upon AutoModelForSequenceClassification, designed to identify tweets belonging to four distinct classes: Reason, Statement, Notification, and None of the TACO dataset. Designed specifically for extracting information and inferences from Twitter data, this specialized classification model utilizes WRAPresentations, from which WRAP acquires its name. WRAPresentations is an advancement of the BERTweet-base architecture, whose pre-training was extended on augmented tweets using contrastive learning.

Class Semantics

The TACO framework revolves around the two key elements of an argument, as defined by the Cambridge Dictionary. It encodes inference as a guess that you make or an opinion that you form based on the information that you have, and it also leverages the definition of information as facts or details about a person, company, product, etc..

Taken together, WRAP can identify specific classes of tweets, where inferences and information can be aggregated in relation to these distinct classes containing these components:

In its entirety, WRAP can classify the following hierarchy for tweets:

<div align="center"> <img src="https://github.com/TomatenMarc/public-images/raw/main/Component_Space_WRAP.svg" alt="Component Space" width="100%"> </div>

Usage

Using this model becomes easy when you have transformers installed:

pip install - U transformers

Then you can use the model to generate tweet classifications like this:

from transformers import pipeline

pipe = pipeline("text-classification", model="TomatenMarc/WRAP")
prediction = pipe("Huggingface is awesome")

print(prediction)

<a href="https://github.com/TomatenMarc/TACO/blob/main/notebooks/classifier_cv.ipynb"> <blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;"> Notice: The tweets need to undergo preprocessing before classification. </blockquote> </a>

Training

The final model underwent training using the entire shuffled ground truth dataset known as TACO, encompassing a total of 1734 tweets. This dataset showcases the distribution of topics as: #abortion (25.9%), #brexit (29.0%), #got (11.0%), #lotrrop (12.1%), #squidgame (12.7%), and #twittertakeover (9.3%). For training, we utilized SimpleTransformers.

Additionally, the category and class distribution of the dataset TACO is as follows:

Inference No-Inference
865 (49.88%) 869 (50.12%)
Information No-Information
1081 (62.34%) 653 (37.66%)
Reason Statement Notification None
581 (33.50%) 284 (16.38%) 500 (28.84%) 369 (21.28%)

<p> <blockquote style="border-left: 5px solid grey; background-color: #f0f5ff; padding: 10px;"> Notice: Our training involved WRAP to forecast class predictions, where the categories (Information/Inference) represent class aggregations based on the inference or information component. </blockquote> <p>

Dataloader

"data_loader": {
    "type": "torch.utils.data.dataloader.DataLoader",
    "args": {
        "batch_size": 8,
        "sampler": "torch.utils.data.sampler.RandomSampler"
    }
}

Parameters of the fit()-Method:

{
    "epochs": 5,
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 4e-05
    },
    "scheduler": "WarmupLinear",
    "warmup_steps": 66,
    "weight_decay": 0.06
}

Evaluation

We applied a 6-fold (In-Topic) cross-validation method to demonstrate WRAP's optimal performance. This involved using the same dataset and parameters described in the Training section, where we trained on k-1 splits and made predictions using the kth split.

Additionally, we assessed its ability to generalize across the 6 topics (Cross-Topic) of TACO. Each of the k topics was utilized for testing, while the remaining k-1 topics were used for training purposes.

In total, the WRAP classifier performs as follows:

Content Management

Macro-F1 Inference Information Multiclass
In-Topic 87.71% 85.34% 75.80%
Cross-Topic 86.71% 84.59% 73.92%

Classification

Micro-F1 Reason Statement Notification None
In-Topic 77.82% 61.10% 80.56% 83.71%
Cross-Topic 76.52% 58.99% 78.43% 81.73%

Environmental Impact

Licensing

WRAP © 2023 is licensed under CC BY-NC-SA 4.0