Contrastive user encoder (multi-post)

This model is a DistilBertModel trained by fine-tuning distilbert-base-uncased on author-based triplet loss.

Details

Training and evaluation details are provided in our EMNLP Findings paper:

Training

We fine-tuned DistilBERT on triplets consisting of:

To compute the loss, we use [CLS] encodings of the anchors, positive examples and negative examples from the last layer of the DistilBERT encoder. We perform feature-wise averaging of anchor posts encodings and optimize for \(max(||\overline{f(A)} - f(n)|| - ||\overline{f(A)} - f(p)|| + \alpha,0)\)

where:

Evaluation and usage

The model yields performance advantages downstream user-based classification tasks.

We encourage usage and benchmarking on tasks involving:

Limitations

Being exclusively trained on Reddit data, our models probably overfit to linguistic markers and traits which are relevant to characterizing the Reddit user population, but less salient in the general population. Domain-specific fine-tuning may be required before deployment.

Furthermore, our self-supervised approach enforces little or no control over biases, which models may actively use as part of their heuristics in contrastive and downstream tasks.