Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
#Encoder from HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding. The model can effectively encode a tweet into topic-level embeddings. It can be used to estimate topic-level similarity between tweets.
Model Details
#Encoder leverage hashtags to learn inter-post topic relevance (for retrieval) via contrastive learning over 179M tweets. It was pre-trained on pairwise posts, and contrastive learning guided them to learn topic relevance via learning to identify posts with the same hashtag. We randomly noise the hashtags to avoid trivial representation. Please refers to https://github.com/albertan017/HICL for more details.
Model Description
<!-- Provide a longer summary of what this model is. -->
- Developed by: Hanzhuo Tan, Department of Computing, the Hong Kong Polytechnic University
- Model type: Roberta
- Language(s) (NLP): English
- License: n.a
- Finetuned from model [optional]: Bertweet
Model Sources [optional]
<!-- Provide the basic links for the model. -->
- Repository: https://github.com/albertan017/HICL
- Paper [optional]: HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding
Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
from transformers import AutoModel, AutoTokenizer
hashencoder = AutoModel.from_pretrained("albertan017/hashencoder")
tokenizer = AutoTokenizer.from_pretrained("albertan017/hashencoder")
tweet = "here's a sample tweet for encoding"
input_ids = torch.tensor([tokenizer.encode(tweet)])
with torch.no_grad():
features = hashencoder(input_ids) # Models outputs are now tuples
Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
We do not inforce semantic similarity.
Training Details
Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
#Encoder is pre-trained on 15 GB of plain text from 179 million tweets and 4 billion tokens. Following the practice to pre-train BERTweet, the raw data was collected from the archived Twitter stream, containing 4TB of sampled tweets from January 2013 to June 2021. For data pre-processing, we ran the following steps. First, we employed fastText to extract English tweets and only kept tweets with hashtags. Then, low-frequency hashtags appearing in less than 100 tweets were further filtered out to alleviate sparsity. After that, we obtained a large-scale dataset containing 179M tweets, each has at least one hashtag, and hence corresponds to 180K hashtags in total.
Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
To leverage hashtag-gathered context in pre-training, we exploit contrastive learning and train #Encoder to identify pairwise posts sharing the same hashtag for gaining topic relevance.
Citation [optional]
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
BibTeX:
[More Information Needed]
APA:
[More Information Needed]