Model Card for MiniAM2
<!-- Provide a quick summary of what the model is/does. -->
Case sensitive, multilingual, multiattribute model to predict the gender of Twitter users as well as their Organization status from text only input data.
Model Details
Model Description
<!-- Provide a longer summary of what this model is. -->
MiniAM2 is an assemblage model of an enriched distillation with weak-supervised learning. It is a multilingual model to detect the gender and organizational status of Twitter users based on their name, screen_name, and bio description, that is lighter and outperforms the state-of-the-art M3 model. The model is obtained after a novel semi-supervised process that we call "assemblage". This process is a multi-language strategy based on enriching the distillation process and weak-supervised learning with a low number of annotated data. This model can adapt to a language similar to an existing one without annotated data in the target language. We provide our model so social scientists can use it for their analysis. To know more about this process, we will put the link to the publication once we have it.
The M3 model is named for its multilingual, multi-modal and multi-attributes characteristics. In our case, we discarded the multi-modal part, focusing on text inputs only. When comparing results we also consider a so-called M2 model: the M3 model inferences without the computer vision. As our model is smaller thanks to the enriched distillation process, we call the final model miniAM2 as a lighter version of a multilingual and multi-attribute model trained with an assemblage process. MiniAM2 provides two key informations: the probability for an observation to be an organization or a human and the probability to be a male or a female.
- Developed by: Arnault Gombert, Borja Sánchez-López, Jesus Cerquides
- Shared by: Citibeats
- Model type: Text-classification
- Language(s) (NLP): EN, ES, FR
- License: This software is © 2023 The Social Coin, SL and is licensed under the OPEN RAIL M License. See license
Model Sources
<!-- Provide the basic links for the model. -->
Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Social media platforms offer an invaluable wealth of data to understand what is taking place in our society. However, social media data hides demographic biases related to characteristics such as gender or age. Therefore, considering social media data as representative of the population can lead to fallacious interpretations. For instance, in France in 2021, women represent 51.6% of the population, whereas on Twitter they represent only 33.5% of French users. With such a significant difference between social network user demographics and the actual population, detecting the gender or age before delving into a deeper analysis of social phenomena becomes a priority.
Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Bias may appear on every data inferation provided by models. By knowing the empirical distributions present in the data, it is possible to leverage the predictions and reduce bias. We intend to use this model to capture gender distributions in Twitter data in order to know the potential bias in it and apply afterwards other techniques to minimize bias as much as possible.
Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
This model is not created to identify the gender or organization status of individuals, and that use is not allowed. We do not allow neither using this model for the purpose of exploiting, harming or attempting to exploit or harm minors in any way. This model has license OpenRAIL and before using the model, the user should read and agree to our white paper and license conditions.
Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Since MiniAM2 is a deep learning model, it does contain bias. Although we studied the performance of MiniAM2 extensively, we did not worked around bias control. For example, the model may tend to predict Man label if the text inputs are too short or if they lack of gender information, just because there is a majority of men as Twitter users.
Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
from huggingface_hub import from_pretrained_keras
miniam2 = from_pretrained_keras("CitibeatsAI/miniam2")
Obtain a prediction by just sending a list with the columns screen_name, name and bio description of a dataframe to the model
predictions = miniam2([df["screen_name"], df["name"], df["bio"]])
The output will be an array of n rows (as much as users sent to the model) and of 3 columns. The i-th row is the prediction of i-th user. First column is the probability to belong to an organization, second column is the probability of man class and third column is the probability of woman class.
Training Details
Training Data
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
Language | Training set |
---|---|
English | 3316545 |
Spanish | 3608997 |
French | 1086762 |
Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Preprocessing
- Eliminate with a preprocessing step all punctuations, accents and styles on the text (do not lowercase, the model is case sensitive). We provide an example of preprocessing function named tokenizer_preprocess_multiple. It requires a dictionary of patterns and corresponding substitutions with identifiers.
def tokenizer_preprocess_multiple(text):
text = re.sub(r'[!\"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' ', text)
return multiple_replace(text, patterns, substitutions)
def search_code(match, patterns, subs):
for pat, code in patterns.items():
if match in pat:
return subs[code]
return "No match"
def multiple_replace(text, patterns, substitutions):
# Create a regular expression from the dictionary keys
# regex = re.compile("(%r)" % "|".join(map(re.escape, d.keys())))
pat = '|'.join(patterns.keys())
# For each match, look-up corresponding value in dictionary
return re.sub(pat, lambda mo: search_code(mo.string[mo.start():mo.end()], patterns, substitutions), text)
### Example of patterns and substitutions
patterns = {
"[0123456789]":"0",
u'[aàáâãäåăąǎǟǡǻȁȃȧ𝐚𝑎𝒂𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊𝝰𝞪]': "1",
"!": "2",
}
substitutions = {
"0": "numbers",
"1": "a",
"2": ", prove me wrong",
}
text = "The letter 'ã' is 𝕒s useful 𝓪s 5!"
print(multiple_replace(text, patterns, substitutions))
The letter 'a' is as useful as numbers, prove me wrong
Training Hyperparameters
- Training regime: fp32 <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
The following hyperparameters were used during training:
Hyperparameters | Value |
---|---|
name | RMSprop |
weight_decay | None |
clipnorm | None |
global_clipnorm | None |
clipvalue | None |
use_ema | False |
ema_momentum | 0.99 |
ema_overwrite_frequency | 100 |
jit_compile | False |
is_legacy_optimizer | False |
learning_rate | 0.0010000000474974513 |
rho | 0.9 |
momentum | 0.0 |
epsilon | 1e-07 |
centered | False |
training_precision | float32 |
Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
- Infer speed: ~2582 users per second
- Model size: 2.7 M parameters
Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
Testing Data, Factors & Metrics
Testing Data
<!-- This should link to a Data Card if possible. -->
Hard labeled (humanly annotated data)
Language | Test set |
---|---|
English | 2568 |
Spanish | 2498 |
French | 2136 |
<!--#### Factors
These are the things the evaluation is disaggregating by, e.g., subpopulations or domains.
-->
Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
We compared our model against M3 and against M2, which is the M3 version that accepts only text as input (exactly the same inputs as for MiniAM2) in Accuracy, Recall, Precision, Loss and F1.
By using all these metrics, we provide a wider and deeper comparison between models
Results
Model | Accuracy | Recall | Precision | Loss | F1 |
---|---|---|---|---|---|
English | |||||
M2 | 80.92 | 84.55 | 73.75 | 0.49 | 76.81 |
M3 | 86.4 | 85.04 | 83.56 | 0.39 | 84.2 |
MiniAM2 | 83.39 | 85.07 | 80.05 | 0.5 | 82.14 |
Spanish | |||||
M2 | 77.29 | 81.91 | 68.78 | 0.55 | 69.94 |
M3 | 88.44 | 86.18 | 86.55 | 0.39 | 86.3 |
MiniAM2 | 86.52 | 82.55 | 85.87 | 0.37 | 84.04 |
French | |||||
M2 | 68.45 | 74.19 | 63.79 | 0.77 | 63.64 |
M3 | 83.17 | 82.46 | 80.63 | 0.47 | 81.42 |
MiniAM2 | 81.86 | 78.71 | 80.07 | 0.5 | 79.33 |
Summarizing a lot, MiniAM2 outperforms M2 and even compares with the quality of M3 (which is additionally processing images to make its predictions)
Summary
Multilingual gender and organization status detector model MiniAM2 is lighter, faster and more accurate than M2. According to the experiments, MiniAM2 closely follows the quality of M3, implying that our model which processes only text competes with models that benefit from image input data.
In the future, we plan to add more languages given the cheap and fast process developed in our work.
Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: Intel Xeon Gold 6148 (TDP of 150W)
- Hours used: 30
- Cloud Provider: Amazon Web Services
- Compute Region: eu-north-1
- Carbon Emitted: 0.05 kgCO2eq/kWh
Model Architecture and Objective
To process the text, MiniAM2 applies two types of tokenization: character-level and word-level tokenizations (except for screenname that uses character tokenization only). The tokenized inputs are fed to embedding layers before two layers of feed-forward neural networks, referred as the deep-learning component (DL) from now on. For every combination of inputs and tokenization, the model has a deep learning (DL) component, referred as DL1,...,DL5. Then MiniAM2 concatenates the 5 deep-learning representations together before a final softmax layer with 3 outputs: organization, man and woman.
<!--
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed] -->
Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
Coming soon...
BibTeX:
Coming soon...
APA:
Coming soon...
<!--
Glossary [optional]
If relevant, include terms and calculations in this section that can help readers understand the model or model card.
[More Information Needed] -->
Model Card Contact
- bsanchez@citibeats.com