Model Card for MiniAM2

Case sensitive, multilingual, multiattribute model to predict the gender of Twitter users as well as their Organization status from text only input data.

Model Details

Model Description

MiniAM2 is an assemblage model of an enriched distillation with weak-supervised learning. It is a multilingual model to detect the gender and organizational status of Twitter users based on their name, screen_name, and bio description, that is lighter and outperforms the state-of-the-art M3 model. The model is obtained after a novel semi-supervised process that we call "assemblage". This process is a multi-language strategy based on enriching the distillation process and weak-supervised learning with a low number of annotated data. This model can adapt to a language similar to an existing one without annotated data in the target language. We provide our model so social scientists can use it for their analysis. To know more about this process, we will put the link to the publication once we have it.

The M3 model is named for its multilingual, multi-modal and multi-attributes characteristics. In our case, we discarded the multi-modal part, focusing on text inputs only. When comparing results we also consider a so-called M2 model: the M3 model inferences without the computer vision. As our model is smaller thanks to the enriched distillation process, we call the final model miniAM2 as a lighter version of a multilingual and multi-attribute model trained with an assemblage process. MiniAM2 provides two key informations: the probability for an observation to be an organization or a human and the probability to be a male or a female.

Developed by: Arnault Gombert, Borja Sánchez-López, Jesus Cerquides
Shared by: Citibeats
Model type: Text-classification
Language(s) (NLP): EN, ES, FR
License: This software is © 2023 The Social Coin, SL and is licensed under the OPEN RAIL M License. See license

Model Sources

Repository: GitHub
Paper: Under review...
Demo: Colab

Uses

Social media platforms offer an invaluable wealth of data to understand what is taking place in our society. However, social media data hides demographic biases related to characteristics such as gender or age. Therefore, considering social media data as representative of the population can lead to fallacious interpretations. For instance, in France in 2021, women represent 51.6% of the population, whereas on Twitter they represent only 33.5% of French users. With such a significant difference between social network user demographics and the actual population, detecting the gender or age before delving into a deeper analysis of social phenomena becomes a priority.

Direct Use

Bias may appear on every data inferation provided by models. By knowing the empirical distributions present in the data, it is possible to leverage the predictions and reduce bias. We intend to use this model to capture gender distributions in Twitter data in order to know the potential bias in it and apply afterwards other techniques to minimize bias as much as possible.

Out-of-Scope Use

This model is not created to identify the gender or organization status of individuals, and that use is not allowed. We do not allow neither using this model for the purpose of exploiting, harming or attempting to exploit or harm minors in any way. This model has license OpenRAIL and before using the model, the user should read and agree to our white paper and license conditions.

Bias, Risks, and Limitations

Since MiniAM2 is a deep learning model, it does contain bias. Although we studied the performance of MiniAM2 extensively, we did not worked around bias control. For example, the model may tend to predict Man label if the text inputs are too short or if they lack of gender information, just because there is a majority of men as Twitter users.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from huggingface_hub import from_pretrained_keras
miniam2 = from_pretrained_keras("CitibeatsAI/miniam2")

Obtain a prediction by just sending a list with the columns screen_name, name and bio description of a dataframe to the model

predictions = miniam2([df["screen_name"], df["name"], df["bio"]])

The output will be an array of n rows (as much as users sent to the model) and of 3 columns. The i-th row is the prediction of i-th user. First column is the probability to belong to an organization, second column is the probability of man class and third column is the probability of woman class.

Training Details

Training Data

Language	Training set
English	3316545
Spanish	3608997
French	1086762

Training Procedure

Preprocessing

Eliminate with a preprocessing step all punctuations, accents and styles on the text (do not lowercase, the model is case sensitive). We provide an example of preprocessing function named tokenizer_preprocess_multiple. It requires a dictionary of patterns and corresponding substitutions with identifiers.

  def tokenizer_preprocess_multiple(text):
    text = re.sub(r'[!\"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' ', text)
    return multiple_replace(text, patterns, substitutions)

def search_code(match, patterns, subs):
    for pat, code in patterns.items():
        if match in pat:
            return subs[code]
    return "No match"
def multiple_replace(text, patterns, substitutions):
    # Create a regular expression  from the dictionary keys
    # regex = re.compile("(%r)" % "|".join(map(re.escape, d.keys())))
    pat = '|'.join(patterns.keys())

    # For each match, look-up corresponding value in dictionary
    return re.sub(pat, lambda mo: search_code(mo.string[mo.start():mo.end()], patterns, substitutions), text)

### Example of patterns and substitutions
patterns = {
"[0123456789]":"0",
u'[aàáâãäåăąǎǟǡǻȁȃȧ𝐚𝑎𝒂𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊𝝰𝞪]': "1",
"!": "2",
} 

substitutions = {
"0": "numbers",
"1": "a",
"2": ", prove me wrong",
} 

text = "The letter 'ã' is 𝕒s useful 𝓪s 5!"
print(multiple_replace(text, patterns, substitutions))

The letter 'a' is as useful as numbers, prove me wrong

Training Hyperparameters

Training regime: fp32

The following hyperparameters were used during training:

Hyperparameters	Value
name	RMSprop
weight_decay	None
clipnorm	None
global_clipnorm	None
clipvalue	None
use_ema	False
ema_momentum	0.99
ema_overwrite_frequency	100
jit_compile	False
is_legacy_optimizer	False
learning_rate	0.0010000000474974513
rho	0.9
momentum	0.0
epsilon	1e-07
centered	False
training_precision	float32

Speeds, Sizes, Times

Infer speed: ~2582 users per second
Model size: 2.7 M parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Hard labeled (humanly annotated data)

Language	Test set
English	2568
Spanish	2498
French	2136

<!--#### Factors

These are the things the evaluation is disaggregating by, e.g., subpopulations or domains.

-->

Metrics

We compared our model against M3 and against M2, which is the M3 version that accepts only text as input (exactly the same inputs as for MiniAM2) in Accuracy, Recall, Precision, Loss and F1.

By using all these metrics, we provide a wider and deeper comparison between models

Results

Model	Accuracy	Recall	Precision	Loss	F1
English
M2	80.92	84.55	73.75	0.49	76.81
M3	86.4	85.04	83.56	0.39	84.2
MiniAM2	83.39	85.07	80.05	0.5	82.14
Spanish
M2	77.29	81.91	68.78	0.55	69.94
M3	88.44	86.18	86.55	0.39	86.3
MiniAM2	86.52	82.55	85.87	0.37	84.04
French
M2	68.45	74.19	63.79	0.77	63.64
M3	83.17	82.46	80.63	0.47	81.42
MiniAM2	81.86	78.71	80.07	0.5	79.33

Summarizing a lot, MiniAM2 outperforms M2 and even compares with the quality of M3 (which is additionally processing images to make its predictions)

Summary

Multilingual gender and organization status detector model MiniAM2 is lighter, faster and more accurate than M2. According to the experiments, MiniAM2 closely follows the quality of M3, implying that our model which processes only text competes with models that benefit from image input data.

In the future, we plan to add more languages given the cheap and fast process developed in our work.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: Intel Xeon Gold 6148 (TDP of 150W)
Hours used: 30
Cloud Provider: Amazon Web Services
Compute Region: eu-north-1
Carbon Emitted: 0.05 kgCO2eq/kWh

Model Architecture and Objective

To process the text, MiniAM2 applies two types of tokenization: character-level and word-level tokenizations (except for screenname that uses character tokenization only). The tokenized inputs are fed to embedding layers before two layers of feed-forward neural networks, referred as the deep-learning component (DL) from now on. For every combination of inputs and tokenization, the model has a deep learning (DL) component, referred as DL1,...,DL5. Then MiniAM2 concatenates the 5 deep-learning representations together before a final softmax layer with 3 outputs: organization, man and woman.

<!--

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed] -->

Citation

Coming soon...

BibTeX:

Coming soon...

APA:

Coming soon...

<!--

Glossary [optional]

If relevant, include terms and calculations in this section that can help readers understand the model or model card.

[More Information Needed] -->

Model Card Contact

bsanchez@citibeats.com

Model Card for MiniAM2

Model Details

Model Description

Model Sources

Uses

Direct Use

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing

Training Hyperparameters

Speeds, Sizes, Times

Evaluation

Testing Data, Factors & Metrics

Testing Data

Metrics

Results

Summary

Environmental Impact

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation

Glossary [optional]

Model Card Contact

NSDT 3DConvert

UnrealSynth

DreamTexture.js