Model Card for MiniAM2

<!-- Provide a quick summary of what the model is/does. -->

Case sensitive, multilingual, multiattribute model to predict the gender of Twitter users as well as their Organization status from text only input data.

Model Details

Model Description

<!-- Provide a longer summary of what this model is. -->

MiniAM2 is an assemblage model of an enriched distillation with weak-supervised learning. It is a multilingual model to detect the gender and organizational status of Twitter users based on their name, screen_name, and bio description, that is lighter and outperforms the state-of-the-art M3 model. The model is obtained after a novel semi-supervised process that we call "assemblage". This process is a multi-language strategy based on enriching the distillation process and weak-supervised learning with a low number of annotated data. This model can adapt to a language similar to an existing one without annotated data in the target language. We provide our model so social scientists can use it for their analysis. To know more about this process, we will put the link to the publication once we have it.

The M3 model is named for its multilingual, multi-modal and multi-attributes characteristics. In our case, we discarded the multi-modal part, focusing on text inputs only. When comparing results we also consider a so-called M2 model: the M3 model inferences without the computer vision. As our model is smaller thanks to the enriched distillation process, we call the final model miniAM2 as a lighter version of a multilingual and multi-attribute model trained with an assemblage process. MiniAM2 provides two key informations: the probability for an observation to be an organization or a human and the probability to be a male or a female.

Model Sources

<!-- Provide the basic links for the model. -->

Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

Social media platforms offer an invaluable wealth of data to understand what is taking place in our society. However, social media data hides demographic biases related to characteristics such as gender or age. Therefore, considering social media data as representative of the population can lead to fallacious interpretations. For instance, in France in 2021, women represent 51.6% of the population, whereas on Twitter they represent only 33.5% of French users. With such a significant difference between social network user demographics and the actual population, detecting the gender or age before delving into a deeper analysis of social phenomena becomes a priority.

Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

Bias may appear on every data inferation provided by models. By knowing the empirical distributions present in the data, it is possible to leverage the predictions and reduce bias. We intend to use this model to capture gender distributions in Twitter data in order to know the potential bias in it and apply afterwards other techniques to minimize bias as much as possible.

Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

This model is not created to identify the gender or organization status of individuals, and that use is not allowed. We do not allow neither using this model for the purpose of exploiting, harming or attempting to exploit or harm minors in any way. This model has license OpenRAIL and before using the model, the user should read and agree to our white paper and license conditions.

Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

Since MiniAM2 is a deep learning model, it does contain bias. Although we studied the performance of MiniAM2 extensively, we did not worked around bias control. For example, the model may tend to predict Man label if the text inputs are too short or if they lack of gender information, just because there is a majority of men as Twitter users.

Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

from huggingface_hub import from_pretrained_keras
miniam2 = from_pretrained_keras("CitibeatsAI/miniam2")

Obtain a prediction by just sending a list with the columns screen_name, name and bio description of a dataframe to the model

predictions = miniam2([df["screen_name"], df["name"], df["bio"]])

The output will be an array of n rows (as much as users sent to the model) and of 3 columns. The i-th row is the prediction of i-th user. First column is the probability to belong to an organization, second column is the probability of man class and third column is the probability of woman class.

Training Details

Training Data

<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

Language Training set
English 3316545
Spanish 3608997
French 1086762

Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

Preprocessing

  1. Eliminate with a preprocessing step all punctuations, accents and styles on the text (do not lowercase, the model is case sensitive). We provide an example of preprocessing function named tokenizer_preprocess_multiple. It requires a dictionary of patterns and corresponding substitutions with identifiers.
  def tokenizer_preprocess_multiple(text):
    text = re.sub(r'[!\"#$%&\'()*+,-./:;<=>?@\[\\\]^_`{|}~]', ' ', text)
    return multiple_replace(text, patterns, substitutions)

def search_code(match, patterns, subs):
    for pat, code in patterns.items():
        if match in pat:
            return subs[code]
    return "No match"
def multiple_replace(text, patterns, substitutions):
    # Create a regular expression  from the dictionary keys
    # regex = re.compile("(%r)" % "|".join(map(re.escape, d.keys())))
    pat = '|'.join(patterns.keys())

    # For each match, look-up corresponding value in dictionary
    return re.sub(pat, lambda mo: search_code(mo.string[mo.start():mo.end()], patterns, substitutions), text)

### Example of patterns and substitutions
patterns = {
"[0123456789]":"0",
u'[aàáâãäåăąǎǟǡǻȁȃȧ𝐚𝑎𝒂𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊𝝰𝞪]': "1",
"!": "2",
} 

substitutions = {
"0": "numbers",
"1": "a",
"2": ", prove me wrong",
} 

text = "The letter 'ã' is 𝕒s useful 𝓪s 5!"
print(multiple_replace(text, patterns, substitutions))
The letter 'a' is as useful as numbers, prove me wrong

Training Hyperparameters

The following hyperparameters were used during training:

Hyperparameters Value
name RMSprop
weight_decay None
clipnorm None
global_clipnorm None
clipvalue None
use_ema False
ema_momentum 0.99
ema_overwrite_frequency 100
jit_compile False
is_legacy_optimizer False
learning_rate 0.0010000000474974513
rho 0.9
momentum 0.0
epsilon 1e-07
centered False
training_precision float32

Speeds, Sizes, Times

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Testing Data, Factors & Metrics

Testing Data

<!-- This should link to a Data Card if possible. -->

Hard labeled (humanly annotated data)

Language Test set
English 2568
Spanish 2498
French 2136

<!--#### Factors

These are the things the evaluation is disaggregating by, e.g., subpopulations or domains.

-->

Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

We compared our model against M3 and against M2, which is the M3 version that accepts only text as input (exactly the same inputs as for MiniAM2) in Accuracy, Recall, Precision, Loss and F1.

By using all these metrics, we provide a wider and deeper comparison between models

Results

Model Accuracy Recall Precision Loss F1
English
M2 80.92 84.55 73.75 0.49 76.81
M3 86.4 85.04 83.56 0.39 84.2
MiniAM2 83.39 85.07 80.05 0.5 82.14
Spanish
M2 77.29 81.91 68.78 0.55 69.94
M3 88.44 86.18 86.55 0.39 86.3
MiniAM2 86.52 82.55 85.87 0.37 84.04
French
M2 68.45 74.19 63.79 0.77 63.64
M3 83.17 82.46 80.63 0.47 81.42
MiniAM2 81.86 78.71 80.07 0.5 79.33

Summarizing a lot, MiniAM2 outperforms M2 and even compares with the quality of M3 (which is additionally processing images to make its predictions)

Summary

Multilingual gender and organization status detector model MiniAM2 is lighter, faster and more accurate than M2. According to the experiments, MiniAM2 closely follows the quality of M3, implying that our model which processes only text competes with models that benefit from image input data.

In the future, we plan to add more languages given the cheap and fast process developed in our work.

Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Model Architecture and Objective

To process the text, MiniAM2 applies two types of tokenization: character-level and word-level tokenizations (except for screenname that uses character tokenization only). The tokenized inputs are fed to embedding layers before two layers of feed-forward neural networks, referred as the deep-learning component (DL) from now on. For every combination of inputs and tokenization, the model has a deep learning (DL) component, referred as DL1,...,DL5. Then MiniAM2 concatenates the 5 deep-learning representations together before a final softmax layer with 3 outputs: organization, man and woman.

<!--

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed] -->

Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

Coming soon...

BibTeX:

Coming soon...

APA:

Coming soon...

<!--

Glossary [optional]

If relevant, include terms and calculations in this section that can help readers understand the model or model card.

[More Information Needed] -->

Model Card Contact