Model Card for Model Geo-BERT-multilingual

This model predicts the geolocation of short texts (less than 500 words) in a form of two-dimensional distributions also referenced as the Gaussian Mixture Model (GMM).

Model Details

Number of predicted points: 5 Custom transformers pipeline and result visualization: https://github.com/K4TEL/geo-twitter/tree/predict

Model Description

This project was aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data. The suggested approach implements BERT-based neural networks for NLP to estimate the location in a form of two-dimensional GMMs (longitude, latitude, weight, covariance). The base model has been finetuned on a Twitter dataset containing text content and metadata context of the tweets.

Developed by: Kateryna Lutsai
Model type: regression
Language(s) (NLP): multilingual
Finetuned from model: bert-base-multilingual-cased

Model Sources

Repository: https://github.com/K4TEL/geo-twitter
Paper: https://arxiv.org/pdf/2303.07865.pdf
Demo: https://github.com/K4TEL/geo-twitter/blob/predict/prediction.ipynb

Uses

Geo-tagging of Big data

Direct Use

Per-tweet geolocation prediction

Out-of-Scope Use

Per-tweet geolocation prediction without "user" metadata is expected to show lower accuracy of predictions.

Bias, Risks, and Limitations

Risk for unethical use on the basis of data that is not publicly available.

The limitation of text length is dictated by the BERT-based model's capacity of 500 tokens (words).

How to Get Started with the Model

Use the code below to get started with the model:

https://github.com/K4TEL/geo-twitter/tree/predict

A short startup guide is given in the repository branch description.

Training Details

Training Data

The Twitter dataset contained tweets with their text content, metadata ("user" and "place") context, and geolocation coordinates.

Training Procedure

Information about the model training on the user-defined data could be found in the GitHub repository: https://github.com/K4TEL/geo-twitter

Training Hyperparameters

Learning rate start: 1e-5
Learning rate end: 1e-6
Learning rate scheduler: cosine
Number of epochs: 3
Batch size: 10
Optimizer: Adam
Intra-feature loss: mean
Inter-feature loss: mean
Neg log-likelihood domain: positive
Features: NON-GEO + GEO-ONLY

Evaluation

All performance metrics and results are demonstrated in the Results section of the article pre-print: https://arxiv.org/pdf/2303.07865.pdf

Testing Data, Factors & Metrics

Testing Data

Worldwide dataset of tweets with TEXT-ONLY and NON-GEO features

Metrics

Spatial metrics: mean and median Simple Accuracy Error (SAE), Acc@161 Probabilistic metrics: mean and median Cumulative Accuracy Error (CAE), mean and median Prediction Area Region (PRA) for 95% density area, Coverage of PRA

Results

Tweet geolocation prediction task

TEXT-ONLY: mean 1588 km and median 50 km, 61% of Acc@161
NON-GEO: mean 800 km and median 25 km, 80% of Acc@161

User home geolocation prediction task

TEXT-ONLY: mean 892 km and median 31 km, 74% of Acc@161
NON-GEO: mean 567 km and median 26 km, 82% of Acc@161

Model Architecture and Objective

Implemented wrapper layer of liner regression with a custom number of output variables that operates with classification token generated by the base BERT model.

Hardware

NVIDIA GeForce GTX 1080 Ti

Software

Python IDE

Model Card Contact

lutsai.k@gmail.com