Jordan_Name_Disambiguation

This model is a fine-tuned version of distilbert-base-uncased used for token classifcation. It achieves the following results on the evaluation set:

Loss: 0.0025
Precision: 0.9811
Recall: 0.9811
F1: 0.9811
Accuracy: 0.9995

Model Description

This model is used to differentiate a mention of the country "jordan" as a place the veteran served vs the mention of "jordan" as a name, "jordan" as another location i.e. West Jordan Utah, or the country "jordan" as part of form language.

Intended uses & limitations

This is only intended to be used to determine if "jordan" is in the context of a service location.

This model was trained on a limited amount of data for a narrow classification task.

Training Data

The training data has two columns "text" and "service_location". The text column contains a snippet of text containing the word "Jordan" in various contexts. The service_location column indicates if the mention of the word "Jordan" is referencing a service location (1) or not (0).

NOTE: The training data has PII and is only accessible to team members on S3.

Data Analysis

The chart below displays the distribution of examples containing a "jordan" service token

Label	Train	Test
0	668	166
1	408	104

Training Procedure

Preprocessing

The data went through the following preprocessing steps:

Tokenize text into words
Create NER tags for "jordan" i.e. 0's and 1's
Apply distilbert-base-uncased tokenizer and align tokens to labeling scheme ["O", "B-SER", "I-SER"]

Training Hyperparameters

learning_rate: 5e-05
train_batch_size: 24
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 3

Training Results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.0891	1.0	45	0.0069	0.934	0.934	0.934	0.9984
0.0036	2.0	90	0.0037	0.9902	0.9528	0.9712	0.9993
0.0012	3.0	135	0.0025	0.9811	0.9811	0.9811	0.9995

Framework Versions

Transformers 4.33.2
Pytorch 2.0.1+cpu
Datasets 2.14.5
Tokenizers 0.13.3