Jordan_Name_Disambiguation
This model is a fine-tuned version of distilbert-base-uncased used for token classifcation. It achieves the following results on the evaluation set:
- Loss: 0.0025
- Precision: 0.9811
- Recall: 0.9811
- F1: 0.9811
- Accuracy: 0.9995
Model Description
This model is used to differentiate a mention of the country "jordan" as a place the veteran served vs the mention of "jordan" as a name, "jordan" as another location i.e. West Jordan Utah, or the country "jordan" as part of form language.
Intended uses & limitations
This is only intended to be used to determine if "jordan" is in the context of a service location.
This model was trained on a limited amount of data for a narrow classification task.
Training Data
The training data has two columns "text" and "service_location". The text column contains a snippet of text containing the word "Jordan" in various contexts. The service_location column indicates if the mention of the word "Jordan" is referencing a service location (1) or not (0).
NOTE: The training data has PII and is only accessible to team members on S3.
Data Analysis
The chart below displays the distribution of examples containing a "jordan" service token
Label | Train | Test |
---|---|---|
0 | 668 | 166 |
1 | 408 | 104 |
Training Procedure
Preprocessing
The data went through the following preprocessing steps:
- Tokenize text into words
- Create NER tags for "jordan" i.e. 0's and 1's
- Apply distilbert-base-uncased tokenizer and align tokens to labeling scheme ["O", "B-SER", "I-SER"]
Training Hyperparameters
- learning_rate: 5e-05
- train_batch_size: 24
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3
Training Results
Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|---|
0.0891 | 1.0 | 45 | 0.0069 | 0.934 | 0.934 | 0.934 | 0.9984 |
0.0036 | 2.0 | 90 | 0.0037 | 0.9902 | 0.9528 | 0.9712 | 0.9993 |
0.0012 | 3.0 | 135 | 0.0025 | 0.9811 | 0.9811 | 0.9811 | 0.9995 |
Framework Versions
- Transformers 4.33.2
- Pytorch 2.0.1+cpu
- Datasets 2.14.5
- Tokenizers 0.13.3