distilbert-base-uncased-finetuned-eoir_privacy

This model is a fine-tuned version of distilbert-base-uncased on the eoir_privacy dataset. It achieves the following results on the evaluation set:

Loss: 0.3681
Accuracy: 0.9053
F1: 0.8088

Model description

Model predicts whether to mask names as pseudonyms in any text. Input format should be a paragraph with names masked. It will then output whether to use a pseudonym because the EOIR courts would not allow such private/sensitive information to become public unmasked.

Intended uses & limitations

This is a minimal privacy standard and will likely not work on out-of-distribution data.

Training and evaluation data

We train on the EOIR Privacy dataset and evaluate further using sensitivity analyses.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Accuracy	F1
No log	1.0	395	0.3053	0.8789	0.7432
0.3562	2.0	790	0.2857	0.8976	0.7883
0.2217	3.0	1185	0.3358	0.8905	0.7550
0.1509	4.0	1580	0.3505	0.9040	0.8077
0.1509	5.0	1975	0.3681	0.9053	0.8088

Framework versions

Transformers 4.18.0
Pytorch 1.11.0+cu113
Datasets 2.1.0
Tokenizers 0.12.1

Citation

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}