generated_from_trainer

<!-- This model card has been generated automatically according to the information the Trainer had access to. You should probably proofread and complete it, then remove this comment. -->

distilbert-base-uncased-finetuned-eoir_privacy

This model is a fine-tuned version of distilbert-base-uncased on the eoir_privacy dataset. It achieves the following results on the evaluation set:

Model description

Model predicts whether to mask names as pseudonyms in any text. Input format should be a paragraph with names masked. It will then output whether to use a pseudonym because the EOIR courts would not allow such private/sensitive information to become public unmasked.

Intended uses & limitations

This is a minimal privacy standard and will likely not work on out-of-distribution data.

Training and evaluation data

We train on the EOIR Privacy dataset and evaluate further using sensitivity analyses.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

Training results

Training Loss Epoch Step Validation Loss Accuracy F1
No log 1.0 395 0.3053 0.8789 0.7432
0.3562 2.0 790 0.2857 0.8976 0.7883
0.2217 3.0 1185 0.3358 0.8905 0.7550
0.1509 4.0 1580 0.3505 0.9040 0.8077
0.1509 5.0 1975 0.3681 0.9053 0.8088

Framework versions

Citation

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}